Docstoc

ncbi

Document Sample
ncbi Powered By Docstoc
					NCBI Toolkit
Presentation
  May 1, 2003
 Takuma Tsukahara
      Di Ren
     Ying Jin
 Natalya Muzinich
 Introduction
to the Toolkit


 Takuma Tsukahara
       What is NCBI toolkit?
• Used to help writing molecular biology
  programs.
• Initially used by NCBI itself to create and
  distribute GenBank, Entrez, BLAST, and
  other related services.
• Freely available to the public.
• Targeted Platforms- UNIX (known to
  compile find on Linux), DOS, Mac
• Software/Hardware Requirements- None
   Some software developed by toolkit
     (other than ones from NCBI)
• EMGLib- database devoted to the completely
  sequenced bacterial genomes and the yeast
  genome with search functions.
• MPSA- stand-alone software package for protein
  sequence analysis with a high integration level.
• SEGMAP- analyzes biological physical mapping
  data, and presents a graphical view from it.
           Goal of NCBI
      (act of Congress, 1988)
• Create automated systems for knowledge
  about molecular biology, biochemistry, and
  genetics
• Perform research into methods of analyzing
  molecular biology data
• Enable researchers and medical care
  personnel to use the systems developed
• Gather biotechnology information
  worldwide
           NCBI branches
• Basic Research Branch- group of scientists
  who develop algorithms and methods for
  analyzing molecular biology data
• Information Resources Branch- maintains
  the infrastructure at NCBI
• Information Engineering Branch- designs
  and builds databases and software tools for
  molecular biology by incorporating the new
  methods and approaches
  Databases and software tools
       designed by IEB
• GenBank, BLAST, PubMed, Entrez,
  LocusLink, GEO, dbEST, dbSTS,
  Genome Resources, NCBI ToolBox,
  Taxonomy Database, OMIM, dbSNP,
  dbMHC, Sequin, BankIt, RefSeq
  Project, and many more
     Gathering biological data
• NCBI has to deal with gathering biological
  data, which may come from different
  sources.
• So IEB developed the NCBI toolkit, which
  led to creation of other softwares (GenBank,
  Entrez, BLAST…).
• Those softwares can be used internally at
  NCBI to process and analyze data that
  comes from variety of sources.
• This allows NCBI to build and maintain the
  unified databases.
 Accessibility of biological data
• Softwares developed by IEB will be used by
  many people.
• End-user scientists, bioinformatics
  specialists in commercial, academic, or
  government settings, and by academic
  researchers
• So it must be platform and format
  independent
• IEB chose to use ASN.1
              What is ASN.1?
• Abstract Syntax Notation number One
• Originated to make interaction from one
  computer to another easier.
• Regardless of how data is represented, whatever
  the application, whether complex or simple.
• As a matter of fact, NCBI uses ASN.1 to store
  and retrieve data such as nucleotide and protein
  sequences, structures, genomes, and MEDLINE
  records.
  Your cell phone might be using
             ASN.1!!
• ASN.1 is used everywhere in our life.
• Internet Explorer, Netmeeting and Outlook.
• Wireless applications from Nokia, Ericsson and
  Motorola.
• Provides security for credit card purchase over
  the Internet.
• Others: ATM transactions, plane take-offs and
  landings, when Federal Express tracks a
  package
        Why IEB chose ASN.1
• Molecular biology data comes from, and is
  used in variety of environments.
• Data gathered and integrated by IEB will come
  from many different sources, in many different
  models, and they may change over time.
• Data should have longer life span than a
  particular software tool or language.
• So IEB chose ASN.1, which is independent
  from hardware or software architecture and
  language.
    Information on the Toolkit
• http://www.ncbi.nlm.nih.gov/
• Go to “Software Engineering”
  (IEB homepage)
• Download from FTP
• See documents on toolkit
• Look at source codes in the toolkit
        Components of Toolkit
• ASN.1- is used to describe biological data
• Data Model- Model for biotechnology
  information described in ASN.1. Mostly used
  for bibliographic data and sequence data.
• CoreLib- set of C language functions and
  macros that can be compiled and used on most
  hardware/operating systems. Used for
  programming guidelines.
• ASNLib- function library written by CoreLib.
  Read and validate ASN.1 specifications,
  generate parse trees.          Much more…
                  What is   ASN.1 ?

• ASN.1 = Abstract Syntax Notation.1

• It is an International Standards Organization
  (ISO) data representation format which is used
  to achieve interoperability between platforms.

• It permits computers and software systems of
  all types to reliably exchange both the data
  structure and content.
 Does ASN.1 related with NCBI toolkit

• NCBI uses ASN.1 for the storage and
  retrieval of data such as nucleotide and
  protein sequences, structures, genomes, and
  MEDLINE records.

• The software in the NCBI Toolbox is
  primarily designed to read ASN.1 format
  records.
    Let‟s go to NCBI web site to
     experience what is ASN.1.

http://www.ncbi.nlm.nih.gov
          The grammar of ASN.1 language
 ASN.1 Structure: Type references, identifiers,
 values

• Type reference: object name, start with upper case letter
• Identifier: field name, start with lower case letter
• Values: only found in the value notation.
      Type references, identifiers, values


My-type ::= SEQUENCE {      -- My-type is a Type Reference
  first INTEGER,                   -- first is an identifier
  second INTEGER DEFAULT 2, -- second defaults to 2
  third VisibleString OPTIONAL -- third is an optional string
}                               -- end of object definition
              ASN.1 Value Notation
Two types of encoding: ASCII and Binary
Always use binary encoding for production code!
Example text encoding:

  My-type ::= {
       first 42
   }
Specification:
               Defining repeated lists of types
My-type-set ::= SEQUENCE OF My-type

Value Notation:
My-type-set ::= {                      -- start SET OF
  {                                   -- a My-type
    first 42
  },
  {                                   -- another My-type
    first 27 ,
    second 22 ,
    third "Everything set here"
   }
}                                     -- end of SET OF
           NCBI Primitive Data Types
• BOOLEAN (TRUE or FALSE)
• INTEGER
• OCTET STRING (string of bytes)
• NULL
• REAL (although not really - make your own REAL type
  using integer mantissa, exponent)
• VisibleString (printable characters)
• StringStore (NCBI made this one up for sequence)
                  SEQUENCE, SET, OF
• SEQUENCE - a series of named types - in order

• SEQUENCE OF - repeating series of single type -
  in order

• SET - like SEQUENCE, but order does not matter (Never use
  this - especially in secure applications)

• SET OF - like SEQUENCE OF, but order does not matter
  (Don‟t use this either)
                         CHOICE
CHOICE: define a set of alternate types.

Specification:
Accession-Number ::= CHOICE {
            gi INTEGER,
            swiss-prot VisibleString
}

Text value notation:
Accession-Number ::= gi 413223
or
Accession-Number ::= swiss-prot “P22518”
Note: no { } around choice values
                  ENUMERATED
ENUMERATED:         A named set of integer values. Parser will
  check that names are valid. Ideal for controlled vocabulary
  lists that won‟t change.
                    Can also use INTEGER in the same way.
  Names won‟t be type checked. Ideal for controlled vocabulary
  lists that will be expanded over time.

Specification:
Sex ::= ENUMERATED
{ female (1),
    male (2),
    other (255)
}
Example:
Sex ::= other
                           Modifiers
• OPTIONAL: marks a value as optional. Can be added to any
  type.

• DEFAULT: specifies a default value (can be any „value‟).
  May be used with all types except OCTET STRING, NULL.


 My-type ::= SEQUENCE {             -- My-type is a Type Reference
       first INTEGER,                         -- first is an identifier
       second INTEGER DEFAULT 2,            -- second defaults to 2
       third VisibleString OPTIONAL     -- third is an optional string
}                                       -- end of object definition
                    Module

• The ASN.1 module provides a means of
  grouping together related values and types. It
  is similar to its programming language
  counterparts, the function and procedure.

• Each module can be stored in a separate file, or
  several related modules can be grouped
  together in one file for convenient use.
               An ASN.1 Specification
*** Demo Spec ***
Demo-module DEFINITIONS ::=                              -- Module-name
BEGIN
EXPORTS My-type,              -- My-type can be used by other modules
Another;
IMPORTS Foreign-type FROM Other-module;              -- can import types
                              -- we define an object called My-type
My-type ::= SEQUENCE {                 -- My-type is a Type Reference
      first INTEGER,                               -- first is an identifier
      second INTEGER DEFAULT 2,                  -- second defaults to 2
      third VisibleString OPTIONAL           -- third is an optional string
}                                            -- end of object definition
Another ::= Foreign-type          -- can reference other defined types
END                                      -- end of module, END required
ASN.1 Development Process
           ASNTOOL & AsnLib

• Function of ASN Tool
• How to use the asntool
   Some concepts related to asntool
   Options of asntool
   Exercises for using the asntool
• Application of asntool result
   Some Build-In Demos
   Exercise in using related programs
ASN TOOL
  Ying Jin
           Function of ASN Tool
•   Function as a compiler
    Read, write and check ASN.1 specification, ASCII
    values, and binary values conforming to the specification
•   Function as a filter
    Manipulate ASN.1 files and performs conversions
    between ASCII and Binary forms of ASN.1
•   Function as a code generator
    Make C functions for loading ASN.1 from a file, writing
    ASN.1 to a file, allocating and freeing memory for each
    ASN.1 object

    Note: BER stands for Basic Encoding Rules
    Concepts related to ASN Tool
• ASN module file
  e.g. Medline.asn file which is used to
       describe the Medline entries
• ASN value notation entries
  e.g. Medline.ent – single Medline-entry
     Medline.prt – is a Pub-set with mutiple
       Medline entries
• ASN binary file
  e.g. Medline.val – binary ASN.1
              Detailed Look

• Take a look at medline.asn:
  http://web.mit.edu/seven/src/ncbi/asn/medli
  ne.asn
• Take a look at medline.ent
  http://web.mit.edu/seven/src/ncbi/demo/me
  dline.ent
         ASN Tool Command Line
-m ASN.1 Module File [File In]
-v Print value File [File In] Optional
-p Print value File [File Out] Optional
-e Binary Value File [File Out] Optional
-t Binary Value Type [String] Optional
-o Header File [File Out] Optional
-d Binary Value File(type required) [File In] Optional
-B Base for filename, without extensions, for generated
    objects and code [File Out] Optional
-G Generate object loader .c and .h files
 -I In generated .c, add #include to this filename [String]
    Optional

• Run asntool to see more options
           Preparation for exercise
1. Login to solar.uits.indiana.edu
2. Go to dir: $ cd /bigscr/bioinfo/L519/ncbi-test
3. For bash, run $ source env.bash
   for tcsh, run $ source env.tcsh
  (* This is to setup the environment for you to run the asntool)
4. Create your own dir with your user name
   under current dir
   $ mkdir your_username
   $ cd your_username
   Preparation for exercise (cont.)
• asnpub.all contains following modules:
  – general.asn, general purpose data types
  – pub.asn, branch point for various publication
    types
  – biblio.asn, standard bibliographic citations for
    journals, books, manuscripts, patents based on
    ASNI standard
  – medline.asn, Medline entry (based on NCBI-
    Biblio)
   Preparation for exercise (cont.)


• Copy exercise1.command to your own
  directory
• Let‟s concatenate the asnpub.all
  cat ../asn/general.asn ../asn/pub.asn
  ../asn/biblio.asn ../asn/medline.asn>
  asnpub.all
                           Exercises
1. Function as a compiler
1> asntool -m asnpub.all
   read the publication modules and validate that they are correctly
   built. There are no errors in asnpub.all, so asntool is silent.
2> asntool -m asnpub.all -v ../demo/medline.ent
3> asntool -m asnpub.all -v ../demo/medline.ent -p stdout
   reads the file medline.ent which it expects to be of a type defined
   in asnpub.all. It checks for errors, reporting any it finds. There
   are none, so asntool is silent.
                  Exercises (cont.)
• Function as a filter
4> asntool -m asnpub.all -v ../demo/medline.prt -e medline.val
  reads the set of MEDLINE records from medline.prt and
  encodes them in binary ASN.1 in the file medline.val
5> asntool -m asnpub.all -d medline.val -t Pub-set -p stdout
  reads (decodes) the set of MEDLINE records from the binary
  ASN.1 file we just made and outputs them as value notation on
  stdout
                Exercises (cont.)
Function as a code generator
6> asntool -m asnpub.all -o asnpub.h
  Create a asnpub.h file that contains a representation of
  the ASN.1 spec in C
7> asntool -m ../asn/medline.asn -Gt -M asnpub.all -B
  myobj -I asnpub.h
 Use medline.asn to create myobj.c/h files containing
  your object loaders. You use the functions in myobj.h.
                  Detailed Look
Take a look at:
• asnpub.h
• myobj.h
• myobj.c
Parsing ASN.1 specification files

          Natalya Muzinich
            May 1, 2003
                  getmesh.c
• Function: gets mesh terms from a medline entry
• Includes allpub.h, which contains C-language
  translations of ASN.1 structures.
• 2 parts:
  – Parse tree that determines memory locations of all
    structures and their components
  – Names for these locations:

   #define MEDLINE_MESH_term &at[244]
           Name to use               Its starting
           in a C program            location in
                                     memory
            Program Arguments
From command line:           Source code
getmesh                      #define NUMARGS 3 Args
                                myargs[NUMARGS] = {
GetMesh 1.0 arguments:          { "Input data", NULL,
 -i Input data [Data In]        "Medline-entry", NULL,
                                FALSE, 'i', ARG_DATA_IN,
   Data Type = Medline-         0.0,0,NULL},
   entry                        { "Input data is binary",
                                "F", NULL, NULL, TRUE ,
 -b Input data is binary        'b', ARG_BOOLEAN,
   [T/F] Optional               0.0,0,NULL},
                                { "Output list", NULL,
   default = F                  NULL, NULL, FALSE, 'o',
 -o Output list [File Out]      ARG_FILE_OUT, 0.0,0,NULL}
                Main Steps:
• Gets ASN.1 parse tree using AsnLoad():
if (! AsnLoad())
        Message(MSG_FATAL, "Unable to load allpub
  parse tree.");

• Gets or displays command line arguments
  using GetArgs:
if (! GetArgs("GetMesh 1.0", NUMARGS, myargs))
      return 1;
       Main Steps - continued
• Determines from command-line -b
  argument whether the input file is in ASCII
  or binary
if (myargs[1].intvalue)/* binary input is TRUE */
      intype = 1;
  else

      intype = 0;
       Main Steps - continued
• Opens the input file supplied with the -i
  argument (medline.ent)
if ((aip = AsnIoOpen(myargs[0].strvalue,
  intypes[intype])) == NULL)
  {
      Message(MSG_ERROR, "Couldn't open %s",
  myargs[0].strvalue);
      return 1;

  }
       Main Steps - continued
• Opens the output file for writing:
  if ((fp = FileOpen(myargs[2].strvalue, "w")) ==
  NULL)
  {
      Message(MSG_ERROR, "Couldn't open %s",
  myargs[2].strvalue);
      return 1;
  }
          Key Portions of Code:
• atp = MEDLINE_ENTRY;
  - Defines expected type of the structure in the
  input file.
  - Gives an error if the type is not as expected.
       Example: getmesh cannot parse pubmed structure, although it
  contains medline structure.
  - Makes this program structure-specific
      Key Portions of Code - ctd
• In a loop, read each component of the ASN.1
  structure from the input file.
  while ((atp =   AsnReadId(aip,   amp, atp)) != NULL)

• Check whether the last component read is
  MEDLINE_MESH_term:
       if (atp == MEDLINE_MESH_term)
• If it is, get its content:
       AsnReadVal(aip,    atp, &value);
       FilePuts(value.ptrvalue, fp);
           Main Steps - Last
• Close the input and output files:

aip = AsnIoClose(aip);

FileClose(fp);
               Other Examples
• indexpub.c
  – Builds an index to medline.ent based on Medline
    Unique Identifier
• getpub.c
  – Uses the index created by indexpub.c to retrieve a
    Medline-entry from medline.val by Medline uid.

  Display arguments: indexpub -
                       getpub -
            Ways to Explore
• Check out /miscapps/ncbitoolkit/bin on Solar
• Create new ASN.1 specification instances by
  getting an instance from NCBI and modifying it
• Slightly modify one of the existing programs
• Use NCBI web interface output in helping
  understand programs
              Conclusions
• Why learn NCBI toolkit when I can run the
  same programs through a web interface?
     Interface often limits programs‟ options
• Steep learning curve (especially for novice
  programmers), but may be well worth it
• ASN.1 is a great data representation format

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:12/30/2011
language:
pages:56