Document Sample

D.Sharath Chandra                                            Priyanka Komakula
  III yr B.Tech.                                                    III yr B.Tech.
  KITS, Warangal.                                                   KITS, Warangal.
E-mail:                        E-mail:


     In this paper we present a brief idea of Bioinformatics. Firstly, we describe about

 role of computers in bioinformatics, in the identification of unknown virus. Then we

 present the creation of databases of the protein structures .Then we move to the core of

 the paper which deals with the similarity searching techniques: FASTA and PSI-BLAST.

 In FASTA, we describe its application and sequence format of multiple alignments with

 an example, which determines among horse, mike whale and red kangaroo which two

 species are closely related. In PSI-BLAST, we present the flow chat of a PSI-BLAST

 searching technique and an example which answers, why pig, or in general, animal

 derived insulin can be used to treat diabetes. Then we discuss the application of

 bioinformatics mainly in the field of drug discovery .Finally, we conclude by describing

 why bioinformatics has become such a hot topic in career option.

                Bioinformatics, as reflected by the term has two components -„Bio‟ and

„Informatics‟ and as such primarily depicts the convergence of two fields of biology and

information technology. Bioinformatics is the application of computer technology to the

management of biological information. Specifically, it is the science of developing

computer databases and algorithms to facilitate and expedite biological research. The

Human Genome Project (HGP) has been the biggest achievement of bioinformatics to

date. Other areas of bioinformatics are sequence alignment, protein structure prediction,

systems biology, protein-protein interactions and virtual evolution.[SR‟04]

Computers in Bioinformatics:

                   When scientists talk about bioinformatics or “doing bioinformatics”,

they mean the use of computers to store, retrieve, analyze, predict and simulate the nature

and properties of biological macromolecules like nucleic acids (e.g.: deoxyribonucleic

acid, or DNA), and proteins (the product of DNA). Computers and their networking via

internet help the biologists not only to store large volumes of data but also to retrieve data

from anywhere around the world quickly. Scientists and software enthusiasts have

designed innumerable computational tools for analysis of various physio-chemical and

structural properties of biomolecules. These tools help biologists with difficult

mathematical modeling and thus save time, money and effort. Thus, bioinformatics has

ushered in the age of ‘in silico’ biology in contrast to in vivo and in vitro studies.

               For example, imagine a crisis-sometime in future-in which a new

biological virus creates an epidemic of fatal disease in humans or animals. Laboratory

scientists will isolate its genetic material and determine the sequence. Computer
programs will take over. Viruses contain protein molecules which are suitable targets, for

drugs that will interfere with viral structure or function. From the viral DNA sequences,

computer programs will derive the amino acid sequences and other programs will

compute the structures of these proteins and functional properties. First data banks will be

screened for related proteins of known structure. If no related protein of known structure

is found, and viral protein appears genuinely new, the structure prediction must be done

entirely ab initio.

                Thus, knowing the viral protein structure and function will make it

possible to design therapeutic agents.[]

Creation of Sequence Databases:

                 Most biological databases consist of long strings of nucleotides or amino

acids. Each sequence of nucleotides or amino acids represents a particular gene or

protein, respectively. Sequences are represented in shorthand, using single letter

designations. While most biological databases contain nucleotide and protein sequence

information, there are also databases which include taxonomic information such as the

structural and biochemical characteristics of organisms. Computer scripting languages
    such as Perl and Python are often used to interface with biological databases and parse

    output from bioinformatics programs.

    Similarity searching techniques:

                      Theoretical scientists have derived new and sophisticated algorithms

    which allow sequences to be readily compared using probability theories. These

    comparisons become the basis for determining gene function, developing phylogenetic

    relationships and simulating protein models. The two popular data based similarity

    searching techniques FASTA and BLASTA. [HLC-95] Computer scripting languages

    such as Perl and Python are often used to interface with biological databases.


    A very common format for sequence data is derived from conventions of FASTA, a

    program for FAST Alignment by W.R.Pearson.

    A Sequence in FASTA format:

         Begin with a single-line description. A > must appear in the first column.

         Subsequent lines contain the sequence, one character per residue.

         Use one letter codes for nucleotides or amino acids specified by IUB and IUPAC

         Lines can have different lengths, i.e., „ragged right‟ margins.

         Most programs will accept lower case letters as amino acid codes.

               Here is an example for FASTA multiple alignments, which determines that

    which among horse, mike whale and red kangaroo are closely related.



                                    |:|| || :|||| |:     : ..... | || | |       |. ||. |   | | | : | | | :| | .   |. |:| |




                                  :|: || | |::|| |:|.|     : ||:||     |..|| || ||   |:||: :::|||||||             ||||||

                           In this table, an | under the sequences indicates a position that is conserved (same in all

    sequences), and : and . indicate positions at which all sequences contain residues of very similar physicochemical

    character (:), or somewhat similar physicochemical character(.).

    Number of identical residues in aligned Rib nuclease

           Horse         and            whale                         74.2%

           Whale          and           red kangaroo                  65.4%

           Horse         and            red kangaroo                  57.4%

     Which shows that horse and whale share most identical residues. A weakness of simple

    profiles is that the multiple sequence alignment must be provided in advance, and is taken

    as fixed .PSI-BLAST gain power by integrating the alignment step with the collection of

    statistics? So we go for nest method PSI-BLAST where we also use this FASTA. [BI‟03]


    BLAST is Basic Local Sequence Alignment Tool. This program has variants which

    check entry in the databank independently against query sequence. This program is so

    commonly used that the first encounter you have with bioinformatics tools and

    biological databases will probably be through the National Center for Biotechnology

    Information's (NCBI) BLAST web interface.
          Figure 1-1. Form for submitting a BLAST search against nucleotide
databases at NCBI

            Often the databank contains close matches to the query sequence. Less

 sensitive but faster programs are quite capable of identifying the close matches, and if

 that is what is required. The method used by BLAST goes back, in a sense to the dot plot

 approach, checking for well-matching local regions. For each entry in the database, it

 checks for short contiguous regions that match a short contiguous region in the query

 sequence, using substation scoring matrix but allowing no gaps. An approach in which

 candidate regions of fixed length are identified initially can be made very fast by use of

 lookup tables. The PSI-BLAST was originally designed to solve that a full dynamic

 programming methods are rather slow for complete searches in a large databank.
    A flowchart for PSI-BLAST

        Probe each sequence in the chosen database independently for local regions of

    similarity to the query sequence, using a BLAST-type search but allowing gaps.

        Collect significant hits. Construct a multiple sequence alignment table between the

    query sequence and the significant local matches.

        Form a profile from the multiple sequence alignment.

        Reprove the database with the profile, still looking only for local matches.

        Decide which hits are statistically significant and retain these only.

        Go back to step 2, until a cycle produces no change.

            The below is an example for a BLAST. The following pictures show the reason

why pig, or in general, animal derived insulin can be used to treat diabetes: As can be seen in

these figures, the amino acid sequences of the animal insulin are very similar to the human

form. Amino acids, symbolized using the one-letter-code, which match exactly are displayed

in the middle row marked with horizontal lines. Those which do not match are marked with

vertical bars. The sequence identity is 94% for rabbits, 89% for pigs, and 87% for cows.

                        Image of human insulin
           PSI-BLAST, using iterated pattern search, is much more powerful than simple
pair wise BLAST in picking up distant relationships. PSI-BLAST correctly identifies three
times as many homologues as BLAST .Therefore its better method for analyzing genomes.

   Application of Bioinformatics:

          Companies in the business of developing drugs, agricultural chemicals, hybrid

   plants, plastics and other petroleum derivatives, and biological approaches to

   environmental remediation, among others, are developing bioinformatics divisions and

   looking to bioinformatics to provide new targets and to help replace scarce natural

   resources. Bioinformatics can lead to important discoveries as well as help companies

   save time and money in the long run. The continuing partnership between Inpharmatica

   and Sun Microsystems is a good example of how life sciences firms can meet this

   bioinformatics challenge.

             Bioinformatics is first and foremost a component of the biological sciences.

    Therefore, it is imperative to appreciate the fact that to deal in bioinformatics, one must

    necessarily understand the essentials of molecular biology on the one hand and the

    fundamental principles of computer and information technology.

             Bioinformatics has created a great hullabaloo in career option. Biologists with

    knowledge of computers, computer enthusiasts with their biology basics freshened up,

    mathematics freaks with a penchant for computational wizardry, biostatisticians and

    physicists who can handle computers-are all being welcomed into the big new world of

    bioinformatics. One of the biggest reasons for bioinformatics being a hot field is the old

    supply and demand adage. There just are too few people adequately trained in both

    biology and computer science to solve the problems that biologists need to have solved.

                       Bioinformatics is a tool and not an end in itself.



         [SR‟04] SCIENCE REPORTER -NOV2004
         [HLC-95] Holm.L and Sander-Network tools for protein structure – 1995
         [BI‟03]     Arthur-Bioinformatics-2003
    Internet resources:

         Human Genome Project and Bioinformatics
         Bioinformatics journal (
         BLAST -
         FASTA -