D.Sharath Chandra Priyanka Komakula
III yr B.Tech. III yr B.Tech.
KITS, Warangal. KITS, Warangal.
E-mail: firstname.lastname@example.org E-mail: email@example.com
In this paper we present a brief idea of Bioinformatics. Firstly, we describe about
role of computers in bioinformatics, in the identification of unknown virus. Then we
present the creation of databases of the protein structures .Then we move to the core of
the paper which deals with the similarity searching techniques: FASTA and PSI-BLAST.
In FASTA, we describe its application and sequence format of multiple alignments with
an example, which determines among horse, mike whale and red kangaroo which two
species are closely related. In PSI-BLAST, we present the flow chat of a PSI-BLAST
searching technique and an example which answers, why pig, or in general, animal
derived insulin can be used to treat diabetes. Then we discuss the application of
bioinformatics mainly in the field of drug discovery .Finally, we conclude by describing
why bioinformatics has become such a hot topic in career option.
Bioinformatics, as reflected by the term has two components -„Bio‟ and
„Informatics‟ and as such primarily depicts the convergence of two fields of biology and
information technology. Bioinformatics is the application of computer technology to the
management of biological information. Specifically, it is the science of developing
computer databases and algorithms to facilitate and expedite biological research. The
Human Genome Project (HGP) has been the biggest achievement of bioinformatics to
date. Other areas of bioinformatics are sequence alignment, protein structure prediction,
systems biology, protein-protein interactions and virtual evolution.[SR‟04]
Computers in Bioinformatics:
When scientists talk about bioinformatics or “doing bioinformatics”,
they mean the use of computers to store, retrieve, analyze, predict and simulate the nature
and properties of biological macromolecules like nucleic acids (e.g.: deoxyribonucleic
acid, or DNA), and proteins (the product of DNA). Computers and their networking via
internet help the biologists not only to store large volumes of data but also to retrieve data
from anywhere around the world quickly. Scientists and software enthusiasts have
designed innumerable computational tools for analysis of various physio-chemical and
structural properties of biomolecules. These tools help biologists with difficult
mathematical modeling and thus save time, money and effort. Thus, bioinformatics has
ushered in the age of ‘in silico’ biology in contrast to in vivo and in vitro studies.
For example, imagine a crisis-sometime in future-in which a new
biological virus creates an epidemic of fatal disease in humans or animals. Laboratory
scientists will isolate its genetic material and determine the sequence. Computer
programs will take over. Viruses contain protein molecules which are suitable targets, for
drugs that will interfere with viral structure or function. From the viral DNA sequences,
computer programs will derive the amino acid sequences and other programs will
compute the structures of these proteins and functional properties. First data banks will be
screened for related proteins of known structure. If no related protein of known structure
is found, and viral protein appears genuinely new, the structure prediction must be done
entirely ab initio.
Thus, knowing the viral protein structure and function will make it
possible to design therapeutic agents.[ http://bioinformatics.oupjournals.org/]
Creation of Sequence Databases:
Most biological databases consist of long strings of nucleotides or amino
acids. Each sequence of nucleotides or amino acids represents a particular gene or
protein, respectively. Sequences are represented in shorthand, using single letter
designations. While most biological databases contain nucleotide and protein sequence
information, there are also databases which include taxonomic information such as the
structural and biochemical characteristics of organisms. Computer scripting languages
such as Perl and Python are often used to interface with biological databases and parse
output from bioinformatics programs.
Similarity searching techniques:
Theoretical scientists have derived new and sophisticated algorithms
which allow sequences to be readily compared using probability theories. These
comparisons become the basis for determining gene function, developing phylogenetic
relationships and simulating protein models. The two popular data based similarity
searching techniques FASTA and BLASTA. [HLC-95] Computer scripting languages
such as Perl and Python are often used to interface with biological databases.
A very common format for sequence data is derived from conventions of FASTA, a
program for FAST Alignment by W.R.Pearson.
A Sequence in FASTA format:
Begin with a single-line description. A > must appear in the first column.
Subsequent lines contain the sequence, one character per residue.
Use one letter codes for nucleotides or amino acids specified by IUB and IUPAC
Lines can have different lengths, i.e., „ragged right‟ margins.
Most programs will accept lower case letters as amino acid codes.
Here is an example for FASTA multiple alignments, which determines that
which among horse, mike whale and red kangaroo are closely related.
>RNM_HORSE KESPAMKFERQHMDSGSTSSNTPTYCNQMMKRRNMTQGWCKPVNTFVHEPLADVQAICLQ 60
>RNM_BALAC RESPAMKFQRQHMDSGNSPGNNPYCNQMMMRRKMTQGRCKPVNTFVHESLEDVKAVCSQ 60
>RNP_MACRU -ETPAEKFQRQHMDTEHSTASSSNYCNLM M KARDMTSGRCKPLNTFIHEPKSVVDAVCHQ 59
|:|| || :|||| |: : ..... | || | | |. ||. | | | | : | | | :| | . |. |:| |
>RNM_HORSE KNITCKNGQSNCYQSSSSMHITDCRLTSGSKYPNCAYQTSQKERHIIVACEGNPYVPVHF 120
>RNM_BALAC KNVLCKNGRTNCYESNSTMHITDCRQTGSSKYPNCAYKTSQKEKHIIVACEGNPYVPVHF 120
>RNP_MACRU ENVTCKNGRTNCYKSNSRLSITNCRQTGASKYPNCQYETSNLNKQIIVACEG-QYVPVHF 118
:|: || | |::|| |:|.| : ||:|| |..|| || || |:||: :::||||||| ||||||
In this table, an | under the sequences indicates a position that is conserved (same in all
sequences), and : and . indicate positions at which all sequences contain residues of very similar physicochemical
character (:), or somewhat similar physicochemical character(.).
Number of identical residues in aligned Rib nuclease
Horse and whale 74.2%
Whale and red kangaroo 65.4%
Horse and red kangaroo 57.4%
Which shows that horse and whale share most identical residues. A weakness of simple
profiles is that the multiple sequence alignment must be provided in advance, and is taken
as fixed .PSI-BLAST gain power by integrating the alignment step with the collection of
statistics? So we go for nest method PSI-BLAST where we also use this FASTA. [BI‟03]
BLAST is Basic Local Sequence Alignment Tool. This program has variants which
check entry in the databank independently against query sequence. This program is so
commonly used that the first encounter you have with bioinformatics tools and
biological databases will probably be through the National Center for Biotechnology
Information's (NCBI) BLAST web interface.
Figure 1-1. Form for submitting a BLAST search against nucleotide
databases at NCBI
Often the databank contains close matches to the query sequence. Less
sensitive but faster programs are quite capable of identifying the close matches, and if
that is what is required. The method used by BLAST goes back, in a sense to the dot plot
approach, checking for well-matching local regions. For each entry in the database, it
checks for short contiguous regions that match a short contiguous region in the query
sequence, using substation scoring matrix but allowing no gaps. An approach in which
candidate regions of fixed length are identified initially can be made very fast by use of
lookup tables. The PSI-BLAST was originally designed to solve that a full dynamic
programming methods are rather slow for complete searches in a large databank.
A flowchart for PSI-BLAST
Probe each sequence in the chosen database independently for local regions of
similarity to the query sequence, using a BLAST-type search but allowing gaps.
Collect significant hits. Construct a multiple sequence alignment table between the
query sequence and the significant local matches.
Form a profile from the multiple sequence alignment.
Reprove the database with the profile, still looking only for local matches.
Decide which hits are statistically significant and retain these only.
Go back to step 2, until a cycle produces no change.
The below is an example for a BLAST. The following pictures show the reason
why pig, or in general, animal derived insulin can be used to treat diabetes: As can be seen in
these figures, the amino acid sequences of the animal insulin are very similar to the human
form. Amino acids, symbolized using the one-letter-code, which match exactly are displayed
in the middle row marked with horizontal lines. Those which do not match are marked with
vertical bars. The sequence identity is 94% for rabbits, 89% for pigs, and 87% for cows.
Image of human insulin
PSI-BLAST, using iterated pattern search, is much more powerful than simple
pair wise BLAST in picking up distant relationships. PSI-BLAST correctly identifies three
times as many homologues as BLAST .Therefore its better method for analyzing genomes.
Application of Bioinformatics:
Companies in the business of developing drugs, agricultural chemicals, hybrid
plants, plastics and other petroleum derivatives, and biological approaches to
environmental remediation, among others, are developing bioinformatics divisions and
looking to bioinformatics to provide new targets and to help replace scarce natural
resources. Bioinformatics can lead to important discoveries as well as help companies
save time and money in the long run. The continuing partnership between Inpharmatica
and Sun Microsystems is a good example of how life sciences firms can meet this
Bioinformatics is first and foremost a component of the biological sciences.
Therefore, it is imperative to appreciate the fact that to deal in bioinformatics, one must
necessarily understand the essentials of molecular biology on the one hand and the
fundamental principles of computer and information technology.
Bioinformatics has created a great hullabaloo in career option. Biologists with
knowledge of computers, computer enthusiasts with their biology basics freshened up,
mathematics freaks with a penchant for computational wizardry, biostatisticians and
physicists who can handle computers-are all being welcomed into the big new world of
bioinformatics. One of the biggest reasons for bioinformatics being a hot field is the old
supply and demand adage. There just are too few people adequately trained in both
biology and computer science to solve the problems that biologists need to have solved.
Bioinformatics is a tool and not an end in itself.
[SR‟04] SCIENCE REPORTER -NOV2004
[HLC-95] Holm.L and Sander-Network tools for protein structure – 1995
Human Genome Project and Bioinformatics
Bioinformatics journal (http://bioinformatics.oupjournals.org/)
BLAST - http://ncbi.nlm.nih.gov/BLAST/
FASTA - ftp://ftp.virginia.edu/pub/fasta/