Bioinformatics an introduction by ert554898


									           Module 6
Introduction to Bioinformatics

Outline of presentation
   definitions
   the central dogma - DNA, RNA, proteins
   data in biology & molecular biology
   Biological information management &
   extracting meaning from sequences
   comparisons
   subcellular location prediction
   clustering techniques for sequences
What is Biotechnology ?
 Biotechnology generally refers to the use
  of microorganisms to produce certain
  chemical compounds.
 Long before the term "biotechnology" was
  coined for the process of using living
  organisms to produce improved
  commodities, people were utilizing living
  micro-organisms to produce valuable
Biotech early applications
 Proving bread with leaven - prehistoric period
 Fermentation of juices to alcoholic beverages - prehistoric period
 Knowledge of vinegar formation from fermented juices -
  prehistoric period
 Cultivation of vine - before 2000 BC
 Manufacture of beer in Babylonia and Egypt - 3rd century BC
 Wine growing promoted by Roman Emperor Marcus Aurelius
  Probus - 3rd century AD
 Production of spirits of wine (ethanol) - 1150
 Vinegar manufacturing industry - 14th century
 Discovery of the fermentation properties of yeast by Erxleben -
 Description of lactic acid fermentation by Pasteur - 1857
 Detection of fermentation enzymes in yeast by Buchner - 1897
 Discovery of penicillin by Fleming - 1928/29
 Discovery of many other antibiotics - from about 1945
     Recent Biotech Development
   In the mid-forties, scale-up and commercial production of antibiotics such as
    penicillin occurred.
     –   The techniques used were isolation of an organism producing the chemical of interest
         using screening/selection procedures, and improvement of production yields via
         mutagenesis of the organism or optimization of media and fermentation conditions.
     –   This type of "antique" biotechnology is limited to chemicals produced in nature, limited by
         its trial-and-error approach, and requires a lengthy timeframe for yield improvement.
   About two decades ago, biotechnology became much more of a science (rather than
    an art).
     –   Regions of DNA (called genes) were found to contain information that would lead to
         synthesis of specific proteins (which are strings of amino acids).
     –   Each of these proteins have their own identity and function; many catalyze (facilitate)
         chemical reactions, and others are structural components of entities in cells.
     –   If one now is able to express a natural gene in simple bacteria such as Escherichia coli (E.
         coli), a bacterium living in intestines that has become the model organism for much of
         biotechnology, one can have this bacterium make a lot of the protein coded for by the gene,
         regardless its source.
     –   The techniques used for this development include (a) isolation of the gene coding for a
         protein of interest, (b) cloning of this gene into an appropriate production host, and (c)
         improving expression by using better promoters, tighter regulation, etc.; together these
         techniques are known as recombinant DNA techniques.                                      5
Commercial implications
   large number of proteins, existing only in tiny quantities in nature,
    can now be mass-produced if needed.
   spectrum of "bioreactors"(organisms used for production)
    recently has been broadened to include a variety of animals and
   About a decade ago, "protein engineering" became possible as an
    offshoot of the recombinant DNA technology.
     – Protein engineering differs from "classical" biotechnology in that it is
       concerned with producing new (man-made) proteins which have been
       modified or improved in some way.
     – The techniques involved in protein engineering are more complicated
       than before, and involve (a) various types of mutagenesis (to cause
       changes in specific locations or regions of a gene to produce a new
       gene product), (b) expression of the new gene to form a stable protein,
       (c) characterization of the structure and function of the protein
       produced, and (d) selection of new locations or regions to modify as a
       result of this characterization.
Mid-eighties and early-nineties
   It has become possible to transform
    (genetically modify) plants and animals
    that are important for food production.
    – "Transgenic" animals and plants, including
      cows, sheep, tomatoes, tobacco, potato, and
      cotton have now been obtained.
    – Genes introduced may make the organism
      more resistant to disease, may influence the
      rate of fruit ripening, or may increase
Overview of recombinant DNA
based biotechnology
   1953 - Double helix structure of DNA is first described by Watson
    and Crick.
   1973 - Cohen and Boyer develop genetic engineering techniques
    to "cut and paste" DNA and to amplify the new DNA in bacteria.
   1977 - The first human protein (somatostatin) is produced in a
    bacterium (E. coli).
   1982 - The first recombinant protein (human insulin) appears on
    the market.
   1983 - Polymerase chain reaction (PCR) technique conceived.
   1990 - Launch of the Human Genome Project (HGP), an
    international effort to sequence the human genome.
   1995 - The first genome sequence of an organism (Haemophilus
    influenzae) is determined.
   2000 - A first draft of the human genome sequence is completed.

I - Definitions
   Informatics
    – the science of information management

   Bioinformatics
    – the science of biological information

Working definitions?
   • “It is naturally characterised more by the
     problem domains it addresses than by a
     foundational set of philosophical or scientific
     principles.” (David Benton, TIBTECH, August
     1996, 14, 261-272)

   • “...bioinformatics’ one defining principle is its
     pragmatic openness to investigate the
     application of any computational, mathematical
     or statistical method or approach to a biological
     problem.” (Benton, ibidem)
II – Bio & chemo -informatics

The central dogma (Kornberg)
                                                  Inside the cell, the DNA
                                    DNA           acts like an "instruction
                                                  manual": in its sequence,
  Transcription:                                  it provides all the
  copy out in the same language                   information needed to
                                    RNA           function, but the actual
                                                  work of translating the
  Translation:                                    information into a
  render into another language                    medium that can be used
                                  PROTEINS        directly by the cell is
                                                  done by RNA,
                                                  ribonucleic acid.

                      everything          else!
   alphabet of 4 “bases”
      • Adenine, Cytosine, Guanine and Thymine
      • ACGT
 information content encoded by the
  sequence of these bases
 storage of genetic information
 sequence usually determined by
  automated methods
   alphabet of 4 “bases”
        • Adenine, Cytosine, Guanine and Uracil
        • ACGU
   base sequence again determines information content
   mediates interpretation of the genetic message
   The messenger RNA (mRNA) serves as an intermediate
    between DNA and protein.
   Parts of the DNA are "transcribed" into mRNA, a single-
    stranded molecule.
   Transcription starts at a specific site on the DNA called a
    promoter. Each gene or operon has its own promoter(s).
   Transcription ends at a terminator sequence on the DNA.
   The transcripts contain the information to make protein.

   alphabet of 20 amino acids (AAs)
    – e.g. A, V, G etc.
 sequence of AAs ultimately
  determines the function of a protein
 variety of functional and structural
 expression of the genetic message

Transcription (DNA to RNA)
 1:1 correspondence between DNA
  and RNA
 may be “editing” to remove
  intervening sequences “introns”
 given a DNA sequence, it is possible
  to derive the RNA sequence and vice

Translation (RNA to protein)
   3:1 correspondence between RNA
    and proteins using the genetic code
                 – (next slide)
 protein sequences can be derived
  from nucleic acid sequences
 many possible NA sequences can
  produce a given protein sequence
Decoding a sequence of DNA
   Reverse and complement
    – by base pairing rules CG, TA
    – 5'-TCTGACTATTGAGCTCTCTGGCACAATGCA-3'(antisense strand)

   Transcribe
    – Transcribe message (mRNA) (same as sense except U’s for T’s)

   Translate
    – -- - --- ini cys gln arg ala gln stp ---

   Protein product
    – CysGlnArgAlaGln == (NH2)CQRAQ(COOH)                            18
III – Molecular biology data
   Scientific literature
       • e.g. Nature, online resources
   Sequence databases
       • e.g. EMBL, SwissProt
   Structure databases
       • e.g. PDB
   Derived data
       • e.g. CATH, ProSite
The explosion of data
 250,000 articles per year
 210Mbp per year
 Doubling time 14 months
 4Gbp by 1999

   (no current figures available, but
    EMBL 13.5Gbp, August 2001)
Growth of number of entries

Growth of number of residues

    The history of the databases
   "Biology is mere stamp-collecting”
   1951      (Sanger & Tupper) - 30 AAs of ß-chain bovine insulin
   1965      (Holley) - nucleotide sequence of a yeast alanine tRNA
   1970s     (various)- various protein sequencing methods
   1972      (Dayhoff) - "Atlas of protein sequence and structure"
   1977      (Sanger, Maxam & Gilbert) DNA sequencing
   1980s     (Brenner and various others) automated sequencing
   1980s     community databases
   1987-92   genome sequencing projects
   1992      (Venter) Expressed Sequence Tags and patents
   Present   widespread use of automated sequencers
    An explosion of databases
   Community databases  Specialised content
    –   GenBank                     –   FlyBase
    –   EMBL                        –   Rebase
    –   DDBJ                        –   PRINTS
    –   SWISS-PROT                  –   WIT (ex-PUMA)
    –   PDB                         –   LIMB

     Currently there are over 200 molecular, structural,
     genetic and phenotypic databases.
Contents of the databases
   nucleic acid sequence databases
    – linear sequences of nucleic acids
   protein sequence databases
    – linear sequences of amino acids
   structure databases
    – mostly protein 3D structures
    – increasing numbers of carbohydrates,
      nucleic acids and small molecules
    – combinations of the above              25
 – GENOMES:         size very variable (in kbp)
   •   SV40         5.1           1.7 mm
   •   E. coli      4,000         1.360 mm
   •   Yeast        13,500        4,600 mm
   •   Drosophila   165,00        56,000 mm
   •   Human        2,900,000     990,000 mm
 – Completed genomes include
   •   Haemophilus influenzae
   •   Mycoplasma genitalium
   •   Escherichia coli
   •   Saccharomyces cerevisiae
  The HGP and HUGO
     Goals:
            • identify all the estimated 80,000-100,000*
              genes in human DNA
            • determine the sequences of the 3 billion
              chemical bases that make up human DNA
            • store this information in databases
            • develop tools for data analysis
            • address the ethical, legal, and social issues
              (ELSI) that may arise from the project
HGP sequencing progress:
   Working draft sequence:
      • Goal: 90% by summer 2000
      • Completed June 2000
   Finished, high quality sequence:
      • Goal: 100% by 2003
      • 47% (1,660,078,000 bases)   28
Information management
 data representation
 data formats
 annotations
 curation
 distribution and updates
 access - data exchange
 integration of biological databases
Data representation
In order to achieve uniformity of presentation of the information in
the databases, how this information is stored is very precisely

Example from the EBI EMBL Nucleotide sequence database
manual -

"The nucleotide sequence data are generally present in the
database as they have been published, subject to some
conventions which have been adopted for the database as a whole.
The sequences are always listed in the direction 5' to 3', regardless
of the published order. Bases are numbered sequentially beginning
with 1 at the 5' end of the sequence.”

Data formats
   “flat file” versus records
    – platform independent
    – OS & NOS independent
 text format ASCII
 indexed and linked

 4Gbp of sequence  4Gbp database
 annotations, including
    – descriptors
    – organelle & organism
    – enzyme classification
    – authors etc.
    – references
    – corrections
   Community databases
    – now automatic
      • overnight updates
      • re-indexing
      • submission by forms
   Specialised databases
    – largely “hand crafted”
    – usually rely on 1 or 2 expert curators
Distribution and updates
   magnetic tape
    – now little used
   compact disk
    – Asia, E. Europe
   EMail servers
    – access to more and recent data
   WWW and via cgi
Data exchange
   must conform to a standard format
    – storage and exchange
    – transmission
    – Multipurpose Internet Mail Extensions

   all mediated by a tagged file format

   easy to locate information e.g. EMail
    – DE (DEscriptor)
    – RA (Reference Author)
   flexible
    – allows restructuring (additional tag)
    – unrecognised tags ignored
   meaning, ordering, and number is
    very precisely specified
Tags - examples (i)
 ID - identification            (begins each entry; 1 per entry)
 AC - accession number          (>=1 per entry)
 DT - date                      (2 per entry)
 DE - description               (>=1 per entry)
 KW - keyword                   (>=1 per entry)
 OS - organism species          (>=1 per entry)
 OC - organism classification   (>=1 per entry)
 OG - organelle                 (0 or 1 per entry)
 RN - reference number          (>=1 per entry)
 RC - reference comment         (>=0 per entry)
 RP - reference positions       (>=1 per entry)

Tags - examples (ii)
 RP - reference positions       (>=1 per entry)
 RX - reference cross-reference (>=0 per entry)
 RA - reference author(s)       (>=1 per entry)
 RT - reference title           (>=1 per entry)
 RL - reference location        (>=1 per entry)
 DR - database cross-reference (>=0 per entry)
 FH - feature table header      (0 or 2 per entry)
 FT - feature table data        (>=0 per entry)
 CC - comments or notes         (>=0 per entry)
 XX - spacer line               (many per entry)
 SQ - sequence header           (1 per entry)
 bb - (blanks) sequence data    (>=1 per entry)
  // - termination line         (ends each entry; 1 per entry)   38
IV - Integration of databases
 data is more meaningful when
  viewed in context
 integration achieved by links
      • a link maps an entity in one database to an
        entity in another database
   goal:
    – To make a collection of heterogeneous
      data sources appear to the user as a
      large, integrated and coherent entity.
Linking biological databases
   Links can connect information:
    – between two databases
    – about a protein between several databases
    – between a gene, the enzyme it codes for,
      its catalytic activity and the position of the
      reaction in a metabolic pathway and
      pharmacological data on inhibitors

Linking facilitates
 discovery by database analysis
 data trawling and mining
 serendipitous discovery
 generalisation
 data validation and consistency

   starting from:
   For either human or bovine insulin,
    obtain the following:
    – nucleic acid sequence (DNA or RNA)
    – protein sequence
    – 3D co-ordinates


To top