Knowledge Discovery in Biological Databases - PDF

Document Sample
Knowledge Discovery in Biological Databases - PDF Powered By Docstoc
					Knowledge Discovery in
 Biological Databases



 David Gilbert and Aik Choon Tan
     {drg,actan}@brc.dcs.gla.ac.uk
          www.brc.dcs.gla.ac.uk
     Bioinformatics Research Centre
          University of Glasgow
                       Outline
•   Introduction
•   Motivation
•   Overview of KDBD
•   Machine Learning
    –   Domain representations
    –   Knowledge representations
    –   Search strategies
    –   Classification methods
    –   Evaluation & Interpretation
• Conclusion
                Bioinformatics
•Bio - Molecular Biology
•Informatics - Computer Science

•Bioinformatics - the study of the application of
  • molecular biology, computer science, artificial
  intelligence, statistics and mathematics
  •to model, organise, understand and discover
  interesting knowledge associated with the large
  scale molecular biology databases,
  •to guide assays for biological experiments.
  • “Computational Biology” (USA)
             Bioinformatics =
Machine Learning + Data Mining + Biological
               Databases =
  (?Knowledge Discovery in Databases?)




         (?Knowledge Discovery in (Biological)Databases?)
Growth in Sequence Data
Growth in Structural Data




                      (Berman et al 2002)
          Nucleotide sequences

          Nucleotide structures

             Gene expressions

            Protein Structures


             Protein functions


Protein-protein interaction (pathways)


                    Cell

             Cell signalling




                  Tissues


                  Organs


                Physiology



                Organisms
                                         Increasing complexity of Biological Data
         Computational bottlenecks
Caused by
• Data characteristics
  –   Lots of it
  –   heterogeneous
  –   distributed
  –   incomplete
  –   dirty
• (Traditional) complexity issues: time, space
• Induction: constructing
  discriminatory/descriptive functions from large
  data sets
         Computational bottlenecks
• Data representation
  –   sequences (DNA, RNA, amino-acid)
  –   trees (phylogentic,…)
  –   graphs (protein structure, biochemical networks)
  –   matrices (micro-arrays, metabolic pathways)
Molecular biology overview
Biological activity: interaction!
Knowledge Discovery
  Biology – A Classification Problem
• Biology - The division of physical science which
  deals with organised beings or animals and plants,
  their morphology, physiology, origin, and
  distribution                         (OED)
 • Analysis via classification - steps:
    –Organise examples into family
    –Find common descriptions to characterise the family members
    –Look for more members in the Universe
    –If a new instance matches the characteristics of a family, infer
    family properties to the new instance and add it as a member
Comparative
 genomics

One aspect:
making inferences


(Eisenberg et al, 2000)
                                                                                                 Proteome
                                Data Explosion

      Genome

S1     ggggctacgg ggggtggggc ttcgcgcccc gccggcctat aaaagcggcc gccgcggctc cgtgccgttg ccgaccttcg cctgcgccgc
S2     tgctgcttcgcgcccgtcgc ctccgccatg gctcccagga agttcttcgt gggtggcaac tggaagatga acggcgacaa gaagagcttg
S3     ggcgagctca tccacacgct gaatggcgcc aagctctcgg ccgacaccga ggtggtttgc ggagcccctt caatctacct tgattttgcc


                                                          acgt
S4     cgccagaagc ttgatgcaaa gattggagtt gcagcacaaa actgttacaa ggtaccgaag ggtgctttca caggagagat cagcccagca
S5     atgatcaaag atattggagc tgcatgggtg atcctgggcc actcagagcg gaggcatgtttttggagagt ctgatgagtt gattgggcag
S6     aaggtggctc atgctcttgc tgaaggcctc ggtgtcatcg cctgcattgg ggagaagctg gatgagagag aagctggcat aacggagaag
S7     gtggtctttg aacagaccaa agctattgct gataacgtga aggactggag taaggtggtt cttgcctatg agccagtttg ggctatcgga
S8     actggtaaaa ctgctactcc ccaacaggct caggaggttc atgagaagct gagaggctgg ctcaaaagcc acgtgtctga tgctgttgct
S9     cagtcaacta ggacgtcta tggaggttca gtcactggtg gcaactgtaa ggaactggcc tcccagcatg atgtggatgg cttccttgtt
S10    ggtgggacgt ctctcaagcc agagtttgtg gatattatca atgcaaaaca ttaaagcagc ctgtgaggag cagtccctta cggttaagag
S11    caagaaactg aagcaagaag ggaccttgtg ttgcacgtct ctcggtacag aggcttcttc tgaggctttc ccccaccacc acaattattg ttctagctgt
S12    gctgctaacc cccaccacct tgttggagtc ccattagtgt gagcccatct cagcagagtc tcctttctga actggcaaaatccttggtta tctgttgagc
                     Data, information, knowledge …
  • data : nucleotide sequence

  • information : where are the “genes”.
control                                                             Termination
statement           TATA box                                           (stop)
                               start


                                                 gene
            Found using classifier, pattern, rule which has been mined/discovered

 • knowledge : facts and rules
 If a gene X has a weak psi-blast assignment to a function F
      –and that gene is in an expression cluster
      –and sufficient members of that cluster are known to have function F,
      ⇒ then believe assignment of F to X.
Data, Information, Pattern, Knowledge
 INFORMATION                                          DATA
                                      APRKFFVGGN WKMNGKRKSL GELIHTLDGA
                                      KLSADTEVVC GAPSIYLDFARQKLDAKIGV
                                      AAQNCYKVPK GAFTGEISPA MIKDIGAAWV
  Molecular Weight = 26528            ILGHSERRHVFGESDELIGQ KVAHALAEGL
  Number of Residues = 247            GVIACIGEKLDEREAGITEKVVFQETKAIADNVK
                                      DWSKVVLAYEPVWAIGTGKTATPQQAQEVHE
    Number of Alpha = 11              KLRGWLKTHVSDAVAVQSRIIYGGSVTGGNCK
     Number of Beta = 8               ELA SQHDVDGFLV GGASLKPEFV DIINAKH

   Content of Alpha = 43.32
   Content of Beta = 17.00

        PATTERN
[AV]-Y-E-P-[LIVM]-W-[SA]-I-G-T-[GK]

       KNOWLEDGE

  The DNA sequence encodes
  an alpha-beta protein with
  a barrel architecture.
  The structure of the
  protein is a TIM-barrel.
                          An abstract view
• Given
  {p:9, p:1, q:8, p:3, q:2, q:6, p:5, q:4, p:7, q:0}

• Cluster:
  {p:9, p:1, p:3, p:5, p:7} {q:8, q:2, q:6, q:4, q:0}

• Background knowledge:         > + -

• Induce:
  0 is q
  X is q if X-2 is q and X > 0

• X is p if not(X is q)
What is a pattern?
                     Types of Pattern
• Deterministic
   – is a boolean function which either matches a given object (i.e.
     sequence, structure) or not
       R-x-Y-[ST]
  (e.g. regular expression for sequence pattern)
                                             1     2      3   4   5       6   7   8    9   10
•Probabilistic                           S1: R     V      Q   R   A       Y   S   Y    V   N
   Assigns each sequence with a          S2: P     L      M   R   A       Y   S   I    A   S
   probability that generated by the     S3: L     V      I   R   P       Y   T   P    V   S
   model. The higher the probability,    S4: L     C      M   R   A       Y   T   P    T   S
   the better is the match between a     S5: E     K      L   R   L       Y   S   I    A   S
   sequence and a pattern                  R=.2    V=.4   Q=.2 R=1 A=.6 Y=1   S=.6 Y=.2 V=.4 N=.2
                                            P=.2   L=.2   M=.4     P=.2       T=.4 I=.4 A=.4 S=.8
   (e.g. Profile for sequence pattern)      L=.4   V=.2   I=.2     L=.2            P=.4 T=.2
                                            E=.2
                                Motifs
Motif : a pattern associated with some biological meaning (e.g. function)

1FDR:_      RVQRAYSYVNSP                                   FAD binding site
                                       Sequence pattern
1A8P:_      PLMRAYSIASPN
1NDH:_      LVIRPYTPVSSD
1CNF:_      LCMRAYTPTSMV                                           FAD ligand
1B2R:A      EKLRLYSIASTR                RxY[ST]
1AMO:A      LQARYYSIASSS




                                     Structural pattern
       KDD in BIOINFORMATICS
                                       S1:ACAATG
                          Pre-                        Pre-
                                       S2:TCAACTATC
                          Processing                  Processed
 Target                                S3:ACACAGC
           S1:ACAATG                                  Data
 Data                                  S4:AGAATC
                                       S5:ACCGATC

          Selection                Transformation

                                 Transformed S1:ACA---ATG
                                 Data        S2:TCAACTATC
                                             S3:ACAC--AGC
                                             S4:AGA---ATC
                      KNOWLEDGE!!
                      Raw Data
                                             S5:ACCG--ATC


                            Pattern                   Machine
                                                      Learning
Interpretation/
Evaluation
       Characteristics of KDD
• Validity
• High-level Patterns/Languages -
  understandable by human
• Accuracy - measures of certainty
  (probability)
• Interesting Results - novel, useful and
  nontrivial to compute
• Efficiency - running times for large-sized
  databases are predictable and acceptable
                                (Frawley et. al. 1992)
                Data preparation
• Select and identify target database
• Extract target data set
• Transform the target data set into the input format
  of the learning algorithm
• Divide the target data set into groups (training, test
  sets)
• Takes most of KDD process time
• Issues:
   – Dealing with noisy data and missing attributes
   – Filtering target data set (e.g. statistical analysis for gene
     expression before performing clustering)
          Machine learning tasks
    (in bioinformatics as elsewhere…)
• Classification: predicting an the class of an item
• Clustering: finding groups of items
• Characterisation: describing a group
• Deviation Detection: finding changes
• Linkage Analysis: finding relationships &
  associations
• Visualisation: presenting data visually to facilitate
  knowledge discovery by humans (human in the
  loop)
        Learning Approaches

• Unsupervised approach – given the
  unassigned examples, group together the
  examples with similar properties

• Supervised approach – given predefined
  class of a set of positive and negative
  examples, construct the classifiers that
  distinguish between the classes
                   Issues

•   Domain representation
•   Knowledge representation
•   Search strategy
•   Classification method
   Learning in bioinformatics context
• Automatically find pattern (given a training set)

• Characterisation: (positive examples only) patterns
  describing “interesting” properties of a family

• Classification: (positive and negative examples) pattern
  distinguishing S+ and S- .. Which may overlap...

 • Formal language for descriptions (domain representation)
 • Scoring function to rate descriptions (knowledge representation)
 • Algorithm (search strategy and classification methods)
     Protein comparison & motif discovery
                       Str comparison
                       Structure Prediction
                       Function Prediction
                       Str Classification


 Str Database                               Str Motif Database

Extract                Match
features

     Str Description

Discover / Compare                          Eidhammer, Jonassen & Taylor,
                                            “Structure Comparison and Structure
                       Patterns             Patterns”, JCB, 7:5 pp 685-716, 2000.
                          Steps
• Pattern matching: input is 1 pattern & 1 str;output is
  “yes”/“no” (deterministic pattern) or score (probabilistic
  pattern).
• Pattern discovery: find patterns matching some/all of
  input structures (choose patterns with high as possible
  fitness value to input structures)
• Comparison: input (pair of) structure descriptions, find
  (local/global) similarities, optimise similarity measure,
  output score.
   – Similarity may be represented as a pattern
  Pattern discovery in biosequences
• Group together sequences thought to have common
  biological (structural, functional) properties
   – families (biological - semantic level)
• Study their common syntactic properties ignoring
  biological (semantic) properties
   – patterns, clusters (mathematical - syntactic level)
• Test whether the discovered patterns make sense (back
  to semantic level)
     Approaches to pattern discovery
    • Pattern driven:
      enumerate all (or some) patterns up to
      certain complexity (length), for each
      calculate the score, and report the best
    • Sequence driven:
      look for patterns by aligning the given
      sequences
Brazma et al, Approaches to the automatic discovery of patterns in
biosequences, Journal of Computational Biology, 5(2):277-303, 1988
    Pattern driven algorithms
• Brute force - enumerate all patterns (for
  instance, all substrings) up to a given length
  (complexity)
• Evaluate their fitness with respect to the input
  sequences and output the best
• Unrealistic for patterns of even modest size even
  for substring patterns (e.g., for substring patterns of length
  10 over the amino acid alphabet, there are more than 1013
  different substrings to enumerate in this way)
• E.g. PRATT program (Jonassen, U.Bergen, via www.ebi.ac.uk)
Sequence driven algorithms
• Group similar sequences together (e.g., in
  pairs);
• For each group find a common pattern (e.g.,
  by dynamic programming);
• Group similar patterns together and repeat
  the previous step until there is only one
  group left
Sequence driven approach
   s1
                 p1
   s2



                            p4

   s3
                p2
   s4                  p3



   s5
     Characteristic string function for
                family F+
                    function g : Σ* → {FALSE,TRUE}


          F-         g(s)=   {   TRUE if s ∈ F+
                                 FALSE if s ∈ F-
     F+



Σ*
Classification & characterisation Problems
               Classification: + and - examples   Characterisation: + examples



                                  S-
     Clean
    training         S+                                 S+
                                   F-                              F-
      data
                        F+                                 F+
                                        Σ*                                Σ*


                             S-

     Noisy                                                 S+
    training              S+       F-                               F-
      data
                                                             F+
                          F+                 Σ*                           Σ*
 (Some) Performance Measurements
Sensitivity               Specificity              Positive Predicted Value
       TP                    TN            TP
Sn =                  Sp =         PPV =
     TP + FN               TN + FP       TP + FP
0 ≤ Sn ≤ 1                 0 ≤ Sp ≤ 1                     0 ≤ PPV ≤ 1
Correlation Coefficient
                     (TP * TN − FP * FN )
  cc =
       (TP + FP ) * ( FP + TN ) * (TN + FN ) * ( FN + TP )
                          1.0 no FP or FN
  -1≤cc ≤1       cc       0.0 when f is random with respect to S+ and S-
                          -1.0 only FP and FN
          Knowledge Representation
“If the predictive accuracies of two hypotheses are statistically
   equivalent then the hypothesis with better explanatory
   power will be preferred.
Otherwise the one with higher accuracy will be preferred.”
              (Muggleton et al., 1998)
           Input     Learner        Classifier


                                  •High accuracy
                                  •High explanatory power
     Biological Sequences

-nucleotide sequences
-protein sequences
Domain representation –Example 1

                                         xxx
                                     V         x
                                 x                 x
                                 x                 x
                                 x                 x
                                     C         H
                                  x \ / x
                                x    Zn x
                                  x / \ x
         Zn                        C    H
                              xxxx        xxxxxx


C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H
                  Edit distance
• Levenshtein 1966
• Minimum number of edit operations to
  transform 1 string into another
  – insert, delete, substitute (1 symbol)
• Score is zero (identical) or positive
• E.g “AIMS” & “AMOS”
                                                   AMOS
  AIMS   ⇒      AIM-S           AIMS
                                                   AMOS
  AMOS          A-MOS           AIMS
                     (score=2 for each solution)
The possibilities?
AIM-S
| | |
A-MOS
              Which is better?
AIMS
| |
AMOS
                       Multiple alignments
        • Analyse gene families
            – reveal (subtle) conserved family characteristics
                                  characters
                  1    2    3    4      5      6    7   8   9    10

            S1    Y    D     G    G      A     V    -   E   A    L
sequences




            S2    Y    D     G    G      -     -    -   E   A    L
            S3    F    E     G    G      I     L    V   E   A    L
            S4    F    D     -    G      I     L    V   Q   A    V
            S5    Y    E     G    G      A     V    V   Q   A    L

      consensus    y    d    G    G    AI      VL   V   e   A    l
         Multiple aligment - methods
• Simultaneous: N-wise alignment (adapted from pairwise approach)
   – uses N-dimension matrix.
   – Complexity is
      • O(m1m2) [2 sequences length m1 & m2 ]
      • O(mn) [n sequences of length m]
   – Thus only good for short sequences.
• Manua1 (!)
                                                           s1
                                                           s2         a1

                                                           s3
• Progressive (heuristic) e.g. ClustalW:                             a2
                                                                                a4

                                                           s4
   – compute pairwise sequence identities                                  a3

   – construct binary tree (can output phylogenetic tree) s5
   – align similar sequences in pairs, add distantly related ones later.
     Multiple sequence alignment (globins)
CLUSTAL W (1.81) multiple sequence alignment


Human     VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV   60
Gorilla   VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV   60
Rabbit    VHLSSEEKSAVTALWGKVNVEEVGGEALGRLLVVYPWTQRFFESFGDLSSANAVMNNPKV   60
Pig       VHLSAEEKEAVLGLWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSNADAVMGNPKV   60
          ***:.***.** .*******:****************************..:***.****

Human     KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK   120
Gorilla   KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFKLLGNVLVCVLAHHFGK   120
Rabbit    KAHGKKVLAAFSEGLSHLDNLKGTFAKLSELHCDKLHVDPENFRLLGNVLVIVLSHHFGK   120
Pig       KAHGKKVLQSFSDGLKHLDNLKGTFAKLSELHCDQLHVDPENFRLLGNVIVVVLARRLGH   120
          ******** :**:** **********.*******:********:*****:* **::::*:

Human     EFTPPVQAAYQKVVAGVANALAHKYH   146
Gorilla   EFTPPVQAAYQKVVAGVANALAHKYH   146
Rabbit    EFTPQVQAAYQKVVAGVANALAHKYH   146
Pig       DFNPNVQAAFQKVVAGVANALAHKYH   146
          :*.* ****:****************
sequence alignments
& phylogenetic trees

Pair             Score
Human-Gorilla    99
Human-Rabbit     90
Gorilla-Rabbit   89
Human-Pig        84
Gorilla-Pig      84
Rabbit-Pig       83


((Human:0.00000,
Gorilla:0.00685)
:0.04110,
Rabbit:0.05479,
Pig:0.10959);
     What can we do with multiple alignments?
• Create (databases of) profiles derived from multiple
  alignments for protein families
   – profile = multiple alignment + observed character
     frequencies at each position
• Search with a sequence against a database of profiles
  (e.g. PROSITE database)
   – faster than sequence against sequence
   – gives a more general result (“the input sequence matches
     globin profile”)
• Search with a profile against a database of sequences
   – PSI-BLAST : can identify more distant relationships
     than by normal BLAST search
PSI-BLAST (position specific iterated BLAST)
              Single protein
                 sequence




           Search database(BLAST)



                 ?iterate
                   until       Multiple alignment
 Profile       convergence

                               Estimate statistical
                                  significance of
                                 local alignments
Protein structure
     Protein structure - levels
                                             SECONDARY STRUCTURE (helices, strands)
   PRIMARY STRUCTURE (amino acid sequence)

   VHLTPEEKSAVTALWGKVNVD
   EVGGEALGRLLVVYPWTQRFF
   ESFGDLSTPDAVMGNPKVKAH
   GKKVLGAFSDGLAHLDNLKGTF
   ATLSELHCDKLHVDPENFRLLG
   NVLVCVLAHHFGKEFTPPVQAA
   YQKVVAGVANALAHKYH




                                                 TERTIARY STRUCTURE (fold)
QUATERNARY STRUCTURE
              TOPS
Simplified descriptions of protein 3D
 structures and their use in searching
   and structural pattern recognition
Domain representation –Example 2
         (TOPS Approach)
TOPS Example – 2bopA0
  chirality
              2bopA0
H-bond                 α-helix


                       loop
                              β-strand
• Several
examples, with   What is a pattern?
common parts
highlighted




                 • A common description
Number of
              Plait motif
insert SSEs




                               (0,N)
                       (0,N)




  (0,N)                                (0,N)
Correspondences
1,2,4,6,7,8
                  Pattern matching
                                     2bop
1,3,4,6,7,8




    (0,N)                                   (0,N)




                                       Plait
                                       motif
Alternative           1,2,4,6,7,8
matches

        2bopA0
                      1,3,4,6,7,8




              Plait
Discovering common patterns
and making multiple alignments
 Pattern
                P       P        P      Compression:
                                       Send the pattern
                                      once, and then for
                                      each domain, send
                                     the uncovered parts
                                             Domain 1



                                               Domain 2




                                              Domain 3
               Topological description
• Consider sequence of SSEs (strand, helices), plus spatial
  adjacency within fold & approximate orientation
• Neglect details (lengths & structures of loops, exact lengths &
  spatial orientations of SSEs, sequence information...)
√ simplicity
   – implement very fast comparison algorithms, machine learning, ...
   – detect distant structural relationships

X simplicity
   – relate structures topologically which may have no meaningful biological
     relationship.
                 Enhanced TOPS
                                TOPS
                        Sequence        Biochemical
                       Information       Features




 Pattern-Discovery/
                              TOPS+                   Structure Comparison
Matching Algorithms              &                          Algorithm

                         Scoring Functions

                      TOPS+ BASED – PSSM/HMM PROFILES


                       Structural and Functional
                              Assignment
                                                             (Veeramalai, 2002)
Tops + Sequence with Biochemical features



            Functional information
DNA
               DNA binding-site




Ligand          Ligand binding-site   (Veeramalai, 2002)
A                    Feature Extraction       B




                                          C
    E



(Veeramalai, 2002)
       PSSM/HMM Profiles & Scoring Function
                                                               Key
                                                              Ligand interaction
Structure-Based        S1
    Sequence                                                  Ligand interaction in
       &                                                    loop
                                                              Ligand interacting aa’s
    Function
                                                              Seq segment of a helix
   Extraction         S2                                       Seq segment of a strand
  For Protein                                                 Seq segment of a loop
    Domains         Etc.,




                      Sn



                                                       S1
 TOPS-BASED
Multiple Sequence                                      S2
   Alignment                                           Sn




                            SAM/HMMER       IMPALA             Scoring
  Profile
                                                               Function
 Generation
                            HMM Profile PSSM Profile         (Veeramalai, 2002)
1vpt00                           Methyltransferase
                                      Superfamily
2admA1



1vid00


                                       Key to TOPS
1xvaA2
                                           Ligand binding site
                                           in α-helix residues

                                           Ligand binding site
                                           in β-strand residues
1hmy01
                                           Ligand binding site
                                           in loop-region residues

                                           Ion-binding site between
                                           SSEs & loop regions
                                            Structural/Functional
  Conserved Structural Pattern              (TOPS) Pattern



                                         (Veeramalai, 2002)
       Comparing structures - NADP binding domains
                         dihydropterine
                            reductase


homo sapiens                                         rat




                              dihyrofoliate
homo sapiens                    reductase       E.Coli
   Dendrogram from
pairwise comparisons &     Dihydropteridine reductase (human)

                            Dihydropteridine reductase (rat)
hierarchical clustering     Lactate dehydrogenease (pig)

                            Lactate dehydrogenase (bacterial)

                            Malate dehydrogenase (pig)

                            Malate dehydrogenase (bacterial)


                           Quinone oxido-reductase (bacteria)

                          Alcohol dehydrogenase (human)

                          D-3-phosphoglycerate dehydrogenase (bacteria)

                           NADH peroxidase (bacteria)

                          D-glyceraldehyde-3-phosphate dehydrogenase

                           Dihydrofolate reductase (bacterial)

                          Dihydrofolate reductase (human)

                           NADH peroxidase (bacteria)
      NAD comparisons




                Structure          Structure
Sequence   (atomic coordinates)   (topology)
     Hierarchical Machine Learning
•Integrate various machine learning techniques
•Incorporate patterns induced from different sources
•Produce user readable hypotheses
Gene Expression
                       Gene - informatics??
               Phylogenetic                        Metabolic                 Connectors To
               Inferences                          Profiles                  Other Maps



              Sequence Homologs                   Cofactors &                Metabolic Map Locator
              In Other Genomes                    Metabolites




          Sequence
                                                                             Functional              Experimental
                                                  Gene X                     Chemistry               Data

   Genome Location




                      Expression Info                                   Structure



         Raw         Numerical          Cluster            Raw    Electron      Structure    SS
         Images      Values             Genes              Data   Density       Annotation   Assignment


(Adapted from Gibas & Jambeck, 2001)
                Gene expression
Pre-genomics era


           g1                  p1          p2

     One gene = One gene product = One behaviour

Post-genomics era              p1’’
                                          p2
                                p1’
                                          p3
     g1   g2        g3          p1
                                          p4
 Many genes = Many gene products = Many behaviours
Microarray experiment
                Spotting the arrays
RED = Present (P) = highly expressed, detected by the detector
YELLOW = Marginal (M) = expressed, “not sure” for the detector
GREEN = Absent (A) = maybe expressed, not detected by the detector
Classification
Problem
(Golub et al 1999)




ALL = acute lymphoblastic leukemia
(lymphoid precursors)
AML = acute myeloid leukemia
(myeloid precursor)
                          Characterisation Problem
                                                    (Stuart et al 2001)




Temporal gene expression profiles during kidney development. Data are expressed as the mean at each time for clusters of genes as defined by k-
means clustering (1-5). The distribution of individual profiles is also shown for the most heterogeneous group (2, all). 13, 15, 17, 19, embryonic
days; N, newborn; W, 1 week old; A, adult.
                        Characterisation Problem
                                                 (Stuart et al 2001)




Functional associations of gene clusters. Gene clusters varied remarkably in terms of major functional classifications of component genes.
Group 1 expressed earlier in nephrogenesis was most notable for genes involved in DNA replication (D), RNA production (R), protein synthesis
(P), and morphogenesis (M), consistent with an actively proliferating tissue.
Group 2 (which peaked in midnephrogenesis) was most notable for genes of the extracellular matrix (E) as well as morphogenetic genes (M).
Group 3 (with a peak in neonatal life) was dominated by retrotransposon transcripts (RT).
Group 4 was most notable for transport (T) and energy metabolism (EN) related genes.
Group 5 genes (significantly up-regulated in the adult vs. all previous times) was more heterogeneous and included genes specifying catabolic
enzymes (C), defense and immune recognition (DE), homeostasis of the organism as a whole (H), detoxification (DT), oxidative stress (RD), and
transport (T).
                              Gene expression matrix
      Rows = genes expression profiles
      Columns = Different conditions/time points
Genes                         A1 TSu74aA1 TSu74aA2 TSu74aA2 TSu74aA3 V10_SiA3 V10_DeB1 V12-A_B1 V12-A_B2 V12-B_B2 V12-B_B3 V12-C_B3 V12-C_C1 P1-A_S
AFFX-b-ActinMur/M12481_3_st       26.1 A            29.7 A             7.7 A            13.2 A            11.4 A            43.7 A            15.1
AFFX-YEL002c/WBP1_at               1.3 A             6.2 A             2.5 A             4.7 A             2.7 A             1.3 A               7
AFFX-YEL018w/_at                   6.1 A             0.6 A             1.8 A             3.1 A             2.1 A             1.4 A             0.6
AFFX-YEL024w/RIP1_at              11.9 A             7.2 A             2.7 A            10.4 A             2.4 A             8.2 A             7.6
AFFX-YEL021w/URA3_at              11.8 A             6.6 A             2.6 A            12.4 A             6.9 A             7.6 A               6
92539_at                        2475.9 P          2091.3 P          1391.6 P          1407.9 P          1947.2 P          1572.9 P          1999.6
92540_f_at                        96.9 P            77.4 P           138.7 P           144.8 P           122.6 P           126.6 P           128.8
92541_at                         863.2 P          1920.6 P          1248.1 P          1384.9 P             268 P           352.3 P           856.4
92542_at                         702.4 P           868.3 P           558.4 P           613.1 P           631.8 P           602.1 P           548.3
92543_at                          56.7 P            56.7 P            75.5 P            61.6 P            72.5 P            76.6 P            56.2




                                             Replicates
                                                                                               Signal                      Detection
                                                                                             (intensity)
   Clustering Gene Expression Data
• A clustering problem consists of elements & a
  characteristic vector for each element
• A measure of similarity is defined between pairs of such
  vectors
• Elements = genes
• Vector = expression level of each gene
• Goal: Partition the elements into subsets (clusters) which
  satisfy:
   – Homogeneity: elements in the same cluster are highly
     similar to each other
   – Separation: elements from different clusters have low
     similarity to each other
 Hierarchical clustering
Different experimental conditions/time points




                                         Genes
k-means clustering
Genes related with ‘casein’ in
  mammary gland tissues




               Lactation
Linking gene expression data with
   morphological information
    stage(A, pregnancy) :-
                        gene_id(A, g1),
                        gene_id(A, g2),…,
                        has_ducts(A, medium),
                        Fat_pad(A, medium).




                                                  38%
                                    47%           Biological
                                    Molecular     Process
                                    Function

                                                15%
                                                Cellular
                                                Component
Challenges of
  KDBD
             Goal
                              “All possible data”
                               (in the universe)


Hypotheses
                    Current
                     Data
    Learning in “Dirty” Biological
              Databases

•   Experimental errors
•   Wrong interpretation by biologists
•   Human error during annotation process
•   Non standardised techniques
•   Biased data
                Expressive Capacity
           Hypothesis for Glutathione reductase (GR) Family
Class(‘GR’,A):-
protein(A,B,C,D),
Sequence(B,GxG(x)2G(x)16-19[DE]),
Structure(C,bbasandwich),
Has_seq(strand1_helix1_strand2,B,C),          GxG(x)2G(x)16-19[DE]
Function(D,oxidoreductases).


If the protein has sequence motif
GxG(x)2G(x)16-19[DE]
in β1-α1-β2 of the 3-layer β-β-α
sandwich structure and carries
out oxidoreductases reaction then
it is GR family.
   Single Vs Multiple Methods
• Advantage - compliment each other
• Increase expressive power - discover useful
  & understandable knowledge



• Difficult to combine - lack of coherence
Open Question?
               “All data”




    Training
    Set                     Current data
                            (continues to expand)

  Hypotheses
                              Conclusion
ggggctacgg   ggggtggggc   ttcgcgcccc gccggcctat aaaagcggcc gccgcggctc cgtgccgttg
ccgaccttcg   cctgcgccgc   tgctgcttcgcgcccgtcgc ctccgccatg gctcccagga agttcttcgt
gggtggcaac   tggaagatga   acggcgacaa gaagagcttg ggcgagctca tccacacgct gaatggcgcc
aagctctcgg   ccgacaccga   ggtggtttgc ggagcccctt caatctacct tgattttgcc cgccagaagc
ttgatgcaaa   gattggagtt   gcagcacaaa actgttacaa ggtaccgaag ggtgctttca caggagagat
              Acknowledgements
• Gilleain Torrance, Mallika Veeramalai, Olivier Sand,
  Ali Al-Shahib (Bioinformatics Research Centre,
  University of Glasgow)
• David Westhead, (EBI), Ioannis Michalopoulos,
  Leeds University
• Janet Thornton, UCL, Birkbeck, EBI
• Lorenz Wernisch, Birkbeck
• Juris Viksna, University of Latvia
• Inge Jonassen, Ingvar Eidhammer, U.Bergen
• Alvis Brazma, EBI
    Bioinformatics Research Centre
• Provides an environment for collaborative
  interdisciplinary research in Bioinformatics.
• Hosts researchers from
   – Department of Computing Science
   – Institute of Biomedical and Life Sciences.
• Physically located in the Institute of Biomedical and
  Life Sciences (Spring 2003)
• Strong links with
   – Sir Henry Welcome Functional Genomics Facility.
   – Statistical Bioinformatics
   – Mathematical Biology
• Outreach programme (visitors etc)
   The Scottish Bioinformatics Forum (SBF)
• Network of Bioinformatics researchers and industries in
  Scotland
• A vehicle for developing Scotland as a Centre of
  Bioinformatics Excellence
• Nodes in Glasgow, Edinburgh, Dundee, Aberdeen, ...
• Promoting collaborative research
• Development of a Bioinformatics educational programme
• www.sbforum.org, sbforum-general@sbforum.org
           Contacts
   {actan,drg}@brc.dcs.gla.ac.uk
 Bioinformatics Research Centre
Department of Computing Science
      University of Glasgow
  http://www.brc.dcs.gla.ac.uk