Learning Center
Plans & pricing Sign in
Sign Out


VIEWS: 13 PAGES: 129

									The Lion, the Leopard, the Wolf, or the Boar…

Why not more?

                Bud Mishra

 Professor of Computer Science,
  Mathematics and Cell Biology
Courant Institute, NYU School of Medicine, Tata Institute of
 Fundamental Research, and Mt. Sinai School of Medicine
• Biology
   – Evolution by Duplication
   – Segmental Duplications in mammalian genomes
• Mathematical Challenges in Systems Biology
• Compare Genomes All-vs-All
   –   Models:
   –   Blast and SWAT:
   –   MUMMER:
   –   PRIZM
• Nondeterminism, Probability and Determinism: Computing the
  Priors and Mixed Strategies.
• Demos: IT WORKS!
Introduction to Biology
       Introduction to Biology
• Genome:
   – Hereditary information of an organism is encoded
     in its DNA and enclosed in a cell (unless it is a
     virus). All the information contained in the DNA of
     a single organism is its genome.
• DNA molecule can be thought of as a very long
  sequence of nucleotides or bases:
              S = {A, T, C, G}
• DNA is a double-stranded polymer and should be thought of as
  a pair of sequences over S. However, there is a relation of
  complementarity between the two sequences:
   – A , T, C , G
   – That is if there is an A (respectively, T, C, G) on one
     sequence at a particular position then the other sequence
     must have a T (respectively, A, G, C) at the same position.
• We will measure the sequence length (or the DNA length) in
  terms of base pairs (bp): for instance, human (H. sapiens)
  DNA is 3.3 £ 109 bp measuring about 6 ft of DNA polymer
  completely stretched out!
                   Inert & Rigid
• Complementary base pairs (A-T and C-G) are
  connected by hydrogen bonds and the base-pair
  forms a coplanar ―rung‖ connecting the two strands.
   – Cytosine and thymine are smaller (lighter) molecules, called
   – Guanine and adenine are bigger (bulkier) molecules, called
   – Adenine and thymine allow only for double hydrogen bonding,
     while cytosine and guanine allow for triple hydrogen bonding.
• Thus the chemical (through hydrogen bonding) and the
  mechanical (purine to pyrimidine) constraints on the pairing
  lead to the complementarity and makes the double stranded
  DNA both chemically inert and mechanically quite rigid
  and stable.
      J.B.S. Haldane
• ―If I were compelled to give my own
  appreciation of the evolutionary process…, I
  should say this: In the first place it is very
  beautiful. In that beauty, there is an element
  of tragedy…In an evolutionary line rising from
  simplicity to complexity, then often falling back
  to an apparently primitive condition before its
  end, we perceive an artistic unity …
• ―To me at least the beauty of evolution is far
  more striking than its purpose.‖
       • J.B.S. Haldane, The Causes of Evolution.
Human Condition

•Overlapping words of different sizes are analyzed for their frequencies along the whole
human genome
      •Red: 24-mers,
      • Green: 21-mers
      • Blue:18 mers
      • Gray:15 mers
• To the very left is a ubiquitous human transposon Alu. The high frequency is indicative
of its repetitive nature.
•To the very right is the beginning of a gene. The low frequency is indicative of its
uniqueness in the whole genome.
                                                                    Doublet Repeats
                                                                                                          • Serendipitous discovery of a new
                                                                                                            uncataloged class of short duplicate
                                                                                                            sequences; doublet repeats.
                                                                                                             – almost always < 100 bp
                                                                                                             – appear to use duplicative
                                                                                                               machinery that does not have the
                                                                                                               hallmarks of transposition,
                                                                                                               segmental duplication, or
                                                                                                               pseudogene formation.
Number of elements in 10 Mb window


                                                                                                          • (Top) . The distance between the
                                                                                                            two loci of a doublet is plotted
                                     400                                                                    versus the chromosomal position of
                                                                                                            the first locus.

                                                                                                          • (Bottom) : Distribution of doublets
                                        0.00   57500000.00   115000000.00   172500000.00   230000000.00     (black) and segmental duplications
                                               Start of 10 Mb window on chromosome 2
                                                                                                            (red) across human chromosome 2
 • J.B.S. Haldane (1932):
      – ―A redundant duplicate of a
        gene may acquire divergent
        mutations and eventually
        emerge as a new gene.‖
 • Susumu Ohno (1970):
      – ―Natural selection merely
        modified, while redundancy
     Evolution By Duplication
• Tandem:
  – Polymerase slippage
  – Unequal crossover: γ-globin duplication mediated
    by L1.
• Interspersed:
  – Transposons
  – Exon shuffling
  – Segmental duplication
• Whole genome
  – S. cerevisiae; early vertebrates; A. thaliana, etc
         Duplicated Genes
• Mutation during nonfunctionalization
• Gene sharing, Duplication-
  degeneration-complementary (DDC)
• Dosage compensation
• Multi-function constraint
    Chromosomal Aberrations
• Breakage
• Translocation (Among non-homologous
• Formation of acentric and dicentric chromosomes.
• Gene Conversions
• Amplifications and deletions
• Point mutations
• Jumping genes a Transposition of DNA segments
• Programmed rearrangements a E.g., antibody
     Recent Segmental Duplications
•3.5% ~ 5% of the human         Human
genome is found to contain
    • segmental duplications,
    with length > 5 or 1kb,
    identity > 90%.
•These duplications are
estimated to have emerged
about 40Mya under neutral
•The duplications are mostly
interspersed (non-tandem),
and happen both inter- and
intra-chromosomally.            From [Bailey, et al. 2002]
                                 The Model
                                           f --                                    f --

                               insertion           deletion or

                                            f +-                                    f +-
Duplication by recombination
between other repeats or
other mechanisms

                                                   deletion or
                               insertion            mutation
                                            f ++                                    f ++

Duplication by recombination                                     Mutation accumulation in the
between repeats                                                  duplicated sequences
                  The Mathematical Model
                                              Time after duplication

                           1-α-2β                       1-α-2β                                  1-α-2β
               h0--                      α                           α         α                              α          f --
                       γ     2β                γ         2β                            γ         2β
        h0     h0+-                      α                           α         α                              α
   H0                                                                                                                    f +-
                            1-α-β/2-γ                   1-α-β/2-γ                               1-α-β/2-γ
                 2γ          β/2              2γ         β/2                         2γ          β/2
               h0++                      α                           α         α                              α
   H1                                                                                                                    f ++
        h1      h1++
                            1-α-2γ                      1-α-2γ                                  1-α-2γ
                       0≤d<ε                   ε ≤ d < 2ε                            (k-1)ε ≤ d < kε

h1: proportion of duplications by repeat recombination;                              α: mutation rate in duplicated sequences;
    h1++: proportion of duplications by recombination of the specific repeat;
                                                                                     β: insertion rate of the specific repeat;
    h1- - : proportion of duplications by recombination of other repeats;
h0: proportion of duplications by other repeat-unrelated mechanism;                  γ: mutation rate in the specific repeat;
    h0++: proportion of h0 with common specific repeat in the flanking regions;      d: divergence level of duplications;
    h0+-: proportion of h0 with no common specific repeat in the flanking regions;   ε: divergence interval of duplications.
    h0- -: proportion of h0 with no specific repeat in the flanking regions;
                  Mechanisms of
Segmental Duplications in Mammalian Genomes
• To test and quantify the hypothesis, we designed a
  Markov model that incorporates:
   – evolutionary dynamics of the sequence elements
   – interactions between different elements
• Segmental duplication is a multi-mechanism process.
• 12% was caused by recombination between the most
  active interspersed repeats.
• Physical instability in the DNA sequences is also
  involved in duplication mechanism.
• The results showed dynamic interaction between
  duplications at different scales.
• Paralogs:
  – Compare one genome against itself
• Orthologs:
  – Compare one genome against another
• Compare multiple sets of genomes
• Metagenomes:
 Grand Challenges in
Computational Systems
            Mathematical challenges
              in Systems Biology
1.  How to fully decipher the (digital) information content of the genome
2.  How to do all-vs-all comparisons of 1000s of genomes
3.  How to extract protein and gene regulatory networks from 1 & 2
4.  How to integrate multiple high-throughout data types dependably
5.  How to visualize & explore large- scale, multi- dimensional data
6.  How to convert static network maps into dynamic mathematical
7. How to predict protein function ab initio
8. How to identify signatures for cellular states (e. g. healthy vs.
9.   How to build hierarchical models across multiple scales of time &
10. How to reduce complex multi- dimensional models to underlying
          Evolution by Mutation
• Mutant gene or DNA sequence:
    – Substitution, Insertion/Deletion, Recombination, Duplication, Gene
    – Spread through a population by genetic drift and/or natural
      selection and eventually is fixed in a species
• Nucleotide substitution can be divided into two classes:
    – Transitions: Substitution of a purine by purine (A, G) or a
      pyrimidine by a pyrimidine (C,T)
    – Transversions: The other types.
• More specific properties:
    – Frameshift mutation, nonsense mutation, synonymous or silent
      substitutions, non-synonymous or amino acid replacement
    – Transposons, gene conversions, horizontal gene transfer.
              Jukes Cantor
• The nucleotide substitution occurs at any site
  with equal frequency, and that at each site a
  nucleotide changes to of the three remaining
  nucleotides with a probability of a per year.
• Probability of a change of a nucleotide to
  another is r = 3 a
• qt = Proportion of identical nucleotides
  between two sequences
       qt+1 = (1-2r) qt +(2/3) r (1-qt)
            q(t) = 1 – ¾(1-e-8rt/3)
          Kimura 2 Parameters
• The rate of transitional substitution per site per year = a
• The rate of transversional substitution per site per year = 2b
• Total substitution rate, r = a + 2 b
• Pt is the proportion of identical transition-type pairs
  AG, GA, CT, TC
• Qt is the proportion of identical transversion-type
  pairs AG, GA, CT, TC
        P(t) = (1/4)(1- 2 e-4(a+b)t +e-8b t)
             Q(t) = (1/2)(1 –e-8b t)
             Other Models
• Tajima & Nei:
  – Substitutions that seem to be rather
    insesnsitive to various disturbing factors.
• Tamura:
  – Takes into account varying GC content
• Hasegawa et al. (HKY)
• Rzhetsky & Nei
• Tamura & Nei
     Blast and SWAT:
Blast, and the world Blasts
with you; SWAT, and you
       SWAT alone…
          Pair-wise Alignment:
             Task Definition
• Given
  – a pair of sequences (DNA or protein)
  – a method for scoring the similarity of a pair
    of characters
• Do
  – determine the correspondences between
    substrings in the sequences such that the
    similarity score is maximized
       DP Alignment Algorithms

• Needleman & Wunsch
  – Improvement for Affine Gap-penalty, by
    Smith & Waterman (Swat)
  – Dynamic programming:
    • Determine alignment of two sequences by
      determining alignment of all prefixes of the
 Comparing Syntenic Regions

 Comparison of Human and Fugu orthologous genomic
   sequence using Plains and EMBOSS. This sequence
   contains six exon regions. Both Plains and EMBOSS
  correctly identify the exon region 2 in Fugu with 3 in
Human, but only Plains correctly aligns the exon region 5
                in Fugu with 5 in Human.
        Heuristic Alignment
• FASTA [Pearson & Lipman, 1988]
• BLAST [Altschul et al., 1990]
  – BLAST heuristically finds high scoring
    segment pairs (HSPs):
    • identical length segments from 2 sequences
      with statistically significant match scores
    • i.e. ungapped local alignments
    • key tradeoff: sensitivity vs. speed
       The MUMmer System
• Delcher et al., Nucleic Acids Research, 1999
• Given genomes A and B
  – Find all maximal, unique, matching subsequences
  – Extract the longest possible set of matches that
    occur in the same order in both genomes
  – Close the gaps
  – Output the alignment
Find Longest Subsequence

            • Sort MUMs according to
              position in genome A
            • Solve variation of
              Longest Increasing
              Subsequence (LIS)
              problem to find
              sequences in ascending
              order in both genomes
               – Requires O(k log k) time
                 where k is number of
                The More the Merrier
 Algorithm              Homology Seed               Indexing        Reference
                                                    Scanned with
   BLAST                   exact k-mer                 DFA*            [31]
   WABA            wobble base degenerate k-mer                        [32]
                 randomly projected k-mer with <d   Sorted Array
LSH-ALL-PAIRS                                                          [33]
                                                     Hash Table
  BLASTZ             discontinuous exact k-mer                        [34][35]
                                                     Hash Table
PatternHunter       discontinuous exact k-mer                          [36]
                                                     Hash Table
    BLAT              exact or inexact k-mer                           [37]
   CHAOS              exact or inexact k-mer                           [38]
                                                     Hash Table
    PASH            discontinuous exact k-mer                          [39]
                                                     Suffix Tree
  REPuter              maximal exact repeat                            [40]
                                                    Factor Oracle
FORRepeats                 exact repeat                                [41]
• Many innovative sequence alignment tools are
  available for detailed comparative genomic studies.
   – Recent segmental duplications in mammalian genomes (with
     identity level >90%) can be detected using BLAST and many
     other tools.
• They use exact or inexact k-mers as homology seeds.
• As homology levels become lower, they encounter a
  dilemma between sensitivity and computational
   – homologous segments,
   – segmental duplications, or
   – homology-based phylogentic distances.
• To improve sensitivity they rely on exhaustive searches of exact
  matches with short mers or inexact matches with long mers.
    – They encounter too many false-positives, to be later filtered
      through an expensive post-processing step.
• Instead, more stringent search criteria (longer mers with more
  exact matches) may be used to improve efficiency.
    – Then these algorithms fail to detect low-homology regions, such as
      ancient duplication events.
• In order to detect less-recent duplications, orthologous genes
  have been used as ―anchors‖ to map out the duplication blocks.
  But, for obvious reasons, they are unsuitable for identifying
  duplications that are not subject to strong selection process.

(Paxia, Rastogi, Zhou, Mishra)

How to do all-vs-all comparisons of
       1000s of genomes
• It uses a Bayesian scheme.
  – It is efficient…Linear time.
  – It computes homologous regions between two
    genomes even when the homology level drops to
    a value around 65%.
     • Incorporates background knowledge about genome
       evolution, by experimenting with several priors
       (noninformative improper prior, exponential and Gamma
       priors, and priors based on Juke-Cantor one parameter
       and Kimura‘s two parameters models of evolution).
     • The results appear to be unaffected by these choices,
       while the computational efficiency is only mildly affected
       by what prior is employed.
      A Simple Observation
• Homology is hard to compute but easy
  to verify. Quadratic vs. Linear time.
  – Can a probabilistic approach replace
    nondeterminism? If so, we can expect a
    probabilistic linear time algorithm.
  – Unlikely! But if we can use priors based on
    the underlying distributions, there is hope.
  – Many computational biology problems
    share this feature!

 The genomic sequences under comparison: A) M. genitalium
  (Y-axis) and M. pneumoniae (X-axis), computed using 300
  probes from six iterations, taking about 5 seconds using a
                   non-optimized algorithm

 The mouse chromosome 10 and rat chromosome 1 share a syntenic
   region about 20Mb at the beginning of the two chromosomes

                Alignment between Mouse Chromosome 8 and
                           human chromosome 4.
 In silico experiment uses 21 mers (5,105,3), and takes about 45 seconds.

       Human vs. Mouse

M. genitalium vs. M. pneumoniae
              Homology Curve
• An m-mer is a word of length m, selected from either
• Consider a location in the first genome, G1[a] and a
  short window, starting at a.
                W1,a = G1[a, a+m−1]
• Compare this window with a word of equal length
  from the second genome starting at G2[b]:
               W2,b = G2[b, b+m − 1].
• Define the homology level for the locations G1[a]
  and G2[b] and a window of size m as
   h(2)(G1[a], G2[b], m) = (1/m) g=1m-1 IG1[a+g] = G2[a + g].
            Homology Curve
• Let us define, h(a) to denote the highest homology
  level for genome G1 at position a and computed with
  respect to G2:
    h(a) = max1 · b <|G2| - m h(2)(G1[a], G2[b], m)

• The ―homology curve‖ for the first genome, G1 with
  the respect to the second genome G2 is then defined
            h : [1..G1 −m − 1] → [0, 1]
                      : a a h(a)
• Replacing non-determinism with a
  probabilistic guessing scheme.
  – The probability distributions are based on
    biologically meaningful priors.
  – Using these priors it guesses a local homology
    curve, and designs and performs an in silico
  – It uses the results to verify its guess (in linear time)
    and refines the local homology curve and the
    probability distributions for the next iteration.
       Probabilistic guesses
• Use a Bayesian scheme and a boosting
  approach to modify the probability
  distributions of the ―guessing experiments‖
  from one iteration to next.
• At each iteration, a sequence of words with a
  specific distribution is selected from one
  genome, and is optimally partitioned into
  groups for ―in silico experiments‖ involving
  – exact-match search,
  – inexact-match search with one error,
  – inexact match search with two errors, etc.
        Probabilistic guesses
• These searches can be efficiently conducted
  over the second genome,
  – Assuming that the other genome has been
    preprocessed and stored in an efficient data
    structure (e.g., suffix arrays or hash table).
  – From the results of the experiments, a Bayesian
    estimator can compute the local homology levels
    for the genome, and use it to verify and sharpen
    the probabilistic distributions for the next iteration.
• The algorithm converges to the true local
  homology levels after a few iterations.
          In Silico Experiments
• Let b, B = | G2 |/b, w, W, m, N0, N1, . . ., Nk (k ≤ m, and
  in our applications usually k = 2) be some pre-specified
   – Choose k+1 random subsets, S0, S1, . . ., Sk, of words each
      of length m, randomly (i.i.d. uniform) from G1[a, a+w−1],
      such that
       |S0| = N0, |S1| = N1, . . . , and |Sk| = Nk.

   – Consider a block in the second genome of length b: Bb = G2
     [b, b+b−1]. Let X0 (X1, . . ., Xk, respectively) be defined
     as the number of m-mers from set S0 (S1, . . ., Sk,
     respectively) that match exactly (with one, two and so on up
     to k mismatches, respectively) to an m-mer in G2[b, b +b −
          Experiment Design
• By examining the sensitivity (∂(Xi/Ni)/∂h = a′i[h])
  we can divide the interval for h into three intervals:
  [1/4, θ1] ≈ [1/4, (m−2)/m], (θ1, θ2] ≈
  [(m−2)/m, (m−1)/m] and (θ1, 1] ≈ ((m−1)/m,
  1], such that the choices of (N0,N1,N2) are based on
  the following mixed strategies:
   N0 = (K/b) sq21 pi,I(h) dh
   N1 = (K/3bm) sq1q2 pi,I(h) dh
   N2 = (2K/9b(m2 –m)) s1/4q1 pi,I(h) dh
            In Silico Experiments
• Thus Xi’s for i in [0..k] are binomially distributed random
  variables whose parameters depend on the homology level
• We can estimate the local homology by the following robust
                          h h | X0, X1, … Xk i
                      = s01 h p(h | X0, …, Xk) dh
      = s01 h p(h) p(X0, …, Xk|h) dh/ s01 p(h) p(X0, …, Xk|h) dh
   – Similarly, compute the mean, standard deviation and
     confidence of the homology function over Bb.
   – Let b* = arg maxb mean(Bb). Then the homology
     function is estimated at a by mean(Bb*).
  Conditional Probabilities
       ri = b pm j=1i C[m,j] 3j
   si = j=1i C[m,j] hm−j (1 − h)j
          bi = (1 − si)(1 − ri)
     ai = 1 − bi = si + ri − si ri.

p(X1, X2, …, XK|h) / j aj Xj bjNj – Xj
                    Initial Priors
• Using Jukes-Cantor: the random variable r represents the rate
  of nucleotide substitution per site per year.
   – In this model, it is assumed that nucleotide substitution
      occurs at any nucleotide site with equal frequency and at
      each site a nucleotide changes to one of the three remaining
      nucleotides with a probability a per year: r = 3 a.
   – The substitution rate is often higher at functionally less
      important sites than at functionally more important sites.
   – Case 1: r » Exponential(λ): fexp(r) = λe−λr. In that case
                         p(h) »(4h − 1)3λ/8T−1
   – Case 2: r » Gamma(λ, ν): f Γ(r) = λn e−λrrn−1/Γ(n). In
      that case
            p(h) »(4h − 1)3λ/8T−1 ln[3/(4h − 1)/T]n−1
               Initial Priors
• A more complex structures arise as we
  consider multi-parameter models: e.g.,
  Kimura‘s Two-Parameter Method. In this
  model, the rate of transitional substitution per
  site per year (a) is assumed to be different
  from that of transversional substitution (2β).
              h = (1 − P)(1 − Q)
      P = (1/4) (1 − 2e−4(a+β)T + e−8βT)
             Q = (1/2) (1 − e8βT)
                Initial Priors
• Assume that α » Exponential(λa) and β »
  Exponential(λβ) and they are independent.
p(h) = (λa λβ/8T2)

  s01    (1 − P)2+(λa+λb)/8T(2h + P − 1)(λa−λb)/8T
               (h − 2P + P2)−λa/4T
         /(2h + P − 1)(h − 2P + P2) dP.
               Refining Priors
• In iteration i, let us consider an interval I with k
  homology estimates:
           h m1, s1 i, h m2, s2 i, …, h mk, sk i
• Assume that the homology values h1, h2, . . ., hk is
  sampled from a distribution
                     h » N(μ, σ2).
• Furthermore, we assume the following:
                    μ » N(ξ, t2)
             r = (σ2 + t2)−1 » Γ(a, β).
                New Prior
– Prior = Kummer‘s hypergeometric function of
  order 1
                   f(h|ξ, t, a , β)
         » s01/τ2 ra-2 1/√(1 − r t2) e−Br dr
           » 1F1(a − 1, a − 1/2,−B/ t2)
            ¼ ((h-x)2 + 2 b)/2 t2)-a+1
– Estimates
                        ξ = h μji
                  t 2 = h μ2 i − h μj i 2
                a/β = h (σ2i + t2)−1 i
         a/β2 = h σ2i + t2)−2 i - h σ2i + t2)−1 i2
    Optimizing the parameters
• The parameter choices are as follows:
   – Let the number of blocks (b) and the number of windows (w)
     be chosen a priori based on the needed resolution for
   – We may choose these parameters so that b = O(p(G2)) and
     w = O(p(G1)). We assign K = O(1) amount of work to a
     region defined by a combination of any single block with any
     single window.
   – Thus the amount of work is roughly K(G1G2)/(w b) =
     O(G1+G2) per iteration.
   – The mer size parameter ‗m‘ is chosen so that the probability
     of a ―hit‖ in a block containing a homologous sequence much
     higher than in a random block:
                   (b/4m) << E(h0,G)m.

How to identify signatures for
  cellular states (e. g. healthy
          vs. diseased)
                Microarray Analysis
                                 • Representations are reproducible
Normal DNA             Tumor DNA   samplings of DNA populations in
                                   which the resulting DNA has a new
                                   format and reduced complexity.
                                      – We array probes derived from low
Normal LCR             Tumor LCR        complexity representations of the
                                        normal genome
                                      – We measure differences in gene
             Label                      copy number between samples
                     Hybridize        – Since representations have a lower
                                        nucleotide complexity than total
                                        genomic DNA, we obtain a stronger
                                        specific hybridization signal relative
                                        to non-specific and noise
Copy Number Fluctuation

     A1      B1      C1

     A2      B2      C2

     A3      B3      C3
      A MAP (Maximum A Posteriori)
• Priors:                                 – pe is the probability of a
                                             particular probe being
   – Deletion + Amplification
• Data:                                   – pb is the average number of
   – Priors + Noise                          intervals per unit length.
                                           Max likelihood
                                              over (Pe,Pb)
• Goal: Find the most plausible        0.10

  hypothesis of regional


  changes and their associated

  copy numbers
• Generalizes HMM:The prior

  depends on two parameters            0.04

  pe and pb.
                                                                                               (pe,pb) max at (0.55,0.01)

                                                                                   355.4                          284.3
                                                                                  213.2    236.9                    165.9
                                                      0.4                   0.5               0.6                              0.7
A reasonable choice of priors yields
       good segmentation.
              sampling = 5
              p_e = 0.35
              p_b = 0.01
         2    chr = 8

Copy #



                             100   300   Probe # 500   700   900
  A reasonable choice of priors yields
         good segmentation.
           sampling = 5
           p_e = 0.55
     0.5   p_b = 0.0001
           chr = 2

Copy #



                 50       300             #
                                550 Probe800   1050   1300   1550
    Prior Selection: F criterion
                                                  Pf over (Pe,Pb)

• For each break we

  have a T2 statistic and                  0.08

  the appropriate tail                                                    0.1

  probability (p value)
                                           0.06                                                0.7                           0.8

  calculated from the
                                                                          0.2                                       0.50.6

  distribution of the


  statistic. In this case,
                                                                                                                      (pe,pb) max at (0.55,0.01)

  this is an F distribution.
                                           0.02                           0.6

• The best (pe,pb) is the                                                                                                          0.9
                                                                                                                              0.4 0.6

  one that leads to the
                                                            0.4                     0.5                            0.6                     0.7

  maximum min p-value.
                                                    N1 N 2
                                                   N1  N 2
                                                            x y               

                           T2 
                                  df1  df 2
                                                   x  x   y  y  
                                                      i       i
                                                                                    j     j
      Current Implementation
• NYU Array CGH
• Versatile:
   – Works well for BAC array, ROMA & Agilent
   – Raw Affymetrix data is noisier, but our new algorithm for
     ―background correction and summarization (BCS)‖ makes
     Affy-data significantly better than the competitors.
   – (BCS also generalizes to gene expression and performs
     better than RMA, Li-Wong, etc.)
• Global Analysis (LOH analysis, detecting TSG and
• Fast implementation, with visualization and
  integration (to be released soon…)
         Finding Cancer Genes
• LOH/Deletion Analysis analysis
• Hypothesize a TSG (Tumor Suppressor Gene)
• Score function for each possible genomic region containing the
   – Evolutionary history
   – Interactions
   – Parameters
• This score can be computed using estimation from data and
  also prior information on how the deletions arise. We use a
  simple approximation; we assume there is a Poisson process
  that generates breakpoints along the genome and an
  Exponential process that models the length of the deletions.
                            •   Several scenarios:
• There are S = 200         •   Model1: one TSG [300, 350]. All individuals
  individuals. Deletions        are diseased because of homozygous
                                deletion of the TSG.
  occur in each individual. •   Model2: one TSG [300, 350]; 50% of the
  A cell having incurred a      individuals are diseased because of
  deletion overlapping          homozygous deletion of the TSG and the rest
  one of the TSGs, will         are diseased because of hemizygous deletion
                                of the TSG.
  multiply quickly and      •   Model3: one TSG [300, 350]. All individuals
  generate many copies          are diseased because of hemizygous deletion
  of itself. These copies       of the TSG.
  evolve independently •        Model4: two TSGs; 50% of the people are
                                diseased because of homozygous deletion in
  for a while until we          the first TSG and the other half are diseased
  collect the information.      because of homozygous deletion in the
•                               second TSG.
Model Simulation

Host-Pathogen Interaction

     Gene Ontology
   Algorithmic Logic
   Invariant Extractor
                       A Picture
                   Biological System:
            Part-List + Functional Relations

 Regulatory, Metabolic &
   Signaling Relations
Redescription                                     Recomputation

Descriptive Relation                     Numerical Traces
Yeast-Cell Cycle Data:
     Spellman et al.
G0                            Invariants


           Another Example

• Pre-Apoptosis 6 time points data analysis
• Six time-point data at 2h, 4h, 6h, 8h,
  12h, 24h
  – Cells treated with SEB
  – Control: untreated cells
• Data from Jett-Lab (Walter-Reed)
Hypothesized pathway
•Note that GOALIE is not intimately tied to any particular
ontology: It can be customized to work with other controlled
vocabulary: e.g., UMLS, MetaCYC, Reactome…
•Thus GOALIE also provides a way to ―compare‖ different

       From the GO web site. The
       path back to each ontology
       from a gene.

       We will call each term in a
       path a split.
 Structure of a GO annotation

Each gene can have several annotated GOs and each GO
can have several splits. E.g. DNA topoisomerase II alpha
has 8 GO annotations and 11 splits
  Time Coarse GO categorization
• GOALIE partitions the temporal data in order to understand
  local behavior. Data are grouped by considering (weighted)
  windows of time points (2-4-6, 4-6-8, 6-8-12, 8-12-24)
• Next GOALIE uses a K-Means algorithm (genesis) to produce
  clusters for each window
• GOALIE then associates to each cluster a set of GO descriptors
  (with p-values and controlled FDR, false discovery rates)
• GOALIE computes ―patterns‖ of GO categories across the time
• All these steps can be run from one unifying interface that
  GOALIE provides. GOALIE will be embedded inside VALIS.
     Female, Sings,
     Male, Talks, Thin
       Male, Sings,
X3     Overweight

      Female, Sings,
X3 X4   X3           X2 X1

        X4 X2
X3 X4   X3        X2 X1


        X4 X2
X3 X4   X3        X2 X1


        X4 X2

X3        X1
     X3        Finale
X4        X2

             X2      : O, : FLS

        X3                        X1
             X3                              Finale
        X4                        X2

: O, : FLS        : O, : FLS
                                  : O, FLS     O


                    : O, FLS
             X2      : O, : FLS
             X4      : O UFLS

        X3                        X1
             X3                              Finale
        X4                        X2

: O, : FLS        : O, : FLS
                                  : O, FLS     O
                  : O UFLS


                    : O, FLS
                   : O UFLS
             X2      : O, : FLS
             X4      : O UFLS

        X3                        X1
             X3                              Finale
        X4                        X2

: O, : FLS        : O, : FLS
                                  : O, FLS     O
: O UFLS          : O UFLS


                    : O, FLS
                   : O UFLS
                                    It ain’t over ‘til
                      : O, : FLS   the fat lady sings
              X4      : O UFLS

        X3                         X1
              X3                                  Finale
        X4                         X2

 : O, : FLS        : O, : FLS
                                   : O, FLS              O
 : O UFLS          : O UFLS

                     : O, FLS
                    : O UFLS
                GOALIE: GO Algorithmic Logic for
                     Invariant Extraction

                                                    Clusters connection tree
                                                    Each level a “window”

Micro-array accessions

GO categories

    Clusters information   Connection information      Cluster Information
           GOALIE: GO Algorithmic Logic for
                Invariant Extraction

                                                          GO categories
                      GO categories    GO categories        describing
                        shared with      describing      “source” cluster
                       “destination”    “destination”         but not
  GO categories                        cluster but not
 describing genes
                          cluster                          “destination”
in “source” cluster
GOALIE: SEB Analysis Preliminary Results

1.   Time Course Window 1 to Time Course Window 2: Connection 1:9
     to 2:18.
     By inspecting the first cluster in the first window (Cluster~1:9), we note that
     one of the connection to the cluster2 in the second window (Cluster~2:18) is
     labeled (among many others) by the GO categories ―circulation‖
     (GO:0008015), and by the category ―negative regulation of heart rate‖
     (GO:0045822). This represents a constant set of biological processes shared
     by this cluster chain, traversing Cluster 3:17, to Cluster 4:13.
2.   Time Course Window 1 to Time Course Window 2: Connection 1:9
     to 2:6.
     The connection between Cluster 1:9 and Cluster 2:6 is interesting because it
     shows how the category ―regulation of lymphocyte proliferation‖
     (GO:0050670) becomes activated in the next time-window (Cluster 2:6),
     while the categories ―antigen presentation‖ and ―antigen processing‖
     became inactive. This should indicate that some of the genes in the clusters
     start a response to the pathogen in the second time point.
           Hidden Kripke Model
• ―Hidden Kripke Model‖ reconstruction via ontology based
  redescription of time-sliced clusters of time-course
  measurements (arrays) offers a novel viewpoint form which
  formulate biological hypotheses and eventually reconstruct
  ―biological first principles‖
• The GOALIE tool, in its embryonic state, has already been
  proven ―correct‖ and offered a wide range of insights into a
  number of biological datasets
   • Spellman‘s Yeast Cell Cycle
   • SEB host-pathogen data from WRAIR
• New datasets being analyzed now include
   • P. falciparum dataset [Bozdech et al, 1(1):085]
   • Subset of Genome Module Map dataset [Segal et al]
                     Story generation
                                    • Temporal Logic formulae
Dataset               Formula
 Dataset              Generator       can be rendered in
   Dataset                            English.
                                    • Temporal Logic formulae
                          Formula     can be generated
                                      automatically (with care).
    Logic Analysis
                                    • Each formula can be
                                      tested against a set of
                                      datasets; differences can
   Natural Language       HTML
   Story Generation        file
                                      then be noted.
#4How to integrate multiple high-throughout data types
        dependably; How to visualize & explore large-
 #5             scale, multi- dimensional data

                      Reps   MER57A           L1P





• Sequence                           • Bioperl
   – GENBANK, EMBL, DDBJ               • International open-
   – SWISS-PROT, GenPept,                source collaboration
                                       • 7 Years of development
   – PDB, SCOP,…
                                       • 600 Modules
• Genetic   (Flybase,
  Wormbase, GDB, AtDB, SGD,…)          • 100,000 lines of code
• Expression (RNA                      • Easy-to-use, stable, and
  Expression from microarrays,...)       consistent programming
• Metabolic (KEGG, WIT,…)                interface for
                                         bioinformatics application
• Literature (MEDLINE,…)                 programmers
            BioPerl Module Statistics

                                           Histogram of Module Sizes
Number of Modules

                                   mean 144



                         0   100    200   300    400   500   600   700   800   900   1000 1100

                                                Real Lines of Code
           Bioinformatics Data
• Majority of the data           • Example: GoldenPath
  kept in (indexed) flat           /UCSC Genome Browser
  files & Relational             • Rough Estimates:
  Databases                         – 800 Tables
• Flat Files                        – 14000 fields
   – Unstructured/Difficult to      – 1600 varchars [most
     update                           varchar(255) ]
• Relational Databases              – 1300 blobs
   – Strings are atomic          • Blobs:
     objects                        – exonStart, ExonEnd,
   – Blobs                            qStarts, tStarts, …
      Key Feature of Valis
• State of the art of rapid
  prototyping in bioinformatics
• Multilanguage Scripting
• Data storage
• Graphical User interfaces
        • A Valis script can be written
          in any supported language:
            – JScript, VBScript, Python,
              PERL, Lisp, R and SETL.
            – All the scripts see the same
              Valis class hierarchy.
            – For example, once a user
              learns that a Valis
              Sequence Object has a
              method called Input that will
              read the sequence from a
              file, the user can
              subsequently use this same
              primitive from all the
              different languages.
              Data Storage
  • Based on Extended B-Trees
  • At the lowest level there is an Heap of
  • Must correctly keep track of the
    reference counts of each record/object
    to implement value semantics
 • Once the processing is completed, it is very
   important to be able to quickly visualize the results.
 • For this reason Valis provides numerous visualization
   tools that allow a user to quickly display
     –   sequences, maps,
     –   microarray data,
     –   tables,
     –   graphs and annotations.
 • These widget can be customized from the scripts.
      How to convert static network maps into dynamic
          mathematical models; How to predict protein
           function ab initio; How to build hierarchical
         models across multiple scales of time & space;
            How to reduce complex multi- dimensional
                 models to underlying principles

#6                                         Glycogen

     #9              Glucose         Glucose-1-P Phosphorylase a      Glycolysis
#10                Glucokinase
                                           Phosphoglucose isomerase

      SIMPATHICA               Fructose-6-P
SimPathica System
         Simpathica is a multi-functional
Front End                               Analysis
Model data structures                       XSSYS (TL)
Equations Handling                          Time/Frequency
Octave/Matlab                               Combined TL/TF
code generation

Simulation                          Dash-          Visualization
                                    board          PtPlot
                        mArray DB
       Simpathica is a modular system
    Canonical Form:
           n m            n m
   X i  a i  X j  bi  X j ij i  1 n
                   g ij             h
             j 1            j 1
                                      nm
 C ( X (t ),, X (t ))  (g
                                    l  X j lj )  0
    l    1              n m
                                        j 1

•      Predefined Modular Structure
•      Automated Translation from
       Graphical to Mathematical Model
•      Scalability
           Purine Metabolism
• Purine Metabolism
  – Provides the organism with building blocks for the synthesis
    of DNA and RNA.
  – The consequences of a malfunctioning purine metabolism
    pathway are severe and can lead to death.
• The entire pathway is almost closed but also
  quite complex. It contains
  – several feedback loops,
  – cross-activations and
  – reversible reactions
• Thus is an ideal candidate for reasoning with
  computational tools.
Simple Model
Biochemistry of Purine Metabolism
             • The main metabolite in purine
               biosynthesis is 5-phosphoribosyl-a-1-
               pyrophosphate (PRPP).
                – A linear cascade of reactions
                   converts PRPP into inosine
                   monophosphate (IMP). IMP is the
                   central branch point of the purine
                   metabolism pathway.
                – IMP is transformed into AMP and
                – Guanosine, adenosine and their
                   derivatives are recycled (unless
                   used elsewhere) into hypoxanthine
                   (HX) and xanthine (XA).
                – XA is finally oxidized into uric acid
Purine Metabolism
 • Variation of the initial         •   Persistent increase in the initial
   concentration of PRPP does           concentration of PRPP does
   not change the steady state.         cause unwanted changes in the
   (PRPP = 10 * PRPP1)                  steady state values of some
   implies steady_state()
                                    •   If the increase in the level of
 • This query will be true when         PRPP is in the order of 70%
   evaluated against the                then the system does reach a
   modified simulation run (i.e.        steady state, and we expect to
   the one where the initial            see increases in the levels of
   concentration of PRPP is 10          IMP and of the hypoxanthine
   times the initial                    pool in a ―comparable‖ order of
   concentration in the first run       magnitude.
   – PRPP1).                            Always (PRPP =
                                        1.7*PRPP1) implies
•  Consider the following               •    In fact, the increase in IMP is
   statement:                                about 6.5 fold while the
• Eventually                                 hypoxanthine pool increase is
(Always (PRPP = 1.7 * PRPP1)                 about 60 fold.
   implies                              •    Since the above queries turn
   steady_state()                            out to be false over the
   and Eventually
                                             modified trace, we conclude
       Always(IMP < 2* IMP1))                that the model ―over-predicts‖
   and Eventually (Always
   (hx_pool < 10*hx_pool1)))                 the increases in some of its
                                             products and that it should
• where IMP1 and hx_pool1 are
   the values observed in the                therefore be amended
   unmodified trace. The above
   statement turns out to be false
   over the modified experiment
Final Model
Purine Metabolism
Quantum leaps
  ―provoke creative
Shotgun Mapping
        • Schematics
          –   Experiment Design
          –   Robotics
          –   BioChemistry
          –   Imaging
          –   Image Analysis
          –   Statistical Algorithms
          –   Visualization
Shotgun Optical Mapping of Genomes
                                        Gentig‘s Successes                                 E. coli

                                     • E. coli
                                     • P. falciparum ¦ D. radiodurans ¦ Y. Pestis
                                     • Rhodobacter sphaeroides ¦ Shigella flexneri ¦ Salmonella
                                     • Aspergillus fumigatus
                                     • …
                                                                                              P. falciparum

                                     • The automated Gentig system is routinely used
                                         – to map a microbe genome quickly & effortlessly
                                         – by a scientist with no quantitative or computational training.
Some Interesting Applications
•   Sequence Validation
•   Haplotyping
•   Sequencing
•   Comparative Genomics
•   Rearrangement events
•   Hemizygous Deletions
•   Epigenomics
•   Characterizing cDNAs
    – Expression Profiling
    – Alternate Splicing
Sequencing in Post-Genomic Era

• Haplotypic Sequencing of 6.6 Billion
  Base Pairs in a Diploid Human genome
• Less than $700
• Less than 24 Hours
• Draft Quality (Not Resequencing)
• Single Molecule Optical Mapping
  – Methylation Sensitive Restriction Enzymes
  – Multiple-Enzyme Maps
• Probe Hybridization on the Surface
• Sequencing by Hybridization
  – Localize algorithms
                  PSBH problem
• The PSBH problem:
   – Can overcome the complexity issues associated with the SBH
   – For each L-mer probe, in addition to knowing whether it hybridizes
     with the unknown sequence (with or without count) we are
     provided constraints on the location of the L-mer in the sequence:
   – The constraint takes the form of a set of permissible locations for
     each L-mer (which need not be contiguous).
   – However it is known that if the constraint limits each L-mer to no
     more than two exact locations on the sequence, then it admits an
     efficient algorithmic solution—If 3 or more locations are possible
     the reconstruction problem becomes NP-complete.
   – However if the location constraints are in the form of K contiguous
     locations then the reconstruction problem is exponential in K,
     rather than the sequence length m.
           Multiple PSBH problem
•   For our data set of probe maps for all 6-mers, we have multiple instances of
    each L-mer, for L=6 (about one every 4Kb on each strand of the DNA) in the
    sequence, and for each instance we can constrain the location to within about 
    200 base pairs depending on the optical resolution.
•   A special case of the PSBH problem, which we will call the ―Multiple Positional
    Sequence by Hybridization‖ (Multiple PSBH) problem, where we have separate
    constraints for each of the multiple instances of each L-mer.
     – By focusing on a small window of about 2000bp, in which most L-mers will
        occur only once, we can solve the standard PSBH problem where separate
        constraints for multiple instances of each L-mer are not important.
     – However if we solve each local PSBH problem for each 2000bp window
        separately, the worst case exponential time reconstruction is unlikely to
        apply to most windows, so if we simply limit the amount of time spent in
        each window to some upper bound linear in the window size, we can still be
        able to reconstruct the sequence in most windows in linear time.
Initial Experiments
The end
The end

To top