VIEWS: 13 PAGES: 129 POSTED ON: 4/25/2011
The Lion, the Leopard, the Wolf, or the Boar… Why not more? Comparative Genomics Bud Mishra Professor of Computer Science, Mathematics and Cell Biology ¦ Courant Institute, NYU School of Medicine, Tata Institute of Fundamental Research, and Mt. Sinai School of Medicine Outline • Biology – Evolution by Duplication – Segmental Duplications in mammalian genomes • Mathematical Challenges in Systems Biology • Compare Genomes All-vs-All – Models: – Blast and SWAT: – MUMMER: – PRIZM • Nondeterminism, Probability and Determinism: Computing the Priors and Mixed Strategies. • Demos: IT WORKS! Introduction to Biology Introduction to Biology • Genome: – Hereditary information of an organism is encoded in its DNA and enclosed in a cell (unless it is a virus). All the information contained in the DNA of a single organism is its genome. • DNA molecule can be thought of as a very long sequence of nucleotides or bases: S = {A, T, C, G} Complementarity • DNA is a double-stranded polymer and should be thought of as a pair of sequences over S. However, there is a relation of complementarity between the two sequences: – A , T, C , G – That is if there is an A (respectively, T, C, G) on one sequence at a particular position then the other sequence must have a T (respectively, A, G, C) at the same position. • We will measure the sequence length (or the DNA length) in terms of base pairs (bp): for instance, human (H. sapiens) DNA is 3.3 £ 109 bp measuring about 6 ft of DNA polymer completely stretched out! Inert & Rigid • Complementary base pairs (A-T and C-G) are connected by hydrogen bonds and the base-pair forms a coplanar ―rung‖ connecting the two strands. – Cytosine and thymine are smaller (lighter) molecules, called pyrimidines – Guanine and adenine are bigger (bulkier) molecules, called purines. – Adenine and thymine allow only for double hydrogen bonding, while cytosine and guanine allow for triple hydrogen bonding. • Thus the chemical (through hydrogen bonding) and the mechanical (purine to pyrimidine) constraints on the pairing lead to the complementarity and makes the double stranded DNA both chemically inert and mechanically quite rigid and stable. J.B.S. Haldane • ―If I were compelled to give my own appreciation of the evolutionary process…, I should say this: In the first place it is very beautiful. In that beauty, there is an element of tragedy…In an evolutionary line rising from simplicity to complexity, then often falling back to an apparently primitive condition before its end, we perceive an artistic unity … • ―To me at least the beauty of evolution is far more striking than its purpose.‖ • J.B.S. Haldane, The Causes of Evolution. 1932. Human Condition Mer-scape… •Overlapping words of different sizes are analyzed for their frequencies along the whole human genome •Red: 24-mers, • Green: 21-mers • Blue:18 mers • Gray:15 mers • To the very left is a ubiquitous human transposon Alu. The high frequency is indicative of its repetitive nature. •To the very right is the beginning of a gene. The low frequency is indicative of its uniqueness in the whole genome. Doublet Repeats • Serendipitous discovery of a new uncataloged class of short duplicate sequences; doublet repeats. – almost always < 100 bp – appear to use duplicative machinery that does not have the hallmarks of transposition, segmental duplication, or pseudogene formation. Number of elements in 10 Mb window 800 • (Top) . The distance between the 600 two loci of a doublet is plotted 400 versus the chromosomal position of the first locus. 200 • (Bottom) : Distribution of doublets 0 0.00 57500000.00 115000000.00 172500000.00 230000000.00 (black) and segmental duplications Start of 10 Mb window on chromosome 2 (red) across human chromosome 2 EBD • J.B.S. Haldane (1932): – ―A redundant duplicate of a gene may acquire divergent mutations and eventually emerge as a new gene.‖ • Susumu Ohno (1970): – ―Natural selection merely modified, while redundancy created.‖ Evolution By Duplication • Tandem: – Polymerase slippage – Unequal crossover: γ-globin duplication mediated by L1. • Interspersed: – Transposons – Exon shuffling – Segmental duplication • Whole genome – S. cerevisiae; early vertebrates; A. thaliana, etc Duplicated Genes • Mutation during nonfunctionalization (MDN) • Gene sharing, Duplication- degeneration-complementary (DDC) • Dosage compensation • Multi-function constraint Chromosomal Aberrations • Breakage • Translocation (Among non-homologous chromosomes.) • Formation of acentric and dicentric chromosomes. • Gene Conversions • Amplifications and deletions • Point mutations • Jumping genes a Transposition of DNA segments • Programmed rearrangements a E.g., antibody responses. Recent Segmental Duplications •3.5% ~ 5% of the human Human genome is found to contain • segmental duplications, with length > 5 or 1kb, identity > 90%. •These duplications are estimated to have emerged about 40Mya under neutral assumption. •The duplications are mostly interspersed (non-tandem), and happen both inter- and intra-chromosomally. From [Bailey, et al. 2002] The Model f -- f -- insertion deletion or mutation f +- f +- Duplication by recombination between other repeats or other mechanisms deletion or insertion mutation f ++ f ++ Duplication by recombination Mutation accumulation in the between repeats duplicated sequences The Mathematical Model Time after duplication 1-α-2β 1-α-2β 1-α-2β h0-- α α α α f -- h1-- γ 2β γ 2β γ 2β h0 h0+- α α α α H0 f +- 1-α-β/2-γ 1-α-β/2-γ 1-α-β/2-γ 2γ β/2 2γ β/2 2γ β/2 h0++ α α α α H1 f ++ h1 h1++ 1-α-2γ 1-α-2γ 1-α-2γ 0≤d<ε ε ≤ d < 2ε (k-1)ε ≤ d < kε h1: proportion of duplications by repeat recombination; α: mutation rate in duplicated sequences; h1++: proportion of duplications by recombination of the specific repeat; β: insertion rate of the specific repeat; h1- - : proportion of duplications by recombination of other repeats; h0: proportion of duplications by other repeat-unrelated mechanism; γ: mutation rate in the specific repeat; h0++: proportion of h0 with common specific repeat in the flanking regions; d: divergence level of duplications; h0+-: proportion of h0 with no common specific repeat in the flanking regions; ε: divergence interval of duplications. h0- -: proportion of h0 with no specific repeat in the flanking regions; Mechanisms of Segmental Duplications in Mammalian Genomes • To test and quantify the hypothesis, we designed a Markov model that incorporates: – evolutionary dynamics of the sequence elements – interactions between different elements • Segmental duplication is a multi-mechanism process. • 12% was caused by recombination between the most active interspersed repeats. • Physical instability in the DNA sequences is also involved in duplication mechanism. • The results showed dynamic interaction between duplications at different scales. TASKS • Paralogs: – Compare one genome against itself • Orthologs: – Compare one genome against another • Compare multiple sets of genomes • Metagenomes: Grand Challenges in Computational Systems Biology Mathematical challenges in Systems Biology 1. How to fully decipher the (digital) information content of the genome 2. How to do all-vs-all comparisons of 1000s of genomes 3. How to extract protein and gene regulatory networks from 1 & 2 4. How to integrate multiple high-throughout data types dependably 5. How to visualize & explore large- scale, multi- dimensional data 6. How to convert static network maps into dynamic mathematical models 7. How to predict protein function ab initio 8. How to identify signatures for cellular states (e. g. healthy vs. diseased) 9. How to build hierarchical models across multiple scales of time & space 10. How to reduce complex multi- dimensional models to underlying principles Evolution by Mutation • Mutant gene or DNA sequence: – Substitution, Insertion/Deletion, Recombination, Duplication, Gene Conversion – Spread through a population by genetic drift and/or natural selection and eventually is fixed in a species • Nucleotide substitution can be divided into two classes: – Transitions: Substitution of a purine by purine (A, G) or a pyrimidine by a pyrimidine (C,T) – Transversions: The other types. • More specific properties: – Frameshift mutation, nonsense mutation, synonymous or silent substitutions, non-synonymous or amino acid replacement substitution – Transposons, gene conversions, horizontal gene transfer. Jukes Cantor • The nucleotide substitution occurs at any site with equal frequency, and that at each site a nucleotide changes to of the three remaining nucleotides with a probability of a per year. • Probability of a change of a nucleotide to another is r = 3 a • qt = Proportion of identical nucleotides between two sequences qt+1 = (1-2r) qt +(2/3) r (1-qt) q(t) = 1 – ¾(1-e-8rt/3) Kimura 2 Parameters • The rate of transitional substitution per site per year = a • The rate of transversional substitution per site per year = 2b • Total substitution rate, r = a + 2 b • Pt is the proportion of identical transition-type pairs AG, GA, CT, TC • Qt is the proportion of identical transversion-type pairs AG, GA, CT, TC P(t) = (1/4)(1- 2 e-4(a+b)t +e-8b t) Q(t) = (1/2)(1 –e-8b t) Other Models • Tajima & Nei: – Substitutions that seem to be rather insesnsitive to various disturbing factors. • Tamura: – Takes into account varying GC content • Hasegawa et al. (HKY) • Rzhetsky & Nei • Tamura & Nei Blast and SWAT: Blast, and the world Blasts with you; SWAT, and you SWAT alone… Pair-wise Alignment: Task Definition • Given – a pair of sequences (DNA or protein) – a method for scoring the similarity of a pair of characters • Do – determine the correspondences between substrings in the sequences such that the similarity score is maximized DP Alignment Algorithms • Needleman & Wunsch – Improvement for Affine Gap-penalty, by Smith & Waterman (Swat) – Dynamic programming: • Determine alignment of two sequences by determining alignment of all prefixes of the sequences Comparing Syntenic Regions Comparison of Human and Fugu orthologous genomic sequence using Plains and EMBOSS. This sequence contains six exon regions. Both Plains and EMBOSS correctly identify the exon region 2 in Fugu with 3 in Human, but only Plains correctly aligns the exon region 5 in Fugu with 5 in Human. Heuristic Alignment • FASTA [Pearson & Lipman, 1988] • BLAST [Altschul et al., 1990] – BLAST heuristically finds high scoring segment pairs (HSPs): • identical length segments from 2 sequences with statistically significant match scores • i.e. ungapped local alignments • key tradeoff: sensitivity vs. speed The MUMmer System • Delcher et al., Nucleic Acids Research, 1999 • Given genomes A and B – Find all maximal, unique, matching subsequences (MUMs) – Extract the longest possible set of matches that occur in the same order in both genomes – Close the gaps – Output the alignment Find Longest Subsequence • Sort MUMs according to position in genome A • Solve variation of Longest Increasing Subsequence (LIS) problem to find sequences in ascending order in both genomes – Requires O(k log k) time where k is number of MUMs The More the Merrier Algorithm Homology Seed Indexing Reference Scanned with BLAST exact k-mer DFA* [31] Array WABA wobble base degenerate k-mer [32] randomly projected k-mer with <d Sorted Array LSH-ALL-PAIRS [33] mismatches Hash Table BLASTZ discontinuous exact k-mer [34][35] Hash Table PatternHunter discontinuous exact k-mer [36] Hash Table BLAT exact or inexact k-mer [37] T-Trie CHAOS exact or inexact k-mer [38] Hash Table PASH discontinuous exact k-mer [39] Suffix Tree REPuter maximal exact repeat [40] Factor Oracle FORRepeats exact repeat [41] Summary • Many innovative sequence alignment tools are available for detailed comparative genomic studies. – Recent segmental duplications in mammalian genomes (with identity level >90%) can be detected using BLAST and many other tools. • They use exact or inexact k-mers as homology seeds. • As homology levels become lower, they encounter a dilemma between sensitivity and computational efficiency – homologous segments, – segmental duplications, or – homology-based phylogentic distances. Summary • To improve sensitivity they rely on exhaustive searches of exact matches with short mers or inexact matches with long mers. – They encounter too many false-positives, to be later filtered through an expensive post-processing step. • Instead, more stringent search criteria (longer mers with more exact matches) may be used to improve efficiency. – Then these algorithms fail to detect low-homology regions, such as ancient duplication events. • In order to detect less-recent duplications, orthologous genes have been used as ―anchors‖ to map out the duplication blocks. But, for obvious reasons, they are unsuitable for identifying duplications that are not subject to strong selection process. #2 PRIZM (Paxia, Rastogi, Zhou, Mishra) How to do all-vs-all comparisons of 1000s of genomes PRIZM • It uses a Bayesian scheme. – It is efficient…Linear time. – It computes homologous regions between two genomes even when the homology level drops to a value around 65%. • Incorporates background knowledge about genome evolution, by experimenting with several priors (noninformative improper prior, exponential and Gamma priors, and priors based on Juke-Cantor one parameter and Kimura‘s two parameters models of evolution). • The results appear to be unaffected by these choices, while the computational efficiency is only mildly affected by what prior is employed. A Simple Observation • Homology is hard to compute but easy to verify. Quadratic vs. Linear time. – Can a probabilistic approach replace nondeterminism? If so, we can expect a probabilistic linear time algorithm. – Unlikely! But if we can use priors based on the underlying distributions, there is hope. – Many computational biology problems share this feature! #2 The genomic sequences under comparison: A) M. genitalium (Y-axis) and M. pneumoniae (X-axis), computed using 300 probes from six iterations, taking about 5 seconds using a non-optimized algorithm #2 The mouse chromosome 10 and rat chromosome 1 share a syntenic region about 20Mb at the beginning of the two chromosomes (arrow). #2 Alignment between Mouse Chromosome 8 and human chromosome 4. In silico experiment uses 21 mers (5,105,3), and takes about 45 seconds. Demo Human vs. Mouse M. genitalium vs. M. pneumoniae Homology Curve • An m-mer is a word of length m, selected from either genome. • Consider a location in the first genome, G1[a] and a short window, starting at a. W1,a = G1[a, a+m−1] • Compare this window with a word of equal length from the second genome starting at G2[b]: W2,b = G2[b, b+m − 1]. • Define the homology level for the locations G1[a] and G2[b] and a window of size m as h(2)(G1[a], G2[b], m) = (1/m) g=1m-1 IG1[a+g] = G2[a + g]. Homology Curve • Let us define, h(a) to denote the highest homology level for genome G1 at position a and computed with respect to G2: h(a) = max1 · b <|G2| - m h(2)(G1[a], G2[b], m) • The ―homology curve‖ for the first genome, G1 with the respect to the second genome G2 is then defined as: h : [1..G1 −m − 1] → [0, 1] : a a h(a) PRIZM • Replacing non-determinism with a probabilistic guessing scheme. – The probability distributions are based on biologically meaningful priors. – Using these priors it guesses a local homology curve, and designs and performs an in silico experiment. – It uses the results to verify its guess (in linear time) and refines the local homology curve and the probability distributions for the next iteration. Probabilistic guesses • Use a Bayesian scheme and a boosting approach to modify the probability distributions of the ―guessing experiments‖ from one iteration to next. • At each iteration, a sequence of words with a specific distribution is selected from one genome, and is optimally partitioned into groups for ―in silico experiments‖ involving – exact-match search, – inexact-match search with one error, – inexact match search with two errors, etc. Probabilistic guesses • These searches can be efficiently conducted over the second genome, – Assuming that the other genome has been preprocessed and stored in an efficient data structure (e.g., suffix arrays or hash table). – From the results of the experiments, a Bayesian estimator can compute the local homology levels for the genome, and use it to verify and sharpen the probabilistic distributions for the next iteration. • The algorithm converges to the true local homology levels after a few iterations. In Silico Experiments • Let b, B = | G2 |/b, w, W, m, N0, N1, . . ., Nk (k ≤ m, and in our applications usually k = 2) be some pre-specified parameters. – Choose k+1 random subsets, S0, S1, . . ., Sk, of words each of length m, randomly (i.i.d. uniform) from G1[a, a+w−1], such that |S0| = N0, |S1| = N1, . . . , and |Sk| = Nk. – Consider a block in the second genome of length b: Bb = G2 [b, b+b−1]. Let X0 (X1, . . ., Xk, respectively) be defined as the number of m-mers from set S0 (S1, . . ., Sk, respectively) that match exactly (with one, two and so on up to k mismatches, respectively) to an m-mer in G2[b, b +b − 1]. Experiment Design • By examining the sensitivity (∂(Xi/Ni)/∂h = a′i[h]) we can divide the interval for h into three intervals: [1/4, θ1] ≈ [1/4, (m−2)/m], (θ1, θ2] ≈ [(m−2)/m, (m−1)/m] and (θ1, 1] ≈ ((m−1)/m, 1], such that the choices of (N0,N1,N2) are based on the following mixed strategies: N0 = (K/b) sq21 pi,I(h) dh N1 = (K/3bm) sq1q2 pi,I(h) dh N2 = (2K/9b(m2 –m)) s1/4q1 pi,I(h) dh …. In Silico Experiments • Thus Xi’s for i in [0..k] are binomially distributed random variables whose parameters depend on the homology level h. • We can estimate the local homology by the following robust estimators: h h | X0, X1, … Xk i = s01 h p(h | X0, …, Xk) dh = s01 h p(h) p(X0, …, Xk|h) dh/ s01 p(h) p(X0, …, Xk|h) dh – Similarly, compute the mean, standard deviation and confidence of the homology function over Bb. – Let b* = arg maxb mean(Bb). Then the homology function is estimated at a by mean(Bb*). Conditional Probabilities ri = b pm j=1i C[m,j] 3j si = j=1i C[m,j] hm−j (1 − h)j bi = (1 − si)(1 − ri) ai = 1 − bi = si + ri − si ri. p(X1, X2, …, XK|h) / j aj Xj bjNj – Xj Initial Priors • Using Jukes-Cantor: the random variable r represents the rate of nucleotide substitution per site per year. – In this model, it is assumed that nucleotide substitution occurs at any nucleotide site with equal frequency and at each site a nucleotide changes to one of the three remaining nucleotides with a probability a per year: r = 3 a. – The substitution rate is often higher at functionally less important sites than at functionally more important sites. – Case 1: r » Exponential(λ): fexp(r) = λe−λr. In that case p(h) »(4h − 1)3λ/8T−1 – Case 2: r » Gamma(λ, ν): f Γ(r) = λn e−λrrn−1/Γ(n). In that case p(h) »(4h − 1)3λ/8T−1 ln[3/(4h − 1)/T]n−1 Initial Priors • A more complex structures arise as we consider multi-parameter models: e.g., Kimura‘s Two-Parameter Method. In this model, the rate of transitional substitution per site per year (a) is assumed to be different from that of transversional substitution (2β). h = (1 − P)(1 − Q) P = (1/4) (1 − 2e−4(a+β)T + e−8βT) Q = (1/2) (1 − e8βT) Initial Priors • Assume that α » Exponential(λa) and β » Exponential(λβ) and they are independent. p(h) = (λa λβ/8T2) s01 (1 − P)2+(λa+λb)/8T(2h + P − 1)(λa−λb)/8T (h − 2P + P2)−λa/4T /(2h + P − 1)(h − 2P + P2) dP. Refining Priors • In iteration i, let us consider an interval I with k homology estimates: h m1, s1 i, h m2, s2 i, …, h mk, sk i • Assume that the homology values h1, h2, . . ., hk is sampled from a distribution h » N(μ, σ2). • Furthermore, we assume the following: μ » N(ξ, t2) r = (σ2 + t2)−1 » Γ(a, β). New Prior – Prior = Kummer‘s hypergeometric function of order 1 f(h|ξ, t, a , β) » s01/τ2 ra-2 1/√(1 − r t2) e−Br dr » 1F1(a − 1, a − 1/2,−B/ t2) ¼ ((h-x)2 + 2 b)/2 t2)-a+1 – Estimates ξ = h μji t 2 = h μ2 i − h μj i 2 a/β = h (σ2i + t2)−1 i a/β2 = h σ2i + t2)−2 i - h σ2i + t2)−1 i2 Optimizing the parameters • The parameter choices are as follows: – Let the number of blocks (b) and the number of windows (w) be chosen a priori based on the needed resolution for homology. – We may choose these parameters so that b = O(p(G2)) and w = O(p(G1)). We assign K = O(1) amount of work to a region defined by a combination of any single block with any single window. – Thus the amount of work is roughly K(G1G2)/(w b) = O(G1+G2) per iteration. – The mer size parameter ‗m‘ is chosen so that the probability of a ―hit‖ in a block containing a homologous sequence much higher than in a random block: (b/4m) << E(h0,G)m. #7 How to identify signatures for cellular states (e. g. healthy vs. diseased) Microarray Analysis • Representations are reproducible Normal DNA Tumor DNA samplings of DNA populations in which the resulting DNA has a new format and reduced complexity. – We array probes derived from low Normal LCR Tumor LCR complexity representations of the normal genome – We measure differences in gene Label copy number between samples ratiometrically Hybridize – Since representations have a lower nucleotide complexity than total genomic DNA, we obtain a stronger specific hybridization signal relative to non-specific and noise Copy Number Fluctuation A1 B1 C1 A2 B2 C2 A3 B3 C3 A MAP (Maximum A Posteriori) Estimators • Priors: – pe is the probability of a particular probe being – Deletion + Amplification ―normal‖. • Data: – pb is the average number of – Priors + Noise intervals per unit length. Max likelihood over (Pe,Pb) • Goal: Find the most plausible 0.10 hypothesis of regional 213.2 28 changes and their associated 4. 3 0.08 236.9 copy numbers • Generalizes HMM:The prior 0.06 Pb depends on two parameters 0.04 30 8.0 pe and pb. (pe,pb) max at (0.55,0.01) 33 1 .7 0.02 355.4 284.3 260.6189.5 213.2 236.9 165.9 142.2 118.5 0.00 0.4 0.5 0.6 0.7 A reasonable choice of priors yields good segmentation. sampling = 5 p_e = 0.35 p_b = 0.01 2 chr = 8 1 Copy # 0 -1 -2 100 300 Probe # 500 700 900 A reasonable choice of priors yields good segmentation. sampling = 5 p_e = 0.55 0.5 p_b = 0.0001 chr = 2 0.0 Copy # -0.5 -1.0 50 300 # 550 Probe800 1050 1300 1550 Prior Selection: F criterion Pf over (Pe,Pb) • For each break we 0.1 0.10 have a T2 statistic and 0.08 0.2 the appropriate tail 0.1 0.3 0.4 probability (p value) 0.2 0 0.6.5 0.3 0.06 0.7 0.8 0.7 calculated from the 0.3 0.4 0.2 0.50.6 Pb distribution of the 0.8 0.7 0.04 statistic. In this case, 0.7 (pe,pb) max at (0.55,0.01) this is an F distribution. 0.3 0.4 0.02 0.6 0.7 0.9 1.0 • The best (pe,pb) is the 0.9 0.8 0.7 0.4 0.6 0.5 one that leads to the 0.00 0.4 0.5 0.6 0.7 Pe maximum min p-value. N1 N 2 N1 N 2 x y 2 T2 1 df1 df 2 x x y y i i 2 j j 2 Current Implementation • NYU Array CGH • Versatile: – Works well for BAC array, ROMA & Agilent – Raw Affymetrix data is noisier, but our new algorithm for ―background correction and summarization (BCS)‖ makes Affy-data significantly better than the competitors. – (BCS also generalizes to gene expression and performs better than RMA, Li-Wong, etc.) • Global Analysis (LOH analysis, detecting TSG and onco-genes) • Fast implementation, with visualization and integration (to be released soon…) Finding Cancer Genes • LOH/Deletion Analysis analysis • Hypothesize a TSG (Tumor Suppressor Gene) • Score function for each possible genomic region containing the TSG – Evolutionary history – Interactions – Parameters • This score can be computed using estimation from data and also prior information on how the deletions arise. We use a simple approximation; we assume there is a Poisson process that generates breakpoints along the genome and an Exponential process that models the length of the deletions. Simulations • Several scenarios: • There are S = 200 • Model1: one TSG [300, 350]. All individuals individuals. Deletions are diseased because of homozygous deletion of the TSG. occur in each individual. • Model2: one TSG [300, 350]; 50% of the A cell having incurred a individuals are diseased because of deletion overlapping homozygous deletion of the TSG and the rest one of the TSGs, will are diseased because of hemizygous deletion of the TSG. multiply quickly and • Model3: one TSG [300, 350]. All individuals generate many copies are diseased because of hemizygous deletion of itself. These copies of the TSG. evolve independently • Model4: two TSGs; 50% of the people are diseased because of homozygous deletion in for a while until we the first TSG and the other half are diseased collect the information. because of homozygous deletion in the • second TSG. Model Simulation #10 Host-Pathogen Interaction BioCOMP/Biospice GOALIE Gene Ontology Algorithmic Logic Invariant Extractor A Picture Biological System: Part-List + Functional Relations Regulatory, Metabolic & Signaling Relations Measurements Redescription Recomputation Descriptive Relation Numerical Traces KRIPKE MODELS Invariants Yeast-Cell Cycle Data: Spellman et al. G0 Invariants G1(I) M G1(II) G2 S Another Example • Pre-Apoptosis 6 time points data analysis • Six time-point data at 2h, 4h, 6h, 8h, 12h, 24h – Cells treated with SEB – Control: untreated cells • Data from Jett-Lab (Walter-Reed) Hypothesized pathway •Note that GOALIE is not intimately tied to any particular ontology: It can be customized to work with other controlled vocabulary: e.g., UMLS, MetaCYC, Reactome… •Thus GOALIE also provides a way to ―compare‖ different ontologies… From the GO web site. The path back to each ontology from a gene. We will call each term in a path a split. Structure of a GO annotation Each gene can have several annotated GOs and each GO can have several splits. E.g. DNA topoisomerase II alpha has 8 GO annotations and 11 splits Time Coarse GO categorization • GOALIE partitions the temporal data in order to understand local behavior. Data are grouped by considering (weighted) windows of time points (2-4-6, 4-6-8, 6-8-12, 8-12-24) • Next GOALIE uses a K-Means algorithm (genesis) to produce clusters for each window • GOALIE then associates to each cluster a set of GO descriptors (with p-values and controlled FDR, false discovery rates) • GOALIE computes ―patterns‖ of GO categories across the time points • All these steps can be run from one unifying interface that GOALIE provides. GOALIE will be embedded inside VALIS. Female, Sings, Overweight X1 Male, Talks, Thin X2 Male, Sings, X3 Overweight X4 Female, Sings, Underweight X3 X4 X3 X2 X1 X1 X4 X2 X3 X4 X3 X2 X1 X1 X4 X2 X3 X4 X3 X2 X1 X1 X4 X2 X2 X4 X3 X1 X3 Finale X4 X2 X1 X2 : O, : FLS X4 X3 X1 X3 Finale X4 X2 : O, : FLS : O, : FLS : O, FLS O X1 : O, FLS X2 : O, : FLS X4 : O UFLS X3 X1 X3 Finale X4 X2 : O, : FLS : O, : FLS : O, FLS O : O UFLS X1 : O, FLS : O UFLS X2 : O, : FLS X4 : O UFLS X3 X1 X3 Finale X4 X2 : O, : FLS : O, : FLS : O, FLS O : O UFLS : O UFLS X1 : O, FLS : O UFLS It ain’t over ‘til : O, : FLS the fat lady sings X2 X4 : O UFLS X3 X1 X3 Finale X4 X2 : O, : FLS : O, : FLS : O, FLS O : O UFLS : O UFLS A[: O UFLS] X1 : O, FLS : O UFLS GOALIE: GO Algorithmic Logic for Invariant Extraction Clusters connection tree Each level a “window” Micro-array accessions GO categories Clusters information Connection information Cluster Information GOALIE: GO Algorithmic Logic for Invariant Extraction GO categories GO categories GO categories describing shared with describing “source” cluster “destination” “destination” but not GO categories cluster but not describing genes cluster “destination” “source” in “source” cluster GOALIE: SEB Analysis Preliminary Results 1. Time Course Window 1 to Time Course Window 2: Connection 1:9 to 2:18. By inspecting the first cluster in the first window (Cluster~1:9), we note that one of the connection to the cluster2 in the second window (Cluster~2:18) is labeled (among many others) by the GO categories ―circulation‖ (GO:0008015), and by the category ―negative regulation of heart rate‖ (GO:0045822). This represents a constant set of biological processes shared by this cluster chain, traversing Cluster 3:17, to Cluster 4:13. 2. Time Course Window 1 to Time Course Window 2: Connection 1:9 to 2:6. The connection between Cluster 1:9 and Cluster 2:6 is interesting because it shows how the category ―regulation of lymphocyte proliferation‖ (GO:0050670) becomes activated in the next time-window (Cluster 2:6), while the categories ―antigen presentation‖ and ―antigen processing‖ became inactive. This should indicate that some of the genes in the clusters start a response to the pathogen in the second time point. Hidden Kripke Model • ―Hidden Kripke Model‖ reconstruction via ontology based redescription of time-sliced clusters of time-course measurements (arrays) offers a novel viewpoint form which formulate biological hypotheses and eventually reconstruct ―biological first principles‖ • The GOALIE tool, in its embryonic state, has already been proven ―correct‖ and offered a wide range of insights into a number of biological datasets • Spellman‘s Yeast Cell Cycle • SEB host-pathogen data from WRAIR • New datasets being analyzed now include • P. falciparum dataset [Bozdech et al, 1(1):085] • Subset of Genome Module Map dataset [Segal et al] Story generation • Temporal Logic formulae Dataset Formula Dataset Generator can be rendered in Dataset Dataset English. • Temporal Logic formulae Formula Formula Formula can be generated automatically (with care). Temporal Logic Analysis • Each formula can be tested against a set of datasets; differences can Natural Language HTML Story Generation file then be noted. #4How to integrate multiple high-throughout data types dependably; How to visualize & explore large- #5 scale, multi- dimensional data Chr1 Ns ATs Reps MER57A L1P CDs ΔG Dup Copy # Mer Freq VALIS vast¢active¢ living¢intelligence¢system Databases • Sequence • Bioperl – GENBANK, EMBL, DDBJ • International open- – SWISS-PROT, GenPept, source collaboration TREMBL, PIR • 7 Years of development – PDB, SCOP,… • 600 Modules • Genetic (Flybase, Wormbase, GDB, AtDB, SGD,…) • 100,000 lines of code • Expression (RNA • Easy-to-use, stable, and Expression from microarrays,...) consistent programming • Metabolic (KEGG, WIT,…) interface for bioinformatics application • Literature (MEDLINE,…) programmers BioPerl Module Statistics Histogram of Module Sizes 50 Number of Modules 40 mean 144 30 20 10 0 0 100 200 300 400 500 600 700 800 900 1000 1100 Real Lines of Code Bioinformatics Data • Majority of the data • Example: GoldenPath kept in (indexed) flat /UCSC Genome Browser files & Relational • Rough Estimates: Databases – 800 Tables • Flat Files – 14000 fields – Unstructured/Difficult to – 1600 varchars [most update varchar(255) ] • Relational Databases – 1300 blobs – Strings are atomic • Blobs: objects – exonStart, ExonEnd, – Blobs qStarts, tStarts, … Key Feature of Valis • State of the art of rapid prototyping in bioinformatics • Multilanguage Scripting • Data storage • Graphical User interfaces Multi-Scripting • A Valis script can be written in any supported language: – JScript, VBScript, Python, PERL, Lisp, R and SETL. – All the scripts see the same Valis class hierarchy. – For example, once a user learns that a Valis Sequence Object has a method called Input that will read the sequence from a file, the user can subsequently use this same primitive from all the different languages. Data Storage #4 • Based on Extended B-Trees • At the lowest level there is an Heap of pages • Must correctly keep track of the reference counts of each record/object to implement value semantics Visualization #5 • Once the processing is completed, it is very important to be able to quickly visualize the results. • For this reason Valis provides numerous visualization tools that allow a user to quickly display – sequences, maps, – microarray data, – tables, – graphs and annotations. • These widget can be customized from the scripts. How to convert static network maps into dynamic mathematical models; How to predict protein function ab initio; How to build hierarchical models across multiple scales of time & space; How to reduce complex multi- dimensional models to underlying principles #6 Glycogen P_i #9 Glucose Glucose-1-P Phosphorylase a Glycolysis Phosphoglucomutase #10 Glucokinase Glucose-6-P Phosphoglucose isomerase SIMPATHICA Fructose-6-P Phosphofructokinase SimPathica System Simpathica is a multi-functional system Front End Analysis Model data structures XSSYS (TL) Equations Handling Time/Frequency Octave/Matlab Combined TL/TF code generation Simulation Dash- Visualization Matlab Octave board PtPlot gnuplot mArray DB NYUMAD Simpathica is a modular system Canonical Form: n m n m X i a i X j bi X j ij i 1 n g ij h j 1 j 1 nm C ( X (t ),, X (t )) (g l X j lj ) 0 f l 1 n m j 1 Characteristics: • Predefined Modular Structure • Automated Translation from Graphical to Mathematical Model • Scalability Purine Metabolism • Purine Metabolism – Provides the organism with building blocks for the synthesis of DNA and RNA. – The consequences of a malfunctioning purine metabolism pathway are severe and can lead to death. • The entire pathway is almost closed but also quite complex. It contains – several feedback loops, – cross-activations and – reversible reactions • Thus is an ideal candidate for reasoning with computational tools. Simple Model Biochemistry of Purine Metabolism • The main metabolite in purine biosynthesis is 5-phosphoribosyl-a-1- pyrophosphate (PRPP). – A linear cascade of reactions converts PRPP into inosine monophosphate (IMP). IMP is the central branch point of the purine metabolism pathway. – IMP is transformed into AMP and GMP. – Guanosine, adenosine and their derivatives are recycled (unless used elsewhere) into hypoxanthine (HX) and xanthine (XA). – XA is finally oxidized into uric acid (UA). Purine Metabolism Queries • Variation of the initial • Persistent increase in the initial concentration of PRPP does concentration of PRPP does not change the steady state. cause unwanted changes in the (PRPP = 10 * PRPP1) steady state values of some metabolites. implies steady_state() • If the increase in the level of • This query will be true when PRPP is in the order of 70% evaluated against the then the system does reach a modified simulation run (i.e. steady state, and we expect to the one where the initial see increases in the levels of concentration of PRPP is 10 IMP and of the hypoxanthine times the initial pool in a ―comparable‖ order of concentration in the first run magnitude. – PRPP1). Always (PRPP = 1.7*PRPP1) implies steady_state() TRUE TRUE Queries • Consider the following • In fact, the increase in IMP is statement: about 6.5 fold while the • Eventually hypoxanthine pool increase is (Always (PRPP = 1.7 * PRPP1) about 60 fold. implies • Since the above queries turn steady_state() out to be false over the and Eventually modified trace, we conclude Always(IMP < 2* IMP1)) that the model ―over-predicts‖ and Eventually (Always (hx_pool < 10*hx_pool1))) the increases in some of its products and that it should • where IMP1 and hx_pool1 are the values observed in the therefore be amended unmodified trace. The above statement turns out to be false over the modified experiment trace.. False Final Model Purine Metabolism Quantum leaps ―provoke creative dreaming‖ Shotgun Mapping • Schematics – Experiment Design – Robotics – BioChemistry – Imaging – Image Analysis – Statistical Algorithms – Visualization Shotgun Optical Mapping of Genomes Gentig‘s Successes E. coli • E. coli • P. falciparum ¦ D. radiodurans ¦ Y. Pestis • Rhodobacter sphaeroides ¦ Shigella flexneri ¦ Salmonella enterica • Aspergillus fumigatus • … P. falciparum • The automated Gentig system is routinely used – to map a microbe genome quickly & effortlessly – by a scientist with no quantitative or computational training. Y.Pestis Some Interesting Applications • Sequence Validation • Haplotyping • Sequencing • Comparative Genomics • Rearrangement events • Hemizygous Deletions • Epigenomics • Characterizing cDNAs – Expression Profiling – Alternate Splicing Sequencing in Post-Genomic Era • Haplotypic Sequencing of 6.6 Billion Base Pairs in a Diploid Human genome • Less than $700 • Less than 24 Hours • Draft Quality (Not Resequencing) Ingredients • Single Molecule Optical Mapping – Methylation Sensitive Restriction Enzymes – Multiple-Enzyme Maps • Probe Hybridization on the Surface – PNA, LNA, TFO • Sequencing by Hybridization – Localize algorithms PSBH problem • The PSBH problem: – Can overcome the complexity issues associated with the SBH problem. – For each L-mer probe, in addition to knowing whether it hybridizes with the unknown sequence (with or without count) we are provided constraints on the location of the L-mer in the sequence: – The constraint takes the form of a set of permissible locations for each L-mer (which need not be contiguous). – However it is known that if the constraint limits each L-mer to no more than two exact locations on the sequence, then it admits an efficient algorithmic solution—If 3 or more locations are possible the reconstruction problem becomes NP-complete. – However if the location constraints are in the form of K contiguous locations then the reconstruction problem is exponential in K, rather than the sequence length m. Multiple PSBH problem • For our data set of probe maps for all 6-mers, we have multiple instances of each L-mer, for L=6 (about one every 4Kb on each strand of the DNA) in the sequence, and for each instance we can constrain the location to within about 200 base pairs depending on the optical resolution. • A special case of the PSBH problem, which we will call the ―Multiple Positional Sequence by Hybridization‖ (Multiple PSBH) problem, where we have separate constraints for each of the multiple instances of each L-mer. – By focusing on a small window of about 2000bp, in which most L-mers will occur only once, we can solve the standard PSBH problem where separate constraints for multiple instances of each L-mer are not important. – However if we solve each local PSBH problem for each 2000bp window separately, the worst case exponential time reconstruction is unlikely to apply to most windows, so if we simply limit the amount of time spent in each window to some upper bound linear in the window size, we can still be able to reconstruct the sequence in most windows in linear time. Initial Experiments The end The end