Pathways Databases A Case Study in Computational Symbolic Theories
Shared by: pua50703
Pathways Databases: A Case Study in Computational Symbolic Theories Author: Peter D. Karp Presented by: Gunjan Gupta April 13, 2004 Overview 1. What is Gene Co-expression and why discover gene co- expression ? 2. Using DNA microarray data for discovering functional gene co-expression. 3. Discussion on how to use phylogenetic information to filter out false positives in the micro-array data. 4. Detailed discussion of the proposed solution. A *Complete Scientific Theory of a simple organism is very Complex A “simple” organism such as E. Coli is very complex – Example, for E. Coli: - 791 chemical compounds & 744 enzyme- catalyzed biochemical reactions for the Metabolic pathways. - 4290 “protein coding” genes - A perhaps larger number of mostly unknown RNA-encoding genes *Explains all the observed processes in an organism Too complex(large) for a single Scientist to grasp More complex than a factory producing cars – Imagine a car factory that not only produces cars, but produces other factories of its own kind, and everything else needed to produce, reproduce, repair, fend off competition, fight foreign attacks, adapt to a changing environment, & evolve ! Traditional methods of representing theoretical “knowledge” 1. Technical papers in journals 2. Diagrams, flowcharts & tables 3. Mostly natural language descriptions more suitable for humans. 4. Raw and partially processed data in large biological databases: example, indexed by search criteria, annotated in English etc. Traditional methods not enough, why ? 1. Cannot represent scientific theories for biological systems – too complex. 2. Automatic deductions not possible as the theory is not stored in a structured, computer readable presentation. 3. Natural Language Processing not developed to the level yet to allow a computer to search human readable data. 4. Only bits and pieces understandable by one person at any given time. 5. Large-scale patterns and deductions might never be found in the absence of any one individual who can see the “big picture”. 6. Addition of details from experimentation does not necessarily improve our understanding of the whole system. Pathway Databases for Biology 1. A Database of Scientific theory storing “qualitative information” – semantics of the theory in a well-defined form. 2. Useful for representing theories that are mostly non- numerical in nature – a biological system is well suited. 3. Definition: A Pathway is a linked set of biochemical reactions, linked as follows – Theory of relationships Reaction 6 between Reaction 1 Enzyme 3 1,2,3,4,5,6 & 7 defined) Reaction 2 Enzyme 4 Chemical 8 Raw Material 5 Substrate 7 Different Kinds of Biological Pathways Database (PDB) 1. Only Metabolic Pathways (majority): (time= generally fast) – e.g. Production of ATP in the Mitochondria 2. Signaling Pathways (time= generally medium) – Intra-cellular: e.g. cell-membrane to nucleus – Extra-cellular: e.g. growth-factors for nerve, skin, say for injury.. 3. Genetic-regulatory pathways (time= generally medium/slow) – gene expression control, make more as needed, example synthesize amino-acids deficient in food by increasing expression of related genes. 4. A combination of one or more of the above (e.g. PAD, EcoCyc). Pathways Databases: an intersection of 4 fields Genomics Biochemistry Databases Artificial Intelligence Examples of current PDBs • Metabolic Pathway Database – EcoCyc (author) : http://www.ecocyc.org/ • Signaling Pathways – SPAD: http://www.grt.kyushu-u.ac.jp/spad/ • Genetic-regulatory pathways – BIOBASE: http://www.gene-regulation.com/ EcoCyc Project • EcoCyc: Encyclopedia of Escherichia coli K12 Genes and Metabolism – At http://www.ecocyc.org/ – Links to a bigger site : http://biocyc.org/server.html containing PDB for other organisms including human. What Metabolic & Genetic Pathway Info does EcoCyc Contain ? For each Enzyme in the PDB – • Detailed description of reaction catalyzed by each enzyme. • Genes to which the enzymes map to, if available. • The range of substrate the enzyme would accept. • Chemicals that inhibit or activate the Enzyme. • Its subunit structure. • Each small molecule enzyme substrate Pathways types for E-Coli included – Biosynthesis of cellular bulding blocks. – Extraction of Carbon from food. – Extraction of chemical energy from food. Tools & Visualization in EcoCyc 1. SRI International (http://www.sri.com/) developed the visualization and search tool called Pathway Tools as an intelligent user interface. 2. Allows the user to exploits the semantic information in the PDB and write complex queries. 3. Visualize results as a Pathway graph called Overview Diagram. 4. A variety of criteria for the queries – – Name matching – Classification hierarchy (taxonomy, metabolic pathways) 5. Example: – Find all reactions that are activated or inhibited by a given metabolite. 6. Superimposing genetics data on the visualization (see demo) 7. User can create a PGDB for a new organism and share it. V i s u a l i z a t Metabolite i categories o n Individual Glycolysis Reactions region An application: “automated” inter-species comparison of reactions 1. Yellow shows reactions in E. Coli that match with another species – S. Cerevisiae, found as a result of database lookup. 2. Mostly automated layout using Pathway Tools (some manual fitting was done by author in this paper for this diagram). Exploiting Knowledge Representation AI tools in a PDB 1. Building an Ontology: DB schema defining the precise relationships between entity : use UML ? 2. Encode a theory using the Ontology 3. EcoCyc ontology consists of 1000 object-oriented classes encoding key concepts of biology and biochemistry. 4. Extend Ontology when new concepts are found that cannot be derived using existing Ontology. 5. Use KR techniques that exploit Symbolic AI reasoning to say – • build an inference engine (see Bruce Porter’s Knowledge Machine for example) on the PDB or • Perform specific global inferences using specific relationship searches. Example of Symbolic Inferences on EcoCyc • Results of a search changed the simplistic notion of what gene is – – In E. Coli found 1 out of 7 enzymes catalyze more than one reaction, and almost 1 out of 7 cases where an reaction is used in more than one Pathway (overlapping sub-graphs or clusters): this can be treated as a discovered theory. • Characterized the transcription factors relationship (see next slide). • Other interesting theories found using PDB – – Scale free network topology that follows a power law for both Metabolic and Genetic networks. – Deletion of proteins with high connectivity more likely to kill the organism. Example 2: Characterizing Transcription Factors “inter-relationship” in a genetic pathway Most do not regulate themselves or Just two dominate the other transcription factors relationships making the tree very shallow EcoCyc Demo • Demo 1: Search demo – Go to site http://www.ecocyc.org/ – Click on DB search – Search for “Glycolysis” and show to class if time permits .. • Demo 2: Combining pathways with Gene expression data – – Go to site http://www.ecocyc.org:1555/expression.html – Specify data file as http://biocyc.org/coli.dat (might have to save it locally to work). – Select “absolute” display level. – Enter ratio for numerator as 1. Issues/Comments 1. The idea itself is quite powerful – combining sequence data with DNA array, and results seem to be quite good, but .. 2. Too many steps – hard to quantify the results except empirically, because of the complexity. At least 5 levels of successive transformations (1.genes to meta-genes via blast 2. Pearson correlation 3. order based probability 4. network thresholding 5. Clustering). 3. Some heuristics are not explained much – for clustering, transformation into 2-d space from probability for example, where the original problem was a graph- why not directly partition the graph ? 4. Limited quality of clusters because of 2-d translation. Not clear why and how the data fits into a 2-d space. Not clear if the translation using P-value is a metric. 5. The paper was obviously not written by plain computer scientists - lot of interesting discoveries and analysis after the method was used. One line summary of the paper .. Using phylogenetics to filter out gene co-expressions in micro-array data that are not functionally relevant.