"Data Mining Poster"
Data Mining Tools for Curation of the Human Metabolome Database Savita Shrivastava, Craig Knox, Paul Stothard, Russ Greiner, David Wishart, University of Alberta, Edmonton, Canada Abstract: The Human Metabolome Database (HMDB) contains more than 1400 metabolite entries, each consisting of more that 80 data fields. Obtaining and evaluating the contents of these data fields has required the development of several custom software tools. These data mining programs extract information from several publicly accessible databases (KEGG, PubChem, PubMed, MetaCyc, ChEBI, PDB, Swiss-Prot, GenBank), and generate a series of web-based reports. These reports, by combining the results obtained from several independent sources, provide a useful means for evaluating the reliability of the metabolite information that is added to the HMDB. The HMDB is regularly updated as additional data becomes available and as source databases and data mining methods improve. Introduction Evaluating Metabolizing Enzymes The extensive information stored in the HMDB has Each of the automatically generated MetaboCards been assembled by a team of curators using a is reviewed by curators who look for missing or collection of custom data mining programs developed incorrect information. To assist the curators the specifically for building and updating the HMDB. HMDB development team has prepared several These software tools use sequence and text tools that obtain information from additional comparison algorithms to obtain up-to-date resources, using data mining approaches that differ metabolite information from the some of the most from those used to build the MetaboCards. One of reliable and complete resources. Two of the HMDB the programs, called MetabolizingInfo, is used to data mining tools, MetaboBuilder and evaluate the content of the MetaboCards relating to MetabolizingInfo, are discussed below. metabolizing enzymes. Currently more than 3,000 protein (and DNA) sequences are linked to the metabolite entries. The MetabolizingInfo program uses the name of each metabolite and its known Building the MetaboCards Fig. 1 Data stored in the HMDB is available to users and curators in the form of MetaboCards. The cards are generated by a data synonyms to obtain publications from PubMed, The HMDB contains more than 1400 metabolite mining program that retrieves information from several external metabolizing enzymes from Swiss-Prot, and and internal databases and scripts. Whenever possible the metabolite and metabolizing enzyme information entries, each consisting of over 80 data fields. The contents of the MetaboCards are hyperlinked to additional data pertaining to each metabolite is accessible as a information to aid in the curation process. from KEGG. The searches are conducted using a “MetaboCard”. The MetaboCard serves as a combination of WWW agents and public database curator-friendly summary of the current metabolite APIs. All of the retrieved information is ranked annotations stored in the HMDB (Fig 1). The initial using a scoring system and presented to the set of MetaboCards is assembled using a data curator as an HTML document (Fig 2). Each of the mining program called MetaboBuilder, which entries in the document is hyperlinked to a searches a variety of databases using sequence complete database record (Fig 3). and keyword queries. The results of each search are evaluated to determine whether they are relevant for the metabolite in question, or if they Updating the HMDB should be discarded. MetaboBuilder also Fig. 3 The HMDB data mining tools, such as the The HMDB will never be a “finished” database, coordinates the updating of fields that are calculated MetabolizingInfo program, provide web-based reports for since new research is always providing additional from the contents of other fields, such as protein human curators. These reports contain hyperlinks to records in data. Furthermore, the HMDB data mining tools molecular weight, and protein isoelectric point. The a variety of external databases, including Swiss-Prot, PubMed, Fig. 2 The MetabolizingInfo program uses text-based searches to and KEGG. Shown above is a Swiss-Prot record, PubMed and curators constantly scrutinize and update content that is gathered and generated by retrieve information from Swiss-Prot, PubMed, and KEGG. abstract, KEGG compound record, and KEGG enzyme record existing content. The HMDB is available at MetaboBuilder is stored in a relational database and Records that pass a scoring cut-off are presented in a colour- obtained for corticosterone. By using a combination of http://www.hmdb.ca. We encourage users to provide in a flat file database to facilitate curator review. coded HTML table. The table for corticosterone is shown above. automated data mining and manual curation, the HMDB aims us with their feedback. Each external record ID is hyperlinked to its corresponding record to be a comprehensive and reliable database of human for curator review. Some of these records are shown in Fig 3. metabolites.