datamining in bioinformatics

Reviews
Shared by: Mark Hardigan
Stats
views:
86
rating:
not rated
reviews:
0
posted:
1/22/2009
language:
English
pages:
0
Data Mining and Bioinformatics Sebastian Kropp 27 May 2004 Monash University Faculty of Information Technology Caulfield, VIC Abstract This paper looks at the use of Data Mining in the domain of Bioinformatics. Knowledge-discovery techniques are becoming more and more important as the collected data increases. Future progress in biology is made possible by advances in machine learning. The broad use of data mining and their applicability in the different areas of bioinformatics are evaluated. The areas include the Genome project, prediction of protein structures and the struggle of neurobiology to understand the human brain. 1 Contents 1 Introduction 2 Brain Functionality 3 Protein structure prediction 4 Discussion and conclusions 3 3 4 4 2 1 Introduction Computer science and biology fuse in the relatively new disciple of bioinformatics. This interdisciplinary work is driven by the need to analyse and make sense out of the vast amount of data that is produced, when biological systems are studied. Data mining has already been successfully applied for business problems. Insurance companies asses insurance risks [1] and other highly competitive markets like the telecommunication industry use data mining to predict customer churn. Throughout the economy similar such knowledge-discovery methods are used to optimise productivity and the understanding of data mining as a tool for optimization is fairly good understood in this area. Those positive experiences are tried to be adopted for science. Science and especially biology produce vast, complex and noisy data of unseen proportion. An example for this is the human genome project. The sequence of the whole human DNA poses a new challenge for data mining and computer science. Data mining algorithms and machine learning have exponential complexity and sometimes require parallel computation. Before we take a look at examples of data mining in biology, we need to define what it actually means. The term data mining or also known as Knowledge-discovery in Databases (KDD) is explained in the book Principles of Data Mining [2] as ”The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” and ”The science of extracting useful information from large data sets or databases”. This definition is quite general. In some cases it is extended to include all possible means of knowledge-extraction that is available to gain the most possible understanding of the data. There are many ways how data con be exploited. Data mining can be divided into two main groups; supervised and unsupervised techniques. Supervised algorithms require a posteriori knowledge and experience with the data. Classification and decision trees are examples of this approach and can be used to verify a hypothesis. A priori techniques do not need knowledge. They discover relations by their own. Clustering is used to detect similarities a priori. The Apriori algorithm is fundamental for data mining. Such statistical approaches usually lag the ability to detect non-linear relations but provide understandable results (decision trees). New advances in artificial intelligence like neural networks and genetic algorithms support the pattern recognition process to find non-linear relations. There are a lot of patterns in biology which are not understood and data mining helps to discover novel and hopefully useful information. Data mining is used in the prediction of gene relations in a genome, understanding of relations for region activation in the brain and the prediction of protein folding resulting from changes in the DNA. 2 Brain Functionality The understanding of the human brain and functional composition of brain activities is a challenging task of biology in these days. Research is this area is heavily dependant on image recognition. FunctionalMagnetic Resonance Imaging (fMRI) is used as the basis of data retrieval. The resulting 3D images show locations (Regions of Interest RoI) of increased positron activity. Two kinds of functional associations in the human brain are of interest in the international study called ”Computationally Intelligent Methods for Mining 3D Medical Images” [3]. One is to understand the association of damaged brain regions and the resulting neuropsychological deficits. This might be of interest to assess probable damage before a brain surgery. The second interest is to identify activation patterns for different tasks. subjects are asked to perform different tasks and the activation of brain regions is measured. This helps to identify the regions necessary for a specific task (example: learning). Current techniques are either too computationally expensive or not accurate enough. The study [3] tries to tackle that problem in two ways. Adaptive recursive partioning is used to reduce the domain and a neural network is used for classification of this data. To identify discriminate regions in Alzheimer disease patients statistical, adaptive statistical methods and neural networks are compared. Neural network outperformed both statistical methods in accuracy of the prediction of affected regions. 3 3 Protein structure prediction The aim of protein structure prediction is to determine the three-dimensional structure of proteins from their amino acid sequence [4]. Combing this information with the knowledge of the structure of useful proteins leads to rational drug design, speeding up the research in drug design. To determine protein structures is tedious and expensive and to verify the resulting structure, molecule spectroscopy is needed. Some factors make it extremely difficult to predict the structure. The most important is probably, that the molecular physical stability is not fully understood. This is where prediction comes into play, since generating the structures in simulation is not possible. Mohammed J. Zaki [5] has written an interesting paper called ”Mining Protein Contact Maps”. The sequence of amino acids (linear structure) determines the way, a protein is folded. Since the pyhsical model behind this is not understood, similarities between sequences and their three-dimensional structure can help to understand and predict the structural outcome of a protein. Such data driven approaches are generally useful when the physical model is not understood. The Protein Data Bank has records of the position of each atom in a known protein. Clustering, classification, association rules, hidden Markov models and many more data mining algorithms are applied to predict a sequence’s output. These heuristic approaches just deliver a probability and not a certainty, which seems to be enough for now. Unarguably, knowing the physical model would lead to exact results. But even if the model would be known, simulation of the protein construction would be very complex. The probabilistic approach yields to faster results. Those measures are applied to protein contact maps. These are matrixes of the contact of amino acids in a protein. Mohammed J. Zaki used the hidden Markov model HMMSTR to predict, if two acids are likely to have contact with each other. 4 Discussion and conclusions Data mining in bioinformatics has a revolutionary impact on biology. Not applying data mining methods in research where the model is not known might miss essential discoveries. The data in genome and protein databases is growing constantly. New clusters of computer are crunching on quantities of numbers, like never before. This has in return leaded to new approaches in data mining, optimising the algorithms and combinations of those thrown at the biological data. Advances in artificial intelligence play a bigger role in those techniques, since in most cases, the data is not understood and self-organizing maps (neural networks) and genetic algorithms are continuously searching for similarities and optimisations in an unsupervised manner. References [1] C. Apte, E. Grossman, E. Pednault, B. Rosen, F. Tipu, and B. White. Probabilistic estimation based data mining for discovering insurance risks. Technical report, IBM Corporation, Yorktown Heights, NY, September 1999. [2] D. Hand, H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, Cambridge, 2001. [3] Despina Kontos, Vasileios Megalooikonomou, and Filia Makedon. Computationally intelligent methods for mining 3d medical images. Technical report, Temple University, Department of Computer Science, Dartmouth College, University of the Aegean, 2002. [4] Wikipedia. Protein structure prediction. [http://en.wikipedia.org/wiki/Protein structure prediction]. World Wide Web page [5] Mohammed J. Zaki. Mining protein contact maps. Technical report, Rensselaer Polytechnic Institute, Computer Science Department, 2000. 4

Related docs
Datamining
Views: 1  |  Downloads: 1
Datamining
Views: 0  |  Downloads: 0
datamining
Views: 23  |  Downloads: 3
http www fas org irp dni datamining pdf
Views: 3  |  Downloads: 1
datamining
Views: 114  |  Downloads: 29
Data Mining
Views: 19  |  Downloads: 9
Trends in Data Warehousing & Datamining
Views: 42  |  Downloads: 10
BI-Part2_08.DataMining.Demo
Views: 36  |  Downloads: 3
premium docs
Other docs by Mark Hardigan
Stock Certificate Preferred Stock
Views: 629  |  Downloads: 26
Sample emergency procedures
Views: 352  |  Downloads: 9
Users marcsigal Desktop term papers pagemills
Views: 190  |  Downloads: 0
Lynuxworks Inc Ammendments and Bylaws
Views: 163  |  Downloads: 0
Form 1040-V Payment Voucher
Views: 2897  |  Downloads: 8
Sample Open-Ended Promissory Note
Views: 2504  |  Downloads: 20
Deltic timber Inc Ammendments and By laws
Views: 180  |  Downloads: 0
CorpDocs-Adopt Articles and Appoint Directors
Views: 219  |  Downloads: 6
I Have A Dream Speech
Views: 390  |  Downloads: 8
Form 4562 Depreciation and Amortization
Views: 844  |  Downloads: 5