Bioinformatics PVPSIT, Vijayawada 30th September 2010 Allam Appa Rao JNTUK 10/12/2010 Allam Appa Rao 1 Dijkstra "Computer Science is no Socrates taught us, more about computers the essence of than astronomy is about telescopes Science http://www.quotationspage.com/quote/78 ” Knuth is measuring, 8.html counting, and If you want to understand life, don't think “Science is what we weighing together about vibrant, throbbing gels and oozes, think about information technology. understand well enough --- Richard Dawkins, University of California, with reasoning Berkeley Oxford UniversityThe Blind Watchmaker, 1986, to explain to a computer” from postulates or Norton, p. 112. axioms 10/12/2010 Allam Appa Rao 3 IT Over the past few decades rapid developments in molecular research technologies (MRT) and developments in information technologies (IT) have combined to produce a tremendous amount of information related to molecular biology. 10/12/2010 Allam Appa Rao 4 Bioinformatics: Definition Bioinformatics is the application of information technology and computer science to the field of molecular biology. The term bioinformatics was coined by Paulien Hogeweg in 1979 for the study of informatic processes in biotic systems. 10/12/2010 Allam Appa Rao 8 Bioinformatics: Applications Bioinformatics focuses on developing and applying computationally intensive techniques like pattern recognition, data mining, machine learning algorithms, and visualization 10/12/2010 Allam Appa Rao 9 Bioinformatics: Activities Common activities in bioinformatics include mapping and analyzing DNA and protein sequences, aligning different DNA and protein sequences to compare them and creating and viewing 3-D models of protein structures. 10/12/2010 Allam Appa Rao 10 DM: Definition Diabetes Mellitus is a condition in which the body either does not produce enough, or does not properly respond to, insulin, a hormone produced in the pancreas. 10/12/2010 Allam Appa Rao 11 Bioinformatics: Entailment Bioinformatics entails the creation and advancement of databases, algorithms, computational and statistical techniques, and theory to solve formal and practical problems arising from the management and analysis of biological data. 10/12/2010 Allam Appa Rao 12 Bioinformatics: Research Major research efforts in the field include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, genome-wide association studies and the modelling of evolution. 10/12/2010 Allam Appa Rao 13 DM: Epidemic Diabetes Mellitus (DM) affects about 10-15% of the Indians and is assuming epidemic proportions, development of newer robust therapeutic approaches both in its prevention and treatment are needed. 10/12/2010 Allam Appa Rao 14 Our Work: BDNF • BDNF is a natural compound that is present in human body and hence therapeutics developed using these molecules are expected to have fewer side effects. • BDNF can indeed prevent and ameliorate DM, it would pave way to develop newer therapeutic opportunities. • BDNF as a therapeutic tool for the prevention and treatment of DM. 10/12/2010 Allam Appa Rao 15 10/12/2010 Allam Appa Rao 16 Protein folding ? The ability of a protein to fold reliably into a pre-determined conformation despite a near infinite number of possibilities is, despite much research, still poorly understood. The structure of a protein is determined purely by the amino acid sequence, and the structure of the protein determines the function. The function of a protein depends entirely on the ability of the protein to fold rapidly and reliably to its native structure. Many proteins fold spontaneously into their native structure in aqueous solution. It has been suggested that for a protein of 100 amino acids, a purely random conformational search would require around 10 **29 years, and yet proteins are able to fold on a timescale of milliseconds to seconds. This suggests that only a small amount of conformational space is sampled during the folding process and this in turn implies the existence of kinetic folding pathways. This paradox of how proteins fold rapidly and reliably to their native conformation is known as the protein folding problem. 10/12/2010 Allam Appa Rao 17 Statistics is essence of making sense 10/12/2010 Allam Appa Rao 18 George Gamow (1904-68) Application of Shannon’s information theory breaks genetics and molecular biology out of the descriptive mode into the quantitative mode 10/12/2010 Allam Appa Rao 19 Shannon - Information Flow Information flow in an information theoretical context is the transfer of information from a variable h to a variable l in a given process. The measure of information flow, p Gene Protein P is defined as the uncertainty before the process started minus the uncertainty after the process terminated. This can be quantified as where H (h | l) is the conditional entropy (equivocation) of variable h (before the process started) given the variable l (before the process started), and H(h | l') is the conditional entropy (equivocation) of variable h (before the process started) given the variable l' (the value of variable l after the process finished). H(X,Y) is the joint entropy, and can be calculated as follows: 10/12/2010 Allam Appa Rao 20 Information in living organisms Braitenberg, a German cybernetist, One of the prime has submitted evidence ‘that characteristics of all information is an intrinsic part of the essential nature of life.’ The living organisms is transmission of information plays a fundamental role in everything that the information lives. • Without a doubt, the most complex information they contain for all processing system in existence is the human body. If we take all human information processes together, that is, conscious ones (language, information-controlled operational functions of the organs, hormone system), this involves the processing of 1024 bits daily. • This astronomically high figure is higher by a factor of processes 1,000,000 than the total human knowledge of 1018 bits stored in all the world’s libraries. 10/12/2010 Allam Appa Rao 21 Allen Turing and Gatlinburg symposium on information theory in biology • Information Source The logic of • Transmission of Information Turing • Tasks to be completed machines has • Output an isomorphism with the logic Information source: DNA of the genetic Transmission through information m/t/r RNA system Tasks: Transcription, translation Output: Protein(s) 10/12/2010 Allam Appa Rao 22 Information Theory, Evolution, and the Origin of Life Hubert P. Yockey, pp 35 10/12/2010 Allam Appa Rao 23 Shannon, Turing, Gamow and Rao H (gene) L (protein) Information Transfer Process 10/12/2010 Allam Appa Rao 24 The word "information" derives from the Latin, informare, which means "to put into form” Latent Manifest • Existing or present Readily seen, or understood: apparent, clear, evident, • but concealed or inactive noticeable, observable DNA Protein 10/12/2010 Allam Appa Rao 25 How does the code work? Inherited disease: broken/ damaged dna broken/ damaged proteins Viral disease: dna/rna foreign proteins (Akin to Computer VIRUS) • Template for construction of proteins Manifestation Latent Information Manifested Information 10/12/2010 Allam Appa Rao 26 Genomic Information ACGTCCGGCCTTATACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAG ACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGC TCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACG CGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACG AGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGC GCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATA What good is all CGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTA CGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAA this genetic GCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTA TACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCG information? TACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAAT AAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTC TATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGC CGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTA ATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATG TCTATACGCGCGATGCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGAT GCCGTACGAGACGCTAATAAGCGCTCTATGTCTATACGCGCGATGCCGTACGAG… 10/12/2010 Allam Appa Rao 27 Information Inheritance • Human beings are endowed with the information encoded in the genetic material of inheritance which controls the development, reproduction and self-repair. • The carrier of this information is a complex structure of dna. 10/12/2010 Allam Appa Rao 28 Anything Computable! • Markov defined • Alonzo Church used what became Lambda calculus for known as Markov computing algorithms (HMM) for biological computation • Kurt Gödel defined Recursive functions 10/12/2010 Allam Appa Rao 29 Genetic Information Flow • Within the body, – first, by the transcription of portions of the dna into so- genetic information called messenger rna and, flows from dna to – second, by (translation) the protein and other assembly of individual amino products acids into polypeptides, including proteins. This is the process of life and living related to information flow. 10/12/2010 Allam Appa Rao 30 Paradigm of 'computational thinking' Advances in This paradigm of computational 'computational thinking‘ takes power and applications of computational computer science far methods have led to beyond mere programming and the crystallization of data management. the paradigm of 'computational This paradigm of 'computational thinking’ thinking’ provides newer methods for understanding the complex life style 10/12/2010 Allam Appa Rao diseases like Diabetes.31 The Value of the Right Tool 3.2 Billion Nucleotides A “Supercomputer” Cluster 10/12/2010 Allam Appa Rao 32 The new Challenges in Computer Science • Promoter recognition in genomic sequence, • Understanding data from micro array experiments, and • Accurate prediction of protein folds from sequences 10/12/2010 Allam Appa Rao 33 OVERVIEW Bioinformatics approach to extract information from genes With the widespread availability of nucleotide and amino acid sequences, novel methods for extracting biologically and clinically relevant knowledge are feasible. Data is deposited on the Internet on websites such as GeneCards, available at http://www.genecards.org/mirror.shtml . Further information can be obtained from related sites - UniProt (http://www.uniprot.org) and SwissProt (http://www.expasy.org/sprot/). Using FASTA and CLUSTAL_X programs, similarity scores can be calculated to choose items of interest. Further information can be obtained by mining text, either manually or increasingly using text-mining tools such as PathBinderH and GENIA corpus. METHODS AND PROCEDURES Computational approach of protein analysis using affinity chromatography: application to proteomics Computerization is necessary to build the database of chromatography coupled mass spectrometric analytical data. On the basis of this proteomic data we can identify the proteins as biomarkers which are expressing dissimilarity between healthy and disease condition. Affinity chromatography is one of the fastest liquid chromatographic methods for the separation and purification of biomolecules due to its high molecular specificity. Proteomics and tools for identification and understanding the biochemistry of proteins and pathways are in a new stage of development and evaluation. Development and Validation of LC Method for the Determination of Famciclovir in Pharmaceutical Formulation Using an Experimental Design A rapid and sensitive RP-HPLC method with UV detection (242 nm) for routine analysis of famciclovir in pharmaceutical formulations was developed. Chromatography was performed with mobile phase containing a mixture of methanol and phosphate buffer (50:50, v/v) with flow rate 1.0 mL min−1. Quantitation was accomplished with internal standard method. The procedure was validated for linearity (correlation coefficient =0.9999), accuracy, robustness and intermediate precision. Experimental design was used for validation of robustness and intermediate precision. Contd… To test robustness, three factors were considered; percentage v/v of methanol in mobile phase, flow rate and pH; flow rate, the percentage of organic modifier and pH have considerable important effect on the response. For intermediate precision measure the variables considered were: analyst, equipment and number of days. The RSD value (0.86%, n=24) indicated an acceptable precision of the analytical method. The proposed method was simple, sensitive, precise, accurate and quick and useful for routine quality control. Mathematical Analysis of Diabetes Related Proteins Having High Sequence Complexity We have searched for proteins affecting diabetes and we also found in which common species these proteins were more prevalent and have performed protein composition analysis of those having high sequence complexity. About 90% of rat genes have counterparts in the mouse and human genomes and this is the reason to find proteins common among the three different species.(Rat Genome Sequencing Consortium 2004, www.ratbehaviour.org/Ratsmice.htm) The distribution pattern of the protein variates was examined and bivariate plots were further drawn. Contd… The bivariate plots show a similar clustering for Rattus norvegicus and Mus Musculus but show some variation in Homo sapiens indicating that the plots are correct as Rattus Norvegicus and Mus Musculus are relatively close in the phylogenetic tree(Sridhar GR etal) hence having a similar clustering. The proteins which are away from the cluster are outliers due to the reason that they are having different compositional characteristics. PHYLOGENETIC TREES: DIABETIC COMPLICATIONS AND RELATED CONDITIONS Bioinformatics analysis of diabetic retinopathy using functional protein sequences Diabetic retinopathy is the leading cause of blindness among patients with diabetes mellitus. We evaluated the role of several proteins that are likely to be involved in diabetic retinopathy by employing multiple sequence alignment using ClustalW tool and constructed a phylogram tree using functional protein sequences extracted from NCBI. Phylogram was constructed using Neighbor-Joining Algorithm in bioinformatics approach. It was observed that aldose reductase and nitric oxide synthase are closely associated with diabetic retinopathy. Contd… It is likely that vascular endothelial growth factor, pro-inflammatory cytokines, advanced glycation end products, and adhesion molecules that also play a role in diabetic retinopathy may do so by modulating the activities of aldose reductase and nitric oxide synthase. These results imply that methods designed to normalize aldose reductase and nitric oxide synthase activities could be of significant benefit in the prevention and treatment of diabetic retinopathy. COMPUTATIONAL PROTEIN SEQUENCES ANALYSIS FOR DIABETIC RETINOPATHY – A BIO INFORMATICS STUDY The role of bioinformatics is to aid life scientists in gathering and processing genomic data to study protein function. Another important role is to aid researchers at pharmaceutical companies in making detailed studies of protein structures to facilitate drug design. Human genome with 3 billion chemical nucleotide bases has about 30,000 genes whose functions are known to a great extent. These genes dictate the synthesis of different proteins which proteins differ from one another in their amino acid sequence. The physiological functions of a protein depend upon this sequence. Contd… The functions of the protein, butyrylcholinesterase, are not known to a great extent. Therefore its amino acid sequence is compared with the sequences of 29 different proteins using computational techniques. Close similarity is observed with the protein EST2_human which confirms similarities of physiological actions. This finding obtained from computational techniques, now found to be indispensable, help the scientists of life sciences to proceed with the work in their wet laboratories. Contd… Amino acid sequence of BChE is compared with proteins which act as inhibitors of neovascularisation and similarity is found with one of the inhibitors. Early onset of diabetic retinopathy is often found in patients who have insufficient BChE in their serum. This suggests that BChE may act as an inhibitor of neovascularisation that causes retinopathy. Bioinformatics analysis of functional protein sequences reveals a role for brain-derived neurotrophic factor in obesity and type 2 diabetes mellitus Using bioinformatics techniques and sequence analyses algorithms, a comparative study between human and rodents revealed similarity in the behavior of genes involved in the control of energy homeostasis. Brain-derived neurotrophic factor (BDNF) modulates the secretion and actions of insulin, leptin, ghrelin, various neurotransmitters and peptides, and pro-inflammatory cytokines involved in energy homeostasis suggesting that it (BDNF) has a significant role in the pathobiology of obesity and type 2 diabetes mellitus. Contd… Based on these evidences, we propose that obesity and type 2 diabetes could be disorders of the brain and BDNF could serve as a biomarker in predicting their development. Hence, methods developed to selectively deliver BDNF to appropriate hypothalamic neurons may form a novel approach in their treatment. Bioinformatics Analysis of Functional Protein Sequences Reveals a Role for Tumor Necrosis Factor-α and Nitric Oxide in Insulin Resistance Syndrome Using bioinformatics techniques and sequence analyses algorithms, we identified that tumor necrosis factor-α (TNF-α) and nitric oxide (NO) have a significant role in the pathobiology of insulin resistance syndrome, a condition that is common in subjects with abdominal obesity, hypertension, dyslipidemia, atherosclerosis, and coronary heart disease and are accompanied by endothelial dysfunction due to reduced endothelial nitric oxide generation. TNF-α has neurotoxic actions, stimulates inducible NO synthase activity, and modulates the expression of neurotransmitters involved in the control of feeding and thermogenesis. Contd… NO is a neurotransmitter and influences secretion and actions of various hypothalamic peptides and neuropeptides. Insulin suppresses the production of TNF-α but stimulates that of endothelial NO. This close interaction between TNF-α, NO, hypothalamic peptides, and insulin suggests that regulation of TNF-α and NO production and action could be critical in the management of insulin resistance syndrome and its associated conditions. BUTYRLCHOLINESTERASE Phylogenetic Tree Construction of Butyrylcholinesterase Sequences in Life Forms Butyrylcholinesterase is an enzyme with few known physiological functions. It is related to acetylcholine that was shown to be expressed in a variety of life forms. We performed a search using the human butyrylcholinesterase gene (HGNC:983;MIM:177400), and found the sequence in a broad spectrum including plants, bacteria and animals. Therefore butyrylcholinesterase appears to have evolved early in evolution, and to have been conserved. Serum butyrylcholinesterase in type 2 diabetes mellitus: a biochemical and bioinformatics approach Background Butyrylcholinesterase is an enzyme that may serve as a marker of metabolic syndrome. We (a) measured its level in persons with diabetes mellitus, (b) constructed a family tree of the enzyme using nucleotide sequences downloaded from NCBI. Butyrylcholinesterase was estimated colorimetrically using a commercially available kit (Randox Lab, UK). Phylogenetic trees were constructed by distance method (Fitch and Margoliash method) and by maximum parsimony method. Contd… Results There was a negative correlation between serum total cholesterol and butyrylcholinesterase (-0.407; p < 0.05) and between serum LDL cholesterol and butyrylcholinesterase (-0.435; p < 0.05). There was no statistically significant correlation among the other biochemical parameters. In the evolutionary tree construction both methods gave similar trees, except for an inversion in the position of Sus scrofa (M62778) and Oryctolagus cuniculus (M62779) between Fitch and Margoliash, and maximum parsimony methods. Conclusion The level of butyrylcholinesterase enzyme was inversely related to serum cholesterol; dendrogram showed that the structures from evolutionarily close species were placed near each other. Alzheimer's disease and Type 2 diabetes mellitus: the cholinesterase connection? Alzheimer's disease and type 2 diabetes mellitus tend to occur together. We sought to identify protein(s) common to both conditions that could suggest a possible unifying pathogenic role. Using human neuronal butyrylcholinesterase (AAH08396.1) as the reference protein we used BLAST Tool for protein to protein comparison in humans. We found three groups of sequences among a series of 12, with an E-value between 0–12, common to both Alzheimer's disease and diabetes: butyrylcholinesterase precursor K allele (NP_000046.1), acetylcholinesterase isoform E4-E6 precursor (NP_000656.1), and apoptosis- related acetylcholinesterase (1B41|A). Contd… Butyrylcholinesterase and acetylcholinesterase related proteins were found common to both Alzheimer's disease and diabetes; they may play an etiological role via influencing insulin resistanceand lipid metabolism.