Advances in Bioinformatics: Review and Applications 1 1 SEMANTIC DATA INTEGRATION AND KNOWLEDGE RETRIEVAL IN BIOINFORMATICS 1.1 INTRODUCTION In the past 10 years, the amount of biological data that represent experimental result and knowledge of the biological area has been increased. This phenomenon produces a big impact in the biological field that enables biologists to perform well through in silico experiments. The huge amount of the databases are very useful in order to make further analysis and experiment. Life science data integration is one of the most challenging problems facing in System Biology and Bioinformatics nowadays, instead of computation method to solve other bioinformatics task such as sequence alignment, cancer classification, gene expression and many more. Hence, semantic web provides a common framework promising to enable data integration, sharing and reuse of knowledge from multiple resources over the internet. The emerging of semantic web in the current research seems a sensible approach to bridging the gap between biology and computer science. Particularly, the use of ontologies for domain knowledge in Bioinformatics field clearly can solve the problem of heterogeneity and distributed database. Current trends of research in Bioinformatics and Life Science is depends on the heavily and efficient use of data from experimental wet lab activities and computational analysis through the Internet. While many types of biological data are growing, various sources 2 Semantic Data Integration and Knowledge Retrieval in Bioinformatics of biological data must often be integrated in order to build new knowledge with hypothesis or inference driven research. Unfortunately, these data are provided and distributed from many different organizations and different format over the internet that are hosted in a large number of independent and heterogeneous research fields. Thus, the integration of these databases is vital and very challenging for biological research to perform any task that is related to biological phenomena. Commonly, the obstacles that faces the task of combining and integrating the biological data include the large number of biological database, database heterogeneity, bug in data, rapid rate in the growth of biological data and many more. Therefore, a standard tool, method and approach are demanding and needed in computer science field to facilitate the problem of the biological data integration. 1.2 BIOLOGICAL DATABASE Modern biological research era has successfully generated and produces data that can be used for machine processing like digital data. These data generally store in specific categorization of database systematically. Basically, all databases fall into the following categories as described in Table 1.1 below: Table 1.1 Biological database Categorization Information References Bibliographic Literature, evidence - Taxonomic Classification - Nucleic acid DNA information [1-2] Genome Gene level information [3-4] Protein Protein information [5-6] Advances in Bioinformatics: Review and Applications 3 Metabolic Pathways Metabolic pathways [7-9] Ontology Controlled vocabulary [10-12] Other molecular Others - database The entire databases above are freely accessible to the public. Knowledge integration in System Biology based on information of gene, enzyme, metabolic pathway and ontology require access to suitable databases. Except ontology, all database mentioned above consist of various different format and purposes among them. To harmonize this heterogeneous environment of biological database, ontology has emerged. Ontology plays a role to assist researchers to integrate and understanding the whole database and knowledge which can allow them to annotate and load the ontology annotation to specific database or organisms of interest. 1.3 KNOWLEDGE MANAGEMENT Basically, Knowledge Representation (KR) in computer science is used to represent knowledge that facilitates inference from set of knowledge in the area of artificial intelligent. According to Davis (1993), knowledge representation can be understood in terms of five distinct roles where each is crucial to the task of what its purpose. The five roles are: 1) A knowledge representation is most fundamentally surrogate 2) Set of ontological commitments 3) Framework theory of intelligent reasoning 4) Medium for pragmatically efficient computation 5) Medium of human expression. 4 Semantic Data Integration and Knowledge Retrieval in Bioinformatics Understanding these 5 roles is crucial to both research and practice to view representation in order to answer the question of fundamental significance in the field (Davis et al., 1993). In computer science, KR is commonly used to ensure sharing of understanding and an unambiguous exchange between system interoperability. There are two conditions considered for interoperability between system which are a) Adoption of a common syntax and b) Adoption of a means for understanding the semantic (Antezana et al., 2009). Adoption of a common syntax means the ability of application to parse the data, while adoption of a means for understanding the semantic is to enabling the application to use the data. 1.3.1 Ontology Ontology and knowledge representation cannot be separated from the Life Science domain. Ontologies constitute the very core of the computational knowledge representation (Antezana et al., 2009). The term of ontology is originated from a branch of philosophy before adopted by AI researchers to describe formal domain knowledge. However, the philosophy differs in some ways from the computer scientist who works on knowledge management. There are several ontology terms that have been proposed. For instance, the most frequently used was the definition given by Gruber (1995), that is, "an explicit specification of conceptualization" which describes the concepts in a domain and relationships among them. In other words, ontology is explicitly described or specified from domain model (conceptualization). Guarino (1999), later defined ontology as 'a shared vocabulary plus a specification of its intended meaning'. Ontology also can be drawn as a conceptual graph which can be served as a knowledge model for a particular domain. It contains the collection of concepts for representing domain-specific entities, relationships between each concept and properties. Advances in Bioinformatics: Review and Applications 5 Class is the major component of most ontologies. Classes are the entities that describes the concept in the domain of ontology. For example, classes of transports represent all types of transport. Subclass of transports may consist of land transports and air transports. Specific object of this subclass is an example of the subclass in the class of transports. Besides, the properties also can define what is concept and constraint of each class. The combination of ontology and all instances of classes consist of a knowledge base. 1.4 SEMANTIC WEB The World Wide Web Consortium (W3C) has established the vision of the Web of linked data, called Semantic Web. Thus, it enables people to create data store on the Web, build vocabularies and write rules for handling data. According to Lee et al. (2001), the Semantic Web is not a separate Web but an extension of the current one, in which information is given a well-defined meaning, better enabling computers and people to work in cooperation. Currently, technologies and components that are provided by W3C to support semantic web of data are Resource Description Framework (RDF), Ontology Web Language (OWL) and SPARQL Protocol and RDF Query Language (SPARQL). RDF is the first step and the simplest language towards a semantic web vision that is expressed in XML format. It allows data to be interchanged on the Web and can be used to represent information and meaning in the form of the subject, predicate and object as shown in Figure 1.1. While RDF is directed, labeled graph data format for representing information on the web, SPARQL can be used to query RDF. It can be used to express queries across diverse data 6 Semantic Data Integration and Knowledge Retrieval in Bioinformatics source. SPARQL contains functionality for querying required and optional graph patterns along with their conjunctions and disjunctions. The output of SPARQL queries can come out in form of RDF graphs. However, RDF has limitation such as not expressive enough and it is unable to support a number of commonly required features, such as negation or disjunction. participated_in Bacterial Gene: SecA Secretion subject predicate object Figure 1.1 RDF Graph Figure 1.2 Architecture of Semantic Web (Lee et al., 2000) Figure 1.2 shows the architecture of semantic web proposed by Lee et al. (2000). Based on the architecture given, ontology is a centered core part of semantic information and foundation to support reasoning based on services. Semantic web is the current on going research. Therefore, it seems the best idea to bridge the Advances in Bioinformatics: Review and Applications 7 gap between computer science and biologists to realization in System Biology or Bioinformatics. Due to the capabilities of semantic web technology which is useful in bringing the data in term of machine readable to generate knowledge base, its role is very demanding in the biological area. There are many projects done to take a semantic as the key role in the life science to become more matured and useful. The semantic web can support every role of system biology from knowledge management to modeling and simulation in System Biology and Bioinformatics: a) Knowledge management b) Analysis c) Modeling and Simulation In the knowledge management, biological data integration and retrieval can be done using semantic web technology. For instance, biological data expressed in the form of knowledge representation such as ontology and integrated from many distributed resources. This data or knowledge is collected before stored in centralized resources. Then some queries such as SPARQL can be used to get all information in different resources that can be combined together. The result of this query can expose some sort of hypothesis and inference. Thus it’s very vital and significant for biologists to further their experiments based on this result. 1.4 SEMANTIC WEB SERVICES FOR KNOWLEDGE RETRIEVAL Web services are modular, self-describing, self-contained applications that are accessible over the Internet. It was designed to support interoperable machine-to-machine interaction over a network. There are 3 basic components of web services which are SOAP, WSDL and UDDI. Figure 1.3 below shows the triangle architecture of web services. 8 Semantic Data Integration and Knowledge Retrieval in Bioinformatics There are various different standards that have been developed for different Web Service tasks such as description, discovery and invocation. These technologies are primarily designed to be used in conjunction with other Web standards, e.g. XML for syntax and HTTP for communication Figure 1.3 Web Services Architecture SOAP is the communication protocol designed to exchange message between applications over the Web. It is fundamentally a stateless, one-way message exchange paradigm, but applications can create more complex interaction patterns by combining such one-way exchanges. SOAP provides a distributed processing model where a SOAP message is delivered from a sender to an ultimate receiver via zero or more SOAP intermediaries. This distributed processing model can support many message exchange patterns including but not limited to one-way messages, request/response interactions, and peer-to-peer conversations. Web Service Description Language (WSDL) is the language to describe the mechanics of interacting with a particular Web Advances in Bioinformatics: Review and Applications 9 service. The abstract functionality of the Web service is defined in terms of the types of messages it sends and receives in WSDL interface. An interface is a set of operations and an operation is a sequence of input and output messages. An operation associates a message exchange pattern (MEP) with the message types that will be exchanged during execution. The message types are defined using a schema language such as (but not limited to) XML Schema. The abstract interfaces are associated to concrete message formats and transmission protocols with binding descriptions. Universal Description Discovery and Integration (UDDI) is an emerging standard registry system for Web Services. UDDI allows businesses to advertise their Web Services by publishing their descriptions on a global registry. There are three main parts of this registry: White Pages that list contact information about the company that developed the Web service; Yellow Pages that organize Web services by such categories as geography and industry code; and Green Pages that hold WSDL descriptions. UDDI supports the association of an unbounded set of properties to the description of Web Services via a construct called TModel. For example, a service may specify its category using an arbitrary classification system though their meaning is not codified, therefore there may be two different TModels with the same meaning, but this similarity cannot be recognized. 10 Semantic Data Integration and Knowledge Retrieval in Bioinformatics Figure 1.4 The nature of semantic Web services The emerging semantic technology is ideal to support heterogeneity, integration for distributed resources across biological domain. However, this technology is not enough to bring the application into more dynamic and complex environment. Semantic Web services are the result of the evolution of the syntactic definition of Web services and the semantic Web as shown in Figure 1.4. The semantic web approach addressed the limitations of current web service technology. This approach was done by augmenting the service descriptions with a semantic layer. In the life science, semantic web infrastructure is already matured. Therefore, powerful application such as knowledge integration, pathway analysis, gene expression, modeling and simulation can be developed to give a significant result to biologists and bioinformaticians. The application mainly can use annotations of domain ontology and suitable planning engines to automatically discover execute and compose web services to solve biological problem such as knowledge integration, complex biological question and biological processes. Current research efforts in semantic web services are OWL-S, WSMO and SAWSDL. Advances in Bioinformatics: Review and Applications 11 1.4.1 Services Composition Composition is the process of combining and coordinating a set of Semantic Web Services (SWS) to achieve the goal. While individual Bioinformatics Web services are useful, the needs of more than one service are required at the same time to perform a complete biological analysis. For example as shown in Figure 1.5, users need to get listed of gene that participated in protein secretion pathway in L. lactis organism. Assume that there are three distributed databases provided by three different resources. They exposed the databases through the web services. So, to get completed list of gene, users need to combine several web services to achieve that goal. Figure 1.5 Semantic web services composition scenario Mainly, there are several methods to composite many web services. (1) Manually composing the services themselves, (2) Using fully automatic composition software, or (3) Using a hybrid approach, which is called semi-automatic composition. There are many limitations of automatic composition rather than two other methods. Automated composition is likely to be useful where 12 Semantic Data Integration and Knowledge Retrieval in Bioinformatics transparent seamless access is the most overriding requirements, such as appointments and flight booking. For that task, users will be happy to accept the result, as long as they are reasonable and they gain the advantage of not having to perform such tasks themselves. It is not likely to serve the needs of expert, knowledgeable, opinioned scientists who may invest large quantities of money and time in further experiments based on the results and who may be required to justify their methodologies under peer review. In other words, these scientists are unlikely to trust automated service invocation and composition, probably with justification, as it is unlikely to improve on their own selections. This automated task must act to support biologist activities, not to replace their tasks. In this way, Bioinformatics seems to be following the path of medical informatics, where early decision- making systems have given way to later decision-support system. Although the main and well-known research efforts towards discovery and composing biological Web services for Bioinformatics are BioMoby (Wilkinson et al., 2005) and Taverna (Tan et al., 2009). But for composition, these tools are still too difficult to use because they are manual composition and need advance knowledge of biologists to use them as a scientific workflow. The semi-automatic approach or hybrid approach is very useful for composition of semantic web services in Bioinformatics. Although Semantic web service composition is very challenging and active research in the semantic web and Service-oriented Architecture (SOA) arena because of its complexity from many aspects such as the numbers of services is increasing over the web, services are updated on the fly and its distributed developed by many organizations with different models and features. 1.5 SUMMARY In this chapter we have seen that Semantic Web technologies have the potential to overcome many of the limitation and can exploit Advances in Bioinformatics: Review and Applications 13 the System Biology and Bioinformatics field in term of data integration and knowledge retrieval. We also discussed some fundamental principle of what are biological database, knowledge management, semantic web and semantic web services. We found that a semantic web technology has a great potential solution to bring Bioinformatics and life science become more matured and meaningful in laboratory research. Furthermore, the technology like web services can add more dynamic behavior to use in services based application. Knowledge management that use the concept of Knowledge Portal basically uses technology of web portal. The concept seems suitable for end-user because of its transparency and reduce burden from users by providing functionality like query, visualize and retrieve data and knowledge with integrated database. Many resources provide a semantic system biology portal that’s combine several database in centralized repository and expose them by biological query (SPARQL) combined with visualization of biological network. The result of this research was able to support hypothesis and inference for further experiments in biology. Ontology and Semantic is often used for biological data integration. Some research use ontology and semantic only, and some other use ontology and semantic combined with web portal technology. There are many resources that use semantic to integrate various data source like gene and gene product, then the data collected was encoded in RDF merging ontology before stored in repository. The knowledge of this data then manipulated using SPARQL query. Web Services also used for data integration especially in heterogeneous and distributed data sources. Commonly, most of data repository from provider is exposed to the integration and retrieval facilities through web services. For example in KEGG, they provided services to retrieve various data from genome to pathway databases. User then uses this web 14 Semantic Data Integration and Knowledge Retrieval in Bioinformatics services to develop a client program or software to retrieve and integrate KEGG database. Based on reviews that has been done, the need for data integration and retrieval in biological domain is very critical and vital due to the huge amount of database available over the internet. Furthermore, new knowledge discovery and inference can derive from the integrated database. In order to face the challenging issue on data integration in this domain, ontology, semantic web and semantic web services play a very important role to realize this issue. Acknowledgements Malaysian Genome Institute (MGI) and vot 73744 from University Technology of Malaysia Research Management Centre (RMC) REFERENCES . Berman, H.M., et al. 2002. The Nucleic Acid Database. Acta Crystallographica Section D-Biological Crystallography. 58: 889-898 . Tateno, Y. and T. Gojobori. 1997. DNA Data Bank of Japan in the age of information biology. Nucleic Acids Research. 25(1): 14-17. . Sayers, E.W. 2009. Database resources of the National Center for Biotechnology Information (vol 37, pg D5, 2008) Nucleic Acids Research. 37(9):3124-3124. . Emmert, D.B. 1994. The European-Bioinformatics-Institute (Ebi) Databases. Nucleic Acids Research. 22(17):3445-3449. . Magrane, M. and U. Consortium. 2007. The UniProt Knowledgebase: a useful resource for developmental Advances in Bioinformatics: Review and Applications 15 biology. Genetical Research. 89(3): 184-185 . Gasteiger, E. 2003. ExPASy: the proteomics server for in- depth protein knowledge and analysis. Nucleic Acids Research. 31(13):3784-3788. . Kanehisa, M. and S. Goto. 2000. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research. 28(1): 27-30. . Karp, P.D. 2005. BioCyc pathway database collection and the pathway tools software. Abstracts of Papers of the American Chemical Society. 229: U1178-U1178 . Stein, L., et al. 2007. Reactome: a knowledge base of biological pathways and processes. Genome Biology. 8(3). . Smith, B., et al. 2007. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology. 25(11): 1251-1255. . Harris, M.A., et al. 2006. The Gene Ontology (GO) project in 2006. Nucliec Acids Research. 34: D322-D326. . Bader, G.D., et al. 2010. The BioPAX community standard for pathway data sharing. Nature Biotechnology. 28(9): 935- 942. . Davis, R., H. Shrobe, and P. Szolovits. 1993. What Is a Knowledge Representation. Ai Magazine, 14(1): 17-33 . Antezana, E., M. Kuiper, and V. Mironov. 2009. Biological knowledge management: the emerging role of the Semantic Web technologies. Briefings in Bioinformatics, 10(4): 392- 407 16 Semantic Data Integration and Knowledge Retrieval in Bioinformatics . Gruber, T.R. 1995. Toward principles for the design of ontologies used for knowledge sharing. International Journal of Human-Computer Studies, 43(5-6): 907-928 . Guarino, N. 1999. Formal ontology and conceptual modeling. Data & Knowledge Engineering, 31(2): V-Vi. . Berners-Lee, T., J. Hendler, and O. Lassila. 2001. The Semantic Web - A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American, 284(5): 34-+. . Decker, S., P. Mitra, and S. Melnik. 2000. Framework for the semantic Web: An RDF tutorial. IEEE Internet Computing. 4(6): 68-73. . Bechhofer, S., R. Volz, and P. Lord. 2003. Cooking the semantic web with the OWL API. Semantic Web – Iswc. 2870: 659-675 . Heese, R. 2006. Query graph model for SPARQL. Advances in Conceptual Modeling – Theory and Practice. 4231: 445- 454. . Martin, D., Paolucci M. and McIlraith S. 2005. Bringing semantics to web services: The OWL-S approach. Semantic Web Services and Web Process Composition, 3387: 26-42. . Roman, D., Bruijn J. and Mocan A. 2006. WWW: WSMO, WSML, and WSMX in a nutshell. Semantic Web - Aswc 2006, Proceedings . 4185: 516-522. . Kopecky, J., Vitvar, T., Bournez, C. and Farrell, J. 2007. SAWSDL: Semantic annotations for WSDL and XML schema. IEEE Internet Computing, 11(6): 60-67. . Gottschalk, K., et al. 2002. Introduction to Web services architecture. Ibm System Journal. 41(2): 170-177. Advances in Bioinformatics: Review and Applications 17 . Curbera, F., et al. 2002. Unraveling the Web services Web – An introduction to SOAP, WSDL, and UDDI. IEEE Internet Computing. 6(2): 86-93. . Gordon, R.S. 2003. Understanding Web services: XML, WSDL, SOAP, and UDDI. Library Journal. 128(2):111-111. . Paolucci, M., et al. 2002. Importing the semantic web in UDDI. Web Services, E-Business, and the Semantic Web. 2512: 225-236. . Wilkinson, M., Schoof H., Ernst R. and Haase D. 2005. BioMOBY successfully integrates distributed heterogeneous bioinformatics Web services. The Planet exemplar case. Plant Physiology. 138(1): 4-16. . Tan, W., Missier P., Madduri R. and Foster I. 2008. Building Scientific Workflow with Taverna and BPEL: A Comparative Study in caGrid. Service-Oriented Computing – Icsoc. Workshops, 2009. 5472: 118-129416. . Morello, E., Bermudez-Humaran LG., Llull D., Sole V., Miraglio N., Langella P. and Poquet I. 2008. Lactococcus lactis, an efficient cell factory for recombinant protein production and secretion. Journal of Molecular Microbiology and Biotechnology, 14(1-3): 48-58.