Docstoc

Semantic web

Document Sample
Semantic web Powered By Docstoc
					Advances in Bioinformatics: Review and Applications              1



                                    1
 SEMANTIC DATA INTEGRATION AND
    KNOWLEDGE RETRIEVAL IN
        BIOINFORMATICS



1.1 INTRODUCTION
In the past 10 years, the amount of biological data that represent
experimental result and knowledge of the biological area has been
increased. This phenomenon produces a big impact in the
biological field that enables biologists to perform well through in
silico experiments. The huge amount of the databases are very
useful in order to make further analysis and experiment. Life
science data integration is one of the most challenging problems
facing in System Biology and Bioinformatics nowadays, instead of
computation method to solve other bioinformatics task such as
sequence alignment, cancer classification, gene expression and
many more. Hence, semantic web provides a common framework
promising to enable data integration, sharing and reuse of
knowledge from multiple resources over the internet. The emerging
of semantic web in the current research seems a sensible approach
to bridging the gap between biology and computer science.
Particularly, the use of ontologies for domain knowledge in
Bioinformatics field clearly can solve the problem of heterogeneity
and distributed database.

Current trends of research in Bioinformatics and Life Science is
depends on the heavily and efficient use of data from experimental
wet lab activities and computational analysis through the Internet.
While many types of biological data are growing, various sources
 2                              Semantic Data Integration and Knowledge
                                              Retrieval in Bioinformatics

of biological data must often be integrated in order to build new
knowledge with hypothesis or inference driven research.
Unfortunately, these data are provided and distributed from many
different organizations and different format over the internet that
are hosted in a large number of independent and heterogeneous
research fields. Thus, the integration of these databases is vital and
very challenging for biological research to perform any task that is
related to biological phenomena.

Commonly, the obstacles that faces the task of combining and
integrating the biological data include the large number of
biological database, database heterogeneity, bug in data, rapid rate
in the growth of biological data and many more. Therefore, a
standard tool, method and approach are demanding and needed in
computer science field to facilitate the problem of the biological
data integration.


1.2 BIOLOGICAL DATABASE
Modern biological research era has successfully generated and
produces data that can be used for machine processing like digital
data. These data generally store in specific categorization of
database systematically. Basically, all databases fall into the
following categories as described in Table 1.1 below:

                   Table 1.1 Biological database
 Categorization          Information                       References
 Bibliographic            Literature, evidence             -
 Taxonomic                Classification                   -
 Nucleic acid             DNA information                  [1-2]
 Genome                   Gene level information           [3-4]
 Protein                  Protein information              [5-6]
Advances in Bioinformatics: Review and Applications              3


 Metabolic Pathways         Metabolic pathways         [7-9]
 Ontology                   Controlled vocabulary      [10-12]
 Other molecular            Others                     -
 database

The entire databases above are freely accessible to the public.
Knowledge integration in System Biology based on information of
gene, enzyme, metabolic pathway and ontology require access to
suitable databases. Except ontology, all database mentioned above
consist of various different format and purposes among them. To
harmonize this heterogeneous environment of biological database,
ontology has emerged. Ontology plays a role to assist researchers
to integrate and understanding the whole database and knowledge
which can allow them to annotate and load the ontology annotation
to specific database or organisms of interest.


1.3 KNOWLEDGE MANAGEMENT
Basically, Knowledge Representation (KR) in computer science is
used to represent knowledge that facilitates inference from set of
knowledge in the area of artificial intelligent. According to Davis
(1993), knowledge representation can be understood in terms of
five distinct roles where each is crucial to the task of what its
purpose. The five roles are:

1)      A knowledge representation is most fundamentally
        surrogate
2)      Set of ontological commitments
3)      Framework theory of intelligent reasoning
4)      Medium for pragmatically efficient computation
5)      Medium of human expression.
 4                              Semantic Data Integration and Knowledge
                                              Retrieval in Bioinformatics

Understanding these 5 roles is crucial to both research and practice
to view representation in order to answer the question of
fundamental significance in the field (Davis et al., 1993).

In computer science, KR is commonly used to ensure sharing of
understanding and an unambiguous exchange between system
interoperability. There are two conditions considered for
interoperability between system which are a) Adoption of a
common syntax and b) Adoption of a means for understanding the
semantic (Antezana et al., 2009). Adoption of a common syntax
means the ability of application to parse the data, while adoption of
a means for understanding the semantic is to enabling the
application to use the data.

1.3.1 Ontology
Ontology and knowledge representation cannot be separated from
the Life Science domain. Ontologies constitute the very core of the
computational knowledge representation (Antezana et al., 2009).
The term of ontology is originated from a branch of philosophy
before adopted by AI researchers to describe formal domain
knowledge. However, the philosophy differs in some ways from
the computer scientist who works on knowledge management.
There are several ontology terms that have been proposed. For
instance, the most frequently used was the definition given by
Gruber (1995), that is, "an explicit specification of
conceptualization" which describes the concepts in a domain and
relationships among them. In other words, ontology is explicitly
described or specified from domain model (conceptualization).

Guarino (1999), later defined ontology as 'a shared vocabulary plus
a specification of its intended meaning'. Ontology also can be
drawn as a conceptual graph which can be served as a knowledge
model for a particular domain. It contains the collection of
concepts for representing domain-specific entities, relationships
between each concept and properties.
Advances in Bioinformatics: Review and Applications                5



Class is the major component of most ontologies. Classes are the
entities that describes the concept in the domain of ontology. For
example, classes of transports represent all types of transport.
Subclass of transports may consist of land transports and air
transports. Specific object of this subclass is an example of the
subclass in the class of transports. Besides, the properties also can
define what is concept and constraint of each class. The
combination of ontology and all instances of classes consist of a
knowledge base.




1.4 SEMANTIC WEB
The World Wide Web Consortium (W3C) has established the
vision of the Web of linked data, called Semantic Web. Thus, it
enables people to create data store on the Web, build vocabularies
and write rules for handling data. According to Lee et al. (2001),
the Semantic Web is not a separate Web but an extension of the
current one, in which information is given a well-defined meaning,
better enabling computers and people to work in cooperation.

Currently, technologies and components that are provided by W3C
to support semantic web of data are Resource Description
Framework (RDF), Ontology Web Language (OWL) and SPARQL
Protocol and RDF Query Language (SPARQL). RDF is the first
step and the simplest language towards a semantic web vision that
is expressed in XML format. It allows data to be interchanged on
the Web and can be used to represent information and meaning in
the form of the subject, predicate and object as shown in Figure
1.1. While RDF is directed, labeled graph data format for
representing information on the web, SPARQL can be used to
query RDF. It can be used to express queries across diverse data
 6                                  Semantic Data Integration and Knowledge
                                                  Retrieval in Bioinformatics

source. SPARQL contains functionality for querying required and
optional graph patterns along with their conjunctions and
disjunctions. The output of SPARQL queries can come out in form
of RDF graphs. However, RDF has limitation such as not
expressive enough and it is unable to support a number of
commonly required features, such as negation or disjunction.

                      participated_in          Bacterial
      Gene: SecA
                                               Secretion

      subject           predicate                  object

                      Figure 1.1 RDF Graph




     Figure 1.2 Architecture of Semantic Web (Lee et al., 2000)

Figure 1.2 shows the architecture of semantic web proposed by Lee
et al. (2000). Based on the architecture given, ontology is a
centered core part of semantic information and foundation to
support reasoning based on services. Semantic web is the current
on going research. Therefore, it seems the best idea to bridge the
Advances in Bioinformatics: Review and Applications                 7


gap between computer science and biologists to realization in
System Biology or Bioinformatics.

Due to the capabilities of semantic web technology which is useful
in bringing the data in term of machine readable to generate
knowledge base, its role is very demanding in the biological area.
There are many projects done to take a semantic as the key role in
the life science to become more matured and useful. The semantic
web can support every role of system biology from knowledge
management to modeling and simulation in System Biology and
Bioinformatics:

a)      Knowledge management
b)      Analysis
c)      Modeling and Simulation
In the knowledge management, biological data integration and
retrieval can be done using semantic web technology. For instance,
biological data expressed in the form of knowledge representation
such as ontology and integrated from many distributed resources.
This data or knowledge is collected before stored in centralized
resources. Then some queries such as SPARQL can be used to get
all information in different resources that can be combined
together. The result of this query can expose some sort of
hypothesis and inference. Thus it’s very vital and significant for
biologists to further their experiments based on this result.


1.4 SEMANTIC WEB SERVICES FOR KNOWLEDGE
    RETRIEVAL
Web services are modular, self-describing, self-contained
applications that are accessible over the Internet. It was designed to
support interoperable machine-to-machine interaction over a
network. There are 3 basic components of web services which are
SOAP, WSDL and UDDI. Figure 1.3 below shows the triangle
architecture of web services.
 8                             Semantic Data Integration and Knowledge
                                             Retrieval in Bioinformatics



There are various different standards that have been developed for
different Web Service tasks such as description, discovery and
invocation. These technologies are primarily designed to be used in
conjunction with other Web standards, e.g. XML for syntax and
HTTP for communication




                  Figure 1.3 Web Services Architecture

SOAP is the communication protocol designed to exchange
message between applications over the Web. It is fundamentally a
stateless, one-way message exchange paradigm, but applications
can create more complex interaction patterns by combining such
one-way exchanges. SOAP provides a distributed processing
model where a SOAP message is delivered from a sender to an
ultimate receiver via zero or more SOAP intermediaries. This
distributed processing model can support many message exchange
patterns including but not limited to one-way messages,
request/response interactions, and peer-to-peer conversations.

Web Service Description Language (WSDL) is the language to
describe the mechanics of interacting with a particular Web
Advances in Bioinformatics: Review and Applications                9


service. The abstract functionality of the Web service is defined in
terms of the types of messages it sends and receives in WSDL
interface. An interface is a set of operations and an operation is a
sequence of input and output messages. An operation associates a
message exchange pattern (MEP) with the message types that will
be exchanged during execution. The message types are defined
using a schema language such as (but not limited to) XML
Schema. The abstract interfaces are associated to concrete message
formats and transmission protocols with binding descriptions.

Universal Description Discovery and Integration (UDDI) is an
emerging standard registry system for Web Services. UDDI allows
businesses to advertise their Web Services by publishing their
descriptions on a global registry. There are three main parts of this
registry: White Pages that list contact information about the
company that developed the Web service; Yellow Pages that
organize Web services by such categories as geography and
industry code; and Green Pages that hold WSDL descriptions.
UDDI supports the association of an unbounded set of properties to
the description of Web Services via a construct called TModel. For
example, a service may specify its category using an arbitrary
classification system though their meaning is not codified,
therefore there may be two different TModels with the same
meaning, but this similarity cannot be recognized.
 10                            Semantic Data Integration and Knowledge
                                             Retrieval in Bioinformatics




             Figure 1.4 The nature of semantic Web services

The emerging semantic technology is ideal to support
heterogeneity, integration for distributed resources across
biological domain. However, this technology is not enough to
bring the application into more dynamic and complex environment.
Semantic Web services are the result of the evolution of the
syntactic definition of Web services and the semantic Web as
shown in Figure 1.4. The semantic web approach addressed the
limitations of current web service technology. This approach was
done by augmenting the service descriptions with a semantic layer.
In the life science, semantic web infrastructure is already matured.
Therefore, powerful application such as knowledge integration,
pathway analysis, gene expression, modeling and simulation can be
developed to give a significant result to biologists and
bioinformaticians. The application mainly can use annotations of
domain ontology and suitable planning engines to automatically
discover execute and compose web services to solve biological
problem such as knowledge integration, complex biological
question and biological processes. Current research efforts in
semantic web services are OWL-S, WSMO and SAWSDL.
Advances in Bioinformatics: Review and Applications                 11


1.4.1 Services Composition
Composition is the process of combining and coordinating a set of
Semantic Web Services (SWS) to achieve the goal. While
individual Bioinformatics Web services are useful, the needs of
more than one service are required at the same time to perform a
complete biological analysis. For example as shown in Figure 1.5,
users need to get listed of gene that participated in protein secretion
pathway in L. lactis organism. Assume that there are three
distributed databases provided by three different resources. They
exposed the databases through the web services. So, to get
completed list of gene, users need to combine several web services
to achieve that goal.




      Figure 1.5 Semantic web services composition scenario

Mainly, there are several methods to composite many web services.
(1) Manually composing the services themselves, (2) Using fully
automatic composition software, or (3) Using a hybrid approach,
which is called semi-automatic composition. There are many
limitations of automatic composition rather than two other
methods. Automated composition is likely to be useful where
 12                            Semantic Data Integration and Knowledge
                                             Retrieval in Bioinformatics

transparent seamless access is the most overriding requirements,
such as appointments and flight booking. For that task, users will
be happy to accept the result, as long as they are reasonable and
they gain the advantage of not having to perform such tasks
themselves. It is not likely to serve the needs of expert,
knowledgeable, opinioned scientists who may invest large
quantities of money and time in further experiments based on the
results and who may be required to justify their methodologies
under peer review. In other words, these scientists are unlikely to
trust automated service invocation and composition, probably with
justification, as it is unlikely to improve on their own selections.
This automated task must act to support biologist activities, not to
replace their tasks. In this way, Bioinformatics seems to be
following the path of medical informatics, where early decision-
making systems have given way to later decision-support system.

Although the main and well-known research efforts towards
discovery and composing biological Web services for
Bioinformatics are BioMoby (Wilkinson et al., 2005) and Taverna
(Tan et al., 2009). But for composition, these tools are still too
difficult to use because they are manual composition and need
advance knowledge of biologists to use them as a scientific
workflow. The semi-automatic approach or hybrid approach is very
useful for composition of semantic web services in Bioinformatics.
Although Semantic web service composition is very challenging
and active research in the semantic web and Service-oriented
Architecture (SOA) arena because of its complexity from many
aspects such as the numbers of services is increasing over the web,
services are updated on the fly and its distributed developed by
many organizations with different models and features.



1.5 SUMMARY
In this chapter we have seen that Semantic Web technologies have
the potential to overcome many of the limitation and can exploit
Advances in Bioinformatics: Review and Applications             13


the System Biology and Bioinformatics field in term of data
integration and knowledge retrieval. We also discussed some
fundamental principle of what are biological database, knowledge
management, semantic web and semantic web services. We found
that a semantic web technology has a great potential solution to
bring Bioinformatics and life science become more matured and
meaningful in laboratory research. Furthermore, the technology
like web services can add more dynamic behavior to use in services
based application.

Knowledge management that use the concept of Knowledge Portal
basically uses technology of web portal. The concept seems
suitable for end-user because of its transparency and reduce burden
from users by providing functionality like query, visualize and
retrieve data and knowledge with integrated database. Many
resources provide a semantic system biology portal that’s combine
several database in centralized repository and expose them by
biological query (SPARQL) combined with visualization of
biological network. The result of this research was able to support
hypothesis and inference for further experiments in biology.

Ontology and Semantic is often used for biological data
integration. Some research use ontology and semantic only, and
some other use ontology and semantic combined with web portal
technology. There are many resources that use semantic to
integrate various data source like gene and gene product, then the
data collected was encoded in RDF merging ontology before stored
in repository. The knowledge of this data then manipulated using
SPARQL query. Web Services also used for data integration
especially in heterogeneous and distributed data sources.
Commonly, most of data repository from provider is exposed to the
integration and retrieval facilities through web services. For
example in KEGG, they provided services to retrieve various data
from genome to pathway databases. User then uses this web
 14                            Semantic Data Integration and Knowledge
                                             Retrieval in Bioinformatics

services to develop a client program or software to retrieve and
integrate KEGG database.

Based on reviews that has been done, the need for data integration
and retrieval in biological domain is very critical and vital due to
the huge amount of database available over the internet.
Furthermore, new knowledge discovery and inference can derive
from the integrated database. In order to face the challenging issue
on data integration in this domain, ontology, semantic web and
semantic web services play a very important role to realize this
issue.

Acknowledgements
Malaysian Genome Institute (MGI) and vot 73744 from University
Technology of Malaysia Research Management Centre (RMC)


REFERENCES

[1]. Berman, H.M., et al. 2002. The Nucleic Acid Database. Acta
     Crystallographica Section D-Biological Crystallography. 58:
     889-898

[2]. Tateno, Y. and T. Gojobori. 1997. DNA Data Bank of Japan
     in the age of information biology. Nucleic Acids Research.
     25(1): 14-17.

[3]. Sayers, E.W. 2009. Database resources of the National
     Center for Biotechnology Information (vol 37, pg D5, 2008)
     Nucleic Acids Research. 37(9):3124-3124.

[4]. Emmert, D.B. 1994. The European-Bioinformatics-Institute
     (Ebi) Databases. Nucleic Acids Research. 22(17):3445-3449.

[5]. Magrane, M. and U. Consortium. 2007. The UniProt
     Knowledgebase: a useful resource for developmental
Advances in Bioinformatics: Review and Applications              15


      biology. Genetical Research. 89(3): 184-185

[6]. Gasteiger, E. 2003. ExPASy: the proteomics server for in-
     depth protein knowledge and analysis. Nucleic Acids
     Research. 31(13):3784-3788.

[7]. Kanehisa, M. and S. Goto. 2000. KEGG: Kyoto
     Encyclopedia of Genes and Genomes. Nucleic Acids
     Research. 28(1): 27-30.

[8]. Karp, P.D. 2005. BioCyc pathway database collection and
     the pathway tools software. Abstracts of Papers of the
     American Chemical Society. 229: U1178-U1178

[9]. Stein, L., et al. 2007. Reactome: a knowledge base of
     biological pathways and processes. Genome Biology. 8(3).

[10]. Smith, B., et al. 2007. The OBO Foundry: coordinated
      evolution of ontologies to support biomedical data
      integration. Nature Biotechnology. 25(11): 1251-1255.

[11]. Harris, M.A., et al. 2006. The Gene Ontology (GO) project in
      2006. Nucliec Acids Research. 34: D322-D326.

[12]. Bader, G.D., et al. 2010. The BioPAX community standard
      for pathway data sharing. Nature Biotechnology. 28(9): 935-
      942.

[13]. Davis, R., H. Shrobe, and P. Szolovits. 1993. What Is a
      Knowledge Representation. Ai Magazine, 14(1): 17-33

[14]. Antezana, E., M. Kuiper, and V. Mironov. 2009. Biological
      knowledge management: the emerging role of the Semantic
      Web technologies. Briefings in Bioinformatics, 10(4): 392-
      407
 16                             Semantic Data Integration and Knowledge
                                              Retrieval in Bioinformatics

[15]. Gruber, T.R. 1995. Toward principles for the design of
      ontologies used for knowledge sharing. International Journal
      of Human-Computer Studies, 43(5-6): 907-928

[16]. Guarino, N. 1999. Formal ontology and conceptual
      modeling. Data & Knowledge Engineering, 31(2): V-Vi.

[17]. Berners-Lee, T., J. Hendler, and O. Lassila. 2001. The
      Semantic Web - A new form of Web content that is
      meaningful to computers will unleash a revolution of new
      possibilities. Scientific American, 284(5): 34-+.

[18]. Decker, S., P. Mitra, and S. Melnik. 2000. Framework for the
      semantic Web: An RDF tutorial. IEEE Internet Computing.
      4(6): 68-73.

[19]. Bechhofer, S., R. Volz, and P. Lord. 2003. Cooking the
      semantic web with the OWL API. Semantic Web – Iswc.
      2870: 659-675

[20]. Heese, R. 2006. Query graph model for SPARQL. Advances
      in Conceptual Modeling – Theory and Practice. 4231: 445-
      454.

[21]. Martin, D., Paolucci M. and McIlraith S. 2005. Bringing
      semantics to web services: The OWL-S approach. Semantic
      Web Services and Web Process Composition, 3387: 26-42.

[22]. Roman, D., Bruijn J. and Mocan A. 2006. WWW: WSMO,
      WSML, and WSMX in a nutshell. Semantic Web - Aswc
      2006, Proceedings . 4185: 516-522.

[23]. Kopecky, J., Vitvar, T., Bournez, C. and Farrell, J. 2007.
      SAWSDL: Semantic annotations for WSDL and XML
      schema. IEEE Internet Computing, 11(6): 60-67.

[24]. Gottschalk, K., et al. 2002. Introduction to Web services
      architecture. Ibm System Journal. 41(2): 170-177.
Advances in Bioinformatics: Review and Applications               17


[25]. Curbera, F., et al. 2002. Unraveling the Web services Web –
      An introduction to SOAP, WSDL, and UDDI. IEEE Internet
      Computing. 6(2): 86-93.

[26]. Gordon, R.S. 2003. Understanding Web services: XML,
      WSDL, SOAP, and UDDI. Library Journal. 128(2):111-111.

[27]. Paolucci, M., et al. 2002. Importing the semantic web in
      UDDI. Web Services, E-Business, and the Semantic Web.
      2512: 225-236.

[28]. Wilkinson, M., Schoof H., Ernst R. and Haase D. 2005.
      BioMOBY successfully integrates distributed heterogeneous
      bioinformatics Web services. The Planet exemplar case.
      Plant Physiology. 138(1): 4-16.

[29]. Tan, W., Missier P., Madduri R. and Foster I. 2008. Building
      Scientific Workflow with Taverna and BPEL: A
      Comparative Study in caGrid. Service-Oriented Computing –
      Icsoc. Workshops, 2009. 5472: 118-129416.

[30]. Morello, E., Bermudez-Humaran LG., Llull D., Sole V.,
      Miraglio N., Langella P. and Poquet I. 2008. Lactococcus
      lactis, an efficient cell factory for recombinant protein
      production and secretion. Journal of Molecular
      Microbiology and Biotechnology, 14(1-3): 48-58.

				
DOCUMENT INFO
Shared By:
Tags: Semantic, data
Stats:
views:33
posted:7/23/2012
language:English
pages:17
Description: Semantic data