Developing ontologies with Prot間 by pengxiuhui

VIEWS: 7 PAGES: 48

									Semantic Webs for Life Sciences

            PSB 2006




                             1
        Personnel

► Olivier Bodenreider
  National Library of Medicine, NIH, USA.
► Yves Lussier
  Department of Medical Informatics,
  Columbia University, USA.
► Robert Stevens
  University of Manchester, UK.




                                            2
        Introduction

►The Web today
►The Semantic Web vision
►Talking about facts…
►The Resource Description Framework
►Naming things in a Semantic Web
►Ontologies on the SW
►RDFS and the Web Ontology Language
►Semantic Webs and Semantic Web applications

                                          3
                A Web of Information in
                   Bioinformatics




…with human
in the middle                         4
     Human Centric Information

► Over 700 bioinformatics data resources
► Many analysis tools producing more data
► Still largely in human readable form only
► A human biologist sits at the centre and does all
  the semantic work




                                                5
A Stack of Troubles


        Semantics


         Structure

         Syntax


         System

                      6
         Heterogeneity all Around

►Numerous, distributed resources highly
 heterogeneous
►Differing platforms, API, storage paradigms, query
 languages, …
►Differing formats, syntax, etc.
►Differing schema implying conceptualisations
►Differing values for schema
►Semantic heterogeneity…


                                                7
                                          Uniprot:- A protein database?
                                                                           CC   -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE   FT    SIGNAL        1     22
ID   PRIO_HUMAN     STANDARD;       PRT;  253 AA.                          CC       BRAIN OF HUMANS AND ANIMALS INFECTED            FT    CHAIN        23    230      MAJOR PRION PROTEIN.
AC   P04156;                                                               CC       WITH NEURODEGENERATIVE DISEASES KNOWN AS        FT    PROPEP      231    253      REMOVED IN MATURE FORM
DT   01-NOV-1986 (Rel. 03, Created)                                        CC       TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR    (BY SIMILARITY).
DT   01-NOV-1986 (Rel. 03, Last sequence update)                           PRION CC       DISEASES,LIKE: CREUTZFELDT-JAKOB          FT    LIPID       230    230      GPI-ANCHOR (BY
DT   20-AUG-2001 (Rel. 40, Last annotation update)                         DISEASE (CJD),                                           SIMILARITY).
DE   Major prion protein precursor (PrP) (PrP27-30) (PrP33-35C) (ASCR).    CC       GERSTMANN-STRAUSSLER SYNDROME (GSS), FATAL      FT    CARBOHYD    181    181      N-LINKED (GLCNAC...)
GN   PRNP.                                                                 CC       FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS;     (PROBABLE).
OS   Homo sapiens (Human).                                                 CC       SCRAPIE IN SHEEP AND GOAT; BOVINE SPONGIFORM    FT    DISULFID    179    214      BY SIMILARITY.
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;     CC       ENCEPHALOPATHY (BSE) IN CATTLE; TRANSMISSIBLE   FT    DOMAIN       51     91      5 X 8 AA TANDEM REPEATS
OC   Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.            CC       MINK ENCEPHALOPATHY (TME); CHRONIC WASTING      OF P-H-G-G-G-W-G-
OX   NCBI_TaxID=9606;                                                      CC       DISEASE (CWD) OF MULE DEER AND ELK; FELINE      FT                                Q.
RN   [1]                                                                   CC       SPONGIFORM ENCEPHALOPATHY (FSE) IN CATS AND     FT    REPEAT       51     59      1.
RP   SEQUENCE FROM N.A.                                                    CC       EXOTIC UNGULATE ENCEPHALOPATHY (EUE) IN         FT    REPEAT       60     67      2.
RX   MEDLINE=86300093; PubMed=3755672;                                     CC       NYALA AND GREATER KUDU. THE PRION DISEASES      FT    REPEAT       68     75      3.
RA   Kretzschmar H.A., Stowring L.E., Westaway D., Stubblebine W.H.,       CC       ILLUSTRATE THREE MANIFESTATIONS OF CNS          FT    REPEAT       76     83      4.
RA   Prusiner S.B., Dearmond S.J.;                                         CC       DEGENERATION: (1) INFECTIOUS (2)                FT    REPEAT       84     91      5.
RT   "Molecular cloning of a human prion protein cDNA.";                   CC       SPORADIC AND (3) DOMINANTLY INHERITED FORMS.    FT                                IN PATIENTS WHO HAVE A
RL   DNA 5:315-324(1986).                                                  CC       TME, CWD, BSE, FSE, EUE ARE ALL THOUGHT TO      PRP MUTATION AT
RN   [2]                                                                   CC       OCCUR AFTER CONSUMPTION OF PRION-INFECTED       FT                                CODON 178: PATIENTS WITH
RP   SEQUENCE OF 8-253 FROM N.A.                                           CC       FOODSTUFFS.                                     MET DEVELOP FFI,
RX   MEDLINE=86261778; PubMed=3014653;                                     DR   EMBL; M13667; AAA19664.1; -.                        FT                                THOSE WITH VAL DEVELOP
RA   Liao Y.-C.J., Lebo R.V., Clawson G.A., Smuckler E.A.;                 DR   EMBL; M13899; AAA60182.1; -.                        CJD).
RT   "Human prion protein cDNA: molecular cloning, chromosomal mapping,    DR   EMBL; D00015; BAA00011.1; -.                        FT                                /FTId=VAR_006467.
RT   and biological implications.";                                        DR   PIR; A05017; A05017.                                FT    VARIANT     171    171      N -> S (IN
RL   Science 233:364-367(1986).                                            DR   PIR; A24173; A24173.                                SCHIZOAFFECTIVE DISORDER).
RN   [3]                                                                   DR   PIR; S14078; S14078.                                FT                                /FTId=VAR_006468.
RP   SEQUENCE OF 58-85 AND 111-150 (VARIANT AMYLOID GSS).                  DR   PDB; 1E1G; 20-JUL-00.                               FT    VARIANT     178    178      D -> N (IN FFI AND CJD).
RX   MEDLINE=91160504; PubMed=1672107;                                     DR   PDB; 1E1J; 20-JUL-00.                               FT                                /FTId=VAR_006469.
RA   Tagliavini F., Prelli F., Ghiso J., Bugiani O., Serban D.,            DR   PDB; 1E1P; 20-JUL-00.                               FT    VARIANT     180    180      V -> I (IN CJD).
RA   Prusiner S.B., Farlow M.R., Ghetti B., Frangione B.;                  DR   PDB; 1E1S; 21-JUL-00.                               FT                                /FTId=VAR_006470.
RT   "Amyloid protein of Gerstmann-Straussler-Scheinker disease (Indiana   DR   PDB; 1E1U; 20-JUL-00.                               FT    VARIANT     183    183      T -> A (IN FAMILIAL
RT   kindred) is an 11 kd fragment of prion protein with an N-terminal     DR   PDB; 1E1W; 20-JUL-00. DR    MIM; 176640; -.         SPONGIFORM
RT   glycine at codon 58.";                                                DR   MIM; 123400; -.                                     FT                                ENCEPHALOPATHY).
RL   EMBO J. 10:513-519(1991).                                             DR   MIM; 137440; -.                                     FT                                /FTId=VAR_006471.
RN   [4]                                                                   DR   MIM; 245300; -.                                     FT    VARIANT     187    187      H -> R (IN GSS).
RP   STRUCTURE BY NMR OF 118-221.                                          DR   MIM; 600072; -.                                     FT                                /FTId=VAR_008746.
RX   MEDLINE=20359708; PubMed=10900000;                                    DR   MIM; 604920; -.                                     FT    VARIANT     188    188      T -> K (IN EOAD;
RA   Calzolai L., Lysek D.A., Guntert P., von Schroetter C., Riek R.,      DR   InterPro; IPR000817; Prion.                         DEMENTIA ASSOCIATED TO
RA   Zahn R., Wuethrich K.;                                                DR   Pfam; PF00377; prion; 1.                            FT                                PRION DISEASES).
RT   "NMR structures of three single-residue variants of the human prion   DR   PRINTS; PR00341; PRION.                             FT                                /FTId=VAR_008748.
RT   protein.";                                                            DR   SMART; SM00157; PRP; 1.                             FT    VARIANT     188    188      T -> R.
RL   Proc. Natl. Acad. Sci. U.S.A. 97:8340-8345(2000).                     DR   PROSITE; PS00291; PRION_1; 1.                       FT                                /FTId=VAR_008747.
CC   -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN     DR   PROSITE; PS00706; PRION_2; 1.                       FT    VARIANT     196    196      E -> K (IN CJD).
THE                                                                        KW   Prion; Brain; Glycoprotein; GPI-anchor; Repeat;     FT                                /FTId=VAR_008749.
CC       HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.   Signal;                                                  FT                                /FTId=VAR_006472.
CC   -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS        KW   3D-structure; Polymorphism; Disease mutation.       SQ    SEQUENCE   253 AA; 27661 MW; 43DB596BAAA66484
CALLED                                                                                                                              CRC64;
CC       "RODS".                                                                                                                    MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP
CC   -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.                                                            PQGGGGWGQP HGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN
CC   -!- POLYMORPHISM: THE FIVE TANDEM OCTAPEPTIDE REPEATS REGION IS                                                                KPSKPKTNMK HMAGAAAAGA VVGGLGGYML GSAMSRPIIH FGSDYEDRYY
HIGHLY                                                                                                                              RENMHRYPNQ VYYRPMDEYS NQNNFVHDCV NITIKQHTVT TTTKGENFTE
CC       UNSTABLE. INSERTIONS OR DELETIONS OF OCTAPEPTIDE REPEAT UNITS                                                              TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPV ILLISFLIFL
ARE                                                                                                                                 IVG
CC       ASSOCIATED TO PRION DISEASE.                                                                                               //




                                                                                                                                                                             8
    Representing facts on the Web

►Much knowledge represented as natural
 language…
►… and much more that is unusable by a computer


►Stating that a thing has a relationship to another
 thing
   ► “protein has function hydrolysis”
   ► “protein has name lymphocyte associated receptor of death”
   ► “protein expressed by gene trp!”


                                                             9
 Facts as RDF Triples


Subject               Object
          Predicate

           hasName
 Gene                  “TrpA”




                                10
        The Vision

►By describing facts in a computationally amenable
 form, computers can do some of the work
►RDF forms the semantic glue for the Semantic Web
►RDF triples form a graph of facts a computer can
 traverse, make inferences and query




                                             11
  What Do We Need for this Vision?

► A standard way of finding and naming things
► A standard way of describing things
► Standard vocabularies for talking about things




                                                   12
        Semantic Web
        Technologies
► The Resource Description Framework (RDF)
► Uniform Resource Identifiers (URIs)
► RDF Schema (RDFS)
► The Web Ontology Language (OWL)
► RDF query languages
► RDF stores
► Web Services
► Rules languages

                                             13
           Triple Basics
► Stating a thing about another thing
► Subject predicate object
► Subject verb object
► Animal hasName Aardvark
► A triple describes resources on the Web
► A resource can be any form of information – not just a page
► Resources identified by Uniform Resource Identifiers (URI), of
  which URLs are a kind
► The names used in a triple are Uniform Resource Names (URN)
► Names are also literals
► All parts of triples have names, providing a vocabulary


                                                            14
                   A Family of Identifiers
                                    URI


                       URL                URN




URI = Uniform Resource Identifier         LSID
URL = Uniform Resource Locator
URN = Uniform Resource Name
LSID = Life Science Identifier

                                                 15
 Uniform Resource Identifiers
                    URI


            URL                URN



► A URI does as the name suggests – it identifies
  unique things on the Web
► This identifier can be a Uniform Resource Locator
  or a Uniform Resource Name


                                              16
 Uniform Resource Locators

► URLs identify unique locations on the Web
► Things of fragments of things
► They specify a protocol by which they are retrieved
  – http, ftp, mailto,…
► Also specify domain and host


   ► http://www.cs.man.ac.uk/~stevensr




                                               17
     Uniform Resource Names
► A name, but not necessarily location
    ► Urn:isbn:
    ► urn:lsid:gene.ucl.ac.uk.lsid.biopathways.org:hugo:MVP

► Life Science Identifiers (LSIDs) are special kinds of
  URNs for biological entities

                                       URN




                                       LSID
                                                              18
         Life Science Identifiers

► OMG standard for uniquely identifying biological
  entities
► LSID can be resolved by a LSR to deliver the entity
► The entity is immutable and versioned
► Metadata can be delivered alongside entity
► Different parties can provide different metadata




                                               19
                      RDF Triple with URI


                   Subject                              Object
                                      Predicate

                                        hasName
urn:lsid:gene.ucl.ac.uk.lsid.biopathways.org:hugo:MVP    “MVP”




                                                                 20
Uniform Names For Things
    1   Are   7

    3   Are   7

    9   Are   1

    2   Are   7
    9   Are   2
    5   Are   4
    6   Are   8
    2   Are   1
    6   Are   3
    4   Are   8


                     21
    Aggregation By Names


        7

                6
                    4
1       2   3
                8       5
    9




                            22
A Collection of Triples


 Gene     hasName     “TrpA”

 Gene     expresses   Protein

Protein   hasName     Tryptophan Synthetase

Protein hasSubstrate Chemical




                                       23
  Aggregation of Triples

       hasName
                 “TrpA”


Gene
                    hasName
 expresses
                          Tryptophan Synthetase


          Protein
                    hasSubstrate

                                   Chemical

                                                  24
        Lingua Franca

►Everything can be represented as RDF triples
►A common data model
►Deliver underlying resources (DB) as RDF
►Common data format
►Common semantics for that format
►Provide a Semantic Bus




                                            25
         RDF Vocabularies

► A collection of RDF statements with their names
  forms a vocabulary
► RDF names designed for representing Uniprot
  becomes an Uniprot vocabulary
► Creating standard vocabularies for a domain
  attacks the major semantic barrier
► For example, a vocabulary for talking about
  molecular function, biological processes, cellular
  components and sequence features etc.
► That is, vocabularies delivered by ontologies

                                                  26
Uniprot Keywords in RDF
                         rdf:comment
Acetoin Biosynthesis                        Protein involved in the
                                            synthesis of acetoin



  rdf:type                    owl:sameAs



   Uniprot Concept                       urn:lsid:uniprot.org:go:45151



                       After Eric Jain
             http://www.isb-sib.ch/~ejain/rdf/
                                                                      27
                               Semantic Bus
                          Knowledge                      Data
                          Discovery                     mining
                             tools                       tools


                                                                     Smart
                   Semantic               Social
                                                                  Discovery &
                    Portals             networking
                                                                    Retrieval




                            Semantic Bus



RDF      RDF        RDF               RDF     RDF                RDF            RDF
BLASTp




                                               PubMed
         UniProt




                     Web              PDF                   Instruments
                                                                                Notes
                    Pages             docs

                                                                                        28
        An RDF World

► Distributed heterogeneous resources present their
  data as RDF
► A common data model for a sea of data
► A “bus” into which resources can plug
► Common, syntax, common data model
► But no common vocabulary for values on the bus
► Also need vocabularies from ontologies
► Build ontology is the Web Ontology Language
  (OWL) and use via RDF Schema

                                             29
             SciFOAF




                                30

http://www.urbigene.com/foaf/
UniProt RDF




                                      31
              http://www.isb-sib.ch/~ejain/rdf/intro.html
         A Shared Understanding

►Semantic Understanding perhaps the most trouble
►(Computer science) ontology – a technique for
 representing the things and relationships between
 things in our world
►Overcoming vocabulary problems
  ► Same word different understandings (polysemy)
  ► Different terms, same understandings (synomymy)

►To compare understandings in different resources,
 need a common understanding

                                                      32
          RDF Schema (RDFS)

► A collection of names in a graph forms a
  vocabulary
► RDF Schema (RDFS) is an RDF vocabulary for
  talking about ontologies
► Can talk about
   ► Classes
   ► Class/sub-class and other associative relationships

► Able to describe ontologies in RDFS


                                                           33
         RDFS Vocabularies

► The RDF Vocabulary Description Language
► RDF Schema “semantically extends” RDF to
  enable us to talk about classes of resources, and
  the properties that will be used with them
► It does this by giving special meaning to certain
  RDF properties and resources
► RDF Schema provides the means to describe
  application specific RDF vocabularies



                                                34
Web Ontology Language (OWL)

► Latest standard in ontology languages from the
  World Wide Web Consortium (W3C)
► Built on top of RDF
► OWL semantically extends RDF(S)
► Based on its predecessor language DAML+OIL
► OWL has a rich set of modelling constructors
► Three „species‟
  ►   OWL-Lite
  ►   OWL-DL
  ►   OWL-Full

                                            35
Components of an OWL Ontology

► Individuals
► Properties
► Classes




                         36
          Three Kinds of OWL

► OWL-Full
   ► The union of OWL and RDF(S)
   ► No restrictions on how/where language constructs can be used
   ► OWL-Full is not decidable

► OWL-DL
   ► Restricted version of OWL-Full
   ► Corresponds to a description logic
   ► Certain restrictions on how/where language constructs can be used
     in order to guarantee decidability

► OWL-Lite
   ► A subset of OWL-DL
   ► The simplest and easiest to implement of the three species
                                                                  37
         OWL and RDF

► OWL-Lite is valid OWL-DL is valid OWL-Full
► Not the other way around
► All can be delivered as RDF
► RDFS using all its expressivity can be delivered as
  OWL-Full




                                               38
 What Makes a Semantic Web

► Stores of RDF statements about pages
► Semantically enabled browsers
► Using links provided by RDF
► Using vocabulary provided by RDF to search
► Presentation sits on top of semantic layer – so can
  differ




                                               39
          What Makes a Semantic Web
                      Web Pages                  RDF Stores




                                  Semantic Bus
 Semantic Web
   Browser



RDQL, SeRQL, SPARQL
  QUERY ENGINES


                                                        40
           Semantic Web Applications
                                     RDF Stores




                      Semantic Bus
 Semantic Web
  Applications



RDQL, SeRQL, SPARQL
  QUERY ENGINES


                                            41
                5 reasons for SW 4 LS
► Problem matches up
   ► Fragmented, distributed, scale, mismatches, dynamic, variable, information
     driven community matches up
   ► Loosely coupled multiple suppliers and consumers making connections
► Culture of (controlled) sharing, curating and connecting content
   ► Lots of publicly available diverse information and knowledge content
► Return on investment is worth it.
   ► Business as usual!




                                                                            42
        On a Threshold?

► Biology in a prime position to make Semantic Webs
► Lots of semantically heterogeneous distributed
  resources
► Web orientated
► Many ontologies potentially delivered as RDF
► Beginning to see SW applications




                                              43
         Acknowledgements

► Slides provided by: Carole Goble, Matt Horridge,
  Nick Drummond, and many others from the
  University of Manchester
► Yeliz Yesilada for drawing and formatting




                                              44
45
A Smooth Journey on the
Semantic Web Underground




       from Tim Berners-Lee   46
Another Level Down
Visualisation and Querying


       Graph Model


        RDF/XML




       Data Sources
                             47
Semantic Web Stack




         From Tim Berners-Lee
                                48

								
To top