Curated Databases by mercy2beans108

VIEWS: 40 PAGES: 61

									Curated Databases


   Peter Buneman

 School of Informatics
University of Edinburgh
 Scotland   Corfu   England




ECDL                          2
                        The Population of Corfu
113.479 (2001)               http://www.corfunext.com/corfu_geography.htm

107,879 (as of 2001 )        http://en.wikipedia.org/wiki/Corfu ***

93,000                       http://www.corfunet.com/corfu/

109,512                      www.agni.gr/

110,000                      www.corfuvisit.net

70,000                       http://www.newadvent.org/cathen/04362a.htm

107,600                      http://www.greek‐hotels.com/

105,043                      http://www.merriam‐webster.com/dictionary/corfu

approximately 110,000        www.kassiopi.com/MenuContent.aspx?MenuId=6

approximately 120.000        http://www.gardeno‐corfu.com/

115,200 (2003 est)           http://encyclopedia.farlex.com/Corfu

around 110,000               http://www.sunshinetravel.gr/CORFUGUIDE/CORFU_TRAVEL_GUIDE 0‐1.htm

110.000                      http://www.dialashop.com/travel/corfu.html

about 110,00                 http://www.argobenitses.gr/greece.php

97,102 in 1981               http://geography.howstuffworks.com/europe/corfu.htm

107,880                      http://catalogue.horse21.net/greece+hotels/corfu+hotels/hotels5/luxury

109,512                      http://www.corfu‐property.gr/content/view/14/38/lang,en/

about 100,000                http://members.virtualtourist.com/m/6ce90/67541/

110,000 approximately        http://www.corfu‐island.org/features.htm

107,000                      http://www.nytimes.com/2009/09/11/greathomesanddestinations/11iht‐recorfu.html


*** The only site to give attribution: http://www.statistics.gr/portal/page/portal/ESYE
  ECDL                                                                                                        3
       These are both curated databases
ECDL                                 4
             What is a curated database?




• A curated database is one that is maintained with a lot of 
  human effort
• Curare: Latin “to care for”
• Prime concern is quality of data
ECDL                                                            5
                      What is a database?
                 (for the purposes of this talk)
• Any structured collection of data that is subject to 
  change/revision 
       –   Ontologies
       –   XML  and other structured text files
       –   Structured wikis 
       –   Standard relational and object‐oriented databases




ECDL                                                           6
Curated databases have interesting properties…

• A digital reference work. Traditional dictionaries, gazetteers, 
  encyclopedia have been replaced by curated databases.
• Value lies in the organization and annotation of data
• Commonly constructed by copying parts of other (curated) databases.
• Rapidly increasing in scientific research. (> 1000 in molecular biology)
• Constantly checked/verified. Data quality and timeliness are 
  important.
• Often group efforts.  Produced by a dedicated organization or  
  collaboration.
• Increasingly seen as “publications” by scientists. (You get kudos if 
  someone uses your database – like a citation.)


ECDL                                                                      7
     ... and they are very expensive

          In $/€/£ per byte

“Reliable” code / Curated data    10
“Production” code/Curated data    1
Book                              10-1
[Movie]                           10-3
Big physics (LHC) data            10-7
        Why should digital librarians worry about 
                  curated dataases?
• They are important digital artifacts. We should 
  preserve them for the “scholarly record”
• Archivists, librarians and ontologists – digital or 
  otherwise – construct catalogs and organising 
  metadata.  They build curated databases.
       – over 50% of the papers at this meeting relate to this?
       – The metadata is often more valuable than the data!
       – Shouldn’t digital librarians “curate” their own work as well 
         as that of others?


ECDL                                                                 9
            What’s this with the “scholarly record”?
113.479 (2001)                http://www.corfunext.com/corfu_geography.htm

107,879 (as of 2001 )         http://en.wikipedia.org/wiki/Corfu ***

109,512                       www.agni.gr/

105,043                       http://www.merriam‐webster.com/dictionary/corfu

115,200 (2003 est)            http://encyclopedia.farlex.com/Corfu

97,102 in 1981                http://geography.howstuffworks.com/europe/corfu.htm

107,880                       http://catalogue.horse21.net/greece+hotels/corfu+hotels/hotels5/luxury




          Some of these figures are not guesswork, they were either copied
          from somewhere or calculated.  From where or how?

          Scientists do not trust data unless they know where it came from
          or how it was generated.

          And someone in the future might want to know what the 
          population of Crete was in 2009 !!!
 ECDL                                                                                                  10
                    A change for the better?




Storage:                                 Storage:
• Redundant                              • Single-source
• Persistent                             • Volatile
• Distributed                            • Centralised
• Readable by people                     • Internal DBMS format
Clear standards for citation             No standards for citation
Historical record (old data is useful)   No historical record
Well understood ownership/IP             Mind-boggling legal issues

       20th century libraries did some things better!

ECDL                                                                  11
             Some computer science issues 

•   Archiving (CS usage)
•   Provenance
•   Annotation/citation
•   Data cleaning

       All of these are intimately connected.  

       For example, if you cite some part of a curated database, 
       the version you cited should be available (archiving)



ECDL                                                            12
         Some well-known curated databases
                              ID   11SB_CUCMA      STANDARD;      PRT;   480 AA.
                              AC   P13744;
                              DT   01-JAN-1990 (REL. 13, CREATED)
                              DT   01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)
                              DT   01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)
                              DE   11S GLOBULIN BETA SUBUNIT PRECURSOR.
                              OS   CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).
                              OC   EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;
                              OC   VIOLALES; CUCURBITACEAE.
                              RN   [1]
                              RP   SEQUENCE FROM N.A.
                              RC   STRAIN=CV. KUROKAWA AMAKURI NANKIN;
                              RX   MEDLINE; 88166744.
                              RA   HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;
                              RL   EUR. J. BIOCHEM. 172:627-632(1988).
                              RN   [2]
                              RP   SEQUENCE OF 22-30 ND 297-302.
                              RA   OHMIYA M., HARA I., MASTUBARA H.;
                              RL   PLANT CELL PHYSIOL. 21:157-167(1980).
                              CC   -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.
                              CC   -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A
                              CC       BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A
                              CC       DISULFIDE BOND.
                              CC   -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).
                              DR   EMBL; M36407; G167492; -.
                              DR   PIR; S00366; FWPU1B.
                              DR   PROSITE; PS00305; 11S_SEED_STORAGE; 1.
                              KW   SEED STORAGE PROTEIN; SIGNAL.
                              FT   SIGNAL        1      21
                              FT   CHAIN        22     480       11S GLOBULIN BETA SUBUNIT.
                              FT   CHAIN        22     296       GAMMA CHAIN (ACIDIC).
                              FT   CHAIN       297     480       DELTA CHAIN (BASIC).
                              FT   MOD_RES      22      22       PYRROLIDONE CARBOXYLIC ACID.
                              FT   DISULFID    124     303       INTERCHAIN (GAMMA-DELTA) (POTENTIAL).
                              FT   CONFLICT     27      27       S -> E (IN REF. 2).
                              FT   CONFLICT     30      30       E -> S (IN REF. 2).
                              SQ   SEQUENCE   480 AA; 54625 MW; D515DD6E CRC32;
                                   MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR
                                   RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA
                                   IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV
                                   FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE
                                   EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE
                                   TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY
                                   TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF
                                   KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE
                              //




  CIA World Factbook                            Uniprot
ECDL                                                                                        13
            Archiving / Database Preservation

• How do we preserve something that evolves (both in 
  content and structure)
• Keep snapshots?
       – frequent: space consuming
       – infrequent: lose “history”


       Most curated databases have a hierarchical 
       structure that we can exploit…



ECDL                                                 14
       A Sequence of Versions




ECDL                            15
Pushing time down




 This relies on a deterministic / keyed model – there’s a 
 unique path to every data item.
        [B., Khanna, Tajima, Tan, TODS 27,2 (2004)]
 ECDL                                                    16
               An initial experiment

• Grabbed the last 20 available versions of Swissprot
• XML‐ized all of them
• Also recorded all OMIM versions for about 14 weeks 
  (100 of them)
• Combined into archive XML format file by pushing 
  time down.




ECDL                                                    17
         100 days of 
                                                                                        diff
         OMIM                                                                  e, inc
                                                                        archiv




                                             Size (bytes) x 106
                                                                                               version
Uncompressed
•   Archive size is
     –   ≤ 1.01 times diff repository size
                                                                  Legend
     –≤ 1.04 times size of largest                                •archive
      version                                                     •inc diff
Compressed                                                        •version
• archive size is between 0.94 and 1                              •compressed inc diff
  times compressed diff repository size                           •compressed archive
• gzip - unix compression tool
• XMill - XML compression tool


                                                                                 gzip(inc diff)

                                                                                XMill(archive)




     ECDL                                                                                                18
~ 5 years of 
UniProt




                                                                                                       ve
                                                                  Legend




                                                                                                  archi
                                             Size (bytes) x 106
                                                                  •archive
                                                                  •inc diff




                                                                                                             ff
                                                                                                       inc di
                                                                  •version
Uncompressed
                                                                  •compressed inc diff
• Archive size is                                                 •compressed archive
     –   ≤ 1.08 times diff repository size
     –  ≤ 1.92 times size of largest




                                                                                                      on
                                                                                                   rsi
        version




                                                                                                 ve
•   Compressed
•   archive size is between 0.59 and 1
    times compressed diff repository size

                                                                                                          f)
                                                                                                      dif
                                                                                                   nc
                                                                                              ip(i
                                                                                          gz
                                                                                                      hive)
                                                                                         XM ill(arc




     ECDL                                                                                                   19
       Snapshots are immediate. 
       Longitudinal/temporal queries are also easy
                                            [1990‐2006]
                                           Factbook
                                           *
                                                  Liechtenstein
              Andorra              China
       *                      *                         *
                                                Demography         Economy
                                               *                               *
                                               Population

                                                                  [2006]
                                  [1990]       [1991]
                                                            …
                    28,292                 28,476                     34,247

           Plot, by year, the population of 
           Liechtenstein since 1990


ECDL                                                                               20
                    A Working System


   • Implemented by Heiko Müller
   • For scale, we require external sorting of large XML 
     files
       •Designed and implemented by Ioannis Koltsidas 
        Heiko Müller and Stratis Viglas
   • Has a simple temporal query language
   • Experimented with recent (HTML) versions of CIA 
     world factbook


ECDL                                                        21
<T t="2002-2007">   What the archive looks like
    <FACTBOOK>
        <COUNTRY>
            <NAME>Afghanistan</NAME>
            <CATEGORY>
                <NAME>Communications</NAME>
                <PROPERTY>
                    <NAME>Internet users</NAME>
                    <TEXT>
                        <T t="2004-2005">1,000 (2002)</T>
                        <T t="2006-2007">30,000 (2005)</T>
                        <T t="2002-2003">NA</T>
                    </TEXT>
                </PROPERTY>
                <PROPERTY>
                    <NAME>Radios</NAME>
                    <TEXT>167,000 (1999)</TEXT>
                </PROPERTY>
                <PROPERTY>
                    <NAME>Telephones - main lines in use</NAME>
                    <TEXT>
                        <T t="2006">100,000 (2005)</T>
                        <T t="2007">280,000 (2005)</T>
                        <T t="2002-2003">29,000 (1998)</T>
                        <T t="2004-2005">33,100 (2002)</T>
…
   ECDL                                                           22
                   How did the population of China
                     change from 2002‐2007?
<T t="2002-2007">
     <FACTBOOK>
         <COUNTRY>
             <CATEGORY>
                 <PROPERTY>
                     <NAME>Population</NAME>
                     <TEXT>
                         <T t="2002">1,284,303,705   (July   2002   est.)</T>
                         <T t="2003">1,286,975,468   (July   2003   est.)</T>
                         <T t="2004">1,298,847,624   (July   2004   est.)</T>
                         <T t="2005">1,306,313,812   (July   2005   est.)</T>
                         <T t="2006">1,313,973,713   (July   2006   est.)</T>
                         <T t="2007">1,321,851,888   (July   2007   est.)</T>
                     </TEXT>
                 </PROPERTY>
             </CATEGORY>
         </COUNTRY>
     </FACTBOOK>
</T>




   ECDL                                                                         23
    How did land area of countries change in 2002‐2007?
<T t="2002-2007">
  <FACTBOOK KEY="">
    …
    <COUNTRY KEY="NAME Austria">
      <CATEGORY KEY="NAME Geography">
        <PROPERTY KEY="NAME Area">
          <SUBPROP>
            <NAME>land</NAME>
            <TEXT>
              <T t="2004-2007">82,444 sq km</T>
              <T t="2002-2003">82,738 sq km</T>
             </TEXT>
           </SUBPROP>
         </PROPERTY>
       </CATEGORY>
    </COUNTRY>
    …
    <COUNTRY KEY="NAME France">
      <CATEGORY KEY="NAME Geography">
        <PROPERTY KEY="NAME Area">
          <SUBPROP>
            <NAME>land</NAME>
            <TEXT>
              <T t="2002-2006">545,630 sq km</T>
              <T t="2007">640,053 sq km; 545,630 sq km (metropolitan France)</T>
             </TEXT>
…
   ECDL                                                                     24
          What are the differences between the factbooks
                 on 21/08/2007 and 10/09/2007?
<T t="21/08/2007-10/09/2007">
    <CIAWFB KEY="">
        <COUNTRY KEY="NAME Afghanistan">
            <CATEGORY KEY="NAME Communications">
                <PROPERTY KEY="NAME Internet users">
                    <T t="21/08/2007">
                        <TEXT>30,000 (2005)</TEXT>
                    </T>
                    <T t="10/09/2007">
                        <TEXT>535,000 (2006)</TEXT>
                    </T>
                </PROPERTY>
                <PROPERTY KEY="NAME Telephones - mobile cellular">
                    <T t="21/08/2007">
                        <TEXT>1.4 million (2005)</TEXT>
                    </T>
                    <T t="10/09/2007">
                        <TEXT>2.52 million (2006)</TEXT>
                    </T>
…




   ECDL                                                              25
   http://homepages.inf.ed.ac.uk/hmueller/xarch/download.html


Heiko Müller’s Xarch
• Examples of use with
   ⁻ Ontologies
   ⁻ XML files
   ⁻ Relational databases
• Automatically converts 
  RDBs into  XML
• Efficiently extracts 
  snapshots
• Simple temporal query 
  language 



  ECDL                                                      26
                Provenance – a huge issue
•      Where did this data come from?
•      How did it get here?
•      How was it constructed?
•      . . .

Two schools of research:
• Workflow (coarse‐grain) provenance – a complete 
  record of how some large scientific 
  analysis/simulation was performed.
• Data (fine‐grain) a record of how some small piece of 
  data (in a larger databases) was produced


ECDL                                                       27
Data 
provenance: an 
example

   Copy‐paste, or
   <cntrl>C <cntrl>V




 113.479 (2001)          http://www.corfunext.com/corfu_geography.htm

 107,879 (as of 2001 )   http://en.wikipedia.org/wiki/Corfu ***

 109,512                 www.agni.gr/

 105,043                 http://www.merriam‐webster.com/dictionary/corfu

 115,200 (2003 est)      http://encyclopedia.farlex.com/Corfu

 97,102 in 1981          http://geography.howstuffworks.com/europe/corfu.htm

 107,880                 http://catalogue.horse21.net/greece+hotels/corfu+hotels/hotels5/luxury

 ECDL                                                                                             28
                  “Where provenance”

Possible explanations of how something was copied:

 This data item was extracted from location L1 in 
 document D1 and placed in location L2 in document D2

 or

 This data item was extracted from database D1 by 
 query Q1 and placed in database D2 by update U2

 (or some combination of the two)


 ECDL                                                   29
                                    ID  11SB_CUCMA      STANDARD;      PRT;   480 AA.

Where Provenance                    AC
                                    DT
                                    DT
                                        P13744;
                                        01-JAN-1990 (REL. 13, CREATED)
                                        01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)
                                    DT  01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)
                                    DE  11S GLOBULIN BETA SUBUNIT PRECURSOR.
                                    OS  CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).
                        DE   11S GLOBULIN BETA SUBUNIT PRECURSOR.
                                    OC  EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;
                        OS          OC  VIOLALES; CUCURBITACEAE.
                             CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).
                                    RN  [1]
                        OC               PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;
                             EUKARYOTA; SEQUENCE FROM N.A.
                                    RP
                        OC   VIOLALES; CUCURBITACEAE.
                                    RC  STRAIN=CV. KUROKAWA AMAKURI NANKIN;
                                    RX  MEDLINE; 88166744.
                                    RA  HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;
                                    RL  EUR. J. BIOCHEM. 172:627-632(1988).
         CC                              SEED STORAGE PROTEIN.
                -!- FUNCTION: THIS IS A [2]
                                    RN
         CC                         RP  SEQUENCE OF 22-30
                -!- SUBUNIT: HEXAMER; EACH SUBUNIT ISAND 297-302. OF AN ACIDIC AND A
                                                             COMPOSED
                                    RA  OHMIYA M., HARA I., MASTUBARA H.;
         CC                              FROM A PHYSIOL. 21:157-167(1980).
                    BASIC CHAIN DERIVED PLANT CELL SINGLE PRECURSOR AND LINKED BY A
                                    RL
         CC         DISULFIDE BOND. CC -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.
         CC                         CC  -!- SUBUNIT: STORAGE PROTEINS COMPOSED OF AN
                -!- SIMILARITY: TO OTHER 11S SEED HEXAMER; EACH SUBUNIT IS (GLOBULINS).ACIDICBY A A
                                    CC      BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED
                                                                                                     AND

                                    CC      DISULFIDE BOND.
                                    CC  -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).
                                    DR  EMBL; M36407; G167492; -.
              FT   CHAIN            DR  PIR;          FWPU1B.
                                22 DR 480 S00366;11S GLOBULIN BETA 1.
                                        PROSITE; PS00305; 11S_SEED_STORAGE;
                                                                             SUBUNIT.
              FT   CHAIN        22 KW 296 STORAGEGAMMA CHAIN (ACIDIC).
                                        SEED          PROTEIN; SIGNAL.
              FT   CHAIN       297 FT 480
                                        SIGNAL       DELTA 21
                                                       1      CHAIN (BASIC).
                                    FT  CHAIN
              FT   MOD_RES      22 FT 22CHAIN
                                                     PYRROLIDONE CARBOXYLIC BETA SUBUNIT.
                                                      22    480       11S GLOBULIN
                                                      22    296
                                                                                    ACID.
                                                                      GAMMA CHAIN (ACIDIC).
              FT   DISULFID    124 FT 303
                                        CHAIN        INTERCHAIN (GAMMA-DELTA) (POTENTIAL).
                                                     297    480       DELTA CHAIN (BASIC).
                                    FT  MOD_RES       22     22       PYRROLIDONE CARBOXYLIC ACID.
                                    FT  DISULFID     124    303       INTERCHAIN (GAMMA-DELTA) (POTENTIAL).
                                    FT  CONFLICT      27     27       S -> E (IN REF. 2).
Where does this information         FT
                                    SQ
                                        CONFLICT
                                        SEQUENCE
                                                      30     30       E -> S (IN REF. 2).
                                                    480 AA; 54625 MW; D515DD6E CRC32;
                                        MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR
come from? Which curator?               RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA
                                        IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV

Or was it the cited papers?             FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE
                                        EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE
                                        TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY

Was it copied from some             //
                                        TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF
                                        KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE


other DB?
 ECDL                                                                                               30
              Copy‐paste model of curated DBs
   Curated databases are not views!!




(a) A biologist copies some UniProt records into her DB.
(b) She fixes entries so that UniProt PTMs are not confused with hers.
(c) She copies in some publication details from OMIM
(d) She corrects a mistake in a PubMed publication number.
[B. Chapman, Cheney, Sigmod ’06]

 ECDL                                                                    31
                A very simple copy‐paste language
                        (uses a “deterministic” tree model)
(1) delete c5 from T;
(2) copy S1/a1/y into T/c1/y;
(3) insert {c2 : {}} into T;
(4) copy S1/a2 into T/c2;
(5) insert {y : {}} into T/c2;
(6) copy S2/b3/y into T/c2/y;
(7) copy S1/a3 into T/c3;
(8) insert {c4 : {}} into T;
(9) copy S2/b2 into T/c4;
(10) insert {y : 12} into T/c4;




           How costly is it to record all this?
  ECDL                                                        32
                    How to reduce space

 • Complete provenance:  Record every update.
 • Transactional provenance: Record the links at the end of 
   some user‐defined transaction (sequence of updates)
 • Hierarchical (inferred) provenance.  Only record a link if it 
   cannot be inferred from the provenance of a higher node




Taken together these provide a substantial saving on 
storage. Overhead comparable with the size of the DB 
in some realistic simulations

 ECDL                                                               33
        Query languages and where provenance
                                                    A B     A B
                                                    1   3   1   5
(select A, 5 as B from R where A = 1)
union                                               6   7   6   7
(select * from R where A <> 1)

                                                    1   3   1   5
delete from R where A = 1; 
insert into R values (1,5)                          6   7   6   7



                                                    1   3   1   5
update R set B = 5 where A = 1 
                                                    6   7   6   7

ECDL   [B., Cheney, Vansummeren, TODS 33,4, 2008]                   34
“The ship wherein Theseus and the youth of Athens returned had thirty
oars, and was preserved by the Athenians down even to the time of
Demetrius Phalereus, for they took away the old planks as they
decayed, putting in new and stronger timber in their place, insomuch
that this ship became a standing example among the philosophers, for
the logical question of things that grow; one side holding that the
ship remained the same, and the other contending that it was not the
same.” Plutarch, Vita Thesei, 22‐:




   ECDL                                                         35
 Other forms of provenance in query languages
• Why‐provenance:  why is a tuple in the output, or what parts 
  of the input “contributed to” the tuple? [Widom et al]
• How‐provenance:  how (by what process) was this tuple 
  constructed. [Tannen et al]

                        Complex program or process
 Large, heterogeneous
 source                                               Database

  Small part of             Simpler program/process      “Piece” of data: data 
  source                                                 value, tuple.etc



                  Taken together, these are the “explanation”.
ECDL                                                                       36
                      Workflow provenance




Taken from [Cohen, et al DILS 2006]
• Each step S1. . . S4 is itself a workflow.
• How does one record an “enactment” of the workflow?
• How much “context” does one record? 
    – from people
    – from databases that change
• Recent attempts to produce a general model
    – Open Provenance Model [Moreau et al. 2007]
    – Petri Net + Complex Object [Hidders et al.Inf Syst 2008]
 ECDL                                                            37
              Provenance is very general issue

 • Intrinsic to data quality.
 • It is starting to be used in several areas of CS:
       –   Semantics of update languages.
       –   Probabilistic databases
       –   Data integration
       –   Debugging schema transformations
       –   File/data synchronization
       –   Program debugging (program slicing)
       –   Security
 • The fundamental problem is finding the right 
   model/models – can we combine data and 
   workflow models?
ECDL                                                   38
       Annotation – closely related to provenance

• Much of the activity of curators is the annotation of 
  existing data.
• When we copy that data, we should also copy its 
  annotations
• The propagation of annotation follows (where‐) 
  provenance
• But the story is more complicated because we often 
  annotate views


ECDL                                                   39
       The Distributed Annotation Server (DAS)




ECDL                                             40
ECDL   41
                            Annotating Databases
Polygen [Wang & Madnick VLDB 1990], DBNotes [Bhagwat et 
al, VLDB 2004] Concern is propagation of annotations from 
views to source and back. Again, there is an interesting theory
                                                  Not strong
         Stijn says this is 
         not a beer         Guinness      Stout       5.0      Eire
                            Heineken      Pilsner     5.0      Netherlands
                            Old Jock      Ale         6.7      Scotland
                            Guinness      Stout       7.5      Nigeria                 Not strong
   Stijn says this is 
                            Fischer       Blonde      6.0      France
   not a beer                                                                Stijn says this is 
                                 π                    σ                      not a beer
      Guinness    Stout                                        Guinness      Stout     5.0    Eire
                                       Not strong
      Heineken    Pilsner                                      Heineken      Pilsner   5.0    Netherlands
      Old Jock    Ale                                          Fischer       Blonde    6.0    France
      Fischer     Blonde
             How do you cite something in a database?

  Many scientific databases ask you to cite them, but..
  • they don’t tell you how, or
  • they tell you to give the URL, or
  • they tell you to cite a paper about the database.

Nutrition Education for Diverse Audiences [Internet]. Urbana (IL): University of 
Illinois Cooperative Extension Service, Illinet Department; [updated 2000 Nov 
28; cited 2001 Apr 25]. Diabetes mellitus lesson; [about 1 screen]. Available 
from http://www.aces.uiuc.edu/~necd/inter2_search.cgi?ind=854148396

NLM Recommended Formats for Bibliographic Citation.
Internet Supplement. NLM Technical report Bethesda, MD 20894, July 2001.


  ECDL                                                                        43
                What is a citation?

  Bard JB and Davies JA. Development, Databases and 
  the Internet. Bioessays. 1995 Nov; 17(11):999‐1001.
  [Location and descriptive information]


  Ann. Phys., Lpz 18 639‐641 
  Nature, 171,737‐738
  (We often want more than location)



ECDL                                                44
              Automatically generating citations
A rule:
        { DB=IUPHAR, Version=$v, Family=$f Receptor=$r, Contributors=$a,
             Editor=$e, Date=$d, DOI=$i} 
             ←
          /Root[ ]/Version[Number=$’v, Editor=$?e, DOI=$.i, Date=$.d] /Data[ ] 
             /Family[FamilyName=$’f] /Contributor‐list/Contributor=$+a] 
             /Receptor[ReceptorName=$’r]


What gets generated (example):


   {   DB=IUPHAR, Version=11, Family=Calcitonin, 
       Receptor=CALCR, Contributors={Debbie Hay, David R. Poyner},
       Editor=Tony Harmar, Date=Jan 2006, DOI=10.1234  }



 ECDL                                                                         45
   Other topics: Data quality and data cleaning

• Published data often looks clean but is intrinsically 
  messy
       – “Dead” fields in the underlying data
       – Multiple syntactic conventions
       – Abuse of / confusion over formats & schema
• Human errors require human correction
       – Automate error detection rather than error correction
• Cleaning is an essential prerequisite in any 
  integration or preservation task.

ECDL                                                             46
           Other topics: Evolution of Structure

• Curated DBs evolve from humble origins.  Schemas 
  are often wrong; they are
       – designed by people who don’t understand schemas
       – designed before the domain is fully understood
• Do ontologies help (you can build an ontology 
  without worrying much about the schema) or do 
  they defer the problem and make it worse?




ECDL                                                       47
         The larger (economic and social) issues

• Who will archive/curate curated databases?
• Should they be open‐access?
       – who pays for their maintenance?
• What are the legal/IP issues?




ECDL                                               48
             A case study: IUPHAR database
           (curated by Tony Harmar and team)
•   “Standard” curated database 
•   Labour‐intensive (hundreds of        IUPHAR
    contributors)
                                                  DCC
•   Valuable (supported by drug 
    companies)
                                              50m
•   Simple, clean structure – as seen 
    by users
ECDL   50
ECDL   51
ECDL   52
ECDL   53
            We wanted to use our archiver

• Our first task was to convert the database into a hierarchical 
  structure (following the web presentation) so that we could 
  archive it.
• We used the Prata XML (Fan et al) publishing software
• This had some unexpected benefits…




ECDL                                                            54
       …
             <transduction>
              <secondary>Y</secondary>
              <transcomments/>
              <transcites/>
            </transduction>
           </transductions>
           <ligandTypes>
            <ligandType>
              <typeName>Agonist</typeName>
              <ligandComment>empty</ligandComment>
              <ligands>
               <ligand>
                 <radioactive>NO</radioactive>
                 <endogenous>YES</endogenous>
                 <alternative>NO</alternative>
                 <ligandSpeices>Human</ligandSpeices>
                 <ligandName>oxytocin</ligandName>
                 <affinity>9.1‐8.8</affinity>
                 <ligandAction>Full Agonist</ligandAction>
                 <ligandUnits>p<i>K</i><sub>d</sub></ligandUnits>
                 <ligandCites>9</ligandCites>
               </ligand>
               <ligand>
                 <radioactive>YES</radioactive>
                 <endogenous>NO</endogenous>
                 <alternative>NO</alternative>
                 <ligandSpeices>Human</ligandSpeices>
                 <ligandName>[<sup>3</sup>H]‐oxytocin</ligandName>
                 <affinity>9.1‐8.8</affinity>
                 <ligandAction>Full Agonist</ligandAction>
                 <ligandUnits>p<i>K</i><sub>d</sub></ligandUnits>
                 <ligandCites>9, 42</ligandCites>
               </ligand>
   …



ECDL                                                                 55
• We can preserve all versions of the data (as intended)
• We can generate static web pages (less software, more 
  efficient)
• We can make the database citable
• Tony can trace the history of entries
• Tony can generate an old‐fashioned book (yes, he wants to do 
  this!)
• We have a “community model” for data exchange
• The data got cleaned up in the process
• The representation information (required by archivists) is 
  greatly simplified




ECDL                                                        56
                                            ii                                                                                                                            CONTENTS
Selected pages from the                                5.1.8 BENZODIAZEPINE INSENSITIVITY                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   19

book – generated by a 100‐                             5.1.9 BENZODIAZEPINE INSENSITIVITY
                                                       5.1.10 THE rho;-CONTAINING RECEPTORS
                                                                                                              .
                                                                                                              .
                                                                                                                  .
                                                                                                                  .
                                                                                                                      .
                                                                                                                      .
                                                                                                                          .
                                                                                                                          .
                                                                                                                              .
                                                                                                                              .
                                                                                                                                  .
                                                                                                                                  .
                                                                                                                                      .
                                                                                                                                      .
                                                                                                                                          .
                                                                                                                                          .
                                                                                                                                              .
                                                                                                                                              .
                                                                                                                                                  .
                                                                                                                                                  .
                                                                                                                                                      .
                                                                                                                                                      .
                                                                                                                                                          .
                                                                                                                                                          .
                                                                                                                                                              .
                                                                                                                                                              .
                                                                                                                                                                  .
                                                                                                                                                                  .
                                                                                                                                                                      .
                                                                                                                                                                      .
                                                                                                                                                                          .
                                                                                                                                                                          .
                                                                                                                                                                              .
                                                                                                                                                                              .
                                                                                                                                                                                  .
                                                                                                                                                                                  .
                                                                                                                                                                                      .
                                                                                                                                                                                      .
                                                                                                                                                                                          .
                                                                                                                                                                                          .
                                                                                                                                                                                              20
                                                                                                                                                                                              20
                                                       5.1.11 OTHER MODULATORY SITES . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   20
line style sheet                                       5.1.12 OTHER MODULATORY SITES . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
                                                       5.1.13 CONCLUDING REMARKS . . . . . .                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
                                                 5.2   References . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21

                                            6 Prokineticin receptors                                                                                                                          25
                                              6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                25
                                                  6.1.1 GENERAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                       25
                                              6.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                25

                                            7 Prolactin-releasing peptide receptor                                                                                                            27
                                              7.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                27

                                            8 Acetylcholine receptors (nicotinic)                                                                                                             31
                                              8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   31
      The IUPHAR Receptor Database                8.1.1 GENERAL . . . . . . . . . . . . . . . . . . . .                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   31
                                                  8.1.2 FUNCTIONAL ROLES . . . . . . . . . . . . .                            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   31
                                                  8.1.3 PHARMACOLOGICAL CHARACTERISTICS                                       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
           Tony Harmar and Ed Rosser              8.1.4 RECEPTOR SUBUNIT ASSEMBLY . . . . .                                   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
                                                  8.1.5 CONCLUDING REMARKS . . . . . . . . . .                                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
             c Draft date August 24, 2006     8.2 2* . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
                                              8.3 6* . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   32
                                              8.4 9* . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
                                              8.5 1* . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
                                              8.6 3* . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
                                              8.7 4* . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
                                              8.8 7* . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33
                                              8.9 References . . . . . . . . . . . . . . . . . . . . . . . . .                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   33

                                            9 P2X receptors                                                                                                                                   37
                                              9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                    .   .   .   .   .   .   37
                                                  9.1.1 GENERAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                           .   .   .   .   .   .   37
                                                  9.1.2 OVERALL STRUCTURE OF THE P2X RECEPTOR FAMILY                                                                  .   .   .   .   .   .   37
                                                  9.1.3 THE PORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                          .   .   .   .   .   .   37
                                                  9.1.4 STOICHIOMETRY . . . . . . . . . . . . . . . . . . . . . . . . .                                               .   .   .   .   .   .   38
                                                  9.1.5 HETEROPOLYMERISATION OF P2X SUBUNITS . . . . . . .                                                            .   .   .   .   .   .   38
                                                  9.1.6 OPERATIONAL CHARACTERISTICS . . . . . . . . . . . . . .                                                       .   .   .   .   .   .   39
                                              9.2 P2X2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                      .   .   .   .   .   .   39
                                              9.3 P2X4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                      .   .   .   .   .   .   39
                                              9.4 P2X5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                      .   .   .   .   .   .   39
                                              9.5 P2X6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                      .   .   .   .   .   .   39
                                              9.6 P2X7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                      .   .   .   .   .   .   40
                                              9.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                    .   .   .   .   .   .   40




                                                                                                                                                                                              57
                                                                                                      208                                                  CHAPTER 27. ENDOTHELIN RECEPTORS

                                                                                                      Structural Information
                                                                                                       Species    TM      AA     Accession Number     Chromosomal Location        Gene Name           References
                                                                                                       Human      7       427    P25101               4q31.22                     EDNRA               1,5,57,58
                                                                                                       Rat        7       426    P26684               19q11                       Ednra               4
Chapter 27                                                                                             Mouse      7       427    Q61614               8?                          Ednra               59

                                                                                                      Functional Assays

Endothelin receptors                                                                                   Isolated ring preparations of human coronary artery
                                                                                                       Species:                  Human
                                                                                                       Tissue:                   Vasoconstriction
                                                                                                       Response measured:        Coronary artery
                                                                                                       References:               60
Contributors: Anthony P. Davenport, Thophile Godfraind, Eliot H. Ohlstein, Robert
                                                                                                       Isolated ring preparations of rat thoracic aorta
R. Ruffolo, Pedro D’Orleacute;ans-Juste and Janet J. Maguire
                                                                                                       Species:                  Rat
                                                                                                       Tissue:                   Vasoconstriction
                                                                                                       Response measured:        Aorta
27.1      Introduction                                                                                 References:               77-79
27.1.1     GENERAL
In mammals, the endothelin (ET) family comprises three endogenous isoforms, ET-1, ET-2 and            Antagonist Ligands
ET-3 (refs. 1,2), and the receptors that mediate their effects have been classified as the endothelin
                                                                                                       R’tive    Endog.   Alt.    Species   Name              Affinity     Action               Units     References
ETA and ETB receptors.
                                                                                                       YES       NO       YES     Human     [125I]PD164333    9.8-9.6    Antagonist           pKd       47
                                                                                                       YES       NO       YES     Human     [125I]PD151242    9.1-9      Antagonist           pKd       23
27.1.2     RECEPTOR STRUCTURE                                                                          YES       NO       YES     Rat       [3H]BQ123         8.5        Antagonist           pKd       21
                                                                                                       NO        NO       YES     Human     A127742           10.5       Antagonist           pIC50     29
The two endothelin receptors have been isolated and cloned from mammalian tissues1-9. The              NO        NO       YES     Human     PD156707          9.2-8.7    Antagonist           pIC50     70,71
structures of the mature receptors have been deduced from the nucleotide sequences of the cDNAs.       NO        NO       YES     Human     SB234551          9          Antagonist           pIC50     26
The encoded proteins contain seven stretches of 20-27 hydrophobic amino acid (aa) residues in both     NO        NO       YES     Human     FR139317          7.9-7.3    Inverse Agonist      pIC50     60
                                                                                                       NO        NO       YES     Human     BQ123             7.4-6.4    Antagonist           pIC50     60
receptors. This structure is consistent with a seven-transmembrane domain (7TM), G protein-
coupled receptor belonging to the rhodopsin-type receptor superfamily. Both receptors have an
N-terminal signal sequence, which is rare among heptahelical receptors, with a relatively long        Agonist Ligands
extracellular N-terminal portion preceding the first transmembrane domain. There are two separate       R’tive    Endog.   Alt.    Species   Name                    Affinity     Action           Units     References
ligand-interaction sub-domains on each endothelin receptor. The extracellular loops, particularly      YES       NO       NO      Human     [125I]ET-1              10.5-9.1   Full Agonist     pKd       62-64
between TM4-TM6, determine selectivity.                                                                YES       NO       YES     Human     [18F]ET-1               8.2        Full Agonist     pKd       67
                                                                                                       YES       NO       YES     Human     [125I]ET-2              9.1-8.9    Full Agonist     pKd       68
                                                                                                       YES       NO       YES     Human     [125I]sarafotoxin S6B   9.8-9.6    Full Agonist     pKd       68
27.1.3     RECEPTOR SIGNALLING                                                                         NO        YES      YES     Human     ET-2                    8.2        Full Agonist     pIC50     60
                                                                                                       NO        NO       YES     Human     sarafotoxin S6b         8.1-7.5    Full Agonist     pIC50     60
Endothelin is able to activate a number of signal transduction processes including phospholipase       NO        YES      YES     Human     ET-1                    8.5-7.8    Full Agonist     pIC50     60
(PL) A2 , PLC and PLD, as well as cytosolic protein kinase activation. The receptors are able to
couple to various types of G protein. Both ETA and ETB receptors expressed in COS7 cells were
shown to couple to Gq , G11 , Gs and Gi2 , suggesting that endothelin receptors may simultaneously    27.4       References
stimulate multiple effectors via several types of G protein10. ETA receptors expressed in CHO          1 S.Kimura, M.Yanagisawa, T.Masaki, K.Goto, H.Kurihara, Y.Tomobe, M.Kobayashi, Y.Mitsui
cells couple to Gq and Gs but not Gi . ETB receptors couple to Gq and Gi . Coupling to Gs                  and Y.YazakiNature, 332, 411 - 415.
occurs through the second and third intracellular loops of the receptor. In order to couple with
Gi through the third intracellular loop, palmitoylation of the C-terminal cysteine residues and C-    2 S.Kimura, M.Yanagisawa, Y.Kasuya, K.Goto, T.Masaki, A.Inoue and T.MiyauchiProc. Natl.
terminus are necessary, whereas to couple with Gq only palmitoylation of the C-terminal domain             Acad. Sci. U.S.A., 86, 2863 - 2867.
is important11,12.

                                               205


                                      Our library will “host” the book, but 
                                      not the database!
                                                                                                                                                                                                              58
              Centralized vs. distributed publishing
   20th century libraries provided robust, distributed dissemination and 
   preservation of reference material




Valuable information was lost in earlier “data centers” .           Is this still 
                                                                    happening?



   Replication and distribution has always been the best guarantee of 
   preservation. We should do the same for curated databases – a database 
   LOCKSS ?


   ECDL                                                                              59
          Many of the issues are non “technical”

• A good economic model for sustainability
       – Open access works for journal papers
       – Can it work for curated DBs?  They require long‐term 
         support.  And people who write reference manuals 
         sometimes expect to make money out of them.
• Intellectual property in curated databases is a 
  nightmare
       – legislation still largely based on the notion of copying.
• We can still help by providing good models of the 
  processes in curating and publishing databases


ECDL                                                                 60
                         Summary
• Study of database curation and preservation is producing new 
  problems for databases and digital libraries
• We need to bring curated databases into the scope of 
  libraries and other archival institutions


                                The problems of curated 
                                databases will not be solved 
                                until we unify the roles of 
                                database curator and digital 
                                librarian/archivist


ECDL                                                         61

								
To top