Archiving Biological Databases by oneforseven

VIEWS: 16 PAGES: 11

									Archiving Biological Databases

        Wang-Chiew Tan
        UC Santa Cruz
       Archiving: Ubiquitous Problem

• Historians, Librarians (The Internet Archive)
• Software developers (revert to stable and working version)
• Scientists (collect and publish results, make revisions)

Scientists:
• Keep all states of data
• Pull out a past version whenever needed
• Support for basic operations




                                                               2
         Biological Databases:
       Complex interdependencies
                          •Domino-effect in data publishing
                          •Efficiently keep many versions
           GERD


           TRRD      EpoDB            BEAD


         Swissprot

EMBL                                  GAIA
          GenBank    Transfac


DDBJ                   flow of data




                                                              3
               Meaningful change description
              (Exploit known key constraints)
Line
Number   Version 1:                   Version 2:                   Output of line diff (versions 1-2):

  1      <DB>                         <DB>                         3,4c
  2      <Person>                     <Person>                     <Name>Jane</>
  3        <Name>Joe</>                 <Name>Jane</>              <DateOfBirth>May</>
  4        <DateOfBirth>March</>        <DateOfBirth>May</>        9,10c
  5        <Address>South Street</>     <Address>South Street</>   <Name>Joe</>
  6        <Zip>12345</>                <Zip>12345</>              <DateOfBirth>March</>
  7      </>                          </>
  8      <Person>                     <Person>
  9        <Name>Jane</>                <Name>Joe</>
  10       <DateOfBirth>May</>          <DateOfBirth>March</>
  11       <Address>Pine Street</>      <Address>Pine Street</>
  12       <Zip>67890</>                <Zip>67890</>
  13     </>                          </>
  14     </>                          </>

         change descriptions to preserve “object continuity” through time



                                                                                                         4
                                  Merge Example

     Archive(version 1)              Version 2                              Archive (Versions 1-2)
                                                                             Archive (Versions 1-2)
                             DB t=2
       • DBInheritance of timestamps
            t=1
                                                                                   DB t=1-2

                                                                       t=1-2
       •
      Emp Time intervals
                      Emp           Emp                                Emp                    Emp t=2
             – eg. t=1,2,3,4,5,9  t=1-5,9
                                                             t=1-2   t=1-2 t=1-2   t=2        t=2 t=2     t=2
Id   Sal Tel Tel     Id   Sal Tel Tel Tel   Id   Sal   Tel    Id     Sal   Tel Tel Tel   Id         Sal   Tel
                                                                               t=1-2
John80 123 887      John82 123 887 467      Alice 85K 239    John80 82 123 887 467       Alice 85K 239
    K                   K                                        t=1 t=2 t=1-2 t=2
                                                                 K K
                                                             t=1-2                            t=2    t=2 t=2
                                                                           t=1-2




                                                                                                                5
                            ID   11SB_CUCMA      STANDARD;     PRT;   480 AA.
                            AC   P13744;
                            DT   01-JAN-1990 (REL. 13, CREATED)


SWISS-PROT:
                            DT   01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE)
                            DT   01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE)
                            DE   11S GLOBULIN BETA SUBUNIT PRECURSOR.


a curated
                            OS   CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH).
                            OC   EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE;
                            OC   VIOLALES; CUCURBITACEAE.


database
                            RN   [1]
                            RP   SEQUENCE FROM N.A.
                            RC   STRAIN=CV. KUROKAWA AMAKURI NANKIN;
                            RX   MEDLINE; 88166744.
                            RA   HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.;

•used for many scientific
                            RL   EUR. J. BIOCHEM. 172:627-632(1988).
                            RN   [2]
                            RP   SEQUENCE OF 22-30 AND 297-302.
research                    RA
                            RL
                                 OHMIYA M., HARA I., MASTUBARA H.;
                                 PLANT CELL PHYSIOL. 21:157-167(1980).
•publishes a new version    CC
                            CC
                                 -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN.
                                 -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A
every few months            CC
                            CC
                                     BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A
                                     DISULFIDE BOND.

•keeps all versions         CC
                            DR
                                 -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS).
                                 EMBL; M36407; G167492; -.
                            DR   PIR; S00366; FWPU1B.
                            DR   PROSITE; PS00305; 11S_SEED_STORAGE; 1.
                            KW   SEED STORAGE PROTEIN; SIGNAL.
                            FT   SIGNAL        1      21
                            FT   CHAIN        22     480      11S GLOBULIN BETA SUBUNIT.
•plenty of constraints!     FT
                            FT
                                 CHAIN
                                 CHAIN
                                              22
                                             297
                                                     296
                                                     480
                                                              GAMMA CHAIN (ACIDIC).
                                                              DELTA CHAIN (BASIC).
                            FT   MOD_RES      22      22      PYRROLIDONE CARBOXYLIC ACID.
                            FT   DISULFID    124     303      INTERCHAIN (GAMMA-DELTA) (POTENTIAL).
                            FT   CONFLICT     27      27      S -> E (IN REF. 2).
                            FT   CONFLICT     30      30      E -> S (IN REF. 2).
                            SQ   SEQUENCE   480 AA; 54625 MW; D515DD6E CRC32;
                                 MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR
                                 RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA
                                 IPGCAETYQT DLRRSQSAGS AFKDQHQKIR PFREGDLLVV PAGVSHWMYN RGQSDLVLIV
                                 FADTRNVANQ IDPYLRKFYL AGRPEQVERG VEEWERSSRK GSSGEKSGNI FSGFADEFLE
                                 EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE
                                 TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY
                                 TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF
                                 KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE
                            //


                                                                                                      6
                                    Experimental Results (OMIM)
                                                           Uncompressed
                                                           •   Archive size is
Size (bytes) x 106




                                                                –  1.01 times diff repository
                                                                  size
                                                                –  1.04 times size of largest
                                                                  version
                             Legend
                             •archive                           – Why not keep all versions?
                             •inc diff
                             •version
                                                           Compressed
                             •compressed inc diff
                             •compressed archive           • archive size is between 0.94 and
                                                             1 times compressed diff
                                                             repository size



                                          gzip(inc diff)
                                                           •   gzip - unix compression tool
                                          XMill(archive)
                                                           •   XMill - XML compression tool
                     Number of versions                                                          7
                            Experimental Results (SWISS-PROT)
                                                Uncompressed
                                                • Archive size is
                         Legend
Size (bytes) x 106




                                                     –  1.08 times diff repository
                         •archive
                                                       size
                         •inc diff
                         •version                    –  1.92 times size of largest
                         •compressed inc diff          version
                         •compressed archive
                                                     – Why not keep all versions
                                                       with our archiving scheme?


                                                Compressed
                                                • archive size is between 0.59 and
                                                  1 times compressed diff
                                                  repository size



                                                •   gzip - unix compression tool
                                                •   XMill - XML compression tool

                     Number of versions                                               8
                   Bottomline
• New technique for archiving (using key constraints!)
• Can archive a whole year of SWISS-PROT or OMIM
  with < 15% overhead (size of current file)
• Retrieval is a linear scan (can do better!)
• Works well with compression < 30% of current file.
• Archive as often as you like! (Almost)
• Permits efficient support for basic temporal queries
  on objects




                                                         9
      Other desirable characteristics
•   Trace evolution
•   XML representation
•   Handle large files
•   Preserve order
•   Pose sophisticated temporal queries
    – E.g. find all protein sequences whose description
      has been changed more than twice over the last
      year
• Handle evolving schema


                                                          10
End




      11

								
To top