Docstoc

Sikes_DS_2012.pptx - Alaska Entomological Society

Document Sample
Sikes_DS_2012.pptx - Alaska Entomological Society Powered By Docstoc
					Arctos at the
University of Alaska
Museum Insect
Collection

Derek Sikes1
Gordon Jarrell2
Dusty McDonald1

1 University of Alaska Museum
  Fairbanks, AK

2 Museum of Southwestern
  Biology, NM




Alaska Entomological Society
5th Annual Meeting, Anchorage, AK
27-28 Jan 2012
Major repositories using the Arctos database:
(43 collections of specimens or observations, 1.4M records)
          in partnership with




          which is a member of
     TeraGrid – A nationwide network
      of 11 supercomputing facilities

          which is sponsored by
U. S. National Science Foundation’s
    Office of Cyberinfrastructure
                Arctos: A 15 year history
   MVZ: 1995 - Hired Stan Blum to develop relational data model (following modeling
    by Assoc. Systematic Collections).

   MVZ: 1997 - Hired John Wieczorek to implement model (desktop application) using
    Sybase and Versata. Partial implementation (e.g., no loans).

   UAM: 1998-2000 - John W. migrated mammal data to Oracle, set up Versata.

   UAM: 2002 - Dusty McDonald replaced Versata with ColdFusion, implemented full
    model (first web-based instance, aka Arctos).

   MSB: 2003 – Joined Arctos at UAM (first multi-hosting instance).

   MVZ and MCZ: 2005-2007 - Implemented separate instances of Arctos at Berkeley
    and Harvard (MVZ: first Postgres, then Oracle).

   MVZ: 2009 - Moved hosting of data to Alaska (Virtual Private Database version).
                                     Arctos
         ARCTOS                               Specimen Catalog
                                               label data (and more)

• Specimens (objects) - body
                                              Accessions     Loans,
  parts, tissues, containers, etc.                           usage



• Images, media (stored at TACC)                    Projects
                                                 contribute and/or
                                                                       Citations
                                                  use specimens


• Projects, permits, publications                       Publications
                                                         cite specimens


• Accessions, loans, usage              The rest of
                                        Cyberspace
                                                                GenBank
• Labels, as PDF files                            Federated portals
                                                     BerkeleyMapper
• Agents, agent activity                      “Media” in TeraGrid
BerkeleyMapper & Google Maps, with error circles
       Breadth of Data in Arctos
 Fish, amphibians, reptiles, mammals, birds and
      bird eggs/nests, plants, arthropods, fossils,
      molluscs AND their parasites
 Specimens and observations
 Media (images, audio, video)
 Publications, fieldnotes


Arctos constantly evolving to incorporate new kinds
of data, e.g.,:
 Better representation of non-publication
      documents (fieldnotes, correspondence)
 Cultural collections (art, anthropology...)
Nearly all that is known about an object (or
observation) can be included in Arctos.
Linking specimen records to archival documentation…
     ECN Session – Arthropod Collections Databases
1) What is the primary user audience? - large/ small museum
   management? taxonomic research? is a dedicated IT /
   programmer required? Single vs multi-user? (annual cost?)

2) GBIF - does the database provide data to GBIF?

3) Barcoding - does the database handle batch processing of
   specimens using barcodes? ( 'speed / ease of use')

4) Georeferencing - does it conform to the recommended 'best
   practices' guide published by GBIF?

5) What is the ease / difficulty of websetup?

6) Security - can a data entry technician accidentally delete or
   change (corrupt) large amounts of data? Is/are the database
   server(s) protected from disaster (eg floods, fires)?

7) Likes / dislikes & pros/cons
     ECN Session – Arthropod Collections Databases

1a) What is the primary user audience?

    Museums / collections data management (also: observations, Federal
    collections [USFWS], large private collections associated with public institution]

1b) is a dedicated IT / programmer required?

    Yes, but the IT staff are shared among all participants.

1c) Single vs multi-user?

    Multi-user without practical limits.

1d) Annual cost?

    Negotiated per institution based on size and maintenance
    needs
    currently ranging $1,300 - $27,000
    ECN Session – Arthropod Collections Databases

2) GBIF - does the database provide data to GBIF?

   Arctos does this automagically every minute.


3) Barcoding - does the database handle batch processing of
    specimens using barcodes? ( 'speed / ease of use')

   Arctos attaches barcodes to “parts.” This lets you track
   things like tissues, extractions, slides and pinned bodies of
   each cataloged specimen separately.
     ECN Session – Arthropod Collections Databases

4) Georeferencing - does it conform to the recommended 'best
    practices' guide published by GBIF?

    Arctos fully supports georeferencing "best practices," in part
    because the authors of that document and of Arctos' spatial
    data structure are one and the same. (John Wieczorek)

5) What is the ease / difficulty of websetup?

    Acquire password. Enter data. (Arctos is only available via the web).
  ECN Session – Arthropod Collections Databases


           Preservation of specimens

            and their associated data

                      for perpetuity


NSF will help us get our data online but ensuring they stay online
             forever is a problem that hasn’t been solved
33,090 specimens
28 institutions / private collections
736 images
4,516 bibliographic images
428 users
DMNS
Arachnology
Data

In-house ->

NSD ->

Crash

        ->
K EMu
Database errors...
Cabinets
     antiquated
     wooden
     damaged
= unsafe
                     Arctos

Database                      Specimen Catalog
                               label data (and more)
     home-made
                              Accessions     Loans,
     weak security                           usage

     mine alone                     Projects
                                                       Citations
     not online                  contribute and/or
                                  use specimens

= unsafe
                                        Publications
                                         cite specimens

                        The rest of
                        Cyberspace
                                                GenBank
                                  Federated portals
                                     BerkeleyMapper
                              “Media” in TeraGrid
     ECN Session – Arthropod Collections Databases

6) Security - can a data entry technician accidentally delete or change
    (corrupt) large amounts of data?

No – Data entry technicians enter data into a staging area

Data must be vetted before being loaded by someone with more
   access privileges

All non-select transactions are audited. We can (theoretically) roll
    back to any point in history, or roll any user's updates back to
    any point in history. We can re-create all actions by all users.
     ECN Session – Arthropod Collections Databases

6) Security - Is/are the database server(s) protected from disaster (eg
    floods, fires)?

Yes – running a RAID array

Backups
   – continuous logs to a remote NAS
   – local drives
   – Texas Advanced Computing Center
   – San Diego Supercomputing Center

“If we lose all the nightly backups (3 tectonic plates), I'm betting
nobody will be overly worried about Arctos data.

Or breathing.” – D. McDonald
     ECN Session – Arthropod Collections Databases

7) Likes / dislikes & pros/cons

    DISLIKES:

    - Learning curve fairly steep -> back to kindergarten

    - Can’t customize to my heart’s content, each change must be
        voted on & prioritized by other users

    - Web access generally slower than I like ( we are all more
       critical of others than ourselves)

    - Only available when networked. Field work in remote areas
        requires special solutions if data are to be accessed.

    - User interface is ~ garish, clunky, industrial (but works)
     ECN Session – Arthropod Collections Databases

7) Likes / dislikes & pros/cons

    LIKES:

    - Rock – solid security, the data will outlive me

    - Web-published

    - Cutting-edge web integration (mapping, GenBank, etc)

    - No responsibility on my part to maintain backups, software
        updates, etc. Need only a networked computer

    - Arctos programmers & designers are biologists / users who
        really care about “doing it right”
          ECN Session – Arthropod Collections Databases

 6) Security - can a data entry technician accidentally delete or change
     (corrupt) large amounts of data?
   There are multiple roles and partitions at various levels. A data entry technician has write access to exactly one table, the
bulkloader. Additionally, one VPD limits his access to his own collection, another limits access to his own rows, and yet
another prevents him from marking records to load. In short, he can only un-do anything he's done, and then only in a
"staging area" separate from "real" data.

   A similar model is used throughout Arctos. We control access at the table and row level, and can easily implement finer-
grained control if such becomes necessary. Users (theoretically) get only the rights that they need and have demonstrated
an understanding of to the data they need, all the while having full access to shared data (like agents).

    Data like agents and taxonomy - things where character strings rather than data concepts matter to collections - are
trigger-protected based on usage. You can't update an agent name after it's been used as an author, for example. This is
pretty basic referential integrity, and Arctos is the only thing that has it.

   Data and user rules are all handled by the RDBMS, so we can plug in forms written by other people/projects, offer SQL
command-line access, webservices, etc., without worrying too much about security or referential integrity. (Specify, for
example, cannot safely support such access as all data and access rules live in the application layer.)

   All non-select transactions are audited. We can (theoretically) roll back to any point in history, or roll any user's updates
back to any point in history. We can re-create all actions by all users.

   In addition to ColdFusion's Application Security, we take full advantage of Oracle security - a breach of one just leads to
another layer. Oracle handles things like secondary user access and brute-force password crack attempts. An independent
semi-intelligent (and slightly paranoid) security wrapper watches for malicious behavior and blocks IP access if it detects
anything anomalous.
            ECN Session – Arthropod Collections Databases

   6) Security - Is/are the database server(s) protected from disaster (eg
       floods, fires)?
The server is running a RAID array - we can lose a disk or two and not lose any data (or stop working). Rollback logs are
continuously written to a remote NAS (Networked Attached Storage) system. Daily backups are stored on the local drives, on
the NAS, and on tape in GVEA's "bunker." (They won't tell us what or where that is, but your electric bill and medical records
are in there and it makes the Department of Homeland Security happy.) Daily backups are also copied to the Texas Advanced
Computing Center at Austin (one copy on disk and another on tape) and to tape at the San Diego Supercomputing Center. We
may have another copy going to massively redundant disk at the National Center for Supercomputing Applications (University
of Illinois at Urbana- Champaign) by the time you get to Reno.

We can recover to the point of failure, or at least to within a couple minutes of it, with one copy of the most recent daily backup
and one copy of the rollback logs. (Depending on recent activity, we can usually actually recover from a week-or-so old daily +
the rollbacks.) We'll lose <24H of data if if we lose all the rollbacks - the sever and the NAS. Those are in two buildings, both
with serious security, separated by about a hundred yards of gravel parking lot. If we lose all the nightly backups (3 tectonic
plates), I'm betting nobody will be overly worried about Arctos data. Or breathing.

There are a couple dozen probes per day - I think it's fairly safe to say that Arctos security has been tested. (Actual attacks are
now kind of hard to detect due to the aforementioned paranoid IP killer, which generally shuts them off at the first probe, but
we used to get one per week or so.) A big DDoS attack would easily take us down, but (1) we're too boring to attract such a
thing, and (2) so what? - those things just eat servers, not data.
            ECN Session – Arthropod Collections Databases

   6) Security - Is/are the database server(s) protected from disaster (eg
       floods, fires)?
We've lost a few disks over the years, but never lost data or had a server go down due to it. (We've had lots of downtime, just
not equipment-related.) Our biggest threat is probably a disgruntled employee with too much access and a long-term plan, but
we could probably (with expensive consultant help) even recover from that, and there's no lack of tools to detect such
behavior.

That might all be a little overkill - I'd settle for daily backups on 2 major tectonic plates if absolutely necessary –

but I certainly think that you have an obligation to do more than install [database X] on some junker computer and maybe buy
a tape drive when you take public money to create or curate digital data.


[database X] may be free, but supporting it takes a real commitment in
hardware, infrastructure, and expertise that most Universities are poorly
equipped to make.
I don't know of a single large project that hasn't at some point lost digital data.

                             - Dusty McDonald, Arctos programmer
Lessons Learned

1) Proprietary software is generally a bad idea unless you have
       guaranteed, sustained budget for staff and upgrades.
2) Back-ups cannot merely be performed/scripted with the
       assumption that the job is done.
3) Back-ups should NOT be incremental, MUST be stored
       offsite, and MUST include separate images of operating
       system and databases
4) Restoration from bare metal must be fully documented and
       periodically performed to verify that the process DOES
       work.
5) Source code must be in a distributed public repository like
       Github.

- D. Shorthouse
University of Connecticut Bird Collection data
were found... on a single floppy




                       2031 records in a flat file
University of Connecticut Bird Collection data
were found... and made available on-line
But... Something with the server setup is not
stable.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:4/28/2013
language:Unknown
pages:37
wang nianwu wang nianwu http://
About wangnianwu