The Dryad data repository wiki

Document Sample
The Dryad data repository wiki Powered By Docstoc
					  “Classifying Scientific
  Data Objects with
  Bibliographic Relationship
                                ASIS&T 2007 Annual Meeting
                                  Sunday, October 21, 2007

Jane Greenberg, Frances McColl Professor; Director, SILS/Metadata
Research Center
School of Information and Library Science
Univ. of North Carolina at Chapel Hill
       DRIADE – (Digital Repository of Information

        and Data for Evolution)
       Accomplishments: basic metadata

         Functional requirements, underlying metadata

       Instantiation                                    -adaptation
         Motivation                                     -natural selection

         Research design
         Preliminary results
       Conclusions
Small science, open access
   Small science repositories (SSR)
     Knowledge Network for Biocomplexity (KNB),                     paleontology,
      Marine Metadata Initiative (MMI)                               population
   Evolutionary biology                                             physiology,
                                                                     systematics +
     Publication   process                                          genomics

        Supplementary  data (Molecular Biology and
         Evolution, American Naturalists)
          “Author,” “deposition date,” not “subject” “species,” ”geo. locator”

        Data   deposition (Genbank, TreeBase, Morphbank)
   NESCent and SILS/Metadata Research
                                  DRIADE Team
DRIADE’s Goals                    NESCent
                                     PI: Todd Vision, Director of
                                      Informatics and Associate
   One-stop deposition and           Professor, Biology, UNC
    shopping for data objects        Hilmar Lapp, Assistant Director
                                      of Informatics
   Support the acquisition,       Ryan Scherle, Data
    preservation, resource          Repository Architect
    discovery, and reuse of       UNC/SILS/MRC
    heterogeneous digital            PI: Jane Greenberg, Associate
                                      Professor, SILS and MRC
                                     Sarah Carrier, Research
   Balance a need for low            Assistant
                                     Abbey Thompson, DRIADE
    barriers, with higher-level       R.A./SILS Masters Student
    … data synthesis                 Hollie White, Doctoral Fellow
                                     Amy Bouck, Biology, Ph.d.student
(Dryad)                         One stop

Repositories   DRIADE           Journals & journal
  -Genbank     -Data objects
  -TreeBase                        repositories
 -Morphbank      published
   -PaleoDB       research

                                     One stop
               Researcher/s         an option
Accomplishments: basic metadata
Accomplishments: addressing basic
metadata challenges
   Functional requirements
         Consensus building (Dec. 06) + SSR workshops (May ‘07)
               First class object, **SHARING & REUSE**
         Analysis metadata functions in existing repositories (JCDL, 2007)
   Functional model
           OAIS (Open Archival Information System), DSpace
   Level one application profile
         Namespace schemas:                Modular scheme:
    1. Dublin Core                        1.  Journal citation
    2. Data Documentation Initiative      2.  Data objects
    3. Ecological Metadata
       Language (EML)
    4. PREMIS                        (Carrier, et al., 2007)
    5. Darwin Core
 Functional requirements
Project                GBIF   KNB   NSDL   ICPSR   MMI

Heterogeneous             ▪      ▪     ▪      ▪       ▪
digital datasets
Long-term data            ▪            ▪
Tools and incentives      ▪      ▪     ▪      ▪       ▪
to researchers
Minimize technical        ▪      ▪     ▪      ▪       ▪
expertise and time
Intellectual property     ▪      ▪            ▪
Datasets coupled
 <DRIADE application profile>
Bibliographic Citation Module
1.    dcterms:bibliographicCitation/Cit
                                          11.   dc:coverage / Locality
      ation information
2.    DOI
                                          12.   dc:coverage/Date Range
Data Object Module                              Required*
1.    dc:creator/Name*                    13.   dc:software/Software*
2.    dc:title/Data Set #                 14.   dc:format/File Format
3.    dc:identifier/Data Set Identifier   15.   dc:format/File Size
4.    PREMIS:fixity/(hidden)              16.   dc:date/(Hidden) Required
5.    dc:relation/DOI of Published        17.   dc:date/Date Modified*
      Article*                            18.   Darwin Core: species/ Species,
6.    DDI:<depositr>/Depositor                  or Scientific*
7.    DDI:<contact>/Contact
      Information                         Key
8.    dc:rights/Rights Statement          * = semi-automatic
9.    dc:description/Description #        # = manual
10.   dc:subject/Keywords                 Everything else is automatic
DRIADE metadata
application profile….organic…
   Level 1 – initial repository implementation
     Application profile: Preservation, access/resource
      discovery, (limited use of CVs)
   Level 2 – full repository implementation
     Level1++ expanded usage, interoperability,
      preservation; administration; greater use of CV and
      authority control; data sharing and reuse
   Level 3 – “next generation” implementation
     Considering   Web 2.0 functionalities, Semantic Web
(Classifying Scientific Data Objects with Bibliographic
              Relationships Taxonomies)
   DRIADE goals:
     Datapreservation, sharing, use/re-use, validation,
        Is it important to know history of a data object?
        How can we support accurate and effective tracking of the life-
         cycle of data object?

   Data objects = first class objects
     Data   structures as works (Coleman, 2002)
        Units   of analysis, intellectual products of activity
     Work  = “propositions expressed (ideational content)”,
      and “expressions of the propositions” (Smiraglia, 2001)
     Data   objects are content carriers (Greenberg, 2007)
   Research and…to explain a“work”;
    bibliographic families; and instantiation
    Coleman (2002)                    Tillet (1991, 1992)
    Cutter (1904)                     Vellucci (1995)
    Leaser (1999)                     Wilson (1983)
    Ranganathan: embodied/expressed   Yee (1995)
    Smiraglia (1999,2001,

   Metadata and Bibliographic Control
       FRBR (Functional Requirements for Bibliographic Records)
       DCAM (Dublin Core Abstract Model)
       RDF (Resource Description Framework)
Research design
   Objective: Build an open access repository that
    accurately reflects the “disciplinary knowledge structures”
    (relationship among data objects”)
   Research questions
    1.   Do scientists (developing scientists) view data
         objects as works?
    2.   Do they understand different “instantiations”?
    3.   Do they think instantiation tracking is important?
   Method
     Instantiation   identification test and survey
   Participants:       Scientists, research and publication
    Procedures, etc.
   Verify participant suitability, contextual introduction
   Read instantiation definitions and view instantiation
    diagram (next slide)
   Example Instantiation scenario
       Scenario: Sherry collects data on the survival and growth of the
       plant Borrichia frutescens (the bushy seaside tansy)…

       Question: What is the relationship between Sherry’s paper data
       sheet and her excel spreadsheet?

       Answer: Equivalent | Derivative | Whole-part | Sequential
       (circle one)

   8 scenarios (randomized); survey
   Data object relationships
 Equivalence                                           Derivative
                                                                           set A annotated)
     A                   A                     A                A
    (=data             (=same              (=same           (=data set)
     set in            data set            data set
     Excel)            in SAS)             on paper)                             C
                                                                              (=data set
                                                                              A revised)

 Whole-part                                            Sequential

                                    A1                        A1                   A2
              A                   (=a subset
                                                             (=part 1             (=part 2
         (=data set)                                      of a data set)       of a data set)
                                     of A)

Smiraglia, Tillet
   7 participants (Biochemistry, Biology, Botany/Plant
       3 Faculty/research associates; 21+ yrs. research/pub.
       4 Research assistants; 3 (1-3 yrs.), 1 (5 yrs.)
        Instant./       A          B    Failing A   Failing B
        Equivalent    100%     100%     n/a         n/a
        Derivative    6 of 7   6 of 7   1 BS        1 MS
                      (86%)    (86%)
        Whole/part    100%     6 of 7   n/a         1 PhD
        Sequential    6 of 7   3 of 7   1 BS        2 BS,
                      (86%)    (43%)                1 MS
     Instant./    Created Apply to     Other          Support
     rel.         X data  Work         app.           last publ.
                  before  “freq.”                     y/n
     Equivalent      7    6 (86%)      1 (h/t)        6/1
     Derivative      7    5 (71%)      2 (h/t)        6/1
     Whole/part      7    5 (71%)      2 (s)          4/2 (1 na)
     Sequential      5    3 (43%)      1 (h/t),1(s)   2/3 (2 na)

    ht=halftime; s=seldom; na=no answer

Use of data by others?              1-5, 1 person
                                    6-20, 3 people
                                    21-99, 1 person
                                    100+, 2 people (PhD)
How important is it to track        0, not
data life-cycle                     3 very
(relationships)?                    4 extremely (2Ph.D, 1 MS, 1 BS)
Results (participant comments)
   Validity: “Whenever anything changes, there’s
    the ability to make mistakes”
   Sheer quantity of data: “We have 30 years worth
    of data, tracking changes is important”
   The impact of changes in scientific nomenclature:
     “As  time goes on, datasets must be maintained to
      reflect current understanding of taxonomy and
      nomenclature, to allow connection of old data and
      attributes of that data to be associated correctly with
      new data and attributes of that data. This is a giant
Conclusions and next steps
   Conclusions
       Use of data created by others increases over time
       In general, more seasoned scientists better grasp
       Sequential data presented the most difficulty (less seasoned sci.)
       Unanimous support, “very  extremely important”
   Bibliographic control relationship are applicable to
    describing the relationships between data objects
   Next steps:
       Continue to collect data (30-50)
       Study additional types of instantiations
       Develop applications that effectively track data object life-cycle
            How do we encode these relationships? Manually/automatically?
            Who should create them? Curator, scientists, collaboratively?
            Should we maintain a vocabulary?
            Consider implications for other disciplines
Jane Greenberg, Associate Professor, and Director, SILS Metadata
Research Center

Shared By: