Docstoc

The Dryad data repository wiki

Document Sample
The Dryad data repository wiki Powered By Docstoc
					  “Classifying Scientific
  Data Objects with
  Bibliographic Relationship
        Taxonomies”
                                         ~~~~~~
                                ASIS&T 2007 Annual Meeting
                                  Sunday, October 21, 2007




Jane Greenberg, Frances McColl Professor; Director, SILS/Metadata
Research Center
School of Information and Library Science
Univ. of North Carolina at Chapel Hill
janeg@ils.unc.edu
Overview
       DRIADE – (Digital Repository of Information




                                                                      http://www.caffedriade.com
        and Data for Evolution)
       Accomplishments: basic metadata
        challenges




                                                                      /
         Functional requirements, underlying metadata

       Instantiation                                    -adaptation
         Motivation                                     -natural selection

         Research design
         Preliminary results
       Conclusions
Small science, open access
   Small science repositories (SSR)
                                                                     ecology,
     Knowledge Network for Biocomplexity (KNB),                     paleontology,
      Marine Metadata Initiative (MMI)                               population
                                                                     genetics,
   Evolutionary biology                                             physiology,
                                                                     systematics +
     Publication   process                                          genomics

        Supplementary  data (Molecular Biology and
         Evolution, American Naturalists)
          “Author,” “deposition date,” not “subject” “species,” ”geo. locator”

        Data   deposition (Genbank, TreeBase, Morphbank)
   NESCent and SILS/Metadata Research
    Center
                                  DRIADE Team
DRIADE’s Goals                    NESCent
                                     PI: Todd Vision, Director of
                                      Informatics and Associate
   One-stop deposition and           Professor, Biology, UNC
    shopping for data objects        Hilmar Lapp, Assistant Director
                                      of Informatics
   Support the acquisition,       Ryan Scherle, Data
    preservation, resource          Repository Architect
    discovery, and reuse of       UNC/SILS/MRC
    heterogeneous digital            PI: Jane Greenberg, Associate
                                      Professor, SILS and MRC
    datasets
                                     Sarah Carrier, Research
   Balance a need for low            Assistant
                                     Abbey Thompson, DRIADE
    barriers, with higher-level       R.A./SILS Masters Student
    … data synthesis                 Hollie White, Doctoral Fellow
                                     Amy Bouck, Biology, Ph.d.student
DRIADE
               Depositor/s
(Dryad)                         One stop
                               deposition

Specialized
Repositories   DRIADE           Journals & journal
  -Genbank     -Data objects
  -TreeBase                        repositories
                supporting
 -Morphbank      published
   -PaleoDB       research




                                     One stop
                                   shopping—
               Researcher/s         an option
Accomplishments: basic metadata
           challenges
Accomplishments: addressing basic
metadata challenges
   Functional requirements
         Consensus building (Dec. 06) + SSR workshops (May ‘07)
               First class object, **SHARING & REUSE**
         Analysis metadata functions in existing repositories (JCDL, 2007)
   Functional model
           OAIS (Open Archival Information System), DSpace
   Level one application profile
         Namespace schemas:                Modular scheme:
    1. Dublin Core                        1.  Journal citation
    2. Data Documentation Initiative      2.  Data objects
       (DDI)
    3. Ecological Metadata
       Language (EML)
    4. PREMIS                        (Carrier, et al., 2007)
    5. Darwin Core
 Functional requirements
Project                GBIF   KNB   NSDL   ICPSR   MMI

Goals/priorities
Heterogeneous             ▪      ▪     ▪      ▪       ▪
digital datasets
Long-term data            ▪            ▪
stewardship
Tools and incentives      ▪      ▪     ▪      ▪       ▪
to researchers
Minimize technical        ▪      ▪     ▪      ▪       ▪
expertise and time
required
Intellectual property     ▪      ▪            ▪
rights
Datasets coupled
w/published
research
 <DRIADE application profile>
Bibliographic Citation Module
1.    dcterms:bibliographicCitation/Cit
                                          11.   dc:coverage / Locality
      ation information
                                                Required*
2.    DOI
                                          12.   dc:coverage/Date Range
Data Object Module                              Required*
1.    dc:creator/Name*                    13.   dc:software/Software*
2.    dc:title/Data Set #                 14.   dc:format/File Format
3.    dc:identifier/Data Set Identifier   15.   dc:format/File Size
4.    PREMIS:fixity/(hidden)              16.   dc:date/(Hidden) Required
5.    dc:relation/DOI of Published        17.   dc:date/Date Modified*
      Article*                            18.   Darwin Core: species/ Species,
6.    DDI:<depositr>/Depositor                  or Scientific*
7.    DDI:<contact>/Contact
      Information                         Key
8.    dc:rights/Rights Statement          * = semi-automatic
9.    dc:description/Description #        # = manual
10.   dc:subject/Keywords                 Everything else is automatic
DRIADE metadata
application profile….organic…
   Level 1 – initial repository implementation
     Application profile: Preservation, access/resource
      discovery, (limited use of CVs)
   Level 2 – full repository implementation
     Level1++ expanded usage, interoperability,
      preservation; administration; greater use of CV and
      authority control; data sharing and reuse
   Level 3 – “next generation” implementation
     Considering   Web 2.0 functionalities, Semantic Web
                   Instantiation
(Classifying Scientific Data Objects with Bibliographic
              Relationships Taxonomies)
Motivation
   DRIADE goals:
     Datapreservation, sharing, use/re-use, validation,
      repeatability
        Is it important to know history of a data object?
        How can we support accurate and effective tracking of the life-
         cycle of data object?

   Data objects = first class objects
     Data   structures as works (Coleman, 2002)
        Units   of analysis, intellectual products of activity
     Work  = “propositions expressed (ideational content)”,
      and “expressions of the propositions” (Smiraglia, 2001)
     Data   objects are content carriers (Greenberg, 2007)
Motivation
   Research and…to explain a“work”;
    bibliographic families; and instantiation
    Coleman (2002)                    Tillet (1991, 1992)
    Cutter (1904)                     Vellucci (1995)
    Leaser (1999)                     Wilson (1983)
    Ranganathan: embodied/expressed   Yee (1995)
    Smiraglia (1999,2001,
    2002,etc.)


   Metadata and Bibliographic Control
    Models
       FRBR (Functional Requirements for Bibliographic Records)
       DCAM (Dublin Core Abstract Model)
       RDF (Resource Description Framework)
Research design
   Objective: Build an open access repository that
    accurately reflects the “disciplinary knowledge structures”
    (relationship among data objects”)
   Research questions
    1.   Do scientists (developing scientists) view data
         objects as works?
    2.   Do they understand different “instantiations”?
    3.   Do they think instantiation tracking is important?
   Method
     Instantiation   identification test and survey
   Participants:       Scientists, research and publication
    Procedures, etc.
   Verify participant suitability, contextual introduction
   Read instantiation definitions and view instantiation
    diagram (next slide)
   Example Instantiation scenario
       Scenario: Sherry collects data on the survival and growth of the
       plant Borrichia frutescens (the bushy seaside tansy)…

       Question: What is the relationship between Sherry’s paper data
       sheet and her excel spreadsheet?

       Answer: Equivalent | Derivative | Whole-part | Sequential
       (circle one)


   8 scenarios (randomized); survey
   Data object relationships
 Equivalence                                           Derivative
                                                                                 B
                                                                                (=data
                                                                           set A annotated)
     A                   A                     A                A
    (=data             (=same              (=same           (=data set)
     set in            data set            data set
     Excel)            in SAS)             on paper)                             C
                                                                              (=data set
                                                                              A revised)


 Whole-part                                            Sequential



                                    A1                        A1                   A2
              A                   (=a subset
                                                             (=part 1             (=part 2
         (=data set)                                      of a data set)       of a data set)
                                     of A)




Smiraglia, Tillet
Results
   7 participants (Biochemistry, Biology, Botany/Plant
    Ecology)
       3 Faculty/research associates; 21+ yrs. research/pub.
       4 Research assistants; 3 (1-3 yrs.), 1 (5 yrs.)
        Instant./       A          B    Failing A   Failing B
        rel.
        Equivalent    100%     100%     n/a         n/a
        Derivative    6 of 7   6 of 7   1 BS        1 MS
                      (86%)    (86%)
        Whole/part    100%     6 of 7   n/a         1 PhD
                               (86%)
        Sequential    6 of 7   3 of 7   1 BS        2 BS,
                      (86%)    (43%)                1 MS
Results
     Instant./    Created Apply to     Other          Support
     rel.         X data  Work         app.           last publ.
                  before  “freq.”                     y/n
     Equivalent      7    6 (86%)      1 (h/t)        6/1
     Derivative      7    5 (71%)      2 (h/t)        6/1
     Whole/part      7    5 (71%)      2 (s)          4/2 (1 na)
     Sequential      5    3 (43%)      1 (h/t),1(s)   2/3 (2 na)

    ht=halftime; s=seldom; na=no answer

Use of data by others?              1-5, 1 person
                                    6-20, 3 people
                                    21-99, 1 person
                                    100+, 2 people (PhD)
How important is it to track        0, not
data life-cycle                     3 very
(relationships)?                    4 extremely (2Ph.D, 1 MS, 1 BS)
Results (participant comments)
   Validity: “Whenever anything changes, there’s
    the ability to make mistakes”
   Sheer quantity of data: “We have 30 years worth
    of data, tracking changes is important”
   The impact of changes in scientific nomenclature:
     “As  time goes on, datasets must be maintained to
      reflect current understanding of taxonomy and
      nomenclature, to allow connection of old data and
      attributes of that data to be associated correctly with
      new data and attributes of that data. This is a giant
      problem.”
Conclusions and next steps
   Conclusions
       Use of data created by others increases over time
       In general, more seasoned scientists better grasp
       Sequential data presented the most difficulty (less seasoned sci.)
       Unanimous support, “very  extremely important”
   Bibliographic control relationship are applicable to
    describing the relationships between data objects
   Next steps:
       Continue to collect data (30-50)
       Study additional types of instantiations
       Develop applications that effectively track data object life-cycle
            How do we encode these relationships? Manually/automatically?
            Who should create them? Curator, scientists, collaboratively?
            Should we maintain a vocabulary?
            Consider implications for other disciplines
THANK YOU!!
Jane Greenberg, Associate Professor, and Director, SILS Metadata
Research Center
janeg@ils.unc.edu

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:8
posted:10/6/2012
language:English
pages:21