Publication Sample Forms by hmg12131


More Info
									                          Application of
International GeoSample Number (IGSN)
                   to Sample Collections

                                            Sri Vinay
                Geoinformatics for Geochemistry (GfG)
                Lamont Campus of Columbia University
Presentation Outline
   Unique identifiers and their
    application to sample and data
   System for Earth SAmple
    Registration (SESAR) and
    International GeoSample Number
   Current Status and Activities of
   IGSN Implementation Strategies
   Discussion
Unique Identifiers
 An identifier is an unambiguous label which
 specifies an entity.

 Unique identifiers are widely used to designate
 physical objects, assisting in trading (e.g., the
 Universal Product Code bar code system), and the
 extension of similar principles to digital and
 abstract entities is a prerequisite for digital
 commerce of rights and intellectual content.

 Although the design of unique identification
 schemes is a technical problem, it is also a
 business issue with implications for what is
 identified and how identified items are made
  “In a dynamic and distributed information
  environment, the effective management of both
  metadata records and the resources they describe
  requires a systematic way of generating and
  assigning unique identifiers.”
  (N. Friesen 2002: Recommendations for Globally Unique,
  Location-Independent, Persistent Identifiers)


Life Sciences -
 “The World-Wide Web provides a globally distributed
 communication framework that is essential for almost all
 scientific  collaboration,  including   bioinformatics.
 However, several limits and inadequacies have become
 apparent, one of which is the inability to
 programmatically identify locally named objects that
 may be widely distributed over the network. This
 shortcoming limits our ability to integrate multiple
 knowledgebases, each of which gives partial information
 of a shared domain, as is commonly seen in
 (Clark, T., Martin S., Liefeld T., 2004: Globally distributed object identification for
 biological knowledgebases. Briefings in Bioinformatics. Vol.5 (1), 59-70.)

 LSID = Life Science Identifier
Geosciences -

    Kai Lin (SDSC): “Ontology Based Resource Registration and
         Integration in GEON”, Lecture July 2005
          Sample Naming in the
         Location         Publication      Cruise
 D3-1    SEIR             ANDERSON, 1980   VM3301 (Vema)
 D3-1    North Fiji Basin EISSEN 1994      Starmer 1 (Nadir)
 D3-1    Shimada Smt GRAHAM 1988           S1-79 (Sea Sounder)
 D3-1    Gorda Ridge CLAGUE 1984           KK2-83NP (Kana Keoki) Examples from the
 3-1     Lamont Smts BATIZA 1982           RISE III (New Horizon) PetDB Database
                                                Dredge sample   3, Amphitrite Cruise 1963/4
                                                D3              Engel 1964
                                                D-3             Scheidegger 1981, Schilling 1971
Sample      names               are             PD3             Tatsumoto 1965, 1966
  duplicated.                                   PD-3            Hedge 1970, Muehlenbach 1972
                                                PV D-3          Engel 1965
                                                AMPH3D          Pineau 1976
Sample names are modified                       AMPH-D3         MacDougall 1986
                                                AMPH D-3        Sun 1980, Schilling 1975
  or changed.                                   AMPH 3-PD-3     Hart 1971
                                                S-10            Subbarao 1972
DSDP Leg 46, Hole 396B, Section 22, Sample 3, 28-33cm
46396B 22 3,28-38                  Dungan 1978
396B 22 3,28-38                    Muehlenbach 1979
249                                Dungan 1978
DSDP046-0396B-022-003/28-38 PetDB
Geosciences -
 Integration of data in a distributed
    system requires unique identification of
   Currently, naming of samples is
     Different samples have identical names.
     Samples are renamed.
     Metadata that allow unique identification are often
      missing for terrestrial samples.
     Institutions have their own naming protocols, no
      assurance that names are unique on a global scale.
   Access to information about the samples
       Need to ensure proper evaluation and facilitate
        interpretation of sample-based data.
   Links to physical specimens
     to make observations & measurements and the science
      derived from them reproducible.
     to allow discovery & re-use of samples for improved use of
Urgency to Act

   Growing number of data systems with
    sample-based data
   Growing demand for ‘fine-grained’
    access to data at the level of
    individual samples
   New technologies for linking and
    integrating data (interoperability)
   Increasing need to share samples
Generating Unique IDs:
   “Registration-based schemes”
       Require a central clearinghouse
          Register personal or institutional names

          Register prefix or namespace (e.g. URN)

          Register metadata that allow the central clearinghouse
           to generate identifiers
   Schemes without registration
     use a computational process (naming protocol) to produce
      an ID based on metadata
     No central authority
No-Registration Scheme

   Risk of incorrect application of
    naming protocol
   Risk of name duplication
   Identifier might grow to impracticable
    length to insure uniqueness
   Metadata missing for legacy samples
   Easy implementation
SESAR - A Centralized
   Response to urgent need for unique ID
   Easier to prevent duplicate registrations
   Easier to ensure links between parent and
    child samples
   Provide a central access point for
    Peer2Peer registration
   Facilitate international collaboration
   Build a Global Sample Catalog
SESAR – A Centralized Approach

   Proposed to NSF in July 2004
   SGER (EAR) award received for September
    2004 - August 2005
   First presented to community at Marine
    Curators’ Meeting at LDEO, September 2004
   Supplement received in Sept 2005 until May
   Workshop at SDSC January 2005
   Proposal to NSF August 2005
   Three year grant awarded in April 2006 (NSF-
    International GeoSample Number:
    A Global Unique Identifier for Earth
                            Unique user code String of random characters

   Managed at central clearinghouse (SESAR)
   Strict Syntax (9 characters: letters [A-Z]
    & numbers [0-9])
     Fits sample labels
     Fits data tables in publications
     Allows 2,176,782,336 sample identifiers per registrant

   Generated by SESAR or by users
   Does not replace personal or institutional
Benefits of the IGSN & SESAR

   Ability to unambiguously identify samples
     allows to link & integrate data for a single sample
         advances interoperability among digital data management
          systems & the development of Geoinformatics.
         helps build more comprehensive data sets for samples.

         fosters new cross-disciplinary approaches in science.

     aids preservation and curation, orphaned samples can be
     ensures proper linking of data from samples and sub-samples.
     facilitates sharing of samples.
SESAR: Status
   Basic version of system functional since
    Fall 2004
   Nearly 3.6 Million GeoObjects registered
     All DSDP/ODP GeoObjects (holes, cores, core sections,
      core samples)
     Dredge and core collections from Scripps, WHOI, Lamont,
      Antarctic Research Facility (ARF)
     >40,000 mineral specimens from Harvard Museum
     Rocks & minerals from the US Polar Rock Repository

   IGSN implemented in Geoscience data
    systems (e.g. EarthChem, MetPetDB,
    PaleoStrat, CoreWall)
   Revised & extended version to be released
    in phases by end of 2007
SESAR: Sample Registration

   Obtain account via website
       Set up login/password
       Get a unique user code

   Submit sample information
       Via Batch Registration Forms (.xls workbooks)
       Via web site (currently off-line for upgrade)
       Via web services (under development)
Registration via Spreadsheet
Available Batch Registration
1. Coring GeoObjects
2. Dredges/trawl/grabs
3. Individual samples
4. Sections, Suites, & Sequences
via Web Site:
Currently off-line for
           Registration via Web
                      Under Development
   Registration of objects via collaborating
    data systems
     Automatically register samples when sample metadata are
      entered into collaborating data systems (e.g. IODP, MGDS)
     Eliminates redundant metadata submission

   Systems communicate via web services
       Starting with REST based services. Could support SOAP in
   Authentication
       Investigating different technologies including GEON/GAMA
   Metadata exchange and validation
       XML schema
 SESAR Service
Assist investigators to manage their

                      Current Services:
                        Long-term preservation of
                         information about samples
                        Lists of personal sample collections
                        Store images, field notes, etc.
SESAR Service
   Services “Under Construction”
       Search & sort personal sample collections
       Create maps of sample locations
       Establish links to data (publications, data systems)
       Download tabular sample information to spreadsheets

                                   Antarctic Research Facility, FSU
                                           Ca. 7,000 cores
SESAR Service
  Extended Services for Sample
    Potential Services:
      Modules to manage administrative metadata
      Modules for creating & operating web interfaces to

    Advantages
      No IT infrastructure required (except a computer and an
       internet connection)
      No maintenance and risk & contingency management
      Access from anywhere by authorized individuals.
      Platform independent
The SESAR Global Sample

   SESAR integrates the World’s sample
   Allows users to find/discover
    existing samples
   Provides access to “sample profiles”
       View sample information in SESAR as provided
       Link to the specimen’s ‘home’ (archive)
       Link to data (publications, databases)
                                             Multiple systems and catalogs
The Challenges
                                               Data Management Systems
                                                for Science Programs
   Diversity of collections                         Ridge2000 - MGDS
      Repositories                                  MARGINS - MGDS
      Museums                                       IODP
      Individual Investigators                 Domain Specific Catalogs
                                                     NGDC – IMLGS
      Structured science & field
       programs                                 National Catalogs
             Metadata requirements                  Canadian National Sample
                                                      Management System
             Sample types & relations
             Vocabularies                    SESAR
   Global Scope                              Issues

      Data Generated by                             Redundancies
       International                                 Unacceptable demands on
       Collaborations                                 investigators
             IODP                                   Inconsistencies
             ICDP                                   Fragmentation
             InterMARGINS,                          Competition rather than
              InterRidge                              collaboration
        Data are shared globally           Adoption
             Scientific literature                  Sample curation
                                                     Data publication
IGSN Implementation
   Work with investigators, curators and
    repositories to define & integrate registration
    process and IGSN into existing sample and data
    management workflows
          Joint Workshop of SESAR & NGDC, February 26 & 27,
           2007, Boulder, CO
          Registration of repository and museum collections ongoing
   Advance adoption of IGSN
          Work with editors to make IGSN a requirement for data
           publication (e.g. Editors’ Round Table, Societies)
          Work with funding agencies, large science programs (e.g.
           IODP, MARGINS, ANDRILL), CI projects (e.g. GEON,
           CHRONOS), and repositories on sample and data archiving
   Work with CI Partners on system design &
          Interoperability Workshop, January 2005 at SDSC
          Working with GEON on authentication scheme
          Working with IODP and KU/EarthChem on web services
                             Editor’s Breakout*
                     - Reporting Data:
                         -   Published paper is point of record. All data should be
                             reported. No “representative data”, no “data can be
                             obtained from author”, no data available at personal
                         -   Submission to databases should be strongly

                     - Unique sample identifier (IGSN)
                         -   This may solve the problem of poor sample metadata
                         -   This system is being implemented.
                         -   Essential component of successful database -
                             contains sample metadata, allows samples to be
                             followed through its analytical history.
                         -   Tracks samples and subsamples.

                         -   We should start using it now   .
*at the GERM Meeting, May 2006, recommendations of Editors’
Breakout presented by Steve Goldstein
Support by Funding Agencies

 “We have also funded an effort (SESAR)
 to uniquely identify all samples so that
 various analyses on the same samples
 can be cross referenced and listed. I
 would also like you to indicate in your
 dissemination plan that your suite of
 samples will be registered with SESAR.”

 Letter of NSF Program Manager (OCE/MG&G) to a PI, processing
 paperwork for a grant (January 2007)
“Government, educational, and private sector organizations,
  individually as well as collectively, are encouraged to
   aggressively address the following Geoscience data-
                 preservation challenges”
   identifying, organizing, documenting, and cataloging existing
    data collections, preferably in a digital format;
   constructing logical linkages and search engines that facilitate
    access to organizations and their geoscience sample and data
   dedicating adequate space — physical and digital — for storing
    and efficient accessing of existing and future samples and data
    sets;”                                         Kerstin Lehnert: The Digital Specimen
Joint Workshop of SESAR & NGDC
Boulder, CO,
February 26 & 27, 2007

   Define procedures & best-practices for
     Creating & assigning IGSNs
     Submitting metadata for GeoObjects to SESAR

   Work towards an integrated system of
    sample catalogs
     Recommend ways to define & implement standards for metadata and
     Identify possibilities for streamlining procedures for submission of
      sample metadata to catalogs
Workshop Recommendations
 Streamlined              Registration Process
     Registration process should be simple
     Options to integrate easily into existing sample and data
      management workflows
     Ability to adopt required metadata from existing forms in use
      to avoid redundant metadata submission to multiple systems
     Support automated registration from other systems via web
      services to avoid manual/redundant metadata submission

   Best Practices
     Objects should receive an IGSN at the time of labeling
     Objects should have an IGSN before being distributed among
      multiple investigators and users
     Parent objects should be registered before child objects
     Metadata should include geospatial info (coordinates prefd.)
Workshop Recommendations
   Batch Registration Forms
       It is preferred that forms for the MGDS, IMLGS, and SESAR have the
        same column headers, which the metadata listed under this header
        clearly defined. The order of the headers can vary.
       An XML schema for sample metadata should be developed to which the
        metadata in any spreadsheet can be exported.
       SESAR Batch Registration Forms should be customizable, e.g. buttons
        beneath the header should allow to hide unnecessary columns. Columns
        for metadata that are identified as ‘recommended’ should always be
       SESAR should develop a manual for filling out the forms. The manual
        should include instructions regarding definition of parent – child relations.
        It needs to be decided if a site should get an IGSN. It is possible to link
        multiple stations taken at one site by including the site name as metadata.

   Vocabularies and Classification Schemes
       Adopt from existing standards as much as possible and work with
        repositories and other systems to use common schemes
       It is preferable for different systems (MGDS, IMLGS, SESAR) to allow
        multiple vocabularies
       List allowed vocabularies on the Marine Metadata Initiative (MMI) web
Registration Procedures to Support
Integration with Existing Workflows:
Under Implementation
    Trusted Agents
     A registrant can apply to become a Trusted Agent. Trusted Agents are
      authorized to generate unique IGSNs within their registered name
      space (user code). They can use tools, e.g. Excel, on the ship or in the
      field, to generate IGSNs within their given name space, have the
      samples labeled with IGSN, and submit the IGSN along with metadata
      via web services within a short time frame. Trusted Agents must sign
      a MOU outlining policy and procedures related to handling IGSN with
      trusted agents.
     Example IODP: Name Space “DR0”, “DR1”,…
    1. Generate Label with

                              Data System                        SESAR
           2. Ingest IGSN &
                                               3. Submit Metadata & IGSN to SESAR
            Trusted Agent Operation                       (Web Services)
Registration Procedures to Support
Integration with Existing Workflows:
Under Implementation
    Pre-Assigned IGSNs
     Upon request, SESAR provides forms (spreadsheets) with pre-
      assigned IGSNs to chief scientists/investigators/repositories to take
      on ship/field. Forms filled with metadata should be submitted to
      SESAR post-collection. E.g.: SCRIPPS.
     Other systems or repositories pre-populate their existing forms with
      IGSNs, obtained from SESAR, and provide to chief scientists. E.g.:
      MGDS provide forms with IGSNs to PIs in advance of R2K and
      MARGINS cruises. Post-cruise, MGDS will submit the sample
      metadata to SESAR.
                                      1. Get forms with IGSN
 2. Enter metadata with        3. Submit forms with metadata and
                      2. Forms with
                          IGSN                             1. Get IGSN         SESAR
 3. Enter metadata with                 Data System

        4. Forms with metadata and IGSN              5. Submit Metadata & IGSN to SESAR
                                                                (Web Services)
Collaboration with Repositories &
   IODP
     Registered DSDP/ODP holes, cores, core sections, core
     “Trusted Agent” arrangement in progress

   MGDS
     Registered existing dredges, cores, and core samples
     Incorporating IGSN into existing MGDS forms

   LDEO (Lamont)
       Registered existing dredge and core collections
   WHOI
     Registering existing dredge and core collections
     Future arrangements like “Trusted Agent” to be discussed

     Used SESAR forms with pre-assigned IGSNs on cruise for
      dredge collections
     Metadata need to be updated
Collaboration with Repositories &
   Antarctic Research Facility (ARF)
       Registering existing dredge and core collections
   US Polar Rock Repository
     Registered existing rocks and minerals
     Need pre-assigned IGSNs and web service registration

   Harvard Museum
     Registered existing mineral specimens
     Project for adding simple sample curation module in progress

   OSU
     Start with IGSN for historic samples
     Then become trusted agent and issue IGSNs to new samples
      including those given to PIs
   NGDC
     May register some orphaned historical samples
     Work with curators/repository and SESAR to streamline and
      standardize metadata fields and entry forms
Collaboration with Repositories &
   Canadian National Marine Geoscience
     Likely to register existing collections
     May become “Trusted Agent” in future

   Limnological Research Center
     Likely to register via batch registration forms
     May use pre-assigned IGSNs or become “Trusted Agent” in
   USGS
     Discussions are on-going with USGS to make them aware of
      SESAR effort
     Plan to contact state geological surveys

   Other Repositories

To top