UBIF Unified Biosciences Information Framework - Presentation TDWG by liamei12345


									         UBIF —
         Unified Biosciences
         Information Framework

                                                    Gregor Hagedorn
                                    Federal Biological Research Center
                                     Königin-Luise-Str. 19, 14195 Berlin

(Talk held Wednesday, 2004-10-13, at the TDWG 2004 meeting in Christchurch, New Zealand)
UBIF Scope
 Form a foundation for common
  concepts, structures and types
  of TDWG and GBIF XML schemata like
   TCS – Taxon concept schema
   SDD – Structured descriptive data
   ABCD / DarwinCore – Specimen collection data
 In principal wider scope is desirable:
   The term ―biosciences‖ instead of
    ―biodiversity‖ was chosen to signal to
    other communities that adoption and
    contributions are welcome
 Advantage:
 Centralization of common types and concepts
     Reduces learning curve for developers
     Increases software-reuse
     Simplifies data integration across standards
     Introduces peer-review for fundamental concepts,
      hopefully reducing the amount of design errors
 Disadvantage:
   Increased need for communication can
    slow development decisions
   Versioning and correction of errors may
    become more painful
UBIF Status
 Collaboration between SDD and ABCD
   Currently dominated by the needs of SDD
    which depends on many types of external data
   The current version is a proposal – if only a much
    reduced UBIF is desired any amount of types can
    be moved into SDD-alone
 A UBIF discussion Wiki is up and running
 Putting UBIF on the agenda here is a last ditch
  attempt before 4 TDWG standard are finalized, all
  of which use different taxon name, geography,
  literature, etc. concepts
UBIF Subtopics
1. Type library
    Simple types derived from other types
    Enumeration types (controlled vocabulary)
    Basic, reusable complex types
2. Top-level structure
      Dataset collection
      Derivation & content metadata
      Place for object linking (see EDI topic)
      ―Payload" element
3. External data interface (EDI) / Proxy data objects
4. Basic text formatting conventions
UBIF Subtopic 1
1. Type library
    Simple types derived from other types
       Using the same type can define semantics
        across multiple standards
       Facets/constraints (incl. regular expressions)
        may improve data quality and data integration
    Enumeration types providing controlled vocabulary
       Provide controlled vocabulary
       Simplify data integration and
        interoperability across language barriers
    Basic, reusable complex types
       Examples: geographical coordinates,
        composite date/time
UBIF Subtopic 2
1. Type library
2. Top-level structure
    Dataset collection
       Collection of multiple independent Dataset objects
       Semantically neutral: relations between Dataset objects
        have to be discovered by the data consumer or may be
        implicit in the protocol requesting a specific assemblage of
        data sets.
    Derivation metadata
       Support tracing and debugging the online transformation
        history data
       Provide technical information about access providers and
        the path of (potentially multiple) portals involved
       Perhaps obsolete – protocol issue?
UBIF Subtopic 2
1. Type library
2. Top-level structure
    Dataset collection
    Derivation metadata
    Content metadata
       Metadata describing the principal data collection
       A dataset may represent a source dataset or it may be
        derived (filtered, normalized, or enriched with secondary
       A dataset is never an aggregation of multiple data
        collection sources with different authorship, copyright, or
        other IPR – doing so requires consent and creates new
       Derivation and project metadata together should provide
        all necessary information for UDDI support
UBIF Subtopic 2-3
1. Type library
2. Top-level structure
     Place for object linking
     ―Payload" element
        Last element, tag name not specified
        May come from same or different namespace
        Within a Datasets collection each Dataset object may have
         a payload from a different external schema
        It is the responsibility of the consumer to decide which
         dataset payload it is interested in or can process
3. External data interface (EDI) / Proxy data objects
     … see separate talk!
UBIF Subtopic 4
1. Type library
2. Top-level
3. External data interface (EDI) / Proxy data objects
4. Basic text formatting conventions
      Minimized set of xhtml-like inline formatting
      Examples: super/subscript, strong/emphasis
      ―m3‖, ―m3‖ and ―m3‖ express different semantics
      Earlier SDD version used a schema type
         Validated by xml parser
         Problems with mixed content and database interaction
    New proposal: encode tags (“m<sub>3</sub>”)
           No mixed content, but not validated – only convention
           May be ignored by processor – resulting in poorer display
           Good processors should recover formatting and render
           Good processors should strip when comparing/indexing
What can we do here?
 I would love to go into details – is anybody willing
  to discuss things through, going through schema
 I believe it would be most beneficial if already
  a tentative agreement could be reached for
  literature and geography interfaces
  (= simplified, relatively flat data structures) – meet
  separately some evening or when to find time?
 If not here – where and how?

To top