UBIF — Unified Biosciences Information Framework Gregor Hagedorn Federal Biological Research Center Königin-Luise-Str. 19, 14195 Berlin Germany (Talk held Wednesday, 2004-10-13, at the TDWG 2004 meeting in Christchurch, New Zealand) UBIF Scope Form a foundation for common concepts, structures and types of TDWG and GBIF XML schemata like TCS – Taxon concept schema SDD – Structured descriptive data ABCD / DarwinCore – Specimen collection data In principal wider scope is desirable: The term ―biosciences‖ instead of ―biodiversity‖ was chosen to signal to other communities that adoption and contributions are welcome Why? Advantage: Centralization of common types and concepts Reduces learning curve for developers Increases software-reuse Simplifies data integration across standards Introduces peer-review for fundamental concepts, hopefully reducing the amount of design errors Disadvantage: Increased need for communication can slow development decisions Versioning and correction of errors may become more painful UBIF Status Collaboration between SDD and ABCD Currently dominated by the needs of SDD which depends on many types of external data The current version is a proposal – if only a much reduced UBIF is desired any amount of types can be moved into SDD-alone A UBIF discussion Wiki is up and running Putting UBIF on the agenda here is a last ditch attempt before 4 TDWG standard are finalized, all of which use different taxon name, geography, literature, etc. concepts UBIF Subtopics 1. Type library Simple types derived from other types Enumeration types (controlled vocabulary) Basic, reusable complex types 2. Top-level structure Dataset collection Derivation & content metadata Place for object linking (see EDI topic) ―Payload" element 3. External data interface (EDI) / Proxy data objects 4. Basic text formatting conventions UBIF Subtopic 1 1. Type library Simple types derived from other types Using the same type can define semantics across multiple standards Facets/constraints (incl. regular expressions) may improve data quality and data integration Enumeration types providing controlled vocabulary Provide controlled vocabulary Simplify data integration and interoperability across language barriers Basic, reusable complex types Examples: geographical coordinates, composite date/time UBIF Subtopic 2 1. Type library 2. Top-level structure Dataset collection Collection of multiple independent Dataset objects Semantically neutral: relations between Dataset objects have to be discovered by the data consumer or may be implicit in the protocol requesting a specific assemblage of data sets. Derivation metadata Support tracing and debugging the online transformation history data Provide technical information about access providers and the path of (potentially multiple) portals involved Perhaps obsolete – protocol issue? … UBIF Subtopic 2 1. Type library 2. Top-level structure Dataset collection Derivation metadata Content metadata Metadata describing the principal data collection A dataset may represent a source dataset or it may be derived (filtered, normalized, or enriched with secondary information) A dataset is never an aggregation of multiple data collection sources with different authorship, copyright, or other IPR – doing so requires consent and creates new IPR. Derivation and project metadata together should provide all necessary information for UDDI support UBIF Subtopic 2-3 1. Type library 2. Top-level structure … Place for object linking ―Payload" element Last element, tag name not specified May come from same or different namespace Within a Datasets collection each Dataset object may have a payload from a different external schema It is the responsibility of the consumer to decide which dataset payload it is interested in or can process 3. External data interface (EDI) / Proxy data objects … see separate talk! UBIF Subtopic 4 1. Type library 2. Top-level 3. External data interface (EDI) / Proxy data objects 4. Basic text formatting conventions Minimized set of xhtml-like inline formatting Examples: super/subscript, strong/emphasis ―m3‖, ―m3‖ and ―m3‖ express different semantics Earlier SDD version used a schema type Validated by xml parser Problems with mixed content and database interaction New proposal: encode tags (“m<sub>3</sub>”) No mixed content, but not validated – only convention May be ignored by processor – resulting in poorer display Good processors should recover formatting and render Good processors should strip when comparing/indexing What can we do here? I would love to go into details – is anybody willing to discuss things through, going through schema details? I believe it would be most beneficial if already a tentative agreement could be reached for literature and geography interfaces (= simplified, relatively flat data structures) – meet separately some evening or when to find time? If not here – where and how?
Pages to are hidden for
"UBIF Unified Biosciences Information Framework - Presentation TDWG "Please download to view full document