jones

Document Sample
jones Powered By Docstoc
					Data, Metadata, and Ontology in Ecology

                        Matthew B. Jones

     National Center for Ecological Analysis and Synthesis (NCEAS)
                 University of California Santa Barbara

                     and many major collaborators:
Mark Schildhauer, Josh Madin, Jing Tao, Chad Berkley, Dan Higgins, Peter
McCartney, Chris Jones, Shawn Bowers, Bertram Ludaescher, and others

                             April 24, 2007
                Scaling-up Synthesis

• More than 400 projects at NCEAS
  – have produced over 1000 publications that
    synthesize and re-use existing data
  – massive investment in compiling, integrating,
    and analyzing data


• Building custom database for each project is
  not logistically feasible


• Instead, need loosely-coupled systems that
  accommodate heterogeneity
               Dilemma: no unified model

• No single database suffices

  – Data warehouses use federated schemas
     • any data that does not fit is not captured
     • original data transformed to fit federation
         – this is a form of data integration for one purpose


  – Numerous data warehouses exist
     • not extensible for all data
     • VegBank, ClimbDB, GenBank, PDB, etc.
                  Data Collections

• Metadata-based data collections


  – Loosely-coupled metadata and data collections
  – No constraints on data schemas
  – Data discovery based on metadata

  – Dynamic data loading and query based on
    metadata descriptions
                               What is EML?

A…
• modular
• extensible
• comprehensive


• Ecological Metadata Language                                    <EML>

 Identity and               Coverage:
  Discovery                Space, Time,             Methods
 Information                  Taxa



                Physical                  Logical
                                                              Access and
                 Data                      Data
                                                              Distribution
                Format                    Model
                                EML: Selected relationships

                          Michener          NBIIB
                          ‟97 paper          DP



                                                                       ISO
            CSDGM                 EML                                 19115
              1.0                 1.0.0

                                        EML
                                        1.3.0
                   ESA FLED
                    Report                  EML                  EML         EML
                                                                                                   OBOE
                                            1.4.x                2.0.0       2.0.1

                                       XML
                                       1.0

                                      Dublin
                                       Core

„92   „93    „94   1995   „96   „97   „98    „99   2000   „01   „02    „03    „04
                                                                                    2005   „06   „07   „08   „09
                          A simple EML example

eml
      packageId: sbclter.316.18

      system: knb

      dataset

                    title: Kelp Forest Community Dynamics: Benthic Fish

                    creator

                              individualName

                                               surName: Reed

                    contact

                              individualName

                                               surName: Evans
                   Data Discovery
Geographic, Temporal, and Taxonomic coverage
                        Logical Model: Attribute structure

  • Describes data tables and their
    variables/attributes
  • a typical data table with 10 attributes
         – some metadata are likely apparent, other ambiguous
         – missing value code is present
         – definitions need to be explicit, as well as data typing
                     Date                                Species        Value       Code
                    Format                                Codes        bounds     definitions

YEAR    MONTH DATE      SITE TRANSECT      SECTION SP_CODE SIZE    OBS_CODE     NOTES
2001   8   2001-08-22    ABUR 1   0-20    CLIN 5    06    .
2001   8   2001-08-22    ABUR 1   21-40   OPIC 11    06      .
2001   8   2001-08-22    ABUR 1   21-40   OPIC 10    06      .
2001   8   2001-08-22    ABUR 1   21-40   OPIC 14    06      .
2001   8   2001-08-22    ABUR 1   21-40   OPIC 7     06     .
2001   8   2001-08-22    ABUR 1   21-40   OPIC 19    06      .
2001   8   2001-08-22    ABUR 1   21-40   COTT 5     06      .
2001   8   2001-08-22    ABUR 2   0-20    CLIN 5    06    .
2001   8   2001-08-22    ABUR 2   21-40   NF   0   06    .
2001   8   2001-08-27    AHND 1   0-20    NF   0   03   .
                   EML Measurement Scale


        Textual                   Numeric               Dates




Nominal       Ordinal     Interval        Ratio       Datetime



                                        Equidistant
                          Equidistant   on number      Points on
              Ordered
Categories                on number       scale,        calendar
             Categories
                             scale      meaningful     timescale
                                           ratio




               Low
  Male
              Medium      3 Celsius      5 meter      6-Oct-2004
 Female
               High
             Logical Model: unit Dictionary

• Consistent assignment of measurement units

  – Quantitative definitions in terms of SI units


  – „unitType‟ expresses dimensionality
     • time, length, mass, energy are all „unitType‟s
     • second, meter, gram, pound, joule are all „unit‟s

            UnitType             Unit

                                  gram

                 Mass                          x1000


                                 kilogram
                  Collating metadata



• Most scientists know all of this information
  about their data
  – EML simply provides a standardized format for
    recording the information


• Enables data exchange across organizations
  and software systems
              Building a community data network

• Simplified data sharing
• Immediate change tracking
• Redundant backup
• Data maintained by individuals
• Access controlled by individuals



    AND                                          PISCO

    GCE             LTER       KNB II    KNB 1           ESA


   ... (26)                                      NCEAS



                      Knowledge Network for      OBFS
                       Biocomplexity (KNB)
                          EML-described data in the KNB




                   2000 4000 6000 8000 10000 12000
                                                            Data Packages
                                                            in the KNB
Cumulative count
                   0




                                                     2002    2003   2004    2005   2006

                                                                     Year
                  Kepler: dynamic data loading
                                       Kepler supports
                                       dynamic data loading:
                                       • Data sources are
                                       discovered via metadata
                                       queries
                                       • EML metadata allows
                                       arbitrary schemas to be
                                       loaded into an embedded
                                       database
                                       • Data queries can be
                                       performed before data
Data source from EcoGrid               flows downstream
(metadata-driven ingestion)
             R processing script
             res <- lm(BARO ~ T_AIR)
             res
             plot(T_AIR, BARO)
             abline(res)
                  Importance of semantics

• So far we‟ve dealt only with the logical data model
   – any semantics in EML in natural language


• The computer doesn‟t really understand:
   – what is being measured
   – how measurements relate to one another
   – how semantics map to logical structure


• Analysis depends on understanding the semantic
  contextual relationships among data measurements
   – e.g., density measured within subplot
                         Observation ontology (OBOE)
                            Observations can
Goal: semantically describe the structure of scientific observation
                            provide context for
                            other observations.         Entities
and measurement as found in a data set
  Observations are                                      represent real-
  made about                                                world objects or
  particular entities.                                      concepts that
                                                            can be
                                                            measured.


                                                              Every measurement
                                                              has a
                                                              characteristic,
                                                              which defines the
                                                              property of the
                                                              entity being
                                                              measured.




                                  Provide extension points for loading
                                  specialized domain ontologies
                                                          slide from J. Madin
                                                 Semantic annotation
• Relational data lacks critical semantic information
• no way for computer to determine that “Ht.” represents a
“height” measurement
• no way for computer to determine if Plot is nested within Site or
vice-versa
• no way for computer to determine if the Temp applies to Site or
Plot or Species




  Observation Ontology
                    QuickTime™ and a
           TIFF (Uncompressed) decompressor
              are neede d to see this picture.




 Mapping between data
 and the ontology via                                      QuickTime™ and a
                                                  TIFF (Uncompressed) decompressor
                                                     are neede d to see this picture.




 semantic annotation
                                                                                        Data set
                                                                                              slide from J. Madin
                           hasContext       hasContext                  hasContext


        Entity:     Time            Space                Space                       Organism



Characteristic:     Date        LocationName      Label          Area       TaxonomicName         Height


                  Date      Site                  Plot                     Species              Height
                  10/12     Hendricks             1                        AHYA                 12.2
                  10/12     Hendricks             1                        AHYA                 11.0
                  10/12     Hendricks             1                        AHYA                  9.7
                  …         …                     …                        …                    …




                                                                                                 h
                           hasContext                  hasContext


        Entity:   Organism              Space                       Organism



Characteristic:    Label        Replicate       Area      TaxonomicName        Abundance


                  Tree        Plot                     Species            Count
                  A           1                        AHYA               3
                  A           2                        AHYA               2
                  A           3                        AHYA               8
                  …           …                        …                  …




                                B



                                                                                   C
                                                             A
Observation ontology


           Extension
           points




                       slide from J. Madin
                 Observation

                                    ?




A high-level assertion that a thing was observed
                       Entity




All things (concrete and conceptual) that are observable
             Entity extension




An extension point for domain-specific terms
                    Context




Asserts a “containment” relationship between entities
                        Context




Context is transitive
               Measurement




Observations are composed of measurements, which refer
measurable characteristics to the entity being observed
Characteristic
                         Summary

• EML captures critical metadata
• OBOE adds critical semantic descriptions


• Data discovery and integration tools can be
  built that leverage metadata and ontologies


• Metadata and ontologies permit:
  – Loosely-coupled systems
  – Schema independence in data systems
  – Semantic data integration
  – Capturing data that is collected, rather than derived
    product
           Vegetation Schema Questions

• Vegetation schema
  – Exchange standard or federation?
• Can we accommodate all data that is
  collected in vegetation plots?
  – or just a transformed subset
• XML? RDF? OWL? other?
• Should a vegetation schema link to other
  evolving community standards?
   – EML?
   – OBOE?
                     Questions?




• http://www.nceas.ucsb.edu/ecoinformatics/
• http://knb.ecoinformatics.org/
• http://seek.ecoinformatics.org/
• http://kepler-project.org/
    Acknowledgements
• Knowledge Representation Working Group
     • Mark Schildhauer, Matt Jones (NCEAS)
     • Shawn Bowers, Bertram Ludaescher, Dave
       Thau (UCD)
     • Deana Pennington (UNM)
     • Serguei Krivov, Ferdinando Villa (UVM)
     • Corinna Gries, Peter McCartney (ASU)
     • Rich Williams (Microsoft)
                         Acknowledgments

•   This material is based upon work supported by:
•   The National Science Foundation under Grant Numbers
    9980154, 9904777, 0131178, 9905838, 0129792, and
    0225676.
•   Collaborators: NCEAS (UC Santa Barbara), University of New
    Mexico (Long Term Ecological Research Network Office), San
    Diego Supercomputer Center, University of Kansas (Center for
    Biodiversity Research), University of Vermont, University of
    North Carolina, Napier University, Arizona State University, UC
    Davis
•   The National Center for Ecological Analysis and Synthesis, a
    Center funded by NSF (Grant Number 0072909), the University
    of California, and the UC Santa Barbara campus.
•   The Andrew W. Mellon Foundation.
•   Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON,
    RoadNet, EOL, Resurgence

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:35
posted:12/4/2011
language:English
pages:33