NCI-caDSR-semantic-integration-03-14-05.ppt - eXtended MetaData by wulinqing


									   NCI caDSR &
 Semantic Integration
           Presented to
Lawrence Berkeley National Labs

          Denise Warzel
     Associate Director, caDSR
         March 14, 2005
              Presentation Outline
   Putting NCI‟s semantic integration into context:
    – Driving factors behind NCI‟s metadata repository
   NCI Metadata repository (caDSR) role in caCORE
   caDSR Semantic Integration
    – Role of concept mapping
    – Metadata repository vs Vocabulary Services
    – Concept „linkage‟ in caDSR
   UML Class diagrams represented in caDSR metadata
   caDSR tooling (if time)

    – Is this what you want to hear?
    – Priority?

D. Warzel
   NCICB                        ScenPro
    – Peter Covitz               –   Bill McCurry
    – Denise Warzel              –   Tom Phillips
                                 –   Robert Harding
   Oracle                        –   Jennifer Brush
    –   Edmond Mulaire
                                 –   Larry Hebel
    –   Ram Chilukuri
                                 –   Smita Hastak
    –   Prerna Aggarwal
    –   Dan Ladino
    –   Christophe Ludet        ISO
    –   Shaji Kakkodi            – ISO/IEC 11179 Parts 1-6
    –   Jane Jiang

D. Warzel
                        Current User Base
    Cancer Biomedical Informatics Grid (caBIG) – 820/466/180/ 61% *
    Center for Cancer Research (CCR) – 821/573/506/ 12%
    Clinical Data Interchange Standard Consortium (CDISC) - 3/0
    Center for Cancer Imaging (CIP) - 238/151/148/ 2%
    Cancer Therapy Evaluation Program (CTEP) – 8029/2432/2428/ .1%
    Division of Cancer Prevention (DCP) – 427/321/286/ 11%
    National Heart Lung and Blood Institute (NHLBI) – 0/0
    Early Detection Research Network (EDRN) – 121/1/1/ 100%
    Divisions of Population Sciences and Cancer Control (PS & CC) 85/9
    Specialized Programs of Research Excellence (SPOREs) – 719/197/120/
    Cancer Ontologic Research Environment (caCORE) – 1028/810/810 0%
* Total CDEs in this Context / ”Released” workflow status / ”Released” and developed by this
    context / “Reused” from other contexts

D. Warzel
            NCI‟s Semantic Integration
   Sharable data, aggregatable across research
    – Unambiguous data characteristics
    to convey semantic, syntactic and lexical meaning
            Human and Machine understandable

   Tools to create, maintain, deploy data standards
   Widely and publicly accessible

D. Warzel
            caCORE Components
   caCORE is the open-source foundation upon which the NCICB
   builds its research information management systems

                Bioinformatics Objects

                    Data Standards

                Enterprise Vocabulary
D. Warzel
            EVS and caDSR Distinctions
   caDSR is a metadata repository
    – maintains metadata to permit a user to locate the correct
      defining characteristics of a piece of datum, an instance
      of a specific concept, in sufficient detail collected and
      stored on a computer
   EVS is a terminology server
    – provides services for synonymy, mapping between
      vocabularies, hierarchical structures, subconcepts,
      superconcepts, broader, narrower, roles, semantic type,

D. Warzel
      caCORE Infrastructure wiring
                                              Public APIs
                   Domain object metadata

 Common data
                 Common data elements

    Vocabulary for                    Dictionary, thesaurus
    CDE specification                 services
D. Warzel
              Why ISO/IEC 11179?
   “What is this datum?”
    – Provides concrete guidance on the creation and maintenance of
      discrete data element attributes and metadata (semantics) enabling
      the formulation of data elements in a consistent, standard manner
   “Metadata Repository/Registry”
    – Framework for data element standardization and registration
      allows the creation of a shared data environment in much less time
      and with much less effort than it takes using conventional data
      management methodologies.
   Adoption of 11179 Allowed us to “Get on with it”

D. Warzel
             Why ISO/IEC 11179?

   Using this framework:
    –   “what is it?”
    –   “how do I want to display it?”
    –   “categorize it?”
    –   “message it?”
    –   “where is it used and by whom?”
    –   “what is its history?” (lifecycle management)

D. Warzel
       ISO/IEC 11179 Administered Item
    Administration Record and Common Attributes

      Unique Identifier                 Created By
       • Data id + version
             (all NCI contents shares   Modified By
             common RAI)                Name(s)
      Administrative Status
       • Workflow status                Definition(s)
      Registration Status               Stewardship Information
      Creation Date                     Submitter Information
      Administrative Note(s)
      Effective Date                    Reference Document(s)
      Change Date(s)                    Classifications
      Change Description(s)
      Until Date

D. Warzel
                    ISO/IEC 11179 and Extensions



                                                Context (for administered item)

            Concept Class

   The Concept Class                            Data_Element_Concept
   (coming in new
   11179 specification)
   Provides                                     Property

   Semantic Linkage                             Representation_Class



D. Warzel
            Why vocabularies/ontology
   Goal: “Semantically unambiguous, interoperability”
   For Humans:
    – Words could be enough – within a specific context or domain
      where common lexicon is already used
   For Machines:
    – Words are not immutable, absent a specific context,
      difficult or impossible to ensure consistent and repeatable

D. Warzel

   Are names and words for definitions enough to create unambiguous,
   interoperable, self-harmonizing metadata?
    – No
    – Within different domains same words mean different things
            “site” “trial” “agent”
    – Synonyms?
    – Phraseology?
    – Not „computable‟
   Was ISO/IEC 11179 flawed?
    – No, not if you have a central body creating metadata.
    – We needed to support simultaneous development of data elements in
      different research domains that could be harmonized later – with minimal
    – Draw from standard terminologies  incorporate into a Cancer

D. Warzel

   Data Element curators are not necessarily vocabulary
   ISO/IEC 11179 provides the framework
    – But how to make it something that could be self harmonizing
      and computed without a human having to read and interpret

D. Warzel
                          The Solution?
   Leverage EVS
   Separate the curation of concepts from the curation of
   ISO/IEC 11179 metadata
   Leverage „semantics‟ of ISO/IEC 11179
   Start with the „building blocks‟ of Administered items
    – Link to controlled vocabulary in the form of concept codes
    – During metadata curation –
            right place
            right time
              – Naming and defining
   Applying naming conventions to build up the subsequent

D. Warzel
                                                Summary caDSR
                                               Semantic Integration
                                                   3.1   Conceptual Domain
                                  Object                      Agent
                                  Agent                                                   3.0
      Classification Schemes

                                                                                                            Valid Values

                                                                                                        Cyclooxygenase Inhibitor
                                            Data Element Concept         Value Domain                        Doxercalciferol
                                           Chemopreventive Agent   Chemopreventive Agent Name                      …

                               Chemopreventive                                         Representation

                                                               Data Element
                                                         Chemopreventive Agent Name

  D. Warzel
            3.0 caDSR Implementation
   Enhance Semantic Integration
    – Concept Class enabled and concept relationship to data
            Replace “Alternate Names” concept linkage
            Add “rule” for linking concepts together
             – Order of concepts conveys semantic meaning
    – Add concept linkage to support Value Domains
            “Referenced Parent Concept” – non-enumerated
   Changes to UML to caDSR mapping
   Changes to UML Loader

D. Warzel
      UML Classes as ISO/IEC 11179

D. Warzel
                 Workflow and Tools
1. Create UML Diagram with   2. Export to XMI.   5. Post Load Curation
EA or similar UML tool.
                                                      value domains

3. Run Semantic Connector    4. Run UML Loader


D. Warzel
                     UML Loader Mapping
UML Model
UMLModel                                               caDSR ISO/IEC 11179
                                                         caDSR Metadata


   UML class:UMLattribute
                                              Data                Value
                                             Element             Domain

      UML class

                                  Object Class =                   Permissible
                                      EVS                            Value =
                                   Concept(s)                     EVS concept(s)
                  UML attribute
                                                   Property =

D. Warzel
    Mapping UML Models to caDSR
Classes and
                    caDSR            UML Loader                     UML Model
attributes in UML
Model are tagged               Data
with EVS                     Element
Concepts before
being loaded into
                       Value        Data Element     Concept
                      Domain          Concept
Classes and
mapped to            Permissible
                    Value/Meaning   Object Class
Object                              (associated to
Classes and                           C12345)                                   Class
                                    (associated to
                                      C54321)                                Class Attribute
 Concepts Codes
 used to perform
 semantic           Semantic Integration                       EVS Concept       EVS Concept
 with existing                                                   C54321            C12345
D. Warzel
       UML Domain Model Example

D. Warzel
               Gene Class in Detail
            Package Name         Class Name


                     Data Type
D. Warzel
                    Gene Class in Detail
 Package Name           Class Name
                                     Class Concept Tagged Values:

                                      ConceptCode = C16612
                                      ConceptPreferredName = Gene
                                      ConceptDefinition = The functional ...
                                      ConceptDefinitionSource = NCI-GLOSS


                 Data Type
     D. Warzel
                    Gene Class in Detail
 Package Name           Class Name   Class Concept Tagged Values:

                                      ConceptCode = C16612
                                      ConceptPreferredName = Gene
                                      ConceptDefinition = The functional ...
                                      ConceptDefinitionSource = NCI-GLOSS

                                     Attribute Concept Tagged Values

                                      ConceptCode = Cxxxxxx
                                      ConceptPreferredName = OMIMId
                                      ConceptDefinition = The identifier
Attributes                            ConceptDefinitionSource = NCI-GLOSS

                                     Attribute Concept Tagged Values:

                                      ConceptCode = Cxxxxxx
                                      ConceptPreferredName = Symbol
                                      ConceptDefinition = An arbitrary sign...
                 Data Type            ConceptDefinitionSource = NCI-GLOSS
     D. Warzel
                             Concept Mapping
      Concept Attribute         Data

      Preferred Name            Derived from ConceptCode Tagged Value (e.g., C16612)

      Long Name                 Derived from ConceptPreferredName Tagged Value (e.g., Gene)

      Preferred Definition      Derived from ConceptDefinition Tagged Value (e.g., Gene)

      Version                   1.0

      Workflow Status           Default value specified during loading process (e.g., Draft New)

      Context                   Default value specified during loading process (e.g., caCORE)

      Begin Date                Current timestamp

            Concepts are created if they do not already exist
            If a Concept exists but with a different definition source,
            an alternate definition is created for that concept.

D. Warzel
               Model metadata and
              Classification Schemes
   UML domain model mapped to a classification
   scheme (CS) (type = Project)
    – Versioning, lifecycle statuses, reference doucments, etc.
   A UML domain model could optionally be
   organized into multiple packages (CSI) (type =
   UML Package)
    – Each package may correspond to a sub-project
    – UML Loader can be configured to create a CSI
      corresponding to each package in the UML domain

D. Warzel
     “Semantic Self-Harmonization”
   Concept code and order are compared to
   determine whether or not two entities are
   Reuse „registered‟ by Classifications
   Concept codes can be used to search caDSR for
   content with relationships at any level of ISO/IEC
   11179 metamodel
    – Object Class, Property, Value domain, Value Meaning,

D. Warzel
   Vocabulary “shifts”
    – merged/split, more granularity, new terms
    – Jan. 2005
            Primary Concept = Breast Cancer – Ccode1
            Qualifier Concept = Lobular – Ccode2
    – March 2005
            Lobular Breast Cancer = Ccode3
    – caDSR metadata maintenance
            Lexical and concept code

D. Warzel
            Introduction to caDSR Tools
       – CDE Browser to Search for and Download
       – Form Builder to Create user specified collections of CDEs
       – Side-by-Side Compare

       – CDE Curation Tool to Create Data Elements

       – Admin Tool to Curate and Administer caDSR - “Power Users”

       – Sentinel Tool (3.0)
              Generates end user „Alerts‟ triggered by metadata changes

       – Batch Load to import Administered Items
              Excel Loader (MS Excel)
              UML Loader (XMI)
              Case Report Form Loader (MS Excel)

             Access, Develop, Manage, Consume
D. Warzel
               CDE Browser
                              View, Search, Download
Browsing”                      – Shopping cart feature
                              FormBuilder to Build /
                              Download Forms and Data
                              “Context Browsing” Tree
                               – By Classification Schemes
                               – By Forms
                              CDE Basic Search Criteria
                               – Google-like search
               Basic Search    – Sortable search results by
                                 clicking on column headings

   D. Warzel
            CDE Browser
                              Advanced Search Criteria
                               – Leverages ISO attributes
                                     Find all with “18254-3”
                                     permissible value
                                     Find all with “Gene*”
                                     Find all with “Released”
                                     workflow status
                                     Find all with “Standard”
                                     Registration status

            Advanced Search

D. Warzel
            Form Builder
                    Create and Manage Forms
                    – Organize CDEs into
                      modules within a Form
                    – Attach pdf or word format
                    – Classify Forms into
                      groupings for specific end
                      user communities
                    – “Publish” “Un-Publish” for
                      Browser Catalog visibility
                    “Printer Friendly” version
                    Download CDEs

D. Warzel
            CDE Side-by-Side Compare
                          CDE Side-by-Side
                          – Build shopping cart,
                            compare CDE
                            metadata side by side
                          – Download to excel

D. Warzel
            Curation Tool
                      To Create, Edit or Version:
                             Data Element Concepts
                             Value Domains
                             Data Elements
                      ISO 11179 Wizard
                       – Construct ISO compliant Data
                         Elements by building up the
                             Builds Names and
                             Definitions from underlying
                      “Get Associated”
                       – Leverage ISO to retrieve
                         related CDEs
                      “Block Edit”
                             “shopping cart”
D. Warzel                    Assign classification
            Administration Tool
                      System Administration
                           User Accounts and
                           Lists of Values (LOVs)
                           used in content creation
                      Create “Framework”:
                           Conceptual Domains
                           Classification Schemes
                           (basis for organizing
                           CDEs in Browser)

D. Warzel
            Sentinel Tool
                    Create “Alerts”
                     – User defined triggers based
                       on data element metadata
                     – “notify me of any change to
                       the Value Domain for any
                       CDE on the Adverse Event
                    Generates and emails a
                    report of changes
                    matching “Alert” criteria

D. Warzel
                                                                                                               Batch Loading
 OC                    caDSR DEFAULT VALUES: Workflow status = "Released" Alw ays. Version = 1.0 Alw ays. Create Date =Date loaded by Loader. Created by = EVS. Long Name = EVS Preferred nam e

                                                                                                                                                                                                  Excel Loaders
 EVS Preferred Nam e   Definition            Definition Source    Database                                      e
                                                                                          Context Preferred Nam Effective Begin Date         Change Note         Alternate Nam e Type
 VARCHAR2 (20)         VARCHAR2 (2000)       VARCHAR2 (2000)      VARCHAR2 (255)          VARCHAR2 (20)          VARCHAR2 (30)               VARCHAR2 (2000)     VARCHAR2 (20)
 Mapped to Long Name   PreferredDefinition   Definition Source    Database                Requestors Context     YY.MM.B                     Text                AlternateName.Type
 and Preferred Name
 Not Null              Not Null              Null                 Not Null                Not Null               Null                        Null                Not Null
 Celsius Scale         The temperature       NCI                  NCI Thesaurus           caBIG                  11/18/2004                  Requested by        NCI_Concept_Code

                                                                                                                                                                                                  – Formatted MS Worksheet
                       scale defined by                                                                                                      Dianne Reeves
                       the values 0
                       degree Celsius for
                       the freezing point
                       of water and 100
                       degrees Celsius
                       for the boiling
                       point of water.
                       The Celsius
                       degree (C) is the
                                                                                                                                                                                                       Administered Item
                       same size as a
                       Kelvin and equal
                       to (F - 32)/1.8. To
                       convert Celsius to
                                                                                                                                                                                                  UML Loader
 HEENT                 HEENT is the       NCI                     NCI Thesaurus           caBIG                  11/18/2004                  Requested by        NCI_Concept_Code
                       Head, Ears, Eyes,                                                                                                     Dianne Reeves
                       Nose and Throat,
                       and is referred to
                       as a body system
                       on a physical or

                                                                                                                                                                                                  – XMI representation of a
                       examination. The
                       term is typically
                       used as 'HEENT'
                       in a physician or

                                                                                                                                                                                                    UML Class Diagram
                       caregiver notes.

 Gracely Pain          The Gracely Pain NCI                       NCI Thesaurus           caBIG                  11/18/2004                  Requested by        NCI_Concept_Code
 Unpleasantness        Unpleasantness                                                                                                        Dianne Reeves
 Scale                 Scale is a visual

                                                                                                                                                                                                        Class Object Class
                       analog scale of 0
                       to 20 used by a
                       subject to define
                       their pain

                                                                                                                                                                                                        Attribute Property
                       Together with the
                       intensity scale
                       these tools serve
                       to differentiate the
                       patient's sensory
                       perception of pain
                                                                                                                                                                                                        Data Element Concept,
                                                                                                                                                                                                        Value Domain and Data
                                                                                                                                                                                                        Element derived from
                                                                                                                                                                                                        the above

D. Warzel
   National Institute of Neurological Disorders and Stroke
   National Icelandic Center for Oncology
   Cancergrid – UK
   Canadian Center for Health Informatics

D. Warzel

To top