Docstoc

Vincent Ferretti

Document Sample
Vincent Ferretti Powered By Docstoc
					Opal: An Open Source Platform for Data
             Integration
              Montréal 2010
Background
•   Opal is a data integration application developed for The Canadian
    Partnership For Tomorrow (CPT) project
       A network of five prospective cohorts on chronic diseases across
       Canada

•   Software development team
                                                  300 000 participants by 2012
    of 9 people in Montreal and
    Toronto


•   Opal is a P3G OBiBa project
       Open source and free at
       www.obiba.org
       Generic enough to be used
       by any other biobanks
What is Opal?
•   Data management software for a single biobank
       Tools for integrating and storing data from multiple sources
       Generic database
        – Unlimited number of variables per study
        – Unlimited number of variable descriptors (Meta-data)
        – Unlimited number of entities (participant, instrument, staff, …)
       Manage
        – Participant identifiers
        – Data encryption
        – Variable dictionary
                                                                    Bio-repository
       Tools for data exportation and reports                                         Public
                                                      Assessment                     Registries
•   A young application                                 Centers

       (few months of programming) just starting
       to be used by some of the CPT cohorts
       Command line interface                                           Opal
       Working on web interface
        – In collaboration with the Australian Wager team
What is Opal?
•   Software for biobank networking
       Provides a comprehensive software infrastructure for integrating,
       exchanging and analyzing harmonized data among biobanks




            Opal                      Opal                      Opal
             B1                        B2                        B3




                                      Opal
Challenge 1: Information Management System heterogeneity

 Different
 • data formats          Biobank 1     Biobank 2       Biobank n

 • software
 • variable concepts                               …
  and definitions


Common way
                                     ETL
• Extract-Transform-Load (ETL)
                                     scripts
• Data warehousing


Some issues
• Data synchronization
                                         Opal
• Data centralization
Challenge 1: Information Management System heterogeneity
 Different
 • data formats             Biobank 1   Biobank 2                Biobank n

 • software
 • variable concepts                                   …
  and definitions


Alternative way
• Direct connection (if possible)
                                        Opal Uniform
• Uniform virtual view of the data       Interface



• No scripts are required
                                          Opal
• Data can remain in its original
  format and system
• Data can also be copied


                                                       Opal DB
Challenge 2: Participant Privacy
•     To ensure participant privacy, data pooling without consent among
      biobanks must be prevented
•     General principles to prevent collusion
           Each organization must use its own participant identifiers (IDs)
           New specific IDs must be used when exchanging data on participants
           IDs are accessible to a very restricted number of people within an
           organization




    [David Chaum. Security without identification: transaction systems to make big brother obsolete.
    Communications of the ACM,28(10):1030–1044, 1985.]
Opal Implementation of Privacy Principles

                      Participant
                         Data
                       Database
                                                               BioBank 2
                                                 Data
                                                                        ID 2
                    Data
  Organization 1
                                      Opal
             ID 1
                                                           Research Project 3
                                                 Data
                                                                        ID 3

                              ID
                           Database          Internal ID
                                             ID 1
                                             ID 2
                               My biobank    ID 3
Challenge 3: Confidentiality
•   Confidentiality is the necessity to protect data from unauthorized access
•   Opal includes a comprehensive Public Key Infrastructure (PKI) for
    organization authentication and data encryption
         Used for data transfer
         Used to connect two opal instances together

                                                       •   Manage & store public-private
                 Public Key                                key pairs
                                                               create, delete, export, import
    Assessment                     Opal
      Center
                                                       •   Provide authentication
                 Encrypted                             •   Decrypt data in memory (no
                   Data
                                                           decrypted file on system)
                           [Private key,
                            Public key]
                                       Key Store

                                 My biobank
Challenge 4: Data Harmonization
•   Data elements collected in each cohort are not identical and cannot
    be shared as-is
•   The DataShaper Harmonization Platform
       Given a network of cohorts, a set of common variables of interest is
       defined using the DataSchema ontological framework
       Cohort-specific mapping algorithms between data elements and this
       common variable set are provided
DataSchema
             Study




                     Mapping
                     Algorithm
Variable Derivation in Opal
                                                       New AGE_GROUP variable
•   New variables are defined using scripts            var age = $('AGE');
                                                       if(age.lt(40)) {
    JavaScript language                                   "<40";
                                                       } else if(age.lt(50)) {
         – Simple to write                                "40-50";
                                                       } else if(age.lt(60)) {
         – Algorithms can be as complex as necessary      "50-60";
                                                       } else {
                                                          ">=60";
•   Integrated with DataShaper’s platform to obtain    }

    algorithms automatically (prototype)
    <attribute name="URI“>
    http://www.datashaper.org/owl/2009/10/cpt.owl#CPT_498
    </attribute>


•    Opal computes values of the derived variables when needed
Challenge 5: Data Access

         Biobank 1          Biobank 2         Biobank 3    …         Biobank n


                                                                           Opal
                                                               Common
                                                                derived
                                                               variables
                Opal           Opal                 Opal

    Common             Common           Common
     derived            derived          derived
    variables          variables        variables




•    How to share data among Opal instances?
          Depends on data access policies
1st scenario: A Central Data Warehouse
         Biobank 1             Biobank 2         Biobank 3    …         Biobank n


                                                                              Opal
                                                                  Common
                                                                   derived
                                                                  variables
                Opal              Opal                 Opal

    Common                Common           Common
     derived               derived          derived
    variables             variables        variables




•   Single central database
•   Data file transfers                    Opal
         Encryption
         Privacy
    2nd Scenario: Database Federation
           Biobank 1          Biobank 2          Biobank 3    …         Biobank n


                                                                              Opal
                                                                  Common
                                                                   derived
                                                                  variables
                  Opal           Opal                  Opal

      Common             Common            Common
       derived            derived           derived
      variables          variables         variables




•   A virtual central database
•   Opal provides web services to          Opal
                                          Central
    securely access data remotely
3rd Scenario: DataShield
        Biobank 1              Biobank 2             Biobank 3        …         Biobank n


                                                                                      Opal
                                                                          Common
                                                                           derived
                                                                          variables          R
               Opal               Opal                     Opal
                      R                    R                      R
   Common                 Common               Common
    derived                derived              derived
   variables              variables            variables



• Opal comes with an
  embedded R server
• Only aggregating R functions are
  made available by Opal via web
  services                                            R session using
                                                      Opal R extension
• Statisticians use the Opal R extension
  for accessing these services
Conclusion
•   Opal can be used both by a single biobank and network of biobanks
•   Opal is meant for data integration, providing means to respect
    confidentiality, privacy and data access policies
•   Integrates other P3G tools such as DataShaper, DataShield and
    other OBiBa software applications
•   Opal is a young application
       Still need development for web interfaces
•   Available on svn.obiba.org
Acknowledgments
Software Architect     P3G                          Canadian Partnerships for
Philippe Laflamme            Isabel Fortier         Tomorrow Cohorts
                             Francois l’Heureux         British Columbia Generation Project
Development team             Paul Burton                Alberta Tomorrow Project
    Martin Boulanger         Genevieve Lachance         Ontario Health Study
    Tony Debat               Cedric Thibeault           Cartagene
    Dwain Elson                                         Atlantic Provinces Partnership
    Nathalie Emond                                      for Tomorrow’s Health
                       Funders
    Chuping Liu
                             Cartagene
    Yannick Marcon
                             CPAC
    Dennis Spathis
                             Genome Quebec
                             OICR
                             Ontario Health Study