Docstoc

caTissue data migration

Document Sample
caTissue data migration Powered By Docstoc
					National Cancer Institute




                                  caTissue Data Migration
                            Washington University/Siteman Cancer Center
                                        Persistent Systems
At WashU
• Administrative data: 642 users, 147 collection
  protocols, 32 storage types, 3873 storage
  containers
• Biospecimen data: 25215 participants and
  95585 specimens

All legacy data has been migrated and as of
  yesterday caTissue Core has been
  commissioned in production in the tissue bank.
   The Migration Process
                   STEP ONE                                STEP TWO
                                                            Migrate data
                      Oracle                                  (SQL/PL
                     Migration                                  SQL)          caTissue
   CATIS                                   CATIS
                    Workbench                                                   Core
                                                             Map value
MS Access Legacy                         Oracle database      domain
      Data                                                                  Staging database




                                                           caTissue
                                                            Core +          Migration Tool
                                                             CSM
                                                                             Error report &
                                                           Final database
                          caTissue Web                                          Duplicate
                           application               STEP THREE             participant report
Migrating to Staging Database
Map legacy data with new schema and opportunity
  to cleanse data.
• Most important and time taking step
• Data issues
  – Value domain mismatch
  – Data type does not match
  – Bad data
     • Patients with specimens but no collection protocol
  – No concept of specimen collection group in legacy
    data
  – Errors due to typos
  – No storage container hierarchy or restrictions
The Data Migration Tool
caTissue API based and publicly (soon to be) available
  Java program.
• Two modes of operation
   – Read from staging database
   – Read from error logs (error recovery mode)
• API validation catches any bad data
   – After correction, tool can be rerun for just those records.
• Duplicate participant matching information is generated
  during participant insertion.
   – Can be used to cleanse legacy data
• Objects to be migrated are configurable
Data Cleansing Process


     Staging      Migration Tool
                                       Successful   Production
                                        Records
       DB                                               DB
                            Error
                           Records


                   Error report &
      Manual          Duplicate
     Correction   participant report
Migrating to Production Database
• Performance issues with API
• API validation errors
  – Bad data
     •   More specimens than specified container size
     •   Email address not unique
     •   Collection protocols with non-existing users
     •   State stored in country field
  – Data type mismatch
     • Date/numbers stored as strings
  – Value domain mismatch
     • M/F for Male Gender/Female Gender

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:12
posted:8/9/2011
language:English
pages:7