Docstoc

Data Modelling and Pre-processing for Efficient Data Mining in

Document Sample
Data Modelling and Pre-processing for Efficient Data Mining in Powered By Docstoc
					                Data Modelling and
             Pre-processing for Efficient
             Data Mining in Cardiology

                           Kamil Matoušek
                                 and
                            Petr Aubrecht


IEEE ITAB ’06, Ioannina, October 28, 2006
                        Introduction
• Institute of Physiology, Charles University in Prague
    – cardiology
    – proprietary application Cardiag (Cardiology Diagnostic tool)
• DM: parameter extraction & background knowledge
  including disease symptomatic information acquisition
• multiple additional data sources
    – wave forms in the WFDB* in PhysioNet format for comparison and
      as background knowledge
    – additional data sources will be required during the data mining



*Massachusetts General Hospital/Marquette Foundation Waveform Database




 IEEE ITAB ’06, Ioannina, October 28, 2006
           Requirements Analysis
• existing data formats & user experience
  – examinations – analyzable electrocardiograph data
  – file types: INT, ECG and MAP – time series, processed
    data, selected characteristic strikes
• storage of both
  – unstructured measured time series data structure
  – structured information (characteristic strike features,
    observed and marked symptoms)
• extensibility: measurements (lab.), source systems
• efficient data exploration and DM: significant
  parameters or factors & available diagnoses


  IEEE ITAB ’06, Ioannina, October 28, 2006
Consolidated Cardio DB Schema
• core tables (red):
  – basic measured parameters, external data files, lab.
    examination; personal patient information; patient
    diagnoses encoded by ICD (International Classification of
    Diseases); applied/taken medicines

• extensions (green): particular laboratory results
• lookup tables (black)
• incomplete knowledge: support for partial
  descriptions
• model extensibility: different measurement data



 IEEE ITAB ’06, Ioannina, October 28, 2006
                        Medicines                Pat_Meds             Pat_Info                 Parameters
                   PK Medicine_ID           PK    ID_med        PK   ID_info              PK Parameter_ID

                         Medicine           FK1 ID_patient      FK1 ID_patient                 Parameter


Cardio DB                                   FK2 Medicine_ID         Date_and_time
                                                                    Anamnesis
                                                                    Objective_status
                                                                    Epicrisis
                                                                                               Parameter_Group



                                                                                                Laboratory
                                                                    Recomendation

Physical           PK
                         Pat_Diags
                          ID_diag
                                                   Patient
                                              PK ID_patient
                                                                    Authorization


                                                                     Measurement
                                                                                          PK   ID_laboratory

                                                                                          FK1 ID_measurement
                                                                                          FK2 Parameter_ID
                   FK1 ID_patient                  Birth_Date

Structure              Date_and_time
                   FK2 Diagnosis_ID
                                                   Gender
                                                   Height
                                                   Weight
                                                                PK   ID_measurement

                                                                FK1 ID_patient
                                                                    Date_and_time
                                                                                              Value

                                                                                           ECG_core_points              Point_types

                                                                                          PK   ID_point              PK ID_point_type

                        Diagnoses           Cardiag_ECG               Results             FK1 ID_result                  Description
                                                                                              Sample_no
                   PK Diagnosis_ID     PK   ID_Card_ECG         PK   ID_result                Lead
                                                                                          FK2 ID_point_type
                         Diagnosis     FK1 ID_result            FK1 ID_measurement            Autogenerated
                         ICD               Channels                 Link
                                           Filtered_samples         Data                         Cardiag_MAP
                                           Filename                 Author
                                           Sampling_interval                              PK   ID_Card_ECG
                                           P_on
                                           P_off                       Cardiag_INT        FK1 ID_result
                                           QRS_on                                             Channels
                                           QRS_off              PK   ID_Card_INT              Samples_per_channel
                                           T_off                                              Filename
                                           RR_distance          FK1 ID_result                 Sampling_interval
                                                                    Channels                  Samples_till_P_on
                                                                    Samples_per_channel       Samples_till_P_off
                                                                    d_code                    Samples_till_QRS_on
                                                                    h_code                    Samples_till_QRS_off
                                                                    Filename                  Samples_till_T_off
                                                                    Sampling_interval
                                                                    clearSampleNo




 IEEE ITAB ’06, Ioannina, October 28, 2006
   Overall DM Process Schema




IEEE ITAB ’06, Ioannina, October 28, 2006
    Efficient DM Process Design
• processing of time series and
  accompanying structured information
• return of typical patterns indicating some
  manifestations of potential diseases or
  diagnoses
• waveform database resources can provide
   – background knowledge (e.g. ILP)
   – result explanation
       • understandable, practitioner-friendly



 IEEE ITAB ’06, Ioannina, October 28, 2006
            Data Preprocessing I
• long-term project SumatraTT 2
• written in Java
• originally aimed at data preprocessing, later
  added analyses (e.g. time series)
• more than 100 modules
• I/O: plain text files, DBF, XML, SQL
  databases, WEKA, Excel etc.



 IEEE ITAB ’06, Ioannina, October 28, 2006
                     Screenshot




IEEE ITAB ’06, Ioannina, October 28, 2006
           Data Preprocessing II
• data visualisation: first-touch preview, static,
  interactive, advanced (scatter plot, radix plot )
• internal database
   – convenient temporary storage, uses embedded SQL
     database
• data cleaning
   – missing values, normalisation, discretisation,
     conversion to unified (number, date) format
   – outlier, errors
• output to Feature DB


 IEEE ITAB ’06, Ioannina, October 28, 2006
SumatraTT

FirstTouch
Preview




  IEEE ITAB ’06, Ioannina, October 28, 2006
                         Analysis
• trends
   – calculations of features of time-series
• contingency table
   – required by data mining algorithms
• specific – by a scripting (in Java, BSH)
• cooperation support
   – automatic documentation of a project like
     JavaDoc


 IEEE ITAB ’06, Ioannina, October 28, 2006
         Project Documentation




IEEE ITAB ’06, Ioannina, October 28, 2006
    SumatraTT Analysis Project




IEEE ITAB ’06, Ioannina, October 28, 2006
    SumatraTT – Ongoing Work
• call external programs to automate the cycle
  with modified parameters
• remote data transmission
  – remote sensors
  – parallel processing
  – grid service
  – based on WebRowSet (XML) standard
• more formats, transformations, visualisations, …


 IEEE ITAB ’06, Ioannina, October 28, 2006
           Experts in DM Process
• interpretation of DM results
  – are they sufficient and bring useful information to be
    practically exploited?
  – if not, modify & repeat parts of DM process

• reusable knowledge base
  – shared ontology repository for expert cooperation
  – information about results and parameter changes
  – learning how to automatically change parameters of
    individual efficient phases of data pre-processing and DM
  – previous mining results = background knowledge for
    consecutive DM tasks



  IEEE ITAB ’06, Ioannina, October 28, 2006
                     Conclusions
• data formats studied, transforms of individual data
  models into the common platform suggested
• consolidated Cardio DB for examination data and
  efficient DM process designed as a shared data
  storage platform for scientific experiments
• current implementation: Oracle 9i, experimental
  data from the Institute of Physiology, Charles
  University in Prague
• data transformations are being prepared using
  SumatraTT
• utility of the initial structure & results are being
  evaluated


 IEEE ITAB ’06, Ioannina, October 28, 2006
                   Future Tasks
• conceptual mapping of Cardio DB &
  knowledge repository to: general HL7
  concepts, XML based ecgML, DICOM
  waveforms, etc.
• automated mining of the knowledge base
  results – longer term




IEEE ITAB ’06, Ioannina, October 28, 2006
              Acknowledgements
Academy of Sciences of the Czech Republic

• grant

  New Methods and Tools
  for Knowledge Discovery in Databases

• Information Society project

  Knowledge-Based Support
  for Diagnostics and Prediction in Cardiology


 IEEE ITAB ’06, Ioannina, October 28, 2006
                        Questions?
Contact:    Kamil Matoušek, Petr Aubrecht
            Gerstner Laboratory
            Department of Cybernetics
            Faculty of Electrical Engineering
            Czech Technical University in Prague
            Technická 2
            166 27 Praha 6, Czech Republic

E-mail:     {matousek,aubrech}@labe.felk.cvut.cz

WWW:        http://krizik.felk.cvut.cz




 IEEE ITAB ’06, Ioannina, October 28, 2006