Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Data mining and its applications in medicine by g4lLQMnZ

VIEWS: 8 PAGES: 63

									CSE
300



      Data mining and its application and
              usage in medicine

                  By Radhika




                                            1
                  Data Mining and Medicine
         History
CSE        Past 20 years with relational databases
300
               More dimensions to database queries
             earliest and most successful area of data mining
             Mid 1800s in London hit by infectious disease
               Two theories
                  – Miasma theory  Bad air propagated disease
                  – Germ theory  Water-borne
               Advantages
                  – Discover trends even when we don’t understand reasons
                  – Discover irrelevant patterns that confuse than enlighten
                  – Protection against unaided human inference of patterns provide
                    quantifiable measures and aid human judgment
             Data Mining
               Patterns persistent and meaningful
               Knowledge Discovery of Data
                                                                                     2
                   The future of data mining
         10 biggest killers in the US
CSE
300




         Data mining = Process of discovery of interesting,
          meaningful and actionable patterns hidden in large
          amounts of data

                                                               3
           Major Issues in Medical Data Mining
         Heterogeneity of medical data
CSE
            Volume and complexity
300
            Physician’s interpretation
            Poor mathematical categorization
            Canonical Form
            Solution: Standard vocabularies, interfaces
             between different sources of data integrations,
             design of electronic patient records
         Ethical, Legal and Social Issues
            Data Ownership
            Lawsuits
            Privacy and Security of Human Data
            Expected benefits
            Administrative Issues
                                                               4
                   Why Data Preprocessing?
         Patient records consist of clinical, lab parameters,
CSE       results of particular investigations, specific to tasks
300
            Incomplete: lacking attribute values, lacking
             certain attributes of interest, or containing only
             aggregate data
            Noisy: containing errors or outliers
            Inconsistent: containing discrepancies in codes or
             names
            Temporal chronic diseases parameters
         No quality data, no quality mining results!
            Data warehouse needs consistent integration of
             quality data
            Medical Domain, to handle incomplete,
             inconsistent or noisy data, need people with
             domain knowledge
                                                                    5
       What is Data Mining? The KDD Process

CSE
300
                                              Pattern Evaluation


                                       Data Mining

                       Task-relevant
                       Data
           Data                Selection
           Warehouse
      Data Cleaning
               Data Integration

             Databases


                                                                   6
      From Tables and Spreadsheets to Data Cubes
         A data warehouse is based on a multidimensional data
CSE       model that views data in the form of a data cube
300
         A data cube, such as sales, allows data to be modeled
          and viewed in multiple dimensions
            Dimension tables, such as item (item_name, brand,
             type), or time(day, week, month, quarter, year)
            Fact table contains measures (such as dollars_sold)
             and keys to each of related dimension tables

         W. H. Inmon:“A data warehouse is a subject-oriented,
          integrated, time-variant, and nonvolatile collection of
          data in support of management’s decision-making
          process.”

                                                                    7
      Data Warehouse vs. Heterogeneous DBMS
         Data warehouse: update-driven, high performance
CSE
           Information from heterogeneous sources is
300
             integrated in advance and stored in warehouses for
             direct query and analysis
           Do not contain most current information
           Query processing does not interfere with
             processing at local sources
           Store and integrate historical information
           Support complex multidimensional queries




                                                                  8
          Data Warehouse vs. Operational DBMS
       OLTP (on-line transaction processing)
CSE      Major task of traditional relational DBMS
300
         Day-to-day operations: purchasing, inventory,
          banking, manufacturing, payroll, registration,
          accounting, etc.
       OLAP (on-line analytical processing)
         Major task of data warehouse system
         Data analysis and decision making
       Distinct features (OLTP vs. OLAP):
         User and system orientation: customer vs. market
         Data contents: current, detailed vs. historical,
          consolidated
         Database design: ER + application vs. star + subject
         View: current, local vs. evolutionary, integrated
         Access patterns: update vs. read-only but complex
          queries
                                                                 9
CSE
300




      10
             Why Separate Data Warehouse?
       High performance for both systems
CSE
         DBMS tuned for OLTP: access methods, indexing,
300
          concurrency control, recovery
         Warehouse tuned for OLAP: complex OLAP queries,
          multidimensional view, consolidation
       Different functions and different data:
         Missing data: Decision support requires historical
          data which operational DBs do not typically maintain
         Data consolidation: DS requires consolidation
          (aggregation, summarization) of data from
          heterogeneous sources
         Data quality: different sources typically use
          inconsistent data representations, codes and formats
          which have to be reconciled

                                                             11
CSE
300




      12
CSE
300




      13
                 Typical OLAP Operations
       Roll up (drill-up): summarize data
CSE       by climbing up hierarchy or by dimension reduction
300    Drill down (roll down): reverse of roll-up
          from higher level summary to lower level summary or
            detailed data, or introducing new dimensions
       Slice and dice:
          project and select
       Pivot (rotate):
          reorient the cube, visualization, 3D to series of 2D planes.
       Other operations
          drill across: involving (across) more than one fact table
          drill through: through the bottom level of the cube to its
            back-end relational tables (using SQL)




                                                                          14
CSE
300




      15
CSE
300




      16
                    Multi-Tiered Architecture
CSE
300
                                            Monitor        OLAP Server
       other             Metadata              &
       sources                             Integrator


                                                                           Analysis
                                                                           Query
      Operational     Extract
                                                                           Reports
      DBs             Transform
                                       Data                Serve
                      Load                                                 Data mining
                      Refresh        Warehouse




                                       Data Marts



  Data Sources                                          OLAP Engine      Front-End Tools
                            Data Storage

                                                                                           17
                   Steps of a KDD Process
       Learning the application domain:
CSE       relevant prior knowledge and goals of application
300    Creating a target data set: data selection
       Data cleaning and preprocessing: (may take 60% of effort!)
       Data reduction and transformation:
          Find useful features, dimensionality/variable reduction,
            invariant representation.
       Choosing functions of data mining
          summarization, classification, regression, association,
            clustering.
       Choosing the mining algorithm(s)
       Data mining: search for patterns of interest
       Pattern evaluation and knowledge presentation
          visualization, transformation, removing redundant patterns,
            etc.
       Use of discovered knowledge
                                                                     18
           Common Techniques in Data Mining
         Predictive Data Mining
CSE
300         Most important
           Classification: Relate one set of variables in data to
             response variables
           Regression: estimate some continuous value
         Descriptive Data Mining
           Clustering: Discovering groups of similar instances
           Association rule extraction
               Variables/Observations
             Summarization of group descriptions




                                                                 19
                           Leukemia
       Different types of cells look very similar
CSE
300
       Given a number of samples (patients)
          can we diagnose the disease accurately?
          Predict the outcome of treatment?
          Recommend best treatment based of previous
            treatments?
       Solution: Data mining on micro-array data
       38 training patients, 34 testing patients ~ 7000 patient
        attributes
       2 classes: Acute Lymphoblastic Leukemia(ALL) vs
        Acute Myeloid Leukemia (AML)



                                                                   20
          Clustering/Instance Based Learning
       Uses specific instances to perform classification than general
CSE     IF THEN rules
300    Nearest Neighbor classifier
       Most studied algorithms for medical purposes
       Clustering– Partitioning a data set into several groups
        (clusters) such that
          Homogeneity: Objects belonging to the same cluster are
            similar to each other
          Separation: Objects belonging to different clusters are
            dissimilar to each other.
       Three elements
          The set of objects
          The set of attributes
          Distance measure




                                                                     21
           Measure the Dissimilarity of Objects

CSE      Find best matching instance
300      Distance function
            Measure the dissimilarity between a pair of
              data objects
         Things to consider
            Usually very different for interval-scaled,
              boolean, nominal, ordinal and ratio-scaled
              variables
            Weights should be associated with different
              variables based on applications and data
              semantic
         Quality of a clustering result depends on both the
          distance measure adopted and its implementation

                                                               22
                            Minkowski Distance
           Minkowski distance: a generalization
CSE
300       d (i, j)  q | x  x |q  | x  x |q ... | x  x |q (q  0)
                          i1  j1       i2  j2           ip  jp
           If q = 2, d is Euclidean distance
           If q = 1, d is Manhattan distance


          Xi (1,7)                           xi
                                                          12
                     8.48
                                       q=2                          q=1
                                                  6


                                                      6
                             Xj(7,1)                           xj
                                                                          23
                         Binary Variables
         A contingency table for binary data
CSE                                  Object j
300
                                   1    0     sum
                         1         a     b    ab
                         0         c     d    cd
               Object i sum a  c       bd    p

         Simple matching coefficient

                    d (i , j )          bc
                                       abcd



                                                    24
           Dissimilarity between Binary Variables
          Example
CSE
300
                     A1       A2    A3   A4   A5   A6    A7
          Object 1    1       0     1    1     1    0     0
          Object 2    1       1     1    0     0    0     1
                     Object 2
               1          0        sum
           1   2          2        4
  Object 1 0   2          1        3
           sum 4          3        7      d (O ,O )     2 2  4
                                              1 2     2  2  2 1 7




                                                                       25
                K-nearest neighbors algorithm
         Initialization
CSE
            Arbitrarily choose k objects as the initial cluster
300
              centers (centroids)
         Iteration until no change
            For each object Oi
               Calculate the distances between Oi and the k centroids
               (Re)assign Oi to the cluster whose centroid is the
                closest to Oi
             Update the cluster centroids based on current
              assignment




                                                                         26
                               k-Means Clustering Method                                                                        cluster
      10                                                      current                                                           mean
                                                                         10
      9                                                       clusters   9
CSE   8                                                                  8

300   7                                                                  7
      6                                                                  6
      5                                                                  5

      4                                                                  4

      3                                                                  3

      2                                                                  2

      1                                                                  1

      0                                                                  0
           0   1   2   3   4    5   6   7   8   9       10                    0   1   2   3   4   5   6   7   8   9       10
                                                                                                                               objects
new                                                                                                                            relocate
clusters
     10
                                                                         10                                                       d
       9
                                                                          9
       8
                                                                          8
       7
                                                                          7
       6
                                                                          6
       5
                                                                          5
       4
                                                                          4
       3
                                                                          3
       2
                                                                          2
       1
                                                                          1
       0
                                                                          0
           0   1   2   3   4    5   6   7   8       9    10
                                                                              0   1   2   3   4   5   6   7   8       9    10


                                                                                                                                     27
                           Dataset
       Data set from UCI repository
CSE
300
       http://kdd.ics.uci.edu/
       768 female Pima Indians evaluated for diabetes
       After data cleaning 392 data entries




                                                         28
                  Hierarchical Clustering
       Groups observations based on dissimilarity
CSE
300
       Compacts database into “labels” that represent the
        observations
       Measure of similarity/Dissimilarity
          Euclidean Distance
          Manhattan Distance
       Types of Clustering
          Single Link
          Average Link
          Complete Link




                                                             29
              Hierarchical Clustering: Comparison
  Single-link                                      Complete-link
CSE                                        5
300                                    1                               4               1
                           3
                                                                   2                           5
          5                                                   5
                       2           1                                   2
               2               3               6                               3           6
                                                                       3
                                                                                       1
                           4                                               4
                       4


 Average-link                                      Centroid distance
                                           5
                                   1                          5                4       1
               2
          5                                                        2
                   2                                          5
                                                                       2
                               3           6                                       3
                   3                                                                       6
      4                            1                                                   1
                           4                                               4
                                                                               3

                                                                                                   30
                Compare Dendrograms
      Single-link              Complete-link
CSE
300




          1 2   5   3 6   4    1 2       5       3 6   4

      Average-link            Centroid distance




                                     2       5    3 6 4 1
          1 2   5   3 6   4                                 31
            Which Distance Measure is Better?
         Each method has both advantages and disadvantages;
CSE       application-dependent
300
         Single-link
            Can find irregular-shaped clusters
            Sensitive to outliers
         Complete-link, Average-link, and Centroid distance
            Robust to outliers
            Tend to break large clusters
            Prefer spherical clusters




                                                               32
                    Dendrogram from dataset

CSE
300




         Minimum spanning tree through the observations
         Single observation that is last to join the cluster is patient whose
          blood pressure is at bottom quartile, skin thickness is at bottom
          quartile and BMI is in bottom half
         Insulin was however largest and she is 59-year old diabetic        33
                  Dendrogram from dataset

CSE
300




         Maximum dissimilarity between observations in one
          cluster when compared to another

                                                              34
                  Dendrogram from dataset

CSE
300




         Average dissimilarity between observations in one
          cluster when compared to another

                                                              35
          Supervised versus Unsupervised Learning
          Supervised learning (classification)
CSE
             Supervision: Training data (observations,
300
              measurements, etc.) are accompanied by labels
              indicating the class of the observations
             New data is classified based on training set
          Unsupervised learning (clustering)
             Class labels of training data are unknown
             Given a set of measurements, observations, etc.,
              need to establish existence of classes or clusters in
              data




                                                                      36
                Classification and Prediction
         Derive models that can use patient specific
CSE       information, aid clinical decision making
300
         Apriori decision on predictors and variables to predict
         No method to find predictors that are not present in the
          data
         Numeric Response
            Least Squares Regression
         Categorical Response
            Classification trees
            Neural Networks
            Support Vector Machine
         Decision models
            Prognosis, Diagnosis and treatment planning
            Embed in clinical information systems
                                                                 37
                  Least Squares Regression
         Find a linear function of predictor variables that
CSE       minimize the sum of square difference with response
300
         Supervised learning technique



         Predict insulin in our dataset :glucose and BMI




                                                                38
                              Decision Trees
         Decision tree
CSE         Each internal node tests an attribute
300         Each branch corresponds to attribute value
            Each leaf node assigns a classification
         ID3 algorithm
            Based on training objects with known class labels to
             classify testing objects
            Rank attributes with information gain measure
            Minimal height
               least number of tests to classify an object
             Used in commercial tools eg: Clementine
             ASSISTANT
                 Deal with medical datasets
                 Incomplete data
                 Discretize continuous variables
                 Prune unreliable parts of tree
                 Classify data

                                                                    39
      Decision Trees

CSE
300




                       40
         Algorithm for Decision Tree Induction

CSE
300
      Basic algorithm (a greedy algorithm)
        Attributes are categorical (if continuous-valued,
          they are discretized in advance)
        Tree is constructed in a top-down recursive
          divide-and-conquer manner
        At start, all training examples are at the root
        Test attributes are selected on basis of a heuristic
          or statistical measure (e.g., information gain)
        Examples are partitioned recursively based on
          selected attributes




                                                                41
                      Training Dataset

CSE         Age     BMI      Hereditary   Vision        Risk of
300                                                   Condition X
      P1    <=30     high        no         fair          no
      P2     <=30    high       no        excellent       no
      P3      >40    high       no          fair          yes
      P4    31…40   medium      no          fair          yes
      P5    31…40    low        yes         fair          yes
      P6    31…40    low        yes       excellent       no
      P7      >40    low        yes       excellent       yes
      P8     <=30   medium      no          fair          no
      P9     <=30    low        yes         fair          yes
      P10   31…40   medium      yes         fair          yes
      P11    <=30   medium      yes       excellent       yes
      P12     >40   medium      no        excellent       yes
      P13     >40    high       yes         fair          yes
      P14   31…40   medium      no        excellent       no

                                                                    42
           Construction of A Decision Tree for “Condition X”

CSE             [P1,…P14]
                                 Age?
300            Yes: 9, No:5

                                                      30…40
              <=30                    >40
  [P1,P2,P8,P9,P11]           [P3,P7,P12,P13]      [P4,P5,P6,P10,P14]
     Yes: 2, No:3               Yes: 4, No:0          Yes: 3, No:2
           History                YES                      Vision

      no             yes                 excellent                   fair

[P1,P2,P8]        [P9,P11]                      [P6,P14]      [P4,P5,P10]
  Yes: 0,          Yes: 2,                       Yes: 0,        Yes: 3,
    No:3             No:0                          No:2            No:0
      NO             YES                         NO                 YES

                                                                          43
              Entropy and Information Gain
       S contains si tuples of class Ci for i = {1, ..., m}
CSE
300
       Information measures info required to classify any
        arbitrary tuple             m
                                          si si
                I( s1,s2,...,sm )   log 2
                                     i 1 s  s
       Entropy of attribute A with values {a1,a2,…,av}
                          v
                           s1 j  ...  smj
               E(A)                       I (s1 j ,..., smj )
                      j 1         s
       Information gained by branching on attribute A

            Gain( A )  I( s1, s2,..., sm )  E( A )

                                                                  44
                Entropy and Information Gain
         Select attribute with the highest information gain (or
CSE       greatest entropy reduction)
300
            Such attribute minimizes information needed to
             classify samples




                                                                   45
                            Rule Induction
       IF conditions THEN Conclusion
CSE    Eg: CN2
300       Concept description:
              Characterization: provides a concise and succinct summarization of
               given collection of data
              Comparison: provides descriptions comparing two or more
               collections of data




       Training set, testing set
       Imprecise
       Predictive Accuracy
          P/P+N

                                                                                46
                   Example used in a Clinic
         Hip arthoplasty trauma surgeon predict patient’s long-
CSE       term clinical status after surgery
300
         Outcome evaluated during follow-ups for 2 years
         2 modeling techniques
            Naïve Bayesian classifier
            Decision trees
         Bayesian classifier
            P(outcome=good) = 0.55 (11/20 good)
            Probability gets updated as more attributes are
             considered
            P(timing=good|outcome=good) = 9/11 (0.846)
            P(outcome = bad) = 9/20
             P(timing=good|outcome=bad) = 5/9

                                                               47
CSE
300




      Nomogram




                 48
                     Bayesian Classification
         Bayesian classifier vs. decision tree
CSE
            Decision tree: predict the class label
300
            Bayesian classifier: statistical classifier; predict
             class membership probabilities
         Based on Bayes theorem; estimate posterior
          probability
         Naïve Bayesian classifier:
            Simple classifier that assumes attribute
             independence
            High speed when applied to large databases
            Comparable in performance to decision trees




                                                                    49
                        Bayes Theorem
       Let X be a data sample whose class label is unknown
CSE
300
       Let Hi be the hypothesis that X belongs to a particular
        class Ci
       P(Hi) is class prior probability that X belongs to a
        particular class Ci
          Can be estimated by ni/n from training data
            samples
          n is the total number of training data samples
          ni is the number of training data samples of class Ci

                              P( X | H )P(H )
                  P(H | X )          i    i
                     i             P( X )

                Formula of Bayes Theorem

                                                               50
                More classification Techniques
         Neural Networks
CSE         Similar to pattern recognition properties of biological
300          systems
            Most frequently used
                Multi-layer perceptrons
                   – Input with bias, connected by weights to hidden, output
                Backpropagation neural networks
         Support Vector Machines
            Separate database to mutually exclusive regions
                Transform to another problem space
                Kernel functions (dot product)
                Output of new points predicted by position
         Comparison with classification trees
            Not possible to know which features or combination of
             features most influence a prediction

                                                                               51
                    Multilayer Perceptrons
         Non-linear transfer functions to weighted sums of
CSE       inputs
300
         Werbos algorithm
            Random weights
            Training set, Testing set




                                                              52
                   Support Vector Machines
         3 steps
CSE         Support Vector creation
300         Maximal distance between points found
            Perpendicular decision boundary
         Allows some points to be misclassified
         Pima Indian data with X1(glucose) X2(BMI)




                                                      53
            What is Association Rule Mining?
        Finding frequent patterns, associations, correlations, or causal
CSE      structures among sets of items or objects in transaction
300
         databases, relational databases, and other information
         repositories
    PatientID Conditions
                                           Example of Association Rules
    1            High LDL Low HDL,
                 High BMI, Heart Failure       {High LDL, Low HDL} 
    2            High LDL Low HDL,             {Heart Failure}
                 Heart Failure, Diabetes
    3            Diabetes
    4            High LDL Low HDL,
                 Heart Failure                People who have high LDL
    5            High BMI , High LDL           (“bad” cholesterol), low
                 Low HDL, Heart Failure        HDL (“good cholesterol”)
                                               are at
                                               higher risk of heart failure.

                                                                          54
                   Association Rule Mining
         Market Basket Analysis
CSE
           Same groups of items bought placed together
300
           Healthcare
               Understanding among association among patients with
                demands for similar treatments and services
             Goal : find items for which joint probability of
              occurrence is high
             Basket of binary valued variables
             Results form association rules, augmented with
              support and confidence




                                                                      55
                Association Rule Mining
     Association Rule
CSE      An implication                 D           Trans
300       expression of the form                containing both
          X  Y, where X and Y                      X and Y
          are itemsets and
          XY=
     Rule Evaluation
      Metrics
         Support (s): Fraction of           Trans               Trans
          transactions that                  containing X        containing Y
          contain both X and Y
         Confidence (c):                        # trans containing ( X  Y )
                                   P( X  Y ) 
          Measures how often                             # trans in D
          items in Y appear in
          transactions that
          contain X                          # trans containing ( X  Y )
                                P( X | Y ) 
                                                # trans containing X

                                                                            56
               The Apriori Algorithm
CSE
       Starts with most frequent 1-itemset
300    Include only those “items” that pass threshold
       Use 1-itemset to generate 2-itemsets
       Stop when threshold not satisfied by any itemset

       L1 = {frequent items};
        for (k = 1; Lk !=; k++) do
          Candidate Generation: Ck+1 = candidates
            generated from Lk;
          Candidate Counting: for each transaction t in
            database do increment the count of all candidates
            in Ck+1 that are contained in t
          Lk+1 = candidates in Ck+1 with min_sup
        return k Lk;
                                                                57
                             Apriori-based Mining
CSE
300
  Data base D                          1-candidates          Freq 1-itemsets            2-candidates
  TID     Items                         Itemset    Sup            Itemset    Sup            Itemset
  10      a, c, d                          a        2                a        2                ab
  20      b, c, e        Scan D            b        3                b        3                ac
  30      a, b, c, e                       c        3                c        3                ae
  40      b, e                             d        1               e         3                bc
      Min_sup=0.5                          e        3                                          be
                                                                                               ce
             3-candidates                     Freq 2-itemsets                Counting
 Scan D                Itemset                    Itemset   Sup             Itemset   Sup
                         bce                         ac      2                 ab      1
                                                     bc      2                 ac      2         Scan D
                                                     be      3                 ae      1
             Freq 3-itemsets
                                                     ce      2                 bc      2
                  Itemset        Sup
                                                                               be      3
                    bce           2
                                                                               ce      2




                                                                                                          58
                Principle Component Analysis
         Principle Components
CSE         In cases of large number of variables, highly possible that
300           some subsets of the variables are very correlated with each
              other. Reduce variables but retain variability in dataset
            Linear combinations of variables in the database
                Variance of each PC maximized
                   – Display as much spread of the original data
                PC orthogonal with each other
                   – Minimize the overlap in the variables
                Each component normalized sum of square is unity
                   – Easier for mathematical analysis
             Number of PC < Number of variables
                Associations found
                Small number of PC explain large amount of variance
             Example 768 female Pima Indians evaluated for diabetes
                Number of times pregnant, two-hour oral glucose tolerance test
                 (OGTT) plasma glucose, Diastolic blood pressure, Triceps skin
                 fold thickness, Two-hour serum insulin, BMI, Diabetes pedigree
                 function, Age, Diabetes onset within last 5 years
                                                                                  59
      PCA Example

CSE
300




                    60
                 National Cancer Institute

CSE
         CancerNet http://www.nci.nih.gov
300      CancerNet for Patients and the Public
         CancerNet for Health Professionals
         CancerNet for Basic Reasearchers
         CancerLit




                                                  61
                           Conclusion
         About ¾ billion of people’s medical records are
CSE       electronically available
300
         Data mining in medicine distinct from other fields due
          to nature of data: heterogeneous, with ethical, legal
          and social constraints
         Most commonly used technique is classification and
          prediction with different techniques applied for
          different cases
         Associative rules describe the data in the database
         Medical data mining can be the most rewarding
          despite the difficulty




                                                               62
CSE
300




      Thank you !!!
                      63

								
To top