Interaction Dendrogram by yDCfAZsM


									              Attribute Interactions
                    in Medical Data Analysis

     A. Jakulin1, I. Bratko1,2, D. Smrke3, J. Demšar1, B. Zupan1,2,4

1.     University of Ljubljana, Slovenia.
2.     Jožef Stefan Institute, Ljubljana, Slovenia.
3.     Dept. of Traumatology, University Clinical Center, Ljubljana, Slovenia.
4.     Dept. of Human and Mol. Genetics, Baylor College of Medicine, USA.
1. Interactions:
  –   Correlation can be generalized to more than 2
      attributes, to capture interactions - higher-order
2. Information theory:
  –   A non-parametric approach for measuring
      ‘association’ and ‘uncertainty’.
3. Applications:
  –   Automatic selection of informative visualizations
      uncover previously unseen structure in medical data.
  –   Automatic constructive induction of new features.
4. Results:
  –   Better predictive models for hip arthroplasty.
  –   Better understanding of the data.
          Attribute Dependencies
                    (outcome, diagnosis)
importance of attribute A           importance of attribute B

               A                             B   (feature)
                     attribute correlation

                    3-Way Interaction:
                   2-Way Interactions
          What is common to A, B and C together;
       and cannot be inferred from pairs of attributes.
                 Shannon’s Entropy
Entropy given C’s empirical probability distribution (p = [0.2, 0.8]).

                  A            C                   H(C|A) = H(C)-I(A;C)
                                                   Conditional entropy ---
                                                   Remaining uncertainty
      H(A)                                         in C after knowing A.
which came with
             I(A;C)=H(A)+H(C)-H(AC)                       H(AB)
 knowledge of A
      Mutual information or information gain ---       Joint entropy
        How much have A and C in common?
          Interaction Information

                                  I(A;B;C) :=
                                     I(AB;C) - I(A;C) - I(B;C)
                                            = I(A;B|C) - I(A;B)

• Interaction information can be:
   – NEGATIVE – redundancy among attributes (negative int.)
   – NEGLIGIBLE – no interaction
   – POSITIVE – synergy between attributes (positive int.)
    History of Interaction Information

(Partial) history of independent reinventions:

•   McGill ‘54 (Psychometrika)                - interaction information
•   Han ‘80 (Information & Control)           - multiple mutual information
•   Yeung ‘91 (IEEE Trans. Inf. Theory)       - mutual information
•   Grabisch & Roubens ‘99 (game theory)      - Banzhaf interaction index
•   Matsuda ‘00 (Physical Review E)           - higher-order mutual inf.
•   Brenner et al. ‘00 (Neural Computation)   - average synergy
•   Demšar ’02 (machine learning)             - relative information gain
•   Bell ‘03 (NIPS02, ICA2003)                - co-information
•   Jakulin ’03 (machine learning)            - interaction gain
Utility of Interaction Information
1. Visualization of interactions in data
   •   Interaction graphs, dendrograms
2. Construction of predictive models
   •   Feature construction, combination, selection

Case studies:
• Predicting the success of hip arthroplasty (HHS).
• Predicting the contraception method used from
   demographic data (CMC).

Predictive modeling helps us focus only on
    interactions that involve the outcome.
Interaction Matrix for CMC Domain

Illustrates the interaction information for all pairs of attributes.
      red – positive, blue – negative, green – independent.
Interaction Graphs
                    Information gain:
                   100% I(A;C)/H(C)
                 The attribute “explains”
                 1.98% of label entropy

                   A positive interaction:
                   100% I(A;B;C)/H(C)
            The two attributes are in a synergy:
           treating them holistically may result
          in 1.85% extra uncertainty explained.

                  A negative interaction:
                   100% I(A;B;C)/H(C)
         The two attributes are slightly redundant:
         1.15% of label uncertainty is explained
               by each of the two attributes.
             Interaction Dendrogram
weakly interacting                                strongly interacting
     loose               cluster “tightness”            tight

               uninformative                              informative
                  attribute                                 attribute
                               information gain
Interpreting the Dendrogram
                           an unimportant interaction

                                      a cluster of

                                  a positive interaction

                                       a weakly negative

            a useless attribute
Application to the Harris hip
  score prediction (HHS)
                  “Bipolar endoprosthesis and
  Attribute Structure for HHS short
                  duration of operation significantly
                             increases the chances of a good

                           “Presence of neurological disease is
                               a high risk factor only in the
                             presence of other complications
                                    during operation.”

                                             late complications


Discovered from data           Designed by the physician
          A Positive Interaction

   Both attributes are useless alone, but useful together.
They should be combined into a single feature (e.g. with a
classification tree, a rule or a Cartesian product attribute).
         These two attributes are also correlated:
          correlation doesn’t imply redundancy.
            A Negative Interaction

                                                       very few

      Once we know the wife’s or the husband’s education,
    the other attribute will not provide much new information.
       But they do provide some, if you know how to use it!
Feature combination may work: feature selection throws data away.
                Prediction of HHS
Brier score - probabilistic evaluation (K classes, N instances):

                BS( p, p)    pi , j  pi , j 
                                 N   K
                       ˆ                  ˆ
                           K i j
• Tree-Augmented NBC:                         0.227 ± 0.018
• Naïve Bayesian classifier:                  0.223 ± 0.014
• General Bayesian net:                       0.208 ± 0.006
• Simple feature selection with NBC:          0.196 ± 0.012
• FSS with background concepts:               0.196 ± 0.011
• 10 top interactions → FSS:                  0.189 ± 0.011
   – Tree-Augmented NB:                       0.207 ± 0.017
   – Search for feature comb.:                0.185 ± 0.012
                          These two (not very logical)
                        combinations of features are only
                                            The Best Model
                        worth 0.2% loss in performance.

   The endoprosthesis and
operation duration interaction
provides little information that
wouldn’t already be provided
by these attributes: it interacts
  negatively with the model.
           A Causal Diagram

      loss of      pulmonary     sitting
   consciousness    disease      ability             cause

  late                                               injury
luxation             HHS                         operation time
                                 luxation           moderator

 neurological                  hospitalization
   disease                        duration
1. Visualization methods attempt to:
   •   Summarize the relationships between attributes in
       data (interaction graph, interaction dendrogram,
       interaction matrix).
   •   Assist the user in exploring the domain and
       constructing classification models (interactive
       interaction analysis).
2. What to do with interactions:
   •   Do make use of interactions! (rules, trees,
       dependency models)
       •   Myopia: naïve Bayesian classifier, linear SVM, perceptron,
           feature selection, discretization.
   •   Do not assume an interaction when there isn’t one!
       •   Fragmentation: classification trees, rules, general Bayesian
           networks, TAN.

To top