Gerstein Lab

Document Sample
Gerstein Lab Powered By Docstoc
Laura Mustavich
Introduction to Data Mining
Final Project Presentation
April 26, 2007
The Inspiration For
a Method
The Nature of Complex Diseases

 Most common diseases are complex
 Caused by multiple genes
 Often interacting with one another

This interaction is termed Epistasis
   When an allele at one locus masks the effect of
    an allele at another locus
The Failure of Traditional Methods

 Traditional gene hunting methods
  successful for rare Mendelian (single
  gene) diseases
 Unsuccessful for complex diseases:
     Since  many genes interact to cause the
      disease, the effect of any single gene is too
      small to detect
     They do not take this interaction into account
MDR: The Algorithm
Multifactor Dimensionality
 A data mining approach to identify
  interactions among discrete variables that
  influence a binary outcome
 A nonparametric alternative to traditional
  statistical methods such as logistic
 Driven by the need to improve the power
  to detect gene-gene interactions
Multifactor Dimensionality
MDR Step 0
Divide data (genotypes, discrete
  environmental factors, and affectation
  status) into 10 distinct subsets
Multifactor Dimensionality
MDR Step 1
Select a set of n genetic or environmental
 factors (which are suspected of epistasis
 together) from the set of all variables in the
 training set
Multifactor Dimensionality
MDR Step 2
Create a contingency table for these
 multilocus genotypes, counting the
 number of affected and unaffected
 individuals with each multilocus genotype
Multifactor Dimensionality
MDR Step 3
Calculate the ratio of cases to controls for
 each multilocus genotype
Multifactor Dimensionality
MDR Step 4
Label each multilocus genotype as “high-
  risk” or “low-risk”, depending on whether
  the case-control ratio is above a certain

****This is the dimensionality reduction step
   Reduces n-dimensional space to 1 dimension with 2 levels
Multifactor Dimensionality
MDR Step 5
Use labels to classify individuals as cases or
 controls, and calculate the
 misclassification rate
Multifactor Dimensionality
Repeat steps 1-5 for:
 All possible combinations of n factors
 All possible values of n
 Across all 10 training and testing sets
The Best Model
   Minimizes prediction error:
    the average misclassification rate across all the 10
    cross-validation subsets
   Maximizes cross-validation consistency:
    the number of times a particular model was the best
    model across cross-validation subsets
Hypothesis test of best model:
   Evaluate magnitude of cross-validation
    consistency and prediction error estimates by
    permutation testing:
     Randomize    disease labels
     Repeat MDR analysis several times to get distribution
      of cross-validation consistencies and prediction errors
     Use distributions to determine p-values for your actual
      cross-validation consistencies and prediction errors
Permutation Testing: An illustration
                 An Example Empirical Distribution
                                                            Sample Quantiles:

                                                                   0%     0.045754

                                                                  25%     0.168814

                                                                  50%     0.237763

                                                                  75%     0.321027

                                                                  90%     0.423336

                                                                  95%     0.489813   0.4500

                                                                  99%     0.623899

                                                                99.99%    0.872345

                                                                 100%           1

                                                           The probability that we would see

                                                           results as, or more, extreme than
                  0.2      0.4     0.6      0.8      1.0   0.4500, simply by chance, is between
                                                           5% and 10%
   Facilitates simultaneous detection and
    characterization of multiple genetic loci
    associated with a discrete clinical endpoint by
    reducing the dimensionality of the multilocus
   Non-parametric – no values are estimated
   Assumes no particular genetic model
   False-positive rate is minimized due to multiple
 Computationally intensive
  (especially with >10 loci)
 The curse of dimensionality:
  decreased predictive ability with high
  dimensionality and small sample due to
  cells with no data
MDR Software
The Authors
Multifactor dimensionality reduction software
 for detecting gene-gene and gene-
 environment interactions. Hahn, Ritchie,
 Moore, 2003.
     Values Calculated by MDR
Measure             Formula/Interpretation
Balanced Accuracy   (Sensitivity+Specificity)/2; fitness measure
                    Accuracy is skewed in favor of the larger class, whereas balanced accuracy gives
                    equal weight to each class
Accuracy            (TP+TN)/(TP+TN+FP+FN)
                    Proportion of instances correctly classified
Sensitivity         TP/(TP+FN); proportion of actual positives correctly classified

Specificity         TN/(TN+FP); proportion of actual negatives correctly classified

Odds Ratio          (TP*TN)/(FP*FN); compares whether the probability of a certain event is the same
                    for two groups
X2                  Chi-squared score for the attribute constructed by MDR from this attribute
Precision           TP/(TP+FP); the proportion of relevant cases returned

Kappa               2(TP*TN+FP*FN)/[(TP+FN)(FN+TN)+(TP+FP)*(FP+TN)]
                    A function of total accuracy and random accuracy
F-Measure           2*TP/(2*TP+FP+FN); a function of sensitivity and precision
Sign Test
n = number of cross-validation intervals
C = number of cross-validation intervals with testing
  accuracy ≥ 0.5
          n  1  n
p     
         k  2
    k c     
The probability of observing c or more cross-
  validation intervals with testing accuracy ≥ 0.5 if
  each case were actually classified randomly
The Problem of
A Case Study
Genes Associated With Alcoholism
    (alcohol dehydrogenase)

              and                 ADH enzymes

(acetaldehyde dehydrogenase 2)
           genes are                 e

  associated with alcoholism
                                  ALDH2 enzyme
 involved in alcohol metabolism

     ADH Genes
     Chromosome 4

                                 370 kb

5’   ADH7       ADH1C ADH1B      ADH1A    ADH6      ADH4       ADH5        3’

     Class IV                             Class V   Class II   Class III
                       Class I
  Taste Receptors and Aversion to
  Alcohol    • a person must be willing to drink in order to be an
                    PTC       alcoholic

                              • TAS2R38 affects the amount of alcohol a person is
                              willing to drink
                                                  • therefore, it is related to
                                                  alcoholism, although no direct
      Tasters                 Non-Tasters         association has been found

                                                  • we hope to provide a direct
Alcohol Tastes Bitter     Alcohol Tastes Sweet    link between TAS2R38 and
                                                  alcoholism, by demonstrating
                                                  that it acts epistatically with
                                                  other genes associated with
 Drink Less Alcohol        Drink More Alcohol     alcoholism
Actual Analysis
   A sample of cases and controls
    (alcoholics and non-alcoholics) from
    three East Asian populations: the Ami,
    Atayal, and Taiwanese
   Genotyped for 98 markers within several
    genes: ALDH2, all ADH genes, and 2
    taste receptor genes, TAS2R16 and
    TAS2R38 (PTC)
Computational Limitations
1.   The software package has a problem reading
     missing data

     I was forced to use only complete records,
     dwindling my (already small) sample to 79
     complete records
Computational Limitations
2.   The computation time is way too long for
     higher order models, especially for high
     numbers of attributes

     I was advised to restrict my attributes to
     markers within ADHIC, and the 2 taste
     receptor genes, which left me with 36

     I considered models only up to order 4
Summary of Results:
All Populations
Instances: 79            Attributes: 36               Ratio: 1.3235
Order   Model                    Training Bal. Acc.    Testing Bal. Acc.   Sign Test (p)     CV Consistency

  1      X.04..ADH1C.dwstrm.Te          0.6049              0.4278              0 (1.0000)         5/10

  2                                     0.7076              0.4438              3 (0.9453)         6/10

  3      X.04..ADH1C.dwstrm.Te          0.785               0.3186              1 (0.9990)         4/10

  4                                     0.8453              0.3564              2 (0.9893)         6/10
Summary of Results: Ami
Instances: 30             Attributes: 36                Ratio: 0.8750

 Order           Model             Training Bal. Acc.    Testing Bal. Acc.   Sign Test (p)   CV Consistency

  1      X.07..TAS2R16.C_11431          0.7331                0.4598          5 (0.6230)          5/10

  2                                     0.8284                0.3476          2 (0.9893)          3/10

  3      X.07..PTC.C_8876467_1          0.9688                0.9545         10 (0.0010)         10/10

  4                                     0.9722                0.8712          8 (0.0547)          9/10
Cross Validation Statistics

Measure                          Training                       Testing

Balanced Accuracy                0.9688                         0.9545

Accuracy                         0.9667                          0.95

Sensitivity                         1                             1

Specificity                      0.9375                         0.9091

Odds Ratio                          ∞                             ∞

χ2                         23.6250 (p < 0.0001)           1.6364 (p = 0.2008)

Precision                        0.9333                          0.9

Kappa                            0.9333                          0.9

F-Measure                        0.9655                         0.9474

Sign Test:                              10 (p = 0.0010)
Cross-validation Consistency:           10/10
Whole Dataset Statistics:
   Training Balanced Accuracy:   0.9688
   Training Accuracy:            0.9667
   Training Sensitivity:         1.0000
   Training Specificity:         0.9375
   Training Odds Ratio:          ∞
   Training Χ²:                  26.2500 (p < 0.0001)
   Training Precision:           0.9333
   Training Kappa:               0.9333
   Training F-Measure:           0.9655
Graphical Model
Classification Rules

     X.07..TAS2R16.C_11431         X.07..PTC.C_8876467_1         X.04..ADH1C.C_2688508          Class

             A\A                           C\G                           C\C                     0

             A\A                           C\G                            C\T                    1

             A\A                           C\G                            T\T                    0

             A\A                           G\G                           C\C                     0

             A\A                           G\G                            C\T                    0

             A\A                           G\G                            T\T                    1

             A\G                           C\C                            C\T                    1

             A\G                           C\G                           C\C                     0

IF           A\G             AND           C\G             AND            C\T            THEN    0

             A\G                           C\G                            T\T                    0

             A\G                           G\G                           C\C                     1

             A\G                           G\G                            C\T                    1

             A\G                           G\G                            T\T                    0

             G\G                           C\G                            C\T                    1

             G\G                           G\G                           C\C                     0

             G\G                           G\G                            C\T                    1

             G\G                           G\G                            T\T                    1
Locus Dendrogram
Future Work
   Simulations to calculate the power of MDR,
    especially in relation to sample size
   Comparison of MDR with logistic regression, and
    other proposed methods to detect epistasis, with
    respect to the current data set and simulated
   Research how different methods to search the
    sample space can be incorporated into MDR
    implementation to improve computational