Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Get this document free

Cup

VIEWS: 12 PAGES: 19

									    KDD-2001 Cup
The Genomics Challenge


   Christos Hatzis, Silico Insights
 David Page, University of Wisconsin
             Co-chairs


           August 26, 2001

        Special thanks: DuPont Pharmaceuticals Research
        Laboratories for providing data set 1, Chris Kostas from
        Silico Insights for cleaning and organizing data sets 2 and 3

              http://www.cs.wisc.edu/~dpage/kddcup2001/
          The Genomics Challenge

• High throughput technologies in genomics,
  proteomics and drug screening are creating
  large, complex datasets
• Bioinformatics datasets are typically under-
  determined
   – very large number of features (complex domain)
   – small number of instances (high cost per data point)
• Multi-relational nature of data
   – reflect complex interactions between molecules,
     pathways and systems
   – Hierarchical organization of interacting layers
• Current tools and approaches do not
  adequately address the Genomics Challenge

                       KDD-2001 Cup                         2
                     Overview

• Cup organization
• Dataset description
   – Thrombin binding
   – Gene function/localization prediction

• Statistics
• Tasks and highlights
• Winners talk (3x10 min)




                        KDD-2001 Cup         3
                   Cup Organization

• KDD-2001 Cup web site
   – Posting of datasets, Q&A, answer keys
• Schedule
   –   Training dataset available: May 31
   –   Question period 1: June 1-10
   –   Test set available: July 13
   –   Question period 2: July 13-24
   –   Entries due: July 26
   –   Winners notified: August 1
   –   Results to participants: August 7
• Evaluation criteria
   – Task 1: weighted accuracy (average of true pos, true neg)
   – Tasks 2, 3: non-weighted accuracy


                           KDD-2001 Cup                          4
     Dataset 1: Molecular Bioactivity

Dataset provided by DuPont Pharmaceuticals for
  the KDD-2001 Cup competition

• Activity of compounds binding to thrombin
• Library of compounds included:
  – 1909 known molecules (42 actively binding
    thrombin)
• 139,351 binary features describe the 3-D
  structure of each compound
• 636 new compounds with unknown capacity to
  bind thrombin


                     KDD-2001 Cup                5
 Dataset 2: Protein Functional Annotation

• Yeast Genome dataset
  – Data on the protein-protein interactions from MIPS database
    (Munich Information Centre for Protein Sequences)
  – Expression profiles: DeRisi et al. (1997) Science 278: 680
• Relational dataset                                          Weak Similarity
                                          Strong Similarity
  – Gene information                          to Known
                                                                to Known
                                                                 Protein
                                               Protein             13%
  – Interaction information                      4%                         Similarity to
                                                                              Unknown
                                                                               Protein

• Predict function,                                                             16%



localization of unknown
proteins
                                                                               No Similarity
                                                                                   8%
                                   Known Proteins
                                       52%                                 Questionable
                                                                              ORFs
                                                                               7%
                                 6449 total proteins



                          KDD-2001 Cup                                                         6
                                           Statistics: I. Participation
                                                 KDD Cup Participation
                                                                                                 Total by Task
                                                                                               (200 submissions)
                                  160                                                           45

                                                                                 136                                        Thrombin
                                  140
                                                                                                                            Function
   Number of Participant Groups




                                  120                                                                                 114   Localization
                                                                                           41

                                  100

                                  80
                                                                                               Total by Affiliation
                                                                                               (200 submissions)
                                  60                                                                 20

                                                                                                                                Com
                                  40                                  30                                                        Gov
                                                   21       24
                                         16                                                                                     Univ
                                  20                                                      66                          107       Other

                                   0
                                        Cup 97    Cup 98   Cup 99   Cup 2000   Cup 2001              7

• 136 unique groups, 200 total entries by about 300-400
  participants
• Almost 5-fold increase over previous years
• More than half of the entries from commercial sector

                                                                    KDD-2001 Cup                                                           7
       Statistics: II. Data Mining Software

       Task 1                Task 2                 Task 3             Total

  21                                                              42
                     9                     12
                                                                                    Custom
                                                                                    Public Domain
                                      16                     19
                                                                               88   Commercial
  5
                53                                                17
                         6                      6




Note: Statistics from 157 responders who provided details on their approach

• Mostly custom software was used
• Especially for task 1, where the number of
  features was too large for most commercial
  systems
• Gap points to need for commercial tools that
  can cope with bioinformatics datasets

                                           KDD-2001 Cup                                             8
                                                                   Statistics: III. Algorithms
                                  0.7
    Fraction of Entries by Task


                                  0.6

                                  0.5

                                  0.4                                                                                                                                                                                                                                                                                                                                              Task 1
                                                                                                                                                                                                                                                                                                                                                                                   Task 2
                                  0.3                                                                                                                                                                                                                                                                                                                                              Task 3

                                  0.2

                                  0.1

                                   0
                                                                                                   Ensemble Classifier




                                                                                                                                                                                                                                                                                        Genetic Programming




                                                                                                                                                                                                                                                                                                                                                                Cross Validation
                                                                                                                                                                                                                                                   Logistic Regression




                                                                                                                                                                                                                                                                                                                               Linear Regression
                                                                                                                                                                                    Association Rules




                                                                                                                                                                                                                                                                                                              Decision Table
                                                                                                                                                                                                                                     Statistical
                                                                                                                                                            Boosting




                                                                                                                                                                                                              Bagging

                                                                                                                                                                                                                        Clustering
                                        Feature Selection




                                                                                                                         Naïve Bayes

                                                                                                                                       k-Nearest Neighbor
                                                                                   Decision Tree




                                                                                                                                                                                                        SVM




                                                                                                                                                                                                                                                                         Bayesian Net




                                                                                                                                                                                                                                                                                                                                                          ILP
                                                                                                                                                                                                                                                                                                                                                   OLAP
                                                            Feature Construction




                                                                                                                                                                       Neural Net




•   Feature selection used in almost 70% of the entries for Task 1
•   Ensemble classifiers based on more than one algorithm used extensively
•   Decision trees among the most commonly used, with Naïve Bayes and k-NN
•   Cross-validation to deal with small dataset size



                                                                                                                                                                                               KDD-2001 Cup                                                                                                                                                                                 9
              Task 1 Highlights

• Test set was challenging second round of
  compounds made by chemists -- change in
  distribution.
• Far more features than data points; can’t run
  most commercial systems even with 1G RAM.
• Varying degrees of correlation among
  features.
• Better than 60% weighted accuracy is
  impressive.
• Pure binary prediction task, yet the winner is a
  Bayes net learning system (after feature
  selection).

                    KDD-2001 Cup                     10
            Tasks 2 & 3: Relational Prediction
Gene/Protein Level                Interactions


    Gene Sequence                        Gene Expression
                                                                        Proteomic Clusters




                                                            Cluster D




                                                                                                Cluster C


                                                                                                            Cluster A
                                                                        Cluster B


                                                                                    Cluster E
                                                                                                                        Expression
                                                                                                                        Clusters



    ATTGCCATT--
                                                            -0.31       -0.12        0.32        0.30       -0.76       Cluster   20
                                                            -0.50       -0.30        0.47        0.46       -0.65       Cluster   12
                                                             0.03       -0.04        0.05        0.06       -0.22       Cluster   13
                                                            -0.76       -0.65        0.73        0.72       -0.34       Cluster   9
                                                            -0.22       -0.35        0.30        0.31       -0.04       Cluster   8


    ATGGCCATT--                                             -0.39
                                                            -0.48
                                                            -0.57
                                                            -0.53
                                                                        -0.56
                                                                        -0.64
                                                                        -0.59
                                                                        -0.65
                                                                                     0.47
                                                                                     0.53
                                                                                     0.51
                                                                                     0.52
                                                                                                 0.48
                                                                                                 0.55
                                                                                                 0.52
                                                                                                 0.53
                                                                                                             0.14
                                                                                                             0.22
                                                                                                             0.29
                                                                                                             0.41
                                                                                                                        Cluster
                                                                                                                        Cluster
                                                                                                                        Cluster
                                                                                                                        Cluster
                                                                                                                                  10
                                                                                                                                  4
                                                                                                                                  32
                                                                                                                                  29



    ATC-CAATTTT
                                                            -0.41       -0.58        0.46        0.48        0.27       Cluster   22
                                                            -0.23       -0.38        0.28        0.29        0.27       Cluster   21
                                                            -0.38       -0.57        0.40        0.41        0.53       Cluster   1
                                                            -0.12       -0.32        0.20        0.22        0.25       Cluster   7
                                                             0.15        0.02       -0.14       -0.13        0.42       Cluster   27


    ATCTTC-TT--                                              0.23
                                                             0.20
                                                             0.21
                                                             0.01
                                                                         0.02
                                                                         0.15
                                                                         0.18
                                                                        -0.01
                                                                                    -0.19
                                                                                    -0.25
                                                                                    -0.28
                                                                                    -0.09
                                                                                                -0.18
                                                                                                -0.24
                                                                                                -0.28
                                                                                                -0.08
                                                                                                             0.57
                                                                                                             0.46
                                                                                                             0.51
                                                                                                             0.48
                                                                                                                        Cluster
                                                                                                                        Cluster
                                                                                                                        Cluster
                                                                                                                        Cluster
                                                                                                                                  6
                                                                                                                                  30
                                                                                                                                  3
                                                                                                                                  24


    ACTGACC----                                             -0.21
                                                            -0.05
                                                            -0.07
                                                            -0.06
                                                                        -0.29
                                                                        -0.19
                                                                        -0.12
                                                                        -0.25
                                                                                     0.17
                                                                                     0.01
                                                                                     0.00
                                                                                     0.09
                                                                                                 0.18
                                                                                                 0.02
                                                                                                 0.01
                                                                                                 0.11
                                                                                                             0.47
                                                                                                             0.72
                                                                                                             0.55
                                                                                                             0.50
                                                                                                                        Cluster
                                                                                                                        Cluster
                                                                                                                        Cluster
                                                                                                                        Cluster
                                                                                                                                  23
                                                                                                                                  34
                                                                                                                                  2
                                                                                                                                  33
                                                            -0.11       -0.31        0.10        0.12        0.71       Cluster   26


    AT*GCCATTTT                                              0.24
                                                             0.62
                                                             0.38
                                                             0.47
                                                                         0.27
                                                                         0.54
                                                                         0.25
                                                                         0.55
                                                                                    -0.32
                                                                                    -0.66
                                                                                    -0.32
                                                                                    -0.55
                                                                                                -0.32
                                                                                                -0.65
                                                                                                -0.32
                                                                                                -0.55
                                                                                                             0.39
                                                                                                             0.57
                                                                                                             0.21
                                                                                                             0.18
                                                                                                                        Cluster
                                                                                                                        Cluster
                                                                                                                        Cluster
                                                                                                                        Cluster
                                                                                                                                  5
                                                                                                                                  31
                                                                                                                                  28
                                                                                                                                  15
                                                             0.28        0.30       -0.30       -0.30        0.11       Cluster   11
                                                             0.68        0.71       -0.70       -0.71        0.08       Cluster   25
                                                             0.56        0.65       -0.63       -0.64        0.13       Cluster   16
                                                             0.39        0.50       -0.35       -0.36       -0.53       Cluster   19
                                                             0.25        0.21       -0.19       -0.18       -0.20       Cluster   17



   Structural Motifs
                                                             0.41        0.46       -0.37       -0.38       -0.35       Cluster   14
                                                             0.64        0.75       -0.65       -0.66       -0.26       Cluster   18



                                                                                                                                       FUNCTION
                                                             0.16        0.40       -0.20       -0.22       -0.60       Cluster   35




                                                                                                                                       LOCATION
                                     Protein Interactions




   Chromosomal Location




                          KDD-2001 Cup                                                                                                    11
               Task 2 Highlights

• Average of about 3 functions per protein.
• Multi-relational, as are many real-world
  databases.
• Yet top-scoring approaches were not pure
  relational learners.
• But top-scoring approaches did account for
  multi-relational structure of the data.
  – Krogel: novel form of feature construction to capture
    relational information in a feature vector.
  – Sese, Hayashi, and Morishita: instance-based
    learning, but using the interactions relation as part of
    the distance function.


                       KDD-2001 Cup                            12
              Task 3 Highlights

• Similar to task 3, but only one localization per
  protein.
• Similar lessons.
• High overlap in top scorers for both tasks.
• Question: did anyone “bootstrap” by using
  their predictions for function to help predict
  localization, or vice-versa?




                     KDD-2001 Cup                    13
            KDD-2001 Cup Winners

• Task 1:    Jie Cheng, CIBC

• Task 2:    Mark-A. Krogel, Magdeburg Univ.

• Task 3:    Hisashi Hayashi, Jun Sese, and
             Shinichi Morishita, Univ. of Tokyo




                     KDD-2001 Cup                 14
                                          Task 1 Winner

KDD Cup 2001 Results
Task 1: Thrombin                                                                Distribution of Prediction Accuracy Scores for
                                                                                          Task 1: Thrombin Activity
Name:                 Jie Cheng
Rank:                 1                                               1
                                                                                       1.000
Weighted Accuracy:    68.4435
Accuracy:             71.1356
                                                                     0.9

                                                                     0.8




                                              Cumulative Frequency
                             Predicted
                      Positive     Negative
                                                                     0.7
          Positive          95         55         150
 Actual
          Negative         128        356         484
                                                                     0.6
                           223        411         634
                                                                     0.5
True Positive Rate:        63.3%
True Negative Rate:        73.6%
                                                                     0.4

                                                                     0.3

                                                                     0.2

                                                                     0.1                                          68.444

                                                                      0
                                                                           30     40           50   60           70        80   90   100
                                                                                                         Score




                                                                     KDD-2001 Cup                                                          15
                                             Task 2 Winner

KDD Cup 2001 Results
                                                                                 Distribution of Prediction Accuracy Scores for
Task 2: Function
                                                                                           Task 2: Function Prediction
Name:                 Mark-A. Krogel                                   1
Rank:                 1                                                                          1.000
Accuracy:             93.6258                                         0.9
Weighted Accuracy:    84.8290
                                                                      0.8




                                               Cumulative Frequency
                             Predicted                                0.7
                      Positive    Negative
 Actual
          Positive        690         282      972                    0.6
          Negative         58        4304     4362
                          748        4586     5334                    0.5
True Positive Rate:       71.0%                                       0.4
True Negative Rate:       98.7%
                                                                      0.3

                                                                      0.2

                                                                      0.1
                                                                                                                                  93.626
                                                                       0
                                                                            60    65      70     75       80     85     90        95       100
                                                                                                         Score




                                                                      KDD-2001 Cup                                                               16
                                  Task 3 Winner
KDD Cup 2001 Results
Task 3: Localization

Name:                              Hisashi Hayashi, Jun Sese, and Shinichi Morishita
Rank:                              1
Accuracy:                          72.1785


                                  Distribution of Prediction Accuracy Scores for
                                          Task 3: Localization Prediction
                         1
                                                  1.000
                        0.9

                        0.8
 Cumulative Frequency




                        0.7

                        0.6

                        0.5

                        0.4

                        0.3

                        0.2

                        0.1                                                            72.179

                         0
                              0         20                40             60             80      100
                                                               Score




                                             KDD-2001 Cup                                             17
   KDD-2001 Honorable Mentions

Task 1:   Silander, Univ. of Helsinki

Task 2:   Lambert, Golden Helix;
          Sese & Hayashi & Morishita;
          Vogel & Srinivasan, A.I. Insight

Task 3:   Schonlau & DuMouchel & Volinsky &
                Cortes, RAND and AT&T Labs;
          Frasca & Zheng & Parekh & Kohavi,
                Blue Martini


                  KDD-2001 Cup                18
            KDD-2001 Cup Winners

• Task 1:    Jie Cheng, CIBC
• Task 2:    Mark-A. Krogel, Magdeburg Univ.
• Task 3:    Hisashi Hayashi, Jun Sese, and
             Shinichi Morishita, Univ. of Tokyo




                     KDD-2001 Cup                 19

								
To top