presentatie by mikeholy

VIEWS: 9 PAGES: 26

									Microarrays: algorithms for knowledge discovery in
oncology and molecular biology


Frank De Smet
          Katholieke Universiteit Leuven
          Faculteit Toegepaste Wetenschappen
          Departement Elektrotechniek (ESAT)

Promotor: Prof. dr. ir. B. De Moor
                                                    Overview



• Introduction: basic concepts of microarray data                      Introduction

                                                                     Feature extraction

                                                                       Classification
• Feature extraction:
                                                                        Clustering
      – Univariate analysis
                                                                        Conclusions
      – Multivariate analysis: PCA


• Classification

• Clustering

• Conclusions and future research


PhD defense Frank De Smet                             May 28, 2004                2
                            Transcription - Translation



                                                                 Introduction

                                                               Feature extraction

                                                                 Classification

                                                                  Clustering

                                                                  Conclusions




PhD defense Frank De Smet                       May 28, 2004                3
                                                                           Microarrays
                                                      mRNA test (tumour)           mRNA reference
                            DNA-clones



                                                                                                          Introduction
                                                                           2                            Feature extraction

                                                                                                          Classification
                                 1
                                                                                                           Clustering
                                             3
                   cDNA-microarray                                                                         Conclusions
                 
                 
                 
                 
                 


                             4
   Red                                    Green
                     
                     
                     
                     
                     


                                                  5                            6
                                                                                           DB

PhD defense Frank De Smet                                                                May 28, 2004                4
                                                        Importance



• Clinical (oncology)                                                           Introduction


      – Clinical management of cancer is in many cases empirical and          Feature extraction

        not all information that is clinically relevant can be extracted        Classification

        using the data that physicians have access to                            Clustering

      – Fundamental mechanisms behind carcinogenesis are not                     Conclusions
        always taken into account
      But:
      – Expression patterns measured with microarrays in malignant
        cells reflect the phenotype of the tumour
• Molecular biology
      – Study of the expression behaviour of genes can help to
        determine their biological role or function



PhD defense Frank De Smet                                      May 28, 2004                5
                                                                                       Data-mining framework

                                          Patients



                                1     1      2                 2                               1   1   2            2
                                                                                                              ...
                                                                                                                        Features                    Introduction
                                                     ...
                 Genes                                                                                                                            Feature extraction
                                ...

                                      ...
                                      ...
                                      ...
                                      ...

                                             ...
                                             ...
                                             ...
                                             ...



                                                               ...
                                                                                                                                                    Classification

                                                                                                                                                     Clustering
                                                                                               ?                           <1>            1
                                                                                                                                                     Conclusions



                                                                                               ?                           <2>            2
                                            Cluster algorithm                                              Classifier



                                                                                               ?                           <1>            2
         1   1   1   2           1
                          ...

                                                     2     2         1   2         2
                                                                             ...

                         Cluster 1


                                                                                   Cluster 2


PhD defense Frank De Smet                                                                                                          May 28, 2004                6
                                                        Expression matrix



                                                                                       Introduction

                                                                                     Feature extraction

                                                                                       Classification

  Microarray experiments                                                                Clustering

                                                                                        Conclusions
                                                         Gene
                                                         expression
                                                         profiles
                            …
                            …
                            …
                            …
                            …
                            …



                            Condition 1   Condition 2

                                      OR

                                     time



PhD defense Frank De Smet                                             May 28, 2004                7
                            Univariate analysis in microarray data


                                           Condition 1   Condition 2

• Expression patterns measured                                                                              Introduction

  under two different conditions                                                   Positive               Feature extraction
                                                                        
• Selection of the individual                                                                               Classification

  genes with the highest                                                           Negative                  Clustering

  differential expression: p-values                                                                          Conclusions

• Rejection level                                                      pi (i = 1,...,n; p1<p2<...<pn)


   – p  : gene is declared
      differentially expressed
   – p > : gene is declared not
      differentially expressed




PhD defense Frank De Smet                                                     May 28, 2004                             8
                                                                              Multiple testing



• Overlap of the p-values                                                                                              Introduction

  of the genes with and                                                                                              Feature extraction

  without actual differential                                            Actually differentially expressed?            Classification

  expression: Type I and II                                                   YES                NO
                                                                                                                        Clustering

  errors                                                                                                                Conclusions



                                  differentially expressed?
• In literature: control of the                                 YES
                                                                               TP
                                                                                                FP
                                                                                                               Pos
                                                              (p   )                      Type I error
  Type I error: too
                                          Declared

  conservative for
  microarray data                                               NO
                                                              (p >  )
                                                                             FN
                                                                         Type II error
                                                                                                 TN            Neg

• Here: balance of Type I
  and II error                                                                 n1                 n0




PhD defense Frank De Smet                                                                       May 28, 2004                      9
                                   Estimation of Type I and II error



                                                                                         Introduction
       No real differential expression        Non-accidental differential expression   Feature extraction
       Randomised data set                    Superposition of two distribuions
       Uniform distribution                                                              Classification

                                              Rejection level                            Clustering
                                         TP
                                                                                          Conclusions




                                                                  FN




                                              FP                  TN

PhD defense Frank De Smet                                              May 28, 2004                10
                                                                     Calculations

1. Estimation of n1 and n0   2. Estimation of TPi, TNi, FPi and FNi
       i  pi .n
Vi                                                      Actually differentially expressed?
        1  pi                                                                                          Introduction

                                                              YES                NO                   Feature extraction

                                               YES              TPi              FPi                    Classification
                                                                                              Posi




                             differentially
                             expressed?
                                              (p  pi)       i - pi.n0          pi.n0




                              Declared
                                                                                               =i
                                                                                                         Clustering

                                                NO             FNi               TNi          Negi       Conclusions
                                              (p > pi)    n1 - i + pi.n0      (1-pi).n0     = n-i

                                                                n1                n0


4. ROC curve                 3. Estimation of sensitivity and
                             specificity
                                                            TPi      TP
                                              SENS i                i
                                                         TPi  FN i   n1

                                                            TN i       TN i
                                              SPECi                 
                                                          TN i  FPi    n0


PhD defense Frank De Smet                                                      May 28, 2004                       11
                                                      ROC curve



• Optimal balance between Type I and II errors                               Introduction

                                                                           Feature extraction

                                                                             Classification

                                                                              Clustering

                                                                              Conclusions




• Area under the curve
      – Quantifies how well the genes whose expression is and is not
        affected by the difference between conditions can be
        discriminated using their p-values
      – Quality measure for microarray data



PhD defense Frank De Smet                                   May 28, 2004               12
                                            Example: Acute leukemia



                                                                                               Introduction

                                                                                             Feature extraction




                                              Golub et al.




                                                                Armstrong
                                               ALL-AML




                                                                ALL-AML
                                                                  et al.
                                                                                               Classification

                                                                                                Clustering

                               n              7129              12582                           Conclusions

                              n0              3876               3084
                              n1              3253               9498
                            AUC (%)          91.39              95.13
                              opt        0.18 (= p3429)     0.11 (= p8633)
                        SENSopt (%)          84.03              87.26
                        SPECopt (%)          82.06              88.56
                  SENSopt + SPECopt (%)      166.09             175.82




PhD defense Frank De Smet                                                     May 28, 2004               13
                            Multivariate analysis in microarray data
                                      Principal Component Analysis
                    Breast cancer                 Acute leukemia
               Degree of differentiation           ALL - AML
                                                                                    Introduction
Unsupervised




                                                                                  Feature extraction

                                                                                    Classification

                                                                                     Clustering

                                                                                     Conclusions




                                                                                              PC1
                                                                                     PC2




PhD defense Frank De Smet                                          May 28, 2004               14
                                                Classification

                    Breast cancer          Acute leukemia
               Degree of differentiation    ALL - AML
                                                                             Introduction
Unsupervised




                                                                           Feature extraction

                                                                             Classification

                                                                              Clustering

                                                                              Conclusions
Supervised




PhD defense Frank De Smet                                   May 28, 2004               15
                            Clustering: gene expression profiles



• Importance                                                                     Introduction


      – Identification of groups of coexpressed genes                          Feature extraction


      – Have a higher probability of having similar biological functions:        Classification

        e.g., might interact with the same transcription factors                  Clustering

        (coregulation)                                                            Conclusions

• First generation algorithms: disadvantages
      – Parameter fine-tuning
      – Assign each profile to a cluster
      – Computational complexity




PhD defense Frank De Smet                                       May 28, 2004               16
                            Quality-based clustering (Heyer et al.)


     Algorithm produces clusters with
      – a quality guarantee (fixed and user-defined threshold for diameter D)                   Introduction
      – with a maximum number of profiles
                                                                                              Feature extraction

                                                                                                Classification

                                                                                                 Clustering
          D
                                                            Candidate cluster 1: 3 profiles      Conclusions




                                                                           ...
                                                            Candidate cluster 5: 6 profiles




                                                                           ...
                                                           Candidate cluster 17: 2 profiles




      Still some disadvantages !


PhD defense Frank De Smet                                              May 28, 2004                       17
                     Adaptive quality-based clustering (AQBC)



• A heuristic iterative two-step approach                                       Introduction

                                                                              Feature extraction

                                                                                Classification
      – Step 1: Quality-based approach:
                                                                                 Clustering
        Find a cluster center in an area of the data set where the
        density of expression profiles, within a sphere with preliminary         Conclusions

        radius, is locally maximal

      – Step 2: Adaptive approach:
        Re-estimation of the radius




PhD defense Frank De Smet                                      May 28, 2004               18
                            Step 1: Localization of a cluster center

                            R

                                                                              Introduction

                                                                            Feature extraction

                                                                              Classification

                                                                               Clustering

                                                                               Conclusions




PhD defense Frank De Smet                                    May 28, 2004               19
                            Step 2: Re-calculation of the radius



                                                                                               Introduction

                                                                                             Feature extraction

                                                                                               Classification

                                                                                                Clustering

                                                                                                Conclusions




                                                  p(r )  PC . p(r | C )  PB . p(r | B)
                                                         PC . p( Rnew | C )
                             P(C | Rnew )                                           S
                                               PC . p( Rnew | C )  PB . p( Rnew | B)
PhD defense Frank De Smet                                                     May 28, 2004               20
                                                                               Comparison


                                        AQBC                           QT_Clust (Heyer et al.)
                                                                                                            Introduction
                                      1. Data set                             1. Data set
  User-defined
                                2. Significance level S                2. Radius R or diameter D
   parameters                                                                                             Feature extraction
                            3. Minimum number of genes               3. Minimum number of genes
                                                                                                            Classification
                            Significance level S: statistical
Quality measure                                                 Radius or diameter: arbitrary parameter
                                      parameter                                                              Clustering

                       Automatically calculated for each                                                     Conclusions
Cluster radius R                                                      Constant and user-defined
                       cluster separately - not constant

 Computational
                                   ~ O(n  e  VC)                         ~ O(n  e  VC)
                                                                                2
  Complexity

   Number of
                                    Not predefined                          Not predefined
    clusters

 Inclusion of all
                                          No                                        No
genes in clusters




PhD defense Frank De Smet                                                                 May 28, 2004                21
                                                                                     Validation

          Cluster number                                                       P-value (-log10)

                                     MIPS functional category
       AQBC        K-means                                                    AQBC       K-means           Introduction

                                                                                                         Feature extraction
                             ribosomal proteins                          80             54
                             organisation of cytoplasm                   77             39                 Classification
                             protein synthesis                           74             NR
          1            1
                             cellular organisation                       34             NR                  Clustering
                             translation                                 9              NR
                             organisation of chromosome structure        1              4                   Conclusions
                             mitochondrial organization                  18             10
                             energy                                      8              NR
                             proteolysis                                 7              NR
          2            4     respiration                                 6              5
                             ribosomal proteins                          4              NR
                             protein synthesis                           4              NR
                             protein destination                         4              NR

                             DNA synthesis and replication               18             16
                             cell growth, cell division, DNA synthesis   17             NR
          5            2     recombination and DNA repair                8              5
                             nuclear organization                        8              4
                             cell-cycle control and mitosis              7              8




PhD defense Frank De Smet                                                                 May 28, 2004               22
                                                                                        Availability



                                                                                                                     Introduction

                                                                                                                   Feature extraction

                                           350                                                                       Classification
                                           300
                          Number of hits




                                                                                                                      Clustering
                                           250
                                           200                                                                        Conclusions
                                           150
                                           100
                                           50
                                            0
                                           Ap 1




                                           Ap 2




                                           Ap 3




                                           Ap 4
                                           O 1




                                           O 2




                                           O 3
                                                  1




                                                  2




                                                  3




                                                  4
                                           Ja 1




                                                  2




                                                  3
                                               l-0




                                               l-0




                                               l-0
                                                 0
                                              r- 0




                                                 0
                                              r- 0




                                                 0
                                              r- 0




                                                 0
                                              r- 0
                                               -0




                                               -0




                                               -0
                                              n-




                                              n-




                                              n-




                                              n-
                                             ct




                                             ct




                                             ct
                                            Ju




                                            Ju




                                            Ju
                                           Ja




                                           Ja




                                           Ja



Himanen et al. (2004) Transcript profiling of early lateral root initiation. Proc Natl Acad Sci, 101, 5146-5151.


PhD defense Frank De Smet                                                                        May 28, 2004                  23
                                                            Conclusions



Data-mining framework for microarray data                                                   Introduction

• Feature extraction                                                                      Feature extraction

      – Univariate analysis                                                                 Classification

             • Estimation of n1 and n0                                                       Clustering
             • ROC curves: optimal balance between Type I and II error +   PC2
                                                                                    PC1
                                                                                             Conclusions
               quality measure
      – Multivariate analysis: PCA
• Classification: FDA and LS-SVM
• Clustering
      – Microarray experiments
      – Gene expression profiles: AQBC

Clinical data


PhD defense Frank De Smet                                            May 28, 2004                     24
                                                              Selected publications



•    De Smet, F., Marchal, K., Timmerman, D., Vergote, I., De Moor, B. and Moreau, Y.                           Introduction
     (2001) Gebruik van microroosters in de klinische oncologie, Tijdschr voor Geneeskunde,
     57, 1225-1236.                                                                                           Feature extraction

                                                                                                                Classification
•    De Smet, F., Mathys, J., Marchal, K., Thijs, G., De Moor, B. and Moreau Y. (2002)
     Adaptive quality-based clustering of gene expression profiles. Bioinformatics, 18, 735-                     Clustering
     746.
                                                                                                                 Conclusions
•    Moreau, Y., De Smet, F., Thijs, G., Marchal, K. and De Moor, B. (2002) Functional
     bioinformatics of microarray data: from expression to regulation. Proceedings of the
     IEEE, 90, 1722-1743.

•    De Smet, F., Moreau, Y., Tmmerman, D., Vergote, I. and De Moor, B. (2004) Balancing
     false positives and false negatives for the detection of differential expression in
     malignancies. Br J Cancer, submitted.

•    Epstein, E., Skoog, L., Isberg, P.E., De Smet, F., De Moor, B., Olofsson, P.A.,
     Gudmundsson, S. and Valentin, L. (2002) An algorithm including results of gray-scale
     and power Doppler ultrasound examination to predict endometrial malignancy in women
     with postmenopausal bleeding. Ultrasound Obstet Gynecol, 20, 370-376.




PhD defense Frank De Smet                                                                      May 28, 2004               25
                                                             Future research



• Specific                                                                            Introduction

      – Ovarian cancer: transcriptomics                                             Feature extraction

             • Prediction of chemosensitivity in stage III                            Classification
             • Prediction of recurrence in stage I                                     Clustering
      – Endometriosis: proteomics and transcriptomics
                                                                                       Conclusions
             • Detection of endometriosis
             • Prediction of relapse after surgery
• General
      – Microarrays: number of patients - validation -
        standardization
      – Proteomics
      – Combination and comparison of microarray,
        proteomic and clinical data



PhD defense Frank De Smet                                            May 28, 2004               26

								
To top