Clustering and MDS by sanmelody

VIEWS: 5 PAGES: 20

									Exploring Microarray data

     Javier Cabrera
                    Outline
1. Exploratory Analysis Steps.
2. Microarray Data as Multivariate Data.
3. Dimension Reduction
4. Correlation Matrix
5. Principal components Geometrical Interpretation
6. Linear Algebra basics
7. How many principal componets
8. Biplots
9. Other graphical software for EDA : Ggobi
                          Process
Assume the data has gone the QC process, normalization, outlier
   detection. At this point we are using an have an exprSet:
   Array of expressions (rows are genes, columns are samples)

1. Select Gene Subset by one of the methods. This will bring you
   down to some small subset of genes (hundred’s or less)

2. Use PCA to further reduce the dimension : from 1 to 10’s

3. Apply data analysis method: biplot, clustering, classification,
   mds
Microarray data as Multivariate Data
Microarray Data: Gene expression Matrix = Gxp matrix

Genes are the variables =>
        G >p Many more Variables than Samples

This makes microarray data very different the data that is
   found in other applications.

Most Multivariate analysis methods rely on more observations
    than variables : G < p this means that the
Standard multivariate methods must be reexamined.

Dimension reduction becomes very important and requires:
1. Gene subset selection.
2. Principal Components for further dimension reduction
Dimension Reduction: gene subset selection
1. Use sample grouping:
   Response: Calculate the F-statistic for each individual
   gene and select those genes with the highest F-value.

2. Group of genes related to some pathway.

3. Correlated subsets
   (a) Maximum correlation statistic. For each gene calculate
   the maximum correlation between that gene and any of
   the the others. Select those genes that have the highest
   maximum correlation.

   (b) Maximum eigenvalue. Select random subsets of genes
    of a prefixed size and calculate the largest or two largest
    eigenvalues of their covariance matrix. Chose the subset
    with largest eigenvalues.

4. Coefficient of Variation
                                                1, r12 , , r 1G 
Correlation Matrix                              
                                              R
                                                                 
                                                  r21 ,1, , r 2G 
                                                                
                                                 r , r , ,1
                                                                
1.   Use covariance or correlation matrix?        G1    G2      

     - It depends on our way of thinking about microarray data.
     - Two genes are highly correlated but in very different scales.
       They belong in the same group?  Use Correlation

2. Dim(R) = GxG and G is between 1000 and 25000, this is too big
    Dimension reduction.

3. Rank (R) = p
   Gene expression matrix X:
               Rows = Genes = Variables
               Columns = Microarrays = Subjects = Observations
                         Gene           Gene             Gene             Gene         Gene            Gene
                         141            187              246              509          1645            1955
               Gene      1.0000         0.7983           -0.5058          0.7463       -0.4049         0.4676
               141                      (0.000)          (0.001)          (0.000)      (0.007)         (0.002)


  Sample
               Gene      0.7983         1.0000           -0.8111          0.9357       -0.6621         0.7891
               187       (0.000)                         (0.000)          (0.000       (0.000)         (0.000)
               Gene      -0.5058        -0.8111          1.0000           -0.7717      0.7624          -0.7977

Correlation
               246       (0.001)        (0.000)                           (0.000)      (0.000)         (0.000)
               Gene      0.7463         0.9357           -0.7717          1.000        -0.6388         0.6827
               509       (0.000)        (0.000)          (0.000)                       (0.000)         (0.000)

  Matrix
               Gene      -0.4049        -0.6621          0.7624           -0.6388      1.000           -0.8143
               1645      (0.007)        (0.000)          (0.000)          (0.000)                      (0.000)
               Gene      0.4676         0.7891           -0.7977          0.6827       -0.8143         1.000
               1955      (0.002)        (0.000)          (0.000)          (0.000)      (0.000)

                                       -3   -1   1   3                    -2   0   2                    -1   0   1   2




                                                                                                                         0.5 1.5
                             141




                                                                                                                         -1.0
                 3
                 1
                                            187


                 -1
                 -3




                                                                                                                         2
                                                                                                                         1
                                                              246




                                                                                                                         0
                                                                                                                         -1
 Scatterplot
                 2




                                                                               509
                 0




   Matrix
                 -2




                                                                                                                         0 1 2
                                                                                            1645




                                                                                                                         -2
                 2
                 1




                                                                                                        1955
                 0
                 -1




                      -1.0   0.5 1.5                     -1   0   1   2                -2    0   1 2
 Principal Components Geometrical Intuition

- The data cloud is approximated
    by an ellipsoid                Variable X2
                                                         Component1
- The axes of the ellipsoid                Component2
   represent the natural
   components of the data

- The length of the semi-axis                    Data
   represent the variability of
   the component.                                       Variable X1
             DIMENSION REDUCTION

-   When some of the
    components show a very          Variable X2           Component1
    small variability they can be
    omitted.
-   The graphs shows that                   Component2
    Component 2 has low
    variability so it can be                      Data
    removed.
-   The dimension is reduced
    from dim=2 to dim=1                                  Variable X1
Linear Algebra

Linear algebra is useful to write computations in a convenient way.
   Since the number of genes (G) is very large we need to write the
   computations so we do not generate any GxG matrices.
Notice that the rows of X are the genes = variables.

Singular Value Decomposition: X = U D V’
                              Gxp   Gxp pxp pxp


In standard Multivariate Analysis X would be transposed so the
    variables correspond to columns of X. But if we do it that way D
    and V would both be GxG matrices and that is what we are trying
    to avoid.
Linear Algebra
-   Singular Value Decomposition:      X = U D V’
                                       Gxp   Gxp pxp pxp

-   The Covariance Matrix takes the form:        S = U D2 U’
                                                GxG   Gxp pxp pxG
    S is GxG but we do not need to write it down to do the dimension
    reduction.
-   Correlation Matrix: Subtract mean of rows of X and divide by
                     standard deviation and calculate the covariance

-   Principal Components(PC): p Columns of U.
-   Eigenvalues (Variance of PC’s): p Diagonal elements of D2
-   The first data reduction is to expressed S or R (GxG) as a
    function of U (Gxp) and D(pxp).
                Principal Components Table
                        Comp.1   Comp.2   Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
Standard deviation      4.70972 4.50705 3.87907 1.8340 1.6120 1.5813 1.4073 1.3201
Proportion of Variance 0.24260 0.22217 0.16457 0.0367 0.0284 0.0273 0.0216 0.0190
Cumulative Proportion   0.24260 0.46477 0.62934 0.6661 0.6945 0.7219 0.7435 0.7626
                        Comp.9   Comp10   Comp11 Comp12 Comp13 Comp14 Comp15 Comp16
Standard deviation      1.27977 1.21854 1.10437 1.0549 1.0238 0.9722 0.9511 0.9177
Proportion of Variance 0.01791 0.01623 0.01333 0.0121 0.0114 0.0103 0.0098 0.0092
Cumulative Proportion   0.78054 0.79678 0.81012 0.8222 0.8337 0.8440 0.8539 0.8632
Dimension reduction:
         Choosing the number of PC’s

1.   k components explain some percentage of the variance: 60%, 70%,
     80%.

2. k eigenvalues are greater than the average (1)

3.   Scree plot: Graph the eigenvalues and look for the last sharp
     decline and choose k as the number of points above the cut off.

4. Test the null hypothesis that the last m eigenvalues are equal (0)
                                                         p
              u  (G  (2m  11) / 6)(m  log  -      
                                                    i  p  m 1
                                                                   log i )

The dfs= (m-1)(m+2)/2 and it is possible to start with a smaller p.
1.   The top 3 eigenvalues explain 70% of variability.
2.   13 eigenvalues greater than the average 1
3.   Scree Plot


     20
     15
     10
     5




                                                                                               average
     0




          Comp.1    Comp.9   Comp.17   Comp.26   Comp.35   Comp.44   Comp.53   Comp.62   Comp.71   Comp.80




4.   Test statistic highly significant for 3.

          p-m      9         8         7         6         5         4         3           2           1
          u        0.23      0.63      2.13      8.09      12.73     25.45     262.16      439.51      552.35
          2       5.99      11.07     16.92     23.68     31.41     40.11     49.80       60.48       72.15
Principal Components Graph: PC3 Vs PC2 Vs PC1
                                        -5   0     5   10




                                                                                        5
                              PC1
The four tumor




                                                                                        0
groups are




                                                                                        -5
represented by
                    10
different colors.   5



                                             PC2
                    0




        EW
        BL
                    -5




        NB
        RM




                                                                                        6
                                                                                        4
                                                                                        2
                                                                                        -8 -6 -4 -2 0
                                                                   PC3



                         -5   0     5                       -8 -6 -4 -2 0   2   4   6
                              Biplots

Combination of two graphs into one:

1. Graph of the observations in the coordinates of the two
   principal components. (Scores)

2. Graph of the Variables projected into the plane of the two
   principal components. (Loadings)

3. The variables are represented as arrows, the observations as
   points or labels.
              Biplots: Linear Algebra
From SVD: X = UDV’  X2 = U2D2V2’
          A = U2D2a and B=V2D2b, a+b=1 so X=AB’

The biplot is a Graphical display of X in which two sets of
  markers are plotted.
One set of markers a1,…,aG represents the rows of X
The other set of markers, b1,…, bp, represents the
  columns of X.
The biplot is the graph of A and B together in the same
   graph.
If the number of genes is too big it is better to omit and
   plot them in a separate graph or to invert the graph.
        Biplots of the first two principal components
             -10                -5                0                  5                 10

                                  BL
                                 BL BLBL
                                 BL BL


                                                                                                             - The data cloud is divided
                                   BLBL
                              BL        BL
      0.2




                                     BL

                                                                                                                 into 4 clear clusters




                                                                                                       10
                                                                                                             -   The arrows
      0.1




                                                                                 V24
                                       V45 V90




                                                                                                       5
                                                                                                                 representing the genes
                                                            NB
                                          NB
                                        V63 NB
                                          V26            NBNB
                                        V91 V17NB
                                               V99
                                              V58         NB
                                       V8    V29 NB
                                              V3   NB


                                                                                                                 fall in approximately
                                       V38 NB NB V27
PC2




                                           NB
                                         V37        V100
                                                  NB V69
                                                       NB
                                                       NB      NB
                                                        V47
                                                      V93
                                     V39             NB
                                                  V31
                                                 NB


                                                                                                                 three groups
                                                        V12                V60
                                                                                 RM
      0.0




                                                               V79        RM




                                                                                                       0
                                                            V64
                                    EW                 V89 V74
                                                        V57 V42              V14      RM
                                                          V78 V32     V36
                                                                       V15
                                                                       V65 RM RM  RM    RM
                              EW
                           EWEW               V96                          RM V21V4
                                                                            V86
                                                                           RM      RM
                          EW V72                                        RM
                                                                        RMV76
                                                             V30 V54 V56 RM RM
                                                                   V5
                                                                  V51    RM
                                                                          V44
                                      V49    V84 V41 V94
                                                 EWV13                      V11         RMRM

                                                                                                                 Next step is to identify
                                                                     V34
                                                                      V66
                                                                     RM

                                                                                                             -
                              EW     V88V87                                   RM
                       EW
                       EW EW                        V67      V70                    RM
                        EW             V48            V53                             V23
                                              V61              V35            RM RM
                       EW                             V97
                                           V68 V75V71
                                           V98 V82
                                            V80     V59                          RM
                                                                                  RM              V6


                                                                                                                 the gene groups and
                               V20         V83                                  RM          V9         -5
                                   V92
                                     V73 EW V62
      -0.1




                                   V46 V10
                          EW
                      EW V19               EW V95 V81 V77
                                                        V85
                                       V43 V40EW V52
                         V16V7                          V1

                                                                                                                 check their biological
                           EW
                       EW                            V55 V50 V22
                            EW
                            V33
                        EW     EW
                               V18
                   EWEW EW V28

                                                                                                                 information.
                                                        V25
                                                         V2
                    EW
                   EW
                   EW
                                                                                                       -10




                            -0.1                 0.0                 0.1                    0.2

                                                         PC1
Ggobi display finding four clusters of tumors using the PP index on the set of 63 cases. The main panel shows the two
dimensional projection selected by the PP index with the four clusters in different colors and glyphs. The top left panel shows
the main controls and the left bottom panel displays the controls and the graph of the PP index that is been optimized. The graph
shows the index value for a sequence of projection ending at the current one.
            Exploratory Analysis Steps

1. Dimension Reduction: Gene subset selection.

2. Principal Components for further dimension reduction.

3. Biplot and Graphs

4. For samples: Select natural clusters of samples. Identify
   sample grouping with natural clusters.

5. For genes: Identify gene clusters and their function.

								
To top