# Clustering and MDS by sanmelody

VIEWS: 5 PAGES: 20

• pg 1
```									Exploring Microarray data

Javier Cabrera
Outline
1. Exploratory Analysis Steps.
2. Microarray Data as Multivariate Data.
3. Dimension Reduction
4. Correlation Matrix
5. Principal components Geometrical Interpretation
6. Linear Algebra basics
7. How many principal componets
8. Biplots
9. Other graphical software for EDA : Ggobi
Process
Assume the data has gone the QC process, normalization, outlier
detection. At this point we are using an have an exprSet:
Array of expressions (rows are genes, columns are samples)

1. Select Gene Subset by one of the methods. This will bring you
down to some small subset of genes (hundred’s or less)

2. Use PCA to further reduce the dimension : from 1 to 10’s

3. Apply data analysis method: biplot, clustering, classification,
mds
Microarray data as Multivariate Data
Microarray Data: Gene expression Matrix = Gxp matrix

Genes are the variables =>
G >p Many more Variables than Samples

This makes microarray data very different the data that is
found in other applications.

Most Multivariate analysis methods rely on more observations
than variables : G < p this means that the
Standard multivariate methods must be reexamined.

Dimension reduction becomes very important and requires:
1. Gene subset selection.
2. Principal Components for further dimension reduction
Dimension Reduction: gene subset selection
1. Use sample grouping:
Response: Calculate the F-statistic for each individual
gene and select those genes with the highest F-value.

2. Group of genes related to some pathway.

3. Correlated subsets
(a) Maximum correlation statistic. For each gene calculate
the maximum correlation between that gene and any of
the the others. Select those genes that have the highest
maximum correlation.

(b) Maximum eigenvalue. Select random subsets of genes
of a prefixed size and calculate the largest or two largest
eigenvalues of their covariance matrix. Chose the subset
with largest eigenvalues.

4. Coefficient of Variation
1, r12 , , r 1G 
Correlation Matrix                              
R

r21 ,1, , r 2G 
                
 r , r , ,1
                
1.   Use covariance or correlation matrix?        G1    G2      

- It depends on our way of thinking about microarray data.
- Two genes are highly correlated but in very different scales.
They belong in the same group?  Use Correlation

2. Dim(R) = GxG and G is between 1000 and 25000, this is too big
 Dimension reduction.

3. Rank (R) = p
Gene expression matrix X:
Rows = Genes = Variables
Columns = Microarrays = Subjects = Observations
Gene           Gene             Gene             Gene         Gene            Gene
141            187              246              509          1645            1955
Gene      1.0000         0.7983           -0.5058          0.7463       -0.4049         0.4676
141                      (0.000)          (0.001)          (0.000)      (0.007)         (0.002)

Sample
Gene      0.7983         1.0000           -0.8111          0.9357       -0.6621         0.7891
187       (0.000)                         (0.000)          (0.000       (0.000)         (0.000)
Gene      -0.5058        -0.8111          1.0000           -0.7717      0.7624          -0.7977

Correlation
246       (0.001)        (0.000)                           (0.000)      (0.000)         (0.000)
Gene      0.7463         0.9357           -0.7717          1.000        -0.6388         0.6827
509       (0.000)        (0.000)          (0.000)                       (0.000)         (0.000)

Matrix
Gene      -0.4049        -0.6621          0.7624           -0.6388      1.000           -0.8143
1645      (0.007)        (0.000)          (0.000)          (0.000)                      (0.000)
Gene      0.4676         0.7891           -0.7977          0.6827       -0.8143         1.000
1955      (0.002)        (0.000)          (0.000)          (0.000)      (0.000)

-3   -1   1   3                    -2   0   2                    -1   0   1   2

0.5 1.5
141

-1.0
3
1
187

-1
-3

2
1
246

0
-1
Scatterplot
2

509
0

Matrix
-2

0 1 2
1645

-2
2
1

1955
0
-1

-1.0   0.5 1.5                     -1   0   1   2                -2    0   1 2
Principal Components Geometrical Intuition

- The data cloud is approximated
by an ellipsoid                Variable X2
Component1
- The axes of the ellipsoid                Component2
represent the natural
components of the data

- The length of the semi-axis                    Data
represent the variability of
the component.                                       Variable X1
DIMENSION REDUCTION

-   When some of the
components show a very          Variable X2           Component1
small variability they can be
omitted.
-   The graphs shows that                   Component2
Component 2 has low
variability so it can be                      Data
removed.
-   The dimension is reduced
from dim=2 to dim=1                                  Variable X1
Linear Algebra

Linear algebra is useful to write computations in a convenient way.
Since the number of genes (G) is very large we need to write the
computations so we do not generate any GxG matrices.
Notice that the rows of X are the genes = variables.

Singular Value Decomposition: X = U D V’
Gxp   Gxp pxp pxp

In standard Multivariate Analysis X would be transposed so the
variables correspond to columns of X. But if we do it that way D
and V would both be GxG matrices and that is what we are trying
to avoid.
Linear Algebra
-   Singular Value Decomposition:      X = U D V’
Gxp   Gxp pxp pxp

-   The Covariance Matrix takes the form:        S = U D2 U’
GxG   Gxp pxp pxG
S is GxG but we do not need to write it down to do the dimension
reduction.
-   Correlation Matrix: Subtract mean of rows of X and divide by
standard deviation and calculate the covariance

-   Principal Components(PC): p Columns of U.
-   Eigenvalues (Variance of PC’s): p Diagonal elements of D2
-   The first data reduction is to expressed S or R (GxG) as a
function of U (Gxp) and D(pxp).
Principal Components Table
Comp.1   Comp.2   Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
Standard deviation      4.70972 4.50705 3.87907 1.8340 1.6120 1.5813 1.4073 1.3201
Proportion of Variance 0.24260 0.22217 0.16457 0.0367 0.0284 0.0273 0.0216 0.0190
Cumulative Proportion   0.24260 0.46477 0.62934 0.6661 0.6945 0.7219 0.7435 0.7626
Comp.9   Comp10   Comp11 Comp12 Comp13 Comp14 Comp15 Comp16
Standard deviation      1.27977 1.21854 1.10437 1.0549 1.0238 0.9722 0.9511 0.9177
Proportion of Variance 0.01791 0.01623 0.01333 0.0121 0.0114 0.0103 0.0098 0.0092
Cumulative Proportion   0.78054 0.79678 0.81012 0.8222 0.8337 0.8440 0.8539 0.8632
Dimension reduction:
Choosing the number of PC’s

1.   k components explain some percentage of the variance: 60%, 70%,
80%.

2. k eigenvalues are greater than the average (1)

3.   Scree plot: Graph the eigenvalues and look for the last sharp
decline and choose k as the number of points above the cut off.

4. Test the null hypothesis that the last m eigenvalues are equal (0)
p
u  (G  (2m  11) / 6)(m  log  -      
i  p  m 1
log i )

The dfs= (m-1)(m+2)/2 and it is possible to start with a smaller p.
1.   The top 3 eigenvalues explain 70% of variability.
2.   13 eigenvalues greater than the average 1
3.   Scree Plot

20
15
10
5

average
0

Comp.1    Comp.9   Comp.17   Comp.26   Comp.35   Comp.44   Comp.53   Comp.62   Comp.71   Comp.80

4.   Test statistic highly significant for 3.

p-m      9         8         7         6         5         4         3           2           1
u        0.23      0.63      2.13      8.09      12.73     25.45     262.16      439.51      552.35
2       5.99      11.07     16.92     23.68     31.41     40.11     49.80       60.48       72.15
Principal Components Graph: PC3 Vs PC2 Vs PC1
-5   0     5   10

5
PC1
The four tumor

0
groups are

-5
represented by
10
different colors.   5

PC2
0

EW
BL
-5

NB
RM

6
4
2
-8 -6 -4 -2 0
PC3

-5   0     5                       -8 -6 -4 -2 0   2   4   6
Biplots

Combination of two graphs into one:

1. Graph of the observations in the coordinates of the two
principal components. (Scores)

2. Graph of the Variables projected into the plane of the two

3. The variables are represented as arrows, the observations as
points or labels.
Biplots: Linear Algebra
From SVD: X = UDV’  X2 = U2D2V2’
A = U2D2a and B=V2D2b, a+b=1 so X=AB’

The biplot is a Graphical display of X in which two sets of
markers are plotted.
One set of markers a1,…,aG represents the rows of X
The other set of markers, b1,…, bp, represents the
columns of X.
The biplot is the graph of A and B together in the same
graph.
If the number of genes is too big it is better to omit and
plot them in a separate graph or to invert the graph.
Biplots of the first two principal components
-10                -5                0                  5                 10

BL
BL BLBL
BL BL

- The data cloud is divided
BLBL
BL        BL
0.2

BL

into 4 clear clusters

10
-   The arrows
0.1

V24
V45 V90

5
representing the genes
NB
NB
V63 NB
V26            NBNB
V91 V17NB
V99
V58         NB
V8    V29 NB
V3   NB

fall in approximately
V38 NB NB V27
PC2

NB
V37        V100
NB V69
NB
NB      NB
V47
V93
V39             NB
V31
NB

three groups
V12                V60
RM
0.0

V79        RM

0
V64
EW                 V89 V74
V57 V42              V14      RM
V78 V32     V36
V15
V65 RM RM  RM    RM
EW
EWEW               V96                          RM V21V4
V86
RM      RM
EW V72                                        RM
RMV76
V30 V54 V56 RM RM
V5
V51    RM
V44
V49    V84 V41 V94
EWV13                      V11         RMRM

Next step is to identify
V34
V66
RM

-
EW     V88V87                                   RM
EW
EW EW                        V67      V70                    RM
EW             V48            V53                             V23
V61              V35            RM RM
EW                             V97
V68 V75V71
V98 V82
V80     V59                          RM
RM              V6

the gene groups and
V20         V83                                  RM          V9         -5
V92
V73 EW V62
-0.1

V46 V10
EW
EW V19               EW V95 V81 V77
V85
V43 V40EW V52
V16V7                          V1

check their biological
EW
EW                            V55 V50 V22
EW
V33
EW     EW
V18
EWEW EW V28

information.
V25
V2
EW
EW
EW
-10

-0.1                 0.0                 0.1                    0.2

PC1
Ggobi display finding four clusters of tumors using the PP index on the set of 63 cases. The main panel shows the two
dimensional projection selected by the PP index with the four clusters in different colors and glyphs. The top left panel shows
the main controls and the left bottom panel displays the controls and the graph of the PP index that is been optimized. The graph
shows the index value for a sequence of projection ending at the current one.
Exploratory Analysis Steps

1. Dimension Reduction: Gene subset selection.

2. Principal Components for further dimension reduction.

3. Biplot and Graphs

4. For samples: Select natural clusters of samples. Identify
sample grouping with natural clusters.

5. For genes: Identify gene clusters and their function.

```
To top