VIEWS: 5 PAGES: 20 POSTED ON: 3/23/2011 Public Domain
Exploring Microarray data Javier Cabrera Outline 1. Exploratory Analysis Steps. 2. Microarray Data as Multivariate Data. 3. Dimension Reduction 4. Correlation Matrix 5. Principal components Geometrical Interpretation 6. Linear Algebra basics 7. How many principal componets 8. Biplots 9. Other graphical software for EDA : Ggobi Process Assume the data has gone the QC process, normalization, outlier detection. At this point we are using an have an exprSet: Array of expressions (rows are genes, columns are samples) 1. Select Gene Subset by one of the methods. This will bring you down to some small subset of genes (hundred’s or less) 2. Use PCA to further reduce the dimension : from 1 to 10’s 3. Apply data analysis method: biplot, clustering, classification, mds Microarray data as Multivariate Data Microarray Data: Gene expression Matrix = Gxp matrix Genes are the variables => G >p Many more Variables than Samples This makes microarray data very different the data that is found in other applications. Most Multivariate analysis methods rely on more observations than variables : G < p this means that the Standard multivariate methods must be reexamined. Dimension reduction becomes very important and requires: 1. Gene subset selection. 2. Principal Components for further dimension reduction Dimension Reduction: gene subset selection 1. Use sample grouping: Response: Calculate the F-statistic for each individual gene and select those genes with the highest F-value. 2. Group of genes related to some pathway. 3. Correlated subsets (a) Maximum correlation statistic. For each gene calculate the maximum correlation between that gene and any of the the others. Select those genes that have the highest maximum correlation. (b) Maximum eigenvalue. Select random subsets of genes of a prefixed size and calculate the largest or two largest eigenvalues of their covariance matrix. Chose the subset with largest eigenvalues. 4. Coefficient of Variation 1, r12 , , r 1G Correlation Matrix R r21 ,1, , r 2G r , r , ,1 1. Use covariance or correlation matrix? G1 G2 - It depends on our way of thinking about microarray data. - Two genes are highly correlated but in very different scales. They belong in the same group? Use Correlation 2. Dim(R) = GxG and G is between 1000 and 25000, this is too big Dimension reduction. 3. Rank (R) = p Gene expression matrix X: Rows = Genes = Variables Columns = Microarrays = Subjects = Observations Gene Gene Gene Gene Gene Gene 141 187 246 509 1645 1955 Gene 1.0000 0.7983 -0.5058 0.7463 -0.4049 0.4676 141 (0.000) (0.001) (0.000) (0.007) (0.002) Sample Gene 0.7983 1.0000 -0.8111 0.9357 -0.6621 0.7891 187 (0.000) (0.000) (0.000 (0.000) (0.000) Gene -0.5058 -0.8111 1.0000 -0.7717 0.7624 -0.7977 Correlation 246 (0.001) (0.000) (0.000) (0.000) (0.000) Gene 0.7463 0.9357 -0.7717 1.000 -0.6388 0.6827 509 (0.000) (0.000) (0.000) (0.000) (0.000) Matrix Gene -0.4049 -0.6621 0.7624 -0.6388 1.000 -0.8143 1645 (0.007) (0.000) (0.000) (0.000) (0.000) Gene 0.4676 0.7891 -0.7977 0.6827 -0.8143 1.000 1955 (0.002) (0.000) (0.000) (0.000) (0.000) -3 -1 1 3 -2 0 2 -1 0 1 2 0.5 1.5 141 -1.0 3 1 187 -1 -3 2 1 246 0 -1 Scatterplot 2 509 0 Matrix -2 0 1 2 1645 -2 2 1 1955 0 -1 -1.0 0.5 1.5 -1 0 1 2 -2 0 1 2 Principal Components Geometrical Intuition - The data cloud is approximated by an ellipsoid Variable X2 Component1 - The axes of the ellipsoid Component2 represent the natural components of the data - The length of the semi-axis Data represent the variability of the component. Variable X1 DIMENSION REDUCTION - When some of the components show a very Variable X2 Component1 small variability they can be omitted. - The graphs shows that Component2 Component 2 has low variability so it can be Data removed. - The dimension is reduced from dim=2 to dim=1 Variable X1 Linear Algebra Linear algebra is useful to write computations in a convenient way. Since the number of genes (G) is very large we need to write the computations so we do not generate any GxG matrices. Notice that the rows of X are the genes = variables. Singular Value Decomposition: X = U D V’ Gxp Gxp pxp pxp In standard Multivariate Analysis X would be transposed so the variables correspond to columns of X. But if we do it that way D and V would both be GxG matrices and that is what we are trying to avoid. Linear Algebra - Singular Value Decomposition: X = U D V’ Gxp Gxp pxp pxp - The Covariance Matrix takes the form: S = U D2 U’ GxG Gxp pxp pxG S is GxG but we do not need to write it down to do the dimension reduction. - Correlation Matrix: Subtract mean of rows of X and divide by standard deviation and calculate the covariance - Principal Components(PC): p Columns of U. - Eigenvalues (Variance of PC’s): p Diagonal elements of D2 - The first data reduction is to expressed S or R (GxG) as a function of U (Gxp) and D(pxp). Principal Components Table Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Standard deviation 4.70972 4.50705 3.87907 1.8340 1.6120 1.5813 1.4073 1.3201 Proportion of Variance 0.24260 0.22217 0.16457 0.0367 0.0284 0.0273 0.0216 0.0190 Cumulative Proportion 0.24260 0.46477 0.62934 0.6661 0.6945 0.7219 0.7435 0.7626 Comp.9 Comp10 Comp11 Comp12 Comp13 Comp14 Comp15 Comp16 Standard deviation 1.27977 1.21854 1.10437 1.0549 1.0238 0.9722 0.9511 0.9177 Proportion of Variance 0.01791 0.01623 0.01333 0.0121 0.0114 0.0103 0.0098 0.0092 Cumulative Proportion 0.78054 0.79678 0.81012 0.8222 0.8337 0.8440 0.8539 0.8632 Dimension reduction: Choosing the number of PC’s 1. k components explain some percentage of the variance: 60%, 70%, 80%. 2. k eigenvalues are greater than the average (1) 3. Scree plot: Graph the eigenvalues and look for the last sharp decline and choose k as the number of points above the cut off. 4. Test the null hypothesis that the last m eigenvalues are equal (0) p u (G (2m 11) / 6)(m log - i p m 1 log i ) The dfs= (m-1)(m+2)/2 and it is possible to start with a smaller p. 1. The top 3 eigenvalues explain 70% of variability. 2. 13 eigenvalues greater than the average 1 3. Scree Plot 20 15 10 5 average 0 Comp.1 Comp.9 Comp.17 Comp.26 Comp.35 Comp.44 Comp.53 Comp.62 Comp.71 Comp.80 4. Test statistic highly significant for 3. p-m 9 8 7 6 5 4 3 2 1 u 0.23 0.63 2.13 8.09 12.73 25.45 262.16 439.51 552.35 2 5.99 11.07 16.92 23.68 31.41 40.11 49.80 60.48 72.15 Principal Components Graph: PC3 Vs PC2 Vs PC1 -5 0 5 10 5 PC1 The four tumor 0 groups are -5 represented by 10 different colors. 5 PC2 0 EW BL -5 NB RM 6 4 2 -8 -6 -4 -2 0 PC3 -5 0 5 -8 -6 -4 -2 0 2 4 6 Biplots Combination of two graphs into one: 1. Graph of the observations in the coordinates of the two principal components. (Scores) 2. Graph of the Variables projected into the plane of the two principal components. (Loadings) 3. The variables are represented as arrows, the observations as points or labels. Biplots: Linear Algebra From SVD: X = UDV’ X2 = U2D2V2’ A = U2D2a and B=V2D2b, a+b=1 so X=AB’ The biplot is a Graphical display of X in which two sets of markers are plotted. One set of markers a1,…,aG represents the rows of X The other set of markers, b1,…, bp, represents the columns of X. The biplot is the graph of A and B together in the same graph. If the number of genes is too big it is better to omit and plot them in a separate graph or to invert the graph. Biplots of the first two principal components -10 -5 0 5 10 BL BL BLBL BL BL - The data cloud is divided BLBL BL BL 0.2 BL into 4 clear clusters 10 - The arrows 0.1 V24 V45 V90 5 representing the genes NB NB V63 NB V26 NBNB V91 V17NB V99 V58 NB V8 V29 NB V3 NB fall in approximately V38 NB NB V27 PC2 NB V37 V100 NB V69 NB NB NB V47 V93 V39 NB V31 NB three groups V12 V60 RM 0.0 V79 RM 0 V64 EW V89 V74 V57 V42 V14 RM V78 V32 V36 V15 V65 RM RM RM RM EW EWEW V96 RM V21V4 V86 RM RM EW V72 RM RMV76 V30 V54 V56 RM RM V5 V51 RM V44 V49 V84 V41 V94 EWV13 V11 RMRM Next step is to identify V34 V66 RM - EW V88V87 RM EW EW EW V67 V70 RM EW V48 V53 V23 V61 V35 RM RM EW V97 V68 V75V71 V98 V82 V80 V59 RM RM V6 the gene groups and V20 V83 RM V9 -5 V92 V73 EW V62 -0.1 V46 V10 EW EW V19 EW V95 V81 V77 V85 V43 V40EW V52 V16V7 V1 check their biological EW EW V55 V50 V22 EW V33 EW EW V18 EWEW EW V28 information. V25 V2 EW EW EW -10 -0.1 0.0 0.1 0.2 PC1 Ggobi display finding four clusters of tumors using the PP index on the set of 63 cases. The main panel shows the two dimensional projection selected by the PP index with the four clusters in different colors and glyphs. The top left panel shows the main controls and the left bottom panel displays the controls and the graph of the PP index that is been optimized. The graph shows the index value for a sequence of projection ending at the current one. Exploratory Analysis Steps 1. Dimension Reduction: Gene subset selection. 2. Principal Components for further dimension reduction. 3. Biplot and Graphs 4. For samples: Select natural clusters of samples. Identify sample grouping with natural clusters. 5. For genes: Identify gene clusters and their function.