Docstoc

PRINCIPAL COMPONENTS ANALYSIS _PCA_

Document Sample
PRINCIPAL COMPONENTS ANALYSIS _PCA_ Powered By Docstoc
					PRINCIPAL COMPONENTS
    ANALYSIS (PCA)

                    Steven M. Ho!and
  Department of Geology, University of Georgia, Athens, GA 30602-2501




                             May 2008
Introduction
Suppose we had measured two variables, length and width, and plotted them as shown below.
Both variables have approximately the same variance and they are highly correlated with one
another. We could pass a vector through the long axis of the cloud of points and a second
vector at right angles to the first, with both vectors passing through the centroid of the data.




Once we have made these vectors, we could find the coordinates of all of the data points rela-
tive to these two perpendicular vectors and replot the data, as shown here (both of these fig-
ures are from Swan and Sandilands, 1995).




In this new reference frame, note that variance is greater along axis 1 than it is on axis 2. Also
note that the spatial relationships of the points are unchanged; this process has merely ro-
tated the data. Finally, note that our new vectors, or axes, are uncorrelated. By performing
such a rotation, the new axes might have particular explanations. In this case, axis 1 could be
regarded as a size measure, with samples on the left having both small length and width and
samples on the right having large length and width. Axis 2 could be regarded as a measure of
shape, with samples at any axis 1 position (that is, of a given size) having different length to
width ratios. PC axes will generally not coincide exactly with any of the original variables.

Although these relationships may seem obvious, when one is dealing with many variables, this
process allows one to assess much more quickly any relationships among variables. For data


Principal Components Analysis
                                                                   1
sets with many variables, the variance of some axes may be great, whereas others may be
small, such that they can be ignored. This is known as reducing the dimensionality of a data
set, such that one might start with thirty original variables, but might end with only two or
three meaningful axes. The formal name for this approach of rotating data such that each
successive axis displays a decreasing among of variance is known as Principal Components
Analysis, or PCA. PCA produces linear combinations of the original variables to generate the
axes, also known as principal components, or PCs.



Computation
Given a data matrix with p variables and n samples, the data are first centered on the means
of each variable. This will insure that the cloud of data is centered on the origin of our prin-
cipal components, but does not affect the spatial relationships of the data nor the variances
along our variables. The first principal components (Y1) is given by the linear combination of
the variables X1, X2, ...,Xp


                           Y1 = a11 X1 + a12 X2 + ... + a1p Xp
or, in matrix notation


                                           Y1 = aT X
                                                 1

The first principal component is calculated such that it accounts for the greatest possible
variance in the data set. Of course, one could make the variance of Y1 as large as possible by
choosing large values for the weights a11, a12, ... a1p. To prevent this, weights are calculated
with the constraint that their sum of squares is 1.


                                  a2 + a2 + ... + a2 = 1
                                   11   12         1p

The second principal component is calculated in the same way, with the condition that it is
uncorrelated with (i.e., perpendicular to) the first principal component and that it accounts
for the next highest variance.


                           Y2 = a21 X1 + a22 X2 + ... + a2p Xp
This continues until a total of p principal components have been calculated, equal to the
original number of variables. At this point, the sum of the variances of all of the principal
components will equal the sum of the variances of all of the variables, that is, all of the origi-
nal information has been explained or accounted for. Collectively, all of these transforma-
tions of the original variables to the principal components is


                                            Y = AX
Calculating these transformations or weights requires a computer for all all but the smallest
matrices. The rows of matrix A are called the eigenvectors of matrix Sx, the variance-
covariance matrix of the original data. The elements of an eigenvector are the weights aij, and


Principal Components Analysis
                                                                       2
are also known as loadings. The elements in the diagonal of matrix Sy, the variance-
covariance matrix of the principal components, are known as the eigenvalues. Eigenvalues
are the variance explained by each principal component, and to repeat, are constrained to
decrease monotonically from the first principal component to the last. These eigenvalues are
commonly plotted on a scree plot to show the decreasing rate at which variance is explained
by additional principal components.




                                              ●
                                          2
                                                  ●
                                                      ●
                                                          ●
                                          0                   ●
                                                                  ●
                                                                      ●
                                                                          ●
                         log(variance)                                        ●
                                                                                  ●
                                         −2                                            ● ●
                                                                                             ● ●
                                                                                                   ●
                                                                                                        ●
                                                                                                            ●
                                         −4
                                                                                                                ●
                                                                                                                    ●
                                                                                                                        ●
                                                                                                                             ●
                                         −6                                                                                      ●


                                                                                                                                     ●
                                         −8                                                                                              ●



                                                              5                   10               15                   20

                                                                          principle component




The positions of each observation in this new coordinate system of principal components are
called scores and are calculated as linear combinations of the original variables and the
weights aij. For example, the score for the rth sample on the kth principal component is calcu-
lated as


                         Ykr = ak1 xk1 + ak2 xk2 + ... + akp xkp

In interpreting the principal components, it is often useful to know the correlations of the
original variables with the principal components. The correlation of variable Xi and principal
component Yj is



                                              rij =                       a2 V ar (Yj ) /sii
                                                                           ij


Because reduction of dimensionality, that is, focussing on a few principal components versus
many variables, is a goal of principal components analysis, several criteria have been proposed
for determining how many PCs should be investigated and how many should be ignored. One
common criteria is to ignore principal components at the point at which the next PC offers
little increase in the total variance explained. A second criteria is to include all those PCs up
to a predetermined total percent variance explained, such as 90%. A third standard is to ig-
nore components whose variance explained is less than 1 when a correlation matrix is used or
less than the average variance explained when a covariance matrix is used, with the idea being



Principal Components Analysis
                                                                                                               3
that such a PC offers less than one variable’s worth of information. A fourth standard is to
ignore the last PCs whose variance explained is all roughly equal.

Principal components are equivalent to major axis regressions. As such, principal compo-
nents analysis is subject to the same restrictions as regression, in particular multivariate nor-
mality. The distributions of each variable should be checked for normality and transforms
used where necessary to correct high degrees of skewness in particular. Outliers should be
removed from the data set as they can dominate the results of a principal components analy-
sis.



PCA in R
1) Do an R-mode PCA using prcomp() in R. To do a Q-mode PCA, the data set should be
transposed before proceeding. R-mode PCA examines the correlations or covariances among
variables, whereas Q-mode focusses on the correlations or covariances among samples.

    > mydata <- read.table(file="mydata.txt", header=TRUE,
    row.names=1, sep=",")

    > mydata.pca <- prcomp(mydata, retx=TRUE, center=TRUE,
    scale.=TRUE)
    # variable means set to zero, and variances set to one
    # sample scores stored in mydata.pca$x
    # loadings stored in mydata.pca$rotation
    # singular values (square roots of eigenvalues) stored
    # in mydata.pca$sdev
    # variable means stored in mydata.pca$center
    # variable standard deviations stored in mydata.pca$scale

    >   sd <- mydata.pca$sdev
    >   loadings <- mydata.pca$rotation
    >   rownames(loadings) <- colnames(mydata)
    >   scores <- mydata.pca$x

2) Do a PCA longhand in R:

    > R <- cor(mydata)
    # calculate a correlation matrix

    >   myEig <- eigen(R)
    #   find the eigenvalues and eigenvectors of correlation matrix
    #   eigenvalues stored in myEig$values
    #   eigenvectors (loadings) stored in myEig$vectors

    > sdLONG <- sqrt(myEig$values)
    # calculating singular values from eigenvalues

    > loadingsLONG <- myEig$vectors


Principal Components Analysis
                                                                      4
    > rownames(loadingsLONG) <- colnames(mydata)
    # saving as loadings, and setting rownames

    > standardize <- function(x) {(x - mean(x))/sd(x)}
    > X <- apply(mydata, MARGIN=2, FUN=standardize)
    # transforming data to zero mean and unit variance

    > scoresLONG <- X %*% loadingsLONG
    # calculating scores from eigenanalysis results


3) Compare results from the two analyses to demonstrate equivalency. Maximum differences
should be close to zero if the two approaches are equivalent.

    > range(sd - sdLONG)
    > range(loadings - loadingsLONG)
    > range(scores - scoresLONG)

4) Do a distance biplot (see Legendre & Legendre, 1998, p. 403)

    > quartz(height=7, width=7)
    > plot(scores[,1], scores[,2], xlab="PCA 1", ylab="PCA 2",
       type="n", xlim=c(min(scores[,1:2]), max(scores[,1:2])),
       ylim=c(min(scores[,1:2]), max(scores[,1:2])))
    > arrows(0,0,loadings[,1]*10,loadings[,2]*10, length=0.1,
       angle=20, col="red")
    # note that this scaling factor of 10 may need to be changed,
    # depending on the data set

    > text(loadings[,1]*10*1.2,loadings[,2]*10*1.2,
       rownames(loadings), col="red", cex=0.7)
    # 1.2 scaling insures that labels are plotted just beyond
    # the arrows

    > text(scores[,1],scores[,2], rownames(scores), col="blue",
       cex=0.7)




Principal Components Analysis
                                                            5
                                15
                                                       CJ001




                                10
                                                       CJ011




                       PCA 2

                                5
                                                  Na
                                                  KB
                                                  Li
                                                  W

                                                   Sr As
                                                 Ba Si CaCd
                                                     Th
                                               CJ003   AgTi
                                           CJ045
                                             CJ043
                                         CJ025 Au       Be        CJ005
                                           CJ041
                                          CJ023      d18O
                                          CJ047 CJ017Zn Cu
                                           JK013 Hg  DH Sb




                                0
                                              pH
                                            DC002
                                            JK012
                                            JK014      CJ009
                                                          SeTl
                                          CJ051
                                           CJ053
                                          CJ063
                                       CJ027
                                         CJ049
                                                         Mn
                                                          Co
                                                          Mo
                                                         YNi
                                                          Cr
                                                          Pb
                                                          Mg
                                                      CJ007
                                                      U Ce
                                                         La
                                                         VP
                                                         Fe
                                                          Al
                                                        SO4
                                        CJ015 CJ019
                                      CJ029
                                     CJ031
                                     CJ039
                                     CJ037
                                                          DC008 DC011
                                                                DC012               CJ021
                                     CJ035                                                            DC006




                                                 0                        5            10        15

                                                                           PCA 1




    > quartz(height=7, width=7)
    > biplot(scores[,1:2], loadings[,1:2], xlab=rownames(scores),
       ylab=rownames(loadings), cex=0.7)
    # using built-in biplot function



                                                     0.0                  0.2        0.4    0.6


                                                                                                         0.6
                                     15




                                                                                                         0.4



                                                         CJ001
                                     10




                                                       Na
                                                       KBCJ011
                                                       WLi
                               PC2




                                                                     As
                                                           Sr
                                                                                                         0.2
                                     5




                                                                     Ca
                                                     Ba         Th
                                                                Si      Cd
                                                                     Ag Ti
                                                   Au          Be
                                                  CJ003
                                                CJ045
                                                 CJ043
                                              CJ025
                                               CJ041        d18O CJ005
                                                                                                         0.0




                                               CJ023        DH
                                            pH CJ051 CJ017 Zn Cu
                                               CJ047
                                               JK013      CJ009 Sb
                                     0




                                                    Hg
                                                DC002
                                                JK012
                                                JK014
                                                CJ053             Se
                                               CJ063
                                            CJ027
                                              CJ049      CJ007 MnCoTl
                                                                 Mo
                                           CJ029 CJ019
                                             CJ015
                                          CJ031
                                          CJ039
                                          CJ037
                                          CJ035
                                                                YPb
                                                           DC008DC011 CJ021
                                                                V
                                                                  Ni
                                                                 Cr
                                                                 Mg
                                                                DC012
                                                             U CeP
                                                                La
                                                                Fe
                                                                 Al
                                                               SO4                               DC006




                                                       0                  5          10     15

                                                                              PC1




5) Do a correlation biplot (see Legendre & Legendre, 1998, p. 404)

    > quartz(height=7, width=7)
    > plot(scores[,1]/sd[1], scores[,2]/sd[2], xlab="PCA 1",
       ylab="PCA 2", type="n")
    > arrows(0,0,loadings[,1]*sd[1],loadings[,2]*sd[2],


Principal Components Analysis
                                                                                 6
      length=0.1, angle=20, col="red")
   > text(loadings[,1]*sd[1]*1.2,loadings[,2]*sd[2]*1.2,
      rownames(loadings), col="red", cex=0.7)
   > text(scores[,1]/sd[1],scores[,2]/sd[2], rownames(scores),
      col="blue", cex=0.7)
   # 1.2 scaling insures that labels are plotted just beyond
   # the arrows




                            4
                                                  CJ001

                                                  CJ011




                            3
                            2
                    PCA 2




                                              Na
                                              KB
                                               Li
                                               W
                            1



                                                  Sr     As
                                               Ba        Ca
                                          CJ003      Th
                                                     Si      Cd
                                       CJ045             Ag Ti CJ005
                                         CJ043
                                     CJ025
                                       CJ041Au
                                      CJ023                 Be
                                                        d18O Cu
                                   pH CJ047 CJ017 CJ009 Sb
                                       JK013            DH
                            0




                                        DC002
                                        JK012Hg         Zn     Se
                                                               Tl
                                        JK014                Mn
                                                              Co
                                      CJ051
                                       CJ053
                                      CJ063
                                   CJ027                U    YMo
                                                            LaNi
                                                              Cr
                                                            CePb
                                                              Mg
                                                             VP
                                     CJ049         CJ007 SO4 Fe
                                                              Al
                                            CJ019      DC008 DC011
                                                              DC012
                                    CJ015
                                  CJ029                                CJ021
                                 CJ031
                                 CJ039
                                 CJ037
                                 CJ035
                                                                                   DC006
                            −1




                                 −1           0           1             2      3      4

                                                              PCA 1




   >   quartz(height=7, width=7)
   >   biplot(mydata.pca)
   #   using built-in biplot function of prcomp(); note that only
   #   the top and right axes match the coordinates of the points;
   #   also note that still doesn't quite replicate the correlation
   #   biplot. It’s unclear what this function really does.




Principal Components Analysis
                                                             7
                                      −5      0       5         10    15

                                              CJ001




                                0.6
                                              CJ011




                                                                                   15
                                0.4




                                                                                   10
                          PC2

                                0.2
                                               Na
                                               K
                                               WB
                                                Li




                                                                                   5
                                                 Sr As
                                                    Ca
                                              Ba Th Ti
                                           CJ003 Si AgCd
                                         CJ045
                                          CJ043
                                        CJ025
                                            Au          CJ005
                                         CJ041
                                         CJ023        Be
                                                   d18O




                                0.0
                                         CJ047 CJ009 Cu
                                         DC002 DH Mn
                                         JK013
                                       pH Hg CJ017 Zn Sb




                                                                                   0
                                         JK012
                                          JK014
                                         CJ051
                                         CJ053          Se
                                                         Tl
                                                        Co
                                                        Mo
                                                        Ni
                                                        Cr
                                                        Pb
                                                       Mg
                                        CJ049 CJ007Ce
                                         CJ063
                                       CJ027           YP
                                                    U SO4
                                                       La
                                                       V
                                                       Fe
                                                        Al
                                            CJ019DC008
                                        CJ015         DC012
                                                      DC011CJ021
                                      CJ029
                                      CJ031
                                      CJ039
                                      CJ037
                                      CJ035                                DC006




                                                                                   −5
                                             0.0      0.2       0.4    0.6

                                                          PC1




6) Calculate the correlation coefficients between variables and principal components

    > correlations <- t(loadings)*sd
    # find the correlation of all the variables with all PC's

    > correlations <- cor(scores,mydata)
    # another way to find these correlations

7) Plot a scree graph

    > quartz(height=7, width=7)
    > plot(mydata.pca)
    # using built-in function for prcomp; may not show all PCs




Principal Components Analysis
                                                          8
                                                                    mydata.pca




                                       20
                                       15
                       Variances

                                       10
                                       5
                                       0




    > quartz(height=7, width=7)
    > plot(log(sd^2), xlab="principle component",
       ylab="log(variance)", type="b", pch=16)
    # using a general plot, with variance on a log scale




                                                 ●●
                                                      ●●●●●
                                       0




                                                              ●●●●●●
                                                                       ●●●●●
                                                                               ●●●●●
                                                                                       ●●●
                                                                                             ●●●
                                                                                                   ●●●
                                                                                                         ●
                                       −20
                       log(variance)

                                       −40
                                       −60




                                                                                                             ●



                                             0          5      10      15      20      25      30            35

                                                                principle component




8) Find variance along each principal component and the eigenvalues

    > newsd <- sd(scores)
    > max (sd - newsd)
    # finds maximum difference between the standard deviation form


Principal Components Analysis
                                                                                    9
    # prcomp and the standard deviation calculated longhand;
    # should be close to zero

    > eigenvalues <- sd^2
    > sum(eigenvalues)
    # should equal number of variables

    > length(mydata)
    # number of variables

9) Save loadings, scores, and singular values to files

    > write.table(loadings, file="loadings.txt")
    > write.table(scores, file="scores.txt")
    > write.table(sd, file="sd.txt")



References
Legendre, P., and L. Legendre, 1998. Numerical Ecology. Elsevier: Amsterdam, 853 p.

Swan, A.R.H., and M. Sandilands, 1995. Introduction to Geological Data Analysis. Blackwell
Science: Oxford, 446 p.




Principal Components Analysis
                                                         10

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:8
posted:10/15/2011
language:English
pages:11