Document Sample

PRINCIPAL COMPONENT ANALYSIS (PCA) Problem of Data Reduction Summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite) variables. p k n A n X Problem of Data Reduction Residual Variation: It is information in A that is not retained in X Balancing Act Between Clarityof representation, ease of understanding Oversimplification: loss of important or relevant information. Principal Component Analysis (PCA) Proposed by Pearson (1901) and Hotelling (1933) Probably the most widely-used and well-known of the “standard” multivariate methods for data reduction Takes a data matrix of n objects by p variables, which may be correlated, and summarizes it by uncorrelated axes (principal components or principal axes) that are linear combinations of the original p variables The first k components display as much as possible of the variation among objects. Geometric Interpretation of PCA Objects are represented as a cloud of n points in a multidimensional space with an axis for each of the p variables 14 12 10 Variable X 2 8 6 + 4 2 0 0 2 4 6 8 10 12 14 16 18 20 Variable X1 Geometric Interpretation of PCA The centroid of the points is defined by the mean of each variable The variance of each variable is the average squared deviation of its n values around the mean of that variable. 1 X im X i n 1 2 Vi n 1 m Geometric Interpretation of PCA Degree to which the variables are linearly correlated is represented by their covariance 1 X im X i X jm X j n 1 Cij n 1 m Covariance of variables i and j Value of Value of Mean of Mean of Sum over all variable i variable j variable j variable i n objects in object m in object m Geometric Interpretation of PCA Objective of the PCA is to rigidly rotate the axes of this p-dimensional space to new positions (principal axes) that have the following properties: Ordered such that principal axis 1 has the highest variance, axis 2 has the next highest variance, .... , and axis p has the lowest variance Covariance among each pair of the principal axes is zero (the principal axes are uncorrelated). 2D Example of PCA Variables X1 and X2 have positive covariance & each has a similar variance. 14 12 10 V1 6.67 Variable X 2 V2 6.24 8 6 X 2 4.91 4 + C1, 2 3.42 2 X 1 8.35 0 0 2 4 6 8 10 12 14 16 18 20 Variable X1 Configuration is Centered Each variable is adjusted to a mean of zero (by subtracting the mean from each value). 8 6 4 Variable X 2 2 0 -8 -6 -4 -2 0 2 4 6 8 10 12 -2 -4 -6 Variable X1 Principal Components are Computed PC 1 has the highest possible variance (9.88) PC 2 has a variance of 3.03 PC 1 and PC 2 have zero covariance. 6 4 2 PC 2 0 -8 -6 -4 -2 0 2 4 6 8 10 12 -2 -4 -6 PC 1 The Dissimilarity Measure in PCA PCA uses Euclidean Distance calculated from the p variables as the measure of dissimilarity among the n objects PCA derives the best possible k dimensional (k < p) representation of the Euclidean distances among objects. Generalization to p-dimensions In practice, nobody uses PCA with only 2 variables The algebra for finding principal axes readily generalizes to p variables PC1 is the direction of maximum variance in the p-dimensional cloud of points PC2 is in the direction of the next highest variance, subject to the constraint that it has zero covariance with PC1. Generalization to p-dimensions PC 3 is in the direction of the next highest variance, subject to the constraint that it has zero covariance with both PC 1 and PC 2 and so on... up to PC p Generalization to p-dimensions 8 PC 1 6 4 PC 2 Variable X 2 2 0 -8 -6 -4 -2 0 2 4 6 8 10 12 -2 -4 -6 Variable X1 Generalization to p-dimensions Given a sample of n observations on a vector of p variables Define the first principal component (PC1) of the sample by the linear transformation where the vector is chosen such that is maximum Generalization to p-dimensions Likewise, define the kth PC (PC k) of the sample by the linear transformation λ where the vector is chosen such that is maximum subject to and to Generalization to p-dimensions If we take the first k principal components, they define the k-dimensional “hyper plane of best fit” to the point cloud Of the total variance of all p variables: PCs 1 to k represent the maximum possible proportion of that variance that can be displayed in k dimensions i.e. the squared Euclidean distances among points calculated from their coordinates on PCs 1 to k are the best possible representation of their squared Euclidean distances in the full p dimensions. Covariance vs Correlation Using covariance among variables only makes sense if they are measured in the same units Even then, variables with high variances will dominate the principal components These problems are generally avoided by standardizing each variable to unit variance and zero mean X im X Xi im Mean variable i Standard deviation SDi of variable i Covariance vs Correlation Covariance between the standardized variables are correlations After standardization, each variable has a variance of 1.000 Correlations can be also calculated from the variances and covariance: Correlation between Covariance of variables i and j C ij variables i and j rij Variance ViV j Variance of variable j of variable i The Algebra of PCA First step is to calculate the cross-products matrix of variances and covariance (or correlations) among every pair of the p variables This gives Square, symmetric matrix Diagonals are the variances, off-diagonals are the covariance X1 X2 X1 X2 X1 6.6707 3.4170 X1 1.0000 0.5297 X2 3.4170 6.2384 X2 0.5297 1.0000 Variance-covariance Matrix Correlation Matrix The Algebra of PCA In matrix notation, this is computed as S XX Where X is the n x p data matrix, with each variable centered (also standardized by SD if using correlations). X1 X2 X1 X2 X1 6.6707 3.4170 X1 1.0000 0.5297 X2 3.4170 6.2384 X2 0.5297 1.0000 Variance-covariance Matrix Correlation Matrix The Algebra of PCA Trace of the covariance matrix represents the total variance in the data It is the mean squared Euclidean distance between each object and the centroid in p- dimensional space. X1 X2 X1 X2 X1 6.6707 3.4170 X1 1.0000 0.5297 X2 3.4170 6.2384 X2 0.5297 1.0000 Trace = 12.9091 Trace = 2.0000 The Algebra of PCA Finding the principal axes involves eigen analysis of the covariance matrix(S) The eigen values (latent roots) of S are solutions () to the following characteristic equation S I 0 The Algebra of PCA The eigen values, 1, 2, ... p are the variances of the coordinates on each principal component axis The sum of all p eigen values equals the trace of S (the sum of the variances of the original variables). X1 X2 1 = 9.8783 X1 6.6707 3.4170 2 = 3.0308 X2 3.4170 6.2384 Note: 1+2 =12.9091 Trace = 12.9091 The Algebra of PCA Each eigenvector consists of p values which represent the “contribution” of each variable to the principal component axis Eigenvectors are uncorrelated (orthogonal) their dot-products are zero. Eigenvectors u1 u2 X1 0.7291 -0.6844 X2 0.6844 0.7291 0.7291*(-0.6844) + 0.6844*0.7291 = 0 The Algebra of PCA Coordinates of each object i on the kth principal axis, known as the scores on PC k, are computed as zki u1k x1i u2k x2i u pk x pi where Z is the n x k matrix of PC scores, X is the n x p centered data matrix and U is the p x k matrix of eigenvectors The Algebra of PCA Variance of the scores on each PC axis is equal to the corresponding eigen value for that axis The eigen value represents the variance displayed (“explained” or “extracted”) by the kth axis The sum of the first k eigen values is the variance explained by the k-dimensional ordination. 1 = 9.8783 2 = 3.0308 Trace = 12.9091 PC 1 displays (“explains”) 9.8783/12.9091 = 76.5% of the total variance 6 4 2 PC 2 0 -8 -6 -4 -2 0 2 4 6 8 10 12 -2 -4 -6 PC 1 The Algebra of PCA The covariance matrix computed among the p principal axes has a simple form: all off-diagonal values are zero (the principal axes are uncorrelated) the diagonal values are the eigen values. PC1 PC2 PC1 9.8783 0.0000 PC2 0.0000 3.0308 Variance-covariance Matrix of the PC axes Projection of Data into New Space V: Eigenvector matrix of size p x p where each row contains one eigenvector Eigenvectors in V are arranged in the order of their increasing eigenvalues, i.e., 1st row corresponds to eigenvector of highest eigenvalue, 2nd row corresponds to eigenvector of next highest eigenvalue, and so on … Vk : k x p matrix containing the first k (k<<n) significant eigenvectors (or first k principal components) A : n x p data matrix Projection of Data into New Space Projection of data matrix A into new space defined by first k significant eigenvectors (first k principal components) is given as follows X A Vk T where, X is new n x k data matrix obtained after projection What are the assumptions of PCA? Assumes relationships among variables are LINEAR cloudof points in p-dimensional space has linear dimensions that can be effectively summarized by the principal axes If the structure in the data is NONLINEAR (the cloud of points twists and curves its way through p-dimensional space), the principal axes will not be an efficient and informative summary of the data. Reference Pattern Classification, R. O. Duda and P. E. Hart and D. G. Stork, Second edition, Wiley-Interscience Publication

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 6 |

posted: | 10/7/2012 |

language: | English |

pages: | 34 |

OTHER DOCS BY alicejenny

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.