VIEWS: 4 PAGES: 32 POSTED ON: 12/7/2011 Public Domain
1.1 1. Applied multivariate methods 1.1 An overview of multivariate methods Experimental unit – The object in which information is collected upon. This information is organized into observed variable values. Univariate data – Information measured on one variable. Multivariate data – Information measured on multiple variables. 1.2 Examples Read the examples in Johnson. Example: Cereal data (cereal_data.xls – data stored in Excel file) Data available on the STAT 873 website. In January 1999, I collected the following data on dry cereal from a Dillon’s supermarket in Manhattan, KS. This data was collected as a stratified random sample where shelf of the cereal was the stratum. It is called “stratified” since I randomly selected 10 cereals within a shelf. Shelf #1 represents the bottom shelf and shelf #4 represents the top shelf. 2005 Christopher R. Bilder 1.2 Serving Sugar Fat Sodium ID Shelf Cereal Size (g) (g) (g) (mg) Kellogg’s Razzle Dazzle Rice 1 1 28 10 0 170 Crispies 2 1 Post Toasties Corn Flakes 28 2 0 270 3 1 Kellogg’s Corn Flakes 28 2 0 300 4 1 Food Club Toasted Oats 32 2 2 280 5 1 Frosted Cheerios 30 13 1 210 6 1 Food Club Frosted Flakes 31 11 0 180 7 1 Capn Crunch 27 12 1.5 200 Capn Crunch's Peanut Butter 8 1 27 9 2.5 200 Crunch 9 1 Post Honeycomb 29 11 0.5 220 10 1 Food Club Crispy Rice 33 2 0 330 11 2 Rice Crispies Treats 30 9 1.5 190 12 2 Kellogg's Smacks 27 15 0.5 50 13 2 Kellogg's Froot Loops 32 15 1 150 Capn Crunch's Peanut Butter 14 2 27 9 2.5 200 Crunch 15 2 Cinnamon Grahams 30 11 1 230 Marshmallow Blasted Froot 16 2 30 16 0.5 105 Loops 17 2 Koala Coco Krunch 30 13 1 170 18 2 Food Club Toasted Oats 33 10 1.5 150 19 2 Cocoa Pebbles 29 13 1 160 20 2 Oreo O's 27 11 2.5 150 21 3 Food Club Raisin Bran 54 17 1 280 22 3 Post Honey Bunches of Oats 30 6 1.5 190 23 3 Rice Chex 31 2 0 290 24 3 Kellogg's Corn Pops 31 14 0 120 Post Morning Traditions - 25 3 54 14 5 160 Raisin, Date, Pecan Post Shredded Wheat Spoon 26 3 49 0 0.5 0 Size 27 3 Basic 4 55 14 3 320 28 3 French Toast Crunch 30 12 1 180 29 3 Post Raisin Bran 59 20 1 300 Food Club Frosted Shredded 30 3 50 1 1 0 Wheat 2005 Christopher R. Bilder 1.3 Serving Sugar Fat Sodium ID Shelf Cereal Size (g) (g) (g) (mg) 31 4 Total Raisin Bran 55 19 1 240 32 4 Food Club Wheat Crunch 60 6 0 300 33 4 Oatmeal Crisp Raisin 55 19 2 220 34 4 Food Club Bran Flakes 31 5 0.5 220 35 4 Cookie Crisp 30 12 1 180 36 4 Kellogg's All Bran Original 31 6 1 65 37 4 Food Club Low Fat Granola 55 14 3 100 38 4 Oatmeal Crisp Apple Cinnamon 55 19 2 260 Post Fruit and Fibre - Dates, 39 4 55 17 3 280 Raisons, Walnuts 40 4 Total Corn Flakes 30 3 0 200 Notes: 1)Each row represents a different experimental unit 2)Each column represents a different measured variable 3)The experimental units are defined by the cereal type For most of the methods in this course, the experimental units are assumed to be independent. Are the experimental units in the cereal example independent? To be independent, the observed values for one cereal can not have an effect on another. For example, observed values for Marshmallow Blasted Froot Loops and Captain Crunch's Peanut Butter Crunch need to be independent of each other. We will assume the experimental units are independent for this example. 2005 Christopher R. Bilder 1.4 Example: Placekicking data (placekick.sas7bdat) For the 1995 National Football League season, I collected information on all placekicks attempted. The SAS data set file, placekick.sas7bdat, contains 1,425 of the observations. Two types of placekicks: Field goal – 3 points if successful Point after touchdown (PAT) – 1 point if successful See video at http://www.chrisbilder.com/gf_small.mov. Below is an example of some of the data collected. id date week loc time type field temp humid dir speed cloud precip 1 90395 1 NE 100 O G 73 64 S 11 SN 2 90395 1 NE 100 O G 73 64 S 11 SN 3 90395 1 NE 100 O G 73 64 S 11 SN sc_ sc_ fin_ fin_ id kicker team opp good how pat dist qrtr team opp win team opp alt 1 BAHR NE CL Y N 21 1 0 0 Y 17 14 138 2 BAHR NE CL Y N 21 1 3 7 Y 17 14 138 3 STOVER CL NE Y Y 20 2 13 6 N 14 17 138 id windy diff change elap30 1 0 0 1 24.7167 2 0 -4 0 15.85 3 0 7 0 0.45 2005 Christopher R. Bilder 1.5 Specific information about what each variable measures will be given later in the course. There is a lot of information given! How can all of this information be summarized and is any of it meaningful? What types of inferences can be made from this data set? Some of the topics for this semester: Principle components analysis and factor analysis examine ways to reduce the dimension (number of variables) of the data set. New variables are formed which are “hopefully” interpretable. Discriminant analysis is used to classify experimental units into groups which are already known before the analysis. Canonical discriminant analysis reduces the dimension of the data set and applies discriminant analysis methods. Logistic regression can be used for the same purpose as discriminant analysis, but for only two groups. Cluster analysis can help classify observations into groups which are unknown before the analysis. Multivariate analysis of variance (MANOVA) is an extension of analysis of variance (ANOVA). In ANOVA, hypothesis tests for equality of means are performed. In MANOVA, hypothesis tests for equality of mean vectors are performed. 2005 Christopher R. Bilder 1.6 A lot of these methods are what people use for “data mining”. See Table 1.1 on p. 8 of Johnson. This table will be very helpful to refer back to at the end of the course! 1.3 Types of variables Continuous variables – The have numerical values and can occur anywhere within some interval. There is no fixed number of values the variable can take on. Discrete variables – They can be numerical or nonnumerical. There are a fixed number of values the variable can take on. Example: Cereal data Shelf and cereal are discrete variables, and serving size, sugar, fat, and sodium are continuous variables. Note that serving size, sugar, fat, and sodium are probably reported on the cereal box as discrete variables. Example: Placekicking data Distance is an example of a continuous variable (assumed), and surface is an example of a discrete variable. 2005 Christopher R. Bilder 1.7 Most multivariate methods were developed based on continuous variables using a multivariate normal distribution assumption (more on this distribution in Section 1.5). 1.4 Data matrices and vectors See Appenidx A of Johnson (1998), any matrix algebra book, regression books (such as Chapter 5 of Neter, Kutner, Nachtscheim, and Wasserman (1996)) for a review of matrices. p = number of numerical variables of interest N = number of experimental units on which the variables are being measured xrj = value of the jth response variable on the rth experimental unit for r=1,…,N and j=1,…,p X = data matrix - xrj are arranged in matrix form so that xrj is the rth row and jth column element of X – rows represent experimental units and the columns represent variables. X has dimension Np. xr1 x xr = column vector of data for the rth r2 xrp experimental unit. xr = [xr1, xr2, …, xrp] = “transpose” of xr 2005 Christopher R. Bilder 1.8 x11 x12 x1p x1 x x22 x2p x2 X 21 xN1 xN2 xNp xN The determinant of a matrix is denoted by |A| for some matrix A. o The determinant for a 2 2 matrix is defined a11 a12 as a11a22 a12a21 . a21 a22 o The determinant for a 3 3 matrix can be defined as a11 a12 a13 a a22 a23 a21 a23 a21 a22 a22 a23 a11 a a12 a13 a33 a a33 a 21 a31 a32 a33 32 31 31 a32 See a matrix algebra book for more information on determinants. Example: Cereal data Consider only the shelf, sugar, fat, and sodium variables. The variables are adjusted for the serving size by taking their observed values divided by serving size. For example, for Kellog’s Razzle Dazzle Rice Crispies the “sugar” value is 2005 Christopher R. Bilder 1.9 (sugar grams per serving)/(# of grams per serving size) = 10/28 = 0.3571 The data looks as follows: ID Cereal Shelf Sugar Fat Sodium Kellog’s Razzle Dazzle Rice 1 Crispies 1 0.3571 0 6.0714 2 Post Toasties Corn Flakes 1 0.0714 0 9.6429 40 Total Corn Flakes 4 0.1 0 6.6667 The ID and Cereal variables are in the table to help identify the experimental units. p = 4 since the responses for the cereals are measured on 4 different variables N = 40 x11 = 1, x12 = 0.3571, x13 = 0, x14 = 6.0714, x21 = 1, x22 = 0.0714,… x11 x12 x13 x14 1 0.3571 0 6.0714 x x22 x23 x24 1 0.0714 0 9.6429 X 21 x40,1 x40,2 x40,3 x40,4 4 0.1 0 6.6667 2005 Christopher R. Bilder 1.10 x11 1 x 0.3571 x1 12 x13 0 x14 6.0714 Book notation notes: 1. The i, j, k,… subscripts are used for variables of interest. 2. The r, s, t,… subscripts are used for experimental units. Examples: Suppose denotes the correlation. Then ij denotes the correlation between response variables i and j. Suppose d denotes distance. Then drs denotes the distance between experimental units r and s. 2005 Christopher R. Bilder 1.11 1.5 The multivariate normal distribution Let x be a random variable with an univariate normal distribution with mean E(x) = and variance Var(x) = E[(x-)2]=2. This is represented symbolically as x~N(, 2). Note that x could be capitalized here if one wanted to represented it more formally. The probability distribution function of x is ( x )2 1 f(x | , ) e 2 2 for - <x< . 2 Example: Suppose x~N(50, 32). Below is the graph for f(x|,). Normal Probability Distribution for =50 and =3 0.15 0.1 f(x;,) f(x;,) 0.05 0 35 40 45 50 55 60 65 X 2005 Christopher R. Bilder 1.12 Note that x f(x; , ) 40 0.000514 50 0.132981 60 0.000514 The normal distribution for a random variable can be generalized to a multivariate normal distribution for a vector of random variables (random vector). Much of the theory for multivariate analysis methods relies on the assumption that a random vector, x1 x , xp has a multivariate normal distribution. The relationships between the xi’s can be measured by the covariances and/or the correlations. Let E(xi)=i Covariance of xi and xj: ij = Cov(xi, xj) = E[(xi-i)(xj-j)] Covariance of xi and xi (variance of xi): E[(xi-i)2]= ii Note that this is not denoted by 2 2005 Christopher R. Bilder 1.13 Correlation coefficient of xi and xj: ij Cov(xi ,x j ) ij ii jj Var(xi )Var(x j ) Remember that ij = ji and -1 ij 1. The means, covariances, correlations can be put into a mean vector, covariance matrix, and a correlation matrix: E(x1 ) 1 = E(x)= , E(xp ) p 11 12 1p 22 2p Cov(x )=E[(x-)(x- )]= , and 21 p1 p2 pp 1 12 1p 1 2p P Corr(x )= . 21 p1 p2 1 Remember that and P are symmetric and knowing is equivalent to knowing P. 2005 Christopher R. Bilder 1.14 NOTE: and are bolded although it may not look like it when it is printed. Word does not properly bold the symbol font. Multivariate normal distribution Similar to a univariate normal distribution where only and 2 are needed to know the distribution, only and (or P) are needed for the multivariate normal distribution. The notation, x~Np(,), means that x has a p-dimension multivariate normal distribution with mean and covariance matrix . The multivariate normal distribution function is: 1 1 ( x ) ( x ) f( x | , ) 1/ 2 e 2 (2)p / 2 for - <xi< for i=1,…,p and ||>0. Example: Bivariate normal distribution (p=2) – (graph_mult_normal.sas). Similar to as shown on p. 1.11, the bivariate normal x1 distribution can be graphed. Suppose x has a x2 15 multivariate normal distribution with = and 20 2005 Christopher R. Bilder 1.15 1 0.5 . Thus, we could write x N2 , or 0.5 1.25 15 1 0.5 x N2 , . Also, note that 20 0.5 1.25 1 0.45 P (for example, 12= 0.5 / 1 1.25 =0.45). 0.45 1 The multivariate normal distribution function is: 0.5 15 1 15 1 1 x x 1 2 20 0.5 1.25 20 f(x | , ) 1/ 2 e 1 0.5 (2)p / 2 0.5 1.25 1 1 15 1.25 0.5 15 x x 1 2 20 0.5 1 20 e (2) 1 p / 2 1/ 2 Below are 3D surface and contour plots of the normal distribution. 2005 Christopher R. Bilder 1.16 2005 Christopher R. Bilder 1.17 Examine the following: 1. The surface is centered at . 2. The surface is wider in the x2 direction than in the x1. This is because there is more variability. 3. Notice the shape of the surface – The contour plot is an ellipse and the surface plot looks like a 3D normal curve. 4. Volume underneath the surface is 1. Question: Suppose a random sample is taken from a population which can be characterized by this multivariate normal distribution. A scatter plot is constructed of the observed x1 and x2 values. Where would most of the (x1, x2) points fall? 15 1 0.5 Suppose = and . The only change 20 0.5 3 from the previous example is that 22 has increased. 1 0.289 Note that P . Below are 3D surface 0.289 1 and contour plots of the normal distribution. 2005 Christopher R. Bilder 1.18 2005 Christopher R. Bilder 1.19 Compare the above plots with the previous example. Questions to think about for the bivariate normal distribution: 1. What happens if the means change? 2. What happens if 12=0? 3. What happens if 11=22 and 12=0? 4. What happens if 12=1 or -1? If you do not know the answers, use the SAS program to find out what happens! Below is part of the SAS program used to construct the plots above. SAS does not have a mechanism where the multivariate normal distribution function can be 2005 Christopher R. Bilder 1.20 inputted and a resulting plot automatically constructed. Instead, we can tell SAS to find f(x|,) for many different values of x and then plot these (x1, x2, f(x|,) ) triples. 2005 Christopher R. Bilder 1.21 dm 'log;clear;output;clear;'; options ps=50 ls=70 pageno=1; goptions reset=global border ftext=swiss gunit=cm htext=0.4 htitle=0.5; goptions display noprompt; **********************************************************; ** **; ** AUTHOR: Chris Bilder **; ** COURSE: STAT 873 **; ** DATE: 1-8-01 **; ** UPDATE: 2-6-01 – **; ** corrected error in finding f(x), 8-22-03 **; ** PURPOSE: Graph multivariate normal distribution **; ** **; ** NOTES: **; ** **; **********************************************************; title1 'Chris Bilder, STAT 873'; proc iml; pi = 3.141592654; mu={15,20}; sigma={1 0.5, 0.5 1.25}; det = det(sigma); inv_sigma = inv(sigma); p=2; *initialize save matrix, note there are 61*61 x1 and x2 combinations in the do loop below; save = repeat(0, 3721, 3); *initialize counter for do loop; i=1; 2005 Christopher R. Bilder 1.22 *evaluate f(x) for many x1 and x2 values; do x1 = 10 to 25 by 0.25; do x2 = 10 to 25 by 0.25; *Creates 2x1 vector; x = x1 // x2; *Evaluate f(x); f_x = (1/(2*pi)**(p/2)) * det**(-1/2) * exp(-1/2 * t(x-mu)*inv_sigma*(x-mu)); *Create 1x3 vector for the results and then save it into the ith row of the save matrix; save[i,] = x1 || x2 || f_x; i=i+1; end; end; col={ "x1" "x2" "f_x"}; create set1 from save [colname=col]; append from save; quit; *Create 3D plot of the surface; proc g3d data=set1; title2 '3D surface '; plot x1*x2=f_x / grid zmin=0 zmax=0.3 zticknum=5 rotate=50; *use different rotate option on the above plot statement; * if you need to; run; *Create contour plot; proc gcontour data=set1; plot x1*x2=f_x / grid autolabel=(reveal) haxis=axis1 vaxis=axis2 legend=legend1 2005 Christopher R. Bilder 1.23 levels=0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15; *The reveal statement puts the level labels on the plot. If there are too many to fit, SAS will not plot them; title2 'Contour plot'; axis1 label = ('x2') length=10 order = (10 to 30 by 5); axis2 label=(a=90 'x1') length=8.71 order = (10 to 30 by 5); legend1 label=('f(x)') down=2; run; For homework, run this program! In order to get the plots from SAS into Word, I simply selected EDIT > COPY from SAS when the GRAPH window is the current window. Note that the ability to do this is new to SAS 9! Before, one had to export the plot out of SAS as a .gif or .jpg, and then insert the image into Word. If you do not have experience in SAS, I recommend taking a look at my lecture notes for OSU STAT 4091 at http://www.chrisbilder.com/stat4091 (see the schedule web page). This course was a 1 credit hour course introducing students to SAS 8.1. 2005 Christopher R. Bilder 1.24 1.6 Statistical computing Read on your own. 1.7 Multivariate outliers Read on your own. 2005 Christopher R. Bilder 1.25 1.8 Multivariate summary statistics Let x1, x2, …, xN be a random sample from a multivariate distribution (not necessarily normal) with a mean vector , covariance matrix of , and correlation matrix P. This can be represented symbolically as {xi }N 1 ~ (, ). i Note on notation: If it is understood that a random sample of size N is taken, then we could shorten this to x ~ (, ) . Estimates of , , and P based on the random sample are: 1N 1 xr x1 x 2 ... xN ˆ N r 1 N ˆ 1 ( xr )(xr ) N ˆ ˆ N 1 r 1 Notes: 1 N xr1 x1 N r 1 1 ˆ x 1 N x ˆ 1. ˆ 2 N r1 r 2 2 p xp ˆ 1 N N r1 xrp 2005 Christopher R. Bilder 1.26 11 12 ˆ ˆ 1p ˆ ˆ 21 22 ˆ 2p ˆ 1 N ˆ 2. ( xr )(xr ) ˆ ˆ N 1 r 1 p1 p2 ˆ ˆ pp ˆ 1 N where ii Var(xi ) ˆ (xri xi ) and 2 N 1 r 1 1 N ij Cov(xi ,x j ) ˆ (xri xi )(xrj x j ) N 1 r 1 3. rij and R are used to denote the estimates of ij and P, respectively. 1 N ij ˆ (xri xi )(xrj x j ) 4. rij N 1 r 1 ii jj ˆ ˆ 1 N 2 1 N 2 (xri xi ) N 1 r 1 (xrj x j ) N 1 r 1 N (xri xi )(xrj x j ) r 1 (x x )2 (x x )2 N N r 1 ri i r 1 rj j and 1 r12 r1p r 1 r2p R 21 rp1 rp2 1 2005 Christopher R. Bilder 1.27 Example: Cereal data (cereal_data.sas, cereal_data.xls) Let x1 denote sugar, x2 denote fat, and x3 denote sodium which are adjusted for the number of grams per serving. Suppose 40 xi1 xi i1 xi2 ~ (, ) 40 xi3 i1 meaning we have a random sample of size 40 from (, ) . Thus, we have x1, x2, …, x40. More simply, someone may say x ~ (, ) with N = 40. Using the observed values of x1, x2, …, x40 produces the estimates of , , and P: 1 N 1 xr1 0.3571 ... 0.1 N r 1 x1 40 0.2894 x2 xr 2 0 ... 0 0.0322 1 N 1 ˆ N r 1 40 x3 N 1 5.6147 1 xr 3 6.07 ... 6.67 N r 1 40 2005 Christopher R. Bilder 1.28 ˆ 1 N ( xr )(xr ) ˆ ˆ N 1 r 1 0.2894 0.2894 1 N xr 0.0322 xr 0.0322 40 1 r 1 5.6147 5.6147 0.3571 0.2894 0.3571 0.2894 1 0 0.0322 0 0.0322 ... 40 1 6.07 5.6147 6.07 5.6147 0.1 0.2894 0.1 0.2894 0 0.0322 0 0.0322 6.67 5.6147 6.67 5.6147 0.0224 0.0010 0.0602 0.0010 0.0008 0.0451 0.0602 0.0451 6.0640 1 0.2397 0.1636 R 0.2397 1 0.0661 0.1636 0.0661 1 2005 Christopher R. Bilder 1.29 One way to find these items in SAS is to use PROC PRINCOMP (PROC CORR can also find it). This procedure will be used in Chapter 5 to perform principal components analysis. Below is part of the SAS program and output used to find the above items. *Read in Excel file containing the cereal data'; * Note: The variable names are ID, shelf, cereal, size_g, sugar_g,; * fat_g, and sodium_mg; proc import out=set1 datafile= "c:\Chris\OSU\Stat5063\Chapter 1\cereal_data.xls" dbms=excel2000 replace; getnames=yes; run; *adjust for the serving size; data set2; set set1; sugar = sugar_g/size_g; fat = fat_g/size_g; sodium = sodium_mg/size_g; run; title2 'Find mean vector and covariance matrix'; proc princomp data=set2 covariance; var sugar fat sodium; run; title2 'Find mean vector and correlation matrix'; proc princomp data=set2; var sugar fat sodium; run; Chris Bilder, STAT 5063 5 Find mean vector and covariance matrix The PRINCOMP Procedure Observations 40 Variables 3 Simple Statistics sugar fat sodium Mean 0.2894155586 0.0321827691 5.614703818 StD 0.1495598652 0.0276878932 2.462527107 Covariance Matrix 2005 Christopher R. Bilder 1.30 sugar fat sodium sugar 0.022368153 0.000992690 -0.060242015 fat 0.000992690 0.000766619 -0.004509788 sodium -0.060242015 -0.004509788 6.064039752 Total Variance 6.0871745247 Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative 1 6.06464374 6.04283353 0.9963 0.9963 2 0.02181021 0.02108963 0.0036 0.9999 3 0.00072058 0.0001 1.0000 Eigenvectors Prin1 Prin2 Prin3 sugar -.009970 0.998938 -.044987 fat -.000745 0.044981 0.998988 sodium 0.999950 0.009993 0.000296 Chris Bilder, STAT 5063 6 Find mean vector and correlation matrix The PRINCOMP Procedure Observations 40 Variables 3 Simple Statistics sugar fat sodium Mean 0.2894155586 0.0321827691 5.614703818 StD 0.1495598652 0.0276878932 2.462527107 Correlation Matrix sugar fat sodium sugar 1.0000 0.2397 -.1636 fat 0.2397 1.0000 -.0661 sodium -.1636 -.0661 1.0000 Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 1.32346996 0.38459531 0.4412 0.4412 2 0.93887466 0.20121928 0.3130 0.7541 3 0.73765538 0.2459 1.0000 Eigenvectors Prin1 Prin2 Prin3 sugar 0.667079 0.090841 0.739428 fat 0.587928 0.545387 -.597406 sodium -.457543 0.833247 0.310408 2005 Christopher R. Bilder 1.31 1.9 Standardized Data and/or Z Scores Sometimes it is easier to work with data which are on the same scale. “Standardizing data” or using “Z scores” can used to convert the data to a unitless scale. Let xrj j ˆ Zrj jj ˆ for r=1,…,N and j=1,…,p. Zrj is the Z score for the jth response variable on the rth experimental unit. The Zrj values will have a mean of 0 and variance of 1 for each response variable. The matrix z11 z12 z1p z z22 z2p Z 21 zN1 zN2 zNp is called the matrix of Z scores. Example: Cereal data (cereal_data.sas) Many of the SAS procedures that will be used in the course will standardize the data for us. The SAS procedure, PROC STANDARD, also standardizes data. Below is an example. proc standard data=set2 out=stand mean=0 std=1; var sugar fat sodium; run; 2005 Christopher R. Bilder 1.32 title2 'The standardized data'; proc print data=stand; var ID sugar fat sodium; run; Obs ID sugar fat sodium 1 1 0.45284 -1.16234 0.18547 2 2 -1.45752 -1.16234 1.63578 3 3 -1.45752 -1.16234 2.07087 4 4 -1.51722 1.09496 1.27320 0.3571-0.2894 Note that Z11= 0.4528 0.0224 Final notes: 1. On-line SAS help - http://support.sas.com/91doc/docMainpage.jsp 2. SAS help within SAS 2005 Christopher R. Bilder