# Chapter 1 – An overview of multivariate methods by yaofenji

VIEWS: 4 PAGES: 32

• pg 1
```									                                                             1.1

1. Applied multivariate methods
1.1 An overview of multivariate methods

Experimental unit – The object in which information is
collected upon. This information is organized into
observed variable values.

Univariate data – Information measured on one variable.

Multivariate data – Information measured on multiple
variables.

1.2 Examples

Example: Cereal data (cereal_data.xls – data stored in Excel
file)

Data available on the STAT 873 website.

In January 1999, I collected the following data on dry
cereal from a Dillon’s supermarket in Manhattan, KS.
This data was collected as a stratified random sample
where shelf of the cereal was the stratum. It is called
“stratified” since I randomly selected 10 cereals within a
shelf. Shelf #1 represents the bottom shelf and shelf #4
represents the top shelf.

 2005 Christopher R. Bilder
1.2

Serving       Sugar   Fat   Sodium
ID Shelf               Cereal                  Size (g)       (g)    (g)    (mg)
Kellogg’s Razzle Dazzle Rice
1    1                                             28         10     0      170
Crispies
2    1      Post Toasties Corn Flakes              28         2       0     270
3    1         Kellogg’s Corn Flakes               28         2       0     300
4    1       Food Club Toasted Oats                32         2       2     280
5    1           Frosted Cheerios                  30         13      1     210
6    1      Food Club Frosted Flakes               31         11      0     180
7    1              Capn Crunch                    27         12     1.5    200
Capn Crunch's Peanut Butter
8    1                                             27         9      2.5    200
Crunch
9    1           Post Honeycomb                    29         11     0.5    220
10   1        Food Club Crispy Rice                33         2       0     330
11   2          Rice Crispies Treats               30         9      1.5    190
12   2           Kellogg's Smacks                  27         15     0.5    50
13   2         Kellogg's Froot Loops               32         15      1     150
Capn Crunch's Peanut Butter
14   2                                             27         9      2.5    200
Crunch
15   2          Cinnamon Grahams                   30         11     1      230
Marshmallow Blasted Froot
16   2                                             30         16     0.5    105
Loops
17   2          Koala Coco Krunch                  30         13      1     170
18   2       Food Club Toasted Oats                33         10     1.5    150
19   2            Cocoa Pebbles                    29         13      1     160
20   2                Oreo O's                     27         11     2.5    150
21   3        Food Club Raisin Bran                54         17      1     280
22   3     Post Honey Bunches of Oats              30         6      1.5    190
23   3               Rice Chex                     31         2       0     290
24   3          Kellogg's Corn Pops                31         14      0     120
25   3                                             54         14     5      160
Raisin, Date, Pecan
Post Shredded Wheat Spoon
26   3                                             49         0      0.5     0
Size
27   3                Basic 4                      55         14     3      320
28   3         French Toast Crunch                 30         12     1      180
29   3           Post Raisin Bran                  59         20     1      300
Food Club Frosted Shredded
30   3                                             50         1      1       0
Wheat
 2005 Christopher R. Bilder
1.3
Serving       Sugar   Fat   Sodium
ID Shelf             Cereal                    Size (g)       (g)    (g)    (mg)
31  4          Total Raisin Bran                  55          19      1      240
32  4      Food Club Wheat Crunch                 60           6      0      300
33  4        Oatmeal Crisp Raisin                 55          19      2      220
34  4       Food Club Bran Flakes                 31           5     0.5     220
35  4            Cookie Crisp                     30          12      1      180
36  4      Kellogg's All Bran Original            31           6      1      65
37  4     Food Club Low Fat Granola               55          14      3      100
38  4    Oatmeal Crisp Apple Cinnamon             55          19      2      260
Post Fruit and Fibre - Dates,
39  4                                              55         17     3      280
Raisons, Walnuts
40  4          Total Corn Flakes                   30         3      0      200

Notes:
1)Each row represents a different experimental unit
2)Each column represents a different measured variable
3)The experimental units are defined by the cereal type

For most of the methods in this course, the experimental
units are assumed to be independent. Are the
experimental units in the cereal example independent?

To be independent, the observed values for one
cereal can not have an effect on another. For
example, observed values for Marshmallow Blasted
Froot Loops and Captain Crunch's Peanut Butter
Crunch need to be independent of each other.

We will assume the experimental units are
independent for this example.

 2005 Christopher R. Bilder
1.4
Example: Placekicking data (placekick.sas7bdat)

For the 1995 National Football League season, I
collected information on all placekicks attempted. The
SAS data set file, placekick.sas7bdat, contains 1,425 of
the observations.

Two types of placekicks:
 Field goal – 3 points if successful
 Point after touchdown (PAT) – 1 point if successful

See video at http://www.chrisbilder.com/gf_small.mov.

Below is an example of some of the data collected.
id date week loc   time type field temp humid dir speed cloud precip
1 90395 1   NE     100   O    G     73   64   S    11   SN
2 90395 1   NE     100   O    G     73   64   S    11   SN
3 90395 1   NE     100   O    G     73   64   S    11   SN


sc_ sc_      fin_ fin_
id kicker team opp good how pat dist qrtr team opp win team opp alt
1 BAHR    NE CL    Y        N 21     1     0   0   Y    17   14 138
2 BAHR    NE CL    Y        N 21     1     3   7   Y    17   14 138
3 STOVER CL NE     Y        Y 20     2     13  6   N    14   17 138


id windy diff change elap30
1   0    0      1  24.7167
2   0    -4     0    15.85
3   0    7      0    0.45


 2005 Christopher R. Bilder
1.5

Specific information about what each variable measures
will be given later in the course.

There is a lot of information given! How can all of this
information be summarized and is any of it meaningful?
What types of inferences can be made from this data
set?

Some of the topics for this semester:
 Principle components analysis and factor analysis
examine ways to reduce the dimension (number of
variables) of the data set. New variables are formed
which are “hopefully” interpretable.
 Discriminant analysis is used to classify experimental
units into groups which are already known before the
analysis.
 Canonical discriminant analysis reduces the dimension
of the data set and applies discriminant analysis
methods.
 Logistic regression can be used for the same purpose as
discriminant analysis, but for only two groups.
 Cluster analysis can help classify observations into
groups which are unknown before the analysis.
 Multivariate analysis of variance (MANOVA) is an
extension of analysis of variance (ANOVA). In ANOVA,
hypothesis tests for equality of means are performed. In
MANOVA, hypothesis tests for equality of mean vectors
are performed.
 2005 Christopher R. Bilder
1.6

A lot of these methods are what people use for “data
mining”.

See Table 1.1 on p. 8 of Johnson. This table will be very
helpful to refer back to at the end of the course!

1.3 Types of variables

Continuous variables – The have numerical values and
can occur anywhere within some interval. There is no
fixed number of values the variable can take on.

Discrete variables – They can be numerical or
nonnumerical. There are a fixed number of values the
variable can take on.

Example: Cereal data
Shelf and cereal are discrete variables, and serving size,
sugar, fat, and sodium are continuous variables. Note
that serving size, sugar, fat, and sodium are probably
reported on the cereal box as discrete variables.

Example: Placekicking data
Distance is an example of a continuous variable
(assumed), and surface is an example of a discrete
variable.

 2005 Christopher R. Bilder
1.7
Most multivariate methods were developed based on
continuous variables using a multivariate normal distribution
assumption (more on this distribution in Section 1.5).

1.4 Data matrices and vectors

See Appenidx A of Johnson (1998), any matrix algebra
book, regression books (such as Chapter 5 of Neter,
Kutner, Nachtscheim, and Wasserman (1996)) for a
review of matrices.

 p = number of numerical variables of interest
 N = number of experimental units on which the
variables are being measured
 xrj = value of the jth response variable on the rth
experimental unit for r=1,…,N and j=1,…,p
 X = data matrix - xrj are arranged in matrix form so that
xrj is the rth row and jth column element of X – rows
represent experimental units and the columns
represent variables.
 X has dimension Np.
 xr1 
x 
 xr    = column vector of data for the rth
r2

 
 
 xrp 
experimental unit.
 xr = [xr1, xr2, …, xrp] = “transpose” of xr
 2005 Christopher R. Bilder
1.8
 x11 x12                   
x1p   x1 
x       x22      x2p   x2 

 X    21
 
                       
                       
 xN1 xN2         xNp   xN 
 The determinant of a matrix is denoted by |A| for some
matrix A.
o The determinant for a 2 2 matrix is defined
 a11 a12 
as             a11a22  a12a21 .
a21 a22 
o The determinant for a 3 3 matrix can be defined
as
a11 a12 a13 
a                     a22   a23            a21 a23          a21 a22 
a22 a23   a11   a            a12                 a13
                            a33            a   a33          a       
21

a31 a32 a33           32                   31               31 a32 
            
on determinants.

Example: Cereal data

Consider only the shelf, sugar, fat, and sodium variables.
The variables are adjusted for the serving size by taking
their observed values divided by serving size. For
example, for Kellog’s Razzle Dazzle Rice Crispies the
“sugar” value is

 2005 Christopher R. Bilder
1.9
(sugar grams per serving)/(# of grams per serving size)
= 10/28 = 0.3571

The data looks as follows:

ID               Cereal                           Shelf Sugar Fat Sodium
Kellog’s Razzle Dazzle Rice
1                Crispies                               1   0.3571 0   6.0714
2       Post Toasties Corn Flakes                       1   0.0714 0   9.6429
                                                                    
40          Total Corn Flakes                           4    0.1   0   6.6667
The ID and Cereal variables are in the table to help
identify the experimental units.

 p = 4 since the responses for the cereals are
measured on 4 different variables
 N = 40
 x11 = 1, x12 = 0.3571, x13 = 0, x14 = 6.0714, x21 = 1, x22
= 0.0714,…
 x11    x12   x13   x14   1 0.3571 0 6.0714 
x      x22   x23    x24   1 0.0714 0 9.6429 
 X                                                     
21

                                                   
                                                   
 x40,1 x40,2 x40,3 x40,4   4     0.1    0 6.6667 

 2005 Christopher R. Bilder
1.10
 x11   1 
 x   0.3571
 x1              
12

 x13   0 
             
 x14  6.0714 

Book notation notes:
1. The i, j, k,… subscripts are used for variables of
interest.
2. The r, s, t,… subscripts are used for experimental
units.

Examples:
Suppose  denotes the correlation. Then ij denotes
the correlation between response variables i and j.

Suppose d denotes distance. Then drs denotes the
distance between experimental units r and s.

 2005 Christopher R. Bilder
1.11
1.5 The multivariate normal distribution

Let x be a random variable with an univariate normal
distribution with mean E(x) =  and variance Var(x) =
E[(x-)2]=2. This is represented symbolically as
x~N(, 2). Note that x could be capitalized here if one
wanted to represented it more formally. The probability
distribution function of x is

( x  )2
1
f(x | , )                     e     2 2
for - <x< .
2

Example: Suppose x~N(50, 32). Below is the graph for
f(x|,).
Normal Probability Distribution for
=50 and =3
0.15

0.1
f(x;,)
f(x;,)

0.05

0
35    40         45          50         55   60   65
X

 2005 Christopher R. Bilder
1.12

Note that
x f(x; , )
40 0.000514
50 0.132981
60 0.000514

The normal distribution for a random variable can be
generalized to a multivariate normal distribution for a vector
of random variables (random vector). Much of the theory for
multivariate analysis methods relies on the assumption that
a random vector,
 x1 
x   ,
 
 xp 
 
has a multivariate normal distribution.

The relationships between the xi’s can be measured by the
covariances and/or the correlations.

Let E(xi)=i

Covariance of xi and xj: ij = Cov(xi, xj) = E[(xi-i)(xj-j)]
Covariance of xi and xi (variance of xi): E[(xi-i)2]= ii

Note that this is not denoted by 2

 2005 Christopher R. Bilder
1.13
Correlation coefficient of xi and xj:

ij             Cov(xi ,x j )
ij               
ii  jj       Var(xi )Var(x j )

Remember that ij = ji and -1  ij  1.

The means, covariances, correlations can be put into a
mean vector, covariance matrix, and a correlation matrix:
E(x1 )   1 
 = E(x)=           ,
         
E(xp ) p 
         
 11 12    1p 
      22  2p 
  Cov(x )=E[(x-)(x- )]=                  , and
21

                
                
 p1 p2    pp 
 1 12          1p 
      1       2p 
P  Corr(x )=                     .
21

                    
                    
 p1 p2         1

Remember that  and P are symmetric and knowing  is
equivalent to knowing P.

 2005 Christopher R. Bilder
1.14
NOTE:  and  are bolded although it may not look like it
when it is printed. Word does not properly bold the
symbol font.

Multivariate normal distribution
Similar to a univariate normal distribution where only
and 2 are needed to know the distribution, only 
and  (or P) are needed for the multivariate normal
distribution. The notation, x~Np(,), means that x
has a p-dimension multivariate normal distribution
with mean  and covariance matrix .

The multivariate normal distribution function is:
1
1            ( x   )  ( x   )
f( x | ,  )              1/ 2
e 2                        

(2)p / 2 
for - <xi< for i=1,…,p and ||>0.

Example: Bivariate normal distribution (p=2) –
(graph_mult_normal.sas).

Similar to as shown on p. 1.11, the bivariate normal
 x1 
distribution can be graphed. Suppose x    has a
 x2 
15 
multivariate normal distribution with =   and
 20 

 2005 Christopher R. Bilder
1.15
 1      0.5 
             . Thus, we could write x N2 ,   or
0.5 1.25 
 15   1      0.5  
x N2    ,               . Also, note that
  20 0.5 1.25 
 1       0.45 
P                (for example, 12= 0.5 / 1 1.25 =0.45).
0.45       1 

The multivariate normal distribution function is:
                0.5   15   
1  15    1
1
  x                  x     
1                         2   20  0.5 1.25  20  
         
f(x | , )                                  1/ 2
e                                        

 1   0.5 
(2)p / 2          
0.5 1.25
                     1            
1  15   1.25 0.5  15   
  x                    x     
1                 2   20   0.5 1   20  
          
                     e                                          
(2) 1
p / 2 1/ 2

Below are 3D surface and contour plots of the normal
distribution.

 2005 Christopher R. Bilder
1.16

 2005 Christopher R. Bilder
1.17
Examine the following:
1. The surface is centered at .
2. The surface is wider in the x2 direction than in the x1.
This is because there is more variability.
3. Notice the shape of the surface – The contour plot is
an ellipse and the surface plot looks like a 3D
normal curve.
4. Volume underneath the surface is 1.

Question: Suppose a random sample is taken from a
population which can be characterized by this
multivariate normal distribution. A scatter plot is
constructed of the observed x1 and x2 values. Where
would most of the (x1, x2) points fall?

15            1 0.5 
Suppose =   and                 . The only change
 20          0.5 3 
from the previous example is that 22 has increased.
 1      0.289 
Note that P                  . Below are 3D surface
 0.289    1  
and contour plots of the normal distribution.

 2005 Christopher R. Bilder
1.18

 2005 Christopher R. Bilder
1.19

Compare the above plots with the previous example.

Questions to think about for the bivariate normal
distribution:
1. What happens if the means change?
2. What happens if 12=0?
3. What happens if 11=22 and 12=0?
4. What happens if 12=1 or -1?

If you do not know the answers, use the SAS program
to find out what happens!

Below is part of the SAS program used to construct the
plots above. SAS does not have a mechanism where
the multivariate normal distribution function can be
 2005 Christopher R. Bilder
1.20
inputted and a resulting plot automatically constructed.
Instead, we can tell SAS to find f(x|,) for many
different values of x and then plot these (x1, x2, f(x|,) )
triples.

 2005 Christopher R. Bilder
1.21
dm 'log;clear;output;clear;';
options ps=50 ls=70 pageno=1;
goptions reset=global border ftext=swiss gunit=cm htext=0.4
htitle=0.5;
goptions display noprompt;

**********************************************************;
**                                                      **;
** AUTHOR: Chris Bilder                                 **;
** COURSE: STAT 873                                     **;
** DATE: 1-8-01                                         **;
** UPDATE: 2-6-01 –                                     **;
**         corrected error in finding f(x), 8-22-03     **;
** PURPOSE: Graph multivariate normal distribution      **;
**                                                      **;
** NOTES:                                               **;
**                                                      **;
**********************************************************;

title1 'Chris Bilder, STAT 873';

proc iml;

pi = 3.141592654;
mu={15,20};
sigma={1 0.5,
0.5 1.25};
det = det(sigma);
inv_sigma = inv(sigma);
p=2;

*initialize save matrix, note there are 61*61 x1 and x2
combinations in the do loop below;
save = repeat(0, 3721, 3);

*initialize counter for do loop;
i=1;

 2005 Christopher R. Bilder
1.22
*evaluate f(x) for many x1 and x2 values;
do x1 = 10 to 25 by 0.25;
do x2 = 10 to 25 by 0.25;

*Creates 2x1 vector;
x = x1 // x2;

*Evaluate f(x);
f_x = (1/(2*pi)**(p/2)) * det**(-1/2) *
exp(-1/2 * t(x-mu)*inv_sigma*(x-mu));

*Create 1x3 vector for the results and then save it
into the ith row of the save matrix;
save[i,] = x1 || x2 || f_x;

i=i+1;
end;
end;

col={ "x1" "x2" "f_x"};
create set1 from save [colname=col];
append from save;

quit;

*Create 3D plot of the surface;
proc g3d data=set1;
title2 '3D surface ';
plot x1*x2=f_x / grid zmin=0 zmax=0.3 zticknum=5
rotate=50;
*use different rotate option on the above plot statement;
* if you need to;
run;

*Create contour plot;
proc gcontour data=set1;
plot x1*x2=f_x / grid autolabel=(reveal)
haxis=axis1 vaxis=axis2 legend=legend1
 2005 Christopher R. Bilder
1.23
levels=0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15;
*The reveal statement puts the level labels on the
plot. If there are too many to fit, SAS will not plot
them;
title2 'Contour plot';
axis1 label = ('x2')
length=10
order = (10 to 30 by 5);
axis2 label=(a=90 'x1')
length=8.71
order = (10 to 30 by 5);
legend1 label=('f(x)')
down=2;
run;

For homework, run this program!

In order to get the plots from SAS into Word, I simply
selected EDIT > COPY from SAS when the GRAPH
window is the current window. Note that the ability to do
this is new to SAS 9! Before, one had to export the plot
out of SAS as a .gif or .jpg, and then insert the image
into Word.

If you do not have experience in SAS, I recommend
taking a look at my lecture notes for OSU STAT 4091 at
http://www.chrisbilder.com/stat4091 (see the schedule
web page). This course was a 1 credit hour course
introducing students to SAS 8.1.

 2005 Christopher R. Bilder
1.24
1.6 Statistical computing

1.7 Multivariate outliers

 2005 Christopher R. Bilder
1.25
1.8 Multivariate summary statistics

Let x1, x2, …, xN be a random sample from a multivariate
distribution (not necessarily normal) with a mean vector ,
covariance matrix of , and correlation matrix P. This can be
represented symbolically as {xi }N 1 ~ (, ).
i

Note on notation: If it is understood that a random
sample of size N is taken, then we could shorten this to
x ~ (, ) .

Estimates of , , and P based on the random sample are:

1N          1
   xr   x1  x 2  ...  xN 
ˆ
N r 1      N
ˆ  1  ( xr   )(xr   )
N
                ˆ      ˆ
N 1   r 1

Notes:

1 N        
     xr1 
 x1   N r 1   1  ˆ
 x   1 N x        ˆ
1.  
ˆ    2    N r1 r 2    2 

                   
                  p 
 xp                    ˆ
1 N          
 N r1 xrp 
          
 2005 Christopher R. Bilder
1.26
 11 12
ˆ     ˆ               1p 
ˆ
ˆ 21 22
ˆ               2p 
ˆ
1 N
ˆ
2.              ( xr   )(xr   )  
ˆ         ˆ                                  
N  1 r 1                                                    
                           
 p1 p2
ˆ     ˆ               pp 
ˆ
1 N
where ii  Var(xi ) 
ˆ                            (xri  xi ) and
2

N  1 r 1
1 N
ij  Cov(xi ,x j ) 
ˆ                              (xri  xi )(xrj  x j )
N 1  r 1

3. rij and R are used to denote the estimates of ij and
P, respectively.

1 N
ij
ˆ                             (xri  xi )(xrj  x j )
4. rij                            N  1 r 1
ii  jj
ˆ ˆ             1 N                2 1
N
2
 (xri  xi )  
 N  1 r 1                     (xrj  x j ) 
                       N  1 r 1           
N
 (xri  xi )(xrj  x j )
            r 1

  (x  x )2    (x  x )2 
N               N

r 1 ri

i
 r 1 rj

j


and
 1 r12                     r1p 
r    1                     r2p 
R                               
21

                               
                               
rp1 rp2                     1

 2005 Christopher R. Bilder
1.27

Example: Cereal data (cereal_data.sas, cereal_data.xls)

Let x1 denote sugar, x2 denote fat, and x3 denote sodium
which are adjusted for the number of grams per serving.
Suppose
40
  xi1  
xi i1    xi2   ~ (, )
40
  
  xi3  
   i1

meaning we have a random sample of size 40 from
(, ) . Thus, we have x1, x2, …, x40. More simply,
someone may say x ~ (, ) with N = 40.

Using the observed values of x1, x2, …, x40 produces
the estimates of , , and P:

1 N         1                        
 xr1    0.3571  ...  0.1 
 N r 1
 x1                  40
                            0.2894 
   x2     xr 2            0  ...  0    0.0322
1  N              1
ˆ
   N r 1             40                            
 x3   N
                    1                         5.6147 
       
1
  xr 3  
           
 6.07  ...  6.67  
 N r 1   40                         


 2005 Christopher R. Bilder
1.28

ˆ     1 N
          ( xr   )(xr   )
ˆ        ˆ
N  1 r 1
                                 
0.2894         0.2894 
1 N                                 
           xr  0.0322   xr  0.0322 
40  1 r 1 
    5.6147  
               5.6147 
       

 0.3571 0.2894   0.3571 0.2894  
1                                   
          0   0.0322    0   0.0322    ... 
40  1                                          
  6.07  5.6147    6.07  5.6147  
                                 


  0.1  0.2894     0.1  0.2894   
                                       
   0   0.0322     0   0.0322   
                           
 6.67  5.6147    6.67  5.6147   
                                


 0.0224 0.0010 0.0602
  0.0010 0.0008 0.0451
                        
 0.0602 0.0451 6.0640 
                        

    1    0.2397 0.1636 
R   0.2397     1    0.0661
                        
 0.1636 0.0661
                    1   


 2005 Christopher R. Bilder
1.29
One way to find these items in SAS is to use PROC
PRINCOMP (PROC CORR can also find it). This
procedure will be used in Chapter 5 to perform principal
components analysis. Below is part of the SAS program
and output used to find the above items.
*Read in Excel file containing the cereal data';
* Note: The variable names are ID, shelf, cereal, size_g, sugar_g,;
* fat_g, and sodium_mg;
proc import out=set1
datafile= "c:\Chris\OSU\Stat5063\Chapter 1\cereal_data.xls"
dbms=excel2000 replace;
getnames=yes;
run;

data set2;
set set1;
sugar = sugar_g/size_g;
fat = fat_g/size_g;
sodium = sodium_mg/size_g;
run;

title2 'Find mean vector and covariance matrix';
proc princomp data=set2 covariance;
var sugar fat sodium;
run;

title2 'Find mean vector and correlation matrix';
proc princomp data=set2;
var sugar fat sodium;
run;

Chris Bilder, STAT 5063                         5
Find mean vector and covariance matrix

The PRINCOMP Procedure
Observations          40
Variables               3

Simple Statistics
sugar               fat              sodium
Mean       0.2894155586      0.0321827691         5.614703818
StD        0.1495598652      0.0276878932         2.462527107

Covariance Matrix

 2005 Christopher R. Bilder
1.30
sugar                    fat                 sodium
sugar              0.022368153            0.000992690           -0.060242015
fat                0.000992690            0.000766619           -0.004509788
sodium            -0.060242015           -0.004509788            6.064039752

Total Variance           6.0871745247

Eigenvalues of the Covariance Matrix
Eigenvalue    Difference    Proportion                 Cumulative
1      6.06464374    6.04283353        0.9963                     0.9963
2      0.02181021    0.02108963        0.0036                     0.9999
3      0.00072058                      0.0001                     1.0000

Eigenvectors
Prin1         Prin2                  Prin3
sugar          -.009970      0.998938               -.044987
fat            -.000745      0.044981               0.998988
sodium         0.999950      0.009993               0.000296

Chris Bilder, STAT 5063                                     6
Find mean vector and correlation matrix

The PRINCOMP Procedure
Observations          40
Variables               3

Simple Statistics
sugar                fat                        sodium
Mean        0.2894155586      0.0321827691                    5.614703818
StD         0.1495598652      0.0276878932                    2.462527107

Correlation        Matrix
sugar                fat         sodium
sugar          1.0000             0.2397         -.1636
fat            0.2397             1.0000         -.0661
sodium         -.1636             -.0661         1.0000

Eigenvalues of the Correlation Matrix
Eigenvalue    Difference    Proportion    Cumulative
1      1.32346996    0.38459531        0.4412        0.4412
2      0.93887466    0.20121928        0.3130        0.7541
3      0.73765538                      0.2459        1.0000

Eigenvectors
Prin1         Prin2                   Prin3
sugar         0.667079      0.090841                0.739428
fat           0.587928      0.545387                -.597406
sodium        -.457543      0.833247                0.310408

 2005 Christopher R. Bilder
1.31

1.9 Standardized Data and/or Z Scores

Sometimes it is easier to work with data which are on the
same scale. “Standardizing data” or using “Z scores”
can used to convert the data to a unitless scale. Let
xrj   j
ˆ
Zrj 
 jj
ˆ
for r=1,…,N and j=1,…,p. Zrj is the Z score for the jth
response variable on the rth experimental unit. The Zrj
values will have a mean of 0 and variance of 1 for each
response variable.

The matrix
 z11 z12        z1p 
z    z22       z2p 
Z                      
21

                    
                    
 zN1 zN2       zNp 
is called the matrix of Z scores.

Example: Cereal data (cereal_data.sas)
Many of the SAS procedures that will be used in the
course will standardize the data for us. The SAS
procedure, PROC STANDARD, also standardizes data.
Below is an example.
proc standard data=set2 out=stand mean=0 std=1;
var sugar fat sodium;
run;
 2005 Christopher R. Bilder
1.32

title2 'The standardized data';
proc print data=stand;
var ID sugar fat sodium;
run;
Obs   ID       sugar            fat            sodium
1    1      0.45284        -1.16234          0.18547
2    2     -1.45752        -1.16234          1.63578
3    3     -1.45752        -1.16234          2.07087
4    4     -1.51722         1.09496          1.27320

0.3571-0.2894
Note that Z11=                              0.4528
0.0224

Final notes:
1. On-line SAS help -
http://support.sas.com/91doc/docMainpage.jsp
2. SAS help within SAS

 2005 Christopher R. Bilder

```
To top