Chapter 1 – An overview of multivariate methods by yaofenji

VIEWS: 4 PAGES: 32

									                                                             1.1

1. Applied multivariate methods
1.1 An overview of multivariate methods

    Experimental unit – The object in which information is
    collected upon. This information is organized into
    observed variable values.

    Univariate data – Information measured on one variable.

    Multivariate data – Information measured on multiple
    variables.

1.2 Examples

    Read the examples in Johnson.

Example: Cereal data (cereal_data.xls – data stored in Excel
file)

    Data available on the STAT 873 website.

    In January 1999, I collected the following data on dry
    cereal from a Dillon’s supermarket in Manhattan, KS.
    This data was collected as a stratified random sample
    where shelf of the cereal was the stratum. It is called
    “stratified” since I randomly selected 10 cereals within a
    shelf. Shelf #1 represents the bottom shelf and shelf #4
    represents the top shelf.

                        2005 Christopher R. Bilder
                                                                                  1.2

                                               Serving       Sugar   Fat   Sodium
ID Shelf               Cereal                  Size (g)       (g)    (g)    (mg)
           Kellogg’s Razzle Dazzle Rice
1    1                                             28         10     0      170
                      Crispies
2    1      Post Toasties Corn Flakes              28         2       0     270
3    1         Kellogg’s Corn Flakes               28         2       0     300
4    1       Food Club Toasted Oats                32         2       2     280
5    1           Frosted Cheerios                  30         13      1     210
6    1      Food Club Frosted Flakes               31         11      0     180
7    1              Capn Crunch                    27         12     1.5    200
           Capn Crunch's Peanut Butter
8    1                                             27         9      2.5    200
                      Crunch
9    1           Post Honeycomb                    29         11     0.5    220
10   1        Food Club Crispy Rice                33         2       0     330
11   2          Rice Crispies Treats               30         9      1.5    190
12   2           Kellogg's Smacks                  27         15     0.5    50
13   2         Kellogg's Froot Loops               32         15      1     150
           Capn Crunch's Peanut Butter
14   2                                             27         9      2.5    200
                      Crunch
15   2          Cinnamon Grahams                   30         11     1      230
            Marshmallow Blasted Froot
16   2                                             30         16     0.5    105
                       Loops
17   2          Koala Coco Krunch                  30         13      1     170
18   2       Food Club Toasted Oats                33         10     1.5    150
19   2            Cocoa Pebbles                    29         13      1     160
20   2                Oreo O's                     27         11     2.5    150
21   3        Food Club Raisin Bran                54         17      1     280
22   3     Post Honey Bunches of Oats              30         6      1.5    190
23   3               Rice Chex                     31         2       0     290
24   3          Kellogg's Corn Pops                31         14      0     120
             Post Morning Traditions -
25   3                                             54         14     5      160
                Raisin, Date, Pecan
           Post Shredded Wheat Spoon
26   3                                             49         0      0.5     0
                        Size
27   3                Basic 4                      55         14     3      320
28   3         French Toast Crunch                 30         12     1      180
29   3           Post Raisin Bran                  59         20     1      300
           Food Club Frosted Shredded
30   3                                             50         1      1       0
                       Wheat
                               2005 Christopher R. Bilder
                                                                                  1.3
                                               Serving       Sugar   Fat   Sodium
ID Shelf             Cereal                    Size (g)       (g)    (g)    (mg)
31  4          Total Raisin Bran                  55          19      1      240
32  4      Food Club Wheat Crunch                 60           6      0      300
33  4        Oatmeal Crisp Raisin                 55          19      2      220
34  4       Food Club Bran Flakes                 31           5     0.5     220
35  4            Cookie Crisp                     30          12      1      180
36  4      Kellogg's All Bran Original            31           6      1      65
37  4     Food Club Low Fat Granola               55          14      3      100
38  4    Oatmeal Crisp Apple Cinnamon             55          19      2      260
          Post Fruit and Fibre - Dates,
39  4                                              55         17     3      280
               Raisons, Walnuts
40  4          Total Corn Flakes                   30         3      0      200


Notes:
 1)Each row represents a different experimental unit
 2)Each column represents a different measured variable
 3)The experimental units are defined by the cereal type

  For most of the methods in this course, the experimental
  units are assumed to be independent. Are the
  experimental units in the cereal example independent?

          To be independent, the observed values for one
          cereal can not have an effect on another. For
          example, observed values for Marshmallow Blasted
          Froot Loops and Captain Crunch's Peanut Butter
          Crunch need to be independent of each other.

          We will assume the experimental units are
          independent for this example.


                               2005 Christopher R. Bilder
                                                                  1.4
Example: Placekicking data (placekick.sas7bdat)

    For the 1995 National Football League season, I
    collected information on all placekicks attempted. The
    SAS data set file, placekick.sas7bdat, contains 1,425 of
    the observations.

    Two types of placekicks:
     Field goal – 3 points if successful
     Point after touchdown (PAT) – 1 point if successful

    See video at http://www.chrisbilder.com/gf_small.mov.

    Below is an example of some of the data collected.
id date week loc   time type field temp humid dir speed cloud precip
 1 90395 1   NE     100   O    G     73   64   S    11   SN
 2 90395 1   NE     100   O    G     73   64   S    11   SN
 3 90395 1   NE     100   O    G     73   64   S    11   SN
 

                                           sc_ sc_      fin_ fin_
id kicker team opp good how pat dist qrtr team opp win team opp alt
 1 BAHR    NE CL    Y        N 21     1     0   0   Y    17   14 138
 2 BAHR    NE CL    Y        N 21     1     3   7   Y    17   14 138
 3 STOVER CL NE     Y        Y 20     2     13  6   N    14   17 138
 

id windy diff change elap30
 1   0    0      1  24.7167
 2   0    -4     0    15.85
 3   0    7      0    0.45
 

                               2005 Christopher R. Bilder
                                                          1.5


   Specific information about what each variable measures
   will be given later in the course.

   There is a lot of information given! How can all of this
   information be summarized and is any of it meaningful?
   What types of inferences can be made from this data
   set?

Some of the topics for this semester:
  Principle components analysis and factor analysis
   examine ways to reduce the dimension (number of
   variables) of the data set. New variables are formed
   which are “hopefully” interpretable.
  Discriminant analysis is used to classify experimental
   units into groups which are already known before the
   analysis.
  Canonical discriminant analysis reduces the dimension
   of the data set and applies discriminant analysis
   methods.
  Logistic regression can be used for the same purpose as
   discriminant analysis, but for only two groups.
  Cluster analysis can help classify observations into
   groups which are unknown before the analysis.
  Multivariate analysis of variance (MANOVA) is an
   extension of analysis of variance (ANOVA). In ANOVA,
   hypothesis tests for equality of means are performed. In
   MANOVA, hypothesis tests for equality of mean vectors
   are performed.
                       2005 Christopher R. Bilder
                                                           1.6


  A lot of these methods are what people use for “data
  mining”.

  See Table 1.1 on p. 8 of Johnson. This table will be very
  helpful to refer back to at the end of the course!


1.3 Types of variables

    Continuous variables – The have numerical values and
    can occur anywhere within some interval. There is no
    fixed number of values the variable can take on.

    Discrete variables – They can be numerical or
    nonnumerical. There are a fixed number of values the
    variable can take on.

Example: Cereal data
   Shelf and cereal are discrete variables, and serving size,
   sugar, fat, and sodium are continuous variables. Note
   that serving size, sugar, fat, and sodium are probably
   reported on the cereal box as discrete variables.

Example: Placekicking data
   Distance is an example of a continuous variable
   (assumed), and surface is an example of a discrete
   variable.

                        2005 Christopher R. Bilder
                                                            1.7
Most multivariate methods were developed based on
continuous variables using a multivariate normal distribution
assumption (more on this distribution in Section 1.5).


1.4 Data matrices and vectors

    See Appenidx A of Johnson (1998), any matrix algebra
    book, regression books (such as Chapter 5 of Neter,
    Kutner, Nachtscheim, and Wasserman (1996)) for a
    review of matrices.

     p = number of numerical variables of interest
     N = number of experimental units on which the
      variables are being measured
     xrj = value of the jth response variable on the rth
      experimental unit for r=1,…,N and j=1,…,p
     X = data matrix - xrj are arranged in matrix form so that
      xrj is the rth row and jth column element of X – rows
      represent experimental units and the columns
      represent variables.
     X has dimension Np.
              xr1 
             x 
     xr    = column vector of data for the rth
                r2

              
              
              xrp 
      experimental unit.
     xr = [xr1, xr2, …, xrp] = “transpose” of xr
                         2005 Christopher R. Bilder
                                                                                  1.8
          x11 x12                   
                           x1p   x1 
         x       x22      x2p   x2 
                                     
    X    21
                                
                                
                                
          xN1 xN2         xNp   xN 
    The determinant of a matrix is denoted by |A| for some
     matrix A.
       o The determinant for a 2 2 matrix is defined
              a11 a12 
         as             a11a22  a12a21 .
             a21 a22 
       o The determinant for a 3 3 matrix can be defined
         as
    a11 a12 a13 
    a                     a22   a23            a21 a23          a21 a22 
         a22 a23   a11   a            a12                 a13
                                a33            a   a33          a       
      21

    a31 a32 a33           32                   31               31 a32 
                
             See a matrix algebra book for more information
             on determinants.


Example: Cereal data

   Consider only the shelf, sugar, fat, and sodium variables.
   The variables are adjusted for the serving size by taking
   their observed values divided by serving size. For
   example, for Kellog’s Razzle Dazzle Rice Crispies the
   “sugar” value is


                              2005 Christopher R. Bilder
                                                                             1.9
     (sugar grams per serving)/(# of grams per serving size)
     = 10/28 = 0.3571

     The data looks as follows:

ID               Cereal                           Shelf Sugar Fat Sodium
       Kellog’s Razzle Dazzle Rice
1                Crispies                               1   0.3571 0   6.0714
2       Post Toasties Corn Flakes                       1   0.0714 0   9.6429
                                                                    
40          Total Corn Flakes                           4    0.1   0   6.6667
     The ID and Cereal variables are in the table to help
     identify the experimental units.

      p = 4 since the responses for the cereals are
       measured on 4 different variables
      N = 40
      x11 = 1, x12 = 0.3571, x13 = 0, x14 = 6.0714, x21 = 1, x22
       = 0.0714,…
             x11    x12   x13   x14   1 0.3571 0 6.0714 
            x      x22   x23    x24   1 0.0714 0 9.6429 
      X                                                     
                21

                                                               
                                                               
             x40,1 x40,2 x40,3 x40,4   4     0.1    0 6.6667 




                          2005 Christopher R. Bilder
                                                           1.10
            x11   1 
            x   0.3571
     x1              
              12

            x13   0 
                        
            x14  6.0714 

Book notation notes:
   1. The i, j, k,… subscripts are used for variables of
      interest.
   2. The r, s, t,… subscripts are used for experimental
      units.

    Examples:
       Suppose  denotes the correlation. Then ij denotes
       the correlation between response variables i and j.

        Suppose d denotes distance. Then drs denotes the
        distance between experimental units r and s.




                        2005 Christopher R. Bilder
                                                                                   1.11
1.5 The multivariate normal distribution

   Let x be a random variable with an univariate normal
   distribution with mean E(x) =  and variance Var(x) =
   E[(x-)2]=2. This is represented symbolically as
   x~N(, 2). Note that x could be capitalized here if one
   wanted to represented it more formally. The probability
   distribution function of x is

                                            ( x  )2
                                 1
       f(x | , )                     e     2 2
                                                        for - <x< .
                                 2

Example: Suppose x~N(50, 32). Below is the graph for
f(x|,).
                             Normal Probability Distribution for
                                      =50 and =3
                     0.15



                      0.1
        f(x;,)
          f(x;,)




                     0.05



                       0
                            35    40         45          50         55   60   65
                                                         X

                                      2005 Christopher R. Bilder
                                                                 1.12


    Note that
         x f(x; , )
        40 0.000514
        50 0.132981
        60 0.000514

The normal distribution for a random variable can be
generalized to a multivariate normal distribution for a vector
of random variables (random vector). Much of the theory for
multivariate analysis methods relies on the assumption that
a random vector,
          x1 
     x   ,
          
          xp 
          
has a multivariate normal distribution.


The relationships between the xi’s can be measured by the
covariances and/or the correlations.

    Let E(xi)=i

    Covariance of xi and xj: ij = Cov(xi, xj) = E[(xi-i)(xj-j)]
    Covariance of xi and xi (variance of xi): E[(xi-i)2]= ii

        Note that this is not denoted by 2

                          2005 Christopher R. Bilder
                                                              1.13
    Correlation coefficient of xi and xj:

                ij             Cov(xi ,x j )
        ij               
                ii  jj       Var(xi )Var(x j )

    Remember that ij = ji and -1  ij  1.

The means, covariances, correlations can be put into a
mean vector, covariance matrix, and a correlation matrix:
             E(x1 )   1 
    = E(x)=           ,
                      
             E(xp ) p 
                      
                                  11 12    1p 
                                       22  2p 
      Cov(x )=E[(x-)(x- )]=                  , and
                                    21

                                                 
                                                 
                                  p1 p2    pp 
                  1 12          1p 
                       1       2p 
   P  Corr(x )=                     .
                    21

                                     
                                     
                  p1 p2         1

    Remember that  and P are symmetric and knowing  is
    equivalent to knowing P.



                                2005 Christopher R. Bilder
                                                                              1.14
    NOTE:  and  are bolded although it may not look like it
    when it is printed. Word does not properly bold the
    symbol font.


    Multivariate normal distribution
       Similar to a univariate normal distribution where only
          and 2 are needed to know the distribution, only 
       and  (or P) are needed for the multivariate normal
       distribution. The notation, x~Np(,), means that x
       has a p-dimension multivariate normal distribution
       with mean  and covariance matrix .

        The multivariate normal distribution function is:
                                                1
                                   1            ( x   )  ( x   )
             f( x | ,  )              1/ 2
                                              e 2                        

                             (2)p / 2 
        for - <xi< for i=1,…,p and ||>0.

Example: Bivariate normal distribution (p=2) –
(graph_mult_normal.sas).

    Similar to as shown on p. 1.11, the bivariate normal
                                                   x1 
    distribution can be graphed. Suppose x    has a
                                                   x2 
                                             15 
    multivariate normal distribution with =   and
                                              20 

                              2005 Christopher R. Bilder
                                                                                                       1.15
   1      0.5 
             . Thus, we could write x N2 ,   or
  0.5 1.25 
      15   1      0.5  
x N2    ,               . Also, note that
       20 0.5 1.25 
   1       0.45 
P                (for example, 12= 0.5 / 1 1.25 =0.45).
  0.45       1 

The multivariate normal distribution function is:
                                                                                0.5   15   
                                                              1  15    1
                                                                                       1
                                                               x                  x     
                                    1                         2   20  0.5 1.25  20  
                                                                                              
    f(x | , )                                  1/ 2
                                                         e                                        

                               1   0.5 
                    (2)p / 2          
                              0.5 1.25
                                                      1            
                               1  15   1.25 0.5  15   
                                x                    x     
             1                 2   20   0.5 1   20  
                                                                
                         e                                          
        (2) 1
             p / 2 1/ 2




Below are 3D surface and contour plots of the normal
distribution.




                                  2005 Christopher R. Bilder
                               1.16




 2005 Christopher R. Bilder
                                                         1.17
  Examine the following:
  1. The surface is centered at .
  2. The surface is wider in the x2 direction than in the x1.
     This is because there is more variability.
  3. Notice the shape of the surface – The contour plot is
     an ellipse and the surface plot looks like a 3D
     normal curve.
  4. Volume underneath the surface is 1.


Question: Suppose a random sample is taken from a
population which can be characterized by this
multivariate normal distribution. A scatter plot is
constructed of the observed x1 and x2 values. Where
would most of the (x1, x2) points fall?


            15            1 0.5 
Suppose =   and                 . The only change
             20          0.5 3 
from the previous example is that 22 has increased.
               1      0.289 
Note that P                  . Below are 3D surface
               0.289    1  
and contour plots of the normal distribution.




                     2005 Christopher R. Bilder
                               1.18




 2005 Christopher R. Bilder
                                                     1.19




Compare the above plots with the previous example.

Questions to think about for the bivariate normal
distribution:
  1. What happens if the means change?
  2. What happens if 12=0?
  3. What happens if 11=22 and 12=0?
  4. What happens if 12=1 or -1?

  If you do not know the answers, use the SAS program
  to find out what happens!

Below is part of the SAS program used to construct the
plots above. SAS does not have a mechanism where
the multivariate normal distribution function can be
                    2005 Christopher R. Bilder
                                                          1.20
inputted and a resulting plot automatically constructed.
Instead, we can tell SAS to find f(x|,) for many
different values of x and then plot these (x1, x2, f(x|,) )
triples.




                     2005 Christopher R. Bilder
                                                        1.21
dm 'log;clear;output;clear;';
options ps=50 ls=70 pageno=1;
goptions reset=global border ftext=swiss gunit=cm htext=0.4
         htitle=0.5;
goptions display noprompt;

**********************************************************;
**                                                      **;
** AUTHOR: Chris Bilder                                 **;
** COURSE: STAT 873                                     **;
** DATE: 1-8-01                                         **;
** UPDATE: 2-6-01 –                                     **;
**         corrected error in finding f(x), 8-22-03     **;
** PURPOSE: Graph multivariate normal distribution      **;
**                                                      **;
** NOTES:                                               **;
**                                                      **;
**********************************************************;

title1 'Chris Bilder, STAT 873';

proc iml;

 pi = 3.141592654;
 mu={15,20};
 sigma={1 0.5,
        0.5 1.25};
 det = det(sigma);
 inv_sigma = inv(sigma);
 p=2;


 *initialize save matrix, note there are 61*61 x1 and x2
  combinations in the do loop below;
 save = repeat(0, 3721, 3);

 *initialize counter for do loop;
 i=1;


                       2005 Christopher R. Bilder
                                                          1.22
 *evaluate f(x) for many x1 and x2 values;
 do x1 = 10 to 25 by 0.25;
   do x2 = 10 to 25 by 0.25;

        *Creates 2x1 vector;
        x = x1 // x2;

        *Evaluate f(x);
        f_x = (1/(2*pi)**(p/2)) * det**(-1/2) *
               exp(-1/2 * t(x-mu)*inv_sigma*(x-mu));

        *Create 1x3 vector for the results and then save it
         into the ith row of the save matrix;
        save[i,] = x1 || x2 || f_x;

      i=i+1;
   end;
 end;

 col={ "x1" "x2" "f_x"};
 create set1 from save [colname=col];
 append from save;

quit;


*Create 3D plot of the surface;
proc g3d data=set1;
  title2 '3D surface ';
  plot x1*x2=f_x / grid zmin=0 zmax=0.3 zticknum=5
                  rotate=50;
  *use different rotate option on the above plot statement;
  * if you need to;
run;


*Create contour plot;
proc gcontour data=set1;
  plot x1*x2=f_x / grid autolabel=(reveal)
     haxis=axis1 vaxis=axis2 legend=legend1
                         2005 Christopher R. Bilder
                                                          1.23
     levels=0.01 0.03 0.05 0.07 0.09 0.11 0.13 0.15;
      *The reveal statement puts the level labels on the
      plot. If there are too many to fit, SAS will not plot
      them;
  title2 'Contour plot';
  axis1 label = ('x2')
        length=10
        order = (10 to 30 by 5);
  axis2 label=(a=90 'x1')
        length=8.71
        order = (10 to 30 by 5);
  legend1 label=('f(x)')
          down=2;
run;



    For homework, run this program!

    In order to get the plots from SAS into Word, I simply
    selected EDIT > COPY from SAS when the GRAPH
    window is the current window. Note that the ability to do
    this is new to SAS 9! Before, one had to export the plot
    out of SAS as a .gif or .jpg, and then insert the image
    into Word.

    If you do not have experience in SAS, I recommend
    taking a look at my lecture notes for OSU STAT 4091 at
    http://www.chrisbilder.com/stat4091 (see the schedule
    web page). This course was a 1 credit hour course
    introducing students to SAS 8.1.




                        2005 Christopher R. Bilder
                                                     1.24
1.6 Statistical computing
Read on your own.

1.7 Multivariate outliers
Read on your own.




                       2005 Christopher R. Bilder
                                                          1.25
1.8 Multivariate summary statistics

Let x1, x2, …, xN be a random sample from a multivariate
distribution (not necessarily normal) with a mean vector ,
covariance matrix of , and correlation matrix P. This can be
represented symbolically as {xi }N 1 ~ (, ).
                                 i


    Note on notation: If it is understood that a random
    sample of size N is taken, then we could shorten this to
    x ~ (, ) .

Estimates of , , and P based on the random sample are:

       1N          1
       xr   x1  x 2  ...  xN 
    ˆ
       N r 1      N
    ˆ  1  ( xr   )(xr   )
               N
                    ˆ      ˆ
       N 1   r 1


Notes:

                     1 N        
                          xr1 
              x1   N r 1   1  ˆ
              x   1 N x        ˆ
      1.  
         ˆ    2    N r1 r 2    2 
                          
                                
                               p 
              xp                    ˆ
                     1 N          
                      N r1 xrp 
                               
                            2005 Christopher R. Bilder
                                                                           1.26
                                            11 12
                                             ˆ     ˆ               1p 
                                                                   ˆ
                                           ˆ 21 22
                                                   ˆ               2p 
                                                                   ˆ
          1 N
   ˆ
2.              ( xr   )(xr   )  
                          ˆ         ˆ                                  
        N  1 r 1                                                    
                                                                      
                                            p1 p2
                                             ˆ     ˆ               pp 
                                                                   ˆ
                                   1 N
  where ii  Var(xi ) 
           ˆ                            (xri  xi ) and
                                                     2

                                N  1 r 1
                              1 N
   ij  Cov(xi ,x j ) 
   ˆ                              (xri  xi )(xrj  x j )
                           N 1  r 1



3. rij and R are used to denote the estimates of ij and
   P, respectively.

                                        1 N
             ij
             ˆ                             (xri  xi )(xrj  x j )
4. rij                            N  1 r 1
             ii  jj
             ˆ ˆ             1 N                2 1
                                                             N
                                                                          2
                                     (xri  xi )  
                             N  1 r 1                     (xrj  x j ) 
                                                   N  1 r 1           
                 N
                 (xri  xi )(xrj  x j )
               r 1

             (x  x )2    (x  x )2 
             N               N

           r 1 ri
           
                     i
                          r 1 rj
                         
                                     j
                                         
                                         
   and
      1 r12                     r1p 
     r    1                     r2p 
   R                               
       21

                                    
                                    
     rp1 rp2                     1

                             2005 Christopher R. Bilder
                                                                 1.27


Example: Cereal data (cereal_data.sas, cereal_data.xls)

   Let x1 denote sugar, x2 denote fat, and x3 denote sodium
   which are adjusted for the number of grams per serving.
   Suppose
                           40
                     xi1  
        xi i1    xi2   ~ (, )
             40
                     
                     xi3  
                      i1

   meaning we have a random sample of size 40 from
   (, ) . Thus, we have x1, x2, …, x40. More simply,
   someone may say x ~ (, ) with N = 40.

   Using the observed values of x1, x2, …, x40 produces
   the estimates of , , and P:

                 1 N         1                        
                       xr1    0.3571  ...  0.1 
                  N r 1
         x1                  40
                                                        0.2894 
       x2     xr 2            0  ...  0    0.0322
                   1  N              1
    ˆ
           N r 1             40                            
         x3   N
                            1                         5.6147 
                                                                  
                   1
                   xr 3  
                            
                                    6.07  ...  6.67  
                  N r 1   40                         
                                                         




                            2005 Christopher R. Bilder
                                                           1.28

ˆ     1 N
          ( xr   )(xr   )
                    ˆ        ˆ
     N  1 r 1
                                                 
                     0.2894         0.2894 
      1 N                                 
             xr  0.0322   xr  0.0322 
    40  1 r 1 
                    5.6147  
                                    5.6147 
                                               
         
           0.3571 0.2894   0.3571 0.2894  
    1                                   
          0   0.0322    0   0.0322    ... 
  40  1                                          
           6.07  5.6147    6.07  5.6147  
                                             
         
                                                       
            0.1  0.2894     0.1  0.2894   
                                                 
             0   0.0322     0   0.0322   
                                             
           6.67  5.6147    6.67  5.6147   
                                          
                                                       

        0.0224 0.0010 0.0602
       0.0010 0.0008 0.0451
                               
        0.0602 0.0451 6.0640 
                               


        1    0.2397 0.1636 
R   0.2397     1    0.0661
                            
     0.1636 0.0661
                        1   
                             


                         2005 Christopher R. Bilder
                                                                     1.29
One way to find these items in SAS is to use PROC
PRINCOMP (PROC CORR can also find it). This
procedure will be used in Chapter 5 to perform principal
components analysis. Below is part of the SAS program
and output used to find the above items.
*Read in Excel file containing the cereal data';
* Note: The variable names are ID, shelf, cereal, size_g, sugar_g,;
* fat_g, and sodium_mg;
proc import out=set1
     datafile= "c:\Chris\OSU\Stat5063\Chapter 1\cereal_data.xls"
     dbms=excel2000 replace;
getnames=yes;
run;

*adjust for the serving size;
data set2;
  set set1;
  sugar = sugar_g/size_g;
  fat = fat_g/size_g;
  sodium = sodium_mg/size_g;
run;

title2 'Find mean vector and covariance matrix';
proc princomp data=set2 covariance;
  var sugar fat sodium;
run;

title2 'Find mean vector and correlation matrix';
proc princomp data=set2;
  var sugar fat sodium;
run;

                Chris Bilder, STAT 5063                         5
            Find mean vector and covariance matrix

                    The PRINCOMP Procedure
                   Observations          40
                   Variables               3

                         Simple Statistics
                       sugar               fat              sodium
     Mean       0.2894155586      0.0321827691         5.614703818
     StD        0.1495598652      0.0276878932         2.462527107

                           Covariance Matrix

                         2005 Christopher R. Bilder
                                                                               1.30
                         sugar                    fat                 sodium
sugar              0.022368153            0.000992690           -0.060242015
fat                0.000992690            0.000766619           -0.004509788
sodium            -0.060242015           -0.004509788            6.064039752

                     Total Variance           6.0871745247

              Eigenvalues of the Covariance Matrix
          Eigenvalue    Difference    Proportion                 Cumulative
   1      6.06464374    6.04283353        0.9963                     0.9963
   2      0.02181021    0.02108963        0.0036                     0.9999
   3      0.00072058                      0.0001                     1.0000

                              Eigenvectors
                           Prin1         Prin2                  Prin3
         sugar          -.009970      0.998938               -.044987
         fat            -.000745      0.044981               0.998988
         sodium         0.999950      0.009993               0.000296


                 Chris Bilder, STAT 5063                                     6
              Find mean vector and correlation matrix

                         The PRINCOMP Procedure
                        Observations          40
                        Variables               3

                       Simple Statistics
                    sugar                fat                        sodium
 Mean        0.2894155586      0.0321827691                    5.614703818
 StD         0.1495598652      0.0276878932                    2.462527107

                           Correlation        Matrix
                            sugar                fat         sodium
            sugar          1.0000             0.2397         -.1636
            fat            0.2397             1.0000         -.0661
            sodium         -.1636             -.0661         1.0000

              Eigenvalues of the Correlation Matrix
          Eigenvalue    Difference    Proportion    Cumulative
   1      1.32346996    0.38459531        0.4412        0.4412
   2      0.93887466    0.20121928        0.3130        0.7541
   3      0.73765538                      0.2459        1.0000

                             Eigenvectors
                          Prin1         Prin2                   Prin3
         sugar         0.667079      0.090841                0.739428
         fat           0.587928      0.545387                -.597406
         sodium        -.457543      0.833247                0.310408

                         2005 Christopher R. Bilder
                                                         1.31


1.9 Standardized Data and/or Z Scores

   Sometimes it is easier to work with data which are on the
   same scale. “Standardizing data” or using “Z scores”
   can used to convert the data to a unitless scale. Let
              xrj   j
                      ˆ
        Zrj 
                   jj
                  ˆ
   for r=1,…,N and j=1,…,p. Zrj is the Z score for the jth
   response variable on the rth experimental unit. The Zrj
   values will have a mean of 0 and variance of 1 for each
   response variable.

   The matrix
             z11 z12        z1p 
            z    z22       z2p 
        Z                      
               21

                                
                                
             zN1 zN2       zNp 
   is called the matrix of Z scores.

Example: Cereal data (cereal_data.sas)
   Many of the SAS procedures that will be used in the
   course will standardize the data for us. The SAS
   procedure, PROC STANDARD, also standardizes data.
   Below is an example.
       proc standard data=set2 out=stand mean=0 std=1;
         var sugar fat sodium;
       run;
                        2005 Christopher R. Bilder
                                                                   1.32

       title2 'The standardized data';
       proc print data=stand;
         var ID sugar fat sodium;
       run;
           Obs   ID       sugar            fat            sodium
            1    1      0.45284        -1.16234          0.18547
            2    2     -1.45752        -1.16234          1.63578
            3    3     -1.45752        -1.16234          2.07087
            4    4     -1.51722         1.09496          1.27320
       
                      0.3571-0.2894
   Note that Z11=                              0.4528
                          0.0224

Final notes:
  1. On-line SAS help -
     http://support.sas.com/91doc/docMainpage.jsp
  2. SAS help within SAS




                           2005 Christopher R. Bilder

								
To top