# Methods in Multivariate Analysis

Document Sample

```					Methods in Multivariate Analysis
Why multivariate analysis?

Course objectives:
 Perform exploratory data analysis (plots, descriptive statistics)  Explore techniques for comparing means, analyzing covariance structure, reducing dimensionality, sorting and grouping  Decide on appropriate multivariate technique(s) for a given problem  Critically examine the assumptions that underlie multivariate methods and, if necessary, apply corrective measures such as data transformations  Analyze a real data set using popular computer packages such as SAS, SPSS and Minitab

Introduction: Notation and descriptive statistics Data p variables, n observations X1 X2 … … … … Xp x1p x2p


x11 x12 x21 x22
 

xn1 xn2

xnp

Data matrix:

 x11 x12 x  21 x 22   X=   x n1 x n 2 

... x1 p  ... x 2 p       ... x np  

Descriptive statistics:
Sample means

xk 

1  x jk n j

k = 1,2,…,p

Sample variances

2 s k  s kk 

1  (x jk  x ) 2 n j
k = 1,2,…,p

Sample standard deviations

s kk

k = 1,2,…,p

Sample covariances

s ik 

1  ( x ji  x i )( x jk  x k ) n j
i = 1,2,…,p, k = 1,2,…,p

Sample correlations

rik 

s ik s ii s kk

i = 1,2,…,p, k = 1,2,…,p

Arrays (Matrices)

 x1  x  2   x  ( p x 1)   xp   

 s11 s12 ... s1 p  s s22 ... s2 p  21  Sn          s p1 s p 2 ... s pp  
(p x p)

 1 r12 ... r1 p  r 1 ... r2 p  21  R        rp1 rp 2 ... 1  
(p x p)

Notes :

S n and R symmetric  1  rik  1

Example:

(Exercise 1.4)

Scatter plot; Marginal dot diagram; Boxplot

x1 

1 (127  .... 32.4)  62.3 10

x2 

1 (4.2  ...  2.4)  2.9 10

s12  s11 
2 s2  s22 

1 ((127  62.3) 2  ...  (32.4  62.3) 2 )  900.4 10

1 ((4.2  2.9) 2  ...  (2.4  2.9) 2 )  1.3 10

s12 

1 ((127  62.3)(4.2  2.9)  ...  (32.4  62.3)(2.4  2.9))  22.7 10

r12 

22 .7 900 .4 1.3

 0.7

62.3 x   2.9 

900.4 22.7 Sn     22.7 1.3 

 1 0.7 R  0.7 1 

Graphical Techniques
Dot diagram, Histogram, Boxplot (all univariate) Scatter plot (bivariate association, arrays) 3D (spin) plots

Example:

(Example 1.5)

Linked scatter plots—

Scatter plots and boxplots of paper-quality data from Table 1.2.

Example (continued)

Modified scatter plots with (a) group of points selected and (b) points, including specimen 25, deleted and the scatter plots rescaled.

Three dimensional scatter plots— (Lower-dimensional structure) Example:
(Example 1.6)

3D scatter plots of lizard data (see Table 1.3).

Example:

(Example 1.9)

Three-dimensional perspectives for the lumber stiffness data.

Plots for investigating sampling unit similarities
Growth curves Stars Glyphs Boxes (Chernoff) faces Andrews plots

Example:

(Profiles, Stars, Glyphs, Faces, Boxes)

Variables

Cases (Sampling units)

Example:

(Examples 1.10 & 1.13)

Combined growth curves for weight for seven female grizzly bears.

Chernoff faces over time.

Important to look at data before using inferential methods.

Sample Geometry and Distance
n points in p dimensions (scatter plot)—
1st point

 x11 x12 ... x1 p  x x 22 ... x 2 p   21        X=   x n1 x n 2 ... x np    

nth point

p points in n dimensions—

 x11 x12 ... x1 p  x x 22 ... x 2 p   21        X=   x n1 x n 2 ... x np    
1st point pth point

Example:

(Example 3.1)

 4 1 X   1 3    3 5  
(a) Scatter plot

n = 3, p = 2

A plot of the data matrix X as n=3 points in p=2 space.

(b) Variable plot

y1'  4  1 3
' y 2  1 3 5

A plot of the data matrix X as p=2 vectors in n=3 space.

Note (with x1  2 )
4 1  2   1  2 1   3      3 1  1       y1  x1 1  ( y1  x1 1) or y 1  x1 1  d 1

Similarly (with x 2  3 )
1 1  2 3  31   0       5 1  2       y 2  x2 1  ( y 2  x2 1) or y 2  x2 1  d 2

Have (since n = 3)
' d 1' d 1  3 s11 , d 2 d 2  3 s 22

d 1' d 2  3 s12  Ld1 L d 2 cos( 12 )  d 1' d 2 r12  cos( 12 )   L d1 L d 2

Results true in general (see page 119—Geometry of Sample)

Note:

 12  90   r12  0  12  0   r12  1  12  180   r12  1

Our example—

n3 d1' d1  14 so s11  and r12  .19 14 3 s22  8 3 s12  2 3
' d2 d2  8

d1' d 2  2

Many multivariate techniques based on concept of “distance” or “volume.” Euclidean (straight line) distance—
P  ( x1 , x 2 ,..., x p )

Two points

Q  ( y1 , y 2 ,..., y p )

d ( P, Q)  ( x1  y1 ) 2  ( x2  y2 ) 2    ( x p  y p ) 2

1. Each pair of (x, y) coordinates contributes equally to distance. 2. Locus of points a constant distance from a fixed point Q is a circle (p = 2). 3. Matrix notation
 x1  x  2 x      x p     y1  y  2 y       yp   

d ( P, Q)  ( x  y) ' ( x  y)

or

d 2 ( P, Q)  ( x  y) ' ( x  y)

Statistical distance (accounts for difference in variation and association)—

Suppose n measurements on 2 independent random variable X1 and X2 with zero means and Var(X1) > Var(X2).

A scatter plot with greater variability in the x1 direction than in the x2 direction.

Weight x2 coordinate more heavily in determining, say, distance of P = (x1, x2) to origin O = (0, 0). One way
x 
 1

x1 s11

 1  x   s  1  11 

x 

 2

x2 s 22

 1  x   s  2  22 

d (O, P)  x

2 1

x

2 2



2 x12 x 2 2   a11 x12  a 22 x 2 s11 s 22

All points constant squared distance from origin lie on an ellipse (p = 2).

2 2 2 The ellipse of constant statistical distance d2(O, P) = x1 / s11  x 2 / s 22  c .

Generalize to p dimensions
P  ( x1 , x 2 ,..., x p ) Q  ( y1 , y 2 ,..., y p )
(x p  y p ) 2 ( x1  y1 ) 2 ( x 2  y 2 ) 2 d ( P, Q )     a11 ( x1  y1 ) 2  a 22 ( x 2  y 2 ) 2    a pp ( x p  y p ) 2 s11 s 22 s pp

For correlated variables p=2 P = (x1, x2) and origin O = (0, 0)

2 d (O, P)  a11 x12  2a12 x1 x 2  a 22 x 2

or with Q = (y1, y2) fixed

d ( P, Q)  a11 ( x1  y1 )2  2a12 ( x1  y1 )(x2  y2 )  a22 ( x2  y2 )2

Ellipse of points a constant statistical distance from the point Q.

With

 x1  x   x2  have

 y1  y   y2 

 a11 A a12

a12  a22  

d ( P, Q )  ( x  y ) ' A ( x  y )

Generalize to p dimensions

P  ( x1 , x 2 ,..., x p )

Q  ( y1 , y 2 ,..., y p ) fixed

Let
 x1  x  2 x      x p     y1  y  2 y       y p     a11 a 12 A    a1 p  a12 a 22  a2 p  a1 p   a2 p        a pp  

d ( P, Q)  ( x  y)' A ( x  y)
(p x p)

or

d 2 ( P, Q)  ( x  y)' A ( x  y)

If Q = O = (0, 0, …, 0),

d (O, P)  x ' A x
(p x p)

or

d 2 (O, P)  x ' A x

For statistical distance to be a valid distance, aik’s must be such that following properties are satisfied (R an intermediate point)

d ( P, Q)  d (Q, P) d ( P, Q )  0 P  Q d ( P, Q )  0 P  Q d ( P, Q )  d ( P, R )  d ( R, Q )

Notes:

1. “A” a positive definite matrix ' (i.e. x A x  0 for x  0 ), then statistical distance is a valid distance. d2(P, Q) is called a positive definite quadratic form. 2. Q fixed. Locus of points P a constant statistical distance from Q is an ellipsoid.

Some Matrix Results
Read Supplement 2A, review Chapter 2 in Johnson & Wichern Pay particular attention to:      Matrix algebra Determinant Inverse of a matrix Trace of a matrix Spectral decomposition

Eigenvalues and Eigenvectors— scalar  , vector x  0
(k x 1)

Square matrix

A,
(k x k)

Find , x  0 such that

Ax x
then  is an eigenvalue x is an eigenvector 1. ( A   I ) x  0  A   I  0 2. A(cx) = (cx) for some constant c

Notes:

Example:
 1 2 A   1 4
1  1

0  A  I 

2 4

 (1   )(4   )  (1)(2)  2  5  6  (  3)(  2)

Eigenvalues: Eigenvectors:

1  3 ,  2  2
Solve

Ax1  3 x1  x1   1 2  x1   1 4  x   3 x    2   2
Find
x1  x 2  1 (arbitrarily ) 1 x1    1

Normalized eigenvector
  e1      1   2 1  2 

For  2

   2  2 , have x 2  1 and e 2       

2   5 1  5 

Note:

For a symmetric matrix A, eigenvectors corresponding to different eigenvalues are perpendicular (orthogonal)

i.e. i , ei

& j , ej

 then i   j  ei e j  0

Quadratic Forms— A
(k x k)

symmetric,

x0

quadratic form

 x Ax

Example:
x  3 2  A , x   1  2  1 x2 
x Ax  x1 3 2   x1  2 2 x 2    x   3x1  4 x1 x 2  x 2 2  1  2 

Nonnegative definite quadratic form

x Ax  0

for all x

Positive definite quadratic form

xAx  0

for all x  0

(Statistical) Distance and positive definite quadratic form— Point P with coordinates x  x1 x2  x p  Point O with coordinates

0

0  0

d 2 ( P, O)  x Ax  0

for

x  0 , A positive definite
y   y1

Point Q with coordinates



y2  y p



d 2 ( P, Q)  ( x  y )  A( x  y )  0

1 ,  2 ,  ,  p eigenvalues of A e1 , e 2 ,  , e p normalized eigenvectors

x Ax  c 2  0

(ellipse)

Points a constant distance c from the origin (p=2, 1  1 < 2 )

More matrix results— A
(k x k)

with eigenvalues 1 ,  2 ,  ,  k

1. tr(A) = a11  a 22    a kk  1   2     k 2. A  1  2   k 3. A is positive definite if and only if all  i  0

Linear combinations—
Z1  c11 X 1  c12 X 2    c1 p X p Z 2  c21 X 1  c22 X 2    c2 p X p   Z q  cq1 X 1  cq 2 X 2    cqp X p

 Z 1   c11  Z  c  2    21          Z q  c q1    or Z  CX
(q x 1) (q x p)(p x 1)

c12 c 22  cq2

 c1 p   X 1   c2 p  X 2            c qp   X p   

Random Vectors
 X1  X  2 X        X p   

random vector

mean vector

 1    2 E( X )          p   

covariance matrix
 11  12 E ( X   )( X   )        1 p 

 12  22


 2p

  1p    2p         pp  

(p x p)

correlation matrix
 ik   ik  ii  kk

 1  12      1 p 

12
1 

2p

 1 p   2p        1  

Let
1 2

V

      


0  0

11

0


22


 0

  

0   0      pp  

then

 V V
1 2 1

1 2

1 2

  (V )  (V )

1 2 1

Linear combinations of random variables—

The linear combination
c1 X 1  c 2 X 2    c p X p  c X

has mean variance

c 

c  c
  Cov( X )

where   E ( X ) ,

The q linear combinations have

Z  CX

Z  C  X
Z  C  X C
Similar results hold for sample values (observations) of linear combinations of variables (See Results 3.5 and 3.6)

Total Variance
Two scalar measures of variance for the multivariate case Let population covariance 
(p x p)

sample covariance S
(p x p)

(R)

Total population variance = Tr() =  11   22     pp = sum of eigenvalues Total sample variance = Tr(S) = s11  s 22    s pp = sum of eigenvalues

Also, Total standardized sample variance = Tr(R) = 111  p = sum of eigenvalues

Generalized Variance
Generalized population variance =  = product of eigenvalues Small  implies effective dimension less than p i.e. variables are (nearly) linearly related Generalized sample variance = | S | = product of eigenvalues Also, for standardized variables, Generalized sample variance = | R | = product of eigenvalues

Generalized variance has an interpretation as a volume (see Figures 3.6, 3.8, 3.9 & 3.10 and Examples 3.8 & 3.9). Note: If n  p (i.e. sample size  number of variables), then | S | = 0 for all samples.

Multivariate Normal Distribution
Recall univariate normal distribution

f (X ) 

1 2 

e

1 X  2  ( ) 2 

p-variate normal distribution

f (X ) 

1 (2 )
p/2



1/ 2

e

1  ( X   )  1 ( X   ) 2


(p x 1)

is mean vector, 
(p x p)

is covariance matrix

Note:

 X  2 1    ( X   )( ) ( X   )   
2

for p = 1 replaced by

( X   )  1 ( X   ) for p > 1

The solid ellipsoid of x values satisfying
2 ( x   ) 1 ( x   )   p ( )

has probability 1  

Some Nice Properties of Normal Distributions— 1. All marginal distributions are normal

Example: If
X  X   1 X 2 

  is N 2 (  , ) with   3 , 2  

 3  1    1 9 

,

then X 1  N (  1 ,  11 )  N (2 , 3)

2. Zero covariance is equivalent to independence

Example:
 X1  X  X 2    X 3   

If

is N 3 (  , ) with

3 1 0    1 2 0   0 0 4   

, then

X 3 is independent of X 1 , X 2 

3. Any linear combination of normal random variables is normally distributed

X with mean  and covariance 
(p x 1)

 X1  X  2 a1 X 1  a2 X 2    a p X p  a1 a2  a p    aX       X p   





is N (a , a a)  q linear combinations
N q ( A , A  A  )

AX
(q x p)(p x 1)

have distribution

A strong case can be made for the normal distribution as an approximate sampling distribution.

Limit Results— Let X 1 , X 2 ,  , X n be independent vector observations from a population with mean  and finite covariance .
(p x 1) (p x p)

Then for n  p large,

n ( X   ) is approximately N p (0 ,  )
(If  unknown, replace  with S) and
2 n( X   ) S 1 ( X   ) is approximat ely  p

Assessing Normality
Look at univariate marginal distributions using Dot diagrams / Histograms / Boxplots Symmetry? Outliers?

Normal scores (Q-Q) plot: Let x (1) , x ( 2) ,  , x ( n ) be the ordered observations on a single variable X . Then a proportion j/n of the sample observations will be  x(j). (j/n is often approximated by (j  ½)/n for analytical convenience) Define quantiles q ( j ) by
q( j )

P( Z  q ( j ) ) 





1 2

e z

j
2

/2

dz  p ( j ) 

1 2

n

Plot the pairs (q ( j ) , x ( j ) ) with the same associated probability (j  ½)/n. If the data are approximately normally distributed the Q-Q plot should be (nearly) a straight line.

The straightness of the Q-Q plot can be judged by calculating the correlation coefficient

rQ 

 (x
j 1

n

( j)

 x )(q ( j )  q )
2

 (x
j 1

n

( j)

 x)

 (q
j 1

n

( j)

 q)2

(See Table 4.2 for critical values)

Example:
Radiation measurements through the closed doors of microwave ovens are contained in Table 4.1. Q-Q plot is given in Figure 4.6.

Evaluating Multivariate Normality—

For multivariate normal variables, all x such that
2 ( x   )  1 ( x   )   p (.5)

has probability .5. Thus we would expect roughly 50% of sample observations to lie in ellipsoid
2 ( x  x ) S 1 ( x  x )   p (.5)

If not, normality assumption is suspect. (See Example 4.12 for p = 2)

Chi-square Plot: For a sample of n observations, consider the squared statistical distances

d 2  ( x j  x ) S 1 ( x j  x ) j
2 2 2

j  1,2, , n

If n and n  p both large, d 1 , d 2 ,  , d n should (if data are normal) look like observations on a chi-square variable.
2 (Strictly speaking, the d j are not independent.)

To construct a chi-square plot 1. Calculate and order the squared distances

d (2 )  d (22)    d (2n) 1
2 2. Plot the pairs (qc, p (( j  1 / 2) / n) , d( j ) ) where qc , p (( j  1 / 2) / n) is the 100(j  ½)/n quantile of a chi-square distribution with p degrees of freedom (d.f.)

If the data are multivariate normal, the chi-square plot should look like a straight line through the origin with slope 1. (See Example 4.13 for p = 2)

Example:
Table 4.3 contains four measures of stiffness for n = 30 boards.

A chi-square plot for the lumber stiffness data.

Example:
The following table contains bone lengths at four ages for n = 20 boys.

Look at x1  x4

2 Values of d j for the Ramus bone data

(Slightly different) chi-square plot

Scatter plots for the Ramus bone data

Multivariate outliers may affect mean, variance or correlation
1. Small shift in means and variances but little effect on correlation 2 Little effect on means and variances but reduces the correlation somewhat 3 Effect on means, variances and correlation

Transformations to Near Normality
If normality not a valid assumption, often possible to transform (re-express) the data so they are more “normal looking.”

Power transformations

Box and Cox transformations

x ( )

 x  1    , 0  ln( x) ,   0 

for x  0

Yeo and Johnson, generalized Box & Cox transformation

( ) ( ) ( ) Pretend x1 , x 2 ,  , x n are normal for some choice of .

Choose  by maximizing

where

Notes: 1. No guarantee that marginal distribution will be normal, but better than original data 2. Common transformations  = 0 (logarithm) and  = ½ (square root) are included in the set of  possibilities 3. Generally round  to simple value e.g.  = .21   = .25

Example:
Radiation (door closed) data (See Example 4.16)

ˆ ˆ Find   .30  .25 so take   4

1

A Q-Q plot of transformed radiation data (door closed). (Integers in plot indicate the number of points occupying the same location.)

Example:
Bivariate data—Door open & door closed radiation data (See Example 4.17, Figures 4.14 and 4.25)

For several, say p, variables, determining 1 ,  2 ,  ,  p jointly computationally tedious and results often not much ˆ different from determining the i ' s individually (i.e. from the marginal distributions of the Xi’s).

Inferences About a Mean Vector
Univariate case:
X 1 , X 2 ,, X n independent N ( ,  2 )

 2 unknown
H0 :   0 H1 :    0
Reject H 0
if
X  0 s/ n

t

is t n1

| t |  t n 1 ( / 2)

2 Alternatively, since t n1  F1,n1

Reject H0 if
 s2   x  0  2 2 t    ( x   0 )  ( x   0 )  t n1 ( / 2)  F1,n1 ( )   n  s/ n   
2 1

Also, not reject H 0 :    0

if

| t |

x  0  t n1 ( / 2) s/ n

or

 0  x  t n1 ( / 2) s / n





 100(1)% confidence
interval for 

Multivariate generalization— Assume X1, X2,…, Xn independent Np(, )  unknown

H0 :   0 H1 :    0
Hotelling’s T2 (n1>p)
1

S T 2  ( x   0 )    ( x   0 )  n ( x   0 )  S 1 ( x   0 ) n

Under H0 ,

T2 is n  p F p , n  p

(n  1) p

Reject H0 at level  if

T 2  n ( x   0 ) S 1 ( x   0 ) 

(n  1) p Fp ,n p ( ) n p

(Note: t2 = T2 if p = 1)

Example:
Sweat data (See Example 5.2)

p = 3, n = 20  = .10,
x   4.64

4 H 0 :   50 ,   10   

4 H 1 :   50   10   

F3,17(.10) = 2.44
45 .4 9.965 

T = 9.74,

2

(19)3 F3,17 (.10)  3.353(2.44)  8.18 17

Since T2 = 9.74 > 8.18, reject H0 at 10% level

H0 will be rejected if one or more component means, or some linear combination of component means, differs too much from the hypothesized values [4 50 10].

Other approaches to hypothesis testing— Likelihood ratio, Wilks’ lambda, Hotellings T2

X j is N p (  , )

 unknown,

H0 :   0

1 n ˆ   S n   ( x j  x )( x j  x ) n j 1 1 n ˆ  0   ( x j   0 )( x j   0 ) n j 1

Likelihood ratio:

ˆ  ||       |  | max L(  , )  ˆ 0   ,
Reject H0 for  small

max L(  0 , )

n/2

Wilks’ lambda:

2 / n

ˆ ||  T2     1  ˆ |  n 1 | 0  

1

2/n 2 Tests based on  ,  and T are equivalent.

Confidence regions and simultaneous comparisons of component means— All  such that
n( x   )  S 1 ( x   ) 


100(1-)% confidence region for 

(n  1) p F p , n  p ( )  c 2 n p

Example:
Microwave radiation data (See Example 5.3)

.60 H0 :     .60

A 95% confidence region for  based on the microwave-radiation data

Relationship between confidence region and tests of H0:  = 0: Any 0 in 100(1)% confidence region is consistent with the data (not reject H0) at the  level.

100(1)% simultaneous confidence statements:

X j is N p (  , )

 unknown

Consider a   a1  1  a 2  2    a p  p 100(1)% simultaneous intervals for all a given by

a x  c
where
c2 

a Sa n

(n  1) p F p , n  p ( ) n p

Notes:

1. a   1,0,  ,0 a    1
a  1,1,,0 a  1   2
and so forth

2. a fixed, 100(1)% confidence interval for a given by aSa ax  t n1 ( / 2) n

Example:
Test scores data (See Example 5.5) p = 3, n = 87

Bonferroni intervals:  m fixed linear combinations of interest before data collected

a1 '  , a 2 '  ,, am ' 
 confidence level = 1i i = 1,2,…,m Intervals

ai ' x  t n1 ( / 2m)

ai ' Sai n

i  1,2,  , m

where  i   / m , have overall confidence level greater than or equal to 1

Examples:

1. Suppose m = 6,  i  .01 , then  i / 2  .005 and critical value is t n 1 (.005 ) . Now  i   / m    m i  6(.01)  .06 and overall confidence is at least 1 = 1.06 = .94. 2. Alternatively, m = 6 and set 1 = .95, then  = .05 and /2m = .05/12 = .0042. So critical value is tn-1(.0042).

(See Example 5.6)

Large Samples— For large samples, the normality assumption can be relaxed. Tests: H 0 :    0 vs H 1 :    0 Reject H0 at (approximate) level  if
2 n( x  0 ) S 1 ( x  0 )   p ( )

Simultaneous intervals: With 100(1)% confidence
a'  in
2 a ' x   p ( )

a ' Sa n

for all a Note: Choice a'  0,,0,1,00 with 1 in ith position gives

i

in

2 x i   p ( )

s ii n

i  1,2,  , p

Example:
Music data (See Example 5.7) p = 7, n = 96

Comparison of Several Multivariate Population Means
Two population means: 1 2

What can we say about 1  2 ? Paired samples 1. “Before” and “ after” measurements on same experimental unit 2. Two treatments on same (or similar) experimental unit jth unit: j = 1,2,…,n
X 1 j1  var iable 1 under treatment 1   X 1 jp  var iable p under treatment 1

_____________________________
X 2 j1  var iable 1 under treatment 2   X 2 jp  var iable p under treatment 2

Form differences
D j 1  X 1 j1  X 2 j1   D jp  X 1 jp  X 2 jp
' Let D j  D j1 D  D jp and assume D1 , D2 ,  , Dn are independent N p ( ,  d ) where E ( D j )     1   2
j2





Inferences about  = 1  2 based on T2 statistic
 F p,n  p T 2  n( D   )' S d 1 ( D   ) distributed as n p

(n  1) p

Here D is the sample mean vector and S d is the sample covariance matrix of the differences (For n and n  p large, T2 is approximately distributed 2 as  p )
H 0 :   0 (1   2 ) H1 :   0
H0

Tests: Reject

at level  if
 T 2  nd ' S d 1 d 

(n  1) p F p , n  p ( ) n p

100(1)% Confidence Ellipsoid for : All  such that
 n(d   )' S d 1 (d   ) 

(n  1) p F p , n  p ( ) n p

(Relate to hypothesis test)

100(1)% Simultaneous Confidence Intervals for Individual Mean Differences  i :
di  c
2 s di

n

where c 2 

(n  1) p F p,n p n p

2 Here d i is the ith element of d and s d i is the ith diagonal of S d

Note: For n  p large (p fixed, n  ),
(n  1) p 2 F p , n  p ( )   p ( ) n p

and assumption of normality can be relaxed.

Example:
Effluent data (See Example 6.1 and Table 6.1) p = 2, n = 11 Paired observations (Split samples to two labs)

X 1  BOD

X 2  SS

Differences

d j1  x1 j1  x 2 j1 , d j1  x1 j 2  x 2 j 2

d j1  19  22  18  27  4  10  14 17 9 4  19 d j2 12 10 42 15  1 11  4 60  2 10  7

H 0 :   0 (1   2 ) H1 :   0
T 2  13.6,   .05 (11  1)2 F2,9 (.05)  9.47 11  2
2 Since T  13 .6  9.47 , reject H 0

Repeated Measures Design for Comparing Treatments—

Same experimental unit receives each treatment, single response variable. Treatments administered over time in random order.
 X j1  X  j2  Xj        X jq   
 response to treatment 1  response to treatment 2   response to treatment q

jth unit:

Responses on jth unit are correlated Treatments are compared by considering differences
X
j1

X

j2

with mean  1   2

or, more generally, contrasts

c X
k

jk

with mean

c

k

 k where

c

k

0

Examples: 1. q = 3  1 ,  2 ,  3 Contrast: (1)  1  (1)  2  (0)  3   1   2

2. Several contrasts (q = 4)
 X j1    1  1 0 0   X j1  X j 2       X j 2   CX X j 2  X j3 j    0 1  1 0   X  j3 2 X j1  2 X j 2  X j 3  X j 4  2  2 1  1   X      j4   

has mean vector
 1  1   2   1  1 0 0       0 1  1 0   2   C  2  3      21  2 2   3   4  2  2 1  1  3        4

C called a contrast matrix
  1   2  1  1 0      1 0 1 3  1            1   q  1 0 0   0   1     0   2    C          1  q    

3.

No difference in treatments becomes C  0 for any choice of the contrast matrix C.

Xj

is

N q (  , )

Data:

x1 , x 2 ,..., x n

Contrasts Cx j in observations have sample means C x and sample covariances CSC ' Test C  0 using a T2 statistic:

H 0 : C  0 (no differencein treatments) H 1 : C  0 (treatment difference exist ) s
At  level, reject H0 if

T 2  n(C x )' (CSC ' ) 1 (C x ) 

(n  1)( q  1) Fq 1, n  q 1 ( ) n  q 1

Joint confidence region for C : All

C such that
(n  1)( q  1) Fq1,nq1 ( ) n  q 1

n(C x  C )' (CSC ' ) 1 (C x  C ) 

Simultaneous Confidence Intervals for all Single Contrasts: Confidence coefficient 1

c : c' x 

(n  1)(q  1) c' Sc Fq 1, n  q 1 ( ) n  q 1 n

for all c'  c1 c 2  c q  with

c

k

0

Example:
Sleeping dog data (See Example 6.2 and Table 6.2) Design: Present Halothane Absent Low High CO2 Pressure

Treatment 1  High CO2 without halothane Treatment 2  Low CO2 without halothane Treatment 3  High CO2 with halothane Treatment 4  Low CO2 with halothane

Halothane contrast: (  3   4 )  (  1   2 ) CO2 contrast: Interaction:
( 1   3 )  (  2   4 )

( 1   4 )  (  2   3 )

Here

 1  1 1 1  C   1  1 1  1    1 1 1 1   

H 0 : C  0 H 1 : C  0
18(3)

q = 4, n = 19,  = .05 F3,16(.05) = 3.24

2 Since T  116  16 (3.24)  10.94 , reject H0 and conclude there is a treatment effect.

Simultaneous 95% intervals Halothane contrast (  3   4 )  (  1   2 )  209.3  73.7 CO2 contrast ( 1   3 )  (  2   4 )  60.1  54.7 Interaction ( 1   4 )  (  2   3 )  12.8  66.0

No interaction, Halothane effect, CO2 effect Presence of halothane produces longer times between heartbeats; low CO2 pressure produces longer times between heartbeats. Note: Here trials with halothane must necessarily follow those without halothane—trials not assigned in random order.

Comparing Mean Vectors from Two Populations— Independent random samples

Populations
X 11 , X 12 ,  , X 1n1 X 21 , X 22 ,  , X 2 n2 N p (1 , 1 ) N p ( 2 ,  2 )

x11 , x12 , , x1n1
Data:

leading to leading to

x1 , S1 x2 , S 2

x21 , x22 , , x2 n2

Consider    1   2 Case 1. Normal populations,  1   2  
S pooled  (n1  1) S 1  ( n 2  1) S 2 n1  n 2  2

(n1, n2 small)

Form Test:

H 0 : 1   2   0 H 1 : 1   2   0
Set

(e.g.  0  0)

c2 

(n1  n 2  2) p F p , n1  n2  p 1 ( ) n1  n 2  p  1

Reject H0 at level  if
 1  1 T 2  ( x1  x 2   0 )' (  ) S pooled  ( x1  x 2   0 )  c 2  n1 n 2 
1

100(1)% Confidence Region for    1   2 : All  such that
 1  1 ( x1  x 2   )' (  ) S pooled  ( x1  x 2   )  c 2  n1 n 2 
1

100(1)% Simultaneous Confidence Intervals for all a' (  1   2 ) :
a' (  1   2 )  a' ( x1  x 2 )  c a' ( 1 1  ) S pooled a n1 n 2

a'  0, ,0,1,0, ,0
Note choice


ith position

gives

 1i   2i

 ( x1i  x 2i )  c (

1 1  ) s ii , pooled n1 n 2

i  1,2,  , p

Can also construct Bonferroni intervals for a moderate number of fixed linear combinations.

Example:
Soap data (See Example 6.3) p = 2, n1 = n2 = 50
x
' 1

 8.3

4.1,

x

' 2

 10.2

3.9,

S pooled

2 1   1 5

 1.9 x1  x 2   0.2   
95% confidence ellipse for  = 12 :

  1   2  0 ?

95% simultaneous intervals
a'  1 0

11   21
1 1  ) 2   1.9  .71  (2.61,  1.19) 50 50

 1.9  6.26 (

a'  0 1

12   22
1 1  ) 5  0.2  1.12  (.92 , 1.32) 50 50

0.2  6.26 (

Remark: For testing H 0 :    1   2  0 , the linear 1 combination a' ( x1  x 2 ) with a  S pooled ( x1  x2 ) quantifies the largest population difference. That is, if T2 rejects H0, then a' ( x1  x 2 ) is likely to have a non-zero mean. Often try to interpret this linear combination for subject matter importance.

Example:
Electrical consumption data (See Example 6.4) Case 2.  1   2 , n1  p , n 2  p both large

Test:

H 0 : 1   2   0  0 H1 : 1   2  0
Reject H0 at level  if
1 1  2 T 2  ( x1  x2 )'  S1  S 2  ( x1  x2 )   p ( ) n2   n1
1

100(1)% Confidence Region for    1   2 : All  such that
1 1  2 ( x1  x2   )'  S1  S 2  ( x1  x2   )   p ( ) n2   n1
1

100(1)% Simultaneous Confidence Intervals for all a' (  1   2 ) :

a' ( 1   2 ) 

2 a' ( x1  x2 )   p ( ) a' (

1 1 S1  S 2 ) a n1 n2

Remark: For n1  n 2  n,
1 1 1 1 1 1 1 1 1 S1  S 2  ( S 1  S 2 )  ( S 1  S 2 ) (  )  S pooled (  ) n1 n2 n 2 2 n n n n

and the large sample procedures are the same as those based on the pooled covariance matrix.

Example:
Electrical consumption data (See Example 6.5)

Multivariate Analysis of Variance (One-Way MANOVA)— Comparing g  2 population mean vectors
1


Population 1: X 11 , X 12 ,  , X 1n Population 2: X 21 , X 22 ,, X 2 n

1

2

2




Population g: X g1 , X g 2 ,, X gn
g

g



Note:

If n l small, also assume N p (  l , ) populations

Review of univariate ANOVA (g populations)
X lj     l  e lj
e lj independent N (0,  2 ) or X lj independent N (  l ,  2 )

Model:

 
l

One choice of constraint

n 
l 1 l

g

l

0

Observations:

x11 , x12 ,, x1n1 x21 , x22 ,, x2 n2  x g1 , x g 2 ,, x gng

N ( 1 ,  2 ) N ( 2 ,  2 ) N ( g ,  2 )

Are the population means the same (  1   2 equivalently, are there no treatment effects (  1   2   g  0 )?

 g)

or,

X lj   

l



elj

xlj  x  ( xl  x )  ( xlj  xl )
observation = overall + estimated + residual sample treatment mean effect

Example:
Hypothetical data (See Example 6.6) g = 3, n1 = 3, n2 = 2, n3 = 3

4 4   1  2 1 9 6 9 4 4 4  4 0 2    4 4     3  3    1 1          3 1 2  4 4 4   2  2  2  1  1 0        
2 2 2 SSobs = 9  6    2  216   l

x
j

2 lj

SSmean = 4 2  4 2   4 2  128  (n1  n2   n g ) x 2
2 2 2 2 SStr = 4  4    (2)  78   nl ( x l  x ) l

2 2 2 SSres = 1  (2)    0  10   l

 (x
j

lj

 xl ) 2

Total (Corrected) SS = SSobs  SSmean = 216  128 = 88 2 =   ( xlj  x )
l j

_______________

Univariate ANOVA Table—
Source Sum of Degrees of Squares (SS) Freedom (d.f.) Mean Square (MS)
SS tr /( g  1)

F

Treatments Residual

 n (x
l

l

 x) 2

g 1

F

 ( x Total(Corr)  ( x

lj lj

 xl )

2

 n  g
l

MS tr MS res

SS res /( nl  g )

 x) 2

 n 1
l

Test: Reject H 0 :  1   2     g  0 ( 1   2     g )
MS tr SS tr /( g  1) F   Fg 1, n  g ( ) at level  if l MS res SS res /( nl  g )

Example:
Hypothetical data again (See Example 6.7) ANOVA Table:
Source Treatments Residual Total (Corr) SS 78 10 88 d.f. 2 5 7 MS 39 2 F F = 39/2 = 19.5

H 0: 1   2   3  0 H 1 : at least one  l  0
g 1  2,

 = .01

n

l

 g 83  5

F2,5 (. 01)  13 .27

Since F = 19.5 > 13.27, reject H0 at 1% level.

Since g and nl are fixed, the F test is equivalent to rejecting H0 for large values of SS or large values of 1  SS . res res Multivariate generalization rejects H0 for small values of the reciprocal
SS res 1  SS tr SS res  SS tr 1 SS res

SS tr

SS tr

Multivariate Case—

Independent random samples

X l1 , X l 2 , , X l nl l  1,2, , g

from N p ( l , )

X lj     l  elj
Model:

with elj  N p (0, )

where  nl l  0

Note: Each variable follows ANOVA model. Variables are correlated on same unit.

Data decomposition:

x lj  x  ( x l  x )  ( x lj  x l )
Leads to MANOVA table.

Example:
Hypothetical data (See Example 6.8)

MANOVA Table:
Source Treatments Residual Total (Corr) Sum of Squares & Degrees of Freedom Cross-products (SSCP) (d.f.)  78  12 2 = 31
 12 48    10 1   1 24    88  11  11 72   

5 = 3+2+33 7 = 3+2+31

General MANOVA Table—

Source Treatments (Between) Residual (Within) Total (Corr)

Sum of Squares & Cross- Degrees of products (SSCP) Freedom (d.f.) B   n l ( x l  x )( x l  x )' g 1
l

W 
l

 (x
j

lj

 xl )(xlj  xl )'
lj

n

l

g
l

B W  
l

 (x
j

 x )(xlj  x )'

n

1

Reject H 0 :  1   2     g  0 (equal means or no treatment effect) if Wilks’ lambda
  |W | c | B W |

See Table 6.3 for exact distribution of Wilks’ lambda for certain special cases. Notes: If all x l equal, then B = 0 and
 In general, 0    1

 

|W |  1. |W |



 =

W  nl  g

Likelihood Ratio Test (Bartlett approximation)

n

l

 p  n  p large

Reject H 0 : all  l  0 at (approximate) level  if
p  g   |W |   n 1  ln  2  | B W    2    p ( g 1) ( )  |

Example:
Nursing home data (See Example 6.9)

Ownership: Private Nonprofit Government

Cost variables: X1 = cost of nursing labor X2 = cost of dietary labor X3 = cost of maintenance labor X4 = cost of housekeeping labor Sample sizes: n1 = 271 n2 = 138 n3 = 107

Sample mean vectors:
2.066  .480   x1    .082     .360  2.167  .596   x2    .124     .418  2.273  .521   x3    .125     .383 

Within and Between matrices:
W  (n1  1) S 1  (n 2  1) S 2  (n 3  1) S 3 182.962   4.408 8.200     1.695  .633 1.484    9.581 2.428 .394 6.538

B  n1 ( x1  x )(x1  x )' n 2 ( x 2  x )(x 2  x )' n 3 ( x 3  x )(x 3  x )' 3.475  1.111 1.225     .821 .453 .235    .584 .610 .230 .304 

Now Since

|W |  .7714 | B W |

pg |W |  2   n 1   132.76   8 (.01)  20  ln 2  | B W | 

reject H0, there is difference in (mean) labor costs among types of nursing homes

Exact test—
g 3  n  p  2  1       p      is   F2 p , 2 ( n  p  2)

 n  p  2  1       p   

  516  4  2  1  .7714     4  .7714  

   17 .67  

  .01

F2( 4),2 (510) (.01) 

 82 (.01)
8

 2.51

Since 17.67 > 2.51, reject H0. (Result consistent with large sample test)

Which costs differ for which type of home? ______________

Bonferroni intervals for component treatment effect differences— Consider

 k  l

(or  k   l )

 ki ith component of  k
p variables, g(g1)/2 pair-wise differences Let n   nl , then each two-sample t-interval will employ the critical value t n  g ( / 2m) where m = pg(g1)/2 Result: With confidence at least 1,
 ki   li
  wii  1  /2 1    x ki  x li  t n  g     pg ( g  1) / 2  n  g  n     k nl 

for all components i  1,2,  , p and all differences
l  k  1,2,  , g. Here w ii is the ith diagonal element

of W.

Example:
Nursing home data (See Example 6.10)
g  3, p  4, n  516 pg ( g  1) / 2  4(3)2 / 2  12

To get 95% intervals, need the
 .05 / 2  t 5163    t 513 (. 002 )  z (. 002 )  2.87  12 

point

Compare cost of maintenance labor (3rd variable) privategovernment:
ˆ13  ˆ33  ( x13  x 3 )  ( x 33  x 3 )  x13  x 33  .082  .125  .043
 13   33 :  .043  2.87
1.484  1 1      .043  .018 513  271 107 

or

(.061, .025)

non-profitgovernment:
 23   33 :
(.021 , .019 )

privatenon-profit:
 13   23 :
(.058 ,  .026 )

Interpret

Remarks: 1. Wilks’ lambda can be expressed in terms of the eigenvalues 1 ,  2 ,  ,  s of W 1 B .
 
 i 1 s

1 1  i

s  min( p, g  1)

2. For g = 2 groups (populations)
 1    T  (n1  n 2  2)       
2

and the statistic T2 for testing the equality of two

population mean vectors,  1 and  2 , using independent samples, can be obtained from the Wilks’ lambda statistic in a MANOVA program. 3. Other tests of H 0 :  1   2     g  0 (equal population means) are available. None of these tests are uniformly most powerful.

1 ,  2 ,  ,  s eigenvalues of W 1 B

  1 Roy’s largest root: 1  1

i Pillai’s statistic: V   1   i 1

s



i

Hotelling-Lawley statistic: U   i
i 1

s

(Except for certain cases, need special tables) 4. Tests of  1   2     g are sensitive to the normality assumption. Reduce the effects of inequality of covariance matrices if arrange for equal sample sizes. 5. Check data for normality by examining the ˆ residuals elj  x lj  x l , l  1,2,  , g . Transformations to near normality may help.

Example:
Rootstock data

g 6 p4 n1  n 2    n 6  8 n  48

B

W

B W 

X lj     l  e lj

l  1,2,  ,6 ; j  1,2,  ,8

Model:


l

l

0

l    l

1 Eigenvalues of W B : 1.876, .791, .229, .026

 Wilks’ lambda

 

4 |W | 1  .154   | B W | i 1 1   i

  .05 H 0 :  1   2     6  0
n  p  48  4  44
 (48  1 
2  4(5) (.05)  31.4

46 ) ln(.154)  78.6  31.4  2

Reject H0

 Roy’s largest root



1 1.876   .652 1  1 2.876

 (.05)  .377 (Special table)

Since   .652  .377 

Reject H0

 Pillai’s statistic

V 

i  1.305 i 1 1   i
4

V (.05)  .645 (Special table)

Since V  1.305  .645 

Reject H0

 Hotelling-Lawley statistic

U    i  2.921
i 1

4

U (.05)  .952 (Special table)

Since U  2.921  .952 

Reject H0

Univariate versus multivariate tests?

Example:
Hypothetical data (See Example 6.14)

Example:
Lizard data (See Example 6.15)

Two-way MANOVA—

Multivariate measurements recorded at various levels of two factors Factor 1: g levels Factor 2: b levels Balanced design: n replications at each of gb combinations of levels

Review of univariate model and ANOVA table Model:
X lkr     l   k   lk  elkr l  1,2,, g ; k  1,2,, b; r  1,2,, n elkr indep N (0, 2 )

   
l l k

k

   lk    lk  0
l k

E ( X lkr ) 



 l



k

  lk

Mean = Overall + Factor 1 + Factor 2 + Interaction Response level effect effect

Presence of interaction means factor effects are not additive—complicates interpretations

Each observation can be decomposed as
x lkr  x  ( x l   x )  ( xk  x )  ( x lk  x l   xk  x )  ( x lkr  x lk )

SS cor  SS fac1  SS fac 2  SSint  SS res Sum of squares: Degrees of freedom: gbn 1  ( g  1)  (b  1)  ( g 1)(b 1)  gb(n  1)

F ratios:

F

SS fac1 / ( g  1) SS res / gb(n  1)

H 0 : all  l  0

F

SS fac 2 / (b  1) SS res / gb(n  1)

H 0 : all  k  0

F

SS int / ( g  1)(b  1) SS res / gb(n  1)

H 0 : all  lk  0

Multivariate model and MANOVA table
X lkr     l   k   lk  elkr l  1,2,, g ; k  1,2,, b; r  1,2,, n elkr indep N p (0, )

   
l l k

k

   lk    lk  0
l k

Observation vector can be decomposed as
x lkr  x  ( x l   x )  ( xk  x )  ( x lk  x l   xk  x )  ( x lkr  x lk )

Sum of squares and degrees of freedom breakup similar to univariate case with vectors replacing individual observations and matrices of sum of squares and crossproducts replacing scalar sum of squares

A likelihood ratio test of

For large samples, reject H0 at level  if

Test for interaction first, if interaction exists, there is no clear interpretation of effects. Conduct p univariate tests to see if interaction occurs in some responses but not others.

H 0 : 1   2     g  0 H 1 : at least one  l  0
Let Small  consistent with H1. For large samples, reject H0 at level  if


H 0 : 1   2     b  0 H1 : at least one  k  0
Let Small  consistent with H1. For large samples, reject H0 at level  if


Simultaneous intervals—

Example:
Plastic film data (See Example 6.11)

Principal Components
Concerned with core structure of a single sample of observations on p variables Analysis of covariance structure

Objectives:

l. Data reduction 2. Data interpretation

Cov (X) = 

Account for elements of  with few (less than p) linear combinations of original variables X

Population principal components (Illustrate ideas) Have
E ( X )   , Cov( X )  

 positive definite with eigenvalues 1  2     p  0
(pxp)

Consider linear combinations

Y1  a1' X  a11 X 1  a12 X 2    a1 p X p
' Y2  a2 X  a21 X 1  a22 X 2    a2 p X p





Y p  a 'p X  a p1 X 1  a p 2 X 2    a pp X p
E (Y )  A , Cov(Y )  A  A'

Principal components are those linear combinations that are uncorrelated with maximum variance (subject to coefficient vector length restriction)

first principal component: Linear combination a1 X that maximizes Var (a1' X ) subject ' to a1 a1  1

'

ith principal component: Linear combination ai X that ' maximizes Var (ai X ) subject
' to ai ai  1 and Cov(a i' X , a k' X )  0 for k < i

'

Let e1 , e 2 ,  , e p be the normalized eigenvectors corresponding to the eigenvalues 1  2     p  0 of . The choices a i  ei i  1,2,  , p give the population principal components.
Yi  e i' X are the principal components.

That is,

' ' ' Have Var (Yi )  ei ei  ei ( i ei )   i (ei ei )   i
' ' Also, since ei e k  0 i  k , Cov(Yi , Yk )  ei  e k  0 i  k

“Total variance” = tr () =  11   22     pp

Results: 1.

 Var ( X
i 1

p

i

)   11   22     pp  1   2     p   Var (Yi )
i 1

p

2.

 Pr oportion of total    k var iance due to kth     component  1  2     p  

k  1,2, , p

' 3. Yi  ei X  ei1 X 1  ei 2 X 2   eip X p

Y ,X 
i k

eik  i

 kk

i, k  1,2,  , p

Correlation coefficients sometimes useful for interpreting principal components (although univariate information) Can also use the size of the coefficients in the coefficient vector e i for interpretation

Example:
Hypothetical data (See Example 8.1)

_____________ Multivariate normal random vector X (i.e. X  N p (  , ) ) Elliptical contours of constant density

( X   )'  1 ( X   )  c 2
( i , e i ) eigenvalue-eigenvector pairs of 

Principal components lie along axes of constant density ellipsoid

Example:
p=2

 = 0 Cov (X1, X2) > 0

Principal Components Obtained from Standardized Variables—
X 1  1

Z1  Z2   Zp 

 11
X 2  2

 22
X p p

 pp

Z  (V 1 / 2 ) 1 ( X   )

E (Z )  0

Cov( Z )    Corr ( X )

(1 , e1 ) , ( 2 , e 2 ) ,  , ( p , e p ) eigenvalue-eigenvector pairs from  with 1  2     p  0

Population principal components (standardized variables):

Yi  ei' Z  ei' (V 1/ 2 ) 1 ( X   )

i  1,2,  , p

Results: 1.

 Var (Y )   Var (Z
i 1 i i 1

p

p

i

)  p  1   2     p

 Pr oportion of var iance  k   2.  due to kth component   p  

3. Correlation coefficients (sometimes useful)

 Y , Z  eik i
i k

i, k  1,2, , p

4. Principal components obtained from  (in general) not the same as principal components obtained from 

Covariance matrices with special structures—
 11   22        ;     pp   1   1         1 

1.

Principal components are the original variables
2  2      2   

2.

 2   2  1      1     2   2   ; 
 

 2 

   2  

  

     1 

 0

1  1  ( p  1)  2     p  1  
1

 1 1 1  e1'     p  p p  
 Pr oportion of     1  var iance exp lained   1     p  by Y  p 1  

Y1  e Z 
' 1

Z p
i 1

p

i

Sample Principal Components Everything goes through with S for , R for  and x for 

Data matrix:

 x11 x12 ... x1 p  x x 22 ... x 2 p   21        X=    x n1 x n 2 ... x np   

S
(pxp)

with eigenvalue-eigenvector pairs
ˆ ˆ ˆ ˆ ˆ ˆ (1 , e1 ) , (2 , e2 ) , , ( p , e p )
ˆ ˆ ˆ 1  2     p  0

ith principal component (sample):
ˆ ˆ ˆ ˆ ˆ yi  ei' x  ei1 x1  ei 2 x2   eip x p

n observations on each component, have
ˆ ˆ Sample variance ( y i )  i

ˆ ˆ Sample covariance ( y i , y k )  0 i  k
ˆ ˆ ˆ Total sample variance =  s ii  1   2     p
i 1 p

 Pr oportion of total   sample var iance due  to kth component 

 ˆ  k  ˆ ˆ ˆ  1  2     p 

k  1,2,, p

Correlations
ryi , xk  ˆ ˆ ˆ eik i skk i, k  1,2,, p

Properties: 1. Sample principal components are linear ' ˆ combinations y i  a i x maximizing sample
' ˆ variances of y i ' s subject to a i a i  1 and zero ˆ ˆ sample covariances for any pair y i , y k k  i

2. Often center the observations by creating x  x and defining components as ˆ ˆ yi  ei' ( x  x ) i  1,2,  , p . In this case, the components have sample mean 0.

The Number of Principal Components— How many components can be retained without loss of information? Things to consider: Amount of total sample variance explained Relative sizes of the eigenvalues Subject matter interpretation of the components Scree plot

A Scree Plot

Example:
Turtle data (See Example 8.4) SAS Output

Geometrical Interpretation of Sample Principal Components—

x j , j  1,2,  , n

n points in p dimensions

S positive definite (all eigenvalues positive) Constant distance ellipsoid: All x such that
( x  x )' S 1 ( x  x )  c 2

Sample principal components lie along axes of constant distance ellipsoid

Sample principal components and ellipses of constant distance

Sample principal components obtained from S different from those obtained from R (covariance matrix of standardized observations)
           0   0      1  s pp  

1 s11 0  0

0 1 s 22  0

   

D

1 / 2

1 / 2 Standardized observations: z j  D ( x j  x )

Sample Cov(zj) = R (Sample correlation matrix)
(pxp)

Eigenvalue-eigenvector pairs for R
ˆ ˆ ˆ ˆ ˆ ˆ (1 , e1 ) , (2 , e2 ) , , ( p , e p )
ˆ ˆ ˆ 1  2     p  0

Sample principal components:
ˆ ˆ ˆ ˆ ˆ yi  ei' z  ei1 z1  ei 2 z2   eip z p

ˆ ˆ Sample variance ( y i )  i
ˆ ˆ Sample covariance ( y i , y k )  0 i  k
ˆ ˆ ˆ Total sample variance = p  1  2     p

 Pr oportion of total   sample var iance due  to kth component 

 ˆ  k   p 

k  1,2,, p

Correlations
ˆ ˆ ryi , zk  eik i ˆ i, k  1,2,, p

Example:
Stock price data (See Example 8.5) _________________

Graphing the Principal Components— Plots of principal components can reveal groupings, suspect observations (outliers), and provide checks on normality

Example:
Turtle data (See Example 8.7)

Example:
Ramus bone data

Large Sample Inferences—

Large sample inferences for principal components not particularly useful; however, testing for certain covariance (correlation) structures often of interest 1. Test of equicorrelation structure

Suggested test (due to Lawley) is essentially based on an “analysis of variance” on off-diagonal elements of R

Example: Mice data (See Example 8.9)

2.

X  N p (  , )  11 0  0  22 H o :   0       0  0  0   0   (  I )       pp   

H1 :    0

Under H0, all variables X 1 , X 2 ,  , X p uncorrelated (independent) Likelihood ratio test discussed in Exercise 8.9 (a)

3.

X  N p (  , )  2  0 H o :   0      0  0

2
 0

0    0   2I       2  

H1 :   0

Under H0, all variables X 1 , X 2 ,  , X p uncorrelated (independent) and have same variance Likelihood ratio test discussed in Exercise 8.9 (b)

Final Comments— 1. Very small eigenvalues may indicate linear dependencies—examine principal components associated with very small eigenvalues 2. Principal components often used as inputs to regression analysis; for plots of data in 2 dimensions; and so forth 3. Using principal components may result in saving of storage space for large data bases 4. Principal components obtained from covariance matrix or correlation matrix? Depends. (Can always do both and compare results

Factor Analysis
Factor analysis is an attempt to identify the underlying but unobservable factors that produce the correlation pattern of the observable variables Orthogonal factor model— Observable X has mean  covariance  Unobservable random variables
F1 , F2 ,  , Fm

common factors specific factors (errors)

1 ,  2 ,,  p

X 1   1  l11 F1    l1m Fm   1 X 2   2  l 21 F1   l 2 m Fm   2   X p   p  l p1 F1    l pm Fm   p

X    LF  
 l11 l 21 L   l p1  l12 l 22  l p2  l1m   l 2m        l pm  

matrix of factor loadings

Model

Covariance structure

It follows that
2  ii  l i2  l i22    l im   i 1

   Var(Xi) = communality +specific variance
2 2 2 2 With communality = hi  li1  li 2   lip

have

 ii  hi2   i

Example:
Hypothetical data (See Example 9.1)

Cov(X) =  has p variances and p(p1)/2 covariances or p(p+1)/2 entries When m = p,  = LL with  = 0 so  does factor. However, seek m << p in order to get a “simple” explanation of the observed correlations among the variables Xi. Unfortunately, most covariance matrices cannot be factored as LL+, exactly, when m is much less than p.

Example:
Nonexistence of a proper solution (See Example 9.2)
 1 .9 .7   .9 1 .4   .7 .4 1   

Try to factor as   LL   when m = 1

Uniqueness problem— T orthogonal matrix, i.e. TT = T T = I
  New loadings L  LT and new factors F  T F satisfy

E ( F  )  T E ( F )  0 Cov( F  )  T Cov( F )T  T  I T  T T  I  L L  LTT L   LL  L F   LTT F  LF
so

X    LF    L F       LL     L L  
Factor loadings are determined only up to an orthogonal matrix T (basis for “factor rotation”) L and LT = L give same representation of  

Methods of Estimation— Principal component method: Based on spectral decomposition of a matrix
'   1e1e1'  2 e2 e2    p e p e 'p

( i , e i )

eigenvalue-eigenvector pairs of 

Last p  m eigenvalues are small get approximation
' '   1e1e1'  2 e2 e2    m em em

The loadings matrix is
L

 e

1 1

2 e2  m em


m j 1

2 and  is the diagonal matrix with  i   ii   l ij .

Consequently, variances are reproduced exactly.

In practice, work with sample covariance matrix S (or ˆ sample correlation matrix R) and its eigenvalues  and
i

ˆ eigenvectors e i

Important property of the principal component solution: Initial loadings do not change if more factors are added m=1 Note:
~ ˆ ˆ L   1 e1     
~2 ~2 ~ ˆ l11  l21    l p2  1 1

m=2 and so forth

~ ˆ ˆ ˆ ˆ L   1 e1   2 e 2     

 Pr oportion of total  ˆ j    sample var iance due    to jth factor  s11  s 22    s pp    ˆ j p for R

for S

Adequacy of fit judged from the residual matrix
~~ ~ S  LL   





ˆ ˆ ˆ whose (sum of squared entries) 2 1  2 2   2p m m

________________ In cases where units of variables are not commensurate, it is common practice to factor analyze the correlation matix R ________________

Example:
Consumer-preference data (See Example 9.3)

Example:
Stock price data (See Example 9.4)

Maximum likelihood method: Assume X has a multivariate normal distribution with
  LL   . The likelihood is
Lihood (  , L, )  (2 )  np / 2 | LL   | n / 2  1 n   exp   ( x j   ) ( LL   ) 1 ( x j   )  2 j 1 
(px1)

which is maximized over , L,  by a computer search. Uniqueness condition:
L  1 L   (diagonal)

ˆ ˆ ˆˆ Note: If   LL    is MLE of   LL   , the MLE of ˆ ˆ ˆ ˆ   V 1 / 2 V 1 / 2 is   Lz L'z  z where
ˆ ˆ ˆ ˆ ˆ ˆ ˆ L z  V 1 / 2 L,  z  V 1 / 2 V 1 / 2

Example:
Stock price data again (See Example 9.5)

Compare with principal component solution

Example:
Olympic decathlon data (See Example 9.6)

Test for the number of common factors—
H 0 :   L L  
pxp pxm mxp pxp

Assume X is multivariate normal

Reject H0 at level  if
ˆˆ ˆ 2 p  4m  5   | LL   |      2 2 ( )  n 1  ln  (( p  m )  p  m ) / 2 6    | Sn |  

Test requires

m

1 (2 p  1  8 p  1) 2

Example:
Stock price data m = 2 (See Example 9.7) ________________ Factor Rotation—
ˆ Rotated loadings  L  LT where TT   T T  I

Estimated covariance (correlation) matrix remains unchanged after rotation. Also, residual matrix, specific ˆ ˆ variances  i , and communalities hi2 are unchanged.

Example:
Examination score data (See Example 9.8)

Factor rotation for test scores

Criterion for rotation (Varimax or Normal Varimax)
ˆ ˆ Select orthogonal matrix T so that L  LT maximizes

  ˆ where lij  lˆij / hi (scale by communalities)

~

Example:
Consumer preference data again (See Example 9.9)

Factor method: Principal components

Groups (with rotated loadings): Variables 1,3 Variables 2,4,5 Factor 1  “nutritional” factor Factor 2  “taste” factor

See SAS output in text

Note: General factor tends to be destroyed after rotation

Factor rotation for hypothetical marketing data

Example:
Stock price data (See Example 9.10)

Factor method: Maximum likelihood

Notes: 1. Rotation particularly recommended for loadings obtained by maximum likelihood but, in this case, initial loadings and factors more appealing 2. Proportion of total variance explained by each factor changes after rotation but cumulative total variance explained is unchanged

Example:
Olympic decathlon data (See Example 9.11)

Factors: Explosive arm strength Explosive leg strength Running speed Running endurance

Factor Scores— In factor analysis, interest is usually on parameters, lij and i in the factor model. Sometimes, however, estimated values of common factors, called factor scores, may be required for diagnostic purposes or as inputs to a subsequent analysis. Factor scores
ˆ f j  estimates of values f j attained by F j

Note:

Unobserved quantities fj and j, j = 1,2,…,n, outnumber observed quantities xj

Heuristic, but reasonable, approaches to generating factor scores 1. Weighted least squares 2. Regression method Both approaches

ˆ  Treat estimates lˆij and  i as if they were the true values  Involve linear transformations of the observed data, perhaps centered or standardized ( fˆ j  A( x j  x ) )

 Computational formulas do not change if rotated loadings are substituted for the original loadings

Weighted least squares:

Var ( i )   i are, in general, unequal so pick factor scores to minimize
 i2      1  ( x    Lf )  1 ( x    Lf ) i 1 i
p

For each j,
ˆ ˆ ˆ ˆ f j  L  1 L





1

ˆ ˆ ˆˆ ˆ ˆ ˆ L  1 ( x j  x )  L  1 ( x j  x ) if MLE L, 

Regression method: The conditional distribution of F | x is multivariate normal. Have

ˆ ˆ f j  LS 1 ( x j  x )

j = 1,2,…,n

If start with correlation matrix, use R in place of S and zj place of (xjx)

Examples:
Stock price data again (See Examples 9.12 & 9.13)

Simple factor scores can be created by summing (standardized) observed values of variables with high loadings on different factors. Data reduction accomplished by replacing original data with these factor scores. ________________

Factor scores may or may not be normally distributed. Plots of factor scores should be examined prior to using the scores in subsequent analyses.

Strategy for Factor Analysis 1. Perform a principal component factor analysis  Look for suspicious observations by plotting factor scores  Try a varimax rotation 2. Perform a maximum likelihood factor analysis including rotation 3. Compare the solutions in 1 and 2

4. Repeat steps 1—3 for other numbers of common factors m 5. Split large data sets in half and perform a factor analysis on each

Example:
Chicken bone data (See Example 9.14)

Residual matrix using maximum likelihood estimates (m = 3)

Factor scores for first two factors using maximum likelihood loadings (check for outliers)

Factor scores for two solution methods
First factor

Third factor

m = 3 factors?

Stability of solution—split data n1 = 137

n2 = 139

Discrimination and Classification
Goals: 1. Separate—create numerical “discriminants” that separate different groups as much as possible 2. Classify—create rules for assigning new item (observation) to one of several groups

Observations to populations Objects to groups

 Interchangeable

Why problem? 1. Incomplete knowledge of future performance 2. Perfect information requires destroying object 3. Unavailable or expensive information

Good classification should result in few misclassifications

Things to consider: Costs Prior probabilities

Two groups (populations)—

  sample space
Two regions R1 and R2
R1  R2   R1  R2  

General form of classification rule: If

x  R1 assign x to  1 x  R2 assign x to  2
  R1  R2

p=2

Definitions: p1 = prior probability of belonging to 1 p2 = prior probability of belonging to 2 P(2|1) = probability of incorrectly classifying 1 as 2 P(1|2) = probability of incorrectly classifying 2 as 1 c(2|1) = cost of misclassifying 1 object c(1|2) = cost of misclassifying 2 object Criterion: Expected cost of misclassification (ECM)
ECM  c(2 | 1) P(2 | 1) p1  c(1 | 2) P(1 | 2) p 2

Minimum ECM rule leads to
( 1 ) ( 2 ) R1 : f 1 ( x)  c(1 | 2)  p 2     f 2 ( x)  c(2 | 1)  p1     f ( x)  c(1 | 2)  p 2  R2 : 1   c(2 | 1)  p    f 2 ( x)   1 

Example:
Hypothetical data (See Example 11.2) _______________

Other criteria:  Total probability of misclassification (TPM) (Ignores costs or assumes they are equal)  Highest posterior probability (Similar to TPM—see discussion in text)

Two multivariate normal populations— Case 1  1   2  
f 1 ( x )  N p (  1 , ) , ln f 2 ( x )  N p (  2 , )

f 1 ( x)  (  1   2 )  1 x  1 / 2(  1   2 )  1 (  1   2 ) f 2 ( x)

Min ECM rule Allocate x0 to 1 if
 c(1 | 2) p 2  ( R1 ) (  1   2 )  1 x 0  1 / 2(  1   2 )  1 (  1   2 )  ln   c(2 | 1) p   1   ( R 2 ) Allocate x 0 to  2 otherwise

In practice, substitute x 1 for  1 ; x 2 for  2 and
S pooled  (n1  1) S 1  ( n 2  1) S 2 n1  n 2  2

for 

(See text for summary using sample quantities)

Example:
Hemophilia data (See Example 11.3)

Scatter plots of (log10(AHF activity), log10(AHF-like antigen)) for the normal group and the obligatory hemophilia A carriers.

Case 2 1   2
f 1 ( x )  N p (  1 , 1 ) , f 2 ( x)  N p (  2 ,  2 )

Min ECM rule Allocate x0 to 1 if
 c(1 | 2) p 2       ( R1 )  1 / 2 x0 (1 1   1 ) x0  ( 1 1 1   2  1 ) x0  k  ln  2 2  c(2 | 1) p   1   ( R2 ) Allocate x0 to  2 otherwise

Notes:

1. In practice, substitute x 1 for  1 ; x 2 for  2 ; S1 for 1; S2 for 2 2. Classification function is a quadratic function of x

p=1

Quadratic rules for (a) two normal populations with unequal variances and (b) two distributions, one of which is non-normal—rule not appropriate

Evaluating classification functions— Want measure of performance in future samples. Calculate “error rates” or misclassification probabilities.

Measure of performance that doesn’t depend on form of parent population called apparent error rate (APER) APER = fraction in learning sample misclassified

Confusion matrix:

APER (generally) optimistic indication of performance of classification procedure in future samples Proposal: Calculate APER for validation sample Problems: 1. Need large samples 2. Procedure evaluated not procedure not procedure eventually used

Cross-validation (Jackniffed) estimates of error rates Algorithm: 1. Start with 1 group. Omit (holdout) one observation and develop classification rule based on n11, n2 observations. 2. Classify omitted observation 3. Repeat steps 1—2 for all 1 observations. 4. Repeat steps 1—3 for all 2 observations. 5. Compute APER in the usual way.

Example:
Artificial data (See Example 11.6)

Example:
Salmon data (See Example 11.7)

Beware of any two-group procedure with error rates around 50%. Can extend methods to handle g > 2 groups (populations). Fisher’s Discriminant Function—  Separation and allocation together (two populations)  Different (original) approach

Consider linear combination Y = aX

f1 ( x) : y1 j  ax1 j f 2 ( x) : y2 j  ax2 j

j  1,2,, n1 j  1,2,, n2
 ( y1  y 2 ) 2  2  sy     

Measure of separation

| y1  y 2 | sy

where

y1  (1 / n1 ) y1 j  ax1

y 2  (1 / n2 ) y 2 j  ax2
2 y

and

s

(y 

1j

 y1 ) 2   ( y 2 j  y 2 ) 2 n1  n2  2

 a S pooled a

Proposal:

max
a

( y1  y 2 ) 2
2 sy

(a x1  a x 2 ) 2  a S pooled a

Best a

1 ˆ a  ( x1  x2 )S pooled

Maximum separation
ˆ ˆ (ax1  ax2 ) 2 1  ( x1  x2 )S pooled ( x1  x2 )  Mahalanobis D2 ˆ ˆ aS pooled a

Fisher’s sample linear discriminant
1 ˆ ˆ y  ( x1  x2 )S pooled x  ax

Notes: 1. Test H0: 1 = 2 using
T2  n1 n 2 ( n  n 2  2) p D2  1 F p , n1  n2  p 1 n1  n 2 n1  n 2  p  1

to see if discrimination (separation) likely to be useful

2. Initially, no assumption about form of parent populations except common covariance matrix  3. Allocation
1 ˆ Set m  1/ 2( y1  y2 )  1/ 2( x1  x2 )S pooled ( x1  x2 )

Allocate x0 to 1 if
1 ˆ ˆ y 0  ( x1  x 2 )S pooled x0  m

Allocate x0 to 2 otherwise (This is same as allocation rule for classifier based on two multivariate normal populations with common )

ˆ ˆ ˆ Anderson’s classification statistic: w  y  m

A pictorial representation of Fisher’s procedure for two populations with p = 2.

Example:
Steel data X1 = yield point X2 = ultimate strength

Univariate t statistics for mean differences are each non-significant

x  1 ˆ y  ( x1  x 2 ) S pooled x  (1.633 1.819) 1   1.633x1  1.819x 2 x   2

y1

y2

Separation evident

Example:
Hemophilia data again (See Example 11.8) ________________ Additional comments: 1. “Significant” separation (large T2) does not necessarily imply good classification
ˆ 2. Coefficient vector a frequently scaled

Classification with Several Populations—

With several populations, procedures for classifying and those for separating are more distinct g populations: prior probabilities: costs: 1 p1 2 … g p2 … pg

c(k|i) = cost of allocating an item to k when it is from i

Minimum ECM Rule: Assign x to that population k for which

p
i k

i

f i ( x )c ( k | i )

is smallest

Special case: Equal misclassification costs

Rule in special case above is identical to the one that maximizes the “posterior” probability P(k | x) where

Assign new x0 to population with highest posterior probability

Example:
Artificial data (See Example 11.9) ________________

Classification with multivariate normal populations— If fi(x) is N p (  i ,  i ) , classifier (discriminant) is quadratic in x.

If all the i =  (covariance matrices all equal), the classifier (discriminant) is linear in x

Another interpretation of linear classifier: Dropping an appropriate constant term, set

The allocatory rule is then

We see that this rule assigns x to the “closest” population. (The distance measure is penalized by ln(pi))

Example:
Artificial data (See Example 11.10)

Example:
Admissions data (See Example 11.11)

Scatter plot of (GPA, GMAT) for applicants to a graduate school of business who have been identified as admit, do not admit, or borderline.

SAS Output (Admissions data)

Another form of the minimum TPM rule for multivariate normal populations (equal covariances)—
ˆ Using max d k ( x) equivalent to

The classification regions defined above are bounded by hyperplanes

p = 2, g = 3

The classification regions R1, R2, R3 for the linear TPM rule (p1=.25, p2=.50, p3=.25)

Error Rate Estimate— Based on cross-validation or holdout procedure

ˆ E ( AER) 

n
i 1 g i 1

g

(H ) i,M

n

i

Estimate of expected actual error rate

Example:
Iris data (See Example 11.12) Confusion matrix

Fisher’s Discriminant Analyis for Several Groups—

Goal: Separate several populations (groups) for  Visual inspection (relationships among populations)  Graphical descriptive purposes (locations, scatter plots in two dimensions) Assume  1   2 normality Let 
g

 g,

but no assumption of multivariate

 (1 / g )  i , B   (  i   )(  i   )
i 1 i 1

g

Consider Y = aX
E (Y )  a E ( X |  i )  a  i for population  i Var(Y )  a Cov( X )a  a a for all populations

Overall mean

 y  (1 / g )  iY  a 
i 1

g

 Sum of squared dis tan ces from    populations to overall mean of Y     Vaniance of Y 

 (
i 1

g

iY

 Y ) 2

2 Y



a B a a a

Select Yi ' s to maximize ratio subject to being uncorrelated

1 Let 1   2     s be the non-zero eigenvalues of  B and e1 , e 2 ,  , e s the corresponding eigenvectors.

 Normalize ei so that ei ei  1 . Choices ai = ei yield the  Fisher discriminants, Yi  ai X .

In practice, substitute sample quantities for corresponding population parameters

Example:
Artificial data (See Example 11.13)

Example:
Crude oil data (See Example 11.14)

Crude-oil samples in discriminant space

Example:
Sports data (See Example 11.15)

The discriminant means y = [y1 , y2] for each sport

Using Fisher’s Discriminants to Classify—
  It can be shown, for y j  a j x , that

 (y
j 1

S

j

  iY j ) 2  ( x   i )  1 ( x   i )  2d i ( x)  x  1 x  2 ln p i

for all i (If s discriminants and equal priors, classify according to normal theory, equal , minimum TPM rule)

 If fewer than s discriminants are used, there is some loss of information

Two or fewer discriminants suffice if
Number of variables any p any p p=2 Number of populations g=2 g=3 any g Maximum number of discriminants 1 2 2

Example:
Artificial data (See Example 11.16)

ˆ The points y, y1 , y 2 , y 3 in the classification plane

Final Comments—  Selection of variables  Testing for mean differences  Normality

Two non-normal populations for which linear discrimination is inappropriate

 Other procedures (can handle qualitative variables) 1. Logistic regression
e   x p( x)  1  e   x

2. Neural networks

3. Classification trees (CART)

Classification tree terminal nodes (regions) in the petal width, petal length sample space

Clustering and Related Topics
Problem: Given a group of N objects (variables) devise a scheme for sorting the objects into g classes based on “similarity” or “distance” measures.

Notes: 1. Several names for clustering (e.g. numerical taxonomy, pattern recognition, etc.) 2. Classes (groups) not determined a priori 3. In general, looking at all possible groups not feasible 4. Many clustering techniques sensible but ad hoc—no well defined statistical model 5. Clustering useful first step in “data mining” when faced with large complex data set with many variables and a lot of internal structure

Goals: 1. Uncover “natural” structure 2. Impose proper structure

Types of techniques— 1. Hierarchical --classes are grouped at higher levels  Agglomerative (successive fusions) methods e.g. linkage procedures  Divisive (successive partitions) methods 2. Non-hierarchical --one set of specific groups  K-means method

When items are clustered, proximity usually indicated by some sort of distance (similarity) measure When variables are clustered, proximity usually indicated by some sort of correlation coefficient or like measure of association Clustering items Measurements may be quantitative or qualitative Suppose measurements on p variables Item i : x   ( x1 , x 2 ,  , x p ) Item k : y   ( y1 , y 2 ,  , y p )

Distance between item i and item k is
d ( x, y )  ( x  y) ( x  y ) 

 (x

i

 yi ) 2

Alternatively
d ( x, y)  ( x  y) A( x  y)
d ( x, y) 

A positivedefinite
Minkowski dis tan ce

| x

i

 yi | m



1/ m

Qualitative measurements on p characteristics (variables) Introduce “dummy” variable to indicate presence or absence of characteristic
1 present X  0 absent

p=3 Variable | 1 2 3 Item i | 1 0 0 Item k | 1 1 0 Squared distance:  xij  xkj counts mismatches

 (x
j 1

3

ij

 x kj ) 2  1

To allow for differential weighting of matches and mismatches, arrange data in form of 2-way table of counts and compute “similarity” coefficients p=3
Item k | Item i 1 | 0 | Total | 1 1 1 2 0 0 1 1 | Total | 1 | 2 | 3

General frequency table Item k | 1 0 | Total Item i 1 | a b | a+b 0 | c d | c+d Total | a+c b+d | a+d+c+d = p
Similarity Coefficients for Clustering Items

Example:
Individuals data (See Example 12.2) ________________

Notes: 1. Can construct similarities from distances
~  s ik 1 1  d ik

However, distances cannot always be constructed from similarities 2. There are similarity measures for variables also (see text)

Hierarchical Clustering Methods— Basic algorithm for linkage procedures: N objects (items)

Intercluster distance (dissimilarity) for (a) single linkage, (b) complete linkage, and (c) average linkage

Example:
Artificial data—single linkage (See Example 12.4)

Examples:
Concordance data (See Examples 12.3, 12.5, 12.7 & 12.9)
Numerals in Eleven Languages

Concordant First Letters for Numbers in Eleven Languages

The subsequent assignment of distances are

Single linkage

Have d32 = 1, d86 = 1 and d87 = 1. First form groups (2 3) and (6 8) Single linkage dendrogram:

Complete linkage dendrogram for language data

Average linkage dendrogram for language data

Example:
Utility data (See Example 12.10)
Average linkage dendrogram for distances between 22 public utility companies

Example:
Scotch whiskey data (See Example 12.11)

Variables Nose Color Body Palate Finish

A dendrogram for similarities between 109 pure malt Scotch whiskies

Non-hierarchical Clustering Methods— Basic algorithm for K-means method: N objects (items)

Example:
Artificial data (See Example 12.12)

Example:
Utility data (See Example 12.13)

Cluster profiles (K = 5) for public utility data

Grouping by eye using pictorial representations—

Example:
Utility data (See Example 1.12)

Chernoff faces for 22 public utilities

Practical Guidelines— 1. Check for outliers 2. Try standardized and un-standardized data with several similarity (distance) measures 3. Try several clustering methods 4. Evaluate stability of solution  Divide data into 2 subsets; compare results  Add small errors to original data; compare results with and without errors  If feasible, try several starting points

Multidimensional Scaling
Technique for representing multivariate data in lowdimensional space such that any distortion caused by the reduction in dimensionality is minimized Problem: For a set of observed similarities (distances) between every pair of N items, find a representation of the items in few dimensions such that the inter-item proximities “nearly match” the original similarities (distances).

The numerical measure of closeness (to the original similarities) is called stress Non-metric multidimensional scaling: Only the rank orders of the N(N1)/2 similarities are used to arrange the N items in a low-dimensional coordinate system Metric multidimensional scaling: The magnitudes of the N(N1)/2 similarities are used to arrange the N items in a low-dimensional coordinate system Metric multidimensional scaling also called principal coordinate analysis

Basic algorithn: N items (objects)

Two measures of stress:

Stress usually interpreted according to the following guidelines:

Example:
Airline distance data (See Example 12.14)

A geometrical representation of cities produced by multidimensional scaling

Stress function for the airline distances between cities

Example:
Universities data (See Example 12.16)

A two-dimensional representation of universities produced by metric multidimensional scaling

A two-dimensional representation of universities produced by nonmetric multidimensional scaling

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 538 posted: 1/19/2010 language: English pages: 189
pptfiles
About