# Introduction to Multivariate Analysis by 9wII2w

VIEWS: 8 PAGES: 36

• pg 1
```									Epidemiological Applications in Health Services Research

Introduction to Multivariate Analysis

   Introduction to variables and data
   Simple linear regression
   Correlation
   Population covariance
   Multiple regression
   Canonical correlation
   Discriminant analysis
   Logistic regression
   Survival analysis
   Principal component analysis
   Factor analysis
   Cluster analysis
Types of variables (Stevens’
classification, 1951)
   Nominal
categories: race, religions,
 distinct
counties, sex
   Ordinal
 rankings:education, health status,
smoking levels
   Interval
 equaldifferences between levels: time,
temperature, glucose blood levels
   Ratio
 intervalwith natural zero: bone density,
weight, height
Variables use in data analysis
   Dependent: result, outcome
 developing   CHD

   Independent: explanatory
 Age,   sex, diet, exercise

   Latent constructs
 SES,   satisfaction, health status

   Measurable indicators
 education,   employment, revisit, miles
walked
Variables in data example
Name            # of         Position
characters
STFIPS FIPS     1            2
CODE (STATE)
STCENSUS        1            3

LEVEL           1            4

STABBREV        1            5

AREANAME        7            6
NAME OF
US/STATE/COUN
TY
POPULATION      7            13
1992 ABS
ITEM002
xyz                          20
Data
   Data screening and transformation
   Normality
   Independence
   Correlation (or lack of independence)
Variable types and measures of
central tendency
   Nominal: mode
   Ordinal: median
   Interval: Mean
   Ratio: Geometric mean and harmonic
mean
Simple linear regression
Y = A + BX

Y

B

A

X
Correlation
   Mean =

   Variance (SD)2 = 
   Population covariance = (X-  x)(Y-  y)
   Product moment coefficient=

=xy/  x  y
   It lies between -1 and 1
Example physical and mental health
indicators
Negative correlation
Population covariance

 =0.00              =0.33    =0.6

 =0.88
Multiple regression and correlation
Simple linear Y =  + X
Multiple regression Y =  + 1X1 + 2X2 + 3X3 . . .+ pXp

EF ejection fraction

Exercise

Body fat
Issues with regression
   Missing values
 random
 pattern
 mean    substitution and ML
   Dummy variables
 equal   intervals!
   Multicollinearity
 independent    variables are highly
correlated
   Garbage can method
Canonical correlation
   An extension of multiple regression
   Multiple Y variables and multiple X
variables
   Finding several linear combinations of
the X var and the same number of linear
combinations of the Y var.
   These combinations are called canonical
variables and the correlations between
the corresponding pairs of canonical
variables are called CANONICAL
CORRELATIONS
Correlation matrix
e   l            a

E O T
Y
OT
FN  O
S
H P
R R
H
N H L
PT
O
1 P
0
T
H
T L D
W
 Data screening and transformation
T
N
X
2 3 *
1 1 *
6 *
0 0 *
2 *
0 5
H
H
A
e
*
0 8 * *T
S
0 3
0 6
0 0
0 8
0 0
0 0
0.   i g
8N
3 6
8 8
8 8
5 1
8
8
4 5
8  Normality
G
6P *
1 *
9 1 *
2 8 * *
4 7 * *
3 5 *
2 *
8 4 *
0    E
e *
0 S
0 0g
.
0
0 0
0 0
0 0
0
0 0
3 6
8N
8 8
8 8
5 1
8 8
8
4 5 Independence
i

P
2 P *
3 *
3 0 *
0 0
9 5 * *
1 1 *
8 *
8 3 *
8 *  H
e
* *
0 S
0 0g
0 0
0
0 0
0 0
6
8N
5 1
8 8
8
8
0
8
.
4 5
3 6
8 8
Correlation (or lack of independence)
i

M e
P *
3 8E
2 3 * *
0 0
2 0 * *
6 *
1 9 *
1 * *
1 4 **
S
0 0g
0 0
0 0
0 0
0 0
0 .
0    i
8N
8 8
5 1
8 8
8
4 5
8
3 6
8
2PP
0 *
1 4 *
9 5 * *
0 0
0 1 *
5
5 5 *
7 *  O
e *
S
0.
0 0
0 0
1 6
8
0 0
0 5  i g
5N
5 1
5 1
5 1
1
5 4
1
1 8
1
H L
3P *
8 *
1 1 * *
0 1
0 0 *
3 *
5 2 *
5 *
2 2 *e *
0 0g
0 S
0
0 0.
1 6
0
0 0
0   i
N
8 8
8 8
5 1
8 8
8
4 5
8
3 6
B
1 P *
9 *
3 9 *
8 3 * *
5 5 * *
5 2 *
1 *
0 0
4 *  P
e
* *
0 0g
0 S
0
0 0
0 0
0 0
00.  i
4N
4 5
5 4
4 5
5
4 5
5
3 6
5
P * *
T
0 0
1 *
3 0 * *
1 4 * *
1 4
2 2 *
0
3 9 *O
e
*
0 S .
0
0 0
0 5
0 0
3
0 0
0   i g
3N
3 6
1 8
3 6
6
3 6
6
3 6
6
* * .
C o
Discriminant analysis
   A method used to classify an individual
in one of two or more groups based on a
set of measurements
   Examples:
 at   risk for
 heartdisease
 cancer

 diabetes, etc.

   It can be used for prediction and
description
Discriminant analysis

B                        B
ab
A
A

   a and b are wrongly classified
   discriminant function to describe
the probability of being classified in
the right group.
Logistic regression
   An alternative to discriminant analysis to
classify an individual in one of two
populations based on a set of criteria.
   It is appropriate for any combination of
discrete or continuous variables
   It uses the maximum likelihood
estimation to classify individuals based
on the independent variable list.
Survival analysis (event history
analysis)
   Analyze the length of time it takes a
specific event to occur.
   Time for death, organ failure, retirement,
etc.
   Length of time function of {explanatory
variables (covariates)}
Survival data example
died
died
died
lost

surviving

1980
1985            1990
Log-linear regression
   A regression model in which the
dependent variable is the log of survival
time (t) and the independent variables
are the explanatory variables.

Multiple regression Y =  + 1X1 + 2X2 + 3X3 . . .+ pXp

Log (t) =  + 1X1 + 2X2 + 3X3 . . .+ pXp + e
Cox proportional hazards model
    Another method to model the
relationship between survival time and a
set of explanatory variables.
    Proportion of the population who die up
to time (t) is the lined area

1980         t      1985            1990
Cox proportional hazards model
   The hazard function (h) at time (t) is
proportional among groups 1 & 2 so that
   h1(t1)/h2(t2) is constant.
Principal component analysis
   Aimed at simplifying the description of a
set of interrelated variables.
   All variables are treated equally.
   You end up with uncorrelated new
variables called principal components.
   Each one is a linear combination of the
original variables.
   The measure of the information
conveyed by each is the variance.
   The PC are arranged in descending
order of the variance explained.
Principal component analysis
   A general rule is to select PC explaining
at least 5% but you can go higher for
parsimony purposes.
   Theory should guide this selection of
cutoff point.
   Sometimes it is used to alleviate
multicollinearity.
Factor analysis
   The objective is to understand the
underlying structure explaining the
relationship among the original variables.
variables on the factors generated to
determine the usability of a certain
variable.
   It is guided again by theory as to what
are the structures depicted by the
common factors encompassing the
selected variables.
Factor analysis

i      tt
Factor analysis
Cluster analysis
   A classification method for individuals
into previously unknown groups
   It proceeds from the most general to the
most specific:
   Kingdom: Animalia
Phylum: Chordata
Subphylum: vertebrata
Class: mammalia
Order: primates
Family: hominidae
Genus: homo
Species: sapiens
Patient clustering
   Major: patients
Types: medical
Subtype: neurological
Class: genetic
Order: lateonset
disease: Guillian Barre syndrom
   Hierarchical: divisive or agglumerative
Conclusions
Presentation Schedule
   4 each on 4/22 and 4/27
   5 on 4/29
   Each presentation should be maximum of
10 minutes and 5 minutes for discussion
   E-mail me your requirements of software
   Final projects due 5/7/99 by 5:00 pm in
my office.
Presentation Schedule 1

Date     Time        Who
4/22     1:00 - 1:15
1:16 - 1:30
1:31 - 1:45
1:46 - 2:00
Presentation Schedule 2

Date     Time        Who
4/27     1:00 - 1:15
1:16 - 1:30
1:31 - 1:45
1:46 - 2:00
2:01 - 2:15
Presentation Schedule 3

Date   Time        Who
4/29   1:00 - 1:15
1:16 - 1:30
1:31 - 1:45
1:46 - 2:00

```
To top