Introduction to Multivariate Analysis by 9wII2w

VIEWS: 8 PAGES: 36

									Epidemiological Applications in Health Services Research




     Introduction to Multivariate Analysis


                     Dr. Ibrahim Awad Ibrahim.
Areas to be addressed today
      Introduction to variables and data
      Simple linear regression
      Correlation
      Population covariance
      Multiple regression
      Canonical correlation
      Discriminant analysis
      Logistic regression
      Survival analysis
      Principal component analysis
      Factor analysis
      Cluster analysis
Types of variables (Stevens’
   classification, 1951)
      Nominal
                categories: race, religions,
        distinct
         counties, sex
      Ordinal
        rankings:education, health status,
         smoking levels
      Interval
        equaldifferences between levels: time,
         temperature, glucose blood levels
      Ratio
        intervalwith natural zero: bone density,
         weight, height
Variables use in data analysis
         Dependent: result, outcome
           developing   CHD

         Independent: explanatory
           Age,   sex, diet, exercise

         Latent constructs
           SES,   satisfaction, health status


         Measurable indicators
           education,   employment, revisit, miles
            walked
Variables in data example
  Name            # of         Position
                  characters
  STFIPS FIPS     1            2
  CODE (STATE)
  STCENSUS        1            3

  LEVEL           1            4

  STABBREV        1            5

  AREANAME        7            6
  NAME OF
  US/STATE/COUN
  TY
  POPULATION      7            13
  1992 ABS
  ITEM002
  xyz                          20
     Data
   Data screening and transformation
   Normality
   Independence
   Correlation (or lack of independence)
Variable types and measures of
       central tendency
       Nominal: mode
       Ordinal: median
       Interval: Mean
       Ratio: Geometric mean and harmonic
        mean
    Simple linear regression
          Y = A + BX


Y


              B



A


          X
     Correlation
   Mean =

   Variance (SD)2 = 
   Population covariance = (X-  x)(Y-  y)
   Product moment coefficient=

       =xy/  x  y
   It lies between -1 and 1
Example physical and mental health
           indicators
Negative correlation
          Population covariance




 =0.00              =0.33    =0.6




           =0.88
Multiple regression and correlation
           Simple linear Y =  + X
Multiple regression Y =  + 1X1 + 2X2 + 3X3 . . .+ pXp

       EF ejection fraction




                                    Exercise

           Body fat
     Issues with regression
   Missing values
     random
     pattern
     mean    substitution and ML
   Dummy variables
     equal   intervals!
   Multicollinearity
     independent    variables are highly
      correlated
   Garbage can method
     Canonical correlation
   An extension of multiple regression
   Multiple Y variables and multiple X
    variables
   Finding several linear combinations of
    the X var and the same number of linear
    combinations of the Y var.
   These combinations are called canonical
    variables and the correlations between
    the corresponding pairs of canonical
    variables are called CANONICAL
    CORRELATIONS
       Correlation matrix
 e   l            a

E O T
Y
OT
 FN  O
     S
     H P
      R R
        H
N H L
PT
 O
1 P
 0
  T
  H
  T L D
  W
    Data screening and transformation
     T
     N
     X
 2 3 *
 1 1 *
  6 *
 0 0 *
 2 *
0 5
      H
      H
      A
      e
      *
0 8 * *T
  S
 0 3
0 6
0 0
0 8
 0 0
 0 0
 0.   i g
8N
 3 6
 8 8
8 8
5 1
  8
 8
 4 5
 8  Normality
  G
6P *
  1 *
 9 1 *
2 8 * *
4 7 * *
 3 5 *
 2 *
 8 4 *
 0    E
      e *
0 S
 0 0g
  .
  0
0 0
0 0
 0 0
 0
 0 0
 3 6
 8N
8 8
8 8
5 1
 8 8
 8
 4 5 Independence
      i

  P
2 P *
  3 *
 3 0 *
0 0
9 5 * *
 1 1 *
 8 *
 8 3 *
 8 *  H
      e
      * *
0 S
 0 0g
 0 0
 0
0 0
 0 0
 6
8N
5 1
8 8
 8
 8
  0
  8
   .
 4 5
 3 6
 8 8
     Correlation (or lack of independence)
      i

  M e
  P *
 3 8E
2 3 * *
0 0
2 0 * *
 6 *
 1 9 *
 1 * *
 1 4 **
  S
 0 0g
 0 0
0 0
0 0
 0 0
 0 .
 0    i
8N
8 8
5 1
 8 8
 8
 4 5
 8
 3 6
  8
2PP
  0 *
 1 4 *
9 5 * *
0 0
 0 1 *
 5
 5 5 *
 7 *  O
      e *
  S
 0.
0 0
0 0
 1 6
 8
 0 0
 0 5  i g
5N
5 1
5 1
 5 1
 1
 5 4
 1
 1 8
  1
  H L
3P *
  8 *
1 1 * *
0 1
 0 0 *
 3 *
 5 2 *
 5 *
 2 2 *e *
 0 0g
0 S
 0
0 0.
1 6
 0
 0 0
  0   i
  N
8 8
8 8
5 1
 8 8
 8
 4 5
 8
 3 6
  B
1 P *
  9 *
 3 9 *
8 3 * *
5 5 * *
 5 2 *
 1 *
 0 0
 4 *  P
      e
      * *
 0 0g
0 S
 0
0 0
0 0
 0 0
 00.  i
4N
4 5
5 4
 4 5
 5
 4 5
 5
 3 6
  5
  P * *
  T
 0 0
 1 *
3 0 * *
1 4 * *
1 4
 2 2 *
 0
 3 9 *O
      e
      *
0 S .
 0
0 0
0 5
 0 0
 3
 0 0
  0   i g
3N
3 6
1 8
 3 6
 6
 3 6
 6
 3 6
  6
  * * .
  C o
     Discriminant analysis
   A method used to classify an individual
    in one of two or more groups based on a
    set of measurements
   Examples:
     at   risk for
        heartdisease
        cancer

        diabetes, etc.

   It can be used for prediction and
    description
     Discriminant analysis

B                        B
    ab
                                A
     A




            a and b are wrongly classified
            discriminant function to describe
             the probability of being classified in
             the right group.
     Logistic regression
   An alternative to discriminant analysis to
    classify an individual in one of two
    populations based on a set of criteria.
   It is appropriate for any combination of
    discrete or continuous variables
   It uses the maximum likelihood
    estimation to classify individuals based
    on the independent variable list.
Survival analysis (event history
           analysis)
        Analyze the length of time it takes a
         specific event to occur.
        Time for death, organ failure, retirement,
         etc.
        Length of time function of {explanatory
         variables (covariates)}
       Survival data example
        died
                             died
                                    died
               lost



                                    surviving



1980
                      1985            1990
                  Log-linear regression
                A regression model in which the
                 dependent variable is the log of survival
                 time (t) and the independent variables
                 are the explanatory variables.

Multiple regression Y =  + 1X1 + 2X2 + 3X3 . . .+ pXp

Log (t) =  + 1X1 + 2X2 + 3X3 . . .+ pXp + e
Cox proportional hazards model
           Another method to model the
            relationship between survival time and a
            set of explanatory variables.
           Proportion of the population who die up
            to time (t) is the lined area




     1980         t      1985            1990
Cox proportional hazards model
          The hazard function (h) at time (t) is
           proportional among groups 1 & 2 so that
          h1(t1)/h2(t2) is constant.
Principal component analysis
          Aimed at simplifying the description of a
           set of interrelated variables.
          All variables are treated equally.
          You end up with uncorrelated new
           variables called principal components.
          Each one is a linear combination of the
           original variables.
          The measure of the information
           conveyed by each is the variance.
          The PC are arranged in descending
           order of the variance explained.
Principal component analysis
        A general rule is to select PC explaining
         at least 5% but you can go higher for
         parsimony purposes.
        Theory should guide this selection of
         cutoff point.
        Sometimes it is used to alleviate
         multicollinearity.
     Factor analysis
   The objective is to understand the
    underlying structure explaining the
    relationship among the original variables.
   We use the factor loading of each of the
    variables on the factors generated to
    determine the usability of a certain
    variable.
   It is guided again by theory as to what
    are the structures depicted by the
    common factors encompassing the
    selected variables.
    Factor analysis


i      tt
Factor analysis
     Cluster analysis
   A classification method for individuals
    into previously unknown groups
   It proceeds from the most general to the
    most specific:
   Kingdom: Animalia
     Phylum: Chordata
       Subphylum: vertebrata
          Class: mammalia
             Order: primates
                 Family: hominidae
                      Genus: homo
                          Species: sapiens
     Patient clustering
   Major: patients
     Types: medical
       Subtype: neurological
          Class: genetic
             Order: lateonset
                  disease: Guillian Barre syndrom
   Hierarchical: divisive or agglumerative
Conclusions
       Presentation Schedule
   4 each on 4/22 and 4/27
   5 on 4/29
   Each presentation should be maximum of
    10 minutes and 5 minutes for discussion
   E-mail me your requirements of software
    and hardware for your presentation.
   Final projects due 5/7/99 by 5:00 pm in
    my office.
       Presentation Schedule 1

Date     Time        Who
4/22     1:00 - 1:15
         1:16 - 1:30
         1:31 - 1:45
         1:46 - 2:00
       Presentation Schedule 2

Date     Time        Who
4/27     1:00 - 1:15
         1:16 - 1:30
         1:31 - 1:45
         1:46 - 2:00
         2:01 - 2:15
         Presentation Schedule 3

Date   Time        Who
4/29   1:00 - 1:15
       1:16 - 1:30
       1:31 - 1:45
       1:46 - 2:00

								
To top