Multivariate Analysis by A0R01wt


									Multivariate Analysis
   Multivariate thinking
    ◦ Body of thought processes that illuminate the
      interrelatedness between and within sets of
   The essence of multivariate thinking is to
    expose the inherent structure and
    meaning revealed within these sets of
    variables through application and
    interpretation of various statistical
Why the multivariate approach?
   Big idea- multiple response outcomes
   With univariate analyses we have just one dependent
    variable of interest
   Although any analysis of data involving more than one
    variable could be seen as ‘multivariate’, we typically
    reserve the term for multiple dependent variables
   So MV analysis is an extension of UV ones, or
    conversely, many of the UV analyses are special cases of
    MV ones
Why MV over the univariate approach?
   Complexity
    ◦ The subject/data studied may be more
      complex than what univariate methods can
      offer in terms of analysis
   Reality
    ◦ In some cases it would be inappropriate to
      conduct univariate analysis as the
      data/research demand a multivariate analysis
Why MV over the univariate approach?
   Experimental data
    ◦ Although experimental research can be and often is
      multivariate, typically subjects are assigned to groups and
      the manipulations regard corresponding changes to a
      single outcome
       Different doses of caffeine  test performance
       Causality is more easily deduced
   Non-experimental data
    ◦ Likewise survey/inventory data might be analyzed in
      univariate fashion, but typically it will require the
      multivariate approach to solve the questions stemming
      from it
       Correlational
Why not MV?
 In the past the computations were
  overwhelming even with smaller datasets,
  and so MV analyses were typically avoided
 Now this is not a problem but there are
  still reasons to not do a MV analysis
Why not MV?
   Ambiguity
    ◦ MV analysis may result in a less clear understanding of the data
         E.g. group differences on a linear combination of DVs (Manova)
          Differences are easily interpreted in a univariate sense
    ◦ Ambiguity because of ignorance of the technique is not a valid
      reason however
   Unnecessary complexity
    ◦ Just because SEM looks neat/is popular doesn’t mean you have to do
      one, or that it is the best way to answer your research question
   No free lunch
    ◦ MV analyses come with their own rules and assumptions that may make
      analysis difficult or not as strong
Multivariate Pros and Cons Summary
   Advantages of using a multivariate statistic
    ◦ Richer realistic design
    ◦ Looks at phenomena in an overarching way (provides
      multiple levels of analysis)
    ◦ Each method differs in amount or type of Independent
      Variables (IVs) and DVs
    ◦ Can help control for Type I Error
   Disadvantages
    ◦ Larger Ns are often required
    ◦ More difficult to interpret
    ◦ Less known about the robustness of assumptions
Primary purposes of MV analysis
 Prediction and explanation
 Determining structure
   The goal in most research situations is to be
    able to predict outcomes based on prior
    ◦ E.g. given a person’s gender and region, what will their
      attitude be on some social issue?
    ◦ Given a number of variables how well can we predict
      group membership?
   Explanation
    ◦ Which variables are most important in the prediction of
      some outcome?
    ◦ In many cases this is end goal of an analysis, though a very
      problematic one
A caveat regarding ‘explanation’
 Determining variable importance can be a suspect
 Something that might be deemed a statistically
  significant variable may not make the cut had the
  study been conducted again
 Depending on a number of factors, results may be
  sample specific
    ◦ i.e. you may not see the same ordering next time
 A different goal in MV analysis is to determine the structure
  of the data
  ◦ Is there an underlying dimension that can describe the
     data in a simpler fashion?
 Methods involve classification and/or data reduction
 Latent variables (constructs)
    ◦ Example:
       Observed variables Giddiness, Silliness, Irrationality, Possessiveness
        and Misunderstanding reduced to the underlying construct of ‘Love’
   Interest may be in reducing variables (Factor analysis),
    emphasis on group membership (Cluster analysis), stimulus
    structure (MDS) etc.
Prediction and Structure
   Both prediction and structure may be the
    goal of analysis
    ◦ SEM and path analysis
   How well does the model fit the data?
    Multivariate Themes
                                                        Theories and



                     Multiple considerations at         Multiple Time
                       all levels of focus, with           Points
                         greater multiplicity
Multiplicity Theme       generally leading to
                      greater reliability, validity,      Multiple
                     and generalization:                  Controls

                                                       Multiple Samples


Multivariate Themes
Things to consider
 Initial variable choice
 Comes down to:
    ◦   Familiarity with previous research
    ◦   Instrument used
    ◦   Expertise with field of study
    ◦   Common sense
   Much of the ‘hard work’ consists of developing a
    plan of attack and deciding on how to study the
Initial Examination of Data
   Preliminary analysis
    ◦ A thorough initial examination of the data is not
      only required but also necessary for a full
      understanding of any research
    ◦ Such initial analyses provide a better grasp of
      what is happening in the data and may inform the
      MV analysis to a certain extent
   However, in the MV case, if the actual goal is
    interpretation of the UV analyses (as one
    often sees in MANOVA), the MV analysis is
More to consider
   Intro now, more details as we discuss each method
   Assumptions– important for inferences beyond the
   Normality: Basic assumption of General Linear Model;
    concerned with an elliptical pattern of residuals for the
     ◦ Skewness: Distribution of scores is tilted
       Direction established by tail
       greater skewness = less normality
    ◦ Kurtosis: Degree of peakedness of data
       3 Types: leptokurtic (thin); mesokurtic (normal); platykurtic
More to consider
   Linearity
    ◦ Data forms a relatively straight oval line when plotted
   Homoscedasticity
    ◦ variance of 1 variable is equal at all levels of other variables
       understood through standard deviations across variables and scatter
    ◦ Referred to as homogeneity of variance in ANOVA methods
   Homogeneity of regression
    ◦ Regression slopes between covariate and DV are equal across
      groups of IV
    ◦ Do not want this statistic (F) to be significantly different—if so,
      violation of assumption for (M)ANCOVA
More to consider
   Multicollinearity
    ◦   Correlation coefficient (r) between predictors is noticeably large
    ◦   Causes instability in the statistical procedure
    ◦   Can’t differentiate which variables are contributing to outcome
    ◦   Singularity
         Redundant variables—brings discriminant in equation to zero
   Orthogonality
    ◦ Allows no association among variables
    ◦ Not realistic in real world data
    ◦ May allow greater interpretability versus data that are too
More to consider
   Outliers
    ◦ Effect mean (inflate/deflate) disguising true relationship
    ◦ Distort data—create noise (error) lose power

    ◦ Transformations (log or square root) may be helpful with
       Reshapes distribution creating a more normal distribution
       However you now have a scale with which you are unfamiliar
        and which you cannot generalize back to the original
Some distinctions
   Types of data
    ◦ Nominal/Categorical
    ◦ Ordinal
    ◦ Continuous
      Interval or Ratio
   The types of variables involved will say
    much about what analyses are going to be
    appropriate and/or how one might
    proceed with a particular analysis
Types of data
 One thing to keep in mind is that these
  distinctions are largely arbitrary
 One can dichotomize a continuous measure into
    ◦ A bad idea most of the time
 An ordinal measure (e.g. likert question) has a
  mean/construct that actually falls along a
 How the data is to be considered is largely left to
  the researcher
Sample vs. Population
 In typical research we are rarely dealing with a
 The goal in research is not to simply describe our
  data but to generalize to the real world
 Many analyses and data collection are for a
  variety of reasons (not good) sample-specific, and
  not much use to the scientific community
 Take care in the initial phase of research planning
  to help guard against such a situation
The linear combination of variables
   Whether of IVs or DVs, a linear
    combination of variables is often
    necessary to interpret the data
    ◦ This idea is essential to thinking multivariately
   MultReg
    ◦ Finding the linear combination of IVs that best
      predicts the DV
   Manova
    ◦ What linear combination of DVs maximizes
      the distinction between groups
How many variables
   Considerations
    ◦   Cost
    ◦   Availability
    ◦   Meaningfulness
    ◦   Theory
   For ease of understanding and efficiency we
    typically want the fewest number of variables that
    will explain the most
    ◦ Ockham’s razor
Statistical power and effect size
   A problem that has plagued the social sciences is
    the lack of power to find subtle effects
   Some multivariate procedures will require
    relatively large amounts of data (e.g. SEM)
   Power and sample size are a required
    consideration before any attempt at research,
    multivariate or otherwise
   After the fact, emphasis should be placed on
    effect size and model fit, rather than p-values
   More later…
The matrices of interest
   Data matrix
    ◦ What you see in SPSS or whatever program you’re using
    ◦ Includes the cases and their corresponding values for the
      variables of interest
   Correlation matrix- R
    ◦ Contains information about the linear relationship between
       Standardized covariance        cov xy
    ◦ Symmetrical                 r
                                      sx s y
    ◦ Square
    ◦ Typically only the bottom portion is shown as the top portion is
      its mirror image and the diagonal contains all ones (each variable
      is perfectly correlated with itself)
The matrices of interest
   Variance/Covariance matrix - Σ
    ◦ Square and symmetrical
    ◦ Variance of each variable is on the diagonal,
      covariances with other variables on the off-
   In some cases you will have the option to
    use correlations or covariances as the
    unit of analysis, with some debate about
    which is better under what circumstances
The matrices of interest
 Sum of Squares and cross-products matrix - S
 Precursor to the Variance/Covariance matrix (the
  values before division by N-1)
 On the diagonal is a variable’s sum of the squared
  deviations from its mean
 Off-diagonal elements are the sum of the
  products of the deviation scores for the two
Methods of analysis
 A host of methods are available to the
 The kind of question asked will help guide
  one in choosing the appropriate analysis,
  however the data may be available to
  multiple methods, and almost always is
Degree of relationship
   Bivariate r
    ◦ The degree of linear relationship between two variables
    ◦ Partial and semi-partial
   Multiple R
    ◦ The relationship of a set of variables to another (dependent)
   Canonical R
    ◦ The grandaddy
    ◦ Relationship between sets of variables
   Methods are also available to assess the relationship among
    non-continuous variables
    ◦ E.g. Chi-square, Multiway Frequency Analysis
Group Differences
 Very popular research question in social
  sciences (too popular really)
 Is group A different from B?
    ◦ The answer is always yes, and with a large
      enough sample, statistically significantly so
 Anova and related
 Manova the multivariate counterpart
 Repeated measures
Predicting group membership
 Turning the group difference question the
  other way around
 Discriminant function analysis
 Logistic regression
   Data reduction and classification
   Cluster analysis
    ◦ Seeks to identify homogeneous subgroups of cases or
      variables based on some measure of ‘distance’
    ◦ Identify a set of groups in which within-group variation is
      minimized and between-group variation is maximized
   Principal components and Factor analysis
    ◦ Reduce a large number of variables to smaller
    ◦ Often used in psych for the development of inventories
   Structural equation modeling
    ◦ Where factor analysis and regression meet
Time course of events
   How long is it before some event occurs?
   How does a DV change over the course of time?
   The former question can be answered with
    survival/failure analysis
    ◦ Survival rates for disease
    ◦ Time before failure for a particular electronic part
   The latter is often examined with time-series
    ◦ Many time periods are available for analysis
       E.g. monthly stock prices over the past five years
    ◦ Popular in the economics realm
Decision tree
Decision tree
Decision tree
 Although such guides may
  be useful, as mentioned
  before, multiple analyses
  may be appropriate for
  the data under
 The best plan of attack is
  to have a well-defined
  research question, and
  collect data appropriate to
  the analysis that will best
  answer that question
Multivariate Methods: Quick Glance
            Organizational Chart based on: Type of Research Focus
                            (Group differences or Correlational).

     Research Question       IVs: Number and Scale                  # & Scale          Method
       Research Focus              IVs                       DVs                Multivariate
                               Number & Scale            Number & Scale          Method
     Group Differences
                         1+ categorical & continuous    1 continuous            ANCOVA
                         1+ categorical                 2+ continuous           MANOVA
                         2+ continuous                  1+ categorical          DFA
                         1+categ or cont                1 categorical           LR
                         2+ continuous                  1 continuous            MR
                         2+ continuous                  2+ continuous           CC
                         -                              2+ continuous           PCA & FA

Note: Scale and number of Independent (IV) and Dependent (DV) categorical or continuous
variables. + indicates 1 or more; ANCOVA = Analysis of Covariance; MANOVA = Multivariate
Analysis of Variance; DFA = Discriminant Function Analysis; LR=Logistic Regression; MR =
Multiple Regression; CC = Canonical Correlation; PCA/FA = Principal Components/Factor Analysis
Summary of Methods
 The multivariate methods we will look at are a set of tools
  for analyzing multiple variables in an integrated and powerful
 They allow the examination of richer and perhaps more
  realistic designs than can be assessed with traditional
  univariate methods that only analyze one outcome variable
  and usually just one or two independent variables (IVs)
 Compared to univariate methods, multivariate methods
  allow us to analyze a complex array of variables, providing
  greater assurance that we can come to some synthesizing
  conclusions with less error and more validity than if we were
  to analyze variables in isolation.

To top