Blind Analysis Of Multivariate Data

Document Sample
Blind Analysis Of Multivariate Data Powered By Docstoc
					Examining a
Multivatiate Database
Issues to be examined

Tools for examining a multivariate
database

The problem of missing data

The problem of outliers




Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                     2


                                               Key Concepts
                                                   *****

                              Examining a Multivariate Database


Dangers of analyzing data without theory or a thorough understanding
of the data
       Reliability & validity
       Missing data
       Outliers
       Distributional dynamics of the variables
       Ratio of cases to variables
       Statistical assumptions about the data
Analytic tools for examining data:
       Histogram
       Stem & leaf diagram
       Scatterplot
       Box-Whisker plot
       Bar graph
       Normal probability plot
       Cross-tabulation table
       Descriptive statistics
Concept of skew:
       Right or positive skew
       Left or negative skew
Concept of kurtosis:
       Platykurtic
       Mesokurtic
       Leptokurtic
The problem of missing data in multivariate analysis:
       The impact of eliminating subjects
       The impact of eliminating variables
Causes of missing data
Missing at random (MAR) v. missing completely at random (MCAR)
Techniques for determining MAR v. MCAR
Remedies for missing data
       Deletion of cases
       Deletion of variables
       Imputation
       Model-based solutions
Problems with deleting cases
Problems with deleting variables




  Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                     3


Key Concepts (cont.)

The concept of imputation
        Complete case approach
        All-available approach
Techniques for imputation:
        Case substitution
        Mean/median substitution
        Cold deck imputation
        Regression imputation
        Multiple imputation
Advantages & disadvantages of different imputation techniques
Model-based procedures for missing data
The problem of outliers & fringeliers
Univariate v. multivariate outliers
Sources of outliers
Critical questions about outliers
Techniques for identifying outliers:
        Histogram
        Stem & leaf diagram
        Scatterplots: 2 or 3 dimentional
        Box-Whisker plot
        Trend or time series plot
        Descriptive statistics
        Converting data to standard scores
        Multivariate tools
Ways of dealing with outliers
Problems with deleting outlies




  Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   4


                                  Lecture Outline

       Issues to examine


       Tools for examining data


       Problems with missing data


       Problems with outliers




Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    5




       Blind Analysis of Multivariate Data
Blind analysis of a multivariate database without
theory and a good understanding of the data is
hazardous.


Research should be theory-driven                                                                  with a
thorough understanding of:

                The reliability and validity of the data

                The extent and impact of missing data

                Presence and impact of outliers

                Distributional characteristics of the
                 variables

                The ratio of cases to the number of
                 variables

                Whether the data meets the
                 assumptions of the statistical methods
                 to be used


 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   6


Examples of Tools for Examining Data

     Histogram


     Stem and Leaf Diagram


     Scatterplot


     Box-Whisker Plot


     Bar Graph

     Normal Probability Plot


     Cross-Tabulation Table


     Descriptive Statistics




Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    7


                                          Histogram
Useful in determining the symmetry or skew of a
metric variable, and in determining the presence
of extreme values or outliers.

Example

       Distribution of sentences received by 70
       felony offenders




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    8


Stem and Leaf Diagram
Useful in determining the symmetry or skew of a
metric variable, and in determining the presence
of extreme values or outliers.

Example

       Distribution of sentences received by 70
       felony offenders

         Frequency                   Stem & Leaf

              8.00      0*                      11111111
             19.00      0t                      2222222222333333333
             14.00      0f                      44444444555555
             10.00      0s                      6666667777
              7.00      0.                      8888889
              3.00      1*                      001
              2.00      1t                      22
              3.00      1f                      455
              4.00 Extremes                      (17), (18), (20), (25)

        Stem width:                     10.0
        Each leaf:                      1 case(s)




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    9


                           Bivariate Scatterplot

Useful in determining …
       Whether there is a relationship between two
       metric variables, its direction and relative
       magnitude,
       The presence of bivariate outliers, and
       Whether the                              relationship                      is        linear             or
       nonlinear




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    10


Example
       Scatterplot of sentence and number of prior
       convictions




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    11


                               Box-Whisker Plot
Useful in determining the symmetry or skew of a
metric variable, and in determining the presence
of extreme values or outliers.

Example

       Distribution of sentences given to 70 felons
       offenders




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    12


                                         Bar Graph

Useful for determining the frequency of cases in
the various categories of a nonmetric variable,
and as a reference for collapsing categories if
necessary.

Example

       Distribution of race/ethnicity among 70
       offenders




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    13


                       Normal Probability Plot

Useful in determining if a variable is normally
distributed

Example

       Sentences received by 70 convicted felons




Since the points are not on the line and "bow" to
the right, the distribution of sentences is skewed
to the right.



 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    14


                      Cross-Tabulation Table
Useful in determining whether there is a
relationship between two nonmetric variables,
and whether any cells have low frequencies or
contain no cases at all.

Example
   Cross-classification of race by gender
   among 70 felons


Race/                              Male                        Female                           Total
Ethnicity

White                                    7                         18                              25

African                               13                              9                            22
American

Hispanic                              15                              8                            23

Total                                 35                           35                              70




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                     15


                           Descriptive Statistics

Useful in profiling the central tendency,
variability, skew and kurtosis of a metric
variable.

Example

        Descriptive statistics on the sentences
        received by 70 felons




Valid cases:                70.0      Missing cases:                   .0     Percent missing:                 .0


 Mean        5.9571 Std Err        .5920                   Min               1.0000      Skewness          1.6771
 5% Trim     5.4286 Std Dev       4.9532                   Range            24.0000      Kurtosis          3.0632
 95% CI for Mean (4.7761, 7.1382)                          IQR               6.0000      S E Kurt           .5663




  Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    16


              The Problem of Missing Data

    A multivariate data base is an N x k matrix.
           (N = subjects k = variables)


Subjects                      X1                      X2                      .....                    Xk

      S1

      S2

     .....

      Sn




A complete data set is required to analyze the
interrelationships among all the variables.

If one or more values are missing, the
associated subject (s) or variable (s) must be
eliminated from the analysis

       Or the missing data imputed (estimated)
       by some means.



 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    17


                       Impact of Missing Data

Subjects                      X1                      X2                       X3                      X4


       1                      12                       2                     253                       64

       2                      18                       5                      (?)                      85

       3                      (?)                      6                     163                       94

       4                      22                       9                     315                       77

       5                      16                      (?)                    286                       64

       6                      28                       3                     173                       83

       7                      11                       2                     311                       94

       8                      19                       4                     289                       81

       9                      25                       8                     198                       69

     10                       20                       4                     274                       75



This is a 10 x 4 matrix, 40 data points, with 3
missing values.

If the variables with missing data are eliminated,
75% of the variables are lost for the analysis.

 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    18




If subjects with missing data are eliminated, 33%
of the subjects are lost for the analysis.




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    19


        Eliminating Subjects or Variables
Elimination of Subjects                                                Elimination of
                                                                         Variables


Reduces power (1-),                                      May result in a
may lead to a Type II                                     specification error
Error

Reduces df and may                                        The model may over-fit
lead to a Type II Error                                   the data & not cross-
                                                          validate

May reduce the                                            May produce a larger
representativeness of                                     error term due to
sample                                                    unexplained variance in
                                                          the dependent variable

May effect the external May lead to a Type II
validity of the study   Error

May result in           May reduce the
inaccurate estimates of explanatory power of
population variances    the model
and covariances


 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    20


                      Causes of Missing Data


 Recording error                                                   Change in definition
                                                                    of a variable

 Data entry error                                                  Refusal to answer
                                                                     a survey question

 Morbidity of subjects                                             Ignorance of the
                                                                     meaning of a survey
                                                                     question

 Missing record                                                    Agency disclosure
                                                                     policy



 Missing data field                                                Survey response
                                                                     alternatives not
                                                                     applicable

 Change in record                                                  Computer crash
  keeping procedure




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    21


                        Types of Missing Data



Missing At Random ( MAR )


     The pattern of the missing values in a
     variable (Y) is related to the pattern of
     missing values in one or more other variables
     (Xk).



Missing Completely at Random ( MCAR )


     The pattern of the missing values in a
     variable (Y) is not related to the pattern of
     missing values in one or more other variables
     (Xk).




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    22


  Diagnosing the Pattern of the Missing
          Data: MAR v MCAR

Technique 1: for metric variables

For the variable with the missing data, create
two groups of subjects:

     Group 0 = Subjects with missing data
     Group 1 = Subjects with complete data

Conduct a t-test to see if the groups differ
significantly on the other variables in the
database, assuming they are metric.

Technique 2: for nonmetric variables

For the variable with the missing data, create a
dummy variable with two groups of subjects …

     Group 0 = Subjects with missing data
     Group 1 = Subjects with complete data

Conduct a chi-square test to see if there is any
association between the dummy variable and
other nonmetric variables.


 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    23


An Example of a Multivariate Database
         with Missing Data
                          (Shaded cells are missing data)


       Sentence                         Prior Convictions                             Drug Score


                3                                      2                                        1
                1                                      0
                2                                                                               5
                3                                      1                                        7
                5                                      0                                        4
                1                                      1
                1                                      2
                2                                                                               1
                4                                      2                                        3
                2                                      1
                8                                      3                                        8
               10                                      4                                        7
               10                                      1                                        4
               20                                                                               9
               14                                      3                                        2
               14                                      2                                        5
                7                                      4                                        7
               23                                                                               6
               12                                      0                                        8
               15                                      3                                        6

Prior convictions has 4 missing values

Drug score has 4 missing values

 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    24




Is the pattern of missing values in either of these
two variables related to sentence? Is the pattern
MAR or MCAR?
Creating Dummy Variables to Represent
      the Pattern of Missing Data
                  (0 = data missing. 1 = data not missing)



 Sentence                 Prior               Missing Priors             Drug Score               Missing
                        Convictions                                                              Drug Score


       3                       2                        1                       1                       1
       1                       0                        1                                               0
       2                                                0                       5                       1
       3                       1                        1                       7                       1
       5                       0                        1                       4                       1
       1                       1                        1                                               0
       1                       2                        1                                               0
       2                                                0                       1                       1
       4                       2                        1                       3                       1
       2                       1                        1                                               0
       8                       3                        1                       8                       1
      10                       4                        1                       7                       1
      10                       1                        1                       4                       1
      20                                                0                       9                       1
      14                       3                        1                       2                       1
      14                       2                        1                       5                       1
       7                       4                        1                       7                       1
      23                                                0                       6                       1
      12                       0                        1                       8                       1
      15                       3                        1                       6                       1




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    25


  Is the Pattern of Missing Data in the
 Variable Priors Related to the Variable
               Sentence?
Step 1
Using the dummy variable "missing priors" …
    Compute the average sentence for the
    subjects coded 0 and those coded 1

Group                                                               Mean Sentence

Missing data group(0)                                                    11.75 years

Not missing data                                                          6.88 years
group (1)
Step 2
       Run a t-test on the difference between the
       means of the two groups
       t = 1.33, df = 18, p = 0.1986
       Since the difference between means is not
       significant, the missing data process is
       MCAR
       The process is not related to the length of
       sentence

 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    26


    Is the Pattern of Missing Data in the
     Variable Drug Score Related to the
             Variable Sentence?
Step 1
Using the dummy variable "missing drug score"
    Compute the average sentence for the
    subjects coded 0 and those coded 1

Group                                                               Mean Sentence

Missing data group(0)                                                     1.25 years

Not missing data                                                           9.50years
group (1)

Step 2
       Run a t-test on the difference between the
       means of the two groups
       t = 2.50, df = 18, p = 0.022
       Since the difference between means is
       significant, the missing data process is MAR
       The process that produced the missing data
       is related to the length of sentence

 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    27


  Is the Pattern of Missing Data in the
          Variable Drug Score
 Related to the Pattern of Missing Data
         In the Variable Priors?
Step 1
Using the dummy variable "missing drug score"
and "missing priors" …
       Construct a 2x2 cross-tabulation table


Priors                                                      Drug Score

                                       Missing (0)                                     Not (1)

Missing (0)                                        0                                            4

Not (1)                                            4                                         12

Step 2
Run a chi-square test on the cell frequencies.
Since one cell has zero frequency, run Fisher's
exact probability test as well.
2 = 1.25, df = 1, p = 0.246

 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    28


Fisher's p = 0.538

Since the results are not significant, the missing
data process is MCAR
                 Remedies for Missing Data


 Delete the cases with missing data



 Delete the variables with missing data



 Imputation of the missing values



 Model-based procedures




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    29


                                    Case Deletion
Probably the most commonly used method.

Depending upon the number of cases deleted,
the deletion of cases …

      May reduce the power (1 - ) of the
       subsequent statistical tests, and may lead to
       a Type II error

      Will reduce the df of subsequent statistical
       tests, which may lead to a Type II error

      May reduce the representativeness of the
       sample, reducing the external validity of the
       study

      If the process of the missing data is MAR,
       may lead to incorrect generalizations of the
       results

      May bias the estimates of the variables'
       population variances and covariances …

                 Resulting in biased estimates of the
                 statistical model's parameters and their
                 associated standard errors

 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    30



                             Variable Deletion

A poor strategy if the purpose of the study is
multivariate in nature.

Deletion of one or more variables may …

                Result in a specification error in the
                 model

                Result in a model that over-fits the data
                 and does not cross-validate

                May produce too large an error term due
                 to the unexplained variance in the
                 dependent variable

                Lead to a Type II error

                Reduce the explanatory power of the
                 model




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    31


              Imputation of Missing Values


Imputation refers to estimating the missing
values in one variable from the relationship
between that variable and other variables in the
database.




Complete Case Approach

Uses only cases with complete data across all
the variables in the data base. (called casewise
approach in SPSS)



All-Available Approach

Uses any cases with complete data on a pair of
variables. (called pairwise approach in SPSS)




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   32


                      Imputation Techniques


       Case substitution



       Mean/median substitution



       Cold deck imputation



       Regression imputation



       Multiple imputation




Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    33


                               Case Substitution

Identify the case with missing data


Find the case in the database that is most
similar to the case with the missing data


Impute the missing values from the
corresponding values of the case with complete
data


If this procedure is used on too many cases it
may …

        Reduce the external validity of the study

        Result in misrepresentation of the
         population variances and covariances
         and …

                           Produce biased estimates of the
                           model's parameters and associated
                           standard errors


 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    34


                              Mean Substitution

Identify the variable with missing data

Compute some measure of central tendency of
the variable, e.g. arithmetic mean or median

Substitute the average of the variable for the
missing values

If a variable has too many missing values …

        The average will be a biased
         estimate of the true average

        The population variance will be
         underestimated

        The relationship of the variables with the
         missing data with other variables in the
         database will be underestimated, risking
         a Type II error




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    35


                         Cold Deck Imputation


Substitute an estimate of the missing value from
an external source

                From a pilot study

                From a similar research study found in
                 the literature

                From expert opinion or judgment



Disadvantages

                An external source for the missing data
                 may not be available

                In other ways the disadvantages are
                 comparable to those associated with
                 mean/median substitution




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   36


                       Regression Imputation
     If the variables in the database are highly
      interrelated …

                Then it may be possible to estimate the
                missing values …

                By making the variable with the missing
                values (xm) a dependent variable, and
                regressing it on the other variables (xk)
                in the database


                     xm = a + b1x1 + b2x2 + ... bkxk


     By substituting the known values of the case
      with missing data in the model, the missing
      value can be estimated.

     The efficiency of this technique depends
      upon …

                The extent of the missing data, and
                The magnitude of the relationship
                between Xm and the other variables
                used in the regression model

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    37


        Caveats in Regression Imputation

The interrelationship among the variables used
in the model must be high to produce accurate
estimates of the missing values

If the number of missing values in the imputed
variable is large …

        The imputation will reinforce sample
         specific relationships which will not
         cross-validate

        The population variances &
         covariances will be underestimated

        Problems may be encountered if the
         imputed variable is an independent
         variable, since regression assumes no
         collinearity

        The imputed estimates may go beyond
         the bounds of a possible value, since
         regression analysis is not constrained by
         units of measurement



 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    38


                             Multiple Imputation


Involves the use of several different imputation
techniques to produce a set of estimated values
for the missing value.


The different estimates are then averaged to
derive the imputed value


The goal is to derive an estimate of the missing
value (s) by using several different techniques,
and …

       By averaging the various estimates,

       Hopefully canceling out or offsetting the
       disadvantages of the different techniques
       employed




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   39


           Caveats in Multiple Imputation


     Multiple imputation may or may not cancel
      out the disadvantages of the techniques
      employed.


     The success of the technique will depend
      upon the peculiarities of the database.


     It may in fact compound the disadvantages
      of the techniques used.


     Using the average of various estimates may
      reinforce relationships peculiar to the
      sample, which may not cross-validate.


     As with other techniques, the more missing
      values imputed, the less reliable the imputed
      values.




Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   40


                    Model-Based Imputation

      This involves a variety of techniques that
      either …

               Incorporate the missing data process
                into the analysis as a separate variable
                to assess the amount of variance
                accounted for by the missing data

               Or use of maximum likelihood estim-
                ation to model the missing data process,
                and based upon the results, make the
                most accurate estimates of the missing
                values




Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    41


                          Outliers & Fringeliers


An outlier is an extreme value or case. A
fringelier is a marginally extreme value or case.


Such values can significantly affect and distort
multivariate analysis, leading to …

        Type I Errors

        Type II Errors

        Underestimation of significant
         findings

        Reversal of results


Of particular concern is the fact that a case can
be a multivariate outlier …

       While not being a univariate outlier on any
       individual variable.




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   42


                           Sources of Outliers

     Recording error or data entry error


     An unusual event which causes a
      one time change in a variable


     The beginning of a new phenomenon
      with few of the cases represented in the
      database


     Short term change in the way the
      variable is defined


     Differences in the way agencies or
      jurisdictions define the a variable


     "Apparent" outliers resulting from a sample
      that is too small




Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   43


     A case that is not an outlier on each
      individual variable, but is an outlier across
      several variables, i.e. a multivariate outlier




Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   44


            Critical Questions About Outliers


    How extreme must a case be to be an
    outlier?


    Is the apparent outlier an error or a
    reliable value?


    Are there cases that are univariate
    “reasonable” yet multivariate outliers?


    What impact might the outlier have on
    the analysis of the data?


    Is the outlier part of an extreme trend
    for which there are few cases in the
    database, or is it simply a very exceptional
    case?


    How will the outliers be dealt with?



Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   45


                    Ways to Identify Outliers

     Histogram

     Stem & leaf diagram

     Scatterplot: 2 or 3 dimensions

     Box-Whisker plot

     Trend or time series plot

     Descriptive statistics

                mean v. median

                minimum & maximum values

                skew & kurtosis

                interquartile range & standard deviation

     Convert data to standard scores (Z) &
      examine cases where Z   1.96

     Multivariate tools


Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    46


     Identifying Outliers with a Histogram

Example

       Years served in prison by paroled felons




                                               Outliers




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    47


                       Identifying Outliers with a
                         Stem & Leaf Diagram

Example

       Years served in prison by paroled felons




Frequency                      Stem &                Leaf

        16.00        0                         *       0000001111111111
        20.00        0                         t       22222222222233333333
        16.00        0                         f       4444444455555555
         9.00        0                         s       666667777
         2.00        0                         .       89
         3.00        1                         *       001
         4.00 Extremes                                 (12), (15), (16),(18)




                 Outliers




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    48


          Identifying Bivariate Outliers with a
                       Scatterplot

Example
   Years served in prison as a function of
   length of sentence


                           Outlier ?




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    49


                       Identifying Outliers with a
                           Box-Whisker Plot


Example

       Years served in prison by paroled felons




                                                                                      Outliers




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    50


                       Identifying Outliers in a
                          Time Series Plot

Example

     Number of arrested parolees in a county jail


Some change in policy in 2/97 caused a
substantial and permanent change in the
number of jailed parolees.




 12 00


 10 00


  80 0


  60 0


  40 0


  20 0


     0
         95             96               97                98               99                00              01




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                     51


      Identifying Outliers with Descriptive
                   Statistics

Example

        Years served in prison by paroled felons


Valid cases:                70.0      Missing cases:                   .0     Percent missing:                 .0


 Mean        4.6786 Std Err        .4383                   Min                .4000      Skewness          1.6584
 Median      3.7500 Variance    13.4455                    Max              18.2000      S E Skew           .2868
 5% Trim     4.2929 Std Dev       3.6668                   Range            17.8000      Kurtosis          3.1904
 95% CI for Mean (3.8043, 5.5529)                          IQR               4.0500      S E Kurt           .5663




The mean  median, therefore the distribution is
skewed right

The skew (1.658) is positive

The most extreme value (18.2 years) may be an
outlier




  Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    52


  Identifying Outliers by Converting the
      Data to Standard Scores (Z)
Example
   Years served in prison by paroled felons
   (Mean = 4.6786 years, S = 3.6668)


               Time Served                                           Z Score
                                                                   Time Served

                       7.3                                            +0.71
                       5.2                                            +0.14
                      11.3                                            +1.81
                        …                                                …
                        …                                                …
                       8.6                                            +1.07
                      12.2                                            +2.05
                      16.3                                            +3.17
                      14.6                                            +2.71
                      18.2                                            +3.69
                       1.5                                            -0.87
                        …                                                …
                        …                                                …
                       2.7                                            -0.54

     Cases with a Z score  1.96 may be
     considered as outliers

 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    53


                               Multivariate Outliers


Subject                                X1                            X2                            X3

   1                                   1                             3                             18
   2                                   7                            14                             10
   3                                   3                             5                             17
   4                                   4                             4                             19
   5                                  15                            20                              3
   6                                  11                            16                              2
   7                                   2                            18                              2
   8                                  10                            12                              4
   9                                   5                             7                             18
  10                                   8                             9                              9



One of the 10 cases above is a multivariate
outlier …

     A case which may appear univariate
     "reasonable" …

     But which is extreme relative to the
     interrelationship among all three variables.

 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    54


     Multivariate Outliers (cont.)




                               X3
                        24

                        20

                         16

                         12

                           8

                             4

                          0
                         16
                           14
                            12
                              10
                                      8
                                          6                                                    24
                                              4                                 16      20
                                 X1               2                     12
                                                                    8
                                                      0         4
                                                          0                       X2




Case 7 is the multivariate outlier. Relative to the
three variables, its values are …


                   X1 = 2                                     X2 = 18          X3 = 2




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    55


               Is Case 7 a Univariate Outlier?

For variable X1

       Case 7 = 2

       It is not an outlier.

       Frequency                       Stem &               Leaf

                  4.00                           0     *       1234
                  3.00                           0     .       578
                  2.00                           1     *       01
                  1.00                           1     .       5




 16


 14


 12


 10


  8


  6


  4


  2

  0
      N=                                              10

                                                      X1




 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                      56


Is Case 7 a Univariate Outlier? (cont.)




For Variable X2

            Case 7 = 18

            It is not an outlier

        Frequency                        Stem &               Leaf

                    2.00                           0     *       34
                    3.00                           0     .       579
                    2.00                           1     *       24
                    2.00                           1     .       68
                    1.00                           2     *       0




  30




  20




  10




    0
       N=                                               10

                                                        X2




   Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                     57


Is Case 7 a Univariate Outlier? (cont.)




For Variable X3

            Case 7 = 2

            It is not an outlier

        Frequency                       Stem &               Leaf

                   4.00                           0     *       2234
                   1.00                           0     .       9
                   1.00                           1     *       0
                   4.00                           1     .       7889



  30




  20




  10




   0
       N=                                              10

                                                       X3




  Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    58


            Is Case 7 a Bivariate Outlier with
                 Respect to X1 and X2?


                                                                  Case 7




Case 7

       X1 = 2 and X2 = 18

       In this bivariate relationship, Case 7 is an
       outlier

 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    59




            Is Case 7 a Bivariate Outlier with
                 Respect to X1 and X3?

                                                                  Case 7




Case 7

       X1 = 2 and X3 = 2

       In this bivariate relationship, Case 7 is an
       outlier



 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    60


            Is Case 7 a Bivariate Outlier with
                 Respect to X2 and X3?


                                            Case 7




Case 7

       X2 = 18 and X3 = 2

       In this bivariate relationship, Case 7 is not
       an outlier



 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    61


 Identifying Multivariate Outliers among
        More than Three Variables
Graphical techniques can not be used to identify
multivariate outiers when more than three
variables are involved

In this case …

       The model is estimated

       Predictions (Y') are made with the model
       using the original data

       Then the prediction errors (Y' - Y), called
       residuals, are plotted against the predictions
       (Y') and …

                 Likely multivariate outliers are identified
                 in the resulting scatterplot

Example                    Sentence as a function of…
        Age

        Prior convictions

        Drug dependency

 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                     62




        Sentence = -17.42 + 0.9 age + 0.28 drugs + 0.5 priors

Identifying Multivariate Outliers Among More than Three Variables (cont.)




Plot of the residuals (Y' - Y) against the
predictions (Y')




Possible multivariate
     outliers




  Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   63


                    What To Do With Outliers?
               There is no “silver bullet”. It is a matter
                of judgment.

               If the outlier is an error ... correct it

               Analyze the data with and without the
                outlier and see if it makes a difference

       Transform the data to reduce the
        influence of the outlier or skew in the
        data, assuming that the problem is due to
        sampling error

               Increase the sample size if the
                “apparent” outlier resulted from too
                small a sample

               Use a parameter estimating algorithm
                that is less sensitive to outliers
                (maximum likelihood estimation
                v. OLS)




Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University