Blind Analysis Of Multivariate Data

Document Sample

```					Examining a
Multivatiate Database
Issues to be examined

Tools for examining a multivariate
database

The problem of missing data

The problem of outliers

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
2

Key Concepts
*****

Examining a Multivariate Database

Dangers of analyzing data without theory or a thorough understanding
of the data
Reliability & validity
Missing data
Outliers
Distributional dynamics of the variables
Ratio of cases to variables
Analytic tools for examining data:
Histogram
Stem & leaf diagram
Scatterplot
Box-Whisker plot
Bar graph
Normal probability plot
Cross-tabulation table
Descriptive statistics
Concept of skew:
Right or positive skew
Left or negative skew
Concept of kurtosis:
Platykurtic
Mesokurtic
Leptokurtic
The problem of missing data in multivariate analysis:
The impact of eliminating subjects
The impact of eliminating variables
Causes of missing data
Missing at random (MAR) v. missing completely at random (MCAR)
Techniques for determining MAR v. MCAR
Remedies for missing data
Deletion of cases
Deletion of variables
Imputation
Model-based solutions
Problems with deleting cases
Problems with deleting variables

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
3

Key Concepts (cont.)

The concept of imputation
Complete case approach
All-available approach
Techniques for imputation:
Case substitution
Mean/median substitution
Cold deck imputation
Regression imputation
Multiple imputation
Model-based procedures for missing data
The problem of outliers & fringeliers
Univariate v. multivariate outliers
Sources of outliers
Techniques for identifying outliers:
Histogram
Stem & leaf diagram
Scatterplots: 2 or 3 dimentional
Box-Whisker plot
Trend or time series plot
Descriptive statistics
Converting data to standard scores
Multivariate tools
Ways of dealing with outliers
Problems with deleting outlies

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
4

Lecture Outline

 Issues to examine

 Tools for examining data

 Problems with missing data

 Problems with outliers

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
5

Blind Analysis of Multivariate Data
Blind analysis of a multivariate database without
theory and a good understanding of the data is
hazardous.

Research should be theory-driven                                                                  with a
thorough understanding of:

         The reliability and validity of the data

         The extent and impact of missing data

         Presence and impact of outliers

         Distributional characteristics of the
variables

         The ratio of cases to the number of
variables

         Whether the data meets the
assumptions of the statistical methods
to be used

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
6

Examples of Tools for Examining Data

 Histogram

 Stem and Leaf Diagram

 Scatterplot

 Box-Whisker Plot

 Bar Graph

 Normal Probability Plot

 Cross-Tabulation Table

 Descriptive Statistics

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
7

Histogram
Useful in determining the symmetry or skew of a
metric variable, and in determining the presence
of extreme values or outliers.

Example

Distribution of sentences received by 70
felony offenders

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
8

Stem and Leaf Diagram
Useful in determining the symmetry or skew of a
metric variable, and in determining the presence
of extreme values or outliers.

Example

Distribution of sentences received by 70
felony offenders

Frequency                   Stem & Leaf

8.00      0*                      11111111
19.00      0t                      2222222222333333333
14.00      0f                      44444444555555
10.00      0s                      6666667777
7.00      0.                      8888889
3.00      1*                      001
2.00      1t                      22
3.00      1f                      455
4.00 Extremes                      (17), (18), (20), (25)

Stem width:                     10.0
Each leaf:                      1 case(s)

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
9

Bivariate Scatterplot

Useful in determining …
Whether there is a relationship between two
metric variables, its direction and relative
magnitude,
The presence of bivariate outliers, and
Whether the                              relationship                      is        linear             or
nonlinear

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
10

Example
Scatterplot of sentence and number of prior
convictions

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
11

Box-Whisker Plot
Useful in determining the symmetry or skew of a
metric variable, and in determining the presence
of extreme values or outliers.

Example

Distribution of sentences given to 70 felons
offenders

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
12

Bar Graph

Useful for determining the frequency of cases in
the various categories of a nonmetric variable,
and as a reference for collapsing categories if
necessary.

Example

Distribution of race/ethnicity among 70
offenders

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
13

Normal Probability Plot

Useful in determining if a variable is normally
distributed

Example

Sentences received by 70 convicted felons

Since the points are not on the line and "bow" to
the right, the distribution of sentences is skewed
to the right.

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
14

Cross-Tabulation Table
Useful in determining whether there is a
relationship between two nonmetric variables,
and whether any cells have low frequencies or
contain no cases at all.

Example
Cross-classification of race by gender
among 70 felons

Race/                              Male                        Female                           Total
Ethnicity

White                                    7                         18                              25

African                               13                              9                            22
American

Hispanic                              15                              8                            23

Total                                 35                           35                              70

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
15

Descriptive Statistics

Useful in profiling the central tendency,
variability, skew and kurtosis of a metric
variable.

Example

Descriptive statistics on the sentences

Valid cases:                70.0      Missing cases:                   .0     Percent missing:                 .0

Mean        5.9571 Std Err        .5920                   Min               1.0000      Skewness          1.6771
5% Trim     5.4286 Std Dev       4.9532                   Range            24.0000      Kurtosis          3.0632
95% CI for Mean (4.7761, 7.1382)                          IQR               6.0000      S E Kurt           .5663

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
16

The Problem of Missing Data

A multivariate data base is an N x k matrix.
(N = subjects k = variables)

Subjects                      X1                      X2                      .....                    Xk

S1

S2

.....

Sn

A complete data set is required to analyze the
interrelationships among all the variables.

If one or more values are missing, the
associated subject (s) or variable (s) must be
eliminated from the analysis

Or the missing data imputed (estimated)
by some means.

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
17

Impact of Missing Data

Subjects                      X1                      X2                       X3                      X4

1                      12                       2                     253                       64

2                      18                       5                      (?)                      85

3                      (?)                      6                     163                       94

4                      22                       9                     315                       77

5                      16                      (?)                    286                       64

6                      28                       3                     173                       83

7                      11                       2                     311                       94

8                      19                       4                     289                       81

9                      25                       8                     198                       69

10                       20                       4                     274                       75

This is a 10 x 4 matrix, 40 data points, with 3
missing values.

If the variables with missing data are eliminated,
75% of the variables are lost for the analysis.

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
18

If subjects with missing data are eliminated, 33%
of the subjects are lost for the analysis.

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
19

Eliminating Subjects or Variables
Elimination of Subjects                                                Elimination of
Variables

Reduces power (1-),                                      May result in a
may lead to a Type II                                     specification error
Error

Reduces df and may                                        The model may over-fit
lead to a Type II Error                                   the data & not cross-
validate

May reduce the                                            May produce a larger
representativeness of                                     error term due to
sample                                                    unexplained variance in
the dependent variable

May effect the external May lead to a Type II
validity of the study   Error

May result in           May reduce the
inaccurate estimates of explanatory power of
population variances    the model
and covariances

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
20

Causes of Missing Data

 Recording error                                                   Change in definition
of a variable

 Data entry error                                                  Refusal to answer
a survey question

 Morbidity of subjects                                             Ignorance of the
meaning of a survey
question

 Missing record                                                    Agency disclosure
policy

 Missing data field                                                Survey response
alternatives not
applicable

 Change in record                                                  Computer crash
keeping procedure

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
21

Types of Missing Data

Missing At Random ( MAR )

The pattern of the missing values in a
variable (Y) is related to the pattern of
missing values in one or more other variables
(Xk).

Missing Completely at Random ( MCAR )

The pattern of the missing values in a
variable (Y) is not related to the pattern of
missing values in one or more other variables
(Xk).

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
22

Diagnosing the Pattern of the Missing
Data: MAR v MCAR

Technique 1: for metric variables

For the variable with the missing data, create
two groups of subjects:

Group 0 = Subjects with missing data
Group 1 = Subjects with complete data

Conduct a t-test to see if the groups differ
significantly on the other variables in the
database, assuming they are metric.

Technique 2: for nonmetric variables

For the variable with the missing data, create a
dummy variable with two groups of subjects …

Group 0 = Subjects with missing data
Group 1 = Subjects with complete data

Conduct a chi-square test to see if there is any
association between the dummy variable and
other nonmetric variables.

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
23

An Example of a Multivariate Database
with Missing Data

Sentence                         Prior Convictions                             Drug Score

3                                      2                                        1
1                                      0
2                                                                               5
3                                      1                                        7
5                                      0                                        4
1                                      1
1                                      2
2                                                                               1
4                                      2                                        3
2                                      1
8                                      3                                        8
10                                      4                                        7
10                                      1                                        4
20                                                                               9
14                                      3                                        2
14                                      2                                        5
7                                      4                                        7
23                                                                               6
12                                      0                                        8
15                                      3                                        6

Prior convictions has 4 missing values

Drug score has 4 missing values

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
24

Is the pattern of missing values in either of these
two variables related to sentence? Is the pattern
MAR or MCAR?
Creating Dummy Variables to Represent
the Pattern of Missing Data
(0 = data missing. 1 = data not missing)

Sentence                 Prior               Missing Priors             Drug Score               Missing
Convictions                                                              Drug Score

3                       2                        1                       1                       1
1                       0                        1                                               0
2                                                0                       5                       1
3                       1                        1                       7                       1
5                       0                        1                       4                       1
1                       1                        1                                               0
1                       2                        1                                               0
2                                                0                       1                       1
4                       2                        1                       3                       1
2                       1                        1                                               0
8                       3                        1                       8                       1
10                       4                        1                       7                       1
10                       1                        1                       4                       1
20                                                0                       9                       1
14                       3                        1                       2                       1
14                       2                        1                       5                       1
7                       4                        1                       7                       1
23                                                0                       6                       1
12                       0                        1                       8                       1
15                       3                        1                       6                       1

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
25

Is the Pattern of Missing Data in the
Variable Priors Related to the Variable
Sentence?
Step 1
Using the dummy variable "missing priors" …
Compute the average sentence for the
subjects coded 0 and those coded 1

Group                                                               Mean Sentence

Missing data group(0)                                                    11.75 years

Not missing data                                                          6.88 years
group (1)
Step 2
Run a t-test on the difference between the
means of the two groups
t = 1.33, df = 18, p = 0.1986
Since the difference between means is not
significant, the missing data process is
MCAR
The process is not related to the length of
sentence

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
26

Is the Pattern of Missing Data in the
Variable Drug Score Related to the
Variable Sentence?
Step 1
Using the dummy variable "missing drug score"
Compute the average sentence for the
subjects coded 0 and those coded 1

Group                                                               Mean Sentence

Missing data group(0)                                                     1.25 years

Not missing data                                                           9.50years
group (1)

Step 2
Run a t-test on the difference between the
means of the two groups
t = 2.50, df = 18, p = 0.022
Since the difference between means is
significant, the missing data process is MAR
The process that produced the missing data
is related to the length of sentence

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
27

Is the Pattern of Missing Data in the
Variable Drug Score
Related to the Pattern of Missing Data
In the Variable Priors?
Step 1
Using the dummy variable "missing drug score"
and "missing priors" …
Construct a 2x2 cross-tabulation table

Priors                                                      Drug Score

Missing (0)                                     Not (1)

Missing (0)                                        0                                            4

Not (1)                                            4                                         12

Step 2
Run a chi-square test on the cell frequencies.
Since one cell has zero frequency, run Fisher's
exact probability test as well.
2 = 1.25, df = 1, p = 0.246

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
28

Fisher's p = 0.538

Since the results are not significant, the missing
data process is MCAR
Remedies for Missing Data

 Delete the cases with missing data

 Delete the variables with missing data

 Imputation of the missing values

 Model-based procedures

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
29

Case Deletion
Probably the most commonly used method.

Depending upon the number of cases deleted,
the deletion of cases …

      May reduce the power (1 - ) of the
subsequent statistical tests, and may lead to
a Type II error

      Will reduce the df of subsequent statistical
tests, which may lead to a Type II error

      May reduce the representativeness of the
sample, reducing the external validity of the
study

      If the process of the missing data is MAR,
may lead to incorrect generalizations of the
results

      May bias the estimates of the variables'
population variances and covariances …

Resulting in biased estimates of the
statistical model's parameters and their
associated standard errors

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
30

Variable Deletion

A poor strategy if the purpose of the study is
multivariate in nature.

Deletion of one or more variables may …

         Result in a specification error in the
model

         Result in a model that over-fits the data
and does not cross-validate

         May produce too large an error term due
to the unexplained variance in the
dependent variable

         Lead to a Type II error

         Reduce the explanatory power of the
model

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
31

Imputation of Missing Values

Imputation refers to estimating the missing
values in one variable from the relationship
between that variable and other variables in the
database.

Complete Case Approach

Uses only cases with complete data across all
the variables in the data base. (called casewise
approach in SPSS)

All-Available Approach

Uses any cases with complete data on a pair of
variables. (called pairwise approach in SPSS)

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
32

Imputation Techniques

 Case substitution

 Mean/median substitution

 Cold deck imputation

 Regression imputation

 Multiple imputation

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
33

Case Substitution

Identify the case with missing data

Find the case in the database that is most
similar to the case with the missing data

Impute the missing values from the
corresponding values of the case with complete
data

If this procedure is used on too many cases it
may …

 Reduce the external validity of the study

 Result in misrepresentation of the
population variances and covariances
and …

Produce biased estimates of the
model's parameters and associated
standard errors

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
34

Mean Substitution

Identify the variable with missing data

Compute some measure of central tendency of
the variable, e.g. arithmetic mean or median

Substitute the average of the variable for the
missing values

If a variable has too many missing values …

 The average will be a biased
estimate of the true average

 The population variance will be
underestimated

 The relationship of the variables with the
missing data with other variables in the
database will be underestimated, risking
a Type II error

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
35

Cold Deck Imputation

Substitute an estimate of the missing value from
an external source

         From a pilot study

         From a similar research study found in
the literature

         From expert opinion or judgment

         An external source for the missing data
may not be available

         In other ways the disadvantages are
comparable to those associated with
mean/median substitution

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
36

Regression Imputation
     If the variables in the database are highly
interrelated …

Then it may be possible to estimate the
missing values …

By making the variable with the missing
values (xm) a dependent variable, and
regressing it on the other variables (xk)
in the database

xm = a + b1x1 + b2x2 + ... bkxk

     By substituting the known values of the case
with missing data in the model, the missing
value can be estimated.

     The efficiency of this technique depends
upon …

The extent of the missing data, and
The magnitude of the relationship
between Xm and the other variables
used in the regression model

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
37

Caveats in Regression Imputation

The interrelationship among the variables used
in the model must be high to produce accurate
estimates of the missing values

If the number of missing values in the imputed
variable is large …

 The imputation will reinforce sample
specific relationships which will not
cross-validate

 The population variances &
covariances will be underestimated

 Problems may be encountered if the
imputed variable is an independent
variable, since regression assumes no
collinearity

 The imputed estimates may go beyond
the bounds of a possible value, since
regression analysis is not constrained by
units of measurement

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
38

Multiple Imputation

Involves the use of several different imputation
techniques to produce a set of estimated values
for the missing value.

The different estimates are then averaged to
derive the imputed value

The goal is to derive an estimate of the missing
value (s) by using several different techniques,
and …

By averaging the various estimates,

Hopefully canceling out or offsetting the
employed

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
39

Caveats in Multiple Imputation

     Multiple imputation may or may not cancel
out the disadvantages of the techniques
employed.

     The success of the technique will depend
upon the peculiarities of the database.

     It may in fact compound the disadvantages
of the techniques used.

     Using the average of various estimates may
reinforce relationships peculiar to the
sample, which may not cross-validate.

     As with other techniques, the more missing
values imputed, the less reliable the imputed
values.

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
40

Model-Based Imputation

This involves a variety of techniques that
either …

         Incorporate the missing data process
into the analysis as a separate variable
to assess the amount of variance
accounted for by the missing data

         Or use of maximum likelihood estim-
ation to model the missing data process,
and based upon the results, make the
most accurate estimates of the missing
values

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
41

Outliers & Fringeliers

An outlier is an extreme value or case. A
fringelier is a marginally extreme value or case.

Such values can significantly affect and distort

 Type I Errors

 Type II Errors

 Underestimation of significant
findings

 Reversal of results

Of particular concern is the fact that a case can
be a multivariate outlier …

While not being a univariate outlier on any
individual variable.

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
42

Sources of Outliers

 Recording error or data entry error

 An unusual event which causes a
one time change in a variable

 The beginning of a new phenomenon
with few of the cases represented in the
database

 Short term change in the way the
variable is defined

 Differences in the way agencies or
jurisdictions define the a variable

 "Apparent" outliers resulting from a sample
that is too small

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
43

 A case that is not an outlier on each
individual variable, but is an outlier across
several variables, i.e. a multivariate outlier

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
44

How extreme must a case be to be an
outlier?

Is the apparent outlier an error or a
reliable value?

Are there cases that are univariate
“reasonable” yet multivariate outliers?

What impact might the outlier have on
the analysis of the data?

Is the outlier part of an extreme trend
for which there are few cases in the
database, or is it simply a very exceptional
case?

How will the outliers be dealt with?

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
45

Ways to Identify Outliers

 Histogram

 Stem & leaf diagram

 Scatterplot: 2 or 3 dimensions

 Box-Whisker plot

 Trend or time series plot

 Descriptive statistics

mean v. median

minimum & maximum values

skew & kurtosis

interquartile range & standard deviation

 Convert data to standard scores (Z) &
examine cases where Z   1.96

 Multivariate tools

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
46

Identifying Outliers with a Histogram

Example

Years served in prison by paroled felons

Outliers

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
47

Identifying Outliers with a
Stem & Leaf Diagram

Example

Years served in prison by paroled felons

Frequency                      Stem &                Leaf

16.00        0                         *       0000001111111111
20.00        0                         t       22222222222233333333
16.00        0                         f       4444444455555555
9.00        0                         s       666667777
2.00        0                         .       89
3.00        1                         *       001
4.00 Extremes                                 (12), (15), (16),(18)

Outliers

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
48

Identifying Bivariate Outliers with a
Scatterplot

Example
Years served in prison as a function of
length of sentence

Outlier ?

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
49

Identifying Outliers with a
Box-Whisker Plot

Example

Years served in prison by paroled felons

Outliers

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
50

Identifying Outliers in a
Time Series Plot

Example

Number of arrested parolees in a county jail

Some change in policy in 2/97 caused a
substantial and permanent change in the
number of jailed parolees.

12 00

10 00

80 0

60 0

40 0

20 0

0
95             96               97                98               99                00              01

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
51

Identifying Outliers with Descriptive
Statistics

Example

Years served in prison by paroled felons

Valid cases:                70.0      Missing cases:                   .0     Percent missing:                 .0

Mean        4.6786 Std Err        .4383                   Min                .4000      Skewness          1.6584
Median      3.7500 Variance    13.4455                    Max              18.2000      S E Skew           .2868
5% Trim     4.2929 Std Dev       3.6668                   Range            17.8000      Kurtosis          3.1904
95% CI for Mean (3.8043, 5.5529)                          IQR               4.0500      S E Kurt           .5663

The mean  median, therefore the distribution is
skewed right

The skew (1.658) is positive

The most extreme value (18.2 years) may be an
outlier

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
52

Identifying Outliers by Converting the
Data to Standard Scores (Z)
Example
Years served in prison by paroled felons
(Mean = 4.6786 years, S = 3.6668)

Time Served                                           Z Score
Time Served

7.3                                            +0.71
5.2                                            +0.14
11.3                                            +1.81
…                                                …
…                                                …
8.6                                            +1.07
12.2                                            +2.05
16.3                                            +3.17
14.6                                            +2.71
18.2                                            +3.69
1.5                                            -0.87
…                                                …
…                                                …
2.7                                            -0.54

Cases with a Z score  1.96 may be
considered as outliers

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
53

Multivariate Outliers

Subject                                X1                            X2                            X3

1                                   1                             3                             18
2                                   7                            14                             10
3                                   3                             5                             17
4                                   4                             4                             19
5                                  15                            20                              3
6                                  11                            16                              2
7                                   2                            18                              2
8                                  10                            12                              4
9                                   5                             7                             18
10                                   8                             9                              9

One of the 10 cases above is a multivariate
outlier …

A case which may appear univariate
"reasonable" …

But which is extreme relative to the
interrelationship among all three variables.

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
54

Multivariate Outliers (cont.)

X3
24

20

16

12

8

4

0
16
14
12
10
8
6                                                    24
4                                 16      20
X1               2                     12
8
0         4
0                       X2

Case 7 is the multivariate outlier. Relative to the
three variables, its values are …

X1 = 2                                     X2 = 18          X3 = 2

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
55

Is Case 7 a Univariate Outlier?

For variable X1

Case 7 = 2

It is not an outlier.

Frequency                       Stem &               Leaf

4.00                           0     *       1234
3.00                           0     .       578
2.00                           1     *       01
1.00                           1     .       5

16

14

12

10

8

6

4

2

0
N=                                              10

X1

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
56

Is Case 7 a Univariate Outlier? (cont.)

For Variable X2

Case 7 = 18

It is not an outlier

Frequency                        Stem &               Leaf

2.00                           0     *       34
3.00                           0     .       579
2.00                           1     *       24
2.00                           1     .       68
1.00                           2     *       0

30

20

10

0
N=                                               10

X2

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
57

Is Case 7 a Univariate Outlier? (cont.)

For Variable X3

Case 7 = 2

It is not an outlier

Frequency                       Stem &               Leaf

4.00                           0     *       2234
1.00                           0     .       9
1.00                           1     *       0
4.00                           1     .       7889

30

20

10

0
N=                                              10

X3

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
58

Is Case 7 a Bivariate Outlier with
Respect to X1 and X2?

Case 7

Case 7

X1 = 2 and X2 = 18

In this bivariate relationship, Case 7 is an
outlier

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
59

Is Case 7 a Bivariate Outlier with
Respect to X1 and X3?

Case 7

Case 7

X1 = 2 and X3 = 2

In this bivariate relationship, Case 7 is an
outlier

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
60

Is Case 7 a Bivariate Outlier with
Respect to X2 and X3?

Case 7

Case 7

X2 = 18 and X3 = 2

In this bivariate relationship, Case 7 is not
an outlier

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
61

Identifying Multivariate Outliers among
More than Three Variables
Graphical techniques can not be used to identify
multivariate outiers when more than three
variables are involved

In this case …

The model is estimated

Predictions (Y') are made with the model
using the original data

Then the prediction errors (Y' - Y), called
residuals, are plotted against the predictions
(Y') and …

Likely multivariate outliers are identified
in the resulting scatterplot

Example                    Sentence as a function of…
 Age

 Prior convictions

 Drug dependency

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
62

Sentence = -17.42 + 0.9 age + 0.28 drugs + 0.5 priors

Identifying Multivariate Outliers Among More than Three Variables (cont.)

Plot of the residuals (Y' - Y) against the
predictions (Y')

Possible multivariate
outliers

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
63

What To Do With Outliers?
           There is no “silver bullet”. It is a matter
of judgment.

           If the outlier is an error ... correct it

           Analyze the data with and without the
outlier and see if it makes a difference

 Transform the data to reduce the
influence of the outlier or skew in the
data, assuming that the problem is due to
sampling error

           Increase the sample size if the
“apparent” outlier resulted from too
small a sample

           Use a parameter estimating algorithm
that is less sensitive to outliers
(maximum likelihood estimation
v. OLS)

Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 43 posted: 1/19/2010 language: English pages: 63
How are you planning on using Docstoc?