# Frequency Distributions by nikeborome

VIEWS: 5 PAGES: 51

• pg 1
```									   SW388R7
Data Analysis &
Computers II     Analyzing Missing Data
Slide 1

Introduction

Problems

Using Scripts
SW388R7
Data Analysis &
Computers II                Missing data and data analysis
Slide 2

   Missing data is a problem in multivariate data
because a case will be excluded from the analysis if
it is missing data for any variable included in the
analysis.

   If our sample is large, we may be able to allow cases
to be excluded.

   If our sample is small, we will try to use a
substitution method so that we can retain enough
cases to have sufficient power to detect effects.

   In either case, we need to make certain that we
understand the potential impact that missing data
may have on our analysis.
SW388R7
Data Analysis &
Computers II              Tools for evaluating missing data
Slide 3

   SPSS has a specific package for evaluating missing
data, but it is included under the UT license.

   In place of this package, we will first examine
missing data using SPSS statistics and procedures.

   After studying the standard SPSS procedures that we
can use to examine missing data, we will use an SPSS
script that will produce the output needed for
missing data analysis without requiring us to issue all
of the SPSS commands individually.
SW388R7
Data Analysis &
Computers II              Key issues in missing data analysis
Slide 4

   We will focus on three key issues for evaluating
missing data:
 The number of cases missing per variable

 The number of variables missing per case

 The pattern of correlations among variables
created to represent missing and valid data.

   Further analysis may be required depending on the
problems identified in these analyses.
SW388R7
Data Analysis &
Computers II     Problem 1
Slide 5
SW388R7
Data Analysis &
Computers II     Identifying the number of cases in the data set
Slide 6

This problem wants to know if a variable is
missing data for half or more of the cases.

Our first task is to identify the number of
cases that meets that criterion.

If we scroll to the bottom of the data set,
we see than there are 270 cases in the data
set.

270 ÷ 2 = 135.

If any variable included in the analysis has
135 or more missing cases, the answer to
the problem will be true.
SW388R7
Data Analysis &
Computers II                    Request frequency distributions
Slide 7

We will use the output for
frequency distributions to
find the number of missing
cases for each variable.

Select the Frequencies… |
Descriptive Statistics
command from the Analyze
SW388R7
Data Analysis &
Computers II     Completing the specification for frequencies
Slide 8

First, move the five
variables included in the
problem statement to
the list box for variables.

Second, click on the OK
button to complete the
request for statistical
output.
SW388R7
Data Analysis &
Computers II     Number of missing cases for each variable
Slide 9

In the table of statistics at
the top of the Frequencies
output, there is a table
detailing the number of
missing cases for each
variable in the analysis.
SW388R7
Data Analysis &
Slide 10

With 270 subjects in the data set, variables
missing data for 135 or more cases would
correctly be characterized as missing data for half
or more of the cases in the data set.

One variable was incorrectly characterized as
missing half or more of the 270 cases: "self-
employment" [wrkslf] was missing data for 20 of
the 270 cases (7.4%).

None of the variables in this analysis was missing
cases for half or more of the 270 cases in the data
set.

False is the
SW388R7
Data Analysis &
Computers II     Problem 2
Slide 11
SW388R7
Data Analysis &
Computers II       Create a variable that counts missing data
Slide 12

We want to know how
many of the five variables
missing data for each
case in the data set.               To compute a new
variable, select the
We will create a variable           Compute…
containing this                     command from the
information that uses an            Transform menu.
SPSS function to count
the number of variables
with missing data.
SW388R7
Data Analysis &
Computers II       Enter specifications for new variable
Slide 13

First, type in the name for
the new variable nmiss in
the Target variable text box.

Third, click on the
up arrow button to
move the NMISS
function into the
Numeric Expression
Second, scroll down the list        text box.
of functions and highlight
the NMISS function.
SW388R7
Data Analysis &
Computers II                 Enter specifications for new variable
Slide 14

The NMISS function is
moved into the Numeric
Expression text box.

variables to count
missing data for,
we first highlight
the first variable to
include in the                Second, click on the
function, wrkstat.            right arrow button to
move the variable
name into the function
arguments.
SW388R7
Data Analysis &
Computers II               Enter specifications for new variable
Slide 15

variable to the function, we
type a comma to separate the
names of the variables.

the next variable
we highlight the
second variable to          Third, click on the
include in the              right arrow button to
function, hrs1.             move the variable
name into the function
arguments.
SW388R7
Data Analysis &
Computers II     Complete specifications for new variable
Slide 16

function until all of the
variables specified in the

Be sure to type a comma
between the variable names.

When all of the variables have
click on the OK button to
complete the specifications.
SW388R7
Data Analysis &
Computers II     The nmiss variable in the data editor
Slide 17

If we scroll the worksheet
to the right, we see the new
variable that SPSS has just
computed for us.
SW388R7
Data Analysis &
Computers II              A frequency distribution for nmiss
Slide 18

question of how many
possible numbers of
missing value, we
create a frequency
distribution.                    Select the Frequencies… |
Descriptive Statistics
command from the Analyze
SW388R7
Data Analysis &
Computers II     Completing the specification for frequencies
Slide 19

First, move the nmiss
variable to the list of
variables.

Second, click on the OK
button to complete the
request for statistical
output.
SW388R7
Data Analysis &
Computers II     The frequency distribution
Slide 20

SPSS produces a frequency
distribution for the nmiss
variable.

missing values for all 5
missing value; 1 case had 2
missing values; and 14 cases
variables.
SW388R7
Data Analysis &
Slide 21

The problem asked whether or not 14
cases had missing data for more than half
the variables. For a set of five variables,
cases that had 3, 4, or 5 missing values
would meet this requirement.

The number of cases with 3 missing
variables is 0 (not shown in table), with 4
missing variables is 14, and with 5 missing
variables is 0, for a total of 14.

The answer to the problem is true.
SW388R7
Data Analysis &
Computers II     Problem 3
Slide 22
SW388R7
Data Analysis &
Computers II     Compute valid/missing dichotomous variables
Slide 23

To evaluate the pattern of
missing data, we need to
compute dichotomous
valid/missing variables for
To create the new
each of the five variables
variable, select the
included in the analysis.
Recode | Into
Different Variables…
We will compute the new
from the Transform
variable using the Recode
command.
SW388R7
Data Analysis &
Computers II               Enter specifications for new variable
Slide 24

First, move the first
variable in the analysis,
wrkstat, into the Numeric
Variable -> Output Variable
text box.                     Second, type the name for the new
variable into the Name text box. My
convention is to add an underscore
character to the end of the variable name.

If this would make the variable more than
8 characters long, delete characters from
the end of the original variable name.
SW388R7
Data Analysis &
Computers II     Enter specifications for new variable
Slide 25

Finally, click on
the Change button
Next, type the label for the
new variable into the Label
the dichotomous
text box. My convention is to
variable to the
Numeric Variable ->
to the end of the variable
Output Variable text
label for the original variable.
box.
SW388R7
Data Analysis &
Computers II     Enter specifications for new variable
Slide 26

To specify the values for the
new variable, click on the Old
and New Values… button.
SW388R7
Data Analysis &
Computers II               Change the value for missing data
Slide 27

The dichotomous variable should be
coded 1 if the variable has a valid value,
0 if the variable has a missing value.

Second, type 0 in
First, mark                                                        the Value text box.
the System- or
user-missing
option button.

Third, click on the Add button
to include this change in the
list of Old->New list box.
SW388R7
Data Analysis &
Computers II                     Change the value for valid data
Slide 28

Second, type 1 in
the Value text box.

First, mark
the All other
values option
button.

Third, click on the Add button
to include this change in the
list of Old->New list box.
SW388R7
Data Analysis &
Computers II     Complete the value specifications
Slide 29

Having entered the values
for recoding the variable
into dichotomous values, we
click on the Continue button
to complete this dialog box.
SW388R7
Data Analysis &
Computers II     Complete the recode specifications
Slide 30

Having entered specifications for the
new variable and the values for
recoding the variable into dichotomous
values, we click on the OK button to
produce the new variable.
SW388R7
Data Analysis &
Computers II                    The dichotomous variable
Slide 31

The procedure for creating a dichotomous
valid/missing variable is repeated for the
four other variables in the analysis: hrs1,
wrkslf, wrkgovt, and prestg80.
SW388R7
Data Analysis &
Computers II     Filtering cases with excessive missing variables
Slide 32

If we include the cases
that have more than
half of the variables
missing, we will inflate
the correlations. To
prevent this, we                      To filter cases included in
exclude this cases                    further analysis, we choose
before creating the                   the Select Cases…
correlation matrix.                   command from the Data
We do this by selecting
in, or filtering, cases
that have fewer than
half missing variables,
i.e. less than 3 missing
variables.
SW388R7
Data Analysis &
Computers II     Enter specifications for selecting cases
Slide 33

First, click on the If
condition is satisfied
option button on the
Select panel.

Second, click on the If…
button to enter the
criteria for including
cases.
SW388R7
Data Analysis &
Computers II     Enter specifications for selecting cases
Slide 34

First, enter the criteria
for including cases:

nmiss < 3

Second, click
on the Continue
button to
complete the If
specification.
SW388R7
Data Analysis &
Computers II     Complete the specifications for selecting cases
Slide 35

To complete the
specifications, click
on the OK button.
SW388R7
Data Analysis &
Computers II     Cases excluded from further analyses
Slide 36

SPSS marks the cases that will not be
included in further analyses by drawing
a slash mark through the case number.

We can verify that the selection is
working correctly by noting that the
case which is omitted had 4 missing
variables.
SW388R7
Data Analysis &
Computers II     Correlating the dichotomous variables
Slide 37

To compute a correlation
matrix for the dichotomous
variables, select the Correlate
| Bivariate command from
SW388R7
Data Analysis &
Computers II     Specifications for correlations
Slide 38

First, move the
dichotomous variables
to the variables list box.

Second, click on
the OK button to
complete the
request.
SW388R7
Data Analysis &
Computers II                              The correlation matrix
Slide 39

Correlations

The correlation matrix is           RS
OCCUPA
NUMBER
symmetric along the diagonal
R SELF-EMP                   TIONAL
LABOR     OF HOURS(shown by the blue line). The
OR WORKS        GOVT OR PRESTIG
FRCE       WORKED correlation for any pair of E SCORE
FOR          PRIVATE
STATUS LAST WEEK      variables is included twice in
SOMEBODY EMPLOYEE              (1980)
(Valid/Missin we only count
the table. So
(Valid/Mis (Valid/Missin                  (Valid/Missi (Valid/Mis
sing)        g)             g)             ng)
the correlations below the sing) a
LABOR FRCE STATUS Pearson Correlation          .a            .a              .a            .a         .
(Valid/Missing)                                               diagonal (the cells with the
Sig. (2-tailed)              .             .               .             .          .
yellow background).
N                          256             256        256      256         256
NUMBER OF HOURS         Pearson Correlation           .a             1      -.049         .a     -.042
WORKED LAST WEEK         Sig. (2-tailed)                .              .     .437          .       .501
(Valid/Missing)
N                          256             256       256       256        256
R SELF-EMP OR           Pearson Correlation            .a        -.049          1         .a     -.010
WORKS FOR                Sig. (2-tailed)                .          .437           .        .       .877
SOMEBODY                 N
(Valid/Missing)
256             256       256       256        256

GOVT OR PRIVATE         Pearson Correlation           .a              .a         .a       .a         .a
EMPLOYEE                 Sig. (2-tailed)               .               .          .        .          .
(Valid/Missing)          N                          256             256        256      256        256
RS OCCUPATIONAL         Pearson Correlation           .a         -.042      -.010         .a        1
PRESTIGE SCORE           Sig. (2-tailed)               .           .501       .877         .          .
(1980) (Valid/Missing)   N                          256             256        256      256        256
a. Cannot be computed because at least one of the variables is constant.
SW388R7
Data Analysis &
Computers II                                  The correlation matrix
Slide 40

Correlations

RS
The correlations marked with
OCCUPA
NUMBER      footnote letter a could not be
R SELF-EMP                  TIONAL
LABOR                                  GOVT one PRESTIG
OF HOURS computed because OR of the
OR WORKS
FRCE       WORKED                      a constant, i.e.
variables was PRIVATE E SCORE
FOR
STATUS LAST WEEK the dichotomous variable (1980)
SOMEBODY EMPLOYEE            has
(Valid/Mis (Valid/Missin the same value for all cases.
(Valid/Missin (Valid/Missi (Valid/Mis
sing)           g)                  g)             ng)              sing)
LABOR FRCE STATUS Pearson Correlation                   .a                .a                  .a            .a              .a
(Valid/Missing)    Sig. (2-tailed)
This happens when one of the
.                  .                 .         .            .
N
valid/missing variables has no
256                256               256       256          256
missing cases, so thata all of
NUMBER OF HOURS   Pearson Correlation                 .a                1             -.049          .       -.042
WORKED LAST WEEK   Sig. (2-tailed)
the cases have a value of 1
.                 .
(Valid/Missing)
.437          .
and none have a value of 0..501
N                                256                256                 256            256           256
R SELF-EMP OR           Pearson Correlation                 a                                                   a
.            -.049                   1              .         -.010
WORKS FOR                Sig. (2-tailed)                 .             .437                    .             .          .877
SOMEBODY                 N
(Valid/Missing)
256                256                 256            256           256

GOVT OR PRIVATE         Pearson Correlation           .a                  .a                  .a            .a            .a
EMPLOYEE                 Sig. (2-tailed)               .                   .                   .             .             .
(Valid/Missing)          N                          256                 256                 256           256           256
RS OCCUPATIONAL         Pearson Correlation           .a             -.042               -.010              .a           1
PRESTIGE SCORE           Sig. (2-tailed)               .               .501                .877              .             .
(1980) (Valid/Missing)   N                          256                 256                 256           256           256
a. Cannot be computed because at least one of the variables is constant.
SW388R7
Data Analysis &
Computers II                                     The correlation matrix
Slide 41

Correlations

RS
OCCUPA
NUMBER          R SELF-EMP                           TIONAL
LABOR      OF HOURS         OR WORKS          GOVT OR           PRESTIG
FRCE       WORKED               FOR           PRIVATE           E SCORE
STATUS LAST WEEK             SOMEBODY EMPLOYEE                     (1980)
(Valid/Mis (Valid/Missin      (Valid/Missin    (Valid/Missi       (Valid/Mis
sing)         g)                g)              ng)               sing)
LABOR FRCE STATUS Pearson Correlation                  . a            .a                 . a            .a                .a
(Valid/Missing)    Sig. (2-tailed)                      .              .                  .              .                 .
N                                 256           256                 256            256               256
NUMBER OF HOURS   Pearson Correlation                  .a            1              -.049               .a           -.042
WORKED LAST WEEK   Sig. (2-tailed)                      .              .              .437               .            .501
(Valid/Missing)
N                                256             256               256             256              256
R SELF-EMP OR           Pearson Correlation                a                                                a
.         -.049                  1               .            -.010
WORKS FOR                Sig. (2-tailed)                .          .437                   .              .             .877
SOMEBODY                 N In the cells for which the correlation could be computed, the
(Valid/Missing)             probabilities indicating significance are 0.437, 0.501, and
256          256         256        256                               256
0.877.

GOVT OR PRIVATE            The Correlation
Pearson correlation of -.042 between .the missing/valid pair for
.a            a
.a        .a                                .a
EMPLOYEE                    "number
Sig. (2-tailed) of hours worked in the past week" [hrs1] and .
.           .           .                                           .
(Valid/Missing)          N "occupational prestige score" [prestg80] was not statistically
256         256         256       256                               256
RS OCCUPATIONAL            significant (p=0.501) .and should not be interpreted as .a
Pearson Correlation            a
-.042       -.010                                           1
PRESTIGE SCORE              indicating
Sig. (2-tailed) a non-random pattern of missing data.
.       .501        .877          .                                 .
(1980) (Valid/Missing)   N                         256         256         256       256                               256
a. Cannot be computed because at least one of the variables is constant.
SW388R7
Data Analysis &
Slide 42

Correlations

RS
OCCUPA
NUMBER          R SELF-EMP                           TIONAL
LABOR      OF HOURS         OR WORKS          GOVT OR           PRESTIG
FRCE       WORKED               FOR           PRIVATE           E SCORE
STATUS LAST WEEK             SOMEBODY EMPLOYEE                     (1980)
(Valid/Mis (Valid/Missin      (Valid/Missin    (Valid/Missi       (Valid/Mis
sing)         g)                g)              ng)               sing)
LABOR FRCE STATUS Pearson Correlation                  . a            .a                 . a            .a                .a
(Valid/Missing)    Sig. (2-tailed)                      .              .                  .              .                 .
N                                 256           256                 256            256               256
NUMBER OF HOURS   Pearson Correlation                  .a            1              -.049               .a           -.042
WORKED LAST WEEK   Sig. (2-tailed)                      .              .              .437               .            .501
(Valid/Missing)
N                                256             256               256             256              256
R SELF-EMP OR           Pearson Correlation                a                                                a
.         -.049                  1               .            -.010
WORKS FOR                Sig. (2-tailed)                .          .437                   .              .             .877
SOMEBODY                 N False is the correct       answer.
(Valid/Missing)
None of the correlations among the missing/valid variables
256         256           256       256                                256
were statistically significant. The correlation matrix does not
GOVT OR PRIVATE
indicate a non-random pattern of.a
Pearson Correlation       .a
missing data..a        .a                                 .a
EMPLOYEE                 Sig. (2-tailed)          .           .           .        .                                      .
(Valid/Missing)          N Fourteen cases were excluded from the calculations for the
256         256         256      256                                    256
RS OCCUPATIONAL            correlation matrix because they were missing more than .a
Pearson Correlation      .a     -.042       -.010           half                                1
PRESTIGE SCORE              of the variables.
Sig. (2-tailed)          .       .501        .877         .                                      .
(1980) (Valid/Missing)   N                     256         256         256      256                                    256
a. Cannot be computed because at least one of the variables is constant.
SW388R7
Data Analysis &
Computers II                           Using scripts
Slide 43

   The process of evaluating missing data requires
numerous SPSS procedures and outputs that are time
consuming to produce.

   These procedures can be automated by creating an
SPSS script. A script is a program that executes a
sequence of SPSS commands.

   Thought writing scripts is not part of this course, we
can take advantage of a script that I use to reduce
the burdensome tasks of evaluating missing data.
SW388R7
Data Analysis &
Computers II            Using a script for missing data
Slide 44

     The script:
“EvaluatingAssumptions_MissingData_Outliers_2004.SBS”
will produce all of the output we have used for evaluating
missing data, as well as other outputs described in the
textbook.

     Navigate to the link “SPSS Scripts and Syntax” on the
course web page.

“EvaluatingAssumptions_MissingData_Outliers_2004.exe”
to your computer and install it, following the directions
on the web page.
SW388R7
Data Analysis &
Computers II     Open the data set in SPSS
Slide 45

Before using a script, a data
set should be open in the
SPSS data editor.
SW388R7
Data Analysis &
Computers II     Invoke the script
Slide 46

To invoke the script, select
the Run Script… command
SW388R7
Data Analysis &
Computers II     Select the missing data script
Slide 47

First, navigate to the folder where you put the script.
If you followed the directions, you will have a file with
an ".SBS" extension in the C:\StudentData\SW388R7
folder.

If you only see a file with an “.EXE” extension in the
folder, you should double click on that file to extract
the script file to the C:\StudentData\SW388R7 folder.

Second, click on the
script name to highlight it.

Third, click on
Run button to
start the script.
SW388R7
Data Analysis &
Computers II     The script dialog
Slide 48

The script dialog box acts
similarly to SPSS dialog
boxes. You select the
variables to include in the
analysis and choose options
for the output.
SW388R7
Data Analysis &
Computers II           Complete the specifications
Slide 49

We accept the
default option to
Check missing data.

Select the variables for the analysis. This
analysis uses the variables for the last
problem we worked. For the missing data
check, it does not matter what role we
assign to the variables.

Click on the OK
button to produce
the output.
SW388R7
Data Analysis &
Computers II     The script finishes
Slide 50

Since it may take a while to
produce the output, and
since there are times when it
appears that nothing is
to tell you when the script is
finished.

When you see this alert, click
on the OK button and view

Note: the script dialog box
does not close by itself. This
is purposeful so that you can
test assumptions or detect
outliers without having to
redo variable selection.
SW388R7
Data Analysis &
Computers II     Output from the script
Slide 51

The script will produce lots of