Analyzing Patterns of Missing Data

Document Sample
Analyzing Patterns of Missing Data Powered By Docstoc
					                                Analyzing Patterns of Missing Data

          While SPSS contains a rich set of procedures for analyzing patterns of missing data, they
          are not included in the set of tools licensed by the University. However, we can
          replicate much of the analysis with other SPSS procedures.

          The first set of tasks in the missing data analysis involve the creation of diagnostic
          variables that support the analysis: first, a variable that counts the number of variables
          with missing data for each case; second, one new dichotomous variable for each original
          variable that indicates whether or not the original variable had a missing data value;
          and third, a single pattern variable for each case that summarizes the missing or valid
          status of values for all of the variables in the analysis.

          Using the diagnostic variable that counts the missing values for each case, we can
          identify cases with large concentrations of missing data as candidates for elimination
          from the analysis. After we remove specific cases with large numbers of missing
          variables, we do a frequency distribution for the remaining cases to see if any variables
          have so many missing cases that the variable should be considered a candidate for
          exclusion.

          Next, we compute a frequency distribution for the pattern variable to identify patterns
          that occur often in the data, indicating a problematic missing data process.

          Next, using the valid/missing variables as a grouping variable, we examine whether or
          not the missing cases are statistically different from the valid cases for all of the other
          variables in the analysis. If the variable is metric, we do a t-test for group differences;
          if the variable is non-metric, we do a chi-square test of independence to detect group
          differences.

          Finally, we do a correlation matrix of the valid/missing variables to detect
          concentrations of missing data across multiple variables.

Analyzing Patterns of Missing Data                                                                      Slide 1
                                     1. Download the data set

          Download the HATMISS data set from the course web page and save it in your
          C:\SW388R7 folder.




Analyzing Patterns of Missing Data                                                     Slide 2
                         2. Tallying the Number of Missing Variables

          One of the major information items we need for the missing data analysis is the number
          of variables that have missing data for each case in the sample.

          We will create a new variable which we will name num_miss that will contain the
          number of variables from the first ten in the data set, x1 through x10. We include only
          the first ten variables in this calculation to maintain consistency with the text.

          The SPSS function NMISS counts the number of variables that have missing values. We
          will use this function to calculate the value for our NUM_MISS variable for each case.




Analyzing Patterns of Missing Data                                                                  Slide 3
                            Computing the Number Missing by Case


                                                       First, select the
                                                       'Compute…' command from
                                                       the 'Transform' menu.



                                                                                              Fourth, click on the
                                                                                              move arrow to move
                                                                                              the function to the
                                                                                              'Numeric Expression:'
                                                                                              text area.




                                     Second, type
                                     the name of
                                     the variable
                                     we want to
                                     create,
                                     'num_miss', in
                                     the 'Target
                                     Variable:' text
                                     box.                                    Third, scroll down
                                                                             the list of functions
                                                                             and highlight the
                                                                             'NMISS' function.




Analyzing Patterns of Missing Data                                                                                    Slide 4
                            Specifying the Variables in the Function


                                              First, type the names of the variables to include
                                              in the function as a comma-delimited list between
                                              the parentheses after the function.




                           Second, click on the
                           OK button to produce
                           the new variable.


                                                                                 Third, the new
                                                                                 variable appears in
                                                                                 a column to the
                                                                                 right of the existing
                                                                                 columns of data.




Analyzing Patterns of Missing Data                                                                       Slide 5
  3. Creating Dichotomous Valid/Missing Variables for Diagnosing Missing
                                 Data
          To determine whether or not the pattern of missing data is random, we create a special
          diagnostic variable that indicates whether the variable is missing or valid for each case
          in the data set. Each diagnostic variable is dichotomous, using the value 1 for 'Valid' and
          the value 0 for 'Missing'

          Since we may need to refer back to the original variables in the course of the missing
          data analysis, I recommend a naming convention for the diagnostic variables that makes
          it easy to identify the original variable. If the original variable name is less than eight
          characters, an underscore is appended to the end of the original variable name, e.g. the
          diagnostic variable for race would be race_. If the original variable name is eight
          characters, the last character is replaced with an underscore, e.g. the diagnostic
          variable name for response would be respons_. If replacing the last character with an
          underscore duplicates the name assigned to another diagnostic variable for an eight-
          character variable name, we drop the last two characters from the original name and
          append an underscore followed by a sequence letter or digit, e.g. the diagnostic
          variable name for response would be respon_1 if we had already used the name respons_
          for a diagnostic variable.

          When we assign variable labels to the diagnostic variables, we can add a keyword to the
          original variable label to designate it as a missing/valid diagnostic variable, e.g. the
          variable label for the diagnostic variable that had an original variable label of Grade
          Level could be Grade Level (Valid/Missing).

          We will demonstrate the process of creating dichotomous Valid/Missing variables for
          diagnosing missing data using the variables in the HATMISS.SAV data set. If the copy of
          HATMISS.SAV that you are working with does not have variable labels and value labels,
          do the exercise Applying a Data Dictionary to apply the data labels from the HATCO.SAV
          data set to the HATMISS.SAV data set. A quick test for the presence of variable labels is
          to position the mouse over a variable name in the data editor. If a variable label
          appears in a yellow tips box, a variable label has been added for that variable.
Analyzing Patterns of Missing Data                                                                      Slide 6
                       Recoding Diagnostic Variables for Missing Data

                                                  First, select the 'Recode | Into Different
                                                  Values…' command from the 'Transform' menu.




                                                                                                Fifth, click on the
                                                                                                Change button to move
                                                                                                the new name to the
                                                                                                'Numeric Variable ->
                                                                                                Output Variable' list.




                     Second, move the
                     first original variable,
                     x1, from the input list
                     to the list box
                     'Numeric Variable ->
                     Output Variable.'          Third, type the new
                                                variable name, x1_,
                                                                                   Fourth, type the variable
                                                into the 'Name:' text
                                                                                   label for the new x1_
                                                box on the 'Output
                                                                                   variable, 'Delivery Speed
                                                Variable' panel.
                                                                                   (Valid/Missing)' in the
                                                                                   Label text box on the
                                                                                   'Output Variable' panel.




Analyzing Patterns of Missing Data                                                                                       Slide 7
                         Opening the Dialog for Old and New Values




                                     To specify which old values are to be
                                     recoded into new values, click on the
                                     'Old and New Values…' button.




Analyzing Patterns of Missing Data                                           Slide 8
                                        Add the Value for Missing Data

                                                                     Second, type 0 into
                                                                     the 'Value:' text box in
                                                                     the 'New Value' panel.



                    First, click on the
                    'System- or user-
                    missing' option button
                    on the 'Old Value'
                    panel.




                                                                         Third, to add these
                                                                         value changes to
                                                                         the list of recodes,
                                                                         click on the 'Add'
                                                                         button. The change
                                                                         is added to the
                                                                         'OldNew:' list.




Analyzing Patterns of Missing Data                                                              Slide 9
                                             Add the Value for Valid Data

                                                                       Second, type 1 into
                                                                       the 'Value:' text box in
                                                                       the 'New Value' panel.




                       First, click on the
                       'All other values'
                       option button on
                       the 'Old Value'
                       panel.
                                                                         Third, to add these
                                                                         value changes to
                                                                         the list of recodes,
                                                                         click on the 'Add'
                                                                         button. The change
                                                                         is added to the
                                                                         'OldNew:' list.




Analyzing Patterns of Missing Data                                                                Slide 10
                                 Completing the Values Dialog Box




                                      Since this is the last
                                      value specification,
                                      we click on the
                                      continue button to
                                      close the dialog box.




Analyzing Patterns of Missing Data                                  Slide 11
              Adding Diagnostic Variables for the Remaining Variables

                                     First, add the original name, the new diagnostic variable name,
                                     and the variable label for the diagnostic variable for all of the
                                     other variables through x14. The same value changes which we
                                     specified for x1_ will be applied to these variables.




                                                                                      Second, click on the
                                                                                      OK button to complete
                                                                                      the recode request.




Analyzing Patterns of Missing Data                                                                            Slide 12
                     Adding Value Labels to the Diagnostic Variables



                                                 Second, highlight the
                                                 Values cell for the
                                                 variable we want to
                                                 work with.




                                                      Third, click on the gray
                                                      dialogue box which appears in
                                                      the cell to bring up the Value
                                                      Labels dialogue box.




                        To add value labels
                        to the diagnostic
                        variables, first we go
                        to the Variable View
                        worksheet in the
                        Data Editor.




Analyzing Patterns of Missing Data                                                     Slide 13
                                Adding the Value Label for Missing

                                             First, we type a 0 in the 'Value' text
                                             box on the 'Value Labels' Panel.



                                                       Second, we type 'Missing' in
                                                       the 'Value Label' text box on
                                                       the 'Value Labels' Panel.



                                           Third, we click on the Add button to
                                           add this value label to the list box.




Analyzing Patterns of Missing Data                                                     Slide 14
                                     Add the Value Label for Valid

                                                                      First, we type a 1 in
                                                                      the 'Value' text box on
                                                                      the 'Value Labels'
                                                                      Panel.




                                                                       Second, we type
                                                                       'Valid' in the 'Value
                                                                       Label' text box on the
                                                                       'Value Labels' Panel.

                                            Third, we click on
                                            the Add button to add
                                            this value label to the
                                            list box.




Analyzing Patterns of Missing Data                                                              Slide 15
                                     Apply the Value Labels



                                              First click on
                                              the OK button
                                              to apply the
                                              value labels.




Analyzing Patterns of Missing Data                             Slide 16
                        Displaying the Value Labels for the Variables


                                              To display the value labels in the SPSS
                                              Data Editor window, we first return to the
                                              Data View worksheet of the Data Editor.
                                              There we select the 'Value Labels'
                                              command from the View menu. When the
                                              command is in effect, a check mark will
                                              appear before the command.

                                              To restore the display to the numeric code
                                              display, we select the 'Value Labels'
                                              command a second time to toggle it off.




Analyzing Patterns of Missing Data                                                         Slide 17
                                       The Diagnostic Variables

                                     The value labels for the variables appear in the SPSS Data
                                     Editor. The display would be improved by adjusting the width
                                     of the data columns. This display can be used to examine the
                                     pattern of missing values as the text does in table 2.3.




Analyzing Patterns of Missing Data                                                                  Slide 18
                        4. Adding a Pattern Variable to the Data Set

          Another indication of a problematic missing data process is the frequent occurrence of
          the same pattern of missing data among the variables. While patterns can be detected
          by sorting and scanning the data set, this task is facilitated by the creation of a pattern
          variable. The pattern variable is a string variable containing one character for each
          variable in the data set. Each character in the pattern variable is set to a character
          indicating missing data or a character indicating valid data. To make the pattern more
          visually intuitive, the characters selected should have the same width when printed. If
          we do not use same width characters, we cannot scan down values to compare them
          because the column alignment of the characters is not the same from one value to the
          next. We will use an X for missing data and a tilde, ~, for valid data, because both are
          full width characters.

          To create the pattern variable, we first create a one-character string variable for each
          of the original variables. Then, we use the SPSS 'CONCAT' function to add the string
          variables together into a single variable.




Analyzing Patterns of Missing Data                                                                      Slide 19
                   Recode the Original Variables into String Variables


                   First, select
                   the 'Recode                                       Fourth, type the
                   | Into                                            name for the new
                   Different                                         variable 'x1_x' in the
                   Variables…'                                       'Name:' text box on
                   command                                           the 'Output Variable'
                   from the                                          panel.
                   Transform
                   menu.




                                     Third, move the
                                     variable 'Delivery
                                     Speed [x1]' to the
                                     'Numeric Variable -
                                     > Output Variable'
                                     list box.




                                                                 Fifth, click on the
                                                                 Change button to
                                                                 move the name
                                           Second, click on      'x1_x' to the
                                           the Reset button to   'Numeric Variable ->
                                           clear the             Output Variable' list
                                           previously recoded    box.
                                           variables.




Analyzing Patterns of Missing Data                                                            Slide 20
                         Opening the Dialog for Old and New Values




                                     To specify which old values are to be recoded
                                     into new values, click on the 'Old and New
                                     Values…' button.




Analyzing Patterns of Missing Data                                                   Slide 21
                                     Add the Value for Missing Data

                                                                      Fourth, type 'X' into
                                                                      the 'Value:' text box in
                First, click on                                       the 'New Value' panel.
                the 'System- or
                user-missing'
                option button
                on the 'Old
                Value' panel.
                                                                                     Fifth, to add these
                                                                                     value changes to the
                                                                                     list of recodes, click on
                                                                                     the 'Add' button. The
                                                                                     change is added to the
                                                                                     'OldNew:' list.




                                       Second, click on the    Third, set the 'Width'
                                       'Output variables are   of the output variables
                                       strings' check box.     to 1 character.




Analyzing Patterns of Missing Data                                                                               Slide 22
                                     Add the Value for Valid Data


                                                                Second, type '~' (a tilde)
                                     First, click on the 'All   into the 'Value:' text box in
                                     other values' option       the 'New Value' panel. I
                                     button on the 'Old         chose a tilde rather than a
                                     Value' panel.              blank because they will be
                                                                easier to see.




                                                                  Third, to add these
                                                                  value changes to the
                                                                  list of recodes, click on
                                                                  the 'Add' button. The
                                                                  change is added to the
                                                                  'OldNew:' list.




Analyzing Patterns of Missing Data                                                              Slide 23
                                 Completing the Values Dialog Box




                                     Since this is the last value
                                     specification, we click on
                                     the continue button to
                                     close the dialog box.




Analyzing Patterns of Missing Data                                  Slide 24
               Adding String Variables for the Other Original Variables


                            First, add the
                            original name and
                            the new string
                            variable name for all
                            of the other variables
                            through x10. The
                            same value changes
                            which we specified
                            for x1 will be applied
                            to these variables.      Second, click on the
                                                     OK button to complete
                                                     the recode request.




Analyzing Patterns of Missing Data                                           Slide 25
                                     The String Variables


                                     The recoded string variables for variables
                                     Delivery Speed (x1) through Satisfaction Level
                                     (x10) are added to the data editor window.




Analyzing Patterns of Missing Data                                                    Slide 26
               Create the Variable Containing the Concatenated Data

                                                              First, select the 'Compute…'
                                                              command from the Transform
                                                              menu to create a new variable.

              Second, after clicking
              on the Reset button to
              clear the last recoded
              variable, type the
              name for the new
              variable 'miss_str' into
              the 'Target Variable:'
              text box.




             Third, click on the
             'Type&Label…' button
             to set the type of
             variable to string.
                                                                                                                Sixth, click on the
                                                                                                                'Continue' button to
                                                                                                                close the 'Type and
                                                                                                                Label' dialog box.



                                         Fourth, in the 'Type'
                                         panel mark the                         Fifth, set the 'Width:' of the new
                                         'String' option button.                variable to 10 characters, one for
                                                                                each of the ten string variables.




Analyzing Patterns of Missing Data                                                                                                     Slide 27
                    Enter the Formula for the Concatenated Variable

                                                       First, highlight the 'CONCAT' function in the
                                                       'Functions:' list box and move it to the 'String
                                                       Expression:' text area.




                              Second, type the
                              names of the string                    Third, click the OK button to
                              variables as a comma                   complete the compute variable
                              delimited list between                 function.
                              the parentheses
                              following the CONCAT
                              function name.




Analyzing Patterns of Missing Data                                                                        Slide 28
                                 The Missing Data Pattern Variable


                                     One variable now contains a string that has one
                                     character for each string variable. This variable
                                     contains the pattern of missing and valid data
                                     for each case in the data set.

                                     We have made a lot of changes to the HATMISS
                                     data set that we should save, so we click on the
                                     Save File tool.

                                     This completes the creation of the diagnostic
                                     variables we need to conduct the missing data
                                     analysis.




Analyzing Patterns of Missing Data                                                       Slide 29
        5. Removing Cases with a Large Proportion of Missing Variables

          To identify the cases that we should consider removing, we will sort the data set in
          descending order by the number of missing variables. The candidates for elimination
          will appear at the top of the data set.

          Once we have located the cases that we want to eliminate, we specify a filter condition
          to eliminate the cases from further analysis. The cases are not deleted from the data
          set, so we can include them in later analysis should we desire to do so.




Analyzing Patterns of Missing Data                                                                  Slide 30
                                              Sorting the Cases

                                                 It will be easier to identify problem cases if we sort
                                                 the cases by the 'num_miss' variable. First, select
                                                 the 'Sort Cases…' command from the Data menu.




                                                                                                     Fourth, click on
                                                                                                     the OK button to
                                                                                                     sort the data set.




                                     Second, click on the
                                     'Descending' option
                                     button in the 'Sort
                                     Order' panel so that
                                     the cases with the
                                     largest number of                                             Third, in the 'Sort
                                     missing values                                                Cases' dialog, move
                                     appear at the top of                                          the 'num_miss'
                                     the data set.                                                 variable to the 'Sort
                                                                                                   by:' list box.




Analyzing Patterns of Missing Data                                                                                         Slide 31
                               The Cases Sorted by Number Missing

                                     At the top of the sorted data set, we see the six cases which had
                                     missing values on 5, 6, or 7 of the original ten values (missing 50% or
                                     more of the data). These are the cases that will be excluded from
                                     further analysis.




Analyzing Patterns of Missing Data                                                                             Slide 32
                                     Excluding the Cases


                                     We exclude the
                                     cases with too            Second, we mark
                                     many missing              the 'If condition is
                                     values by not             satisfied' option
                                     selecting them            button in the
                                     for inclusion in          Select panel.
                                     later analyses.
                                     First, we select
                                     the 'Select
                                     Cases…'
                                     command from
                                     the Data menu.

                                                           Third, we click on the
                                                           'If…' button to specify
                                                           the condition for
                                                           inclusion.




Analyzing Patterns of Missing Data                                                    Slide 33
                                           Specifying the If Condition


                           First, move the
                           'num_miss' variable to             Second, complete the condition
                           the condition text area            by type '< 5' (less than 5) after
                           on the right.                      the variable name. This 'If
                                                              condition' specifies that a case
                                                              will be included if the value of its
                                                              'num_miss' variable is less than
                                                              5, i.e. 4, 3, 2, 1 or 0. Cases that
                                                              have a 'num_miss' value equal
                                                              to five or greater than 5 will not
                                                              be included.




                                                                 Third, click on the Continue
                                                                 button to signal completion
                                                                 of the IF condition.




Analyzing Patterns of Missing Data                                                                   Slide 34
                                Specify Filtering for Unselected Cases

                     We have two options for
                     removing cases that do
                     not satisfy the selection
                     criteria: deletion from
                     the data set and filtering
                     from the data set.

                     Deletion physically
                     removes the cases from
                     the data set
                     permanently.

                     Filtering leaves the
                     cases in the data set,
                     but marks them for
                     exclusion from the
                     analyses. With the
                     cases still in the data
                     set, we can choose to
                     include them in a later
                     analysis.

                     First, mark the
                     'Filtered' option button                 Second, click of the OK
                     on the 'Unselected                       button to complete the
                     Cases Are' panel.                        selection process.




Analyzing Patterns of Missing Data                                                      Slide 35
                                     The Data Set with Filtered Cases

                           The cases that did not meet the selection criteria are marked with a diagonal line or slash
                           through their case number. In addition, SPSS added a new variable to the data set,
                           'filter_$', which has a value of 1 if the case is included, and a value of 0 if the case is not
                           included.

                           When applying a selection criteria, it is good practice to spot check our cases to make
                           certain we specified the 'IF' condition correctly. In this problem, we see that cases 1
                           through 6, which have num_miss values greater than 4, all have a slash through their case
                           number. Cases 7 through 11, which have num_miss values less than 4, do not have a slash
                           and will still be included in the analyses.




Analyzing Patterns of Missing Data                                                                                           Slide 36
                      6. Summary Statistics for the Unfiltered Cases

          Filtering cases with 50% or more missing data removed six cases from the data set,
          reducing our effective sample size to 64 cases. We next look at a frequency distribution
          for each variable to see if any variables have such a high proportion of missing data that
          they should be considered candidates for removal from the analysis.

          We can see the distribution of missing data on each of our variables by using the
          Frequencies command, which produces the SPSS output equivalent to Table 2.2 on page
          56 of the text. We will use a Frequencies command instead of a Descriptives command,
          because the Frequencies command will provide a count of the remaining missing cases
          for each variable.




Analyzing Patterns of Missing Data                                                                     Slide 37
                              Requesting the Frequency Distributions


                     First, select the
                     'Descriptive
                     Statistics |
                     Frequencies…'
                     command from the
                     Analyze menu.
                                                          Second, move the
                                                          variables Delivery
                                                          Speed through
                      Third, clear the check mark         Satisfaction Level (x1
                      from the 'Display frequency         through x10) to the
                      tables' check box. Frequency        'Variable(s):' list box.
                      tables for continuous
                      variables would generate a
                      large volume of output that
                      we do not need.




                                                            Fourth, click on the
                                                            'Statistics…' button to
                                                            request the mean and
                                                            standard deviation.




Analyzing Patterns of Missing Data                                                    Slide 38
                                       Requesting Specific Statistics




                          First, mark the check
                          boxes for 'Mean' and
                          'Std. Deviation'.

                          All other check boxes
                          should be clear.
                                                                             Third, when the
                                                                             'Frequencies:
                                                  Second, click on the       Statistics' dialog
                                                  Continue button to close   is closed, click on
                                                  the 'Frequencies:          the OK button to
                                                  Statistics' dialog box.    request the
                                                                             output.




Analyzing Patterns of Missing Data                                                                 Slide 39
                                     The Frequencies Output

                                     The frequencies table contains all of the information items in
                                     table 2.2. of the text. The horizontal orientation of the table
                                     makes it difficult to read. We will change its orientation.




Analyzing Patterns of Missing Data                                                                     Slide 40
                              Changing the Orientation of the Table

                                              First, double click on the table
                                              to activate it for editing. When
                                              the table is activated, it displays
                                              a hatched line border.




                       Second, select the
                       'Transpose Rows and
                       Columns' command
                       from the Pivot menu.




Analyzing Patterns of Missing Data                                                  Slide 41
                                     The Transposed Frequencies Table



                   The number of cases in the column
                   labeled Valid are the number of
                   cases that are not missing data for
                   that variable. From studying this
                   column, we see than Delivery
                   Speed, Price Level, and Price
                   Flexibility have the lowest number
                   of valid cases, and thus the largest
                   number of missing cases. For each
                   of these variables, there are still a
                   large number of cases that do not
                   have missing data, so we would
                   not automatically eliminate these
                   variables from the analysis.

                   There is no specific number for the
                   proportion of missing cases that
                   would require the variable to be
                   eliminated. A variable that has
                   50% or more missing data would
                   not have much credibility, and
                   probably a variable with 40%
                   missing data should be eliminated.
                   However, a variable with 20 to
                   30% missing data might or might
                   not be retained depending on its
                   importance to the research
                   question. Whatever we decide
                   about missing data, we should
                   identify our decisions in the
                   research report.




Analyzing Patterns of Missing Data                                      Slide 42
                                7. Tabulating Missing Data Patterns

          In a previous exercise, Adding a Pattern Variable to the Data Set, we created a pattern
          variable that contained a single string of ten characters representing valid or missing
          data for the first ten variables in the data set. To create table 2.4 on page 58, we do
          frequency distribution on the pattern variable. This frequency distribution will tell us if
          there are one or two patterns of missing data that occur with sufficient frequency to
          require further investigation.




Analyzing Patterns of Missing Data                                                                      Slide 43
             Request a Frequency Distribution for the Pattern Variable


                            First, select the 'Descriptive
                            Statistics | Frequencies…'                                          Sixth, click on the
                            command from the Analyze menu.                                      OK button to
                                                                                                complete the
                                                                                                frequency request.

                                            Second, in the
                                            Frequencies dialog box,
                                            move the pattern
                                            variable, 'miss_str' to
                                            the 'Variable(s):' list
                                            box. Also be sure the
                                            Display ‘frequency
                                            tables box’ is checked
                                            in this box.

                       Fourth, in the
                       'Frequencies:
                       Format' dialog,
                       mark the
                       'Descending
                       counts' option in                                                             Fifth, click on
                       the 'Order by'                                                                the Continue
                       panel. This will                                                              button to close
                       order the                                                                     the
                       frequency table to                                                            'Frequencies:
                       be from highest                                                               Format' dialog.
                       count to lowest.                               Third, click on the
                                                                      Format… button in the
                                                                      Frequencies dialog box.




Analyzing Patterns of Missing Data                                                                                     Slide 44
                                The Frequency of Different Patterns


                       The results in the frequency
                       table shows the incidence
                       of different patterns. It
                       agrees with the data in
                       table 2.4 of the text,
                       though the patterns are in a
                       different order.

                       As the text identifies, the
                       most prevalent pattern is
                       X1 missing and all other
                       non-missing, with a
                       frequency of 6. Followed
                       by that is X1 and X3
                       missing, with a frequency
                       of 4. All other patterns
                       have a lower frequency of
                       occurrence.

                       This analysis tells us that
                       we do not have a single
                       missing-data pattern that
                       occurs with sufficient
                       frequency to impact the
                       statistical analysis.




Analyzing Patterns of Missing Data                                    Slide 45
   8. T-tests and Chi-square Tests for Diagnosing Randomness of Missing
                                    Data
          In previous exercises, we created dichotomous grouping variables for the variables X1
          through X10, where the grouping variable was assigned a 1 if the data was valid and a 0
          if the data was missing. We will use these grouping variables to determine whether the
          valid and missing groups differ in their relationship to other variables in the data set. If
          the missing and valid groups are statistically equivalent on other variables, then the
          missing cases can be characterized as random, and of no consequence to our analysis. If
          the missing group shows a statistically significant relationship to the other variable, it
          suggests that there is a missing data process that requires further understanding.

          The statistical tests that we use in this analysis are chi-square tests of independence, if
          the variable to be tested is nonmetric, or t-tests for two independent samples, if the
          variable to be tested is metric. The authors use the separate variance output for all t-
          tests instead of examining individual tests of homogeneity. We will follow this practice.

          When this analysis is conducted, there are usually a large number of statistical
          relationships tested. We know that using an alpha level of 0.05 in these tests implies
          that we will make an incorrect inference in one out of every twenty tests. With a large
          number of tests, we will get some statistically significant relationships even when there
          is no serious problem with our data. We are not looking at the individual test results, as
          much as we are concerned with an overall pattern of relationships.

          NOTE. I cannot reconcile the findings on these tests to the discussion of findings on
          page 58 of the text. The statistical results are consistent with table 2.5 on page 59,
          while the text discussion appears to be a carryover from the fourth edition of the text,
          which does not contain the same statistical results as the fifth edition.




Analyzing Patterns of Missing Data                                                                       Slide 46
                              The Statistical Tests to Be Computed

          We will use the grouping variable 'Delivery Speed (Valid/Missing)' (X1_) to explore
          differences among the next nine variables in the data set, 'Price Level' through
          'Satisfaction Level' (X2 through X10). In each statistical test, we are testing the null
          hypothesis of no relationship associated with the grouping variable, 'Delivery Speed
          (Valid/Missing)'. If we reject the null hypothesis, we would conclude that persons who
          did not answer the question on Delivery Speed had a different pattern of responses than
          did persons who did provide Delivery Speed.

          The variable 'Firm Size' (x8) is a nonmetric variable and we will do a chi-square test of
          independence for this variable.

          The variables 'Price Level' (x2), 'Price Flexibility' (x3), 'Manufacturer Image' (x4), 'Service'
          (x5), 'Salesforce Image' (x6), 'Product Quality' (x7), 'Usage Level' (x9), and 'Satisfaction
          Level' (x10) are all metric and we will do t-tests for these variables.




Analyzing Patterns of Missing Data                                                                           Slide 47
                               The Chi-square Test of Independence


                                                               First, we select the 'Descriptive
                                                               Statistics Crosstabs' command
                                                               from the Analyze menu.




                          Second, we move the dependent
                          variable, 'Firm Size (x8)', to the
                          'Row(s)' list box.



                        Third, we move the independent,
                        or grouping, variable 'Delivery
                        Speed (Valid/Missing)' to the
                        'Column(s)' list box.




                          Fourth, we click on the
                          'Statistics…' button at the
                          bottom of the Crosstabs
                          dialog to request the
                          statistical test.




Analyzing Patterns of Missing Data                                                                 Slide 48
                                     Requesting the Chi-square Test




                                     First, we mark the Chi-
                                     square test check box to
                                     request the statistical
                                     test. For this problem,
                                     we clear all of the other   Second, click on the
                                     check boxes in this         Continue button to
                                     dialog box.                 complete the request
                                                                 for statistical options.




Analyzing Patterns of Missing Data                                                          Slide 49
                                               Specifying Cell Contents

                                                    Fourth, click on the OK
                                                    button in the Crosstabs dialog
                                                    to request the output.



                                                                                                  Third, click on the
                                                                                                  Continue button to
                                                                                                  conclude our
                      First, we click on the                                                      specifications for
                      'Cells…' button to                                                          cell contents.
                      specify what we want
                      in the cells of the
                      crosstabs table.




                                                                               Second, we mark the check
                                                                               boxes for 'Observed' Counts
                                                                               and 'Column' Percentages. If
                                                                               any other check boxes are
                                                                               marked, we clear them.




Analyzing Patterns of Missing Data                                                                                      Slide 50
                                               Chi-square Test Results


                                                                         Second, looking at
                                                                         the column percents
                                                                         in the
                                                                         crosstabulation
                                                                         table, we see that
                                                                         subjects who had a
                                                                         missing value for
                                                                         delivery speed were
                                                                         much more likely to
                                                                         be large firms than
                                                                         were subjects who
                                                                         had valid data for
                                                                         delivery speed
                                                                         (68.4% to 17.8%).

                                                                         This relationship
                                                                         requires further
                        First, the chi-square
                                                                         consideration as a
                        statistical test produced
                                                                         missing data process
                        a significant Sig value,
                                                                         that could affect our
                        so we reject the null
                                                                         analysis.
                        hypothesis and conclude
                        that designation of firm
                        size was different for
                        missing cases than for
                        valid cases.




Analyzing Patterns of Missing Data                                                               Slide 51
                                                  Requesting the T-tests


                     First, we select the 'Compare
                     Means | Independent-
                     Samples T Test…' command
                     from the Analyze menu.




                       Second, we move the variables 'Price
                       Level' (x2), 'Price Flexibility' (x3),
                       'Manufacturer Image' (x4), 'Service'
                       (x5), 'Salesforce Image' (x6), 'Product
                       Quality' (x7), 'Usage Level' (x9), and
                       'Satisfaction Level' (x10) to the list box
                       for 'Test Variable(s):'.
                                                                           Fourth, we
                                                                           click on the
                                                                           'Define
                                                                           Groups…'




                     Third, we move the variable
                     'Delivery Speed
                     (Valid/Missing)' to the text
                     box for 'Grouping Variable:'.
                     SPSS lists the name of the
                     variable, 'x1_'.




Analyzing Patterns of Missing Data                                                        Slide 52
                              Specifying the Groups by Code Number

                                                                  Third, we click on the
                                                                  Continue button to
                                                                  close the 'Define
                                                                  Groups' dialog box.




                        First, we enter 0 in
                        the 'Group 1:' text
                                                                                                       Fifth, we click on
                        box. 0 indicates
                                                                                                       the OK button to
                        missing data on the
                                                                                                       request the t-test
                        original Delivery
                                                                                                       results.
                        Speed variable.

                                               Second, we enter 1 in
                                               the 'Group 2:' text box.
                                               1 indicates valid data
                                               on the original Delivery                    Fourth, we note that
                                               Speed variable.                             SPSS completed the
                                                                                           group identifiers in the
                                                                                           'Grouping Variable:'
                                                                                           text box.




Analyzing Patterns of Missing Data                                                                                          Slide 53
                                               Results of the T-tests

                               Using the 'Equal variances not assumed' rows of the table, we see that there is a significant
                               difference in average score for the variables 'Manufacturer Image' and 'Service.' There is no
                               significant difference in means for 'Price Level' and 'Price Flexibility.' If we scroll down the list,
                               we find that there are significant relationships also with 'Usage Level' and 'Satisfaction Level.'
                               These significant findings reinforce the notion that 'Delivery Speed' might be involved in a
                               missing data process that requires further understanding before proceeding with the analysis.




Analyzing Patterns of Missing Data                                                                                                      Slide 54
    9. The Correlation Matrix for Diagnosing Randomness of Missing Data

          To continue our missing data analysis, we run a correlation matrix for the dichotomous
          grouping variables: 'Delivery Speed (Valid/Missing)', 'Price Level (Valid/Missing)', 'Price
          Flexibility (Valid/Missing)', 'Manufacturer Image (Valid/Missing)', 'Service (Valid/Missing)',
          'Salesforce Image (Valid/Missing)', 'Product Quality (Valid/Missing)', 'Usage Level
          (Valid/Missing)', and 'Satisfaction Level (Valid/Missing)'.

          We examine the pattern of correlations to see if there is are large correlations among
          multiple pairs of variables that do not have an obvious explanation. An obvious
          explanation would be that subjects only answered these questions if their answer to
          another question were some value, e.g. only answer the question about job satisfaction
          if you are employed.

          If there are variables that show a strong pattern of systematic missing data without an
          obvious explanation, we should evaluate the impact that this pattern has on our
          research questions, and make our decision about including, eliminating, or substituting
          for these variables.




Analyzing Patterns of Missing Data                                                                         Slide 55
                                        Requesting the Correlation Matrix

                                              First, select the 'Correlate
                                              | Bivariate…' command
                                              from the Analyze menu.



                      Second, move
                      the Valid/Missing
                      diagnostic
                      variables for the
                      metric variables
                      to the 'Variables:'
                      list box.




                                                                             Fourth, click on
                     Third, accept                                           the OK button to
                     the defaults of                                         produce the
                     'Pearson' for                                           correlation matrix.
                     'Correlation
                     Coefficients',
                     'Two-tailed' for
                     'Test of
                     significance',
                     and 'Flag
                     significant
                     correlations'.




Analyzing Patterns of Missing Data                                                                 Slide 56
                                        The Correlation Matrix Output

                                     Our correlation matrix shows the same pattern as shown in the text in table 2.6 on page 60
                                     of the text. As discussed on page 60 of the text, there is only one moderate correlation in
                                     this table, Salesforce Image and Satisfaction level.

                                     The pattern for missing data is restricted to these variables, so we do not have a serious
                                     problem.




Analyzing Patterns of Missing Data                                                                                                 Slide 57