VIEWS: 5 PAGES: 51 POSTED ON: 4/1/2011
SW388R7 Data Analysis & Computers II Analyzing Missing Data Slide 1 Introduction Problems Using Scripts SW388R7 Data Analysis & Computers II Missing data and data analysis Slide 2 Missing data is a problem in multivariate data because a case will be excluded from the analysis if it is missing data for any variable included in the analysis. If our sample is large, we may be able to allow cases to be excluded. If our sample is small, we will try to use a substitution method so that we can retain enough cases to have sufficient power to detect effects. In either case, we need to make certain that we understand the potential impact that missing data may have on our analysis. SW388R7 Data Analysis & Computers II Tools for evaluating missing data Slide 3 SPSS has a specific package for evaluating missing data, but it is included under the UT license. In place of this package, we will first examine missing data using SPSS statistics and procedures. After studying the standard SPSS procedures that we can use to examine missing data, we will use an SPSS script that will produce the output needed for missing data analysis without requiring us to issue all of the SPSS commands individually. SW388R7 Data Analysis & Computers II Key issues in missing data analysis Slide 4 We will focus on three key issues for evaluating missing data: The number of cases missing per variable The number of variables missing per case The pattern of correlations among variables created to represent missing and valid data. Further analysis may be required depending on the problems identified in these analyses. SW388R7 Data Analysis & Computers II Problem 1 Slide 5 SW388R7 Data Analysis & Computers II Identifying the number of cases in the data set Slide 6 This problem wants to know if a variable is missing data for half or more of the cases. Our first task is to identify the number of cases that meets that criterion. If we scroll to the bottom of the data set, we see than there are 270 cases in the data set. 270 ÷ 2 = 135. If any variable included in the analysis has 135 or more missing cases, the answer to the problem will be true. SW388R7 Data Analysis & Computers II Request frequency distributions Slide 7 We will use the output for frequency distributions to find the number of missing cases for each variable. Select the Frequencies… | Descriptive Statistics command from the Analyze menu. SW388R7 Data Analysis & Computers II Completing the specification for frequencies Slide 8 First, move the five variables included in the problem statement to the list box for variables. Second, click on the OK button to complete the request for statistical output. SW388R7 Data Analysis & Computers II Number of missing cases for each variable Slide 9 In the table of statistics at the top of the Frequencies output, there is a table detailing the number of missing cases for each variable in the analysis. SW388R7 Data Analysis & Computers II Answering the problem Slide 10 With 270 subjects in the data set, variables missing data for 135 or more cases would correctly be characterized as missing data for half or more of the cases in the data set. One variable was incorrectly characterized as missing half or more of the 270 cases: "self- employment" [wrkslf] was missing data for 20 of the 270 cases (7.4%). None of the variables in this analysis was missing cases for half or more of the 270 cases in the data set. False is the correct answer. SW388R7 Data Analysis & Computers II Problem 2 Slide 11 SW388R7 Data Analysis & Computers II Create a variable that counts missing data Slide 12 We want to know how many of the five variables in the analysis had missing data for each case in the data set. To compute a new variable, select the We will create a variable Compute… containing this command from the information that uses an Transform menu. SPSS function to count the number of variables with missing data. SW388R7 Data Analysis & Computers II Enter specifications for new variable Slide 13 First, type in the name for the new variable nmiss in the Target variable text box. Third, click on the up arrow button to move the NMISS function into the Numeric Expression Second, scroll down the list text box. of functions and highlight the NMISS function. SW388R7 Data Analysis & Computers II Enter specifications for new variable Slide 14 The NMISS function is moved into the Numeric Expression text box. To add the list of variables to count missing data for, we first highlight the first variable to include in the Second, click on the function, wrkstat. right arrow button to move the variable name into the function arguments. SW388R7 Data Analysis & Computers II Enter specifications for new variable Slide 15 First, before we add another variable to the function, we type a comma to separate the names of the variables. Second, to add the next variable we highlight the second variable to Third, click on the include in the right arrow button to function, hrs1. move the variable name into the function arguments. SW388R7 Data Analysis & Computers II Complete specifications for new variable Slide 16 Continue adding variables to function until all of the variables specified in the problem have been added. Be sure to type a comma between the variable names. When all of the variables have been added to the function, click on the OK button to complete the specifications. SW388R7 Data Analysis & Computers II The nmiss variable in the data editor Slide 17 If we scroll the worksheet to the right, we see the new variable that SPSS has just computed for us. SW388R7 Data Analysis & Computers II A frequency distribution for nmiss Slide 18 To answer the question of how many cases had each of the possible numbers of missing value, we create a frequency distribution. Select the Frequencies… | Descriptive Statistics command from the Analyze menu. SW388R7 Data Analysis & Computers II Completing the specification for frequencies Slide 19 First, move the nmiss variable to the list of variables. Second, click on the OK button to complete the request for statistical output. SW388R7 Data Analysis & Computers II The frequency distribution Slide 20 SPSS produces a frequency distribution for the nmiss variable. 170 cases had valid, non- missing values for all 5 variables. 85 cases had one missing value; 1 case had 2 missing values; and 14 cases had missing values for 4 variables. SW388R7 Data Analysis & Computers II Answering the problem Slide 21 The problem asked whether or not 14 cases had missing data for more than half the variables. For a set of five variables, cases that had 3, 4, or 5 missing values would meet this requirement. The number of cases with 3 missing variables is 0 (not shown in table), with 4 missing variables is 14, and with 5 missing variables is 0, for a total of 14. The answer to the problem is true. SW388R7 Data Analysis & Computers II Problem 3 Slide 22 SW388R7 Data Analysis & Computers II Compute valid/missing dichotomous variables Slide 23 To evaluate the pattern of missing data, we need to compute dichotomous valid/missing variables for To create the new each of the five variables variable, select the included in the analysis. Recode | Into Different Variables… We will compute the new from the Transform variable using the Recode menu. command. SW388R7 Data Analysis & Computers II Enter specifications for new variable Slide 24 First, move the first variable in the analysis, wrkstat, into the Numeric Variable -> Output Variable text box. Second, type the name for the new variable into the Name text box. My convention is to add an underscore character to the end of the variable name. If this would make the variable more than 8 characters long, delete characters from the end of the original variable name. SW388R7 Data Analysis & Computers II Enter specifications for new variable Slide 25 Finally, click on the Change button Next, type the label for the to add the name of new variable into the Label the dichotomous text box. My convention is to variable to the add the phrase (Valid/Missing) Numeric Variable -> to the end of the variable Output Variable text label for the original variable. box. SW388R7 Data Analysis & Computers II Enter specifications for new variable Slide 26 To specify the values for the new variable, click on the Old and New Values… button. SW388R7 Data Analysis & Computers II Change the value for missing data Slide 27 The dichotomous variable should be coded 1 if the variable has a valid value, 0 if the variable has a missing value. Second, type 0 in First, mark the Value text box. the System- or user-missing option button. Third, click on the Add button to include this change in the list of Old->New list box. SW388R7 Data Analysis & Computers II Change the value for valid data Slide 28 Second, type 1 in the Value text box. First, mark the All other values option button. Third, click on the Add button to include this change in the list of Old->New list box. SW388R7 Data Analysis & Computers II Complete the value specifications Slide 29 Having entered the values for recoding the variable into dichotomous values, we click on the Continue button to complete this dialog box. SW388R7 Data Analysis & Computers II Complete the recode specifications Slide 30 Having entered specifications for the new variable and the values for recoding the variable into dichotomous values, we click on the OK button to produce the new variable. SW388R7 Data Analysis & Computers II The dichotomous variable Slide 31 The procedure for creating a dichotomous valid/missing variable is repeated for the four other variables in the analysis: hrs1, wrkslf, wrkgovt, and prestg80. SW388R7 Data Analysis & Computers II Filtering cases with excessive missing variables Slide 32 If we include the cases that have more than half of the variables missing, we will inflate the correlations. To prevent this, we To filter cases included in exclude this cases further analysis, we choose before creating the the Select Cases… correlation matrix. command from the Data menu. We do this by selecting in, or filtering, cases that have fewer than half missing variables, i.e. less than 3 missing variables. SW388R7 Data Analysis & Computers II Enter specifications for selecting cases Slide 33 First, click on the If condition is satisfied option button on the Select panel. Second, click on the If… button to enter the criteria for including cases. SW388R7 Data Analysis & Computers II Enter specifications for selecting cases Slide 34 First, enter the criteria for including cases: nmiss < 3 Second, click on the Continue button to complete the If specification. SW388R7 Data Analysis & Computers II Complete the specifications for selecting cases Slide 35 To complete the specifications, click on the OK button. SW388R7 Data Analysis & Computers II Cases excluded from further analyses Slide 36 SPSS marks the cases that will not be included in further analyses by drawing a slash mark through the case number. We can verify that the selection is working correctly by noting that the case which is omitted had 4 missing variables. SW388R7 Data Analysis & Computers II Correlating the dichotomous variables Slide 37 To compute a correlation matrix for the dichotomous variables, select the Correlate | Bivariate command from the Analyze menu. SW388R7 Data Analysis & Computers II Specifications for correlations Slide 38 First, move the dichotomous variables to the variables list box. Second, click on the OK button to complete the request. SW388R7 Data Analysis & Computers II The correlation matrix Slide 39 Correlations The correlation matrix is RS OCCUPA NUMBER symmetric along the diagonal R SELF-EMP TIONAL LABOR OF HOURS(shown by the blue line). The OR WORKS GOVT OR PRESTIG FRCE WORKED correlation for any pair of E SCORE FOR PRIVATE STATUS LAST WEEK variables is included twice in SOMEBODY EMPLOYEE (1980) (Valid/Missin we only count the table. So (Valid/Mis (Valid/Missin (Valid/Missi (Valid/Mis sing) g) g) ng) the correlations below the sing) a LABOR FRCE STATUS Pearson Correlation .a .a .a .a . (Valid/Missing) diagonal (the cells with the Sig. (2-tailed) . . . . . yellow background). N 256 256 256 256 256 NUMBER OF HOURS Pearson Correlation .a 1 -.049 .a -.042 WORKED LAST WEEK Sig. (2-tailed) . . .437 . .501 (Valid/Missing) N 256 256 256 256 256 R SELF-EMP OR Pearson Correlation .a -.049 1 .a -.010 WORKS FOR Sig. (2-tailed) . .437 . . .877 SOMEBODY N (Valid/Missing) 256 256 256 256 256 GOVT OR PRIVATE Pearson Correlation .a .a .a .a .a EMPLOYEE Sig. (2-tailed) . . . . . (Valid/Missing) N 256 256 256 256 256 RS OCCUPATIONAL Pearson Correlation .a -.042 -.010 .a 1 PRESTIGE SCORE Sig. (2-tailed) . .501 .877 . . (1980) (Valid/Missing) N 256 256 256 256 256 a. Cannot be computed because at least one of the variables is constant. SW388R7 Data Analysis & Computers II The correlation matrix Slide 40 Correlations RS The correlations marked with OCCUPA NUMBER footnote letter a could not be R SELF-EMP TIONAL LABOR GOVT one PRESTIG OF HOURS computed because OR of the OR WORKS FRCE WORKED a constant, i.e. variables was PRIVATE E SCORE FOR STATUS LAST WEEK the dichotomous variable (1980) SOMEBODY EMPLOYEE has (Valid/Mis (Valid/Missin the same value for all cases. (Valid/Missin (Valid/Missi (Valid/Mis sing) g) g) ng) sing) LABOR FRCE STATUS Pearson Correlation .a .a .a .a .a (Valid/Missing) Sig. (2-tailed) This happens when one of the . . . . . N valid/missing variables has no 256 256 256 256 256 missing cases, so thata all of NUMBER OF HOURS Pearson Correlation .a 1 -.049 . -.042 WORKED LAST WEEK Sig. (2-tailed) the cases have a value of 1 . . (Valid/Missing) .437 . and none have a value of 0..501 N 256 256 256 256 256 R SELF-EMP OR Pearson Correlation a a . -.049 1 . -.010 WORKS FOR Sig. (2-tailed) . .437 . . .877 SOMEBODY N (Valid/Missing) 256 256 256 256 256 GOVT OR PRIVATE Pearson Correlation .a .a .a .a .a EMPLOYEE Sig. (2-tailed) . . . . . (Valid/Missing) N 256 256 256 256 256 RS OCCUPATIONAL Pearson Correlation .a -.042 -.010 .a 1 PRESTIGE SCORE Sig. (2-tailed) . .501 .877 . . (1980) (Valid/Missing) N 256 256 256 256 256 a. Cannot be computed because at least one of the variables is constant. SW388R7 Data Analysis & Computers II The correlation matrix Slide 41 Correlations RS OCCUPA NUMBER R SELF-EMP TIONAL LABOR OF HOURS OR WORKS GOVT OR PRESTIG FRCE WORKED FOR PRIVATE E SCORE STATUS LAST WEEK SOMEBODY EMPLOYEE (1980) (Valid/Mis (Valid/Missin (Valid/Missin (Valid/Missi (Valid/Mis sing) g) g) ng) sing) LABOR FRCE STATUS Pearson Correlation . a .a . a .a .a (Valid/Missing) Sig. (2-tailed) . . . . . N 256 256 256 256 256 NUMBER OF HOURS Pearson Correlation .a 1 -.049 .a -.042 WORKED LAST WEEK Sig. (2-tailed) . . .437 . .501 (Valid/Missing) N 256 256 256 256 256 R SELF-EMP OR Pearson Correlation a a . -.049 1 . -.010 WORKS FOR Sig. (2-tailed) . .437 . . .877 SOMEBODY N In the cells for which the correlation could be computed, the (Valid/Missing) probabilities indicating significance are 0.437, 0.501, and 256 256 256 256 256 0.877. GOVT OR PRIVATE The Correlation Pearson correlation of -.042 between .the missing/valid pair for .a a .a .a .a EMPLOYEE "number Sig. (2-tailed) of hours worked in the past week" [hrs1] and . . . . . (Valid/Missing) N "occupational prestige score" [prestg80] was not statistically 256 256 256 256 256 RS OCCUPATIONAL significant (p=0.501) .and should not be interpreted as .a Pearson Correlation a -.042 -.010 1 PRESTIGE SCORE indicating Sig. (2-tailed) a non-random pattern of missing data. . .501 .877 . . (1980) (Valid/Missing) N 256 256 256 256 256 a. Cannot be computed because at least one of the variables is constant. SW388R7 Data Analysis & Computers II Answering the problem Slide 42 Correlations RS OCCUPA NUMBER R SELF-EMP TIONAL LABOR OF HOURS OR WORKS GOVT OR PRESTIG FRCE WORKED FOR PRIVATE E SCORE STATUS LAST WEEK SOMEBODY EMPLOYEE (1980) (Valid/Mis (Valid/Missin (Valid/Missin (Valid/Missi (Valid/Mis sing) g) g) ng) sing) LABOR FRCE STATUS Pearson Correlation . a .a . a .a .a (Valid/Missing) Sig. (2-tailed) . . . . . N 256 256 256 256 256 NUMBER OF HOURS Pearson Correlation .a 1 -.049 .a -.042 WORKED LAST WEEK Sig. (2-tailed) . . .437 . .501 (Valid/Missing) N 256 256 256 256 256 R SELF-EMP OR Pearson Correlation a a . -.049 1 . -.010 WORKS FOR Sig. (2-tailed) . .437 . . .877 SOMEBODY N False is the correct answer. (Valid/Missing) None of the correlations among the missing/valid variables 256 256 256 256 256 were statistically significant. The correlation matrix does not GOVT OR PRIVATE indicate a non-random pattern of.a Pearson Correlation .a missing data..a .a .a EMPLOYEE Sig. (2-tailed) . . . . . (Valid/Missing) N Fourteen cases were excluded from the calculations for the 256 256 256 256 256 RS OCCUPATIONAL correlation matrix because they were missing more than .a Pearson Correlation .a -.042 -.010 half 1 PRESTIGE SCORE of the variables. Sig. (2-tailed) . .501 .877 . . (1980) (Valid/Missing) N 256 256 256 256 256 a. Cannot be computed because at least one of the variables is constant. SW388R7 Data Analysis & Computers II Using scripts Slide 43 The process of evaluating missing data requires numerous SPSS procedures and outputs that are time consuming to produce. These procedures can be automated by creating an SPSS script. A script is a program that executes a sequence of SPSS commands. Thought writing scripts is not part of this course, we can take advantage of a script that I use to reduce the burdensome tasks of evaluating missing data. SW388R7 Data Analysis & Computers II Using a script for missing data Slide 44 The script: “EvaluatingAssumptions_MissingData_Outliers_2004.SBS” will produce all of the output we have used for evaluating missing data, as well as other outputs described in the textbook. Navigate to the link “SPSS Scripts and Syntax” on the course web page. Download the script file: “EvaluatingAssumptions_MissingData_Outliers_2004.exe” to your computer and install it, following the directions on the web page. SW388R7 Data Analysis & Computers II Open the data set in SPSS Slide 45 Before using a script, a data set should be open in the SPSS data editor. SW388R7 Data Analysis & Computers II Invoke the script Slide 46 To invoke the script, select the Run Script… command in the Utilities menu. SW388R7 Data Analysis & Computers II Select the missing data script Slide 47 First, navigate to the folder where you put the script. If you followed the directions, you will have a file with an ".SBS" extension in the C:\StudentData\SW388R7 folder. If you only see a file with an “.EXE” extension in the folder, you should double click on that file to extract the script file to the C:\StudentData\SW388R7 folder. Second, click on the script name to highlight it. Third, click on Run button to start the script. SW388R7 Data Analysis & Computers II The script dialog Slide 48 The script dialog box acts similarly to SPSS dialog boxes. You select the variables to include in the analysis and choose options for the output. SW388R7 Data Analysis & Computers II Complete the specifications Slide 49 We accept the default option to Check missing data. Select the variables for the analysis. This analysis uses the variables for the last problem we worked. For the missing data check, it does not matter what role we assign to the variables. Click on the OK button to produce the output. SW388R7 Data Analysis & Computers II The script finishes Slide 50 Since it may take a while to produce the output, and since there are times when it appears that nothing is happening, there is an alert to tell you when the script is finished. When you see this alert, click on the OK button and view your output. Note: the script dialog box does not close by itself. This is purposeful so that you can test assumptions or detect outliers without having to redo variable selection. SW388R7 Data Analysis & Computers II Output from the script Slide 51 The script will produce lots of output. Additional descriptive material in the titles should help link specific outputs to specific tasks.