Document Sample

INTRODUCTION TO SPSS FOR WINDOWS Version 15.0 Summer 2007 Contents Purpose of handout & Compatibility between different versions of SPSS……………….. 1 SPSS window & menus…………………………………………………………………… 1 Getting data into SPSS & Editing data…………………………………………………….. 3 Reading an SPSS viewer/output (.spo) file & Editing your pout…………………………. 7 Saving data as an SPSS data (.sav) file…..………………………………………………... 8 Saving your output (statistical results and graphs)………………………………………… 9 Exporting SPSS Output……………………………………………………………………. 10 Printing your work & Exiting SPSS……………………………………………………….. 12 Running SPSS using syntax or command language (.sps files)….…………………………13 Creating a new variable……………………………………………………………………. 14 Recoding or combining categories of a variable……………………………………………15 Summarizing your data Frequency tables (& bar charts) for categorical variables…………………………………. 20 Contingency tables for categorical variables………………………………………………. 21 Descriptive statistics (& histograms) for numerical variables…………………………….. 22 Descriptive statistics (& boxplots) by groups for numerical variables……………………. 24 Using the Split File option for summaries by groups……………………………………… 26 Using the Select Cases option for summaries for a subgroup of subjects/observations…… 27 Graphing your data Bar chart…………………………………………………………………………………… 28 Histogram & Boxplot……………………………………………………………………… 29 Normal probability plot……………………………………………………………………. 30 Error bar plot……………………………………………………………………………….. 31 Scatter plot…………………………………………………………………………………. 32 Adding a line or loess smooth to a scatter plot…………………………………………….. 32 Stem-and-leaf plot………………………………………………………………………….. 33 Hypothesis tests & Confidence intervals One sample t test & Confidence interval for a mean………………………………………. 34 Paired t test & Confidence interval for the difference between means……………………. 37 Two sample t test & Confidence interval for the difference between means……………… 39 Sign test and Wilcoxon signed rank test………………………………………………....... 42 Mann Whitney U test (or Wilcoxon rank sum test)……………………………….............. 45 One-way ANOVA (Analysis of variance) & Post-hoc tests…………………………......... 47 Kruskal-Wallis test……………………………………………………………………….....50 One-sample binomial test………………………………………………………………...... 52 McNemar’s test……………………………………………………………………………..53 Chi-square test for contingency tables………………..…………………………………….55 Fisher’s exact test………………………………………………………………………....... 55 Trend test for contingency tables/ordinal variables……………………………………....... 55 Binomial, McNemar’s, Chi-square and Fisher’s exact tests using summary data……….... 59 Confidence interval for a proportion………………………………………………………. 63 Correlation & Regression Pearson and spearman rank correlation coefficient……………………………………....... 65 Linear regression………………………………………………………………………........ 68 Liner regression via ANOVA commands………………………………………………….. 76 Logistic regression………………………………………………………………………… 80 1 Purpose of handout SPSS for Windows provides a powerful statistical and data management system in a graphical environment. The user interfaces make statistical analysis more accessible for casual users and more convenient for experienced users. Most tasks can be accomplished simply by pointing and clicking the mouse. The objective of this handout is to get you oriented with SPSS for Windows. It teaches you how to enter and save data in SPSS, how to edit and transform data, how to explore your data by producing graphics and summary descriptives, and how to use pointing and clicking to run statistical procedures. It is also intended to serve as a reference guide for SPSS procedures that you will need to know to do your homework assignments. Compatibility between different versions of SPSS SPSS for Windows data files (files ending in .sav) and syntax (command) files (files ending in .sps) are compatible between different versions of SPSS (at least, versions 11.0 or newer). However, SPSS output/viewer files (files ending .spo) are NOT always compatible between different versions. Usually SPSS output files created with an old version and can be read by a new version, but an output file created using a new version can not be read by an old version. One option for avoiding compatibility problems between different versions of SPSS is to export your output in html or MS Word format. The compatibility between Window and Mac versions of SPSS is limited. SPSS Windows & Menus An overview of the SPSS windows, menus, toolbars, and dialog boxes is given in the SPSS Tutorials under Help. You can also find information under Topics, Case Studies, Statistics Coach, and Command & Syntax (if you are using syntax commands.) Window Types SPSS Data Editor. When you start an SPSS session, you usually see the Data Editor window (otherwise you will see a Viewer window). The Data Editor displays the contents of the working data file. There a two views in the data editor window: 1) Data View displays the data in a spreadsheet format with variable names listed for column headings, and 2) Variable View which displays information about the variables in your data set. In the Data View you can edit or enter data, and in the Variable View you can change the format of a variable, add format and variable labels, etc. SPSS Viewer/Output. Statistical results and graphs are displayed in the Viewer window. The (output) Viewer window is divided into two panes. The right-hand pane contains the all the output and the left-hand pane contains a tree-structure of the results. You can use the left-hand pane for navigating through, editing and printing your results. 2 Chart Editor. The chart editor is used to edit graphs. When you double-click on figure or graph, it will reappear in a chart editor window. SPSS Syntax Editor. The Syntax Editor is used to create SPSS command syntax for using the SPSS production facility. Usually you will be using the point and click facilities of SPSS, and hence, you will not need to use the Syntax Editor. More information about the Syntax Editor and using the SPSS syntax is given in the SPSS Help Tutorials under Working with Syntax. A few instructions to get you started are given later in the handout in the section Running SPSS using the Syntax Editor (or Command Language) Menus Data Editor Menu: File. Use the File menu to create a new SPSS file, open an existing file, or read in spreadsheet or database files created by other software programs (e.g., Excel). Edit. Use the Edit menu to modify or copy data and output files. View. Choose which buttons are available in the window or how the window should look. Data. Use the Data menu to make changes to SPSS data files, such as merging files, transposing variables, or creating subsets of cases for subset analysis. Transform. Use the Transform menu to make changes to selected variables in the data file (e.g., to recode a variable) and to compute new variables based on existing variables. Analyze. Use the Analyze menu to select the various statistical procedures you want to use, such as descriptive statistics, cross-tabulation, hypothesis testing and regression analysis. Graphs. Use the Graphs menu to display the data using bar charts, histograms, scatterplots, boxplots, or other graphical displays . All graphs can be customized with the Chart Editor. Utilities. Use the Utilities menu to view variable labels for each variable. Add-ons. Information about other SPSS software. Window. Choose which window you want to view. Help. Index of help topics, tutorials, SPSS home page, Statistics coach, and version of SPSS. Viewer Menu: Menu is similar to Data Editor menu, but has two additional options: Insert. Use the insert menu to edit your output Format. Use the format menu to change the format of your output. Chart Editor Menu: Use SPSS Help to learn more about the Chart Editor. 3 Toolbars Most Windows applications provide buttons arranged along the top of a window that act as shortcuts to executing various functions. In SPSS, you will find such buttons (icons) at the top the of the Data Editor, Viewer, Chart Editor, and Syntax windows. The icons are usually symbolic representations of the procedure they execute when pushed, unfortunately their meanings are not intuitively obvious until one has already used them. Hence, the best way to learn these buttons is to use them and note what happens. The Status Bar The Status Bar runs along the bottom of a window and alerts the user to the status of the system. Typical messages one will see are “SPSS Processor is ready”, “Running procedure…”. The Status Bar will also provide up-to-date information concerning special manipulations of the data file like whether only certain cases are being used in an analysis or if the data has been weighted according to the value of some variable. File Types Data Files. A file with an extension of .sav is assumed to be a data file in SPSS for Windows format. A file with an extension of .por is a portable SPSS data file. The contents of a data file are displayed in the Data Editor window. Viewer (Output) Files. A file with an extension of .spo is assumed to be a Viewer file containing statistical results and graphs. Syntax (Command) Files. A file witn an extension of .sps is assumed to be a Syntax file containing spss syntax and commands. Getting Data into SPSS & Editing Data When reading and editing data into SPSS the data will be displayed in the Data Editor Window. An overview of the basic structure of an SPSS data file is given in the SPSS Help Tutorials: 1. Choose Help on the menu bar 2. Choose Tutorial 3. Choose Reading Data Reading Data from a SPSS Data (.sav) File To read a data file from your computer/floppy disk/flash drive that was created and saved using SPSS. The filename should end with the suffix .sav. 1. Choose Open an existing data source 2. Double click on the filename or 3. Single click on the filename and choose OK Or 4 1. Choose Cancel 2. Choose File on the menu bar 3. Choose Open 4. Choose Data... 5. Edit the directory or disk drive to indicate where the data is located. 6. Double click on the filename or 7. Single click on the filename and choose Open Reading Data from an Text Data File To read an raw/text (ascii) data file from your computer/floppy disk/flash drive, where the data for each observation is on a separate line and a space is used to separate variables on the same line (i.e., the file format is freefield). The filename should end with the suffix .dat. 1. Choose File on the menu bar 2. Choose Read Text Data 3. Choose Files of Type *.dat 4. Edit the directory or disk drive to indicate where the data is located 5. Double click on the filename or 6. Single click on the filename and choose Open 7. Follow the Import Wizard Instructions. You can also get to the Import Wizard as follows: 1. Choose File on the menu bar 2. Choose Open 3. Choose Data... 4. Choose Files of Type *.dat 5. Edit the directory or disk drive to indicate where the data is located 6. Double click on the filename or 7. Single click on the filename and choose Open 8. Follow the Import Wizard Instructions. Instructions on how to read a text data file in fixed format are located in SPSS Help Tutorials under Reading Data from a Text File. 5 Reading Data from Other Types of External Files SPSS allows you to read a variety of other types of external files, such as Excel spreadsheet files, SAS data files, Lotus 1-2-3 spreadsheet files, and dBASE database files. To read data from other types of external files, you follow the same steps as you would for reading an SPSS save file, except that you specify the file type according to what package was used to create the save file. For further instruction on how to read data from other types of external files, see the SPSS for Windows Base System User's Guide on data files or the SPSS Help Tutorials. Entering and Editing Data Using the Data Editor The Data Editor provides a convenient spreadsheet-like facility for entering, editing, and displaying the contents of your data file. A Data Editor window opens automatically when you start an SPSS session. Instruction on Using the Data Editor to enter data is given in the SPSS Help Tutorials. Note that if you are already familiar with entering data into a different spreadsheet program (e.g., MS Excel), you might find it easy to enter your data in the program your are familiar with and then read the data into SPSS. Entering Data. Basic data entry in the Data Editor is simple: Step 1. Create a new (empty) Data Editor window. At the start of an SPSS session a new (empty) Data Editor window opens automatically. During an SPSS session you can create a new Data Editor window by 1. Choose File 2. Choose New 3. Choose Data Step 2. Move the cursor to the first empty column. Step 3. Type a value into the cell. As you type, the value appears in the cell editor at the top of the Data Editor window. Each time you press the Enter key, the value is entered in the cell and you move down to the next row. By entering data in a column, you automatically create a variable and SPSS gives it the default variable name var00001. Step 4. Choose the first cell in the next column. You can use the mouse to click on the cell or use the arrow keys on the keyboard to move to the cell. By default, SPSS names the data in the second column var00002. Step 5. Repeat step 4 until you have entered all the data. If you entered an incorrect value(s) you will need to edit your data. See the following section on Editing Data. 6 Editing Data. With the Data Editor, you can modify a data file in many ways. For example you can change values or cut, copy, and paste values, or add and delete cases. To Change a Data Value: 1. Click on a data cell. The cell value is displayed in the cell editor. 2. Type the new value. It replaces the old value in the cell editor. 3. Press then Enter key. The new value appears in the data cell. To Cut, Copy, and Paste Data Values 1. Select (highlight) the cell value(s) you want to cut or copy. 2. Pull down the Edit box on the main menu bar. 3. Choose Cut. The selected cell values will be copied, then deleted. Or 4. Choose Copy. The selected cell values will be copied, but not deleted. 5. Select the target cell(s) (where you want to put the cut or copy values). 6. Pull down the Edit box on the main menu bar. 7. Choose Paste. The cut or copy values will be ``pasted'' in the target cells. To Delete a Case (i.e., a Row of Data) 1. Click on the case number on the left side of the row. The whole row will be highlighted. 2. Pull down the Edit box on the main menu bar. 3. Choose Clear. To Add a Case (i.e., a Row of Data) 1. Select any cell in the case from the row below where you want to insert the new case. 2. Pull down the Data box on the main menu bar. 3. Choose Insert. Defining Variables. The default name for new variables is the prefix var and a sequential five- digit number (e.g., var00001, var00002, var00003). To change the name, format and other attributes of a variable. 1. Double click on the variable name at the top of a column or, 2. Click on the Variable View tab at the bottom of Data Editor Window. 3. Edit the variable name under column labeled Name. The variable name must be eight characters or less in length. You can also specify the number of decimal places (under Decimals), assign a descriptive name (under Label), define missing values (under Missing), define the type of variable (under Measure; e.g., scale, ordinal, nominal), and define the values for nominal variables (under Values). After the data is entered (or several times during data entering), you will want to save it as an SPSS save file. See the section on Saving Data As An SPSS Save File. 7 Reading an SPSS Viewer/Output (.spo) File Statistical results and graphs are displayed in the Viewer window. An overview of how to use the Viewer is given in the SPSS Help Tutorials under Working with Output. If you saved the results of Viewer window during an earlier SPSS session, you can use the following commands to display the Viewer (output) results in a current SPSS session. However, SPSS output/viewer files (files ending .spo) are NOT always compatible between different versions. Usually SPSS output files created with an older version and can be read by a new version, but an output file created using a new version can not be read by an older version. One option for avoiding compatibility problems between different versions of SPSS is to export your output in html or MS Word format. The compatibility between Window and Mac versions of SPSS is limited. To read a Viewer file from your computer\floppy disk\flashdrive that was created and saved using SPSS. The filename should end with the suffix spo. 1. Choose File on the menu bar 2. Choose Open 3. Choose Output... 4. Edit the directory or disk drive to indicate where the data is located 5. Double click on the filename or 6. Single click on the filename and choose Open Editing Your Output Editing the statistical results and graphs in the Viewer window is beyond the scope of this handout. Instructions on how to edit your output is given in the SPSS Help Tutorials under Working with Output and Creating and Editing Charts. You can use either the tree-structure in the left hand pane or the results displayed in the right hand pane to select, move or delete parts of the output. To edit a table or object (an object is a group of results) you first need to double click on the table/object so an “editing” box appears around the table/object, and then select the value you want to modify. An “editing box'” will be a ragged box outlining the table. If you only do a single click you will get a box with straight/plain lines outlining the table. In general, to create “nice looking” tables of your results it is often easier to hand enter the values into a blank MS Word table than to edit a SPSS table/object (either in SPSS or MS Word). To edit a chart you first need to double click on the chart so it appears in a new Chart Editor window. After you are done editing the chart, close the window and then export the chart, for example to a windows metafile and then into a MS Word file. By default in SPSS a P-value is displayed as .000 if the P-value is less than .001. You can report the P-value as <.001 or to have SPSS display more significant digits: 8 1. In a SPSS (output) Viewer window double click (with the left mouse button) on the table containing the p-value you want to display differently A ``editing box'' should appear around the table. 2. Click on the p-value using the right mouse button. 3. Choose Cell Properties. (If you do not get this option, you need to double click on the table to get the ragged box.) 4. Change the number of decimals to the desired number (default is 3). 5. Choose OK or 6. Double click on the p-value with the left mouse button and SPSS will display the p-value with more significant digits. If the p-value is very small, the p-value will be displayed in scientific notation (e.g., 1.745E-10 = 0.0000000001745). Saving Data as an SPSS Data (.sav) File To save data as a new SPSS Data file onto your computer/floppy disk/flashdrive: 1. Display the Data Editor window (i.e., execute the following commands while in the Data Editor window displaying the data you want to save.) 2. Choose File on the menu bar. 3. Choose Save As... 4. Edit the directory or disk drive to indicate where the data should be saved. SPSS will automatically add the .sav suffix to the filename. 5. Choose Save To save data changes in an existing SPSS Save: file. 1. Display the Data Editor window (i.e., execute the following commands while in the Data Editor window displaying the data you want to save.) 2. Choose File box on the menu bar 3. Choose Save Caution. The Save command saves the modified data by overwriting the previous version of the file. You can save your data in other formats besides an SPSS save file (e.g., as an ASCII file, Excel file, SAS data set). To save your data with a given format you follow the same steps as saving data in a new SPSS Save file, except that you specify the Save as Type as the desired format. 9 Saving Your Output (Statistical Results and Graphs) To save the statistical results and graphs displayed in the Viewer window as a new SPSS Output file: 1. Display the Viewer window (i.e., execute the following commands while in the Viewer window displaying the results you want to save.) 2. Choose File on the menu bar. 3. Choose Save As... 4. Edit the directory or disk drive to indicate where the output should be saved. SPSS will automatically add the .spo suffix to the filename. 5. Choose Save To save Viewer changes in an existing SPSS Output file. 1. Display the Viewer window (i.e., execute the following commands while in the Viewer window displaying the results you want to save.) 2. Choose File on the menu bar. 3. Choose Save. Caution. The Save command saves the modified Viewer window by overwriting the previous version of the file. Note that you will not be able to open SPSS output that was created with a newer version than the version of SPSS that you are using to open the output. Hence, you may want to avoid this problem you by exporting your output in html or MS word format. Also, charts often do not export properly into a Html or Word file. Usually you need to export charts separately into a window metafile file (.wmf). Sometimes the output, including charts, and be copied and pasted directly into a Word file. 10 Exporting SPSS Output Sometimes you will want to save your SPSS output in a different file format than a SPSS output file, because you want to avoid compatibility problems between different versions of SPSS, you want to further edit your output in a Word document, or you want include graphs or figures in another document file. The basic steps in exporting SPSS output to another file type are, while in a SPSS (output) Viewer window: 1. Choose File 2. Choose Export 3. Choose what you want to export: Output Document – exports all the output Output Document (No Charts) – exports only the numerical results Charts Only – exports only charts (i.e., graphs & figures) Note that charts often do not export properly into a Html or Word file. Usually you need to export charts separately into a window metafile file (.wmf). 4. Define further what you want to export: All Objects – this option also exports other extraneous information (rarely useful) All Visible Objects – use this option to export all the output. Selected Objects – this allows you to export only the objects you have selected in the Viewer window. 11 5. Choose the file type HTML and Word/RTF a good file types for numerical results (no charts). Windows Metafile (.WMF) is a good file type for charts in you want to include figures in a MS Word document. Note that the file type options are dependent on what you are exporting. 6. Choose the location and file name for the output you want to export. 7. Choose OK 12 Printing Your Work in SPSS To print statistical results and graphs in the Viewer window or data in the Data Editor window: 1. Display the output or data you want to print (i.e., execute the following commands while in a output or data window) 2. Choose File on the menu bar. 3. Choose Print... 4. Choose All visible output or Selection (if you have selected parts of the output). When printing from a data file, the options are All, Selection and Page # to Page #. 5. Choose OK Exiting SPSS To exit SPSS: 1. Choose File on the menu bar 2. Choose Exit SPSS If you have made changes to the data file or the output file since the last time you saved these files, before exiting SPSS you will be asked whether you want to save the contents of the Data Editor window and Viewer window. If you are unsure as to whether you want to save the contents of the data or output window, choose Cancel, then display the window(s) and if you want to save the contents of the window, follow the instructions in this handout for saving data or output windows. SPSS will use the overwrite method when saving the contents of the window. 13 Running SPSS using Syntax (or Command Language) This handout describes how to the run various statistical summaries and procedures using the point-and-click menus in SPSS. However, it is possible run SPSS commands using SPSS syntax/command language. If you are running similar analyses repeatedly, it can be more efficient to run your analysis using SPSS syntax. How to run SPSS using the syntax/command language is beyond the scope of this handout. Help on running SPSS using the syntax/command language can be found in the SPSS Tutorials under Working with Syntax. To get you started using SPSS syntax, follow the point-and-click instructions for running a particular analysis, but select Paste instead of OK at the last step. A SPSS Syntax Editor window will open containing the SPSS syntax for running the analysis. To run the analysis you can choose Run on the menu bar or you can highlight the syntax you want to run, click the right mouse button, and select Run Current. You can add more syntax to the Syntax Editor window by using the point-and-click method, selecting Paste instead of OK at the last step. The additional syntax will be added at the bottom of the Syntax Editor window. You can also write syntax directly into the syntax file and/or use copy, paste and editing commands to modify the syntax. Remember to save you syntax file before exiting SPSS. The file should end in .sps. You can open a syntax file by selecting File on the menu bar, Open, and the Syntax… Here’s an example of SPSS syntax. This syntax runs a two sample test comparing HDL cholesterol (hdl) for subjects without and with a family history of heart attack (fhha, coded 0 for no and 1 for yes). This syntax creates 3 indicators variables, neversmoke, formersmoke, and currentsmoke for smoking status (smoke). Note that a period (.) is used to denote the end of a string of syntax and Execute. is sometimes required to run the syntax. 14 Creating a New Variable To create a new variable: 1. Display the Data Editor window (i.e., execute the following commands while in the Data Editor window displaying the data file you want to use to create a new variable). 2. Choose Transform on the menu bar 3. Choose Compute... 4. Enter the new variable name in the Target Variable box. 5. Enter the definition of the new variable in the Numeric Expression box (e.g., SQRT(visan), LN(age), or MEAN(age)) or 6. Select variable(s) and combine with desired arithmetic operations and/or functions. 7. Choose OK After creating a new variable(s), you will probably want to save the new variable(s) by re-saving your data using the Save command under File on the menu bar (See Saving Data as an SPSS Save File). Further instructions on creating a new variable are given in the SPSS Help Tutorials under Modifying Data Values. Example: Creating a (New) Transformed Variable You can use the SPSS commands for creating a new variable to create a transformed variable. Suppose you have a variable indicating triglyceride level, trig, and you want to transform this variable using the natural logarithm to make the distribution less skewed (i.e., you want to create a new variable which is natural logarithm of triglyceride levels). 1. Display the Data Editor window 2. Choose Transform on the menu bar 3. Choose Compute... 4. Enter, say, lntrig, in the Target Variable box. 5. Enter Ln(trig) in the Numeric Expression box. 6. Choose OK Now, a new variable, lntrig, which is the natural logarithm of trig, will be added to your data set. Remember to save your data set before exiting SPSS (e.g., while in the SPSS Data window, choose Save under File or click on the floppy disk icon). 15 Recoding or Combining Categories of a Variable To recode or combine categories of a variable: 1. Display the Data Editor window (i.e., execute the following commands while in the Data Editor window displaying the data file you want to use to recode variables). 2. Choose Transform on the menu bar 3. Choose Recode 4. Choose Into Same Variable... or Into Different Variable... 5. Select a variable to recode from the variable list on the left and then click on the arrow located in the middle of the window. This defines the input variable. 6. If recoding into a different variable, enter the new variable name in the box under Name:, then choose Change. This defines the output variable. 7. Choose Old and New Values... 8. Choose Value or Range under Old Value and enter old value(s). 9. Choose New Value and enter new value, then choose Add. 10. Repeat the process until all old values have been redefined. 11. Choose Continue 12. Choose OK After creating a new variable(s), you will probably want to save the new variable(s) by re-saving your data using the Save command under File box on the menu bar (See Saving Data as an SPSS Save File). Example: Recoding a Categorical Variable You can use the commands for recoding a variable to change the coding values of a categorical variable. You may want to change a coding value for a particular category to modify which category SPSS uses as the referent category in a statistical procedure. For example, suppose you want to perform linear regression using the ANOVA (or General Linear Model) commands, and one of your independent variables is smoking status, smoke, that is coded 1 for never smoked, 2 for former smoker and 3 for current smoker. By default SPSS will use current smoker as the referent category because current smoker has the largest numerical (code) value. If you want never smoked to be the referent category you need to recode the value for never smoked to a value larger than 3. Although you can recode the smoking status into the same variable, it is better to recode the variable into a new/different variable, newsmoke, so you do not lose your original data if you make an error while recoding. 16 1. Display the Data Editor window 2. Choose Transform 3. Choose Recode 4. Choose Into Different Variables... 5. Select the variable smoke as the Input variable 6. Enter newsmoke as the name of the Output variable, and then choose Change. 7. Choose Old and New Values... 8. Choose Value under Old Value. (It may already be selected.) 9. Enter 1 (code for never smoker) 10. Choose Value under New Value. (It may already be selected.) 11. Enter 4 (or any value greater than 3) 12. Choose Add 13. Choose All Other Values under Old Value. 14. Choose Copy Old Value(s) under New Value. 15. Choose Add 16. Choose Continue 17. Choose OK Remember to save your data set before exiting SPSS. 17 Example: Creating Indicator or Dummy Variables You can use the commands for recoding a variable to create indicator or dummy variables in SPSS. Suppose you have a variable indicating smoking status, smoke, that is coded 1 for never smoked, 2 for former smoker and 3 for current smoker. To create three new indicator or dummy variables for never, former and current smoking: 1. Display the Data Editor window 2. Choose Transform 3. Choose Recode 4. Choose Into Different Variables... 5. Select the variable smoke as the Input variable 6. Enter neversmoke as the name of the Output variable, and then choose Change. 7. Choose Old and New Values... 8. Choose Value under Old Value. (It may already be selected.) 9. Enter 1 (code value for never smoker) 10. Choose Value under New Value. (It may already be selected.) 11. Enter 1 (to indicate never smoker) 12. Choose Add 13. Choose All Other Values under Old Value. 14. Choose Value under New Value. 15. Enter 0 16. Choose Add 17. Choose Continue 18. Choose OK Now, you have created a binary indicator variable for never smoker (coded 1 if never smoker, 0 if former or current smoker). Next, create a binary indicator variable for former smoker. 18 1. Display the Data Editor window 2. Choose Transform 3. Choose Recode 4. Choose Into Different Variables... 5. Select the variable smoke as the Input variable 6. Enter formersmoke as the name of the Output variable, and then choose Change. (Or change (edit) never to former, and then choose Change). 7. Choose Old and New Values... 8. Choose 1→1 under Old→New and then choose Remove. 9. Choose Value under Old Value. 10. Enter 2 (code value for former smoker) 11. Choose Value under New Value. 12. Enter 1 (to indicate former smoker) 13. Choose Add 14. Choose Continue 15. Choose OK Now, you have a created a binary indicator variable for former smoker (coded 1 if former smoker, 0 if never or current smoker). To create a binary indicator variable for current smoker you would use similar commands to those for creating the indicator variable for former smoke, except that now the value of 3 for smoke is coded as 1 and all other values are coded as 0. 19 Example: Creating a Categorical Variable From a Numerical Variable You can use the commands for recoding a variable to create a categorical variable from a numerical variable (i.e., group values of the numerical variable into categories). For example, suppose you have a variable that is the number of pack years smoked, packyrs, and you want to create a categorical variable with the four categories, 0, >0 to 10, >10 to 30, and >30 pack years smoked . 1. Display the Data Editor window 2. Choose Transform 3. Choose Recode 4. Choose Into Different Variables... 5. Select the variable packyrs as the Input variable 6. Enter a name for the new variable, packcat, for the Output variable, and then choose Change. 7. Choose Old and New Values... 8. Choose Value under Old Value. (It may already be selected.) 9. Enter 0 10. Choose Value under New Value. 11. Enter 0 (to indicate 0 pack years) 12. Choose Add 13. Choose Range under Old Value. 14. Enter 0.01 and 10 in the two blank boxes. 15. Choose Value under New Value 16. Enter 1 (to indicate >0 to 10 pack years) 17. Choose Add 18. Choose Range under Old Value. 19. Enter 10.01 and 30 in the two blank boxes. 20. Choose Value under New Value 21. Enter 2 (to indicate >10 to 30 pack years) 22. Choose Add 23. Choose Range, value through HIGHEST under Old Value. 24. Enter 30.01 in the blank box. 25. Choose Value under New Value 26. Enter 3 (to indicate >30 pack years) 27. Choose Add 28. Choose Continue 29. Choose OK Note that if you may want to use different coding values depending on which category you want to be used as the referent category in certain statistical procedures. Remember to save your data set before exiting SPSS. 20 Summarizing Your Data Frequency Tables (& Bar Charts) for Categorical Variables. To produce frequency tables and bar charts for categorical variables: 1. Choose Analyze from the menu bar 2. Choose Descriptive Statistics 3. Choose Frequencies… 4. Variable(s): To select the variables you want from the source list on the left, highlight a variable by pointing and clicking the mouse and then click on the arrow located in the middle of the window. Repeat the process until you have selected all the variables you want. 5. Choose Charts (Skip to step 7 if you do not want bar charts.) 6. Choose Bar Chart(s) 7. Choose Continue 8. Choose OK Example: Frequency table and bar chart for the categorical variable, smoking status. Smoking status is the selected variable(s) and Bar charts under Charts… has been selected. Frequency table and bar chart of smoking status Smoking status Smoking status Cumu- 60 Fre- Valid lative quency Percent Percent Percent never 590 59.0 59.0 59.0 50 former 293 29.3 29.3 88.3 current 117 11.7 11.7 100.0 40 Percent Total 1000 100.0 100.0 30 20 10 0 never former current Smoking status 21 Contingency Tables for Categorical Variables. To produce contingency tables for categorical variables: 1. Choose Analyze from the menu bar. 2. Choose Descriptive Statistics 3. Choose Crosstabs... 4. Row(s): Select the row variable you want from the source list on the left and then click on the arrow located next to the Row(s) box. Repeat the process until you have selected all the row variables you want. 5. Column(s): Select the column variable you want from the source list on the left and then click on the arrow located next to the Column(s) box. Repeat the process until you have selected all the column variables you want. 6. Choose Cells... 7. Choose the cell values (e.g., observed counts; row, column, and margin (total) percentages). Note the option is selected when the little box is not empty. 8. Choose Continue 9. Choose OK Example: Contingency table of smoking status by coronary heart disease (CHD). Smoking status is the row variable and CHD is the column variable. Observed counts and row percentages will be displayed. Smoking status * Incident CHD Crosstabulation Incident CHD no yes Total Smoking never Count 537 53 590 status % within Smoking status 91.0% 9.0% 100.0% former Count 257 36 293 % within Smoking status 87.7% 12.3% 100.0% current Count 106 11 117 % within Smoking status 90.6% 9.4% 100.0% Total Count 900 100 1000 % within Smoking status 90.0% 10.0% 100.0% 22 Descriptive Statistics (& Histograms) for Numerical Variables. To produce descriptive statistics and histograms for numerical variables: 1. Choose Analyze on the menu bar 2. Choose Descriptive Statistics 3. Choose Frequencies... 4. Variable(s): To select the variables you want from the source list on the left, highlight a variable by pointing and clicking the mouse and then click on the arrow located in the middle of the window. Repeat the process until you have selected all the variables you want. 5. Choose Display frequency tables to turn off the option. Note that the option is turned off when the little box is empty. 6. Choose Statistics 7. Choose summary measures (e.g., mean, median, standard deviation, minimum, maximum, skewness or kurtosis). 8. Choose Continue 9. Choose Charts (Skip to step 11 if you do not want histograms.) 10. Choose Histograms(s) 11. Choose Continue 12. Choose OK An alternate way to produce only the descriptive statistics is at step 3 to choose Descriptives... instead of Frequencies..., then, select the variables you want. By default SPSS computes the mean, standard deviation, minimum and maximum. Choose Options... to select other summary measures. Example: Descriptive summaries and histogram for the numerical variable age. Age is the variable to summarize. You can select more than one variable to analyze. Remember to turn off the Display frequency tables option. 23 Mean, standard deviation, minimum and maximum were selected under Statistics…, and histogram was selected under Charts… Summaries for Age Statistics Age N Valid 1000 Missing 0 Mean 72.14 Std. Deviation 5.275 Minimum 65 Maximum 90 Histogram of Age Histogram 120 100 80 Frequency 60 40 20 Mean =72.14 Std. Dev. =5.275 0 N =1,000 60 65 70 75 80 85 90 95 Age 24 Descriptive Statistics (& Boxplots) by Groups for Numerical Variables. To produce descriptive statistics and boxplots by groups for numerical variables: 1. Choose Analyze on the menu bar 2. Choose Descriptive Statistics 3. Choose Explore... 4. Dependent List: To select the variables you want to summarize from the source list on the left, highlight a variable by pointing and clicking the mouse and then click on the arrow located next to the dependent list box. Repeat the process until you have selected all the variables you want. 5. Factor List: To select the variables you want to use to define the groups from the source list on the left, highlight a variable by pointing and clicking the mouse and then click on the arrow located next to the factor list box. 6. Choose Plots... (If you do not want boxplots, choose Statistics for the Display option and skip to Step 11.) 7. Choose Factor levels together from the Boxplot box. 8. Select Stem-and-leaf option from the Descriptive box to turn off the option. 9. Choose Continue 10. Choose Both for the Display option 11. Choose OK Example: Total cholesterol by family history of heart attack (yes or no). In this example total cholesterol is the dependent variable. You can select more than one variable. Summaries will computed for each group defined by family history of heart attack. Both numerical summaries (statistics) and plots are selected. Under Statistics… Descriptives is usually selected by default. Under Plots select Boxplot option and unselect stem-and- leaf. 25 Descriptives Family history of heart Std. attack Statistic Error The explore Total cholesterol no Mean 221.93 1.417 command by 95% Confidence Lower Bound 219.15 default Interval for Mean Upper Bound 224.72 produces a lot 5% Trimmed Mean 221.63 of different Median 219.76 summaries, so Variance 1350.641 you need to Std. Deviation 36.751 select what to Minimum 111 report. Maximum 363 Range All summaries 252 Interquartile Range 49 Skewness are shown for .184 .094 Kurtosis .363 .188 all groups – yes Mean 220.53 2.150 the table has 95% Confidence Lower Bound 216.30 been cropped Interval for Mean Upper Bound 224.76 in this example. Boxplot of Total Cholesterol by Family History of Heart Attack 400 95 812 350 172 438 875 300 Total cholesterol 250 200 150 729 659 100 no yes Family history of heart attack 26 Using the Split File Option for Summaries by Groups for Categorical and Numerical Variables. The Split File option in SPSS is a convenient way to produce summaries, graphs, and run statistical procedures by groups. To activate the option: 1. Choose Data on the menu bar of the Data Editor window 2. Choose Split File 3. Choose Compare groups or Organize output by groups. The two options display the output differently. Try each option to see which works best for your needs. 4. Choose the variable that defines the groups. 5. Choose OK Now, all the summaries, graphs, and statistical procedures you request will be done (automatically) for each group. To turn off this option: 1. Choose Data on the menu bar of the Data Editor window 2. Choose Split File 3. Choose Analyze all cases, do no create groups 4. Choose OK Example. Use the Split File option to run summaries by family history of heart attack (yes or no). Compare groups option will try to display the results for each group side by side when feasible. Organize output by groups option will display the results separately for each group starting with the group with the lowest numerical code value. 27 Using the Select Cases Option for Summaries for a subgroup of subjects/observations. The Select Cases option in SPSS is a convenient way to produced summaries and run statistical procedures for a subgroup of subjects or to temporary exclude subjects from the analysis. To activate this option: 1. Choose Data on the menu bar of the Data Editor window 2. Choose Select Cases… 3. Choose If condition is satisfied 4. Choose If… 5. Enter the expression that indicates the subjects/observation you want to select. 6. Choose Continue 7. Choose OK Now, all the summaries, graphs, and statistical procedures you request will be done using only the selected subjects/observations. To turn off this option: 1. Choose Data on the menu bar of the Data Editor window 2. Choose Select Cases… 3. Choose All cases 4. Choose OK Example: Select subjects not lipid lowering medications (i.e., subjects with lipid = 0 indicating no medications). Select the If condition is satisfied and then If… Caution! Usually you do not want to delete observations from your dataset, so do not select this option. Typical expressions will involve combinations of the following symbols: Symbol Definition = equal ~= not equal >= greater than or equal <= less than or equal > greater than < less than & and | or 28 Graphing Your Data You can produce very fancy figures and graphs in SPSS. Producing fancy figures and graphs is beyond the scope of this handout. Instructions on producing figures and graphs can be found in SPSS Help under Topics → Contents → Chart Galleries, Standard Charts, and Chart Editor, as well as in the SPSS Tutorials under Creating and Editing Charts. The commands for making charts are located under Graphs (and then Legacy Dialogs, if using Version 15) on the menu bar, and the commands for making simple figures and graphs are relatively easy to use and some instruction is given below. The Interactive option under Graphs is another way to produce charts in SPSS interactively, as well as fancier versions of the basic charts (e.g., 3-dimensional bar charts). Bar Charts The easiest way to produce simple bar charts is to use the Bar Chart option with the Frequencies... command. See Frequency Tables (& Bar Charts) for Categorical Variables. You can only produce only one bar chart at a time using the Bar command. 1. Choose Graphs (& then Legacy Dialogs, if Version 15) from the menu bar. 2. Choose Bar... 3. Choose Simple, Clustered, or Stacked 4. Choose what the data in the bar chart represent (e.g., summaries for groups of cases). 5. Choose Define 6. Select a variable from the variable list on the left and the click on the arrow next to the Category axis. 7. Choose what the bars represent (e.g., number of cases or percentage of cases) 8. Choose OK Family history of 60.0% 60.0% heart attack no yes 50.0% 50.0% 40.0% 40.0% Percent Percent 30.0% 30.0% 20.0% 20.0% 10.0% 10.0% 0.0% 0.0% never former current never former current Smoking status Smoking status 29 Histograms The easiest way to produce simple histograms is to use the Histogram option with the Frequencies... command. See Descriptive Statistics (& Histograms) for Numerical Variables. You can produce only one histogram at a time using the Histogram command. 120 1. Choose Graphs (& then Legacy Dialogs, if Version 15) from the menu bar 100 2. Choose Histogram... 3. Select a variable from the 80 variable list on the left and then Frequency click on the arrow in the middle of the window. 60 4. Choose Display normal Curve if you want a normal curve 40 superimposed on the histogram. 5. Choose OK 20 Mean =26.2366 Std. Dev. =4.8667 Boxplots 0 N =1,000 10 20 30 40 50 option with the The easiest way to produce simple boxplots is to use the BoxplotBody mass index Explore... command. See Descriptive Statistics (& Boxplots) By Groups for Numerical Variables. You can produce only one boxplot at a time using the Boxplot command. 1. Choose Graphs (& then Legacy Dialogs, if Version 15) from the 880 menu bar. 2. Choose Boxplot... 400 684 3. Choose Simple or Clustered Serum fasting glucose 4. Choose what the data in the 77 boxplots represent (e.g., 673 summaries for groups of cases). 5. Choose Define 6. Select a variable from the 200 variable list on the left and then click on the arrow next to the Variable box. 785 7. Select the variable from the variable list that defines the 0 groups and then click on the arrow next to Category Axis. normal impaired fasting diabetic glucose 8. Choose OK ADA diabetes status 30 Normal Probability Plots. To produce Normal probability plots: 1. Choose Graphs (& then Legacy Dialogs, if Version 15) from the menu bar. 2. Choose Q-Q... to get a plot of the quantiles (Q-Q plot) or choose P-P... to get a plot of the cumulative proportions (P-P plot) 3. Select the variables from the source list on the left and then click on the arrow located in the middle of the window. 4. Choose Normal as the Test Distribution. The Normal distribution is the default Test Distribution. Other Test Distributions can be selected by clicking on the down arrow and clicking on the desired Test distribution. 5. Choose OK SPSS will produce both a Normal probability plot and a detrended Normal probability plot for each selected variable. Usually the Q-Q plot is the most useful for assessing if the distribution of the variable is approximately Normal. Normal Q-Q Plot of Serum fasting glucose Normal Q-Q Plot of Body mass index 250 200 40 Expected Normal Value Expected Normal Value 150 30 100 50 20 0 -50 10 -200 0 200 400 600 10 20 30 40 50 Observed Value Observed Value 31 Error Bar Plot. To produce an error bar plot of the mean of a numerical variable (or the means for different groups of subjects): 1. Choose Graphs (& then Legacy Dialogs, if Version 15) from the menu bar. 2. Choose Error Bar... 3. Choose Simple or Clustered 4. Choose what the data in the error bars represent (e.g., summaries for groups of cases). 5. Choose Define 6. Select a variable from the variable list on the left and then click on the arrow next to the Variable box. 7. Select the variable from the variable list that defines the groups and then click on the arrow next to Category Axis. 8. Select what the bars represent (e.g., confidence interval, ±standard deviation, ±standard error of the mean) 9. Choose OK Error Bar Plot 300 Mean +- 2 SD Serum fasting glucose 250 200 150 100 50 normal impaired fasting diabetic A bar chart of the mean with glucose bars can be made error ADA diabetes status using the commands for making a bar chart 300 Mean Serum fasting glucose 200 100 0 normal impaired fasting diabetic glucose ADA diabetes status Error bars: +/- 2 SD 32 Scatter Plot. To produce a scatter plot between two numerical variables: 1. Choose Graphs (& then Legacy HLD cholesterol vs BMI Dialogs, if Version 15) on the menu bar. 2. Choose Scatter/Dot... 140 3. Choose Simple 4. Choose Define 120 5. Y Axis: Select the y variable you HDL cholesterol want from the source list on the left 100 and then click on the arrow next to 80 the y axis box. 6. X Axis: Select the x variable you 60 want from the source list on the left 40 and then click on the arrow next to the x axis box. 20 7. Choose Titles... 0 8. Enter a title for the plot (e.g., y vs. x). 10 20 30 40 50 9. Choose Continue Body mass index 10. Choose OK Adding a linear regression line to a scatter plot. To add a linear regression (least-squares) line to a scatter plot of two numerical variables: 1. While in the Viewer window HLD cholesterol vs BMI double click on the scatter plot. The scatter plot should now be displayed in a window titled Chart 140 Editor. 120 2. Choose Elements. HDL cholesterol 3. Choose Fit Line at Total. (A line 100 should be added to the plot, because 80 the next 2 steps are the default options. 60 4. Choose Linear (in the Properties 40 window) 5. Choose Apply (in the Properties 20 R Sq Linear = 0.121 window). Additional options: 0 o Choose Mean under Confidence Intervals (in the Properties window) to add a prediction 10 20 30 40 50 interval for the linear regression line to the scatter plot or Body mass index o Choose Individual under Confidence Intervals to add a prediction interval for individual observations to the scatter plot. 6. Click on the ``X'' in the upper right hand corner of the Chart Editor window or choose File, and then Close to return to the Viewer window. 33 Adding a Loess (scatter plot) smooth to a scatter plot. To add a Loess smooth to a scatter plot of two numerical variables: 1. While in the Viewer window double click on the scatter plot. The scatter plot should now be HLD cholesterol vs BMI displayed in a window titled Chart Editor. 2. Choose Elements. 140 3. Choose Fit Line at Total. 120 4. Choose Loess (in the Properties window). Default options for % of HDL cholesterol 100 points to fit (50%) and kernel (Epanechnikov) are usually the 80 most appropriate options. 60 5. Choose Apply (in the Properties window). If a line was added to the 40 plot in Step 3, it will be replaced by 20 the loess smooth. 6. Click on the ``X'' in the upper right 0 hand corner of the Chart Editor 10 20 30 40 50 window or choose File, and then Body mass index Close to return to the Viewer window. Stem-and-leaf Plot. To produce stem-and-leaf plot: 0. 1. Choose Analyze on the menu bar Severity of Illness Index Stem-and- 2. Choose Descriptive Statistics Leaf Plot 3. Choose Explore... Frequency Stem & Leaf 4. Dependent List: To select the variables you want from the source list on the left, 2.00 4 . 34 highlight a variable by pointing and 7.00 4 . 6688899 10.00 5 . 0001112344 clicking the mouse and then click on the 3.00 5 . 568 arrow located next to the dependent list 1.00 Extremes (>=62) box. Repeat the process until you have selected all the variables you want. Stem width: 10.00 5. Choose Plots... Each leaf: 1 case(s) 6. Choose Stem-and-leaf from the Descriptive box. Note the option may already be selected if the little box is not empty. 7. Choose None from the Boxplot box 8. Choose Continue 9. Choose Plots for the Display option 10. Choose OK 34 Hypothesis Tests & Confidence Intervals One-Sample t Test 1. Choose Analyze from the menu bar. 2. Choose Compare Means 3. Choose One-Sample T Test... 4. Test Variable(s): Select the variable you want from the source list on the left, highlight variables by pointing and clicking the mouse and then click on the arrow located in the middle of the window. 5. Edit the Test Value. The Test Value is the value of the mean under the null hypothesis. The default value is zero. 6. Choose OK Confidence Interval for a Mean (from one sample of data) 1. Choose Analyze from the menu bar. 2. Choose Compare Means 3. Choose One-Sample T Test... 4. Test Variable(s): Select the variable you want from the source list on the left, highlight variables by pointing and clicking the mouse and then click on the arrow located in the middle of the window. 5. The Test Value should be 0, which is the default value. 6. By default a 95% confidence interval will be computed. Choose Options… to change the confidence level. 7. Choose OK SIDS Example. There were 48 SIDS cases in King County, Washington, during the years 1974 and 1975. The birth weights (in grams) of these 48 cases were: 2466 3941 2807 3118 2098 3175 3317 3742 3062 3033 2353 3515 2013 3515 3260 2892 1616 4423 The mean (and standard 2750 2807 2807 3005 3374 3572 deviation) of these 2722 2495 3459 3374 1984 2495 measurements is 2891 (623) 3005 2608 2353 4394 3232 3062 grams. 2013 2551 2977 3118 2637 1503 2722 2863 2013 3232 2863 2438 We want to know if the mean birth weight in the population of SIDS infant is different from that of normal children, 3300 grams. We could construct a 95% confidence interval, to see if the interval contains the value of 3300 grams or we could perform a one sample t test to test if the mean in the SIDs population is equal to 3300 (versus not equal to 3300). 35 To construct a 95% confidence interval When computing the interval for a mean make sure the Test Value is 0. One-Sample Statistics Number of subjects, mean, standard deviation, and standard Std. Error N Mean Std. Deviation Mean error of the mean. birth weight 48 2891.1250 623.39177 89.97885 One-Sample Test Test Value = 0 95% Confidence Interval of the Difference Mean t df Sig. (2-tailed) Difference Lower Upper birth weight 32.131 47 .000 2891.12500 2710.1109 3072.1391 Ignore the t test results 95% confidence interval for the (t, df, sig.) because these mean birth weight is 2710 to results are for testing if 3072 grams the mean birth weight is equal to 0 (versus not equal to zero). 36 To perform a one sample t test to test if the mean in the SIDs population is equal to 3300 versus not equal to 3300. To run the one-sample t test to test if the mean birth weight is equal to 3300 you need to change the Test Value from the default value of 0 to 3300. One-Sample Statistics Std. Error N Mean Std. Deviation Mean birth weight 48 2891.1250 623.39177 89.97885 One-Sample Test Test Value = 3300 95% Confidence Interval of the Difference Mean t df Sig. (2-tailed) Difference Lower Upper birth weight -4.544 47 .000 -408.87500 -589.8891 -227.8609 Sig. (2-tailed) = two tailed p-value = <.001 Ignore the results for 95% confidence interval of the t = test statistic value = -4.544 difference, because it is the confidence interval for the df = degrees of freedom = 47 mean minus 3300. 37 Paired t Test 1. Choose Analyze from the menu bar. 2. Choose Compare Means 3. Choose Paired-Samples T Test... 4. Paired Variable(s): Select two paired variables you want from the source list on the left, highlight both variables by pointing and clicking the mouse and then click on the arrow located in the middle of the window. Repeat the process until you have selected all the paired variables you want to test. 5. Choose OK Confidence Interval for the Difference Between Means from Paired Sample By default a 95% confidence interval for the difference means of the paired samples will be computed when performing a paired t test. Choose Options… to change the confidence level. Prozac Example. To compare the effect of Prozac on anxiety 10 subjects are given one week of treatment with Prozac and one week of treatment with a placebo. The order of the treatments was randomized for each subject. An anxiety questionnaire was used to measure a subject's anxiety on a scale of 0 to 30. Higher scores indicate more anxiety. Subject Placebo Prozac Difference 1 22 19 3 2 18 11 7 3 17 14 3 4 19 17 2 5 22 23 -1 6 12 11 1 7 14 15 -1 8 11 19 -8 9 19 11 8 10 7 8 -1 Mean difference, d 1.3 Standard deviation, sd 4.5 38 Paired t test and confidence interval for the difference between paired means. The order of the variables in calculating the difference is determined by the order of the variables in the data set (and not the order in which you select the variables). Paired Samples Statistics Std. Error Summaries for each Mean N Std. Deviation Mean sample of data (or Pair 1 placebo 16.1000 10 4.95424 1.56667 prozac variable). 14.8000 10 4.68568 1.48174 Paired Samples Correlations Correlation between the paired N Correlation Sig. Pair 1 placebo & prozac 10 .556 .095 values - usually not useful. Paired Samples Test Sig. (2- Paired Differences t df tailed) Std. Std. Error 95% Confidence Interval of Mean Deviation Mean the Difference Lower Upper Pair 1 placebo 1.30000 4.54728 1.43798 -1.95293 4.55293 .904 9 .390 - prozac 95% confidence interval for the difference = placebo - prozac mean difference is -1.9 to 4.6 mean difference = 1.3 Paired t test standard deviation of the differences = 4.5 Sig. (2 tailed) = two-sided p-value = 0.39 standard error of the t = test statistic value = .904 differences = 1.4 df = degrees of freedom 39 Two-Sample t Test 1. Choose Analyze on the menu bar. 2. Choose Compare Means 3. Choose Independent-Samples T Test... 4. Test Variable(s): Select the test variable you want from the source list on the left and then click on the arrow located next to the test variable box. Repeat the process until you have selected all the variables you want. 5. Grouping Variable: Select the variable which defines the groups and then click on the arrow located next to the grouping variable box. 6. Choose Define Groups... 7. Click on blank box next to Group 1, then enter the code value (numeric or character/string) for group 1. 8. Click on blank box next to Group 2, then enter the code value (numeric or character/string) for group 2. 9. Choose Continue 10. Choose OK Confidence Interval for the Difference Between Means from Independent Samples By default a 95% confidence interval for the difference means from two independent samples will be computed when performing a two sample t test. Choose Options… to change the confidence level. Model Cities Example. Two groups of people were studied - those who had been randomly allocated to a Fee-For-Service medical insurance group and those who had been randomly allocated to a Prepaid insurance group. We would like to compare the two groups on the quality of health care they received in each group, but first we would like to know how comparable the groups are on other characteristics that might affect medical outcome. For example, we would like to know if the mean age in the two groups is similar. Hopefully, the process of random allocation minimizes this possibility, but there is always a chance that it didn't. Group n Mean Standard deviation Prepaid (GHC) 1167 24.0 15.3 Fee-for-service (KCM) 3207 26.4 17.1 We could compare the average age between the two groups using a two sample t test or a confidence interval for the difference between the average ages of the two groups. 40 Two sample t test and 95% confidence interval for the difference between means (from independent samples). After you select the Grouping Variable, SPSS will put in question marks to prompt you to define the code values for the two groups. Select Define Groups… to enter the code values. In this example the group codes are numeric, 0 (for GHC) and 1 (for KCM) T-Test Group Statistics Std. Error Summaries for each prov N Mean Std. Deviation Mean sample/group. age GHC 1167 23.9846 15.30787 .44810 KCM 3207 26.3676 17.10260 .30200 Independent Samples Test Levene's Test for Equality of Variances SPSS by default tests if the variances are equal using Levene’s F Sig. test. A small p-value (sig.) age Equal variances indicates the variances may be 47.068 .000 assumed different. Equal variances not assumed sig. = p-value = <.001 F = test statistic value = 47.0 41 Independent Samples Test t-test for Equality of Means Mean Std. Error t df Sig. (2-tailed) Difference Difference age Equal variances assumed -4.188 4372 .000 -2.38306 .56896 Equal variances not assumed -4.410 2293.698 .000 -2.38306 .54037 Two Sample t test. SPSS by default always performs both versions of the two sample t test assuming equal variance and unequal variances Sig. (2 – tailed) = two sided p-value = <.001 (equal var.), <.001 (unequal var.) t = test statistic value = -4.2 (equal var.), -4.4 (unequal var.) df = degrees of freedom = 4372 (equal var.), 2294 (unequal var.) mean difference = difference between means = -2.4 (equal and unequal var.) std. error difference = standard error of the difference between means = .6 (equal var.), .5 (unequal var.) Independent Samples Test 95% Confidence 95% confidence interval for Interval of the the difference between means Difference is Lower Upper age Equal variances -3.4 to -1.3 (assuming equal assumed -3.49851 -1.26760 variances) Equal variances not assumed -3.44273 -1.32338 -3.4 to -1.3 (assuming unequal variances) 42 Sign Test and Wilcoxon Signed-Rank Test 1. Choose Analyze from the menu bar. 2. Choose Nonparametric Tests 3. Choose 2 Related Samples... 4. Test Pair(s) List: Select two paired variables you want from the source list on the left hand side, highlight both variables by pointing and clicking the mouse and then click on the arrow located in the middle of the window. Repeat the process until you have selected all the paired variables you want to test. 5. Choose Sign as the Test Type. 6. or 7. Choose Wilcoxon as the Test Type. 8. Choose OK Aspirin Example. To compare 2 types of Aspirin, A and B, 1 hour urine samples were collected from 10 people after each had taken either A or B. A week later the same routine was followed after giving the “other” type to the same 10 people. Person Type A Type B Difference 1 15 13 2 2 26 20 6 3 13 10 3 4 28 21 7 5 17 17 0 6 20 22 -2 7 7 5 2 8 36 30 6 9 12 7 5 10 18 11 7 Mean = 19.2 15.6 3.6 = d Standard deviation = 8.63 7.78 3.098 = s d A Sign test or Wilcoxon Signed Rank test could be used to compare the two types of Aspirin. 43 The order of the variables in calculating the difference is determined by the order of the variables in the data set (and not the order in which you select the variables). Select Wilcoxon or Sign (or both) Under Options you can select summaries Descriptive (n, mean, etc.) and Quartiles (median, 25th and 75th percentile) Descriptive Statistics Percentiles N Mean Std. Deviation Minimum Maximum 25th 50th (Median) 75th aspirina 10 19.2000 8.62554 7.00 36.00 12.7500 17.5000 26.5000 aspirinb 10 15.6000 7.77746 5.00 30.00 9.2500 15.0000 21.2500 Sign Test Frequencies N aspirinb - aspirina Negative 8 Differences(a) Positive 1 Differences(b) Ties(c) 1 Total 10 a aspirinb < aspirina b aspirinb > aspirina c aspirinb = aspirina Sign Test Test Statistics(b) Exact sig. (2-tailed) = exact, two-sided aspirinb - p-value = 0.039 aspirina Exact Sig. (2-tailed) .039(a) The p-value is exact because it is a Binomial distribution used. b Sign Test computed using the Binomial distribution instead of using an approximation to the Normal distribution. 44 Information Wilcoxon Signed Ranks Test Ranks used in the test statistic N Mean Rank Sum of Ranks – not usually aspirinb - aspirina Negative Ranks 8(a) 5.38 43.00 reported; use Positive Ranks 1(b) 2.00 2.00 the previous Ties 1(c) Total descriptives. 10 a aspirinb < aspirina b aspirinb > aspirina c aspirinb = aspirina Test Statistics(b) Wilcoxon Signed Rank Test aspirinb - Asymp. Sig. (2-tailed) = two sided p-value = 0.015 aspirina Z -2.442(a) Asymp. Sig. (2-tailed) .015 Asymp. is an abbreviation for asymptotic, which a Based on positive ranks. means the p-value is computed using a large sample b Wilcoxon Signed Ranks Test approximation based on the Normal distribution. 45 Mann-Whitney U Test (or Wilcoxon Rank Sum Test) 1. Choose Analyze on the menu bar. 2. Choose Nonparametric Tests 3. Choose 2 Independent Samples... 4. Test Variable(s): Select the test variable you want from the source list on the left and then click on the arrow located next to the test variable box. Repeat the process until you have selected all the variables you want. 5. Grouping Variable: Select the variable which defines the grouping and then click on the arrow located next to the grouping variable box. The grouping variable must be numeric for the variable to appear on the left hand side. 6. Choose Define Groups... 7. Click on the blank box next to group 1, then enter the code value (it must be numeric) for group 1. 8. Click on the blank box next to group 2, then enter the code value (it must be numeric) for group 2. 9. Choose Continue to return to Two Independent Samples dialog box. 10. Choose Mann-Whitney U as the Test Type. Note that the option may already be selected if the little box is not empty. 11. Choose OK Legionnaires Example. During July and August, 1976, a large number of Legionnaires attending a convention died of mysterious and unknown cause. Chen et al. (1977) examined the hypothesis of nickel contamination as a toxin. They examined the nickel levels in the lungs of nine cases and nine controls. There was no attempt to match cases and controls. The data are as follows (μg/100g dry weight): Legionnaire cases 65 24 52 86 120 82 399 87 139 Controls 12 10 31 6 5 5 29 9 12 The Mann Whitney U test could be used to compare the two groups. After you select the Grouping Variable, SPSS will put in question marks to prompt you to define the code values for the two groups. Select Define Groups… to enter the code values. Note: The codes must be numeric, otherwise the grouping variable will not appear on the left hand side. 46 In this example the group codes are 1 for legionnaires and 2 for controls. Mann-Whitney Test Information used in the test Ranks statistic – not usually reported. The descriptives under Options group N Mean Rank Sum of Ranks nickel 1 9 13.78 124.00 are not useful; you can produce 2 9 5.22 47.00 relevant descriptives (e.g. Total 18 median and interquartile range for each group) using the Test Statistics(b) Explore command. nickel Mann-Whitney U 2.000 Wilcoxon W 47.000 Mann Whitney test Z -3.403 Asymp. Sig. (2-tailed) .001 Asymp. Sig. (2-tailed) = two-sided p-value = Exact Sig. [2*(1-tailed Sig.)] .000(a) 0.001 a Not corrected for ties. b Grouping Variable: group This p-value is computed based a large sample approximation to the Normal distribution and it corrects for ties in the data, if present. Exact Sig. [2*(1-tailed Sig.)] = two-sided p- value = <.001 This p-value is an exact p-value, but it does not correct for ties in the data, if present. In this example, given the small sample sizes and few ties in the data, the exact p-value would be appropriate to report. 47 One-way ANOVA (Analysis of Variance) (E.g., to compare two or more means from two or more independent samples) 1. Choose Analyze on the menu bar 2. Choose Compare Means 3. Choose One-Way ANOVA... 4. Dependent: Select the variable from the source list on the left for which you want to use to compare the groups and then click on the arrow next to the dependent variable box. You run multiple one-way ANOVAs by selecting more than one dependent variable. 5. Factor: Select the variable from the source list on the left which defines the groups. 6. Choose OK To perform pairwise comparisons to determine which groups are different while controlling for multiple testing use the Post Hoc... option. There are many methods to choose from (e.g., Bonferroni and R-E-G-W-Q). Other useful options can be found under Options... For example, choose Descriptive to get descriptive statistics for each group (e.g., mean, standard deviation, minimum value, and maximum value). Choose Homogeneity-of-variance to perform the Levene Test to test if the group variances are all equal versus not all equal. A small p-value for the Levene's Test may indicate that the variances are not all equal. CHD Example. We can use one-way ANOVA to compare HDL levels between subjects with different hypertensive status (0=normotensive, 1=borderline, 2=definite) Hypertensive Standard Group n Mean Deviation Normotensive 1568 55.8 15.5 Borderline 547 55.7 16.2 Definite 1310 53.5 15.2 You can select 1 or more variables to compare between groups. The variable selected as the Factor defines the groups. The variable can be numeric or character/string. 48 Oneway ANOVA HDL cholesterol Sum of Squares df Mean Square F Sig. Between Groups 4344.834 2 2172.417 9.045 .000 Within Groups 821904.577 3422 240.183 Total 826249.411 3424 One-way analysis of variance Sig. = p-value = <.001 F = test statistic = 9.0; df = degrees of freedom Sometimes the test statistic and degrees of freedom of the test statistics are reported along with the p-value; in this example, F=9.0 with degrees of freedom 2 and 3422. Sum of squares and mean square are used to compute the test statistic; they are usually not reported. Descriptives Under Options you can request Descriptives for each group to be computed. This information can be used to describe the differences between the groups. HDL cholesterol Std. Std. 95% Confidence Interval for N Mean Deviation Error Mean Minimum Maximum Lower Bound Upper Bound normotensive 1568 55.82 15.500 .391 55.05 56.59 21 138 borderline 547 55.67 16.202 .693 54.30 57.03 24 149 definite 1310 53.47 15.192 .420 52.64 54.29 15 129 Total 3425 54.90 15.534 .265 54.38 55.42 15 149 49 Post Hoc Tests Under Post Hoc… you can request further comparisons be done between each of the possible pair of groups to determine which groups are different from each other. These are multiple comparison procedures, which control for the number of tests/comparison being performed. There are many methods to choose from; below is an example of the Bonferroni method and Ryan-Einot-Gabriel-Welsch method. Multiple Comparisons Dependent Variable: HDL cholesterol (I) (J) Mean Hypertension Hypertension Difference Std. status status (I-J) Error Sig. 95% Confidence Interval Lower Bound Upper Bound Bonferroni normotensive borderline .157 .770 1.000 -1.69 2.00 definite 2.356(*) .580 .000 .97 3.74 borderline normotensive -.157 .770 1.000 -2.00 1.69 definite 2.198(*) .789 .016 .31 4.09 definite normotensive -2.356(*) .580 .000 -3.74 -.97 borderline -2.198(*) .789 .016 -4.09 -.31 * The mean difference is significant at the .05 level. The Bonferroni method is a method that shows all pairwise comparisons/differences along with a p-value (sig.) adjusted for the number of comparisons. In this example, subjects with normal blood pressure and borderline hypertension have similar HDL cholesterol levels, but subjects with definite hypertension have different HDL cholesterol levels than both subjects with normal blood pressure and borderline hypertension. Homogeneous Subsets HDL cholesterol Subset for alpha = .05 Hypertension status N 1 2 Ryan-Einot-Gabriel- definite 1310 53.47 Welsch Range borderline 547 55.67 normotensive 1568 55.82 Sig. 1.000 .867 Means for groups in homogeneous subsets are displayed. The Ryan-Einot-Gabriel-Welsch (R-E-G-W-Q) method is a method that groups together groups that are similar in the same subset and groups that are different are in different subsets. In this example, subjects with normal blood pressure and borderline hypertension are in one subset and subjects with definite hypertension are in a different subset. Hence, subjects with definite hypertension have different HDL cholesterol levels than subjects with normal blood pressure and borderline hypertension, but subjects with normal blood pressure and borderline hypertension have similar HDL cholesterol levels. 50 Kruskal-Wallis Test 1. Choose Analyze on the menu bar. 2. Choose Nonparametric Tests 3. Choose K Independent Samples... 4. Test Variable(s): Select the test variable you want from the source list on the left and then click on the arrow located next to the test variable box. Repeat the process until you have selected all the variables you want to test. 5. Grouping Variable: Select the variable which defines the grouping and then click on the arrow located next to the grouping variable box. 6. Choose Define Range... 7. Click on the blank box next to Minimum, then enter the smallest numeric code value for the groups. 8. Click on the blank box next to Maximum, then enter the largest numeric code value for the groups. 9. Choose Continue 10. Choose Kruskal-Wallis H as the Test Type. Note that the option may already be selected if the little box is not empty. 11. Choose OK CAUTION: The group variable must be numeric and you must correctly enter the smallest numeric code value and the largest numeric code value. SPSS will allow you to select a character/string variable as the grouping variable, as well as allow you to incorrectly enter the numeric code values. The results displayed for the Kruskal Wallis test in these cases will be incorrect, but no error or warning message will be displayed. CHD Example. We can use one-way ANOVA to compare serum insulin levels between subjects with different hypertensive status (0=normotensive, 1=borderline, 2=definite) Hypertensive Group n Median IQR* Normotensive 1568 12 9, 15 Borderline 547 12 9, 17 Definite 1310 14 11, 20 *IQR, interquartile range = 25th percentile, 75th percentile 51 Kruskal Wallis test You can select 1 or more variables to compare between groups. The variable selected as the Grouping Variable defines the groups. THE VARIABLE SHOULD BE NUMERIC. In this example the smallest numeric code is 0 (for normal) and the largest numeric code is 2 (for definite). Kruskal-Wallis Test Information used in the test Ranks statistic – not usually reported. The descriptives under Options Hypertension status N Mean Rank are not useful; you can produce Serum insulin normotensive 1568 1526.31 relevant descriptives (e.g. borderline 547 1685.28 definite 1310 1948.03 median and interquartile range Total 3425 for each group) using the Explore command. Test Statistics(a,b) Serum insulin Kruskal Wallis test Chi-Square 130.816 Asymp. Sig. = p-value = <.001 df 2 Asymp. Sig. .000 a Kruskal Wallis Test b Grouping Variable: Hypertension status Asymp. is an abbreviation for asymptotic, which means the p-value is computed using a large sample approximation based on the Normal distribution. Chi-Square = test statistic value = 130.8 Df = degrees of freedom = 2 52 One-Sample Binomial Test 1. Choose Analyze from the menu bar. 2. Choose Nonparametric Tests 3. Choose Binomial... 4. Test Variable List: Select the test variable you want from the source list on the left and then click on the arrow located next to the test variable box. Repeat the process until you have selected all the variables you want. 5. Test Proportion: Click on the box next to Test Proportion and enter/edit the proportion value specified by your null hypothesis. 6. Choose OK Example. In the TRAP study, 125 patients of the 527 patients who were negative for lymphocytotoxic antibodies at baseline became antibody positive. The expected rate for being antibody positive is 30%. We could use the one-sample binomial test to test if the rate is different in the TRAP study population. Positive is a variable coded 1 if positive and 0 if negative. Make sure to edit the test proportion value. This case .30 or 30%. The default is .50. NPar Tests Binomial Test Observed Asymp. Sig. Category N Prop. Test Prop. (1-tailed) positive Group 1 yes 125 .24 .3 .001(a,b) Group 2 no 402 .76 Total 527 1.0 a Alternative hypothesis states that the proportion of cases in the first group < .3. b Based on Z Approximation. One-sample binomial test, two-sided p-value given by 2 x .001 = .002 (Note: SPSS reports the one-sided p-value). 53 McNemar's Test 1. Choose Analyze from the menu bar. 2. Choose Descriptive Statistics 3. Choose Crosstabs... 4. Row(s): Select the row variable you want from the source list on the left and then click on the arrow located next to the Row(s) box. Repeat the process until you have selected all the row variables you want. 5. Column(s): Select the column variable you want from the source list on the left and then click on the arrow located next to the Column(s) box. Repeat the process until you have selected all the column variables you want. 6. Choose Cells... 7. For cell values choose total under percentages. 8. Choose Continue 9. Choose Statistics... 10. Choose McNemar 11. Choose Continue 12. Choose OK There is also another way to run McNemar’s test (but the test pair variables must be numeric and an asymptotic (Asymp.) p-value, based a large sample approximation based on the Normal distribution, is reported instead of a p-value based on exact methods). 1. Choose Analyze from the menu bar. 2. Choose Nonparametric Tests 3. Choose 2 Related Samples... 4. Test Pair(s) List: Select two paired variables you want from the source list on the left, highlight both variables by pointing and clicking the mouse and then click on the arrow located in the middle of the window. Repeat the process until you have selected all the paired variables you want. 5. Choose McNemar as the Test Type. 6. Choose Wilcoxon to turn off the option. Note that the option is turned off when the little box is empty. 7. Choose OK Example. Suppose we want to compare two different treatments for a rare form of cancer. Since relatively few cases of this disease are seen, we want the two treatment groups to be as comparable as possible. To accomplish this goal, we set up a matched study such that a random member of each matched pair gets treatment A (chemotherapy), whereas the other member gets treatment B (surgery). The patients are assigned to pairs (621 pairs) matched on age (within 5 years), sex, and clinical condition. The patients are followed for 5 years, with survival as the outcome variable. The 5-year survival rate for treatment A is 17.1% (106/621) and for treatment B is 15.3% (95/621). We could use McNemar’s test to compare the survival rate of the two treatments. 54 McNemar’s test It doesn’t matter for McNemar’s test which variable is selected for the Row(s): or Columns(s). You can run more than one test at a time. Under Statistics… select McNemar. Under Cells…, in this example, select Total percentages. Crosstabs TreatmentA * TreatmentB Crosstabulation TreatmentB Total Survival rate for died survived Treatment A is TreatmentA died Count 510 5 515 17.1% % of Total 82.1% .8% 82.9% survived Count 16 90 106 Survival rate for % of Total 2.6% 14.5% 17.1% Treatment B is Total Count 526 95 621 15.3% % of Total 84.7% 15.3% 100.0% Chi-Square Tests McNemar’s test Exact Sig. (2-sided) Value Exact Sig. (2-sided) = exact two-sided p-value McNemar Test .027(a) = 0.027 N of Valid Cases 621 a Binomial distribution used. The p-value is exact because it is computed using the Binomial distribution instead of using an approximation to the Normal distribution. 55 Chi-square Test, Fisher’s Exact test and Trend test for Contingency Tables If the Chi-square test is requested for a 2 x 2 table, SPSS will also compute the Fisher's Exact test. If the Chi-square test is requested for a table larger than 2 x 2, SPSS will also compute the Mantel-Haenszel test for linear or linear by linear association between the row and column variables. 1. Choose Analyze from the menu bar. 2. Choose Descriptive Statistics 3. Choose Crosstabs... 4. Row(s): Select the row variable you want from the source list on the left and then click on the arrow located next to the Row(s) box. Repeat the process until you have selected all the row variables you want. 5. Column(s): Select the column variable you want from the source list on the left and then click on the arrow located next to the Column(s) box. Repeat the process until you have selected all the column variables you want. 6. Choose Cells... 7. Choose the cell values (e.g., observed and expected counts; row, column, and margin (total) percentages). Note the option is selected when the little box is not empty. 8. Choose Continue 9. Choose Statistics... 10. Choose Chi-square 11. Choose Continue 12. Choose OK Asthma Example. An investigator studied the relationship of parental smoking habits and the presence of asthma in the oldest child. Type A families are defined as those in which both parents smoke and Type B families are those in which neither parent smokes. Of 100 type A families, 15 eldest children have asthma, and of 200 type B families, 6 children have asthma. We could use a chi-square test or Fisher’s exact test to test if the proportion of first born children with asthma different in these two types of families? It doesn’t matter for the chi-square, Fisher’s Exact or trend test which variable is selected for the Row(s): or Columns(s). You can run more than one test at a time. 56 Under Statistics… select Chi- square. Under Cells…, in this example, select Row percentages. Crosstabs familytype * asthma Crosstabulation asthma Total 15% of first born in family No Yes type A have asthma familytype A Count 85 15 100 % within familytype 85.0% 15.0% 100.0% 3% of first borin in family B Count 194 6 200 % within familytype type B have asthma 97.0% 3.0% 100.0% Total Count 279 21 300 % within familytype 93.0% 7.0% 100.0% Chi-Square Tests Asymp. Exact Sig. Sig. (2- Exact Sig. (1-sided) Fisher’s Exact test Value df sided) (2-sided) Pearson Chi-Square 14.747(b) 1 .000 Continuity Exact Sig. (2-sided) 12.961 1 .000 Correction(a) = exact two-side p- Likelihood Ratio 13.745 1 .000 Fisher's Exact Test value = <.001 .000 .000 N of Valid Cases 300 a Computed only for a 2x2 table b 0 cells (.0%) have expected count less than 5. The minimum expected count is 7.00. Chi-square test Pearson Chi-square (without continuity correction), p-value = <.001 Pearson Chi-square with continuity correction, p-value = <.001 Asymp. Sig. (2-sided) = two-sided p-value. Asymp. is an abbreviation for asymptotic, which means the p-value is computed using a large sample approximation based on the Normal distribution. Check that all cells have expected cell counts 5 or greater. Value = test statistic value df = degrees of freedom 57 Trend Test Example. A clinical trial of a drug therapy to control pain was performed. The investigators wanted to investigate whether adverse responses to the drug increased with larger drug doses. Subjects received either a placebo or one of four drug doses. In this example dose is an ordinal variable, and it reasonable to expect that as the dose increases and rate of adverse events will increase. Adverse event Dose n % (n) Placebo 32 18.8% (6) 500 mg 32 21.9% (7) 1000 mg 32 28.1% (9) 2000 mg 32 31.3% (10) 4000 mg 32 50.0% (16) There are several different methods for performing a trend test with ordinal variables. One test, which is available in SPSS is the Mantel-Haenszel chi-square, also called the Mantel-Haenszel test for linear association or linear by linear association chi-square test. Adverse events No Yes Total dose 0 Count 26 6 32 % within dose 81.3% 18.8% 100.0% 500 Count 25 7 32 % within dose 78.1% 21.9% 100.0% 1000 Count 23 9 32 % within dose 71.9% 28.1% 100.0% 2000 Count 22 10 32 % within dose 68.8% 31.3% 100.0% 4000 Count 16 16 32 % within dose 50.0% 50.0% 100.0% Total Count 112 48 160 % within dose 70.0% 30.0% 100.0% Chi-Square Tests In this example, there is a Asymp. Sig. significant trend (p-value = Value df (2-sided) 0.003, chi-square trend test), Pearson Chi-Square 9.107(a) 4 .058 Likelihood Ratio 8.836 4 .065 and we would conclude that Linear-by-Linear 8.876 1 .003 the rate of adverse responses Association N of Valid Cases increases with drug dose. 160 a 0 cells (.0%) have expected count less than 5. The minimum expected count is 9.60. 58 Using Standardized Residuals in R x C tables. When the contingency table has more then 2 rows and 2 columns it can be hard to determine the association or the largest differences. Standard residuals are often helpful in describing the association, if the chi-square test indicates there is a statistically significant association. The (adjusted) standardized residual re-expresses the difference between the observed cell count and expected cell count in terms of standard deviation units below or above the value 0 (the expected differences if there is no association), and the distribution of the standardized residuals has a standard Normal distribution. Hence, values less than -2 or greater than 2 indicate large differences and values less than -3 or greater than 3 indicate very large differences. Under Cells…, select Adjusted standardized for Residuals Education vs Stage of Disease at Diagnosis Example. The chi-square indicated a significant association between education level and stage of disease at diagnosis ( Chi-square test, p-value = 0.016). The adjusted standardized Stage of Disease Education I II III residuals indicate the biggest ≤12 years Count 20 24 35 difference between the % within education 25.3% 30.4% 44.3% observed and expected cell counts (i.e., the most unusual Adjusted Residual -2.6 -.5 3.3 College Count differences under the 37 32 23 % within education 40.2% 34.8% 25.0% Adjusted Residual .8 .6 -1.4 assumption of no association College graduate Count 40 29 21 between education and stage % within education 44.4% 32.2% 23.3% of disease) are for subjects Adjusted Residual 1.8 -.1 -1.8 with ≤12 years of education, where there are fewer subjects with Stage I and more subjects with Stage III or IV than expected if there was no association between education and stage of disease. Also, to a lesser extent, among the subjects with a college graduate degree there a more subjects with Stage I and fewer subject with Stage III or IV than expected if there was no association between education and stage of disease. 59 One sample binomial test, McNemar's test, Fisher's Exact test and Chi-square test for 2 x 2 and R x C Contingency Tables Using Summary Data There is an easy way in SPSS to perform a one sample binomial test, a McNemar's test, a Fisher's Exact test or a Chi-square test for a 2 x 2 or R x C table when you only have summary data (i.e., the number of observations in each cell). One sample binomial test. Suppose you observe 15 cases of myocardial infarction (MI) in 5000 men over a 1 year period and you want to test if the rate of MI is equal to a previously reported incidence rate of 5 per 1000 (or 0.005). 1. In a new (empty) SPSS Data Editor window enter the following 2 rows of data: MI Observed 0 4985 1 15 The values of 0 and 1 used to indicate MI (no/yes) are arbitrary. The variable names are also arbitrary (e.g., you can leave them as var0001 and var0002). 2. Next, you want to weight cases by Observed: Choose Data Choose Weight Cases... Choose Weight cases by Choose Observed and then the arrow button so the variable appears in the Frequency variable box. Choose OK 3. Now, run the one sample binomial test: Choose Analyze Choose Nonparametric Tests Choose Binomial... Choose MI so that in appears in the Test Variable List Change (edit) Test Proportion to .005. Choose OK 60 McNemar's test. Suppose you have the following summary table of presence and absence of DKA before and after therapy for paired data, After therapy No DKA DKA Before No DKA 128 7 therapy DKA 19 7 1. In a new (empty) SPSS Data Editor window enter the following 4 rows of data: Before After Observed 1 1 128 1 0 19 0 1 7 0 0 7 The values of 0 and 1 used to indicate DKA and no DKA are arbitrary. The variable names are also arbitrary (e.g., you can leave them as var0001, var0002, and var0003). 2. Next, you want to weight cases by Observed: Choose Data Choose Weight Cases... Choose Weight cases by Choose Observed and then the arrow button so the variable appears in the Frequency variable box. Choose OK 3. Now, run McNemar's test: Choose Analyze Choose Nonparametric Tests Choose 2 Related Samples... Choose Before and After so that they appear in the Test Pair(s) List. Choose McNemar as the Test Type Choose Wilcoxon to turn off the option Choose OK 61 Chi-square test and Fisher's Exact test for a 2 x 2 table. Suppose you have the following summary table for oral contraceptive (OC) use by presence or absence of cancer (case or control), OC Use No Yes Cases (cancer) 111 6 Controls 387 8 1. In a new (empty) SPSS Data Editor window enter the following 4 rows of data: Case OCuse Observed 1 0 111 1 1 6 0 0 387 0 1 8 The values of 0 and 1 used to indicate case/control and OC use (no/yes) are arbitrary. The variable names are also arbitrary (e.g., you can leave them as var0001, var0002, and var0003). 2. Next, you want to weight cases by Observed: Choose Data Choose Weight Cases... Choose Weight cases by Choose Observed and then the arrow button so the variable appears in the Frequency variable box. Choose OK 3. Now, run the Chi-square (\& Fisher's Exact) test Choose Analyze Choose Crosstabs Choose Case and OCuse as the row the column variables Choose Statistics... Choose Chi-square Choose Continue Choose OK 62 The commands are similar for running the Chi-square test for tables larger than 2x 2. Suppose you have the following summary table for education level by stage of disease at diagnosis Stage of Disease Education level I II III or IV High school or less 20 24 35 College 37 32 23 College graduate 40 29 21 1. In a new (empty) SPSS Data Editor window enter the following 9 rows of data: Educ Stage Observed 1 1 20 1 2 24 1 3 35 2 1 37 2 2 32 2 3 23 3 1 40 3 2 29 3 3 21 The values used to indicate education level and stage are arbitrary, and the variable names are also arbitrary. Follow steps 2. and 3. on the previous page (except use variables Educ and Stage, instead of Case and OCuse). 63 Confidence Interval for a Proportion To construct a confidence interval for a proportion or rate is rather awkward in SPSS, but you can do it with the raw data or with summary data (as long as the sample size is large enough to use the Normal approximation methods for binomial data). To construct a confidence interval using the raw data you need 1) a binary indicator variable equal to 1 if the variable is present for a subject and equal to 0 if the variable is absent for a subject, and 2) a variable that is equal to 1 for all subjects. For example, suppose you want to construct a confidence interval for the proportion of males in your data set. First you need a binary indicator variable for males, e.g. you could have a variable named Gender which is equal to 1 if the subject is a male and equal to 0 if the subject is a female. Second you need to create a variable that is equal to 1 for all subjects (e.g., use the Compute statement and create a variable Allones = 1). Now, 1. Choose Analyze on the menu bar 2. Choose Descriptive Statistics 3. Choose Ratio... 4. Numerator: Select the binary indicator variable from the source list on the left and then click on the arrow located in the middle of the window (e.g. select Gender) 5. Denominator: Select the variable equal to 1 for all subjects from the source list on the left and then click on the arrow located in the middle of the window (e.g. select Ones) 6. Choose Statistics... 7. Choose Mean under Central Tendency 8. Choose Confidence intervals (default is a 95% confidence interval) 9. Choose Continue 10. Choose OK To illustrate how you would construct a confidence interval with summary data, suppose in a data set of 3425 subjects, 1341 are males and 2084 are females: 1. In a new (empty) SPSS Data Editor window enter the following 2 rows of data: Gender Observed Allones 0 2084 1 1 1341 1 2. Next, you want to weight cases by Observed: Choose Data Choose Weight Cases... Choose Weight cases by Choose Observed and then the arrow button so the variable appears in the Frequency variable box. Choose OK 64 3. Now, Choose Analyze on the menu bar Choose Descriptive Statistics Choose Ratio... Numerator: Select Gender Denominator: Select Allones Choose Statistics... Choose both Mean and Confidence intervals under Central Tendency Choose Continue Choose OK Example of the SPSS output using the previous summary data. Ratio Statistics Ratio Statistics for Gender / Allones Mean .392 The observed 95% Confidence Interval Lower Bound .375 proportion was .392 or for Mean Upper Bound 39.2%. .408 Price Related Differential 1.000 A 95% confidence Coefficient of Dispersion . Coefficient of Variation Median Centered interval is 37.5% to . The confidence intervals are constructed by assuming a Normal distribution 40.8%. for the ratios. 65 Correlation & Regression Pearson and Spearman Rank Correlation Coefficient 1. Choose Analyze on the menu bar 2. Choose Correlate 3. Choose Bivariate... 4. Variable(s): Select the variables from the source list on the left and then click on the arrow located in the middle of the window. 5. Choose Pearson or/and Spearman as the Correlation Coefficients. Note that the option is selected if the box has a check mark in it. 6. Choose Two-tailed as the Test of Significance. SPSS will perform the test testing if the correlation is equal to zero versus it is not equal to zero. 7. Choose OK Note that you can use the Crosstabs command to calculate confidence intervals for the correlation. Example. Pain-related beliefs, catastrophizing, and coping have been shown to be associated with measures of physical and psychosocial functioning among patients with chronic musculoskeletal and rheumatologic pain. However, little is known about the relative importance of these process variables in the functioning of patients with temporomandibular disorders (TMD). Correlation coefficients could be calculated to examine the association between catastrophizing, depression (Beck Depression Inventory), pain-related activity interference and jaw opening (maximum assisted opening). (Reference: JA Turner, SF Dworkin, L Mancl, KH Huggins, EL Truelove. “The roles of beliefs, catastrophizing, and coping in the functioning of patients with temporomandibular disorders.” Pain, 92, 41-51, 2001. Typically, you would only report either the Pearson or Spearman (rank) correlation coefficients, but you might calculate both to see if you get different results or conclusions. The correlations are shown on the next page. Note that SPSS will display the correlation between variable 1 and variable 2 and between variable 2 and variable 1, which are equivalent, and similarly the correlations between all possible pairs of variables. So, all results displayed below the diagonal of the matrix of results are redundant. 66 Correlations 1st entry = Pearson correlation coefficient 2nd entry = Sig. (2-tailed) = p-value 3rd entry = N = the number observations or subjects with non-missing data for both variables Correlations Correlation Beck Interference Maximum between Catastroph inventory assisted Catastrophiz- izing score opening ing and Catastroph Pearson Correlation 1 .602(**) .451(**) -.029 Interference -izing Sig. (2-tailed) .000 .000 .758 = .45 N 118 118 118 116 Beck inventory Pearson Correlation .602(**) 1 .445(**) -.079 P-value = score Sig. (2-tailed) .000 .000 .397 <.001 N 118 118 118 116 Interference Pearson Correlation .451(**) .445(**) 1 -.068 N = 118 Sig. (2-tailed) .000 .000 .468 N subjects 118 118 118 116 Maximum Pearson Correlation -.029 -.079 -.068 1 assisted Sig. (2-tailed) .758 .397 .468 opening N 116 116 116 116 ** Correlation is significant at the 0.01 level (2-tailed). Nonparametric Correlations 1st entry = Spearman rank correlation coefficient 2nd entry = Sig. (2-tailed) = p-value 3rd entry = N = the number observations or subjects with non-missing data for both variables Correlations Interference Maximum Beck assisted Rank Catastrophiz- inventory opening ing score correlation Spearman's Catastrophiz- Correlation 1.000 .625(**) .451(**) -.013 between rho ing Coefficient Sig. (2-tailed) . .000 .000 .892 Catastrophiz N 118 118 118 116 -ing and Beck inventory Correlation -.110 Interference .625(**) 1.000 .455(**) score Coefficient = .45 Sig. (2-tailed) .000 . .000 .241 N 118 118 118 116 P-value = Interference Correlation Coefficient .451(**) .455(**) 1.000 -.046 <.001 Sig. (2-tailed) .000 .000 . .621 N 118 118 118 116 N = 118 Maximum assisted Correlation Coefficient -.013 -.110 -.046 1.000 subjects opening Sig. (2-tailed) .892 .241 .621 . N 116 116 116 116 ** Correlation is significant at the 0.01 level (2-tailed). 67 Confidence Interval for a Correlation Coefficient Typically the Crosstabs command is used to produce contingency tables for categorical variables. One of the options under Statistics… is used to compute the correlation coefficient, which would you might want to calculate for ordinal variables. However, you can also use this option for quantitative variables. The Crosstabs command is found by selecting Analyze and then Descriptive Statistics. In this example the correlation between the quantitative variables catastrophizing and interference will be calculated. Select Statistics… and then select Correlations. SPSS will produce a contingency table of the cross-tabulation of the two variables which you can ignore. SPSS will display the correlation coefficient and standard error estimate for the correlation coefficient, which can be used to calculate confidence intervals. Symmetric Measures Asymp. Std. Value Error(a) Approx. T(b) Approx. Sig. Interval by Interval Pearson's R .451 .068 5.445 .000(c) Ordinal by Ordinal Spearman Correlation .451 .076 5.449 .000(c) N of Valid Cases 118 a Not assuming the null hypothesis. b Using the asymptotic standard error assuming the null hypothesis. c Based on normal approximation. An approximate 95% confidence interval for the correlation coefficient is given by Correlation coefficient ± 1.96 x Asymp. Std Error In this example, 95% confidence interval for the Pearson correlation coefficient is given by .451 ± 1.96 x .068 or .31, .58 95% confidence interval for the Spearman rank correlation coefficient is given by .451 ± 1.96 x .076 or .30, .60 68 Linear Regression 1. Choose Analyze on the menu bar 2. Choose Regression 3. Choose Linear... 4. Dependent: Select the dependent variable from the source list on the left and then click on the arrow next to the dependent variable box. 5. Independent(s): Select the independent variable and then click on the arrow next to the independent variable(s) box. Repeat the process until you have selected all the independent variables you want. 6. Choose Statistics... 7. Choose Estimates. SPSS will print the regression coefficient estimate, standard error, t statistic and p-value for each independent variable (as well as the intercept/constant). By default the option should be selected (i.e., the box has a check mark in it). 8. Choose Model fit. SPSS will print the multiple R, R squared, Adjusted R-squared, standard error of the regression line, and the ANOVA table. By default the option should be selected. 9. Choose Continue 10. Choose Enter as the Method. Enter is the default method for independent variable entry. Other methods of variable entry can be selected by clicking on the down arrow and clicking on the desired method of entry. 11. Choose OK Additional options are available under Statistics..., Plots..., Save..., Method, and Options... For example: Statistics... Estimates. Default option, which prints the usual linear regression results. Model fit. Default option, which prints the usual linear regression results. Confidence intervals (for the regression coefficient estimates) Covariance matrix (and correlation matrix for the regression coefficient estimates). R squared change. If independent variables are entered in Blocks (using the Block option; see below), this option computes the change in the R squared between models with different blocks of independent variables. It is also useful for computing a partial F test for a categorical variable with more than two categories by entering the indicator variables for the categorical variable in the second block (Block 2 of 2) and all other independent variables in the first block (Block 1 of 2) and using the R squared change option. Part and Partial Correlations. This option computes the Pearson correlation coefficient between the dependent variable and each independent variable (Zero-order correlation) and the correlation coefficient between the dependent variable and an independent variables after controlling for all the other independent variables in the regression model (Partial correlation). Squaring the partial correlation gives you the partial R-squared for an independent variable. This option also computes a Part correlation, which is the correlation between the dependent variable and an independent after (only) the independent variable has been adjusted for all the other independent variables in the regression model. The square of the Part correlation is equal to the change in the R-squared when an independent is added to the regression model with all the other independent variables. 69 (Multi-)Collinearity diagnostics. This option computes various statistics for detecting collinearity between the independent variables. For example, Tolerance is the proportion of a variable's variance not accounted for by other independent variables in the equation. A variable with a very low tolerance contributes little information to a model, and can cause computational problems. Another statistic is the VIF (variance inflation factor). Large values are an indicator of multicollinearity between independent variables. Plots... which are useful for doing regression diagnostics: Histogram or Normal Probability Plot (P-P plot) (of the standardized residuals). Produce all partial (residual) plots Other scatter plots Save... which produced variables which are useful for doing regression diagnostics: Predicted Values (unstandardized, standardized, adjusted) Residuals (unstandardized, standardized, studentized, delete) Distances (Mahalanobis, Cook's, Leverage) Influence Statistics (dfBeta, dfFit) Note that SPSS creates a new variable for each selected Save... option and adds the new variables to the data file. The variable names are defined in the Variable View of the Data Editor. Once you are done using these variables you may want to delete them from the data file or save them (by re-saving the data file). Method. Click on the down arrow to the right of Method to display the methods available for independent variable entry (enter, stepwise, remove, backward, forward). Enter is the default option. The other options you enter independent variables into the model using various stepwise methods. Options... You can modify the entry and removal criteria used by stepwise, remove, backward, and forward independent variable entry methods. You can define how observations with missing data are handled. Previous, Block \# of \#, Next You can use these options to enter independent variables in blocks into the regression model. You can select different methods of variable entry for each block. This option is also useful for computing partial F tests with the R squared change option. 70 Example. Simple linear regression of forced expiratory volume (volume, 1 second) on height (cm). The dependent variable in this example is forced expiratory volumne (fev1). There is only 1 independent variable in this example, height. Additional options can be found under Statistics, Plots, Save, & Options. Here are the Statistics… options Usually you want the default options Estimates and Model fit selected. In this example, (95%) confidence interval for the regression coefficients is also selected. Here are the Plots… options By default no options are selected. In this example, the normal probability plot of the residuals is requested. 71 Regression Information on the independent Variables Entered/Removed(b) variables and dependent variable in the regression model, and the method of Variables Variables entering the independent variables into Model Entered Removed Method 1 height(a) . Enter the regression model. a All requested variables entered. b Dependent Variable: fev1 R-Square = proportion of the total variation in the dependent variable explained by the independent variable(s) = .315 or 31.5% R is square root of R Square Model Summary(b) Adjusted R Square – “adjusts” the Adjusted R Std. Error of R square for the number of Model R R Square Square the Estimate 1 .315 variables in the model .562(a) .314 .55337 a Predictors: (Constant), height b Dependent Variable: fev1 Std. error of the estimate = standard deviation of the error or residuals. Not usually reported, but used in estimating the standard error of the regression coefficients. ANOVA = analysis of ANOVA(b) variance table. Not needed when there is Sum of Mean only 1 independent Model Squares df Square F Sig. 1 Regression variable in the model. 112.380 1 112.380 366.997 .000(a) Residual 244.054 797 .306 The F test is Total 356.434 798 equivalent to the t test a Predictors: (Constant), height for testing if the slope b Dependent Variable: fev1 is equal to zero in the output that follows. (F = t2) 72 Coefficients(a) Unstandardized Standardized Model Coefficients Coefficients t Sig. 95% Confidence Interval for B Std. B Error Beta Lower Bound Upper Bound 1 (Constant) -4.330 .335 -12.943 .000 -4.987 -3.673 height .039 .002 .562 19.157 .000 .035 .043 a Dependent Variable: fev1 Unstandardized coefficients B = regression coefficient In this example B = 0.039 is the slope and B = -4.330 the intercept Std. Error = standard error of the regression coefficient. Standardized coefficients Beta = standardized regression coefficient t = t statistic for testing if the regression coefficient is equal to zero (versus not equal to zero) Sig. = p – value for testing if the regression coefficient is equal to zero (versus not equal to zero). 95% confidence interval for B = 95% confidence interval for the regression coefficient In this example, you would report the slope (.039), standard error of the slope (.002) and the p-value (<.001), or the slope (.039) and 95% confidence interval (.035 to 0.043). Charts Normal P-P Plot of Regression Standardized Residual Dependent Variable: fev1 Normal probability plot of 1.0 the residuals. The points fall along a straight line, 0.8 indicating the residuals Expected Cum Prob 0.6 have, at least approximately, a Normal 0.4 distribution. 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Observed Cum Prob 73 Linear Regression Example with three independent variables The dependent variable is forced expiratory volume (fev1). The independent variables are height, age and enter. The Enter method means all 3 independent variables will be included in the regression model. Statistics… options By default, Estimates and Model fit are selected. In this example, part and partial correlations and collinearity diagnostics are also selected. Plots… options Normal probability plot (of the standardized residuals) and partial (residual) plots are selected. 74 Regression Variables Entered/Removed(b) Variables Variables Information on the independent Model Entered Removed Method variables, method of variable entry, and 1 gender, Enter age, . dependent variable. height(a) a All requested variables entered. b Dependent Variable: fev1 R-square is .361 or 36.1% Model Summary(b) (adjusted R-square is 35.8%). About 36% of the variation in Adjusted R Std. Error of the dependent variables can be Model R R Square Square the Estimate explained by the 3 independent 1 .601(a) .361 .358 .53531 variables. a Predictors: (Constant), gender, age, height b Dependent Variable: fev1 ANOVA(b) The overall F test, Sum of Mean indicates 1 or more the Model Squares df Square F Sig. independent variables is 1 Regression 128.623 3 42.874 149.621 .000(a) significant (P < .001). Residual 227.811 795 .287 Total Degrees of freedom of 356.434 798 a Predictors: (Constant), gender, age, height the F test are 3 and 795. b Dependent Variable: fev1 Coefficients(a) Unstandardized Standardized Collinearity Coefficients Coefficients t Sig. Correlations Statistics Std. Zero- B Error Beta order Partial Part Tolerance VIF (Constant) -.780 .593 -1.315 .189 height .028 .003 .399 9.143 .000 .562 .308 .259 .423 2.364 age -.025 .004 -.200 -6.857 .000 -.206 -.236 -.194 .944 1.059 gender .273 .059 .201 4.591 .000 .478 .161 .130 .420 2.379 a Dependent Variable: fev1 Height, age, and gender are all statistically significant (P < .001), i.e., the regression coefficients are different from zero. The partial correlations (and partial R-squares, .3082=.095, -.2362 =.056, and .1612=.026) indicate the correlation with the dependent variable adjusted for the other variables in the regression model. A low tolerance value (say, <.20) or a high variance inflation factor (VIF) (say, > 5 or 10) may indicate a multicollinearity problem. 75 Normal P-P Plot of Regression Standardized Residual Dependent Variable: fev1 Normal probability plot of the 1.0 residuals. The points fall approximately along a straight 0.8 line, indicating the residuals have Expected Cum Prob (approximately) a Normal distribution. 0.6 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 Partial Regression Plot Observed Cum Prob Partial regression plots for height and age with lowess Dependent Variable: fev1 smooths. 2.00 The plot for height is assessing the relationship 0.00 fev1 between height and fev1 after adjusting for age and gender -2.00 (e.g., is the relationship linear). -30.00 -20.00 -10.00 0.00 Partial Regression Plot 10.00 20.00 30.00 height Similarly, the plot for age is Dependent Variable: fev1 assessing the relationship 2.00 between age and fev1 adjusting for height and gender. 0.00 fev1 -2.00 Note that SPSS will also produce a partial residual plot for gender. In general, the partial -15.00 -10.00 -5.00 0.00 5.00 10.00 15.00 20.00 residuals plots for categorical/nominal variables are not very useful. Boxplots of the age residuals for each category of a categorical/nominal variable are useful for regression diagnostics. To produce the boxplots you could use the Save… options to save the residuals from a regression and then the Boxplot commands to plot the residuals. 76 Linear Regression via ANOVA Commands It is possible to use the analysis variance commands of SPSS to perform a linear regression analysis, because the methods are mathematically equivalent. Performing a linear regression analysis via analysis of variance in SPSS is more complicated than using the linear regression commands. However, the advantage of using the analysis of variance commands to perform a linear regression is that you do not have to create indicator variables for categorical variables or create interaction terms. To perform a linear regression via analysis of variance commands 1. Choose Analyze on the menu bar 2. Choose General Linear Model 3. Choose Univariate... 4. Dependent: Select the dependent variable from the source list on the left and then click on the arrow next to the dependent variable box. 5. Fixed Factor(s): Select the independent variables that are categorical/qualitative and then click on the arrow next to the fixed factor(s) box. Repeat the process until you have selected all the categorical variables you want. 6. Covariate(s): Select the independent variables that are continuous/quantitative and then click on the arrow next to the covariate(s) box. Repeat the process until you have selected all the continuous variables you want. 7. Choose Model... 8. Choose Custom 9. Factors & Covariates: Select/highlight all the variables, then under Build Terms select Main Effects. You may need to click on the down arrow to display the Main Effects option. After you have selected Main Effects, select the arrow under the Build Terms. All the variables should now appear in the Model box on the right hand side. 10. Choose Continue 11. Choose Options... 12. Choose Parameter Estimates under Display 13. Choose Continue 14. Choose OK For categorical variables the last category (i.e., the category with the largest numeric coding value) will be the referent group/category. SPSS will compute the F test for each continuous independent variable and for categorical independent variable. By selecting to have the parameter estimates displayed, SPSS will also compute the regression coefficient estimates, standard errors, t (statistic) values, p-values, and 95% confidence intervals that you get from the linear regression commands. To include interaction terms in the regression model, in Step 9 highlight two variables you want to create an (two-way) interaction term. Under Build Terms select Interaction, and then select the arrow under the Build Terms. A two-way interaction between two variables (variable 1 * variable 2) should now appear in the Model box on the right hand side. 77 Example. Linear regression of forced expiratory volume on height (continuous variable) and diabetes status (categorical variables; normal, impaired fasting glucose, diabetic). Forced expiratory volume (fev1) is the dependent variable. Diabetes is a categorical variables with the 3 categories Height is a continuous variable Under Model…, select Custom, then select each of the variables separately until they all appear under Model: or select Main Effects under Build Terms(s), select all Factors & Covariates, and then select the arrow under Build Term(s). Under Options…, select Parameter estimates to have usual linear regression results displayed in the output. 78 Univariate Analysis of Variance Between-Subjects Factors Tests of Between-Subjects Effects Dependent Variable: fev1 Type III Sum Mean Source of Squares df Square F Sig. Corrected Model 114.617(a) 3 38.206 125.606 .000 The overall test for Intercept 51.195 1 51.195 168.308 .000 the significant of diabetes 2.237 2 1.118 3.677 .026 diabetes is height 111.378 1 111.378 366.168 .000 displayed (p-value = Error 241.817 795 .304 Total 0.026) 3773.779 799 Corrected Total 356.434 798 a R Squared = .322 (Adjusted R Squared = .319) Parameter Estimates Dependent Variable: fev1 This table displays Std. the usual linear Parameter B Error t Sig. 95% Confidence Interval Lower Upper regression results. Bound Bound In this example Intercept -4.392 .337 -13.025 .000 -5.054 -3.730 diabetes = 3 [diabetes=1.00] .126 .049 2.549 .011 .029 .223 (diabetic) is the [diabetes=2.00] .046 .056 .830 .407 -.063 .156 [diabetes=3.00] 0(a) . . . . . reference group. height .039 .002 19.136 .000 .035 .043 a This parameter is set to zero because it is redundant. 79 Example. Adding an interaction between diabetes status and height in the regression model To add an interaction between two variables, select the Build Term(s) to show Interaction, select two variables under Factors & Covariates and then select the arrow under Build Term(s) Univariate Analysis of Variance Tests of Between-Subjects Effects Dependent Variable: fev1 Type III Sum Mean Source of Squares df Square F Sig. Corrected Model 114.946(a) 5 22.989 75.492 .000 Intercept 42.741 1 42.741 140.354 .000 diabetes .272 2 .136 .447 .639 This table displays the height 94.349 1 94.349 309.823 .000 significant of the diabetes * height .328 2 .164 .539 .583 diabetes status by Error 241.488 793 .305 height interaction (p- Total 3773.779 799 value = 0.58). Corrected Total 356.434 798 a R Squared = .322 (Adjusted R Squared = .318) Parameter Estimates Dependent Variable: fev1 Parameter B Std. Error t Sig. This table displays the Intercept usual linear regression -4.373 .673 -6.498 .000 [diabetes=1.00] -.168 .818 -.206 .837 results which includes [diabetes=2.00] .614 .963 .637 .524 the results for diabetes [diabetes=3.00] 0(a) . . . status, height and the height .039 .004 9.506 .000 interaction between [diabetes=1.00] * height .002 .005 .361 .719 diabetes status and [diabetes=2.00] * height -.003 .006 -.593 .553 height. [diabetes=3.00] * height 0(a) . . . a This parameter is set to zero because it is redundant. 80 Logistic Regression 1. Choose Analyze on the menu bar 2. Choose Regression 3. Choose Binary Logistic... 4. Dependent: Select the dependent variable from the source list on the left and then click on the arrow next to the dependent variable box. 5. Covariate(s): Select the independent variable and then click on the arrow next to the Covariate(s) box. Repeat the process until you have selected all the independent variables you want. 6. Choose Enter as the Method. Enter is the default method for independent variable entry. Other methods of variable entry can be selected by clicking on the down arrow and clicking on the desired method of entry. 7. Choose OK Additional options are available under >a*>b, Categorical..., Save..., Method, or Options... . For example: >a*>b (for adding two-way interactions) You can add an interaction between two independent variables to the regression model by selecting two variables from the source list on the left (hold down the Ctrl key while selecting the two variables) and then clicking on >a*>b (after you highlight two variables from the source list on the left the >a*>b should be available to select). Categorical... You can use the categorical option to have SPSS create indicator or dummy variables for categorical variables. 1. Choose Categorical 2. Categorical Covariates: Select a covariate that is categorical and then click on the arrow next to the Covariates box. 3. Choose Indicator as the Contrast: Indicator is the default method for creating indicator variables. Other methods can be selected by clicking on the down arrow and clicking on the desired method. 4. Choose the reference category as the last category (i.e., the category with the largest numeric coding value) or the first the category (i.e., category with the smallest numeric coding value). 5. Choose Change. 6. Repeat steps 2 through 5 until you have defined all categorical variables. 7. Choose Continue. Save... Predicted Values (Probabilities and Group Membership). This options creates new variables that are the predicted probabilities and the predicted group membership. The predicted group membership (0 or 1) is based on the whether the predicted probability is less than (group membership=0) or greater than or equal to (group membership=1) the classification cutoff. By default the classification cutoff value is 0.5. You can change the cutoff value using Options... Residuals (Unstandardized, Logit, Studentized, Standardized, Deviance) Influence (Cook's, leverage, dfBeta) 81 Note that SPSS creates a new variable for each selected Save... option and adds the new variables to the data file. The variable names are defined in the Viewer window. Once you are done using these variables you may want to delete them from the data file or save them (be re- saving the data file). Method… Click on the down arrow to the right of Method to display the methods available for independent variable entry (enter, forward:conditional, forward:LR, forward:Wald, backward:conditional, backward:LR, backward:Wald). Options... Confidence interval for odds ratio (CI for exp(B)) Hosmer-Lemeshow goodness-of-fit You can modify the entry and removal criteria used by the backward and forward variable entry methods. Previous, Block # of #, Next You can use these options to enter independent variables in blocks into the regression model. You can select different methods of variable entry for each block. Example. Logistic regression will be used to determine the relationship between any use of health services (coded 0 = no use, 1 = any use) and age, health index, gender and race. Subjects in the study (Model Cities Data Set) were followed for a varying amount of time, so the number of months followed (expos) will also be included as an independent variable in the logistic regression model. The dependent variable, anyuse, is binary. There are 5 independent variables. Female and Race are categorical/nominal variables. 82 You can use the Categorical option to define which variables are categorical and SPSS will create the indicator variables. By default the category with the largest numerical value (last) will be the reference group. Here, the category with the smallest numerical value was selected as the reference group. Under Options you can select to have the 95% confidence intervals for the odds ratios displayed in the output. Also, you can run the Hosmer- Lemeshow goodness-of-fit test. Logistic Regression Case Processing Summary Information on the Unweighted Cases(a) N Percent number of observations Selected Cases Included in Analysis 3199 73.1 used in the logistic Missing Cases 1175 26.9 regression. Subjects with Total 4374 100.0 missing data are excluded. Unselected Cases 0 .0 Total 4374 100.0 a If weight is in effect, see classification table for the total number of cases. Dependent Variable Encoding SPSS will always recode the dependent variable to a 0 or 1 binary variable (internal value), and will estimate Original Value Internal Value the odds ratio for the event coded as 1 (vs the event .00 0 coded as 0). If your dependent variable is not coded 0 1.00 1 or 1, check this table to determine the interpretation of the odds ratios. 83 Categorical Variables Codings Parameter coding This table gives the definition of the Frequency (1) (2) indicator variables. E.g., race white 497 .000 .000 race(1) = other other 455 1.000 .000 race(2) = black black 2247 .000 1.000 (race = white, is the reference group) female male 1450 .000 female 1749 1.000 female(1) = female (male is the reference group) Caution! – Make sure you understand the interpretation of the indicator variables that SPSS creates. It is very easy to get confused. For example, in this example the variable race is coded 1=white, 2=other, 3=black. A common mistake would be to interpret race(1) = white and race(2) = other. Block 0: Beginning Block Ignore all the output under Block 0. The output displays information for the logistic regression model with no independent variables in the model. Block 1: Method = Enter Omnibus Tests of Model Coefficients Unless you are using stepwise Chi-square df Sig. Step 1 Step methods to enter variables or 301.534 6 .000 Block 301.534 6 .000 entering variables in different Model 301.534 6 .000 blocks you can ignore this output. Model Summary “R-square” measures for logistic -2 Log Cox & Snell Nagelkerke R regression – usually not very Step likelihood R Square Square useful. 1 2609.415(a) .090 .151 a Estimation terminated at iteration number 5 because parameter estimates changed by less than .001. Ignore this table also. It is Classification Table(a) describing how the logistic Predicted regression predicts any use anyuse percent Observed .00 1.00 correct if a predicted probability > Step 1 anyuse .00 0 542 .0 0.5 is to used to indicate 1.00 0 2657 100.0 Overall any use. All subjects are 83.1 percentage predicted to have use. a The cut value is .500 84 Hosmer and Lemeshow Test Hosmer-Lemeshow goodness-of-fit Step Chi-square df Sig. statistic is formed by grouping the 1 8.368 8 .398 data into g groups (usually Contingency Table for Hosmer and Lemeshow Test g=10) based on the anyuse = .00 anyuse = 1.00 Total percentiles of the Observed Expected Observed Expected Observed estimated probabilities Step 1 1 124 123.653 197 197.347 321 and calculating the 2 101 97.310 218 221.690 319 Pearson chi-square statistic from the 2 x g 3 79 81.589 241 238.411 320 4 73 67.769 248 253.231 321 table of observed and 5 estimated expected 57 54.600 263 265.400 320 6 33 41.820 287 278.180 320 7 32 29.724 288 290.276 320 frequencies. A small p- 8 16 21.258 304 298.742 320 value indicates a lack of 9 13 15.538 307 304.462 320 fit. Large differences 10 14 8.740 304 309.260 318 between the observed and expected values can be used to help identify where there is lack-of-fit when present. The last table of the output usually has the results we are most interested in. It lists the odds ratios, p-values and 95% confidence intervals for the odds ratios. Variables in the Equation B S.E. Wald df Sig. Exp(B) 95.0% C.I.for EXP(B) Lower Upper Step expos .077 .006 167.398 1 .000 1.080 1.068 1.093 1(a) age .009 .003 8.118 1 .004 1.009 1.003 1.016 female(1) .501 .099 25.363 1 .000 1.650 1.358 2.005 race 12.715 2 .002 race(1) -.424 .190 4.964 1 .026 .655 .451 .950 race(2) -.530 .149 12.689 1 .000 .588 .440 .788 health .048 .010 23.603 1 .000 1.049 1.029 1.070 Constant -.337 .196 2.958 1 .085 .714 a Variable(s) entered on step 1: expos, age, female, race, health. Exp(B) = Odds Ratio 95.0% C.I. for EXP(B) = 95% confidence interval for the odds ratio Sig. = P-value for the individual odds ratio or the overall significant of a categorical/nominal variable if there is no Exp(B) listed. 85 B = the logistic regression coefficient, the log odds ratio S.E. = the standard error the of the logistic regression coefficient Wald = the Wald test statistic for testing if B=0 (or equivalently odds ratio = 1) or if all B’s = 0 for a categorical variable with >2 indicator variables. d.f. = degrees of freedom of the test statistic. It is often helpful to write on your output the definition of the indicator variables, so you don’t get confused about the interpretation of the results. Also, helpful to change Exp(B) to odds ratio, and sig. to P-value. 95.0% C.I.for Odds odds ratio Ratio Lower Upper P-value Step expos 1.080 1.068 1.093 .000 1(a) age 1.009 1.003 1.016 .004 female (vs male) 1.650 1.358 2.005 .000 race .002 other vs white .655 .451 .950 .026 black vs white .588 .440 .788 .000 health 1.049 1.029 1.070 .000

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 24 |

posted: | 11/16/2011 |

language: | English |

pages: | 87 |

OTHER DOCS BY nMN0Ah

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.