JMP Tutorial 3 Methods for Analyzing Categorical Data
Data Files: Homicide_season.JMP Preeclampsia.JMP DrugOffers.JMP Seatbelt_injury.JMP
Background: In this handout, we will investigate four case studies which illustrate the following concepts: Learning Outcomes: Goodness-of-Fit Tests Comparing Two Population Proportions (Fisher’s Exact Test) Tests of Homogeneity Tests of Independence
1. To understand when each of these four tests is appropriate. 2. To use JMP to carry out each of these four tests (this includes drawing appropriate conclusions based on the JMP output).
GOODNESS-OF-FIT TESTS Example: "Is There a Season for Homicide?" Criminologists have long debated whether there is a relationship between weather and violent crime. The author of the paper "Is There a Season for Homicide?" (Criminology (1988), p. 287-296) classified 1361 homicides according to season, resulting in the data found in the file Homicide_season.JMP. Look at this file (shown below), and note how the data are set up in JMP. There are two columns (one is for the categorical variable season, and the other is for the number of homicides in the sample that happened during the corresponding season). Make sure that JMP is interpreting the values of the second columns as frequencies!
1
Research Question: Does this data support the theory that the homicide rate differs over the four seasons? To investigate this question, select Analyze > Distribution. Place Season in the Y, Columns box. JMP returns the following:
Fill in the relative frequencies on the table below: Season Observed Number Observed Proportion Questions: 1. Which category has the highest observed percentage? How about the lowest observed percentage? Winter 328 Spring 334 Summer 372 Fall 327 Total 1361
2. Is it likely that these counts (and percentages) will change if another sample of homicides is taken?
3. If the homicide rate is in truth the same across all seasons, what is the expected proportion for each group? Explain.
2
Traditionally, the Goodness-of-Fit test is used to investigate a research question such as this. To perform this test, we first must convert the research question into a set of relevant hypotheses: H0: The homicide rate does NOT differ across season. That is, the proportions of homicides that occur during each season are the same. Ha: The homicide rate differs across season. Alternatively, we may state the hypotheses as follows: H0: Ha:
To carry out this test in JMP, select Test Probabilities from the red drop-down menu next to Season.
The following table will then appear in the output window. In the column labeled Hypoth Prob, we need to enter the hypothesized values for the proportion of homicides occurring during each season (.25 for all in this case).
3
When you are finished entering the hypothesized probabilities, click the Done button and the following test information will be displayed.
The p-value for the Pearson Chi-Square Goodness-of-Fit test is .2578. This is greater than 5%, so we fail to reject the null hypothesis which states that the proportions of homicides that occur during each season are the same. In other words, we do NOT have enough evidence to conclude that the homicide rate differs across season. Details Behind the Pearson Chi-Square Goodness-of-Fit Test Season Observed Number Observed Proportion Expected Number The chi-square test statistic is given by: Test Statistic = Winter 328 Spring 334 Summer 372 Fall 327 Total 1361
(ObservedNumber - Expected Number)2 Expected Number
This follows a chi-square distribution with df = c – 1 (where c represents the number of categories).
4
COMPARING TWO POPULATION PROPORTIONS (FISHER’S EXACT TEST) Example: ―Preeclampsia Associated with Autoimmune Disease‖ These data come from a case-control study conducted at Gundersen Lutheran Medical Center in La Crosse, WI, to examine the potential relationship between preeclampsia and autoimmune disease in pregnant women. In a case-control study, we take a random sample of cases (people with the disease in question) and controls (people similar to those in the case group except for the fact that they do not have the disease). Ultimately, the proportions of people with some potential risk factor are compared across the two groups. In this study, researchers suspected that having an autoimmune disorder is related to an increased risk of developing preeclampsia during pregnancy. Therefore, we will be comparing the proportion of women in the Case Group who had an autoimmune disorder to the proportion of women in the Control Group who had an autoimmune disorder. The data is presented in the table below: Autoimmune Disease No Autoimmune Disease 18 367 7 378
Preeclampsia (Case) Control
The data can also be found in the file Preelampsia.JMP:
The contingency table in JMP is obtained by selecting Analyze > Fit Y by X. Put Disease in the X box and Risk Factor in the Y box. The resulting mosaic plot and contingency table are shown below.
5
By selecting to view only the Row %’s, we can compare the following: Proportion of women in the Case Group with Autoimmune Disorder = Proportion of women in the Control Group with Autoimmune Disorder = It certainly appears that the proportion of women with the potential risk factor is higher in the Preeclampsia (Case) group. To test this formally, we can use the results of Fisher's Exact Test, which is included in the JMP output below the contingency table.
Research Question: Is the proportion of patients with an autoimmune disorder greater for those who have preeclampsia than for those who do not have preeclampsia? We can convert this question to the following set of hypotheses: H0: The proportion of women with autoimmune disorder is the same for those who have preeclampsia as for those who do not have preeclampsia. Ha: The proportion of women with autoimmune disorder is greater for those who have preeclampsia than for those who do not.
The three p-values (Prob values) given are for testing the following: 1. Left, p-value = .9934 is for testing if the proportion of women with the potential risk factor is larger for the control group. The fact that this p-value is NOT significant (i.e., the p-value is greater than 5%) suggests that there is NO evidence that the proportion of women with autoimmune disease is larger for the control group. 2. Right, p-value = .02 is for testing if the proportion of women with the potential risk factor is larger for the preeclampsia (case) group. The fact that this p-value is significant (i.e., less than 5%) suggests that the proportion of women with autoimmune disease is larger for the case group. This was the research hypothesis for the doctors who conducted this study. 3. 2-Tail, p-value = .04 is for testing if the proportion of women with the potential risk factor differs between the two groups. The fact that this p-value is significant suggests that the proportion of women having an autoimmune disorder differs between the cases and controls. 6
TESTS OF HOMOGENEITY Example: “Youth Risk Behavior Surveillance System—Comparing Drug Offers/Sales Across Grades‖ Recall the data from the Youth Risk Behavior survey (see Tutorial 1). Here, we will focus on only two variables: Grade Level and Whether Someone Has Offered/Sold/Given You an Illegal Drug on School Property (in the last 12 months). The data are in the file DrugOffers.JMP: Research Question: Is there evidence to suggest that the proportion of students who were offered/sold/given an illegal drug on school property in the last 12 months differs across grade level? We can convert this question to the following hypotheses: H0: The proportion of students who were offered/sold/given an illegal drug on school property DOES NOT DIFFER across grade level. Ha: The proportion of students who were offered/sold/given an illegal drug on school property DIFFERS across grade level.
To begin our analysis, select Fit Y by X from the Analyze menu and place Grade in the X, Factor box and Offered Drugs in the Y, Response box. The resulting output is shown below:
7
From the Pearson Chi-Square Test p-value (p = .0243), we can see that we have evidence that the proportion of students who were offered/sold/given an illegal drug on school property in the last 12 months does in fact differ across grade level. Question: Can you describe where the differences exist?
Details Behind the Pearson Chi-Square Test of Homogeneity or Independence The Observed Counts The Expected Counts
The chi-square test statistic is given by: Test Statistic =
(ObservedNumber - Expected Number)2 Expected Number
This follows a chi-square distribution with df = (r - 1)(c - 1) (where r = # of rows and c = # of columns).
8
TESTS OF INDEPENDENCE Example: ―Seatbelt Use and the Extent of Injury Sustained in an Automobile Accident‖ These data (in the file Seatbelt_injury.JMP) come from a study to examine the relationship between the extent of injuries in an automobile accident and use of seatbelts. In particular, individuals were classified as sustaining no injuries, minor injuries, major injuries, or death. Their use of seatbelts was also recorded as none, lap belt only, or lap belt and shoulder harness. Research Question: Is Seatbelt use dependent on Extent of Injury Sustained? To investigate this question, we set up the following hypotheses: H0: Seatbelt use is INDEPENDENT of Extent of Injury Sustained. Ha: Seatbelt use is NOT INDEPENDENT of Extent of Injury Sustained. To formally test whether seat belt use and extent of injury sustained are independent, we will use Pearson's Chi-Square Test of Independence. To obtain the results of the Chi-Square test for these data, select Analyze > Fit Y by X and place Belt Use in the X, Factor box and Injury in the Y, Response box. JMP returns the mosaic plot and contingency table:
The following test results are also given by JMP whenever we examine the relationship between two categorical variables.
9
The p-value = .0674, which is significant at the = 10% level. Thus we conclude that there is sufficient evidence at the = 10% level (p = .0674) to conclude that seat belt use and injury sustained in an automobile accident are not independent. Question: Can you describe the nature of the relationship between the two variables? That is, can you describe where the differences exist?
10