VIEWS: 240 PAGES: 12 POSTED ON: 3/26/2010
DATA ANALYSIS WORKBOOK LAB 10 SIMPLE ANALYSIS OF CONTINGENCY TABLES AND THE CHI-SQUARE TEST OF INDEPENDENCE OVERVIEW A contingency table consists of two or more columns and two or more rows. In accord with the convention established in previous chapters, the columns represent the values of the independent variable, and the rows represent the values of the dependent variable. We present three examples in this chapter: the relation between fear of crime and gender, the relation between approval/disapproval of legal abortion and marital status, and the relation between approval/disapproval of homosexuality and marital status. Table 1 contains the information on these variables. Fear of crime and gender are both dichotomous. Approval/disapproval of abortion is dichotomous, while marital status is nominal, polytomous. Finally, approval/disapproval of homosexuality is polytomous, ordinal, but we use just the nominal information. Table 1. Value of the Dependent and Independent Variables Used to Study the Relations between Fear of Crime and Gender, Approval/Disapproval of Abortion and Marital Status, and Approval/Disapproval of Homosexuality and Marital Status Dependent Value Label Independent Value Label Variables Variables FEAR (Afraid to 1 Yes SEX 1 Male walk alone at night) 2 No 2 Female ABANY (Approval/ 1 Yes MARITAL 1 Married Disapproval of Legal 2 No (Marital Status) 2 Widowed Abortion) 3 Divorced HOMOSEX 1 Wrong 4 Separated (Approval/ 2 Mostly 5 Never Married Disapproval of Wrong Homosexuality) 3 Mostly Right 4 Not Wrong As Table 1 suggests, we use a contingency table to display the relationship between two discrete or categorical variables either measured at the nominal level or treated as nominal.. 10.1 LAB 10 DATA ANALYSIS WORKBOOK Since the dependent variable is measured at the nominal level (unlike the analyses described in previous chapters), we work with conditional percentages or proportions rather than conditional means. We use the differences between the conditional percentages to describe the form of the relation. We use the chi-square test of independence to test the null hypothesis that the form of the relation is zero. Statisticians have developed many measures of the strength of the relation in a contingency table. Since we believe that they add little or no information to that provided by the form, we do not discuss them, here. STATISTICS AND DATA ANALYSIS Concepts: form, cell frequencies, marginal distributions, conditional percentages, independence, expected and observed frequencies, chi-square test, degrees of freedom The Information in a Contingency Table The box below contains the computer from an SPSS contingency table analysis of the relation between fear of crime and the respondent’s gender. After describing this output, we show how to present the results in a paper. FEAR * SEX Crosstabulation SEX Total Male Female FEAR Yes Count 141 580 721 % within 18.2% 56.3% 40.0% SEX No Count 632 451 1083 % within 81.8% 43.7% 60.0% SEX Total Count 773 1031 1804 % within 100.0% 100.0% 100.0% SEX An important feature of a contingency table is its size. The size of a contingency table equals on the number of rows and columns it contains. We use the letters “I” and “J” to refer to the number of rows and columns. We refer to a table with I rows and J columns as an I by J table. All contingency tables contain at least two rows and two columns. The variables FEAR and SEX are both dichotomous; therefore we refer to the contingency table in this box as a “two-by-two” (2x2) table. The columns of the table represent the values of gender: “male” and “female.” The rows represent the responses to the question to the question “Are you afraid to walk alone at night in your neighborhood?”--yes and no. The number in the upper-left hand cell of the table, 141, equals the number of respondents who are men and are afraid of crime. The number 632 equals the number of respondents who are men and do not fear crime. We refer to these numbers as the cell frequencies. The numbers below the cell frequencies are conditional percentages. They are conditional because of the way compute them. The percentage 18.2% equals the percentage of men who are afraid of crime. To compute this percentage, we divide the cell frequency 141 by the number of men (773) and multiplying the ratio by 100. The percentage of women who fear 10.2 DATA ANALYSIS WORKBOOK LAB 10 crime equals 56.3%. We use these percentages to compute the form of the relation. Because each set of conditional percentages sums to 100%, they provide no additional information. For example, the percentage of men who do not fear crime is 81.8% (100% - 18.2% = 81.8%), and the percentage of women who do not fear crime is 43.7% (100% - 56.3% = 43.7%). The output somewhat obscures the relation between fear and sex because of other information in the output. This additional information contains the number of respondents used in the analysis (1804), and the marginal distributions of fear and sex, all in columns and rows labeled “total.” The marginal distributions in a contingency table give the univariate (unconditional) distributions of the dependent and independent variables. (We refer to them as “marginal” because they appear on the margins of the table.) For example, we see that 721 and 1083 respondents (40.0% and 60.6%) are afraid and not afraid, respectively, to walk alone at night. We also see that 733 and 1031 respondents are men and women, respectively.1 We use these marginal distributions to calculate the “expected frequencies (described below) in testing the null hypothesis of independence. Table 2 (on the next page) presents the contingency table in a more compact form and one that we hope is easier to read. It contains just the conditional percentages of those afraid to walk along at night plus the number of men and women on which those percentages are based. You can use just this information to construct the full table in box above.2 Note that you could focus on the percentage unafraid of crime in constructing Table 2. The choice of the focal value of the dependent variable is a substantive one, not a statistical one. You should pay attention to the size of the column totals. As the base or denominator uses to computer the percentages, that statistical reliability of a percentage decreases as the size of the column total decreases. One rule of thumb that researchers use is to ignore or at least treat very cautiously percentages based on fewer than 25 cases. Computing The Form of the Relationship for a Contingency Table The way we compute the form of the relationship in contingency table differs slightly according to the dimensions of the table and the level at which the variables are measured. We illustrate this point by describing three examples: a 2x2 table, a 2x5 table, and a 4x5 table. Table 2. Fear of Crime by Sex Sex of Respondent 1 SPSS reports the percentages associated with the frequencies of men and women as 100%. The reason is due to the decision to condition on the column totals when computing the cell percentages. That is, 18.2% plus 81.8% and 56.3% plus 443.7% both sum to 100%. Focusing on the conditional distribution of sex, we can convert the number of men and women into percentages (42.8% and 57.2%, respectively) by dividing 733 and 1031 by 1,803 and multiplying each proportion by 100. 2 You obtain the number of men and women who fear crime by multiplying each column total by the conditional percentages converted to proportions. You subtract these frequencies from the column totals to obtain the number of men and women who do not fear crime. You add across the columns to obtain the marginal distributions of FEAR. Finally, you obtain the total number of respondents by summing either the number of men and women or the number who fear and the number who do not fear crime. 10.3 LAB 10 DATA ANALYSIS WORKBOOK Male Female Per Cent Afraid to Walk 18.2% 56.3% Alone at Night n 773 1031 In calculating the form of the relation in a 2x2 table, we begin by choosing the row that corresponds to a particular value of the dependent variable--for example, the “yes” response to the fear of crime question. Having made this choice, we follow the convention (established in previous labs) and subtract the percentage in the first column from the percentage in the second column. In the case of Table 2, for example, we subtract 18.2% from 56.3%. As equation 1 shows, the form is 38.1%. We adopt the somewhat idiosyncratic convention of using the symbol “ d yx ” to refer to the form. (The “d” stands for “difference,” but its similarity to the “b” used to represent the slope in a regression analysis is convenient.) (1) form: d yx = 56.3% - 18.2% = 38.1% Figure 1 graphs this relation. It shows that the percentage who fear crime increases as the “value” of sex “increases” from men to women. As in the case of computing male - female differences in means, the sign of the form reflects the arbitrary coding of males and females as 1 and 2, respectively. Having made this assignment, however, we treat these values as if the order is real. Note that in this and the subsequent graphs, the vertical axis contains the entire ranges of percentages from 0 to 100%. You should do this when you draw your graphs by hand. Some computer packages, e.g., SPSS, do not do this. Instead, they use a reduced range. The problem with this procedure is that the resulting graph can exaggerate the size of the relation.3 Table 3 contains the relation between the approval/disapproval of legal abortion for “any” reason and martial status. The abortion item is dichotomous. The respondent had only two choices--”yes” or “no”--in answering the following question: “Please tell me whether or not you think it should be possible for a pregnant woman to obtain a legal abortion for any reason she wants?” In contrast, marital status is a nominal, polytomous variable with five categories: married, widowed, divorced, separated, and never married. We constructed table 3 to focus on the conditional percentage who approve of abortion. (The percentage who disapprove of legal abortion is redundant since all five sets of conditional percentages sum to 100%. Consequently, we do not report this information.) 3 Huff referred to such graphs as “gee whiz” graphs in How to Lie with Statistics, a useful book on statistics that was popular in the 1950’s and 1960’s. 10.4 DATA ANALYSIS WORKBOOK LAB 10 Per Cent Who Fear Crime 100% 50% 25% 0% Sex 1 2 Men Women Figure 1 Graph of the Relation between Fear of Crime and Gender Because marital status is nominal, we describe the pattern of differences among the conditional percentages just as we describe the pattern of conditional means in an analysis of variance. (See Chapter 8.) Looking at Table 3, we see that the main contrast occurs between divorced and never married respondents versus the three other marital status categories. Approximately 46% of divorced respondents and 49% of never married respondents approve of legal abortion for any reason. In contrast, only 36%, 30%, and 35% of married, widowed, and separated respondents approve of legal abortion. This pattern is evident in Figure 2. It graphs the relation between abortion approval and marital status. The percentage difference between these two sets of categories is between ten and twenty per cent. Table 3. Approval of Abortion by Marital Status Marital Status Married Widowed Divorced Separated Never Married Approve Legal Abortion 36.3% 29.9% 45.5% 35.4% 48.8% n 868 201 213 82 365 10.5 LAB 10 DATA ANALYSIS WORKBOOK Per Cent Who Approve of Abortion for Any Reason 100% 50% 25% 0% Marital Status 1 2 3 4 5 Married Widowed Divorced Separated Never Married Figure 2. Graph of the Relation between Approval of Abortion and Marital Status In some contingency tables the categories of the interval variable may be ordinal or interval. In such cases, the analyst should see whether the conditional percentages either increases or decrease monotonically with the values of the independent variable and, if this is the case, include this information in his or her interpretation.4 Table 4 contains the final relation described in this chapter, the relation between attitude toward homosexuality and marital status. The NORC homosexuality questions offers the respondent four answers to the following question: “What about sexual relations between two adults of the same sex--do you think it is always wrong, almost always wrong, wrong only sometimes, or not wrong at all?” Unlike the previous two tables, Table 10.4 contains the conditional percentage for each of the four choices. When the dependent variable in a contingency table contains more than two categories, the analyst can investigate between-row as well as between-column contrasts when describing the form of the relation. In the case of variables measured at the ordinal level or higher, however, we typically simplify the description by implicitly collapsing the dependent variable into two categories. We do this by focusing on either the top row, the bottom row, or a combination of either top or bottom rows that makes sense substantively. In addition, we make a choice presents a relatively accurate picture of the relation between the two variables. In the case of attitudes toward homosexuality, we could focus on: (1) the per cent who think it is always wrong, (2a) the per cent who either think it is always wrong or think it is mostly 4 In a strictly monotonic relation, each successive conditional percentage is either greater than or less than all previous conditional percentages. In a simple monotonic relation each conditional percentage can also equal the previous conditional percentage(s). 10.6 DATA ANALYSIS WORKBOOK LAB 10 Table 4. Attitude toward homosexuality by Marital Status Attitude toward Marital Status Homosexuality Married Widowed Divorced Separated Never Married Always Wrong 81.6% 83.4% 75.8% 80.0% 68.1% Mostly Wrong 3.7% 4.4% 3.8% 0.0% 5.7% Wrong Sometimes 5.2% 4.4% 5.7% 3.8% 8.7% Not Wrong 9.5% 7.8% 14.7% 16.3% 17.4% n 886 205 211 80 367 wrong, (2b) the per cent who either think it is only wrong sometime or think it is not wrong at all, or (3) the percentage who think it is not wrong at all. In describing the form of the relation, we choose to focus on the percentage who say that homosexuality is always wrong since this response occurs more frequently than the others, and since the main contrast in the table involves this response. Having made this choice, we see that the main contrast in Table 4 is between never married respondents and the rest, with divorced respondents falling about half the way between these two extremes. Whereas approximately 80% of married, widowed, and separated respondents say that homosexuality is always wrong, only 68% of the never married respondents hold this belief, and the 75% of divorced respondents fall in between. This pattern is apparent in Figure 3 (on the next page) that graphs the relation between marital status and the percentage who respond that homosexuality is always wrong. TESTING THE NULL HYPOTHESIS OF STATISTICAL INDEPENDENCE The typical null hypothesis tested in contingency analysis is the hypothesis of independence or no relation between the independent variables. The alternative hypothesis states that the two variables are related. As in analysis of variance, the hypothesis of independence is an omnibus test for most contingency tables, so that the distinction between a one- and two-tail test does not apply. Only in the case of a 2x2 table--which is analogous to the t-test of no difference between two population means or of a zero population slope in regression analysis--can the analyst choose between the two tests. To simplify this lab, we drop this distinction for all tables. The null hypothesis implies that the conditional (column) percentages for any row in the contingency table will equal one another and, therefore, will equal the marginal (unconditional) distribution of the dependent variable. In the case of the relation between fear and gender, for example, the null hypothesis implies that the percentage of men, women, and, therefore, all people who fear crime are the same. The alternative hypothesis states that at least some of the column percentages for at least some of the rows differ. Equations 2a and 2b state the null and alternative hypotheses more formally. In this notation, the symbol "π " stands for the population proportion, and the letters i and j denote the rows and columns, respectively. 10.7 LAB 10 DATA ANALYSIS WORKBOOK Per Cent Who Say Homosexuality is Always Wrong 100% 75% 50% 0% Marital Status 1 2 3 4 5 Married Widowed Divorced Separated Never Married Figure 3 Graph of the relation between rejection of homosexuality and marital status. (2a) H 0: π i| j = π i. , for at all i and all j. (2b) H 0: π i| j ≠ π i. , for at least some i and for at least some j. Looking at either Tables 1 or 2, however, we see that the percentage of women afraid of crime is much greater than the 40% of all people who fear crime, while the percentage of men who fear crime is much less. These data suggest that the null hypothesis is false--that fear of crime and sex are related. Of course, there is always the possibility that the difference between men and women is due to chance. We need a statistical procedure to test this possibility. We use the chi-square test of independence as the statistical procedure for testing the null hypothesis. In this procedure the analyst computes the squared differences between the observed cell frequencies observed and cell frequencies he or she would expect if the null hypothesis were true. The larger the squared differences, the smaller the chance of obtaining the observed frequencies when the null hypothesis is true and, therefore, the stronger the grounds for rejecting the null hypothesis. Calculating the Expected Frequencies f oi . (3) f eij = f o. j × n Equation 3 contains the formula for calculating the expected frequencies for the hypothesis of independence. To obtain the expected frequency, f eij , for the cell in the ith row and jth 10.8 DATA ANALYSIS WORKBOOK LAB 10 Table 5. Observed and expected Frequencies for the Relation between Fear of Crime and Sex Afraid to Walk Sex of Respondent Row Alone at night (1) Male (2) Female Totals (1) Yes 141 580 721 (309) (412) (2) No 632 451 1083 (464) (619) Column Totals 773 1031 1804 *Observed Frequencies are on top; expected frequencies are in parantheses. column, we multiply the observed total of the jth column, f o. j , by the observed proportion of f oi . cases in the ith row, . For example, we obtain 309 as the expected frequency of men n who fear crime by multiplying 733, the number of men, by .40, the proportion of all respondents who fear crime (721/1803 = .4). (See the contingency table on page 2.) Table 5, given below, contains both the observed and expected frequencies for the relation between fear of crime and gender. We put the expected frequencies below the observed frequencies. We also printed the row and column totals so that you can calculate the expected frequencies, yourself.5 Computing the chi-square statistic The formula in equation 4a is for the chi-square statistic used to test the null hypothesis of independence. Equation 4b demonstrates this formula for the data from the contingency introduced at the beginning of this chapter. For each cell the analyst divides the squared difference between the observed and expected frequencies by the expected frequency. When the null hypothesis is true, the squared differences between the observed and cell frequencies will be small, relative to the expected frequency. When the null hypothesis is false, these quantities will be large. Although the statistic measures the goodness of fit between the observed and expected frequencies, statisticians refer to this statistic as “chi-square,” ( χ ), 2 the approximate sampling distribution of the goodness-of-fit statistic. (f oij − f eij ) 2 (4a) χ = ∑ 2 f eij 2 2 2 (141 − 309 ) ( 632 − 464 ) ( 580 − 412 ) (4b) χ = 2 + + 309 464 412 5 The expected frequencies are rarely integers. SPSS, however, rounds them off to the nearest integer. 10.9 LAB 10 DATA ANALYSIS WORKBOOK 2 ( 451 − 619 ) + = 266.09 619 To see whether the chi-square statistic in equation 4b (266.09) is significant, we evaluate it against the chi-square distribution. The shape of the chi-square distribution depends on the number of degrees of freedom. As shown in equation 5a, the number of degrees of freedom for the test of independence equals the product of the number of rows minus one (r - 1) times the number of columns minus one (c - 1).6 In the case of a 2x2 table, the number of degrees of freedom equals one. The chi-squares for tables 10.2, 10.3, and 10.4 are 266.09, 28.55, and 39.64, respectively. The degrees of freedom are 1, 4, and 12. All chi-squares are significant at well or beyond the .01 level. (5a) df = rc - (r - 1)(c - 1) - 1 = (r - 1)(c - 1) (5b) df = (2 - 1)(2 - 1) = 1 A final note of caution on the use of chi-square to test the null hypothesis of independence is that the distribution of the goodness-of-fit statistic is only approximately chi-square. The approximation becomes better as the size of the expected frequencies increases. Consequently, the approximation will be good for most large samples. A rule of thumb says that problems can occur when the expected frequency for one or more of the cells is less than five. This condition will occur (even, sometimes, in the case of large samples) when the marginal distribution of one or both variables is sufficiently skewed to produce a small expected frequency. In such cases, other procedures for testing the null hypothesis of independence are available, but we do not discuss them, here. DATA ANALYSIS EXAMPLE Assume that a researcher wants to study the relation between fear of crime and sex. The box on the next page contains the research, null and alternative hypotheses. The next box contains the information on the variables, values, and cases used in the analysis of the relation between fear of crime and sex. 6 Technically, the chi-square distribution is the distribution of a sample estimate of the population variance divided by the true population variance. The number of degrees of freedom equals the number of degrees of freedom associated with the sample variance (typically n - 1). In contingency table analysis, the degrees of freedom equals (r - 1)(c - 1), first, because the cell frequencies, rather than individual cases, constitute the observations. In a table with r rows and c columns, therefore, the product rc equals the total number of observations. Second, the number of degrees of freedom equals the total number of observation minus the number of independent parameter estimates that constrain the calculation of the expected frequencies. As shown in equation 5a one constraint occurs because expected frequencies have to sum to the sample size. A second set of constraints occurs because we use r row proportions (which are sample estimates of population distribution of the dependent variable) in calculating the expected frequencies. Only r -1 are independent estimates, however, because the row proportions have to sum to 1. Finally, a third set of constraints occurs because we also use the c column totals in the calculation of the expected frequencies. (These column totals constitute c - 1 independent estimates of the population distribution of the independent variable.) You can gain an intuitive understanding of the concept of degrees of freedom in the test of independence by convincing yourself that you have to use equation 4a (r - 1)(c - 1) times to calculate the expected frequencies for an r x c table. You contain the expected frequencies for the remaining cells simply by subtracting the sum of the appropriate sum of the expected frequencies you obtain from the row and column totals. 10.10 DATA ANALYSIS WORKBOOK LAB 10 Research Hypothesis: A person’s sex affects his or fear of crime H 0 : Fear of crime and sex are independent. H1 : Fear of crime and sex are related. Cases: all Dependent Variable Independent Variable Index/Name v78/Fear Index/Name v2/Sex Description Afraid to walk Description Person’s Gender Level of nominal Level of nominal Measurement Measurement Min. Code/Value (1) yes Min. Code/Value (1) male Max. Code/Value (2) no Max. Code/Value (2) female Results The final box (on the next page) contains the computer output you will see when you test the null hypothesis of independence. This output accompanies the contingency table presented at the beginning of this chapter (see page. 2.) It contains a number of statistics that researchers can use test the null hypothesis of independence. The one you will, however, is Pearson chi-square that is highlighted.7 Interpretation This interpretation is based on information from both the contingency table presented on page 2 and the highlighted information on the box above. I reject the null hypothesis that fear of crime and sex are independent (p < .01). Fifty-six per cent of the women, compared to 18 per cent of the men, report that they are afraid to walk in 7 Modern analyses of contingency tables typically use the likelihood chi-square rather than the Pearson chi- square. The reason they use this statistic is that it can be partitioned into different components (analogous to partition of the sums of squares in an analysis of variance). In addition, they focus on “odds” rather than percentages or proportions. The odds of occurrence for some value equals the frequency of that value divided by the frequency of another value. The unconditional odds of occurrence for fear, for example, is 721/1083 = .666. In computing the form of the relationship, the analyst computes an odds ratio by dividing the conditional odds for one group (e.g., females) by the conditional odds for another group (e.g., males). The odds-ratio for the relation between fear of crime and sex, for example, is (580/451)/(141/632) = 5.76. We would interpret this result by saying that the odds on being afraid of crime are five and three quarters greater for women. 10.11 LAB 10 DATA ANALYSIS WORKBOOK Chi-Square Tests Value df Asymp. Exact Sig. Exact Sig. Sig. (2- (2-sided) (1-sided) sided) Pearson 266.092 1 .000 Chi- Square Continuity 264.509 1 .000 Correction Likelihood 280.267 1 .000 Ratio Fisher's .000 .000 Exact Test Linear-by- 265.944 1 .000 Linear Associatio n N of Valid 1804 Cases a Computed only for a 2x2 table b 0 cells (.0%) have expected count less than 5. The minimum expected count is 308.94. their neighborhood alone at night. The difference in percentages, 38.1%, is statistically significant at the .01 level of significance. A possible explanation of this results it that women’s much greater vulnerability to rape makes them more fearful, particularly when using walking alone at night. CONTINGENCY TABLE LAB EXERCISES Research Hypothesis 1: White men who guns are more likely than white men who do not own guns to favor capital punishment. Research Hypothesis 2: A person’s education affects his or her attitude towards marijuana. Research Hypothesis 3: A person’s race affects the intensity of her or her religious identification. C:\workbook\white\inst10.doc (revised 6/04) 10.12