SIMPLE ANALYSIS OF CONTINGENCY TABLES AND THE CHI-SQUARE TEST OF

Document Sample
SIMPLE ANALYSIS OF CONTINGENCY TABLES AND THE CHI-SQUARE TEST OF Powered By Docstoc
					DATA ANALYSIS WORKBOOK                                                          LAB 10




    SIMPLE ANALYSIS OF CONTINGENCY TABLES AND
       THE CHI-SQUARE TEST OF INDEPENDENCE

OVERVIEW
A contingency table consists of two or more columns and two or more rows. In accord with
the convention established in previous chapters, the columns represent the values of the
independent variable, and the rows represent the values of the dependent variable. We
present three examples in this chapter: the relation between fear of crime and gender, the
relation between approval/disapproval of legal abortion and marital status, and the relation
between approval/disapproval of homosexuality and marital status. Table 1 contains the
information on these variables. Fear of crime and gender are both dichotomous.
Approval/disapproval of abortion is dichotomous, while marital status is nominal,
polytomous. Finally, approval/disapproval of homosexuality is polytomous, ordinal, but we
use just the nominal information.

 Table 1. Value of the Dependent and Independent Variables Used to Study the
          Relations between Fear of Crime and Gender, Approval/Disapproval of
          Abortion and Marital Status, and Approval/Disapproval of Homosexuality
          and Marital Status

Dependent                Value      Label          Independent        Value        Label
Variables                                          Variables
FEAR (Afraid to             1     Yes              SEX                  1              Male
walk alone at night)        2      No                                   2            Female
ABANY (Approval/            1     Yes              MARITAL              1           Married
Disapproval of Legal        2      No              (Marital Status)     2          Widowed
Abortion)                                                               3          Divorced
HOMOSEX                     1     Wrong                                 4         Separated
(Approval/                  2     Mostly                                5     Never Married
Disapproval of                    Wrong
Homosexuality)              3     Mostly
                                  Right
                            4     Not
                                  Wrong


As Table 1 suggests, we use a contingency table to display the relationship between two
discrete or categorical variables either measured at the nominal level or treated as nominal..

                                            10.1
LAB 10                                               DATA ANALYSIS WORKBOOK


Since the dependent variable is measured at the nominal level (unlike the analyses described
in previous chapters), we work with conditional percentages or proportions rather than
conditional means. We use the differences between the conditional percentages to describe
the form of the relation. We use the chi-square test of independence to test the null
hypothesis that the form of the relation is zero. Statisticians have developed many measures
of the strength of the relation in a contingency table. Since we believe that they add little or
no information to that provided by the form, we do not discuss them, here.
STATISTICS AND DATA ANALYSIS
Concepts: form, cell frequencies, marginal distributions, conditional percentages,
          independence, expected and observed frequencies, chi-square test, degrees of
          freedom
The Information in a Contingency Table
The box below contains the computer from an SPSS contingency table analysis of the
relation between fear of crime and the respondent’s gender. After describing this output, we
show how to present the results in a paper.

FEAR      * SEX   Crosstabulation
                                     SEX                   Total
                                     Male     Female
    FEAR          Yes     Count       141        580        721
                        % within    18.2%      56.3%      40.0%
                           SEX
                  No      Count       632        451       1083
                        % within    81.8%      43.7%      60.0%
                           SEX
       Total              Count        773      1031       1804
                        % within    100.0%    100.0%     100.0%
                           SEX

An important feature of a contingency table is its size. The size of a contingency table equals
on the number of rows and columns it contains. We use the letters “I” and “J” to refer to the
number of rows and columns. We refer to a table with I rows and J columns as an I by J
table. All contingency tables contain at least two rows and two columns. The variables
FEAR and SEX are both dichotomous; therefore we refer to the contingency table in this box
as a “two-by-two” (2x2) table.
The columns of the table represent the values of gender: “male” and “female.” The rows
represent the responses to the question to the question “Are you afraid to walk alone at night
in your neighborhood?”--yes and no. The number in the upper-left hand cell of the table,
141, equals the number of respondents who are men and are afraid of crime. The number
632 equals the number of respondents who are men and do not fear crime. We refer to these
numbers as the cell frequencies.
The numbers below the cell frequencies are conditional percentages. They are conditional
because of the way compute them. The percentage 18.2% equals the percentage of men who
are afraid of crime. To compute this percentage, we divide the cell frequency 141 by the
number of men (773) and multiplying the ratio by 100. The percentage of women who fear


                                              10.2
DATA ANALYSIS WORKBOOK                                                                   LAB 10


crime equals 56.3%. We use these percentages to compute the form of the relation. Because
each set of conditional percentages sums to 100%, they provide no additional information.
For example, the percentage of men who do not fear crime is 81.8% (100% - 18.2% =
81.8%), and the percentage of women who do not fear crime is 43.7% (100% - 56.3% =
43.7%).
The output somewhat obscures the relation between fear and sex because of other
information in the output. This additional information contains the number of respondents
used in the analysis (1804), and the marginal distributions of fear and sex, all in columns
and rows labeled “total.” The marginal distributions in a contingency table give the
univariate (unconditional) distributions of the dependent and independent variables. (We
refer to them as “marginal” because they appear on the margins of the table.) For example,
we see that 721 and 1083 respondents (40.0% and 60.6%) are afraid and not afraid,
respectively, to walk alone at night. We also see that 733 and 1031 respondents are men and
women, respectively.1 We use these marginal distributions to calculate the “expected
frequencies (described below) in testing the null hypothesis of independence.
Table 2 (on the next page) presents the contingency table in a more compact form and one
that we hope is easier to read. It contains just the conditional percentages of those afraid to
walk along at night plus the number of men and women on which those percentages are
based. You can use just this information to construct the full table in box above.2 Note that
you could focus on the percentage unafraid of crime in constructing Table 2. The choice of
the focal value of the dependent variable is a substantive one, not a statistical one.
You should pay attention to the size of the column totals. As the base or denominator uses to
computer the percentages, that statistical reliability of a percentage decreases as the size of
the column total decreases. One rule of thumb that researchers use is to ignore or at least
treat very cautiously percentages based on fewer than 25 cases.

Computing The Form of the Relationship for a Contingency Table
The way we compute the form of the relationship in contingency table differs slightly
according to the dimensions of the table and the level at which the variables are measured.
We illustrate this point by describing three examples: a 2x2 table, a 2x5 table, and a 4x5
table.

Table 2. Fear of Crime by Sex

                                                   Sex of Respondent

1 SPSS reports the percentages associated with the frequencies of men and women as 100%. The reason is due
to the decision to condition on the column totals when computing the cell percentages. That is, 18.2% plus
81.8% and 56.3% plus 443.7% both sum to 100%. Focusing on the conditional distribution of sex, we can
convert the number of men and women into percentages (42.8% and 57.2%, respectively) by dividing 733 and
1031 by 1,803 and multiplying each proportion by 100.
2 You obtain the number of men and women who fear crime by multiplying each column total by the
conditional percentages converted to proportions. You subtract these frequencies from the column totals to
obtain the number of men and women who do not fear crime. You add across the columns to obtain the
marginal distributions of FEAR. Finally, you obtain the total number of respondents by summing either the
number of men and women or the number who fear and the number who do not fear crime.

                                               10.3
LAB 10                                                        DATA ANALYSIS WORKBOOK


                                               Male                   Female

    Per Cent Afraid to Walk                    18.2%                   56.3%
    Alone at Night

              n                                 773                    1031

In calculating the form of the relation in a 2x2 table, we begin by choosing the row that
corresponds to a particular value of the dependent variable--for example, the “yes” response
to the fear of crime question. Having made this choice, we follow the convention
(established in previous labs) and subtract the percentage in the first column from the
percentage in the second column. In the case of Table 2, for example, we subtract 18.2%
from 56.3%. As equation 1 shows, the form is 38.1%. We adopt the somewhat idiosyncratic
convention of using the symbol “ d yx ” to refer to the form. (The “d” stands for “difference,”
but its similarity to the “b” used to represent the slope in a regression analysis is convenient.)
            (1) form: d yx = 56.3% - 18.2% = 38.1%

Figure 1 graphs this relation. It shows that the percentage who fear crime increases as the
“value” of sex “increases” from men to women. As in the case of computing male - female
differences in means, the sign of the form reflects the arbitrary coding of males and females
as 1 and 2, respectively. Having made this assignment, however, we treat these values as if
the order is real.
Note that in this and the subsequent graphs, the vertical axis contains the entire ranges of
percentages from 0 to 100%. You should do this when you draw your graphs by hand.
Some computer packages, e.g., SPSS, do not do this. Instead, they use a reduced range. The
problem with this procedure is that the resulting graph can exaggerate the size of the
relation.3
Table 3 contains the relation between the approval/disapproval of legal abortion for “any”
reason and martial status. The abortion item is dichotomous. The respondent had only two
choices--”yes” or “no”--in answering the following question: “Please tell me whether or not
you think it should be possible for a pregnant woman to obtain a legal abortion for any
reason she wants?” In contrast, marital status is a nominal, polytomous variable with five
categories: married, widowed, divorced, separated, and never married. We constructed table
3 to focus on the conditional percentage who approve of abortion. (The percentage who
disapprove of legal abortion is redundant since all five sets of conditional percentages sum to
100%. Consequently, we do not report this information.)




3 Huff referred to such graphs as “gee whiz” graphs in How to Lie with Statistics, a useful book on statistics
that was popular in the 1950’s and 1960’s.


                                                      10.4
DATA ANALYSIS WORKBOOK                                                         LAB 10


    Per Cent Who Fear Crime
      100%




        50%


        25%


        0%                                        Sex
                 1                   2
                Men                Women


Figure 1 Graph of the Relation between Fear of Crime and Gender
Because marital status is nominal, we describe the pattern of differences among the
conditional percentages just as we describe the pattern of conditional means in an analysis of
variance. (See Chapter 8.) Looking at Table 3, we see that the main contrast occurs between
divorced and never married respondents versus the three other marital status categories.
Approximately 46% of divorced respondents and 49% of never married respondents approve
of legal abortion for any reason. In contrast, only 36%, 30%, and 35% of married, widowed,
and separated respondents approve of legal abortion. This pattern is evident in Figure 2. It
graphs the relation between abortion approval and marital status. The percentage difference
between these two sets of categories is between ten and twenty per cent.

Table 3. Approval of Abortion by Marital Status
                                                  Marital    Status
                   Married      Widowed           Divorced   Separated   Never Married
Approve Legal
Abortion             36.3%         29.9%            45.5%     35.4%         48.8%

       n              868            201            213        82              365




                                           10.5
LAB 10                                                         DATA ANALYSIS WORKBOOK


    Per Cent Who Approve of Abortion for Any Reason
    100%




    50%



    25%



     0%                                                               Marital Status
              1        2     3       4        5
           Married Widowed Divorced Separated Never Married
Figure 2. Graph of the Relation between Approval of Abortion and Marital Status
In some contingency tables the categories of the interval variable may be ordinal or interval.
In such cases, the analyst should see whether the conditional percentages either increases or
decrease monotonically with the values of the independent variable and, if this is the case,
include this information in his or her interpretation.4
Table 4 contains the final relation described in this chapter, the relation between attitude
toward homosexuality and marital status. The NORC homosexuality questions offers the
respondent four answers to the following question: “What about sexual relations between
two adults of the same sex--do you think it is always wrong, almost always wrong, wrong
only sometimes, or not wrong at all?”
Unlike the previous two tables, Table 10.4 contains the conditional percentage for each of the
four choices. When the dependent variable in a contingency table contains more than two
categories, the analyst can investigate between-row as well as between-column contrasts
when describing the form of the relation. In the case of variables measured at the ordinal
level or higher, however, we typically simplify the description by implicitly collapsing the
dependent variable into two categories. We do this by focusing on either the top row, the
bottom row, or a combination of either top or bottom rows that makes sense substantively.
In addition, we make a choice presents a relatively accurate picture of the relation between
the two variables. In the case of attitudes toward homosexuality, we could focus on: (1) the
per cent who think it is always wrong, (2a) the per cent who either think it is always wrong
or think it is mostly


4 In a strictly monotonic relation, each successive conditional percentage is either greater than or less than all
previous conditional percentages. In a simple monotonic relation each conditional percentage can also equal
the previous conditional percentage(s).


                                                        10.6
DATA ANALYSIS WORKBOOK                                                          LAB 10


Table 4. Attitude toward homosexuality by Marital Status
Attitude toward                                  Marital  Status
Homosexuality        Married     Widowed         Divorced Separated   Never
                                                                      Married
Always Wrong           81.6%        83.4%         75.8%     80.0%         68.1%
Mostly Wrong            3.7%         4.4%          3.8%      0.0%           5.7%
Wrong Sometimes         5.2%         4.4%          5.7%      3.8%           8.7%
Not Wrong               9.5%         7.8%         14.7%     16.3%         17.4%
         n                886         205           211         80           367

wrong, (2b) the per cent who either think it is only wrong sometime or think it is not wrong
at all, or (3) the percentage who think it is not wrong at all.

In describing the form of the relation, we choose to focus on the percentage who say that
homosexuality is always wrong since this response occurs more frequently than the others,
and since the main contrast in the table involves this response. Having made this choice, we
see that the main contrast in Table 4 is between never married respondents and the rest, with
divorced respondents falling about half the way between these two extremes. Whereas
approximately 80% of married, widowed, and separated respondents say that homosexuality
is always wrong, only 68% of the never married respondents hold this belief, and the 75% of
divorced respondents fall in between. This pattern is apparent in Figure 3 (on the next page)
that graphs the relation between marital status and the percentage who respond that
homosexuality is always wrong.

TESTING THE NULL HYPOTHESIS OF STATISTICAL INDEPENDENCE
The typical null hypothesis tested in contingency analysis is the hypothesis of independence
or no relation between the independent variables. The alternative hypothesis states that the
two variables are related. As in analysis of variance, the hypothesis of independence is an
omnibus test for most contingency tables, so that the distinction between a one- and two-tail
test does not apply. Only in the case of a 2x2 table--which is analogous to the t-test of no
difference between two population means or of a zero population slope in regression
analysis--can the analyst choose between the two tests. To simplify this lab, we drop this
distinction for all tables.
The null hypothesis implies that the conditional (column) percentages for any row in the
contingency table will equal one another and, therefore, will equal the marginal
(unconditional) distribution of the dependent variable. In the case of the relation between
fear and gender, for example, the null hypothesis implies that the percentage of men, women,
and, therefore, all people who fear crime are the same. The alternative hypothesis states that
at least some of the column percentages for at least some of the rows differ. Equations 2a
and 2b state the null and alternative hypotheses more formally. In this notation, the symbol
"π " stands for the population proportion, and the letters i and j denote the rows and
columns, respectively.

                                          10.7
LAB 10                                                     DATA ANALYSIS WORKBOOK


  Per Cent Who Say Homosexuality is Always Wrong
 100%


   75%


   50%




     0%                                                       Marital Status
             1        2     3       4        5
          Married Widowed Divorced Separated Never Married

Figure 3 Graph of the relation between rejection of homosexuality and marital status.

(2a) H 0: π i| j = π i. , for at all i and all j.

(2b) H 0: π i| j ≠ π i. , for at least some i and for at least some j.

Looking at either Tables 1 or 2, however, we see that the percentage of women afraid of
crime is much greater than the 40% of all people who fear crime, while the percentage of
men who fear crime is much less. These data suggest that the null hypothesis is false--that
fear of crime and sex are related. Of course, there is always the possibility that the difference
between men and women is due to chance. We need a statistical procedure to test this
possibility. We use the chi-square test of independence as the statistical procedure for
testing the null hypothesis. In this procedure the analyst computes the squared differences
between the observed cell frequencies observed and cell frequencies he or she would expect
if the null hypothesis were true. The larger the squared differences, the smaller the chance of
obtaining the observed frequencies when the null hypothesis is true and, therefore, the
stronger the grounds for rejecting the null hypothesis.
Calculating the Expected Frequencies
                       f oi .
(3) f eij = f o. j ×
                        n
Equation 3 contains the formula for calculating the expected frequencies for the hypothesis
of independence. To obtain the expected frequency, f eij , for the cell in the ith row and jth




                                                    10.8
DATA ANALYSIS WORKBOOK                                                                                      LAB 10


Table 5. Observed and expected Frequencies for the Relation between Fear of Crime
and Sex

Afraid to Walk                                            Sex of Respondent                          Row
Alone at night                            (1) Male                           (2) Female              Totals

          (1) Yes                             141                                 580                 721
                                             (309)                               (412)
           (2) No                             632                                 451                1083
                                             (464)                               (619)
Column Totals                                    773                             1031                1804
*Observed Frequencies are on top; expected frequencies are in parantheses.

column, we multiply the observed total of the jth column, f o. j , by the observed proportion of
                            f oi .
cases in the ith row,              . For example, we obtain 309 as the expected frequency of men
                             n
who fear crime by multiplying 733, the number of men, by .40, the proportion of all
respondents who fear crime (721/1803 = .4). (See the contingency table on page 2.) Table 5,
given below, contains both the observed and expected frequencies for the relation between
fear of crime and gender. We put the expected frequencies below the observed frequencies.
We also printed the row and column totals so that you can calculate the expected frequencies,
yourself.5
Computing the chi-square statistic
The formula in equation 4a is for the chi-square statistic used to test the null hypothesis of
independence. Equation 4b demonstrates this formula for the data from the contingency
introduced at the beginning of this chapter. For each cell the analyst divides the squared
difference between the observed and expected frequencies by the expected frequency. When
the null hypothesis is true, the squared differences between the observed and cell frequencies
will be small, relative to the expected frequency. When the null hypothesis is false, these
quantities will be large. Although the statistic measures the goodness of fit between the
observed and expected frequencies, statisticians refer to this statistic as “chi-square,” ( χ ),
                                                                                                                     2


the approximate sampling distribution of the goodness-of-fit statistic.
                            (f oij − f eij ) 2
           (4a) χ = ∑
                    2

                                  f eij
                                                 2                       2                       2
                            (141 − 309 )                 ( 632 − 464 )           ( 580 − 412 )
           (4b)   χ =   2
                                                     +                       +
                                  309                         464                     412


5   The expected frequencies are rarely integers. SPSS, however, rounds them off to the nearest integer.

                                                            10.9
LAB 10                                                        DATA ANALYSIS WORKBOOK

                                             2
                             ( 451 − 619 )
                         +             = 266.09
                              619
To see whether the chi-square statistic in equation 4b (266.09) is significant, we evaluate it
against the chi-square distribution. The shape of the chi-square distribution depends on the
number of degrees of freedom. As shown in equation 5a, the number of degrees of freedom
for the test of independence equals the product of the number of rows minus one (r - 1) times
the number of columns minus one (c - 1).6 In the case of a 2x2 table, the number of degrees
of freedom equals one. The chi-squares for tables 10.2, 10.3, and 10.4 are 266.09, 28.55, and
39.64, respectively. The degrees of freedom are 1, 4, and 12. All chi-squares are significant
at well or beyond the .01 level.

         (5a) df = rc - (r - 1)(c - 1) - 1 = (r - 1)(c - 1)

         (5b) df = (2 - 1)(2 - 1) = 1
A final note of caution on the use of chi-square to test the null hypothesis of independence is
that the distribution of the goodness-of-fit statistic is only approximately chi-square. The
approximation becomes better as the size of the expected frequencies increases.
Consequently, the approximation will be good for most large samples. A rule of thumb says
that problems can occur when the expected frequency for one or more of the cells is less
than five. This condition will occur (even, sometimes, in the case of large samples) when the
marginal distribution of one or both variables is sufficiently skewed to produce a small
expected frequency. In such cases, other procedures for testing the null hypothesis of
independence are available, but we do not discuss them, here.
DATA ANALYSIS EXAMPLE
Assume that a researcher wants to study the relation between fear of crime and sex. The box
on the next page contains the research, null and alternative hypotheses. The next box
contains the information on the variables, values, and cases used in the analysis of the
relation between fear of crime and sex.

6 Technically, the chi-square distribution is the distribution of a sample estimate of the population variance
divided by the true population variance. The number of degrees of freedom equals the number of degrees of
freedom associated with the sample variance (typically n - 1). In contingency table analysis, the degrees of
freedom equals (r - 1)(c - 1), first, because the cell frequencies, rather than individual cases, constitute the
observations. In a table with r rows and c columns, therefore, the product rc equals the total number of
observations. Second, the number of degrees of freedom equals the total number of observation minus the
number of independent parameter estimates that constrain the calculation of the expected frequencies. As
shown in equation 5a one constraint occurs because expected frequencies have to sum to the sample size. A
second set of constraints occurs because we use r row proportions (which are sample estimates of population
distribution of the dependent variable) in calculating the expected frequencies. Only r -1 are independent
estimates, however, because the row proportions have to sum to 1. Finally, a third set of constraints occurs
because we also use the c column totals in the calculation of the expected frequencies. (These column totals
constitute c - 1 independent estimates of the population distribution of the independent variable.) You can gain
an intuitive understanding of the concept of degrees of freedom in the test of independence by convincing
yourself that you have to use equation 4a (r - 1)(c - 1) times to calculate the expected frequencies for an r x c
table. You contain the expected frequencies for the remaining cells simply by subtracting the sum of the
appropriate sum of the expected frequencies you obtain from the row and column totals.


                                                     10.10
DATA ANALYSIS WORKBOOK                                                                        LAB 10



       Research Hypothesis: A person’s sex affects his or fear of crime
       H 0 : Fear of crime and sex are independent.
       H1 : Fear of crime and sex are related.



                                              Cases: all

           Dependent        Variable                               Independent        Variable

   Index/Name                   v78/Fear                     Index/Name                  v2/Sex

   Description                  Afraid to walk               Description              Person’s Gender

     Level of                   nominal                        Level of                  nominal
     Measurement                                               Measurement

  Min. Code/Value              (1) yes                      Min. Code/Value              (1) male

  Max. Code/Value              (2) no                       Max. Code/Value              (2) female

Results

The final box (on the next page) contains the computer output you will see when you test the
null hypothesis of independence. This output accompanies the contingency table presented
at the beginning of this chapter (see page. 2.) It contains a number of statistics that
researchers can use test the null hypothesis of independence. The one you will, however, is
Pearson chi-square that is highlighted.7

Interpretation
This interpretation is based on information from both the contingency table presented on
page 2 and the highlighted information on the box above.
I reject the null hypothesis that fear of crime and sex are independent (p < .01). Fifty-six per
cent of the women, compared to 18 per cent of the men, report that they are afraid to walk in


7 Modern analyses of contingency tables typically use the likelihood chi-square rather than the Pearson chi-
square. The reason they use this statistic is that it can be partitioned into different components (analogous to
partition of the sums of squares in an analysis of variance). In addition, they focus on “odds” rather than
percentages or proportions. The odds of occurrence for some value equals the frequency of that value divided
by the frequency of another value. The unconditional odds of occurrence for fear, for example, is 721/1083 =
.666. In computing the form of the relationship, the analyst computes an odds ratio by dividing the conditional
odds for one group (e.g., females) by the conditional odds for another group (e.g., males). The odds-ratio for
the relation between fear of crime and sex, for example, is (580/451)/(141/632) = 5.76. We would interpret this
result by saying that the odds on being afraid of crime are five and three quarters greater for women.

                                                 10.11
LAB 10                                                DATA ANALYSIS WORKBOOK

Chi-Square Tests
                Value         df    Asymp. Exact Sig. Exact Sig.
                                    Sig. (2- (2-sided) (1-sided)
                                     sided)
  Pearson 266.092              1       .000
        Chi-
    Square
 Continuity 264.509            1        .000
 Correction
 Likelihood 280.267            1        .000
       Ratio
    Fisher's                                       .000     .000
Exact Test
 Linear-by- 265.944            1        .000
     Linear
 Associatio
           n
 N of Valid      1804
     Cases
a Computed only for a 2x2 table
b 0 cells (.0%) have expected count less than 5. The minimum expected count is 308.94.
their neighborhood alone at night. The difference in percentages, 38.1%, is statistically
significant at the .01 level of significance. A possible explanation of this results it that
women’s much greater vulnerability to rape makes them more fearful, particularly when
using walking alone at night.

CONTINGENCY TABLE LAB EXERCISES

Research Hypothesis 1: White men who guns are more likely than white men who do not
                       own guns to favor capital punishment.

Research Hypothesis 2: A person’s education affects his or her attitude towards marijuana.

Research Hypothesis 3: A person’s race affects the intensity of her or her religious
                       identification.




C:\workbook\white\inst10.doc       (revised 6/04)




                                              10.12

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:240
posted:3/26/2010
language:English
pages:12