# Sociology Martin

Document Sample

```					               Sociology 601 (Martin)
Lecture 14: November 4-6, 2008
• Contingency Tables for Categorical Variables (8.1)
o   some useful probabilities and hypothesis tests based on
contingency tables
o   independence redefined.

• The Chi-Squared Test (8.2)

• When to use Chi-squared tests (8.3)
o   chi-squared residuals
Definitions for a 2X2 contingency table

• Let X and Y denote two categorical variables
o   Variable X can have one of two values: X = 1 or X = 2
o   Variable Y can have one of two values: Y = 1 or Y = 2

• nij denotes the count of responses in a cell in a table
Structure for a 2X2 contingency table
• Values for X and Y variables are arrayed as follows:

Value of Y:

1             2

Value      1         n11           n12     total X=1
of X:
2         n21           n22     total X=2

total Y=1 total Y=2   (grand
total)
Some useful definitions
• The unconditional probability P(Y = 1):
= (n11 + n21 )/ (n11 + n12 + n21 + n22 )
= the marginal probability that Y equals 1
• The conditional probability P(Y = 1, given X = 1):
= n11 / (n11 + n12)
= P ((Y = 1) | (X = 1))
• The joint probability P(Y = 1 and X = 1):
= n11 / (n11 + n12 + n21 + n22 )
= P ((Y = 1)  (X = 1))
= the cell probability for cell (1,1)
Example:
•                                    Support Law Enforcement?
•                                    Yes        No         Tot
• Support health Yes                 292        25         317
• care spending? No                  14         9          23
• .              Tot                 306        34         340

o   What is the unconditional probability of favoring increased
spending on law enforcement?
o   What is the conditional probability of favoring increased spending
on law enforcement for respondents who opposed increased
spending on health?
o   What is the joint probability of favoring increased spending on law
enforcement and opposing increased spending on health?
Hypothesis tests based on
contingency tables:

• Usually we ask: is the distribution of Y when X=1 different than the
distribution of Y when X=2?

• Null Hypothesis: the conditional distributions of Y, given X, are
equal.
Ho: P ((Y = 1) | (X = 1)) – P((Y = 1) | (X = 2)) = 0
alternatively, Ho: Y|X=1 - Y|X=2 = 0

• This type of question often comes up because of its causal
implications.
o   For example: “Are childless adults more likely to vote for school funding than
parents?”
A confusing new definition for independence
• Previously we used the term independence to refer to groups of
observations.
o    “White and hispanic respondents were sampled independently.”
• In this chapter, we use independence to refer to a property of
variables, not observations.
o    “Political orientation is independently distributed with respect to ethnicity”
o    Two categorical variables are independent if the conditional distributions of one
variable are identical at each category of the other variable.

Democrat           Independent       Republican        Total
white             440                140               420               1000
black             44                 14                42                100
hispanic          110                35                105               250
Total             594                189               567               1350
Contingency tables in STATA
• The 1991 General Social Survey Contains data on Party
Identification and Gender for 980 respondents.
o   See Table 8.1, page 250 in A&F
• Here is a program for inputting the data into STATA
interactively:

input str10 gender str12 party number
female    democrat     279
male      democrat     165
female    independent   73
male      independent   47
female    republican   225
male      republican   191
end
Contingency tables in STATA
• Here is a command to create a contingency table, and its
output

. tabulate gender party [freq=number]

|              party
gender | democrat independe republica |        Total
-----------+---------------------------------+----------
female |       279         73        225 |       577
male |       165         47        191 |       403
-----------+---------------------------------+----------
Total |       444        120        416 |       980

• The following slide adds row, column, and cell %
. tabulate gender party [freq=number], row column cell

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| row percentage    |
| column percentage |
| cell percentage |
+-------------------+

|              party
gender | democrat independe republica |        Total
-----------+---------------------------------+----------
female |       279          73       225 |       577
|     48.35      12.65      38.99 |    100.00
|     62.84      60.83      54.09 |     58.88
|     28.47       7.45      22.96 |     58.88
-----------+---------------------------------+----------
male |       165          47       191 |       403
|     40.94      11.66      47.39 |    100.00
|     37.16      39.17      45.91 |     41.12
|     16.84       4.80      19.49 |     41.12
-----------+---------------------------------+----------
Total |       444        120        416 |       980
|     45.31      12.24      42.45 |    100.00
|    100.00     100.00     100.00 |    100.00
|     45.31      12.24      42.45 |    100.00
8.2 Developing a new statistical significance
test for contingency tables.
•                                     support tax reform?
•                                     Yes          No               Tot
• support              Yes            150          100              250
• environment?         No             200          50               250
• .                    Tot            350          150              500

• “Is the level of support for the environment dependent on
the level of support for tax reform.”
o   If so, these two measures are likely to have some causal link worth
investigating.
With a 2x2 table, we can use a t-test for
independent-sample proportions.

•   . prtesti 250 .6 250 .8

•   Two-sample test of proportion                      x: Number of obs =       250
•                                                      y: Number of obs =       250
•   ------------------------------------------------------------------------------
•       Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
•   -------------+----------------------------------------------------------------
•              x |         .6   .0309839                      .5392727     .6607273
•              y |         .8   .0252982                      .7504164     .8495836
•   -------------+----------------------------------------------------------------
•           diff |        -.2        .04                     -.2783986    -.1216014
•                | under Ho:    .0409878    -4.88   0.000
•   ------------------------------------------------------------------------------
•           diff = prop(x) - prop(y)                                   z = -4.8795
•       Ho: diff = 0

•       Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
•    Pr(Z < z) = 0.0000         Pr(|Z| < |z|) = 0.0000          Pr(Z > z) = 1.0000
Moving beyond 2x2 tables:
• Comparing conditional probabilities is fine when there are only two
comparisons and two possible outcomes for each comparison.

• The Chi-Square (2) test is a new technique for making comparisons
more flexible.

• 2 is like a null hypothesis that every cell should have the frequency
you would expect if the variables were independently distributed.

• fe is the expected count for each cell.

• fe = total N * unconditional row probability * unconditional column probability

• A test for the whole table will combine tests for fe for every cell.
Testing independence of support for tax reform
and environmental spending:

• New Approach: Chi Squared test for independence of
attitudes toward taxes and the environment.

• Test statistic:
o  = ((fo – fe) / fe )
2             2

o where fo is the observed count in each cell

o and where fe is the expected count for each cell,

assuming that attitudes toward taxes will be the same for
people who support environmental issues as for people
who do not support environmental issues.
Assumptions and hypothesis
for a chi-squared test:
• Assumptions:
o   two categorical variables (for this course)
o   random sample or stratified random sample
o   fe  5 for all cells
• Hypothesis: Ho: the two variables are statistically
independent.
o   this means that the distribution of each variable is
independent of the score of the other variable
Calculating expected cell counts:
• The expected cell count is the count we would
expect in a cell if environmental support for tax
reform advocates and tax reform opponents were
identical, or were the same as environmental support
for the whole sample

•   fe(1,1) = 500*(350/500)taxes *(250/500)environment = 175
•   fe(1,2) = 75
•   fe(2,1) = 175
•   fe(2,2) = 75
Using expected cell counts to calculate a test
statistic
• The test statistic is analogous to a t-statistic…
o   but the form of the equation makes it difficult to see that
the X2 statistic is a difference between the observed and
expected values, divided by an estimate of the typical
variation we would expect from random sampling error.

• Test statistic:
o  = ((fo – fe) / fe )
2            2

= ((150 –175)2/175 + (100-75)2/75
+ (200-175)2/175 + (50-75)2/75 )
= 3.5714 + 8.3333 + 3.5714 + 8.3333 = 23.81
Degrees of freedom for a Chi-squared statistic:
• We now have a test statistic: 2 = 23.81
• How do we assign a p-value to this?
• Step 1: calculate the degrees of freedom.
o   Given the row and column marginal totals, how many
cells need we fill in before we can do the rest
automatically?
o   Answer: 1 in this case, so df = 1.
o   General answer: df = (r-1)*(c-1), where r is the number
of rows and c is the number of columns.
p-value for a Chi-squared statistic:
• Assign a p-value to the statistic: 2 = 23.81, df = 1

• Given the degrees of freedom, look up the p-value.
o Go to Table C on page 670.

o Go down to the row for df = 1

2
o Move across X values to the largest tabled value that is

smaller than the measured X2
o Look up the corresponding p-value at the top of the

column: p < .001
o The chi-squared test is always a 1-tailed test: we always

use the right tail of the distribution.
Do your own chi-squared test:
• You watch 50 beachcombers to see if they are wearing
sandals and if they are wearing shorts
• .                             wearing shorts?
•                               Yes         No           Tot
• sandals?          Yes         20          10           30
• .                 No          10          10           20
• .                 Tot         30          20           50

• Q: Does a beachcomber’s chance of wearing sandals depend
on their chance of wearing shorts?
•
Chi-Squared Tests for more than 2X2 Tables
• Here is a command to run a chi-squared test on the gender
and partyid data from the 1991 GSS (see section 8.1)

. tabulate gender party [freq=number], chi2

|              party
gender | democrat independe republica |        Total
-----------+---------------------------------+----------
female |       279         73        225 |       577
male |       165         47        191 |       403
-----------+---------------------------------+----------
Total |       444        120        416 |       980

Pearson chi2(2) =    7.0095   Pr = 0.030
• Add expected cell counts
. tabulate gender party [freq=number], chi2 expected

+--------------------+
| Key                |
|--------------------|
|     frequency      |
| expected frequency |
+--------------------+

|              party
gender | democrat independe republica |        Total
-----------+---------------------------------+----------
female |       279          73       225 |       577
|     261.4       70.7      244.9 |     577.0
-----------+---------------------------------+----------
male |       165         47        191 |       403
|     182.6       49.3      171.1 |     403.0
-----------+---------------------------------+----------
Total |       444        120        416 |       980
|     444.0      120.0      416.0 |     980.0

Pearson chi2(2) =   7.0095   Pr = 0.030
• Add chi-squared contribution of each cell
. tabulate gender party [freq=number], chi2 expected cchi2

+--------------------+
| Key                |
|--------------------|
|      frequency     |
| expected frequency |
| chi2 contribution |
+--------------------+
|             party
gender | democrat independe republica |        Total
-----------+---------------------------------+----------
female |       279          73       225 |       577
|    261.4       70.7      244.9 |     577.0
|      1.2        0.1        1.6 |       2.9
-----------+---------------------------------+----------
male |      165          47       191 |       403
|    182.6       49.3      171.1 |     403.0
|      1.7        0.1        2.3 |       4.1
-----------+---------------------------------+----------
Total |      444        120        416 |       980
|    444.0      120.0      416.0 |     980.0
|      2.9        0.2        3.9 |       7.0

Pearson chi2(2) =   7.0095    Pr = 0.030
8.3 When not to do a chi-squared test
1.) Do not do a Chi-squared test when the expected value of a
cell is less than 5.
age            Party identification
Democrat Indep.                Republican      Total
<65            42 (40)       5 (8)             33 (32)        80
65              8 (10)       5 (2)             7 (8)          20
total          50            10                40             100

The Problem: The total 2 is 6.28, so p<.05, but 4.5 of the total
comes from one cell with fe = 2.
(It is okay to do a Chi-squared test if a cell has an expected value above 5
and an observed value below 5!)
A small sample alternative to a chi-squared test

When the sample size is too small for a chi-squared test, you
may treat the contingency table as a small sample
comparison of two population proportions.

This means you should do a Fisher’s exact test for population
proportions.

A Fisher’s exact test will also work okay on large samples, but
you sometimes will bog down the computer with lengthy
computations. (This is especially likely to happen when the
tables are 5X4 or larger).
Fisher’s exact test in STATA
. * output fisher's exact test
. * (not necessary in this case because of large n.

tabulate gender party [freq=number], exact

|              party
gender | democrat independe republica |        Total
-----------+---------------------------------+----------
female |       279         73        225 |       577
male |       165         47        191 |       403
-----------+---------------------------------+----------
Total |       444        120        416 |       980

Fisher's exact =                    0.031

(For a comparable chi2 test, chi2 = 7.01 and p = .030)
When not do a chi-squared test (continued)
2.) Do not do a Chi-squared test for cell values that are not
observed frequencies.

sex          Voted in last election?
Yes          No            Total
women        35%           15%           50%
men          20%           30%           50%
total        55%           45%           100%

The Problem: If you use percentages, you misstate the sample
size as 100.
When not to do a chi-squared test (continued)
3.) Do not do a Chi-squared test to find a difference in
population proportions for dependent samples.

Number supporting death penalty:
Before       After hearing speech:
speech:      Yes         No        Total
Yes          80           20        100
No           40           60        100
total        120          80        200
The Problem: You want to know if the speech changed
people’s opinions. A 2 test would tell you if opinions after
the speech depend on opinions before the speech.
Residual Analysis for Chi-Squared Tests
This part of section 8.3 will not be on the exam. I don’t use
this stuff, but Agresti covers it, and you should have a
reference in case a referee ever asks for it.

The problem:
If a Chi-squared test produces a statistically significant
result, we only know that somewhere in the table the data
depart from what independence predicts.

To find the level of statistical significance associated with a
single cell value, we conduct a residual analysis.
Residual Analysis for Chi-Squared Tests
Terms:
residual: ( fo - fe ) The difference between an observed and
an expected cell frequency.

adjusted residual: The standardized difference between an
observed and an expected cell frequency. (Like a z-score for
cells in independence tests.)

fo  fe
a.r. 
f e (1  row proportion)(1  column proportion)
Residual Analysis for Chi-Squared Tests
Example:
Party identification
sex         Democrat Indep.       Republican Total
female      279(261.4) 73 (70.65) 225(244.9) 577
male        165(182.6) 47 (49.35) 191(171.1) 403
total       444           120      416       980

adjusted residual for cell (1,1)
=(279-261.4)/sqrt((261.4)(1-444/980)(1-577/980)
= 2.295 (treat as a z-score, so p = .011)
Residual Analysis for Chi-Squared Tests
1.) the adjusted residual is like a z-score for a two-sided test
of the difference between the proportion in the cell and the
average proportion for all other cells in the column.
( not the z-score for fo - fe )

2.) An adjusted residual of “z” = 1.96 for one cell does not
mean that the whole Chi-squared test is statistically
significant at the .05 level. (A Chi-squared test adjusts for
the fact that you are doing df t-tests at the same time.)

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 3 posted: 6/11/2012 language: pages: 32