Docstoc

Sociology Martin

Document Sample
Sociology Martin Powered By Docstoc
					               Sociology 601 (Martin)
           Lecture 14: November 4-6, 2008
• Contingency Tables for Categorical Variables (8.1)
  o   some useful probabilities and hypothesis tests based on
      contingency tables
  o   independence redefined.


• The Chi-Squared Test (8.2)

• When to use Chi-squared tests (8.3)
  o   chi-squared residuals
      Definitions for a 2X2 contingency table


• Let X and Y denote two categorical variables
  o   Variable X can have one of two values: X = 1 or X = 2
  o   Variable Y can have one of two values: Y = 1 or Y = 2


• nij denotes the count of responses in a cell in a table
        Structure for a 2X2 contingency table
• Values for X and Y variables are arrayed as follows:

                     Value of Y:

                     1             2

Value      1         n11           n12     total X=1
of X:
           2         n21           n22     total X=2

                     total Y=1 total Y=2   (grand
                                           total)
             Some useful definitions
• The unconditional probability P(Y = 1):
     = (n11 + n21 )/ (n11 + n12 + n21 + n22 )
     = the marginal probability that Y equals 1
• The conditional probability P(Y = 1, given X = 1):
     = n11 / (n11 + n12)
     = P ((Y = 1) | (X = 1))
• The joint probability P(Y = 1 and X = 1):
     = n11 / (n11 + n12 + n21 + n22 )
     = P ((Y = 1)  (X = 1))
     = the cell probability for cell (1,1)
                             Example:
•                                    Support Law Enforcement?
•                                    Yes        No         Tot
• Support health Yes                 292        25         317
• care spending? No                  14         9          23
• .              Tot                 306        34         340

  o   What is the unconditional probability of favoring increased
      spending on law enforcement?
  o   What is the conditional probability of favoring increased spending
      on law enforcement for respondents who opposed increased
      spending on health?
  o   What is the joint probability of favoring increased spending on law
      enforcement and opposing increased spending on health?
                    Hypothesis tests based on
                      contingency tables:

• Usually we ask: is the distribution of Y when X=1 different than the
  distribution of Y when X=2?

• Null Hypothesis: the conditional distributions of Y, given X, are
  equal.
  Ho: P ((Y = 1) | (X = 1)) – P((Y = 1) | (X = 2)) = 0
  alternatively, Ho: Y|X=1 - Y|X=2 = 0

• This type of question often comes up because of its causal
  implications.
   o   For example: “Are childless adults more likely to vote for school funding than
       parents?”
  A confusing new definition for independence
• Previously we used the term independence to refer to groups of
  observations.
   o    “White and hispanic respondents were sampled independently.”
• In this chapter, we use independence to refer to a property of
  variables, not observations.
   o    “Political orientation is independently distributed with respect to ethnicity”
   o    Two categorical variables are independent if the conditional distributions of one
        variable are identical at each category of the other variable.

                  Democrat           Independent       Republican        Total
white             440                140               420               1000
black             44                 14                42                100
hispanic          110                35                105               250
Total             594                189               567               1350
              Contingency tables in STATA
• The 1991 General Social Survey Contains data on Party
  Identification and Gender for 980 respondents.
   o   See Table 8.1, page 250 in A&F
• Here is a program for inputting the data into STATA
  interactively:

input str10 gender str12 party number
female    democrat     279
male      democrat     165
female    independent   73
male      independent   47
female    republican   225
male      republican   191
end
           Contingency tables in STATA
• Here is a command to create a contingency table, and its
  output

. tabulate gender party [freq=number]

           |              party
    gender | democrat independe republica |        Total
-----------+---------------------------------+----------
    female |       279         73        225 |       577
      male |       165         47        191 |       403
-----------+---------------------------------+----------
     Total |       444        120        416 |       980


• The following slide adds row, column, and cell %
. tabulate gender party [freq=number], row column cell

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| row percentage    |
| column percentage |
| cell percentage |
+-------------------+

           |              party
    gender | democrat independe republica |        Total
-----------+---------------------------------+----------
    female |       279          73       225 |       577
           |     48.35      12.65      38.99 |    100.00
           |     62.84      60.83      54.09 |     58.88
           |     28.47       7.45      22.96 |     58.88
-----------+---------------------------------+----------
      male |       165          47       191 |       403
           |     40.94      11.66      47.39 |    100.00
           |     37.16      39.17      45.91 |     41.12
           |     16.84       4.80      19.49 |     41.12
-----------+---------------------------------+----------
     Total |       444        120        416 |       980
           |     45.31      12.24      42.45 |    100.00
           |    100.00     100.00     100.00 |    100.00
           |     45.31      12.24      42.45 |    100.00
  8.2 Developing a new statistical significance
          test for contingency tables.
•                                     support tax reform?
•                                     Yes          No               Tot
• support              Yes            150          100              250
• environment?         No             200          50               250
• .                    Tot            350          150              500

• “Is the level of support for the environment dependent on
  the level of support for tax reform.”
   o   If so, these two measures are likely to have some causal link worth
       investigating.
        With a 2x2 table, we can use a t-test for
           independent-sample proportions.

•   . prtesti 250 .6 250 .8

•   Two-sample test of proportion                      x: Number of obs =       250
•                                                      y: Number of obs =       250
•   ------------------------------------------------------------------------------
•       Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
•   -------------+----------------------------------------------------------------
•              x |         .6   .0309839                      .5392727     .6607273
•              y |         .8   .0252982                      .7504164     .8495836
•   -------------+----------------------------------------------------------------
•           diff |        -.2        .04                     -.2783986    -.1216014
•                | under Ho:    .0409878    -4.88   0.000
•   ------------------------------------------------------------------------------
•           diff = prop(x) - prop(y)                                   z = -4.8795
•       Ho: diff = 0

•       Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
•    Pr(Z < z) = 0.0000         Pr(|Z| < |z|) = 0.0000          Pr(Z > z) = 1.0000
                  Moving beyond 2x2 tables:
• Comparing conditional probabilities is fine when there are only two
  comparisons and two possible outcomes for each comparison.

• The Chi-Square (2) test is a new technique for making comparisons
  more flexible.

• 2 is like a null hypothesis that every cell should have the frequency
  you would expect if the variables were independently distributed.

• fe is the expected count for each cell.

• fe = total N * unconditional row probability * unconditional column probability

• A test for the whole table will combine tests for fe for every cell.
 Testing independence of support for tax reform
          and environmental spending:

• New Approach: Chi Squared test for independence of
  attitudes toward taxes and the environment.

• Test statistic:
   o  = ((fo – fe) / fe )
       2             2

   o where fo is the observed count in each cell

   o and where fe is the expected count for each cell,

     assuming that attitudes toward taxes will be the same for
     people who support environmental issues as for people
     who do not support environmental issues.
              Assumptions and hypothesis
                 for a chi-squared test:
• Assumptions:
  o   two categorical variables (for this course)
  o   random sample or stratified random sample
  o   fe  5 for all cells
• Hypothesis: Ho: the two variables are statistically
  independent.
  o   this means that the distribution of each variable is
      independent of the score of the other variable
           Calculating expected cell counts:
• The expected cell count is the count we would
  expect in a cell if environmental support for tax
  reform advocates and tax reform opponents were
  identical, or were the same as environmental support
  for the whole sample

•   fe(1,1) = 500*(350/500)taxes *(250/500)environment = 175
•   fe(1,2) = 75
•   fe(2,1) = 175
•   fe(2,2) = 75
  Using expected cell counts to calculate a test
                   statistic
• The test statistic is analogous to a t-statistic…
  o   but the form of the equation makes it difficult to see that
      the X2 statistic is a difference between the observed and
      expected values, divided by an estimate of the typical
      variation we would expect from random sampling error.

• Test statistic:
   o  = ((fo – fe) / fe )
       2            2

   = ((150 –175)2/175 + (100-75)2/75
   + (200-175)2/175 + (50-75)2/75 )
   = 3.5714 + 8.3333 + 3.5714 + 8.3333 = 23.81
 Degrees of freedom for a Chi-squared statistic:
• We now have a test statistic: 2 = 23.81
• How do we assign a p-value to this?
• Step 1: calculate the degrees of freedom.
  o   Given the row and column marginal totals, how many
      cells need we fill in before we can do the rest
      automatically?
  o   Answer: 1 in this case, so df = 1.
  o   General answer: df = (r-1)*(c-1), where r is the number
      of rows and c is the number of columns.
         p-value for a Chi-squared statistic:
• Assign a p-value to the statistic: 2 = 23.81, df = 1

• Given the degrees of freedom, look up the p-value.
  o Go to Table C on page 670.

  o Go down to the row for df = 1

                     2
  o Move across X values to the largest tabled value that is

    smaller than the measured X2
  o Look up the corresponding p-value at the top of the

    column: p < .001
  o The chi-squared test is always a 1-tailed test: we always

    use the right tail of the distribution.
           Do your own chi-squared test:
• You watch 50 beachcombers to see if they are wearing
  sandals and if they are wearing shorts
• .                             wearing shorts?
•                               Yes         No           Tot
• sandals?          Yes         20          10           30
• .                 No          10          10           20
• .                 Tot         30          20           50

• Q: Does a beachcomber’s chance of wearing sandals depend
  on their chance of wearing shorts?
•
  Chi-Squared Tests for more than 2X2 Tables
• Here is a command to run a chi-squared test on the gender
  and partyid data from the 1991 GSS (see section 8.1)

. tabulate gender party [freq=number], chi2

           |              party
    gender | democrat independe republica |        Total
-----------+---------------------------------+----------
    female |       279         73        225 |       577
      male |       165         47        191 |       403
-----------+---------------------------------+----------
     Total |       444        120        416 |       980

          Pearson chi2(2) =    7.0095   Pr = 0.030
• Add expected cell counts
. tabulate gender party [freq=number], chi2 expected

+--------------------+
| Key                |
|--------------------|
|     frequency      |
| expected frequency |
+--------------------+

           |              party
    gender | democrat independe republica |        Total
-----------+---------------------------------+----------
    female |       279          73       225 |       577
           |     261.4       70.7      244.9 |     577.0
-----------+---------------------------------+----------
      male |       165         47        191 |       403
           |     182.6       49.3      171.1 |     403.0
-----------+---------------------------------+----------
     Total |       444        120        416 |       980
           |     444.0      120.0      416.0 |     980.0

          Pearson chi2(2) =   7.0095   Pr = 0.030
• Add chi-squared contribution of each cell
. tabulate gender party [freq=number], chi2 expected cchi2

+--------------------+
| Key                |
|--------------------|
|      frequency     |
| expected frequency |
| chi2 contribution |
+--------------------+
            |             party
    gender | democrat independe republica |        Total
-----------+---------------------------------+----------
    female |       279          73       225 |       577
            |    261.4       70.7      244.9 |     577.0
            |      1.2        0.1        1.6 |       2.9
-----------+---------------------------------+----------
       male |      165          47       191 |       403
            |    182.6       49.3      171.1 |     403.0
            |      1.7        0.1        2.3 |       4.1
-----------+---------------------------------+----------
      Total |      444        120        416 |       980
            |    444.0      120.0      416.0 |     980.0
            |      2.9        0.2        3.9 |       7.0

         Pearson chi2(2) =   7.0095    Pr = 0.030
        8.3 When not to do a chi-squared test
1.) Do not do a Chi-squared test when the expected value of a
   cell is less than 5.
age            Party identification
               Democrat Indep.                Republican      Total
<65            42 (40)       5 (8)             33 (32)        80
65              8 (10)       5 (2)             7 (8)          20
total          50            10                40             100

The Problem: The total 2 is 6.28, so p<.05, but 4.5 of the total
  comes from one cell with fe = 2.
(It is okay to do a Chi-squared test if a cell has an expected value above 5
    and an observed value below 5!)
 A small sample alternative to a chi-squared test

When the sample size is too small for a chi-squared test, you
 may treat the contingency table as a small sample
 comparison of two population proportions.

This means you should do a Fisher’s exact test for population
  proportions.

A Fisher’s exact test will also work okay on large samples, but
  you sometimes will bog down the computer with lengthy
  computations. (This is especially likely to happen when the
  tables are 5X4 or larger).
             Fisher’s exact test in STATA
. * output fisher's exact test
. * (not necessary in this case because of large n.

tabulate gender party [freq=number], exact

           |              party
    gender | democrat independe republica |        Total
-----------+---------------------------------+----------
    female |       279         73        225 |       577
      male |       165         47        191 |       403
-----------+---------------------------------+----------
     Total |       444        120        416 |       980

           Fisher's exact =                    0.031

(For a comparable chi2 test, chi2 = 7.01 and p = .030)
   When not do a chi-squared test (continued)
2.) Do not do a Chi-squared test for cell values that are not
   observed frequencies.

      sex          Voted in last election?
                   Yes          No            Total
      women        35%           15%           50%
      men          20%           30%           50%
      total        55%           45%           100%


The Problem: If you use percentages, you misstate the sample
  size as 100.
  When not to do a chi-squared test (continued)
3.) Do not do a Chi-squared test to find a difference in
   population proportions for dependent samples.

                  Number supporting death penalty:
     Before       After hearing speech:
     speech:      Yes         No        Total
     Yes          80           20        100
     No           40           60        100
     total        120          80        200
The Problem: You want to know if the speech changed
  people’s opinions. A 2 test would tell you if opinions after
  the speech depend on opinions before the speech.
      Residual Analysis for Chi-Squared Tests
This part of section 8.3 will not be on the exam. I don’t use
  this stuff, but Agresti covers it, and you should have a
  reference in case a referee ever asks for it.

The problem:
  If a Chi-squared test produces a statistically significant
  result, we only know that somewhere in the table the data
  depart from what independence predicts.

  To find the level of statistical significance associated with a
  single cell value, we conduct a residual analysis.
      Residual Analysis for Chi-Squared Tests
Terms:
  residual: ( fo - fe ) The difference between an observed and
  an expected cell frequency.

  adjusted residual: The standardized difference between an
  observed and an expected cell frequency. (Like a z-score for
  cells in independence tests.)


                                  fo  fe
    a.r. 
             f e (1  row proportion)(1  column proportion)
     Residual Analysis for Chi-Squared Tests
Example:
            Party identification
sex         Democrat Indep.       Republican Total
female      279(261.4) 73 (70.65) 225(244.9) 577
male        165(182.6) 47 (49.35) 191(171.1) 403
total       444           120      416       980

adjusted residual for cell (1,1)
=(279-261.4)/sqrt((261.4)(1-444/980)(1-577/980)
= 2.295 (treat as a z-score, so p = .011)
      Residual Analysis for Chi-Squared Tests
Cautions about the adjusted residual:
  1.) the adjusted residual is like a z-score for a two-sided test
  of the difference between the proportion in the cell and the
  average proportion for all other cells in the column.
 ( not the z-score for fo - fe )

  2.) An adjusted residual of “z” = 1.96 for one cell does not
  mean that the whole Chi-squared test is statistically
  significant at the .05 level. (A Chi-squared test adjusts for
  the fact that you are doing df t-tests at the same time.)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:6/11/2012
language:
pages:32