Document Sample

Sociology 601 (Martin) Lecture 14: November 4-6, 2008 • Contingency Tables for Categorical Variables (8.1) o some useful probabilities and hypothesis tests based on contingency tables o independence redefined. • The Chi-Squared Test (8.2) • When to use Chi-squared tests (8.3) o chi-squared residuals Definitions for a 2X2 contingency table • Let X and Y denote two categorical variables o Variable X can have one of two values: X = 1 or X = 2 o Variable Y can have one of two values: Y = 1 or Y = 2 • nij denotes the count of responses in a cell in a table Structure for a 2X2 contingency table • Values for X and Y variables are arrayed as follows: Value of Y: 1 2 Value 1 n11 n12 total X=1 of X: 2 n21 n22 total X=2 total Y=1 total Y=2 (grand total) Some useful definitions • The unconditional probability P(Y = 1): = (n11 + n21 )/ (n11 + n12 + n21 + n22 ) = the marginal probability that Y equals 1 • The conditional probability P(Y = 1, given X = 1): = n11 / (n11 + n12) = P ((Y = 1) | (X = 1)) • The joint probability P(Y = 1 and X = 1): = n11 / (n11 + n12 + n21 + n22 ) = P ((Y = 1) (X = 1)) = the cell probability for cell (1,1) Example: • Support Law Enforcement? • Yes No Tot • Support health Yes 292 25 317 • care spending? No 14 9 23 • . Tot 306 34 340 o What is the unconditional probability of favoring increased spending on law enforcement? o What is the conditional probability of favoring increased spending on law enforcement for respondents who opposed increased spending on health? o What is the joint probability of favoring increased spending on law enforcement and opposing increased spending on health? Hypothesis tests based on contingency tables: • Usually we ask: is the distribution of Y when X=1 different than the distribution of Y when X=2? • Null Hypothesis: the conditional distributions of Y, given X, are equal. Ho: P ((Y = 1) | (X = 1)) – P((Y = 1) | (X = 2)) = 0 alternatively, Ho: Y|X=1 - Y|X=2 = 0 • This type of question often comes up because of its causal implications. o For example: “Are childless adults more likely to vote for school funding than parents?” A confusing new definition for independence • Previously we used the term independence to refer to groups of observations. o “White and hispanic respondents were sampled independently.” • In this chapter, we use independence to refer to a property of variables, not observations. o “Political orientation is independently distributed with respect to ethnicity” o Two categorical variables are independent if the conditional distributions of one variable are identical at each category of the other variable. Democrat Independent Republican Total white 440 140 420 1000 black 44 14 42 100 hispanic 110 35 105 250 Total 594 189 567 1350 Contingency tables in STATA • The 1991 General Social Survey Contains data on Party Identification and Gender for 980 respondents. o See Table 8.1, page 250 in A&F • Here is a program for inputting the data into STATA interactively: input str10 gender str12 party number female democrat 279 male democrat 165 female independent 73 male independent 47 female republican 225 male republican 191 end Contingency tables in STATA • Here is a command to create a contingency table, and its output . tabulate gender party [freq=number] | party gender | democrat independe republica | Total -----------+---------------------------------+---------- female | 279 73 225 | 577 male | 165 47 191 | 403 -----------+---------------------------------+---------- Total | 444 120 416 | 980 • The following slide adds row, column, and cell % . tabulate gender party [freq=number], row column cell +-------------------+ | Key | |-------------------| | frequency | | row percentage | | column percentage | | cell percentage | +-------------------+ | party gender | democrat independe republica | Total -----------+---------------------------------+---------- female | 279 73 225 | 577 | 48.35 12.65 38.99 | 100.00 | 62.84 60.83 54.09 | 58.88 | 28.47 7.45 22.96 | 58.88 -----------+---------------------------------+---------- male | 165 47 191 | 403 | 40.94 11.66 47.39 | 100.00 | 37.16 39.17 45.91 | 41.12 | 16.84 4.80 19.49 | 41.12 -----------+---------------------------------+---------- Total | 444 120 416 | 980 | 45.31 12.24 42.45 | 100.00 | 100.00 100.00 100.00 | 100.00 | 45.31 12.24 42.45 | 100.00 8.2 Developing a new statistical significance test for contingency tables. • support tax reform? • Yes No Tot • support Yes 150 100 250 • environment? No 200 50 250 • . Tot 350 150 500 • “Is the level of support for the environment dependent on the level of support for tax reform.” o If so, these two measures are likely to have some causal link worth investigating. With a 2x2 table, we can use a t-test for independent-sample proportions. • . prtesti 250 .6 250 .8 • Two-sample test of proportion x: Number of obs = 250 • y: Number of obs = 250 • ------------------------------------------------------------------------------ • Variable | Mean Std. Err. z P>|z| [95% Conf. Interval] • -------------+---------------------------------------------------------------- • x | .6 .0309839 .5392727 .6607273 • y | .8 .0252982 .7504164 .8495836 • -------------+---------------------------------------------------------------- • diff | -.2 .04 -.2783986 -.1216014 • | under Ho: .0409878 -4.88 0.000 • ------------------------------------------------------------------------------ • diff = prop(x) - prop(y) z = -4.8795 • Ho: diff = 0 • Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 • Pr(Z < z) = 0.0000 Pr(|Z| < |z|) = 0.0000 Pr(Z > z) = 1.0000 Moving beyond 2x2 tables: • Comparing conditional probabilities is fine when there are only two comparisons and two possible outcomes for each comparison. • The Chi-Square (2) test is a new technique for making comparisons more flexible. • 2 is like a null hypothesis that every cell should have the frequency you would expect if the variables were independently distributed. • fe is the expected count for each cell. • fe = total N * unconditional row probability * unconditional column probability • A test for the whole table will combine tests for fe for every cell. Testing independence of support for tax reform and environmental spending: • New Approach: Chi Squared test for independence of attitudes toward taxes and the environment. • Test statistic: o = ((fo – fe) / fe ) 2 2 o where fo is the observed count in each cell o and where fe is the expected count for each cell, assuming that attitudes toward taxes will be the same for people who support environmental issues as for people who do not support environmental issues. Assumptions and hypothesis for a chi-squared test: • Assumptions: o two categorical variables (for this course) o random sample or stratified random sample o fe 5 for all cells • Hypothesis: Ho: the two variables are statistically independent. o this means that the distribution of each variable is independent of the score of the other variable Calculating expected cell counts: • The expected cell count is the count we would expect in a cell if environmental support for tax reform advocates and tax reform opponents were identical, or were the same as environmental support for the whole sample • fe(1,1) = 500*(350/500)taxes *(250/500)environment = 175 • fe(1,2) = 75 • fe(2,1) = 175 • fe(2,2) = 75 Using expected cell counts to calculate a test statistic • The test statistic is analogous to a t-statistic… o but the form of the equation makes it difficult to see that the X2 statistic is a difference between the observed and expected values, divided by an estimate of the typical variation we would expect from random sampling error. • Test statistic: o = ((fo – fe) / fe ) 2 2 = ((150 –175)2/175 + (100-75)2/75 + (200-175)2/175 + (50-75)2/75 ) = 3.5714 + 8.3333 + 3.5714 + 8.3333 = 23.81 Degrees of freedom for a Chi-squared statistic: • We now have a test statistic: 2 = 23.81 • How do we assign a p-value to this? • Step 1: calculate the degrees of freedom. o Given the row and column marginal totals, how many cells need we fill in before we can do the rest automatically? o Answer: 1 in this case, so df = 1. o General answer: df = (r-1)*(c-1), where r is the number of rows and c is the number of columns. p-value for a Chi-squared statistic: • Assign a p-value to the statistic: 2 = 23.81, df = 1 • Given the degrees of freedom, look up the p-value. o Go to Table C on page 670. o Go down to the row for df = 1 2 o Move across X values to the largest tabled value that is smaller than the measured X2 o Look up the corresponding p-value at the top of the column: p < .001 o The chi-squared test is always a 1-tailed test: we always use the right tail of the distribution. Do your own chi-squared test: • You watch 50 beachcombers to see if they are wearing sandals and if they are wearing shorts • . wearing shorts? • Yes No Tot • sandals? Yes 20 10 30 • . No 10 10 20 • . Tot 30 20 50 • Q: Does a beachcomber’s chance of wearing sandals depend on their chance of wearing shorts? • Chi-Squared Tests for more than 2X2 Tables • Here is a command to run a chi-squared test on the gender and partyid data from the 1991 GSS (see section 8.1) . tabulate gender party [freq=number], chi2 | party gender | democrat independe republica | Total -----------+---------------------------------+---------- female | 279 73 225 | 577 male | 165 47 191 | 403 -----------+---------------------------------+---------- Total | 444 120 416 | 980 Pearson chi2(2) = 7.0095 Pr = 0.030 • Add expected cell counts . tabulate gender party [freq=number], chi2 expected +--------------------+ | Key | |--------------------| | frequency | | expected frequency | +--------------------+ | party gender | democrat independe republica | Total -----------+---------------------------------+---------- female | 279 73 225 | 577 | 261.4 70.7 244.9 | 577.0 -----------+---------------------------------+---------- male | 165 47 191 | 403 | 182.6 49.3 171.1 | 403.0 -----------+---------------------------------+---------- Total | 444 120 416 | 980 | 444.0 120.0 416.0 | 980.0 Pearson chi2(2) = 7.0095 Pr = 0.030 • Add chi-squared contribution of each cell . tabulate gender party [freq=number], chi2 expected cchi2 +--------------------+ | Key | |--------------------| | frequency | | expected frequency | | chi2 contribution | +--------------------+ | party gender | democrat independe republica | Total -----------+---------------------------------+---------- female | 279 73 225 | 577 | 261.4 70.7 244.9 | 577.0 | 1.2 0.1 1.6 | 2.9 -----------+---------------------------------+---------- male | 165 47 191 | 403 | 182.6 49.3 171.1 | 403.0 | 1.7 0.1 2.3 | 4.1 -----------+---------------------------------+---------- Total | 444 120 416 | 980 | 444.0 120.0 416.0 | 980.0 | 2.9 0.2 3.9 | 7.0 Pearson chi2(2) = 7.0095 Pr = 0.030 8.3 When not to do a chi-squared test 1.) Do not do a Chi-squared test when the expected value of a cell is less than 5. age Party identification Democrat Indep. Republican Total <65 42 (40) 5 (8) 33 (32) 80 65 8 (10) 5 (2) 7 (8) 20 total 50 10 40 100 The Problem: The total 2 is 6.28, so p<.05, but 4.5 of the total comes from one cell with fe = 2. (It is okay to do a Chi-squared test if a cell has an expected value above 5 and an observed value below 5!) A small sample alternative to a chi-squared test When the sample size is too small for a chi-squared test, you may treat the contingency table as a small sample comparison of two population proportions. This means you should do a Fisher’s exact test for population proportions. A Fisher’s exact test will also work okay on large samples, but you sometimes will bog down the computer with lengthy computations. (This is especially likely to happen when the tables are 5X4 or larger). Fisher’s exact test in STATA . * output fisher's exact test . * (not necessary in this case because of large n. tabulate gender party [freq=number], exact | party gender | democrat independe republica | Total -----------+---------------------------------+---------- female | 279 73 225 | 577 male | 165 47 191 | 403 -----------+---------------------------------+---------- Total | 444 120 416 | 980 Fisher's exact = 0.031 (For a comparable chi2 test, chi2 = 7.01 and p = .030) When not do a chi-squared test (continued) 2.) Do not do a Chi-squared test for cell values that are not observed frequencies. sex Voted in last election? Yes No Total women 35% 15% 50% men 20% 30% 50% total 55% 45% 100% The Problem: If you use percentages, you misstate the sample size as 100. When not to do a chi-squared test (continued) 3.) Do not do a Chi-squared test to find a difference in population proportions for dependent samples. Number supporting death penalty: Before After hearing speech: speech: Yes No Total Yes 80 20 100 No 40 60 100 total 120 80 200 The Problem: You want to know if the speech changed people’s opinions. A 2 test would tell you if opinions after the speech depend on opinions before the speech. Residual Analysis for Chi-Squared Tests This part of section 8.3 will not be on the exam. I don’t use this stuff, but Agresti covers it, and you should have a reference in case a referee ever asks for it. The problem: If a Chi-squared test produces a statistically significant result, we only know that somewhere in the table the data depart from what independence predicts. To find the level of statistical significance associated with a single cell value, we conduct a residual analysis. Residual Analysis for Chi-Squared Tests Terms: residual: ( fo - fe ) The difference between an observed and an expected cell frequency. adjusted residual: The standardized difference between an observed and an expected cell frequency. (Like a z-score for cells in independence tests.) fo fe a.r. f e (1 row proportion)(1 column proportion) Residual Analysis for Chi-Squared Tests Example: Party identification sex Democrat Indep. Republican Total female 279(261.4) 73 (70.65) 225(244.9) 577 male 165(182.6) 47 (49.35) 191(171.1) 403 total 444 120 416 980 adjusted residual for cell (1,1) =(279-261.4)/sqrt((261.4)(1-444/980)(1-577/980) = 2.295 (treat as a z-score, so p = .011) Residual Analysis for Chi-Squared Tests Cautions about the adjusted residual: 1.) the adjusted residual is like a z-score for a two-sided test of the difference between the proportion in the cell and the average proportion for all other cells in the column. ( not the z-score for fo - fe ) 2.) An adjusted residual of “z” = 1.96 for one cell does not mean that the whole Chi-squared test is statistically significant at the .05 level. (A Chi-squared test adjusts for the fact that you are doing df t-tests at the same time.)

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 3 |

posted: | 6/11/2012 |

language: | |

pages: | 32 |

OTHER DOCS BY jennyyingdi

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.