Document Sample

Business Statistics 41000 Autumn 2006 Final Exam Name ____________Solutions_____________________ (please print) DO NOT TURN THIS PAGE OVER UNTIL YOU ARE TOLD TO DO SO. You have 3 hours to complete the exam. When time is called please stop writing immediately. The layout of the exam, including the number of questions and the point value of each question, is on the next page. Unless otherwise indicated, each part of each question is worth 2 points. You may use a calculator and two 8.5 by 11 inch “cheat sheets”. No other reference materials are allowed. Please show your work and clearly indicate your answer in the space provided. You may be awarded partial credit in case of arithmetic errors or incomplete answers, but only if your work is legible. Unsupported answers (e.g., just writing “fail to reject”) receive zero credit. Students in my class are required to adhere to the standards of conduct in the GSB Honor Code and the GSB Standards of Scholarship. The GSB Honor Code also requires students to sign the following GSB Honor pledge: I pledge my honor that I have not violated the Honor Code during this examination. I further understand that discussing the contents of this exam with anyone prior to all students completing the exam is a violation of the Honor Code. Sign here to acknowledge: _____________________________________ There are 8 questions. Question 1, 10 parts, 20 points _____ Question 2, 9 parts, 18 points _____ Question 3, 6 parts, 12 points _____ Question 4, 4 parts, 8 points _____ Question 5, 10 T/F questions, 10 points _____ Question 6, 10 parts, 21 points _____ Question 7, 10 parts, 19 points _____ Question 8, 6 parts, 12 points _____ Total 120 points Summary statistics: Mean 83 Note to Autumn 2006 students: Because this Median 86 was a challenging exam, the graders asked (and I approved) granting of ½-point partial credit for Std. Dev. 16 certain questions. Actual credit assigned may therefore deviate slightly from the point allocations 75th %tile 95 indicated in these solutions. 25th %tile 71 Question 1 Below is a scatter plot. Each observation corresponds to an NFL football game. Before each game, one team is considered the favorite (the team considered more likely to win) and the other the underdog. Before each game, oddsmakers set a number called the point spread. Suppose you place a bet that the favorite will win the game. To win your bet, the favorite must “beat the spread”. That is, they must beat the underdog by more points than the spread. On the horizontal axis below is the spread, set before the game. On the vertical axis is diff, which is the points scored by the favorite minus points scored by the underdog during the actual game (a positive value for diff means the favorite won). If I had to “draw a line” through this scatter plot by hand, it would look something like this (slope about 1, intercept about zero). About 95% of points would be within +/-2*(14) points of the line. Only TWO teams that had spread >10 before the game actually lost! About nine favorites lost by 30 or more points (a) In this sample, about how times did the favorite lose by 30 or more points? (i) 1 (ii) 5 (iii) 9 (iv) 21 (b) In this sample, about how many teams favored by 10 or more points (spread > 10) lost the actual game? (i) 0 (ii) 2 (iii) 7 (iv) 17 (c) The sample mean of diff is: (i) positive (ii) negative (iii) about zero (d) Which of the two variables has a larger sample variance? (i) spread (ii) diff (iii) their variances are roughly equal (e) The sample correlation between spread and diff is closest to (i) -0.65 (ii) -0.10 (iii) 0.25 (iv) 0.89 (f) In a regression of diff on spread, the intercept estimate, a, is closest to (i) -40 (ii) 0 (iii) 30 (iv) 50 (g) In a regression of diff on spread, the slope estimate, b, is closest to (i) -3 (ii) 0 (iii) 1 (iv) 3 (h) In a regression of diff on spread, the estimated standard deviation of the errors, se , is approximately (i) -7 (ii) 1 (iii) 14 (iv) 28 Now suppose we believe this data is representative of the “true” relationship between point spreads and actual scores in NFL football games (the “population”). Also suppose we are willing to assume the errors are iid Normal. Suppose the Chicago Bears are favored by 14 points in this week’s game (spread = 14). (i) What is the 95% plug-in predictive interval for the score difference in the actual game (Bears’ points minus opponent’s points)? a + b*(spread) +/- 2*se = 14 +/- 2*(14) = (-14, 42) (j) If we believe our estimates for a, b, and se are correct and the errors are iid Normal, what is the (approximate) probability the Bears win the game? If we believe our model, diff ~ N( 14 , 142 ) So Prob( diff > 0 ) = Prob( a normal RV falls above one SD below its mean) = .84 Question 2 The country returns dataset we’ve used this quarter consists monthly returns on portfolios of assets traded on major stock exchanges in various countries. Below are the summary statistics for the Germany portfolio. Summary measures for selected variables germany Count 107 Mean 0.0129 Median 0.0100 Variance 0.0031 Skewness -0.1116 (a) Construct a 95% confidence interval for the “true” expected return on the Germany portfolio. 0.0129 +/- 2*[ sqrt( .0031 / 107 ) ] = ( .002135, .02367 ) (b) Construct a 95% plug-in predictive interval for the next monthly German return. 0.0129 +/- 2*sqrt(.0031) = ( -0.09846 , 0.1243 ) (c) Suppose we want to test the claim that: “In any given month, there is a 50% chance that the Germany portfolio has a higher return than the France portfolio.” During our 107 month sample, there were 48 months in which the Germany portfolio had a higher return than the France portfolio. Test the appropriate null hypothesis at the 5% level. po = .5 phat = 48/107 = .4486 z = [ (.4486 - .5) / sqrt( .5*(1-.5)/ 107 ) ] = -1.063 FAIL TO REJECT Below are the results from two statistical tests run on the German data in StatPro. On the left is the “Runs Test for Randomness”. In the Runs Test, the null hypothesis is that the data are iid. On the right is the “Chi-Square Test of Normality”. In this test, the null hypothesis is that the data are normally distributed. Runs Test Results for germany Number of obs 107 Number above cutoff 60 Test of normal fit Number below cutoff 47 Chi-square statistic 18.131 Number of runs 59 p-value 0.034 E(R) 53.710 Stdev(R) 5.071 Z-value 1.043 p-value (2-tailed) 0.297 In class we’ve talked about the iid Normal model for stock returns. Naturally, this entails two assumptions: [1] returns are iid, and [2] returns are normally distributed. Using the tests above is one way to check that these assumptions are reasonable based on the returns we have observed. (d) For the Runs Test, do you reject the null hypothesis at the 5% level? What does this test tell you about our assumptions ([1] and/or [2])? Give a brief explanation. The p-value for the runs test is >.05 so we FAIL TO REJECT. (1 point) This tells you that based on the data, we don’t have definitive evidence that the returns are NOT iid. (1 point) (e) For the Chi-Square Test of Normality, do you reject the null hypothesis at the 5% level? What does this test tell you about our assumptions ([1] and/or [2])? Give a brief explanation. The p-value for the normality test is <.05 so we REJECT. (1 point) This tells you that based on the data, we have evidence that the Germany returns are NOT normally distributed. (1 point) (f) Which of the time series plots below shows the Germany returns? C (Answer A, B, or C) The results of the runs test suggests the data should look iid. This series is not iid. A. All of the stock return data we’ve studied this quarter has been continuous. This series is discrete. B. C. Each of the statistical tools we’ve developed this quarter depends on a set of assumptions we make about the data. If those assumptions are violated, the results you get can be very misleading. (g) Does the confidence interval you constructed in part (a) require assumption [1], [2], both, or neither? Based on your answers to parts (d) and (e), is this confidence interval valid? Briefly explain. The confidence interval requires [1], but not [2]. (1 point) Since we can’t reject that the data are iid, the confidence interval is probably ok. (1 point) (Note: The reason we don’t need [2] for the confidence interval is the Central Limit Theorem, which tells us that if we have a reasonably-sized sample and the data are iid, xbar will be normal regardless of whether the individual observations in our data are normal.) (h) Does the plug-in predictive interval you constructed in part (b) require assumption [1], [2], both, or neither? Based on your answers to parts (d) and (e), is this predictive interval valid? Briefly explain. The plug-in predictive interval requires BOTH [1] and [2]. (1 point) Since in part (e), we rejected the null hypothesis that the data are normally distributed, our predictive interval is likely NOT valid. (1 point) (Note: The reason you actually NEED normality for the predictive interval is that we’re trying to predict one single outcome. So there’s no “averaging” going on, and the CLT doesn’t save you!) (i) Would your answers to (g) and/or (h) change if we have 7 observations instead of 107? Again give a brief explanation. Yes: The confidence interval in part (g) would no longer be valid. (1 point) This is because we need a reasonably large sample to apply the Central Limit Theorem, and n=7 observations is not enough. (1 point, to get credit you must mention the Central Limit Theorem) Question 3 On April 2, 2007, the Chicago Cubs will open their season with a three game series against the Cincinnati Reds. Like a lot of baseball fans, I am not sure how good the Cubs will be next year. Suppose I think there are three possibilities: C = -1 if the Cubs are a BAD team = 0 if the Cubs are an AVERAGE team = 1 if the Cubs are a GOOD team c p(c) Based on what I know right now, I assign the following probability distribution for C: -1 0.25 0 0.4 1 0.35 Suppose I’m sure the Reds will be an average team next season. If the Cubs are also an average team, they have a 50% chance to win each game the two teams play. If the Cubs are good this goes up to 65%, while if they are bad it is only 35%. Note that these probabilities are for EACH game the two teams play. Also suppose I am willing to assume that outcomes in different games are iid. (a) Let S=1 if the Cubs sweep their season opening-series against the Reds (meaning they win all three games). If we assumed the Cubs are a good team, what is the probability they sweep the series, p(S=1|C=1) ? Explanation: Since games are iid and given the Cubs are good, they have a .65 probability of winning each game, .653 = .274625 the probability of three wins in a row is .653 (Note: Please don’t report answers to six decimal places when you’re actually taking the exam!! I’m using Excel to write these solutions and am reporting all six decimal places to avoid rounding errors. In practice, you will not be counted off for reasonable rounding errors when your exam is graded.) (b) What is the probability the Cubs are a bad team AND they sweep the series, p(S=1,C=-1)? Similar to part (a), p(S=1|C=-1) = .353 = .042875 So p(S=1, C=-1) = p(S=1|C=-1)*p(C=-1) = .010719 (c) What is the marginal probability the Cubs sweep the series, p(S=1)? [Hint: It may help you to write out the joint distribution of C and S in our usual two-way table format on a separate sheet, but you don’t have to.] Explanation: I filled in the “S=1 row” of the joint table Here’s the relevant row of the joint distribution: similarly to part (b), then C added to find the marginal -1 0 1 probability of S=1. S 0 1 0.010719 0.05 0.096119 pS(1) = 0.156838 (d) Suppose the Cubs do sweep the series with the Reds. What is the probability they are a good team, p(C=1|S=1)? By definition, “Conditional = Joint/Marginal”, or in this case p(C=1|S=1) = p(C=1,S=1) / p(S=1) From the table above, p(C=1|S=1) = (0.096119/.156838) = 0.612856 Intuition: One way to interpret conditional probability is how the probabilities we assign would change based on observed outcomes. Going into the series, we thought there was a 35% chance the Cubs were good. Sweeping the series is a favorable indicator of the Cubs’ ability, since it’s much more likely a good team would sweep than a bad team. So we now think there’s a 61% chance the Cubs are good! (e) Major League Baseball teams play a total of 162 games each season. Let G be the number of games the Cubs win next year. Suppose (unrealistically) we believe games are iid and that the Cubs will have a 60% chance to win every game. What is the distribution of G? Binomial( 162, .6 ) (1 point for saying “Binomial”, 1 point for correct n and p) (f) Using our “empirical rule” approximation and under the same assumptions as part (e), give an interval that is (approximately) 95% likely to contain the number of wins the Cubs have next season. We know that if Y ~ Binomial(n,p), then E(Y) = np and Var(Y) = np(1-p) So E(G) = 162*.6 = 97.2 and Var(Y) = 162*.6*.4 = 38.88 97.2 +/- 2*sqrt(38.88) = ( 84.73 , 109.67 ) Obviously this is an approximation, since you can’t win .73 or .67 of a game!! Question 4 After acing your business statistics course and reading about how to count cards online, you decide to move to Las Vegas to gamble for a living. Let’s suppose that you’ve gotten very good at one particular card game. Let W be a random variable equal to your net winnings (winnings minus your original bet, in dollars) for each hand that you play when you place a $1 bet. Suppose that E(W) = .0125 and Var(W) = .25 If you place a bigger bet, your net winnings for that hand are are b*W, where b is the size of your bet. If you play n times, your total net winnings are T = W1 + W2 + … + Wn where Wi is your winnings on the ith hand. Assume that your winnings on different hands are iid, and that each Wi has the same distribution as W (defined above). (a) Suppose you bet $60 per hand. For each hand, what is the expected value and variance of your winnings? X = winnings when you bet $60 = 60*W By our linear formulas, E(X) = 60*E(W) = 0.75 Var(X) = (60)2Var(W) = 900 (1 point each) (b) Based on the time it takes to deal the cards and play out each hand, suppose that you play 50 hands per hour. Assuming you bet $60 on each hand, give an interval which is 95% likely to contain your total net winnings after one hour. When you play 50 hands, T = X1 + X2 + … + X50 where each Xi is iid and has the mean and variance from part (a). Therefore, E(T) = E(X1 + … + X50) = E(X1) + E(X2) + … + E(X50) = 50*(.75) = $37.50 Var(T) = Var(X1 + … + X50) = Var(X1) + … + Var(X50) = 50*900 = 45000 So the 95% interval is 37.50 +/- 2*sqrt(45000) = ( -386.76 , 461.76 ) (c) Now suppose you play this game 40 hours per week for one year. (Whew, this is starting to sound like work). There are 52 weeks in a year. What is your expected income per year from gambling? [1 point] 50 hands/hour * 40 hours/week * 52 weeks/year = you play 104,000 hands per year By the same calculation as in part (b), your expected annual income is 104,000*(.75) = $78,000 [Full credit if you got 104,000 hands per year but multiplied by Var(W)=.0125 instead of Var(X)=.75] (d) Your friend (also a GSB student, but he hasn’t taken my class) tells you, “Gambling for a living sounds like fun, but doing the same thing for 40 hours a week is too much like working.” Instead of playing for 40 hours each week, he says you should play 8 hours per week and bet $300 on each hand. He claims this would result in the same expected annual income, and you’d have a lot more time to party! Is he right? What would change if you followed your friend’s advice? Provide calculations to back up your answer. [3 points] Your friend is correct that your expected income is the same. However, since you are now betting $300 on each hand, the standard deviation (or variance) of your income is much higher! [1 point for recognizing standard deviation changes, 1 point each for calculating the standard deviation of your income in each case.] In part (c), SD( annual income ) = sqrt[ 104,000*Var(X) ] = $ 9,764.71 Y = winnings per hand when you bet $300 = 300*W Var(Y) = (300)2Var(W) = 22500 If you only play 8 hours/week, it’s now 50*8*52 = 20,800 hands per year, but since the variance of your winnings on each hand is much larger, SD( annual income ) = sqrt[ 20,800*Var(Y) ] = $ 21,633.31 !! Aside: What’s going on here? Well, when you bet $10 on a single hand, the variance of your winnings is (10)2Var(W) = 100*Var(W), while if you play ten iid hands betting $1 each time, the variance is only 10*Var(W). Intuitively, when you play ten iid hands, your wins and losses will tend to “cancel out”, so variance is smaller! Question 5 True or False. Clearly print either T or F in the slot ___ before each statement. Each correct answer is worth ONE POINT. (a) __F__ (Adding a constant leaves sample variance unchanged!) If we add 7 to each value of a variable in our sample, the sample variance is increased by 49. (b) __F__ (Sample correlation has NO units.) Suppose we observe a sample of people in the workforce. For each person, if x is age in years and y is income in dollars, then the sample correlation between x and y is measured in year-dollars. (c) __F__ (E(X) is a “weighted average” of outcomes, weighted by probabilities.) For a discrete random variable X, the expected value E(X) is the outcome with the highest probability of occurring. (d) __T__ (Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y), while Var(X-Y) = Var(X) + Var(Y) - 2Cov(X,Y)) If X and Y are random variables and have negative correlation, then the variance of their sum, X+Y, is smaller than the variance of their difference, X-Y. (e) __T__ ( 1 – P(both success) – P(both failure) = 1 - .52 - .52 = .5 ) If we conduct two independent Bernoulli trials and each has a .5 probability of success, the probability that EXACTLY ONE out of the two trials is a success is .5. (f) __F__ ( The error is definitely NOT independent of Y!! ) In the simple regression model, Y = α + βX + ε, the error ε is assumed independent of both the regressor X and the dependent variable, Y. (g) ____ In the simple regression model, suppose we are NOT willing to assume the errors ε are normal. The 95% plug-in predictive interval, a + bX +/- 2se , should still be valid provided we have a large enough sample. (h) __F__ (The sampling distribution is the probability distribution of our estimator, not the parameter.) In statistical inference, the sampling distribution is the probability distribution of the unknown parameter we are trying to estimate. (i) __T__ (Probability of being MORE than 2 SD’s from mean is < .05 ) For an unbiased estimator with a normal sampling distribution, a p-value of less than .05 means the estimate was more than TWO standard errors away from the hypothesized value. (j) __T__ Suppose we just conducted a statistical test and that we rejected the null hypothesis at the 5% level. Assuming the assumptions underlying the test were correct (for example, the data were i.i.d.), one way to interpret the phrase “at the 5% level” is as the probability we were wrong: that is, we were willing to admit there is a 5% probability that we rejected the null when it was actually true. (The “level” of a test is often referred to as the probability of a “Type I error”; that is, rejecting the null when it is actually true. Think of it this way, we said we’d reject at the 5% level if |z|>2, but there IS a 5% probability that a normal r.v. could be more than 2 s.d.’s away from its mean by pure chance!) Question 6 Capital punishment (the practice of executing people convicted of crimes, usually murder) is highly controversial but still practiced in most of the United States. We are interested in investigating the relationship between capital punishment and violent crime. Suppose we have the following data: mrdratei = Murders per 100,000 population during a particular year in a given state execi = Number of executions performed in the state in that year. unempi = Percentage unemployment rate in the state during that year We have data for 50 states plus the District of Columbia in three years (1987, 1990, and 1993), for a total of 153 observations. The following table shows regression results from StatPro: Results of multiple regression for mrdrate Summary measures Multiple R ?? R-Square ?? Adj R-Square 0.05 StErr of Est 8.96 ANOVA Table Source df SS MS F p-value Explained 2 799.8 399.9 4.9798 0.0081 Unexplained 150 12045.5 80.3 Regression coefficients Coefficient Std Err Constant 0.35 2.69 exec 0.17 0.19 unemp 1.26 0.44 (a) Using the regression results above, test the null hypothesis that “controlling for state-wide economic conditions, the presence of capital punishment has no impact on the murder rate”. Do you reject the null hypothesis at the 5% level? The statement can be translated as Ho: β1 = 0. So we are comparing an estimate, b1=.17, to the hypothesized value, β1o = 0 Therefore z = (.17 – 0)/.19 = .895 FAIL TO REJECT (b) The p-value associated with your test from part (a) is : Explanation: Since the z-value is a little less (i) .954 (ii) .396 (iii) .171 (iv) .032 than one, the p-value must be a little greater than .32. (c) What is the R-square of this regression? R2 = (Explained SS)/(Total SS) = Explained/(Explained + Unexplained) = 799.8 / ( 799.8 + 12045.5 ) = .0622 (d) In a given year, suppose the state of Texas performs 22 executions and has a statewide unemployment rate of 8.4%. Construct a 95% plug-in predictive interval for the number of murders per 100,000 population. a + b1(22) + b2(8.4) +/- 2*se = 14.674 +/- 2*8.96 = ( -3.246 , 32.594 ) When I ran this regression, I also asked StatPro to output columns of Fitted Values and Residuals. Below are the three observations corresponding to the District of Columbia in 1987, 1990, and 1993 (‘9’ is the state code for D.C. in this dataset). state year mrdrate exec unemp Fitted Values Residuals 9 87 36.2 0 6.3 9 90 77.8 0 6.6 9 93 78.5 0 8.5 ?? ?? (e) For the 1993 observation (last row), what numbers should appear in the “Fitted Values” and “Residuals” columns? = 78.5 – 11.06 11.06 67.44 =.35+.17(0) + 1.26(8.5) (Fitted value) (Residual) (f) Our usual assumption for regression models is that the errors satisfy εi ~ iid N( 0 ,σ2 ) If we believe this assumption here, about how many standard deviations is the residual you calculated in part (e) away from its mean? Residuals have mean zero, and se is our estimate of σ, so the above residual is approximately: (67.44 – 0)/8.96 = 7.53 s.d.’s above the mean!! Early in the class we talked about how outliers can affect sample means and variances. They can also have a HUGE impact in regression analysis! Below is the scatter plot of murder rates versus unemployment. The three District of Columbia observations we saw on the previous page are circled. DC, 1990 Correlation = 0.240 80 DC, 1993 60 m rd ra te 40 DC, 1987 20 0 2 4 6 8 10 12 14 unemp [ Note: If you’ve been listening to me all quarter, this plot would have been one of the FIRST things you looked at!! ] The table below shows the same regression, but with the three D.C. observations omitted from the sample. Results of multiple regression for mrdrate Summary measures Multiple R ?? R-Square ?? Adj R-Square 0.2086 StErr of Est ?? ANOVA Table Source df SS MS F p-value Explained 2 447.8 223.9 20.634 0.0000 Unexplained 147 1595.0 10.9 Regression coefficients Coefficient Std Err Constant 2.56 0.99 exec 0.30 0.07 unemp 0.67 0.16 (g) What is se for the new regression? [Note: With the three D.C. observations omitted, there are now 150 observations.] se = sqrt[ Unexplained SS / (n-k-1) ] = sqrt( 1595 / 147 ) = 3.294 (h) Based on the new results (with D.C. omitted), construct a 95% confidence interval for β1, the coefficient on exec in this multiple regression model. b1 +/- 2*se(b1) = .30 +/- 2*(.07) = ( .16, .44 ) Note: Recognize that zero is OUTSIDE this 95% CI, so if we did the hypothesis test from part (a) again, we would REJECT the null that executions have no relationship with murder rates. (i) Now compare two sets of regression results. Remember, the only difference is that the three District of Columbia observations were included in the table at the beginning of the problem, while the regression on the previous page has the D.C. observations omitted. Suppose you went back and re-did part (a) after throwing out the D.C. observations. How would your answer change? In particular, how does your conclusion about the relationship between murder rates and capital punishment change when these three data points are excluded? Why does this happen? Briefly explain. Your answer to (i) is worth 3 points. • If you throw out the D.C. observations and re-run the regression, you would now conclude the effect of capital punishment is statistically significant. Why?... • Notice that D.C. has ZERO executions (no capital punishment). D.C. also has an insanely LARGE murder rate in all three years, particularly ’90 and ’93 (see part f). • Therefore, when you discard the D.C. observations, the relationship between executions and murder rate looks MORE POSITIVE, (and in this case also turns out to be statistically significant). [1 point for each statement] Note: BE VERY CAREFUL how you interpret this regression. First off, obviously the DC observations are outliers. In general, there is no clear cut answer as to whether you should throw them out, but either way you must be aware of the influence they have on your results (this is why I teach you to LOOK AT YOUR DATA)!! It’s also misleading to say in part (j) that the model “fits better”, because obviously throwing out extreme points will make your model look like a better fit! Also, remember that correlation is not causation: even if you throw DC out, you should NOT say “capital punishment causes murders”. It could be the case that, over time, certain states had higher violent crime rates to begin with, and adopted capital punishment as a way to address the problem. (j) How do R-square and se change when we throw out the D.C. observations, and why? Briefly explain. se is the sample standard deviation of the residuals. When you throw out the three large residuals, se decreases. R-square is (Explained SS)/(Explained SS + Unexplained SS). When you throw out the three large residuals, “Unexplained SS” goes down, and thus R-square increases. [1 point each; good intuitive explanations are ok.] Question 7 (Simpson’s Paradox) Suppose a certain university has two programs: Engineering, and Arts & Sciences. Students who wish to attend the university must choose which program to apply to (they cannot apply to both programs). They then are either accepted or rejected. For each applicant, define the following random variables: E = 1 if the person applied to Engineering and 0 if they applied to Arts & Sciences A = 1 if the person is accepted and 0 if they are rejected G = 1 if the applicant is female, and 0 if male The admissions office has supplied us with some data, which we have used to construct the following probability model. For female applicants (G=1), For male applicants (G=0), we have: we have: E E 0 1 0 1 0 .08 .48 0 .32 .12 A A 1 .12 .32 1 .48 .08 Half of all applicants are women, p( G = 1 ) = .5 (a) Without knowing what program she applied to, what is the probability that a female applicant is accepted, p( A=1 | G=1 )? [Hint: The table on the left gives you joint probabilities for E and A given that G=1.] .44 (the marginal prob. of A=1 from the left-hand table) (b) For a male applicant (G=0), without knowing which program he applied to, what is the distribution of A? Bernoulli(.56) (the marginal prob. of A=1 from the right-hand table is .56; you could also have written out values (0,1) and probabilities.) A recent study published in a major news magazine found that male applicants are more likely to be accepted at this university than female applicants. This results in some very unpleasant publicity for the university. (c) Based only on your answers to (a) and (b), could this study be correct? Yes, it looks like female applicants have a lower probability of being accepted than males. The deans of the two programs consult with each other. Each assures the other that a female applicant is just as likely to be accepted as male. The also both feel that the quality of female and male applicants is comparable. They think something must be wrong with the study. (d) Given that an applicant is female and applied to the engineering program, what is the probability she is accepted, p( A=1 | E=1, G=1 ) ? .32 / (.32 + .48) = .4 (This is just p(A=1|E=1) from the left-hand table) (e) Given that an applicant is male and applied to the engineering program, what is the probability he is accepted, p( A=1 | E=1, G=0 ) ? .08/(.08 + .12) = .4 (This is just p(A=1|E=1) from the right-hand table) (f) Without knowing the gender of an applicant, what is the probability s/he is accepted into the engineering program? .4 (You don’t have to do any math, since if the conditional probabilities are equal for both genders, the marginal must be the same.) (g) For the arts & sciences program, does the probability of being accepted depend on whether an applicant is male or female? (That is, are p( A=1 | E=0, G=1 ) and p( A=1 | E=0, G=0 ) the same? ) No. Similar to parts (d) and (e): For women, .12/(.12 + .08) = .6 For men, .48/(.48 + .32) = .6 (h) Given that an applicant is female, what is the probability she applies to the engineering program? [1 point] .8 (This is just the marginal probability of E=1 from the left-hand table) (i) Given that an applicant is male, what is the probability he applies to the engineering program? [1 point] .2 (This is just the marginal probability of E=1 from the right-hand table) (j) Does this university discriminate against women? Explain. [3 points] No. Even though we observe in part (c) that female applicants are less likely to be admitted (if we don’t control for which program they applied to), we saw in parts (d)-(g) that each program is actually equally likely to admit male and female applicants! The reason that female applicants look less likely to be admitted is that, according to parts (h)-(i), a higher fraction of female applicants choose to apply to the Engineering program, which is less likely to accept applicants of both genders. Once we look at department-level admissions, we find that neither department discriminates. (1 point for saying “No” because each department is equally likely to admit males versus females; 2 points for saying that the difference arises because more female applicants choose to apply to the department that’s harder to get into.) Question 8 (Test-taking tip: Do this question LAST.) Suppose we are estimating the simple linear regression model: Yi = α + β xi + ε i Assume that the errors are distributed εi ~ iid N(0,σ2) Suppose we are going to estimate this model using TWO observations. We have two KNOWN x-values, x1 and x2, and we are about to observe two Y-values, Y1 and Y2. So based on what we know now, the x-values are known constants, and the Y-values are random variables. Define: Y2 − Y1 B= x2 − x1 We are thinking about using B as an estimator of the slope, β. We are interested in asking, what is the sampling distribution of this estimator? Suppose I make the following claim: “B is an unbiased estimator of β with a normal sampling distribution.” (a) If my claim is true, what is P( B > β ) ? [Hint: It may help you to draw a picture!] .5 “Unbiased estimator with a normal sampling distribution” means that E(B=β) and that B is normal — in other words the different values we could see for our estimator B look like a bell curve centered above β. And the probability a normal r.v. is bigger than its mean is .5! (b) Let’s say you knew that Var(B) = σΒ2 , where σΒ2 is some number. Assuming my claim is true and you know what σΒ2 is, construct a 95% confidence interval for β. B +/- 2*sqrt(σΒ2) , or equivalently B +/- 2* σΒ Note: I do not expect most b-stats students (even “A” students) to be able to do parts (c)-(e) in a timed exam situation. However, notice you can get (a) and (b) just by knowing what it means for an estimator to be unbiased, the definition of a sampling distribution, and how the sampling distribution is used to build a confidence interval! Now let’s see if we can verify my claim about B (you shouldn’t just believe everything somebody tells you about a strange estimator!!). [Hint: The rest of this question is actually much easier if you do a little algebra up front. I’ll help get you started. Since B is an estimator of β, see if you can rewrite B as B = β + “error” where “error” depends on ε1, ε2, x1, and x2. The easiest way to do this is to start with the formula above for B, and substitute α + βx1 + ε1 in for Y1 and α + βx2 + ε2 in for Y2. You get: (α + β x2 + ε 2 ) − (α + β x1 + ε 1 ) B= x2 − x1 Now see if you can cancel some terms and get β by itself. Also remember, when you’re doing the problems below, x1 and x2 are known constants, while ε1 and ε2 are iid N(0,σ2) random variables. ] (c) What is E(B)? (α + β x2 + ε 2 ) − (α + β x1 + ε 1 ) β x2 + ε 2 − β x1 − ε 1 B= = x2 − x1 x2 − x1 β ( x2 − x1 ) + ε 2 − ε 1 ( x2 − x1 ) ε 2 − ε 1 ε − ε1 = =β + =β+ 2 x2 − x1 x2 − x1 x2 − x1 x2 − x1 ε 2 − ε1 Therefore, E (B) = β + E ( ) =β Note: This x2 − x1 verifies that “B is unbiased”. (d) What is Var(B)? Using the above expression for B, ⎛ ε 2 − ε 1 ⎞ Var (ε 2 − ε 1 ) 2σ 2 Var ( B ) = Var ⎜ β + ⎟ = = ⎝ x2 − x1 ⎠ ( x 2 − x1 ) 2 ( x2 − x1 ) 2 Explanation: Key steps include (i) recognize that β is a constant, so it doesn’t affect variance; (ii) We know that Var(a*X) = a2Var(X), so 1/(x2-x1) gets squared when you factor it out; and (iii) Var(ε2 – ε1) = Var(ε2) – Var(ε1) + 2*Cov(ε2,ε1) = σ2 + σ2 = 2σ2 . (e) Suppose that 0 ≤ xi ≤ 1; that is, both values x1 and x2 must be between zero and one. However, you get to pick the x-values in advance (then you get a Y-value for each x, and plug them into our estimator, B). If your goal is to estimate β as accurately as possible, what values should you choose for x1 and x2 ? Since B is unbiased, “estimate β as accurately as possible” means we want Var(B) to be as small as possible. Therefore, from part (d), since (x2-x1)2 shows up in the denominator of the variance, we want x2 and x1 as far apart as we can get. So you should choose x1=0 and x2=1 (or x1=1, x2=0). (Of course, we’d also like σ2 to be as small as possible, but that’s usually not something we have control over!) (f) Would my claim still be correct if we were NOT willing to assume the errors εi were normally distributed? Briefly explain. [Note: It is possible to get full credit for part (f) even if you don’t get parts (c) through (e). ] No. E(B) and Var(B) would still be the same (in particular, B would still be unbiased), but the sampling distribution would NOT be normal. When the errors are normal, B is normal because it is a linear combination of two normal r.v’s ε1 and ε2. If the errors aren’t normal, B will not be normal (the Central Limit Theorem won’t save us here, because there are only TWO observations!).

DOCUMENT INFO

Shared By:

Categories:

Tags:
united states, the university of chicago, the chinese, foreign students, pilot projects, biosphere reserve, case studies, rural development, allan hancock college, carnegie mellon, michigan state university, confidence interval, d. c, harris school, course instructor

Stats:

views: | 23 |

posted: | 6/8/2010 |

language: | English |

pages: | 26 |

OTHER DOCS BY vei21189

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.