# Hypothesis Testing

### Pages to are hidden for

"Hypothesis Testing"

##### Please download to view full document

```					Agenda






Review  Regression  Measures of fit: R2 and correlation  Measures of fit: Standard Error of the Estimate SEE  Handout questions Hypothesis tests about the predicted Ŷ Hypothesis tests about the coefficients  Standard error, t-test of the coefficient  p-value
Class 16

1

Regression

   

    

Introduction to regression and correlation Correlation coefficient and measures of fit Using Excel for regression Hypothesis testing and prediction with regression Multiple regression (2 classes): Regression with many RHS X variables Advanced use of dummy variables Forecasting models using regression Advanced topics in regression Demand estimation using regression Causation v. correlation: Some solutions
2
Class 16

Terms we used last week



 

 
 


 


 




Linear Regression Dependent variable, LHS variable, Y variable Independent variable, RHS variable, X variable Intercept (a.k.a. constant). Coefficient Predicted Y, Ŷ Error (a.k.a. residual) SSE Sum of Squared Errors Minimizing SSE Correlation coefficient SST (Sum of Squares Total) SSR (Sum of Squares of the Regression) R2 Best fitting equation SEE Standard Error of the Estimate Correlation coefficient
Class 16

3

Terms we learn today


  

Confidence Intervals (of predictions, of coefficients) Hypothesis tests Standard error of the coefficient T-test

4

Class 16

Knowing SQ.FT, we can better explain/predict rent ENTIRE DASHED LINE How much rent of that apt differs from its average (for SST) AVERAGE: red thick line
2500 2000 1500 1000 500 0 0 1000 2000 SQ FT 3000 4000

Actual Rent Predicted Rent Average Rent Linear (Predicted Rent) Linear (Average Rent)

DASHED RED LINE: The
variation the regression has "explained“ as due to 5 different SQFT (for SSR)

DASHED BLACK LINE: How far the actual rent is from regression prediction (line); the unexplained Class 16 error/residual (for SSE)

Deriving R2
R2 is the proportion of SST explained by the regression R2
=

SSR SST

=

1

- SSE SST

(Remember SST = SSR + SSE) SST: the total variation in Y around its mean (total sum of sqs.) SSR: the variation in Y that is caused or explained by Y’s relationship with X (regression sum of sqs.: explained by regression) SSE: the variation in Y that remains unexplained (sum sq errors)
6
Class 16

Reviewing Correlation Here is a Correlation table from Excel:
RENT RENT BEDROOMS ROOMS SQFT D IS T A N C E T O T AREA A 1 0 .6 8 4 8 0 .6 8 1 5 0 .7 9 5 7 -0 .0 8 2 3 -0 .6 6 2 0 BEDROOMS 1 0 .7 9 3 7 0 .7 4 1 3 -0 .2 3 8 8 -0 .2 8 2 4 ROOMS SQFT D IS T A N C E T O T AREA A 1 0 .7 0 9 8 -0 .1 3 7 0 -0 .3 9 7 7

1 -0 .1 8 6 1 -0 .4 2 4 7

1 -0 .4 3 0 8

1

The computer just didn’t bother filling in top of the table. The correlation between BEDROOMS and RENT is the same as the correlation between RENT and BEDROOMS etc. The correlation coefficient r is just the square root of R2, with the sign based on the regression slope b’s sign: 2 +/-

R

7

Class 16

Standard Error of the Estimate
Standard Error of the Estimate SEE:
   SSE / df SSE ≡ Sum of squared errors df ≡ degrees of freedom = n – k -1  Data points close to the regression line better prediction smaller SSE (sum of squared errors)  More observations more df  More df smaller SEE (Standard Error of the Estimate) moreClass 16 accurate prediction, better fit 8

Review Questions from handout, and Answers
1. Report the regression equation: Rent = 316.51 + 288.04 BEDROOMS R2 = .469 SEE= 257.27 2. Predict the rent for a 3 bedroom apartment. If the rent for a specific 3-bedroom apartment is actually \$1250, is its rent more or less than predicted? Rent = 316.51 + 3*288.04 = 1180.63 A \$1250 3-room apartment is \$69.37 more than we predicted. 3. What does R2 tell us? Where is it given? The R2 tells us that 46.9% of the total variation of Y is explained by the regression on Bedrooms. 2 weren’t there, how could you compute it? 4. If R R2 = SSR/SST = 3566190/7603699 R2 =(Multiple R)2 R2 = 1 - SSE/SST = 4037509/7603699

9

Class 16

Review Questions and Answers
5. Comparing the regression of Rent on BEDROOMS with the one we did last class of Rent on SQFT, BEDROOMS had a coefficient of 288.04 while SQFT has a coefficient of .5477. Does this mean that the number of bedrooms are more important than size (measured in sq.ft.) in determining the rent of an apartment? No…. One sq.foot is about a hundredth of a room: Of course ONE sq.foot increases rent less than a room (which has a hundred square feet or more.)
6. Based on this regression output: What is the sign of the correlation coefficient? How do you know? The sign is the same as the sign on the coefficient BEDROOMS, or positive. 7. Based on the regression output, what is the value of the correlation coefficient? How do you know? How could you compute it? The value is + 0.6848. Its absolute value is the multiple R, in the top row of the regression output. Its sign is as above.
10
Class 16

Review Questions & Answers cont.
8. Which of the variables is most highly correlated with rent? Square feet 9. Which is more highly correlated with square feet: Area A or rooms? Why? Rooms has the highest absolute value of the correlation coefficient. 10. If we run the simple regression of rent on each other variable separately, which will have the highest SEE? Why? The one with the highest SEE will be the one with the smallest correlation with Rent, or DISTANCE TO T.
RENT RENT BEDROOMS ROOMS SQFT D IS T A N C E T O T AREA A 1 0.6848 0.6815 0.7957 -0.0823 -0.6620 BEDROOMS 1 0.7937 0.7413 -0.2388 -0.2824 ROOMS SQFT D IS T A N C E T O T AREA A

1 0.7098 -0.1370 -0.3977

1 -0.1861 -0.4247

1 -0.4308

1

11

Class 16

The R2 is .633 in this regression of Rent = a + b SQFT Q: What is the R2 in the regression of SQFT = a + b Rent ?
SUMMARY OUTPUT Regression Statistics Multiple R 0.795744275 R Square 0.633208951 Adjusted R Square 0.627195983 Standard Error 213.824224 Observations 63 ANOVA df Regression Residual Total 1 61 62 SS 4814730.258 2788968.726 7603698.984 MS 4814730 45720.8 F Significance F 105.3072209 6.57029E-15

Intercept SQFT

Coefficients Standard Error t Stat 199.7159426 72.28093328 1.379561 0.547704941 0.053372505 10.26193

P-value 0.17275837 6.57029E-15

Lower 95% Upper 95% 55.18124307 344.2506 0.440979992 0.65443

12

Class 16

Q: What is the R2 in the regression of SQFT = a + b Rent A: It has to also be R2=.633 also, since the correlation coefficient doesn’t put one variable on the left side, one on the right
R e g re s s io n S ta tis tic s M u ltip le R 0 .7 9 5 7 R S q u a re 0 .6 3 3 2 A d ju s te d R S q u a re 0 .6 2 7 2 S ta n d a rd E rro r 3 1 0 .6 5 8 9 O b s e rva tio n s 63 ANOVA df R e g re s s io n R e s id u a l T o ta l 1 61 62 SS MS 1 0 1 6 3 0 8 8 .0 5 1 0 1 6 3 0 8 8 5 8 8 7 0 4 5 .2 1 7 9 6 5 0 8 .9 4 1 6 0 5 0 1 3 3 .2 7 t S ta t 2 .1 4 1 5 1 0 .2 6 1 9 F S ig n ific a n c e F 1 0 5 .3 0 7 2 6 .5 7 0 2 9 E -1 5

In te rc e p t RENT

C o e ffic ie n ts S ta n d a rd E rro r 2 3 0 .0 5 1 5 1 0 7 .4 2 7 8 1 .1 5 6 1 0 .1 1 2 7

P -v a lu e 0 .0 3 6 2 0 .0 0 0 0

Low er 95% U pper 95% 1 5 .2 3 6 2 4 4 4 .8 6 6 7 0 .9 3 0 8 1 .3 8 1 4

13

Class 16

Lesson from this on causality versus correlation


 

When two variables are correlated, just because we choose to put one on the LHS and call it “dependent Y” does NOT mean that Y actually depends on X. We are imposing our assumptions when we do this. We always want to think hard about what causes what.

14

Class 16

From the standard error of the estimate, we can calculate confidence intervals of Ŷ

15

Class 16

The distribution of Ŷpredicted
Using the regression line, our best guess or prediction of Y is: Ŷpredicted = a + b X But actually Yactual = a + b X + error The SEE (Standard Error of the Estimate) measures how spread out this error (residual) is.
probability

Ŷ
16
Class 16

Y

Confidence interval


The Ŷpredicted error has a t-distribution (which is practically the
same as a normal distribution if there are lots of observations).

The following is approximately true:  68% of the time, our prediction for Y at a given X will be within  1 standard error of the actual Y  This is called the 68% confidence interval (2 tailed test).  It means, “We are 68% certain that the actual Y lies within  1 standard error of the Ypredicted  95% of the time, our prediction will be within  2 (or more specifically 1.96) standard errors of the actual Y  This is the 95% confidence interval (2 tailed test).
17
Class 16

This is the 95% confidence interval

- 2 standard errors
18

Ŷ

predicted
Class 16

+ 2 standard errors

Hypothesis tests



 

We can use the standard error of the estimate (and the related confidence intervals) to test hypotheses. Notation: H0: The null hypothesis, the hypothesis we are testing. In the handout, is the \$1250 apartment from Q2 “overpriced”, i.e. more than it should be? Here,  H0: This apartment is not overpriced H0: 1250 =  +  BEDROOMS where  and  are the true parameters



19

Class 16

Hypothesis tests
  

H0: This apartment is not overpriced We predict that Rent = 316.5 + 288 * 3 = 1181 H0: 1250 =  +  BEDROOMS where  and  are the true parameters The 95% confidence interval around 1181 is:  1181 +/- 2 * SEE = 1181 +/- 2 * 257  257 is the SEE on the handout or on page 19 of these notes



20

Class 16

Review: Ways we measure the regression’s “fit” or accuracy






How much of the variation in rent (Y) do we explain by knowing sq.ft. (X)?  R2 How closely do rent and square feet move together?  r (or ρ): the correlation coefficient How sure are we that the predicted Y will be accurate enough to fall within a specific range (that we specify)?  Confidence intervals of the SEE

21

Class 16

Regression of Rent on BEDROOMS
SUMMARY OUTPUT Regression Statistics Multiple R 0.6848 R Square 0.4690 Adjusted R Square 0.4603 Standard Error 257.2716 Observations 63 ANOVA df Regression Residual Total 1 61 62 SS 3566189.77 4037509.21 7603698.98 MS 3566189.77 66188.68 F 53.88

Coefficients Standard Error Intercept 316.5140 84.3362 BEDROOMS 288.0369 39.2408

t Stat P-value 2.5673 0.0127 7.3402 6E-10

22

Class 16

Review Questions and Answers Based on correlation table (from Excel): cont.
11. Give the 95% confidence interval for your prediction in Q2. In words, explain what this tells us. In light of this, what do you conclude about the apartment discussed in Q2? Rent = 1180.63  1.96 (257.27) The range is from 676.38 (1180.63-504.25) to 1685.88 (1180.63+504.25) We are 95% certain that the true rent for an apartment with 3 bedrooms will be within this range. Since \$1250 is in this range, it doesn’t seem overpriced. More accurately: We are not 95% certain that the \$1250 apartment is “overpriced”.

23

Class 16

Statistics about the coefficient: the standard error, t statistic

24

Class 16

Statistics and hypotheses about the coefficient: the standard error, t statistic
SUMMARY OUTPUT Regression Statistics Multiple R 0.79574 R Square 0.63321 Adjusted R Square 0.62720 Standard Error 213.82422 Observations 63 ANOVA df Regression Residual Total SS MS 1 4814730.258 4814730.258 61 2788968.726 45720.79878 62 7603698.984 t Stat 1.37956 10.26193
Class 16

F Significance F 105.3072209 6.5703E-15

Intercept SQFT
25

Coefficients Standard Error 199.71594 72.28093 0.54770 0.05337

P-value Lower 95% Upper 95% 0.17275837 55.1812431 344.2506 6.57029E-15 0.44097999 0.65443

The distribution of the slope
Ŷ =a+bX Predicted rent = a + b SQFT  We estimate b . We only have a sample of apartments, and a different sample will have a somewhat different b .  b is our best guess for the true slope of X ()

distribution of 

b
26
Class 16

Confidence interval


68% of the time, the true slope  will be within  1 standard error (se) of the estimated b.  This is the 68% confidence interval (2 tailed). 95% of the time, the true slope  will be within  2 (or 1.96) standard errors (se) of the estimated b .  This is the 95% confidence interval. These numbers come from the t-distribution. (The t distribution is the same as a normal distribution if there are lots of observations).





27

Class 16

t statistic
To test the hypothesis that the true slope  equals the value “x”, calculate t=b–x s.e. If | t | > 2 , we are 95% that  is NOT “x”. If | t | < 2 , we are NOT 95% that  is NOT “x”….
but we are never sure that  is “x”
28
Class 16

The regression output gives the t statistic of the hypothesis  = 0
t-statistic = estimated coefficient b -0 s.e. (of b)

If the | t-statistic | is >= 2.0 (or, more accurately,1.96)  We are 95% certain that b is NOT zero. OR  We are 95% certain that this X has a non-zero effect on Y, i.e. that X has SOME impact on Y We call this being “significant” or “significantly different from zero” at the 95% level
Note: the regression output also gives the 95% confidence level.
29
Class 16

The t Stat on the Excel output tests the hypothesis that the true slope or coefficient () on SQFT is zero. Since |t-stat| on SQFT > 1.96, we’re >95% certain that SQFT has an impact on Rent
R e g re s s io n S ta tis tic s M u ltip le R R S q u a re A d ju s te d R S q u a re S ta n d a rd E rro r O b s e rva tio n s ANOVA df R e g re s s io n R e s id u a l T o ta l 1 61 62 C o e ffic ie n ts In te rc e p t SQFT 1 9 9 .7 1 5 9 0 .5 4 7 7 SS 4814730 2788969 7603699 S ta n d a rd E rro r 7 2 .2 8 0 9 0 .0 5 3 4 MS 4 8 1 4 7 3 0 .2 5 8 4 4 5 7 2 0 .7 9 8 8 F 1 0 5 .3 0 7 2 S ig n ific a n c e F 0 .0 0 0 0 0 .7 9 5 7 0 .6 3 3 2 0 .6 2 7 2 2 1 3 .8 2 4 2 63

t S ta t 2 .7 6 3 1 1 0 .2 6 1 9

P -v a lu e 0 .0 0 7 6 0 .0 0 0 0

Low er 95% 5 5 .1 8 1 2 0 .4 4 1 0

U pper 95% 3 4 4 .2 5 0 6 0 .6 5 4 4

30

Class 16

Another way to think about the tstatistic of the H0 that  = 0
If the | t-statistic | is >= 2.0

|coefficient| >= 2.0 s.e. For a positive coefficient, this becomes coefficient >= 2 s.e. Or coefficient – 2 s.e. >=0
In other words, when |t stat| > 2, the coefficient (b) is more than 2.0 s.e.’s away from zero. OR The coefficient does not lie within the 95% confidence interval of zero.  So I am 95% certain that b  0
31
Class 16

Is zero within this 95% confidence interval?

If this is 0, b 0 is outside the 95% confidence interval
32
Class 16

Using t statistics to test H0:  =0


If | t-statistic| > 1.96



We are 95% certain that the coefficient is not zero. We are 90% certain that the coefficient is not zero. We are 99% certain that the coefficient is not zero. We are 68% certain that the coefficient is not zero.



If | t-statistic|>1.645





If | t-statistic|>2.576





If | t-statistic| > 1



33

Class 16

Regression of Rent on BEDROOMS
SUMMARY OUTPUT Regression Statistics Multiple R 0.6848 R Square 0.4690 Adjusted R Square 0.4603 Standard Error 257.2716 Observations 63 ANOVA df Regression Residual Total 1 61 62 SS 3566189.77 4037509.21 7603698.98 MS 3566189.77 66188.68 F 53.88

Coefficients Standard Error Intercept 316.5140 84.3362 BEDROOMS 288.0369 39.2408

t Stat P-value 2.5673 0.0127 7.3402 6E-10

34

Class 16

Review Questions and Answers Based on correlation table (from Excel): cont.
12. You call a realtor who tells you that each extra bedroom will add an additional \$200 to the rent on average. Based on the regression, do you argue with him that he must be wrong? Explain.
Answer: Our best guess is that an extra bedroom adds \$288 to the rest. However, the 95% confidence interval is 288.0 +/- 1.96*39.2, or from 211.1 to 364.9. 200 does not fall in this range. Therefore, we are more than 95% certain that each bedroom adds more than \$200.

(p.s. I’ve done this confidence interval as a 2-tailed test…. )

35

Class 16

Next to the t-statistic is the p-value (of the coefficient)
P-value is 1 minus the probability that the true value is not zero Or The probability that the true value is zero or a different sign. If we are just 95% certain the coefficient is not zero, then the p-value is .05

36

Class 16

Dummy Variables

How to use regression with data on the category of an observation. Examples: male or female? area A or area B? computer terminal included or not?

37

Class 16

Dummy (or Indicator) Variables
Example: We want to know how neighborhood – Allston v. Brookline – affects rent. Neighborhood is a “categorical variable”. It puts observations into categories.  We can’t use “A” and “B” as data  Define Area A = 1 if Area A = 0 if Area B  Run regression: RENT = b0 + b1 AREA A A dummy variable is a variable that takes the value of 1 or 0.
38
Class 16

R egression S tatistics M ultiple R 0.6620 R S quare 0.4382 A djusted R S quare 0.4290 S tandard E rror 264.6300 O bserv ations 63 ANOVA df R egression R esidual T otal 1 61 62 C oefficients 1451.3333 -657.2037 SS 3331928.8915 4271770.0926 7603698.9841 S tandard E rror 88.2100 95.2777 MS 3331928.8915 70029.0179 F S ignificance F 47.5793 0.0000

Intercept AREA A

t S tat 16.4532 -6.8978

P -value Low er 95% U pper 95% 0.0000 1274.9465 1627.7202 0.0000 -847.7232 -466.6842

39

Class 16

Rent and Area A dummy variable
Rent = 1451.3 - 657.2 AREA A R2=.438

Predicted rent in Area A:

1451.3 -657.2 (1) = 794.1

Predicted rent in Area B: 1451.3 -657.2 (0)= 1451.3

40

Class 16

Question to think about


What would the regression be if we had made a dummy variable “Area B” instead?



Hint: The choice of which should be “1”, A or B, is arbitrary.  Either way, you should predict the same rent. For instance, we should predict 794.1 for Allston.

41

Class 16

Answer to Dummy variable question
What would the regression be if we had made a dummy variable “Area B” instead?  Hint: The choice of which should be “1”, A or B, is arbitrary. Either way, you should predict the same rent. Old way: Rent = 1451.3 - 657.2 AREA A R2=.438 Predicted rent in Area A: 1451.3 -657.2 (1) = 794.1 Predicted rent in Area B: 1451.3 -657.2 (0)= 1451.3


New way: Rent = 794.1 + 657.2 AREA B R2=.438 Predicted rent in Area A: 794.1 + 657.2 (0) = 794.1 Predicted rent in Area B: 794.1 + 657.2 (1)= 1451.3

42

Class 16

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 276 posted: 1/3/2008 language: English pages: 42