Simple Linear Regression Template - Download as PowerPoint by wys12714

VIEWS: 13 PAGES: 54

Simple Linear Regression Template document sample

More Info
									                          10-1




  COMPLETE
   BUSINESS
  STATISTICS
              by
     AMIR D. ACZEL
              &
JAYAVEL SOUNDERPANDIAN
      6th edition (SIE)
                 10-2



Chapter 10


 Simple Linear
Regression and
  Correlation
                                                          10-3



10    Simple Linear Regression and Correlation

• Using Statistics
• The Simple Linear Regression Model
• Estimation: The Method of Least Squares
• Error Variance and the Standard Errors of Regression
  Estimators
• Correlation
• Hypothesis Tests about the Regression Relationship
• How Good is the Regression?
• Analysis of Variance Table and an F Test of the
  Regression Model
• Residual Analysis and Checking for Model Inadequacies
• Use of the Regression Model for Prediction
• The Solver Method for Regression
                                                      10-4



10 LEARNING OBJECTIVES
After studying this chapter, you should be able to:
• Determine whether a regression experiment would
  be useful in a given instance
• Formulate a regression model
• Compute a regression equation
• Compute the covariance and the correlation
  coefficient of two random variables
• Compute confidence intervals for regression
  coefficients
• Compute a prediction interval for the dependent
  variable
                                                        10-5



10     LEARNING OBJECTIVES (continued)

After studying this chapter, you should be able to:
• Test hypothesis about a regression coefficients
• Conduct an ANOVA experiment using regression
  results
• Analyze residuals to check if the assumptions about the
  regression model are valid
• Solve regression problems using spreadsheet templates
• Apply covariance concept to linear composites of
  random variables
• Use LINEST function to carry out a regression
                                                               10-6


10-1 Using Statistics

• Regression refers to the statistical technique of modeling the
  relationship between variables.
• In simple linear regression, we model the relationship
  between two variables.
• One of the variables, denoted by Y, is called the dependent
  variable and the other, denoted by X, is called the
  independent variable.
• The model we will use to depict the relationship between X and
  Y will be a straight-line relationship.
• A graphical sketch of the the pairs (X, Y) is called a scatter
  plot.
                                                                                                             10-7


 10-1 Using Statistics
This scatterplot locates pairs of observations of   Scatterplot of Advertising Expenditures (X) and Sales (Y)
advertising expenditures on the x-axis and sales            140

on the y-axis. We notice that:                              120

                                                            100




                                                    Sales
                                                            80
 Larger (smaller) values of sales tend to be               60
associated with larger (smaller) values of                  40

advertising.                                                20

                                                              0
                                                                  0   10      20        30     40       50
                                                                              A d ve rtising

 The scatter of points tends to be distributed around a positively sloped straight line.


 The pairs of values of advertising expenditures and sales are not located exactly on a
  straight line.
 The scatter plot reveals a more or less strong tendency rather than a precise linear
  relationship.
 The line represents the nature of the relationship on average.
                                         10-8


    Examples of Other Scatterplots

    0



    0




                             Y
                 Y
Y




    0



    0


         X   0        X              X
Y




                 Y




                             Y
         X            X              X
                                                                           10-9


 Model Building

The inexact nature of the          Data       In ANOVA, the systematic
relationship between                          component is the variation
advertising and sales                         of means between samples
suggests that a statistical                   or treatments (SSTR) and
model might be useful in
                                Statistical   the random component is
analyzing the relationship.       model       the unexplained variation
                                              (SSE).
A statistical model separates
the systematic component        Systematic    In regression, the
of a relationship from the                    systematic component is
                                component
random component.                             the overall linear
                                     +        relationship, and the
                                 Random       random component is the
                                  errors      variation around the line.
                                                                               10-10

10-2 The Simple Linear Regression
Model
The population simple linear regression model:
                   Y= 0 + 1 X       + 
                       Nonrandom or     Random
                        Systematic     Component
                        Component
where
 Y is the dependent variable, the variable we wish to explain or predict
 X is the independent variable, also called the predictor variable
  is the error term, the only random component in the model, and thus, the
  only source of randomness in Y.

 0 is the intercept of the systematic component of the regression relationship.
 1 is the slope of the systematic component.

The conditional mean of Y: E [Y X ]         0  1 X
                                                                                                         10-11

  Picturing the Simple Linear
  Regression Model
          Y
                             Regression Plot                           The simple linear regression
                                                                       model gives an exact linear
                                                                       relationship between the
                                                                       expected or average value of Y,
                                                                       the dependent variable, and X,
                                                      E[Y]=0 + 1 X
                                                                       the independent or predictor
          Yi
                                                                       variable:
                         {
                 Error: i               }   1 = Slope

                                                                             E[Yi]=0 + 1 Xi
                                   }

                                     1
                                                                       Actual observed values of Y
0 = Intercept
                                                                       differ from the expected value by
                                                                       an unexplained or random error:

                                                               X
                                                                            Yi = E[Yi] + i
                              Xi                                               = 0 + 1 Xi + i
                                                                                 10-12

    Assumptions of the Simple Linear
    Regression Model
•   The relationship between X          Assumptions of the Simple
                                    Y    Linear Regression Model
    and Y is a straight-line
    relationship.
•   The values of the independent
    variable X are assumed fixed
    (not random); the only                                          E[Y]=0 +  1 X

    randomness in the values of Y
    comes from the error term i.
•   The errors i are normally
    distributed with mean 0 and                            Identical normal
    variance 2. The errors are                            distributions of errors,
                                                           all centered on the
    uncorrelated (not related) in                          regression line.

    successive observations. That
    is: ~ N(0,2)                                                               X
                                                                                  10-13

10-3 Estimation: The Method of Least
Squares
Estimation of a simple linear regression relationship involves finding
estimated or predicted values of the intercept and slope of the linear
regression line.

The estimated regression equation:
                           Y = b0 + b1X + e

where b0 estimates the intercept of the population regression line, 0 ;
b1 estimates the slope of the population regression line, 1;
and e stands for the observed errors - the residuals from fitting the estimated
regression line b0 + b1X to a set of n points.
The estimated regression line:

                        
                       Y  b0 + b1 X
       
where Y (Y - hat) is the value of Y lying on the fitted regression line for a given
value of X.
                                                                     10-14


Fitting a Regression Line
Y                              Y




            Data
                                          Three errors from the
                                          least squares regression
                           X              line             X
Y




          Three errors             Errors from the least
          from a fitted line       squares regression
                                   line are minimized
                           X                               X
                                                                                                  10-15


Errors in Regression

 Y
            the observed data point
                                                
                                                Y  b0  b1 X        the fitted regression line
Yi                        .
 
Yi
                      {
                      
     Error ei  Yi  Yi
                                  
                                 Yi the predicted value of Y for X
                                                                     i




                                              X
                          Xi
                                                                                         10-16


  Least Squares Regression

The sum of squared errors in regression is:
           n                  n
SSE =      e
           i=1
                  2
                  i       (y
                           i=1
                                   i     yi ) 2
                                          

The least squares regression line is that which minimizes the SSE
with respect to the estimates b 0 and b 1 .

The normal equations:                                   SSE                    b0

      n                           n

     y
     i=1
            i
                  nb0  b1  x i
                                  i=1
                                                                        At this point
                                                                        SSE is
                                                   Least squares b0     minimized
      n                   n                 n                           with respect

      x i y i b0  x i  b1  x 2
     i=1                  i=1
                                  i
                                           i=1
                                                                        to b0 and b1

                                                     Least squares b1               b1
                                                               10-17

Sums of Squares, Cross Products,
and Least Squares Estimators
  Sums of Squares and Cross Products:
                                          x
                                                 2

         SSx   (x  x )   x
                          2       2
                                      
                                      n 2
        SS y   ( y  y )   y 
                          2     2    y
                                      n
        SSxy   (x  x )( y  y )     xy 
                                                x  ( y )
                                                     n
  Least  squares regression estimators:
               SS XY
          b1 
               SS X

          b0  y  b1 x
                                                                                                                   10-18


   Example 10-1
 Miles    Dollars     Miles 2   Miles*Dollars
                                                           2      x 2
 1211
 1345
            1802
            2405
                    1466521
                    1809025
                                     2182222
                                     3234725
                                                 SS x   x 
 1422       2005    2022084          2851110                        n
 1687       2511    2845969          4236057                                           2
 1849       2332    3418801          4311868                                79 , 448
 2026       2305    4104676          4669930          293 , 426 ,946                      40 ,947 ,557 .84
 2133       3016    4549689          6433128                                    25
 2253
 2400
            3385
            3090
                    5076009
                    5760000
                                     7626405
                                     7416000                      x ( y )
 2468       3694    6091024          9116792    SS xy   xy 
 2699       3371    7284601          9098329                            n
 2806       3998    7873636         11218388
                                                                            (79 , 448 )(106 ,605 )
 3082
 3209
            3555
            4692
                    9498724
                   10297681
                                    10956510
                                    15056628          390 ,185 ,014                                 51, 402 ,852 .4
 3466       4244   12013156         14709704                                           25
 3643       5298   13271449         19300614
 3852       4801   14837904         18493452           SS   51, 402 ,852 .4
 4033       5147   16265089         20757852      b  XY                    1.255333776  1.26
 4267       5738   18207288         24484046       1 SS    40 ,947 ,557 .84
 4498       6420   20232004         28877160            X
 4533       6059   20548088         27465448
 4804       6426   23078416         30870504                        106 ,605                           79,448 
 5090       6321   25908100         32173890      b  y b x                    (1.255333776 )              
 5233       7026   27384288         36767056
                                                   0      1             25                             25 
 5439       6964   29582720         37877196
79,448   106,605 293,426,946     390,185,014          274 .85
                                    10-19

Template (partial output) that can be
used to carry out a Simple Regression
                                   10-20

Template (continued) that can be used
to carry out a Simple Regression
                                                                     10-21

Template (continued) that can be used
to carry out a Simple Regression




   Residual Analysis. The plot shows the absence of a relationship
   between the residuals and the X-values (miles).
                                                                        10-22

Template (continued) that can be used
to carry out a Simple Regression




  Note: The normal probability plot is approximately linear. This
  would indicate that the normality assumption for the errors has not
  been violated.
        10-23




Y




    X
                                                                                                           10-24

  10-4 Error Variance and the Standard
  Errors of Regression Estimators
                                                                Y
Degrees of Freedom in Regression:

df = (n - 2) (n total observations less one degree of freedom
              for each parameter estimated (b 0 and b1 ) )
                                       2                                       Square and sum all
                             ( SS XY )
                2
                                                                               regression errors to find
SSE =  ( Y - Y )  SSY                                                       SSE.
                                SS X                                                                       X

     = SSY  b1SS XY                                            Example 10 - 1:
                                                                SSE = SS Y  b1 SS XY
                          2              2                       66855898  (1.255333776)( 51402852 .4 )
An unbiased estimator of s , denoted by S :
                                                                 2328161.2
                                                                         SSE        2328161.2
         SSE                                                    MSE            
MSE =                                                                  n2              23
        (n - 2)                                                  101224 .4
                                                                s    MSE       101224 .4  318.158
                                                                           10-25

Standard Errors of Estimates in
Regression

The standard error of b0 (intercept):   Example 10 - 1:
                                                          2
                                                  s x
                                        s(b0 ) 
       s(b0 ) 
                  s     x2                         nSS X
                      nSS X                     
                                                  318.158 293426944
                                                     ( 25)( 4097557.84 )
where s =     MSE                                170.338
                                                       s
                                         s(b1 ) 
The standard error of b1 (slope):                   SS X
                                                       318.158
                s                               
      s(b1 )                                       40947557.84
                                                 0.04972
               SS X
                                                                                                                    10-26

 Confidence Intervals for the
 Regression Parameters
A (1 -  ) 100% confidence interval for b :
                                         0
            b  t        s (b )                                           Example 10 - 1
             0  ,(n 2 ) 0
                                                                           95% Confidence Intervals:
                 2      
                                                                            b t                 s (b )
                                                                             0  0.025,( 25 2 ) 0
A (1 -  ) 100% confidence interval for b :                                 = 274.85  ( 2.069) (170.338)
                                         1
             b  t        s (b )                                           274.85  352.43
              1  ,(n 2 ) 1
                          
                  2                                                        [ 77.58, 627.28]
                                            Least-squares point estimate:
                                            b1=1.25533
                                                                            b1  t                        s (b1 )
                                                                                      0.025,( 25 2 )
                                                                            = 1.25533  ( 2.069) ( 0.04972 )
                           Height = Slope




                                                                             1.25533  010287
                                                                                         .
                                                                             [115246,1.35820]
                                                                                .

                       0               (not a possible value of the
          Length = 1
                                        regression slope at 95%)
                                           10-27

Template (partial output) that can be used
to obtain Confidence Intervals for 0 and 1
                                                                                10-28


10-5 Correlation

The correlation between two random variables, X and Y, is a measure of the
degree of linear association between the two variables.

The population correlation, denoted by, can take on any value from -1 to 1.

  1      indicates a perfect negative linear relationship
-1 <  < 0   indicates a negative linear relationship
0          indicates no linear relationship
0<<1        indicates a positive linear relationship
  1       indicates a perfect positive linear relationship

The absolute value of  indicates the strength or exactness of the relationship.
                                       10-29


Illustrations of Correlation

Y                    Y         Y
         = -1           =0
                                   =1




                 X         X           X


Y       = -.8       Y   =0   Y
                                    = .8




            X              X           X
                                                                                    10-30


  Covariance and Correlation
The covariance of two random variables X and Y:
             Cov ( X , Y )  E [( X   )(Y   )]
                                       X       Y
where  and  Y are the population means of X and Y respectively.
       X

The population correlation coefficient:            Example 10 - 1:
                  Cov ( X , Y )                           SS
             =                                              XY
                                                 r=
                                                        SS SS
                      X Y                                   X Y
                                                               51402852.4
The sample correlation coefficient * :               
                                                        ( 40947557.84)( 66855898)
                     SS                              51402852.4
             r=         XY                                        .9824
                    SS SS                            52321943.29
                       X Y

*Note:    If  < 0, b1 < 0 If  = 0, b1 = 0 If  > 0, b1 >0
                                                                           10-31

Hypothesis Tests for the Correlation
Coefficient
                                            Example 10 -1:
                                                             r
 H0:  = 0     (No linear relationship)     t( n  2 ) 
 H1:   0     (Some linear relationship)                  1 r2
                                                           n2
                                                           0.9824
                                r                      =
 Test Statistic: t( n 2 )                               1 - 0.9651
                               1 r2
                                                             25 - 2
                               n2                       0.9824
                                                       =          25.25
                                                         0.0389
                                            t0. 005  2.807  25.25
                                            H 0 rejected at 1% level
                                                                                                      10-32

10-6 Hypothesis Tests about the
Regression Relationship
        Constant Y                   Unsystematic Variation             Nonlinear Relationship
  Y                                   Y                                  Y




                         X                                   X                                   X
A hypothesis test for the existence of a linear relationship between X and Y:
                      H0: 1  0
                      H1:  1  0
Test statistic for the existence of a linear relationship between X and Y:
                                  b
                                   1
                     t
                       (n - 2)   s(b )
                                    1
where b is the least - squares estimate of the regression slope and s ( b ) is the standard error of b .
         1                                                               1                            1
When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom.
                                                                                        10-33

Hypothesis Tests for the Regression
Slope
Example 10 - 1:                                   Example 10 - 4 :
    H0: 1  0                                          H :  1
                                                           0 1
        H1:  1  0                                     H :  1
                                                          1 1
                                b                                b 1
                                 1                               1
   t                                                t
        (n - 2)                s(b )                   ( n - 2) s (b )
                                                                      1
                                  1
                                                              1.24 - 1
                              1.25533                       =           1.14
                      =                  25.25                0.21
                              0.04972
                                                     t            1.671  1.14
   t                           2.807  25.25         (0.05,58)
       ( 0 . 005 , 23 )                           H is not rejected at the 10% level.
                                                    0
H 0 is rejected at the 1% level and we may
                                                  We may not conclude that the beta
conclude that there is a relationship between
                                                  coefficient is different from1.
charges and miles traveled.
                                                                                               10-34


    10-7 How Good is the Regression?
The coefficient of determination, r2, is a descriptive measure of the strength of
the regression relationship, a measure of how well the regression line fits the data.
                                                         ( y  y)  ( y  y
                                                                          )   ( y  y)
                                                                                   
Y                                                        Total = Unexplained    Explained
                                                        Deviation   Deviation   Deviation
Y                             .                                     (Error)    (Regression)



Y

Y
    Unexplained Deviation



        Explained Deviation
                            {
                              }
                              {
                                  Total Deviation

                                                           SST
                                                                    2
                                                         ( y  y)   ( y  y
                                                                   = SSE
                                                                                 2
                                                                             )   ( y  y )
                                                                                + SSR
                                                                                      



                                                                                 Percentage of
                                                                                                   2




                                                        2     SSR          SSE
                                                        r           1         total variation
                                                              SST          SST   explained by
                                                    X
                    X                                                            the regression.
                                                                                                          10-35


The Coefficient of Determination

Y                   Y                                           Y




                X                        X                                                          X
          SST                     SST                                                     SST
                                                                                    S
 r2 = 0   SSE       r2 = 0.50   SSE SSR                          r2 = 0.90          S     SSR
                                                                                    E


                                               7000
 Example 10 -1:                                6000

                                               5000



                                     Dollars
     SSR 64527736.8
 r 2
                    0.96518                  4000

     SST   66855898                            3000

                                               2000

                                                      1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
                                                                            Miles
                                                                       10-36

10-8 Analysis-of-Variance Table and
an F Test of the Regression Model
   Source of   Sum of       Degrees of
   Variation   Squares      Freedom Mean Square F Ratio

   Regression SSR           (1)          MSR          MSR
                                                      MSE
   Error       SSE          (n-2)        MSE
   Total       SST          (n-1)        MST

  Example 10-1
  Source of    Sum of       Degrees of
  Variation    Squares      Freedom                  F Ratio p Value
                                         Mean Square
  Regression 64527736.8     1            64527736.8  637.47   0.000
  Error        2328161.2    23           101224.4
  Total        66855898.0   24
                                                 10-37

Template (partial output) that displays Analysis of
Variance and an F Test of the Regression Model
                                                                                                          10-38

10-9 Residual Analysis and Checking
for Model Inadequacies
  Residuals                                         Residuals



        0                                                 0




                                                
                                           x or y                                                     
                                                                                                 x or y

 Homoscedasticity: Residuals appear completely      Heteroscedasticity: Variance of residuals
 random. No indication of model inadequacy.         increases when x changes.

  Residuals                                          Residuals



       0                                                  0




                                           Time                                                   
                                                                                             x or y

                                                    Curved pattern in residuals resulting from
 Residuals exhibit a linear trend with time.        underlying nonlinear relationship.
                                    10-39

Normal Probability Plot of the
Residuals
              Flatter than Normal
                                   10-40

Normal Probability Plot of the
Residuals
         More Peaked than Normal
                                 10-41

Normal Probability Plot of the
Residuals
            Positively Skewed
                                 10-42

Normal Probability Plot of the
Residuals
             Negatively Skewed
                                                            10-43

10-10 Use of the Regression Model
for Prediction

 • Point Prediction
   A single-valued estimate of Y for a given value of X
    obtained by inserting the value of X in the estimated
    regression equation.
 • Prediction Interval
   For a value of Y given a value of X
      Variation in regression line estimate
      Variation of points around regression line
   For an average value of Y given a value of X
      Variation in regression line estimate
                                                                                                         10-44


    Errors in Predicting E[Y|X]

Y         Upper limit on slope                         Y     Upper limit on intercept
                                     Regression line                                        Regression line



                        Lower limit on slope
Y                                                      Y                        Lower limit on intercept




               X                 X                                      X               X

    1) Uncertainty about the                               2) Uncertainty about the
    slope of the regression line                           intercept of the regression line
                                                                               10-45


Prediction Interval for E[Y|X]

Y   Prediction band for E[Y|X]                • The prediction band for E[Y|X]
                                 Regression
                                 line
                                                  is narrowest at the mean value
                                                  of X.
Y                                             •   The prediction band widens as
                                                  the distance from the mean of
                                                  X increases.
               X                 X            •   Predictions become very
                                                  unreliable when we
Prediction Interval for E[Y|X]                    extrapolate beyond the range of
                                                  the sample itself.
                                                                                       10-46

Additional Error in Predicting Individual
Value of Y
 Y
                       Regression line   Y   Prediction band for E[Y|X]
                                                                          Regression
                                                                          line


                                         Y


                                                               Prediction band for Y

                                    X                   X                 X
 3) Variation around the regression
    line                                 Prediction Interval for E[Y|X]
                                                                            10-47


Prediction Interval for a Value of Y

  A (1 -  ) 100% prediction interval for Y :


                        1 (x  x)      2


            y  t  s 1 
            ˆ   
                 2      n   SS     X




  Example 10 - 1 (X = 4,000) :


                                                 1 (4,000  3,177.92)   2


  {274.85  (1.2553)(4,000)}  2.069  318.16 1  
                                                 25  40,947,557.84

   5296 .05  676.62  [4619 .43, 5972 .67]
                                                                          10-48

Prediction Interval for the Average
Value of Y
  A (1 -  ) 100% prediction interval for the E[ Y X] :


                   1 (x  x)         2


            yt s
            ˆ   
                     
                 2 n   SS        X




  Example 10 - 1 (X = 4,000) :


                                               1 (4,000  3,177.92)   2


   {274.85  (1.2553)(4,000)}  2.069  318.16    
                                               25   40,947,557.84

   5,296.05  156.48  [5139 .57, 5452 .53]
                                  10-49

Template Output with Prediction
Intervals
                                                                      10-50

10-11 The Solver Method for
Regression
  The solver macro available in EXCEL can also be used to conduct a
  simple linear regression. See the text for instructions.
                                                            10-51

10-12 Linear Composites of
Dependent Random Variables

 • The Case of Independent Random
     Variables:
     For independent random variables, X1, X2, …, Xn,
       the expected value for the sum, is given by:
 •   E(X1 + X2 + … + Xn) = E(X1) + E(X2)+ … + E(Xn)
 •   For independent random variables, X1, X2, …, Xn, the
     variance for the sum, is given by:
 •   V(X1 + X2 + … + Xn) = V(X1) + V(X2)+ … + V(Xn)
                                                                10-52

10-12 Linear Composites of
Dependent Random Variables
• The Case of Independent Random Variables
    with Weights:
    For independent random variables, X1, X2, …, Xn,
      with respective weights 1, 2, …, n, the expected
      value for the sum, is given by:
•   E(1 X1 + 2 X2    + … + n Xn) = 1 E(X1) + 2 E(X2)+
    … + n E(Xn)
     For independent random variables, X1, X2, …, Xn,
      with respective weights 1, 2, …, n, the variance for
      the sum, is given by:
•   V(1 X1 + 2 X2    + … + n Xn) = 12 V(X1) + 22
    V(X2)+ … + n2 V(Xn)
                                                 10-53

Covariance of two random variables
X1 and X2

• The covariance between two random variables
    X1 and X2 is given by:

•   Cov(X1, X2) = E{[X1 – E(X1)] [X2 – E(X2)]}

• A simpler measure of covariance is given by:

• Cov(X1, X2) =  SD(X1) SD(X2) where  is the
    correlation between X1 and X2.
                                                                 10-54

10-12 Linear Composites of
Dependent Random Variables

• The Case of Dependent Random
    Variables with Weights:
    For dependent random variables, X1, X2, …, Xn, with
       respective weights 1, 2, …, n, the variance for the
       sum, is given by:
•   V(1 X1 + 1 X2      + … + n Xn) = 12 V(X1) + 22
    V(X2)+ … + n2 V(Xn) + 2 1 2Cov(X1, X2) + … +          2
    n-1 nCov(Xn-1, Xn)

								
To top