# S245 17 Linear Regression

Document Sample

```					    Linear Regression

Hypothesis testing and Estimation
Assume that we have collected data on two
variables X and Y. Let

(x1, y1) (x2, y2) (x3, y3) … (xn, yn)

denote the pairs of measurements on the on
two variables X and Y for n cases in a sample
(or population)
The Statistical Model
Each yi is assumed to be randomly generated from
a normal distribution with
mean mi = a + bxi and
standard deviation s.
(a, b and s are unknown)
slope = b

yi                   Y = a + bX
s

a + b xi

a
xi
The Data
The Linear Regression Model

• The data falls roughly about a straight line.

160

Y = a + bX
140

120

100
unseen
80

60

40

20

0
40      60    80   100   120   140
The Least Squares Line

Fitting the best straight line
to “linear” data
Let
Y=a +bX
denote an arbitrary equation of a straight line.
a and b are known values.
This equation can be used to predict for each value
of X, the value of Y.
For example, if X = xi (as for the ith case) then the
predicted value of Y is:

yi  a  bxi
ˆ
The residual
ri  yi  yi  yi  a  bxi 
ˆ
can be computed for each case in the sample,
r1  y1  y1 , r2  y2  y2 ,, rn  yn  yn ,
ˆ              ˆ                ˆ
The residual sum of squares (RSS) is
n           n                n
RSS   ri    yi  yi     yi  a  bxi 
ˆ
2                2                      2

i 1        i 1             i 1
a measure of the “goodness of fit of the line
Y = a + bX to the data
The optimal choice of a and b will result in
the residual sum of squares
n          n           n
RSS   ri    yi  yi     yi  a  bxi 
ˆ
2           2                    2

i 1       i 1        i 1

attaining a minimum.
If this is the case than the line:
Y = a + bX
is called the Least Squares Line
The equation for the least squares line
Let           n
S xx    xi  x 
2

i 1
n
S yy    yi  y 
2

i 1

n
S xy    xi  x  yi  y 
i 1
Linear Regression

Hypothesis testing and Estimation
The Least Squares Line

Fitting the best straight line
to “linear” data
Computing Formulae:
2
     n

  xi 
 i 1 
n              n
S xx   xi  x    xi 
2         2

i 1           i 1           n
2
   n

  yi 
S yy    yi  y    yi2   i 1 
n               n
2

i 1            i 1           n
 n  n 
n                                     xi   yi 
S xy    xi  x  yi  y                 i 1  i 1 
n
  xi yi 
i 1                       i 1             n
Then the slope of the least squares line can be
shown to be:
n

S xy        x  x  y
i       i    y
b             i 1
n

 x  x 
S xx                               2
i
i 1
and the intercept of the least squares line can
be shown to be:

S xy
a  y  bx  y           x
S xx
The residual sum of Squares

n                    n
RSS    yi  yi     yi   a  bxi  
2                          2
ˆ
i 1                 i 1

2          Computing
 S xy 
                     formula
 S yy 
S xx
Estimating s, the standard deviation in the
regression model :
n                        n

y  y 
ˆ i     i
2
  y  a  bx 
i          i
2

s     i 1
   i 1
n2                              n2
Computing


1         Sxy               2
          formula
 S yy                          
n2
         S xx                   

This estimate of s is said to be based on n – 2
degrees of freedom
Sampling distributions of the
estimators
The sampling distribution slope of the least
squares line :
n

S xy        x  x  y
i               i    y
b             i 1
n

 x  x 
S xx                                       2
i
i 1

It can be shown that b has a normal
distribution with mean and standard deviation
s                          s
mb  b and s b                              
n
S xx
 x  x 
2
i
i 1
Thus
b  mb       bb
z            
sb          s
S xx
has a standard normal distribution, and
b  mb   bb
t        
sb     s
S xx
has a t distribution with df = n - 2
(1 – a)100% Confidence Limits for slope b :

ˆ               s
b t    a /2
S xx
ta/2 critical value for the t-distribution with n – 2
degrees of freedom
Testing the slope
H 0 : b  b 0 vs H A : b  b 0

The test statistic is:
b  b0
t
s
S xx

- has a t distribution with df = n – 2 if H0 is true.
The Critical Region
Reject
H 0 : b  b 0 vs H A : b  b 0

b  b0
if   t           ta / 2 or t  ta / 2
s
S xx              df = n – 2

This is a two tailed tests. One tailed tests are
also possible
The sampling distribution intercept of the
least squares line :
S xy
a  y  bx  y           x
S xx
It can be shown that a has a normal
distribution with mean and standard deviation
2
1                   x
m a  a and s a  s              n

 x  x 
n                           2
i
i 1
Thus         a  ma                a a
z            
sa            1                       x   2
s          n

 x  x 
n                                   2
i
i 1

has a standard normal distribution and
a  ma              a a
t        
sa          1                 x     2
s       n

 x  x 
n                                   2
i
i 1

has a t distribution with df = n - 2
(1 – a)100% Confidence Limits for intercept a :

2
1 x
a  ta / 2 s
ˆ             
n S xx

ta/2 critical value for the t-distribution with n – 2
degrees of freedom
Testing the intercept
H 0 : a  a 0 vs H A : a  a 0

The test statistic is:
a  a0
t
2
1            x
s       n

 x  x 
n                    2
i
i 1

- has a t distribution with df = n – 2 if H0 is true.
The Critical Region
Reject
H 0 : a  a 0 vs H A : a  a 0

a  a0
if   t         ta / 2 or t  ta / 2
sa
df = n – 2
Example
The following data showed the per capita consumption of cigarettes per month
(X) in various countries in 1930, and the death rates from lung cancer for men
in 1950.

TABLE : Per capita consumption of cigarettes per month (Xi) in n = 11
countries in 1930, and the death rates, Yi (per 100,000), from lung cancer for
men in 1950.

Country (i)        Xi     Yi
Australia          48     18
Denmark            38     17
Finland           110     35
Great Britain     110     46
Holland            49     24
Iceland            23      6
Norway             25      9
Sweden             30     11
Switzerland        51     25
USA               130     20
50
Great Britain
45
death rates from lung cancer (1950)

40

35                                                   Finland

30

25                          Switzerland
Holland

20                                                              USA
Australia
Denmark
Sweden
10             Norway
Iceland
5

0
0   20       40         60           80   100   120         140
Per capita consumption of cigarettes
Fitting the Least Squares Line
n                           n

 xi  664
i 1
 xi2  54 ,404
i 1

n

 yi  226
n

i 1
 yi2  6,018
i 1

n

x y
i 1
i   i    16 ,914
Fitting the Least Squares Line

First compute the following three quantities:

S xx     54404 
664  2
 14322 .55
11

S yy    6018 
226 2      1374 .73
11

S xy    16914 
664226  3271.82
11
Computing Estimate of Slope (b), Intercept (a)
and standard deviation (s),

S xy 3271.82
b               0.288
S xx 14322.55

226          664 
a  y  bx       0.288        6.756
11          11 

s
1         Sxy  2

 S yy               8.35
n2
         S xx      

95% Confidence Limits for slope b :

ˆ               s
b t   a /2
S xx
8.35
0.288   2.262 
1432255
0.0706 to 0.3862

t.025 = 2.262 critical value for the t-distribution with 9
degrees of freedom
95% Confidence Limits for intercept a :
2
1 x
a  ta / 2 s
ˆ             
n S xx

1  664 11
2

6.756   2.262  8.35   
11 1432255
-4.34 to 17.85
t.025 = 2.262 critical value for the t-distribution with 9
degrees of freedom
50
Great Britain
45

death rates from lung cancer (1950)   40

35                                                   Finland

30

25                          Switzerland
Holland

20                                                             USA
Australia
Denmark

10                Sweden
Norway
Y = 6.756 + (0.228)X
Iceland
5

0
0   20        40        60           80   100    120        140
Per capita consumption of cigarettes

95% confidence Limits for slope 0.0706 to 0.3862
95% confidence Limits for intercept -4.34 to 17.85
Testing the positive slope

H 0 : b  0 vs H A : b  0

The test statistic is:
b0
t
s
S xx
The Critical Region
Reject
H 0 : b  0 in favour of H A : b  0

b0
if     t         t0.05 =1.833
s
S xx          df = 11 – 2 = 9

A one tailed test
b0
Since       t
s
S xx
0.288
                        41.3  1.833
8.35
1432255

we reject       H0 : b  0
and conclude
HA : b  0
Confidence Limits for Points on the
Regression Line
• The intercept a is a specific point on the regression
line.
• It is the y – coordinate of the point on the regression
line when x = 0.
• It is the predicted value of y when x = 0.
• We may also be interested in other points on the
regression line. e.g. when x = x0
• In this case the y – coordinate of the point on the
regression line when x = x0 is a + b x0
y=a+bx

a + b x0

x0
(1- a)100% Confidence Limits for a + b x0 :

1 x0  x 
2
a  bx0  ta / 2 s   
n    S xx

ta/2 is the a/2 critical value for the t-distribution with
n - 2 degrees of freedom
Prediction Limits for new values of the
Dependent variable y
• An important application of the regression line is
prediction.
• Knowing the value of x (x0) what is the value of y?
• The predicted value of y when x = x0 is:
y  a  bx0
ˆ
• This in turn can be estimated by:.
ˆ ˆ ˆ
y  a  bx0  a  bx0
The predictor
ˆ ˆ ˆ
y  a  bx0  a  bx0

• Gives only a single value for y.
• A more appropriate piece of information would
be a range of values.
• A range of values that has a fixed probability of
capturing the value for y.
• A (1- a)100% prediction interval for y.
(1- a)100% Prediction Limits for y when x = x0:

1 x0  x 
2
a  bx0  ta / 2 s 1  
n    S xx

ta/2 is the a/2 critical value for the t-distribution with
n - 2 degrees of freedom
Example
In this example we are studying building fires in a
city and interested in the relationship between:

1. X = the distance of the closest fire hall
and the building that puts out the alarm
and

2. Y = cost of the damage (1000\$)

The data was collected on n = 15 fires.
The Data
Fire   Distance Damage
1       3.4     26.2
2       1.8     17.8
3       4.6     31.3
4       2.3     23.1
5       3.1     27.5
6       5.5     36.0
7       0.7     14.1
8       3.0     22.3
9       2.6     19.6
10      4.3     31.3
11      2.1     24.0
12      1.1     17.3
13      6.1     43.2
14      4.8     36.4
15      3.8     26.1
Scatter Plot
50.0
45.0
40.0
35.0
Damage (1000\$)

30.0
25.0
20.0
15.0
10.0
5.0
0.0
0.0   2.0         4.0          6.0   8.0
Distance (miles)
Computations
n
Fire
1
Distance Damage
3.4     26.2     x
i 1
i        49 .2
2       1.8     17.8       n
3
4
4.6
2.3
31.3
23.1
 xi2  196 .16
i 1
5       3.1     27.5       n
6
7
5.5
0.7
36.0
14.1     y
i 1
i    396 .2
8       3.0     22.3
n


9       2.6     19.6
10       4.3     31.3       yi2  11376 .5
11       2.1     24.0     i 1
12       1.1     17.3      n
13
14
6.1
4.8
43.2
36.4
x y
i 1
i       i    1470 .65
15       3.8     26.1
Computations Continued

n

x     i
x   i 1                49.2         3.28
n                15

n

y         i
y   i 1                 396.2            26.4133
n                 15
Computations Continued
2
 n

n             xi 
S xx        x 2   i 1             196.16  49.22         34.784
i 1
i                   n                      15
2
n

n              yi 
S yy         y 2   i 1             11376.5  396.22         911.517
i 1
i                   n                        15

 n  n 
n           xi   yi 
S xy   xi yi   i 1  i 1 
i 1
n

 1470 .65  49 .2396 .2              171 .114
15
Computations Continued
ˆ  S xy  171.114  4.92
bb
S xx 34.784

a  a  y  bx  26 .4133  4.919 3.28   10 .28
ˆ
2
S xy
S yy 
S xx
s
n2

911.517  171.1142
                               34.784  2.316
13
95% Confidence Limits for slope b :

ˆ               s
b t    a /2
S xx

4.07 to 5.77

t.025 = 2.160 critical value for the t-distribution with
13 degrees of freedom
95% Confidence Limits for intercept a :
2
1 x
a  ta / 2 s
ˆ             
n S xx
7.21 to 13.35

t.025 = 2.160 critical value for the t-distribution with
13 degrees of freedom
Least Squares Line
60.0

50.0
Damage (1000\$)

40.0

30.0

20.0                                     y=4.92x+10.28
10.0

0.0
0.0      2.0         4.0            6.0     8.0
Distance (miles)
(1- a)100% Confidence Limits for a + b x0 :

1 x0  x 
2
a  bx0  ta / 2 s   
n    S xx

ta/2 is the a/2 critical value for the t-distribution with
n - 2 degrees of freedom
95% Confidence Limits for a + b x0 :

x0     lower   upper
1      12.87   17.52
2      18.43   21.80
3      23.72   26.35
4      28.53   31.38
5      32.93   36.82
6      37.15   42.44
95% Confidence Limits for a + b x0
60.0

50.0
Damage (1000\$)

40.0

30.0

20.0                                  Confidence limits
10.0

0.0
0.0   2.0         4.0          6.0        8.0
Distance (miles)
(1- a)100% Prediction Limits for y when x = x0:

1 x0  x 
2
a  bx0  ta / 2 s 1  
n    S xx

ta/2 is the a/2 critical value for the t-distribution with
n - 2 degrees of freedom
95% Prediction Limits for y when x = x0

x0     lower   upper
1       9.68   20.71
2      14.84   25.40
3      19.86   30.21
4      24.75   35.16
5      29.51   40.24
6      34.13   45.45
95% Prediction Limits for y
when x = x0
60.0

50.0
Damage (1000\$)

40.0

30.0

20.0                                    Prediction limits
10.0

0.0
0.0     2.0         4.0            6.0        8.0
Distance (miles)
Linear Regression
Summary
Hypothesis testing and Estimation
(1 – a)100% Confidence Limits for slope b :

ˆ               s
b t    a /2
S xx
ta/2 critical value for the t-distribution with n – 2
degrees of freedom
Testing the slope
H 0 : b  b 0 vs H A : b  b 0

The test statistic is:
b  b0
t
s
S xx

- has a t distribution with df = n – 2 if H0 is true.
(1 – a)100% Confidence Limits for intercept a :

2
1 x
a  ta / 2 s
ˆ             
n S xx

ta/2 critical value for the t-distribution with n – 2
degrees of freedom
Testing the intercept
H 0 : a  a 0 vs H A : a  a 0

The test statistic is:
a  a0
t
2
1            x
s       n

 x  x 
n                    2
i
i 1

- has a t distribution with df = n – 2 if H0 is true.
(1- a)100% Confidence Limits for a + b x0 :

1 x0  x 
2
a  bx0  ta / 2 s   
n    S xx

ta/2 is the a/2 critical value for the t-distribution with
n - 2 degrees of freedom
(1- a)100% Prediction Limits for y when x = x0:

1 x0  x 
2
a  bx0  ta / 2 s 1  
n    S xx

ta/2 is the a/2 critical value for the t-distribution with
n - 2 degrees of freedom
Correlation
Definition

The statistic:
n

S xy                  x  x  y  y 
i            i
r                         i 1
n                       n
S xx S yy
 x  x    y  y 
2                2
i                    i
i 1                    i 1

is called Pearsons correlation coefficient
Properties

1. -1 ≤ r ≤ 1, |r| ≤ 1, r2 ≤ 1
2. |r| = 1 (r = +1 or -1) if the points
(x1, y1), (x2, y2), …, (xn, yn) lie along a
straight line. (positive slope for +1,
negative slope for -1)
The test for independence (zero correlation)
H0: X and Y are independent
HA: X and Y are correlated
The test statistic:
r
t  n2
1 r   2

The Critical region
Reject H0 if |t| > ta/2 (df = n – 2)

This is a two-tailed critical region, the critical
region could also be one-tailed
Example
In this example we are studying building fires in a
city and interested in the relationship between:

1. X = the distance of the closest fire hall
and the building that puts out the alarm
and

2. Y = cost of the damage (1000\$)

The data was collected on n = 15 fires.
The Data
Fire   Distance Damage
1       3.4     26.2
2       1.8     17.8
3       4.6     31.3
4       2.3     23.1
5       3.1     27.5
6       5.5     36.0
7       0.7     14.1
8       3.0     22.3
9       2.6     19.6
10      4.3     31.3
11      2.1     24.0
12      1.1     17.3
13      6.1     43.2
14      4.8     36.4
15      3.8     26.1
Scatter Plot
50.0
45.0
40.0
35.0
Damage (1000\$)

30.0
25.0
20.0
15.0
10.0
5.0
0.0
0.0   2.0         4.0          6.0   8.0
Distance (miles)
Computations
n
Fire
1
Distance Damage
3.4     26.2     x
i 1
i        49 .2
2       1.8     17.8       n
3
4
4.6
2.3
31.3
23.1      xi2  196 .16
n

y
i 1
5
6
3.1
5.5
27.5
36.0                i    396 .2
7       0.7     14.1     i 1
8       3.0     22.3      n
9
10
2.6
4.3
19.6
31.3
 yi2  11376 .5
i 1
11       2.1     24.0      n
12
13
1.1
6.1
17.3
43.2
x y
i 1
i       i    1470 .65
14       4.8     36.4
15       3.8     26.1
Computations Continued

n

x     i
x   i 1                49.2         3.28
n                15

n

y         i
y   i 1                 396.2            26.4133
n                 15
Computations Continued
2
 n

n             xi 
S xx        x 2   i 1             196.16  49.22         34.784
i 1
i                   n                      15
2
n

n              yi 
S yy         y 2   i 1             11376.5  396.22         911.517
i 1
i                   n                        15

 n  n 
n           xi   yi 
S xy   xi yi   i 1  i 1 
i 1
n

 1470 .65  49 .2396 .2              171 .114
15
The correlation coefficient

S xy              171.114
r                                    0.961
S xx S yy        34.784 911.517

The test for independence (zero correlation)
The test statistic:
r             0.961
t  n2                  13                 12.525
1 r2          1  0.9612

We reject H0: independence, if |t| > t0.025 = 2.160

H0: independence, is rejected
Relationship between Regression
and Correlation
S xy
Recall           r
S xx S yy
Also
ˆ      S xy        S yy          S xy        S yy        sy
b                                                r        r
S xx        S xx      S xx S yy       S xx        sx

S xx           S yy
since          sx        and s y 
n 1            n 1
Thus the slope of the least squares line is simply the ratio
of the standard deviations × the correlation coefficient
The test for independence (zero correlation)
H0: X and Y are independent
HA: X and Y are correlated
Uses the test statistic:
r
t  n2
1 r   2

ˆ    S yy                        S xx ˆ
Note:    b          r   and    r             b
S xx                        S yy
The two tests
1. The test for independence (zero correlation)
H0: X and Y are independent
HA: X and Y are correlated

2. The test for zero slope
H0: b = 0.
HA : b ≠ 0
are equivalent
1. the test statistic for independence:
r
t  n2
1 r2
S xy                          S xy
S xx S yy                        S xx
Thus t  n  2                         n2
2                                2
S                                  S xy
1         xy
S yy 1 
S xx S yy                          S xx S yy
S xy
S xx                      ˆ
b
                                   
 1            S xy 
2                s
       S             n2
 S   yy S xx                  S xx
 xx                
 the same statistic for testing for zero slope.
Regression (in general)
In many experiments we would have collected data on a
single variable Y (the dependent variable ) and on p (say)
other variables X1, X2, X3, ... , Xp (the independent variables).

One is interested in determining a model that describes the
relationship between Y (the response (dependent) variable)
and X1, X2, …, Xp (the predictor (independent) variables.

This model can be used for
– Prediction
– Controlling Y by manipulating X1, X2, …, Xp
The Model:
is an equation of the form
Y = f(X1, X2,... ,Xp | q1, q2, ... , qq) + e

where q1, q2, ... , qq are unknown parameters
of the function f and e is a random disturbance
(usually assumed to have a normal
distribution with mean 0 and standard
deviation s.
Examples:
1. Y = Blood Pressure, X = age
The model
Y = a + bX + e,thus q1 = a and q2 = b.
This model is called:
the simple Linear Regression Model
160

Y = a + bX
140

120

100

80

60

40

20

0
40   60   80   100   120   140
2.   Y = average of five best times for running the
100m, X = the year
The model
Y = a e-bX + g  e, thus q1 = a, q2 = b and q2 = g.
This model is called:
the exponential Regression Model
12.5

12

11.5
Y = a e-bX + g
11

10.5

10

9.5

9

8.5

8
1930   1940   1950   1960   1970   1980   1990   2000   2010
2.  Y = gas mileage ( mpg) of a car brand
X1 = engine size
X2 = horsepower
X3 = weight
The model
Y = b0 + b1 X1 + b2 X2 + b3 X3 + e.
This model is called:
the Multiple Linear Regression Model
The Multiple Linear
Regression Model
In Multiple Linear Regression we assume the
following model

Y = b0 + b1 X1 + b2 X2 + ... + bp Xp + e

This model is called the Multiple Linear Regression
Model.
Again are unknown parameters of the model and
where b0, b1, b2, ... , bp are unknown parameters and
e is a random disturbance assumed to have a normal
distribution with mean 0 and standard deviation s.
The importance of the Linear model
1.     It is the simplest form of a model in which
each dependent variable has some effect on the
independent variable Y.
– When fitting models to data one tries to find the
simplest form of a model that still adequately
describes the relationship between the dependent
variable and the independent variables.
– The linear model is sometimes the first model to
be fitted and only abandoned if it turns out to be
2. In many instance a linear model is the
most appropriate model to describe the
dependence relationship between the
dependent variable and the independent
variables.
–   This will be true if the dependent variable
increases at a constant rate as any or the
independent variables is increased while
holding the other independent variables
constant.
3.     Many non-Linear models can be
Linearized (put into the form of a Linear
model by appropriately transformation the
dependent variables and/or any or all of the
independent variables.)
– This important fact ensures the wide utility of
the Linear model. (i.e. the fact the many non-
linear models are linearizable.)
An Example
The following data comes from an experiment
that was interested in investigating the source
from which corn plants in various soils obtain
their phosphorous.
–The concentration of inorganic phosphorous (X1)
and the concentration of organic phosphorous (X2)
was measured in the soil of n = 18 test plots.
–In addition the phosphorous content (Y) of corn
grown in the soil was also measured. The data is
displayed below:
Inorganic      Organic        Plant       Inorganic      Organic        Plant
Phosphorous   Phosphorous    Available    Phosphorous   Phosphorous    Available
X1            X2        Phosphorous       X1            X2        Phosphorous
Y                                        Y
0.4            53            64           12.6           58            51
0.4            23            60           10.9           37            76
3.1            19            71           23.1           46            96
0.6            34            61           23.1           50            77
4.7            24            54           21.6           44            93
1.7            65            77           23.1           56            95
9.4            44            81           1.9            36            54
10.1           31            93           26.8           58           168
11.6           29            93           29.9           51            99
Coefficients
Intercept        56.2510241 (b0)

X1               1.78977412 (b1)

X2               0.08664925 (b2)

Equation:
Y = 56.2510241 + 1.78977412 X1 + 0.08664925 X2
The Multiple Linear
Regression Model
In Multiple Linear Regression we assume the
following model

Y = b0 + b1 X1 + b2 X2 + ... + bp Xp + e

This model is called the Multiple Linear Regression
Model.
Again are unknown parameters of the model and
where b0, b1, b2, ... , bp are unknown parameters and
e is a random disturbance assumed to have a normal
distribution with mean 0 and standard deviation s.
Summary of the Statistics
used in
Multiple Regression
The Least Squares Estimates:

b0 , b1 , b2 , , b p ,
- the values that minimize
n
RSS    yi  yi 
2
ˆ
i 1

                                           
n
  yi   b 0  b1 x1i  b 2 x2i 
2

                             b p x pi 

i 1
The Analysis of Variance Table Entries
a) Adjusted Total Sum of Squares (SSTotal)
n

y  y . d.f.  n  1
_ 2
SSTotal           i
i1
b) Residual Sum of Squares (SSError)
n

RSS  SSError        i1
ˆ
yi  yi 2 . d.f.  n  p  1

c) Regression Sum of Squares (SSReg)
n

SSReg  SSb1 ,b2 , ... , bp            i1
ˆ  _ 2 . d.f. p
yi y

Note:   n                       n                  n

                                         
_ 2
yi  y               ˆ  _ 2 
yi y                     ˆ
yi  yi 2 .
i1                    i1                 i1
i.e. SSTotal = SSReg +SSError
The Analysis of Variance Table

Source       Sum of Squares d.f.   Mean Square                       F

Regression       SSReg        p   SSReg/p = MSReg                 MSReg/s2
Error            SSError    n-p-1 SSError/(n-p-1) =MSError = s2

Total            SSTotal     n-1
Uses:
1. To estimate s2 (the error variance).
- Use s2 = MSError to estimate s2.
2. To test the Hypothesis
H0: b1 = b2= ... = bp = 0.
Use the test statistic
F  MSReg MSError  MSReg s        2

  SSReg p   SSError  n  p  1
                               

- Reject H0 if F > Fa(p,n-p-1).
3. To compute other statistics that are useful in describing
the relationship between Y (the dependent variable) and
X1, X2, ... ,Xp (the independent variables).
a)R2 = the coefficient of determination
= SSReg/SSTotal
n

 y i  y
2
ˆ
= i 1
n

 y i  y
2

i 1

= the proportion of variance in Y explained by
X1, X2, ... ,Xp
1 - R2 = the proportion of variance in Y
that is left unexplained by X1, X2, ... , Xp
= SSError/SSTotal.
b) Ra2 = "R2 adjusted" for degrees of freedom.
= 1 -[the proportion of variance in Y that is left
unexplained by X1, X2,... , Xp adjusted for d.f.]

 1  MS Error MSTotal

SS Error  n  p  1
 1
SSTotal    n  1
 1
 n  1 SSError
 n  p  1 SSTotal
 1
 n  1 1  R2 
 n  p  1        
c) R= R2 = the Multiple correlation coefficient of
Y with X1, X2, ... ,Xp

SS Re g
=
SS T ot al

= the maximum correlation between Y and a
linear combination of X1, X2, ... ,Xp

Comment: The statistics F, R2, Ra2 and R are
equivalent statistics.
Using Statistical Packages

To perform Multiple Regression
Using SPSS

Note: The use of another statistical package
such as Minitab is similar to using SPSS
After starting the SSPS program the following dialogue
box appears:
If you select Opening an existing file and press OK the
following dialogue box appears
The following dialogue box appears:
If the variable names are in the file ask it to read the
names. If you do not specify the Range the program will
identify the Range:

Once you “click OK”, two windows will appear
One that will contain the output:
The other containing the data:
To perform any statistical Analysis select the Analyze
Then select Regression and Linear.
The following Regression dialogue box appears
Select the Dependent variable Y.
Select the Independent variables X1, X2, etc.
If you select the Method - Enter.
All variables will be put into the equation.

There are also several other methods that can be
used :

1. Forward selection
2. Backward Elimination
3. Stepwise Regression
Forward selection
1. This method starts with no variables in the
equation
2. Carries out statistical tests on variables not in
the equation to see which have a significant
effect on the dependent variable.
4. Continues until all variables not in the
equation have no significant effect on the
dependent variable.
Backward Elimination
1. This method starts with all variables in the
equation
2. Carries out statistical tests on variables in the
equation to see which have no significant
effect on the dependent variable.
3. Deletes the least significant.
4. Continues until all variables in the equation
have a significant effect on the dependent
variable.
Stepwise Regression (uses both forward and
backward techniques)
1. This method starts with no variables in the
equation
2. Carries out statistical tests on variables not in
the equation to see which have a significant
effect on the dependent variable.
3. It then adds the most significant.
4. After a variable is added it checks to see if any
variables added earlier can now be deleted.
5. Continues until all variables not in the
equation have no significant effect on the
dependent variable.
All of these methods are procedures for
attempting to find the best equation

The best equation is the equation that is the
simplest (not containing variables that are not
that are important)
Once the dependent variable, the independent variables and the
Method have been selected if you press OK, the Analysis will
be performed.
The output will contain the following table

Model Summary

Std. Error
Model      R       R Square   R Square    Estimate
1           .822 a     .676       .673          4.46
a. Predictors: (Constant), WEIGHT, HORSE, ENGINE

R2 and R2 adjusted measures the proportion of variance
in Y that is explained by X1, X2, X3, etc (67.6% and
67.3%)
R is the Multiple correlation coefficient (the maximum
correlation between Y and a linear combination of X1,
X2, X3, etc)
The next table is the Analysis of Variance Table
ANOVAb

Sum of                   Mean
Model                  Squares     df          Square      F       Sig.
1       Regression   16098.158            3   5366.053   269.664     .000 a
Residual      7720.836          388     19.899
Total        23818.993          391
a. Predictors: (Constant), WEIGHT, HORSE, ENGINE
b. Dependent Variable: MPG

The F test is testing if the regression coefficients of
the predictor variables are all zero.
Namely none of the independent variables X1, X2, X3,
etc have any effect on Y
The final table in the output
Coefficientsa

Standardi
zed
Unstandardized        Coefficien
Coefficients             ts
Model                    B        Std. Error     Beta         t       Sig.
1       (Constant)      44.015        1.272                 34.597      .000
ENGINE       -5.53E-03         .007        -.074      -.786     .432
HORSE        -5.56E-02         .013        -.273     -4.153     .000
WEIGHT       -4.62E-03         .001        -.504     -6.186     .000
a. Dependent Variable: MPG

Gives the estimates of the regression coefficients,
there standard error and the t test for testing if they are
zero
Note: Engine size has no significant effect on
Mileage
The estimated equation from the table below:
Coefficientsa

Standardi
zed
Unstandardized        Coefficien
Coefficients             ts
Model                    B        Std. Error     Beta         t       Sig.
1       (Constant)      44.015        1.272                 34.597      .000
ENGINE       -5.53E-03         .007        -.074      -.786     .432
HORSE        -5.56E-02         .013        -.273     -4.153     .000
WEIGHT       -4.62E-03         .001        -.504     -6.186     .000
a. Dependent Variable: MPG

Is:
5.53          5.56         4.62
Mileage  44.0       Engine       Horse       Weight  Error
1000          100          1000
Note the equation is:

5.53          5.56         4.62
Mileage  44.0       Engine       Horse       Weight  Error
1000          100          1000

Mileage decreases with:

1.    With increases in Engine Size (not
significant, p = 0.432)
With increases in Horsepower (significant,
p = 0.000)
With increases in Weight (significant, p =
0.000)
Logistic regression
Recall the simple linear regression model:
y = b0 + b1x + e
where we are trying to predict a continuous
dependent variable y from a continuous
independent variable x.
This model can be extended to Multiple linear
regression model:
y = b0 + b1x1 + b2x2 + … + + bpxp + e
Here we are trying to predict a continuous
dependent variable y from a several continuous
dependent variables x1 , x2 , … , xp .
Now suppose the dependent variable y is
binary.
It takes on two values “Success” (1) or
“Failure” (0)

We are interested in predicting a y from a
continuous dependent variable x.

This is the situation in which Logistic
Regression is used
Example
We are interested how the success (y) of a new
antibiotic cream is curing “acne problems” and
how it depends on the amount (x) that is applied
daily.
The values of y are 1 (Success) or 0 (Failure).
The values of x range over a continuum
The logisitic Regression Model
Let p denote P[y = 1] = P[Success].
This quantity will increase with the value of x.
p
The ratio:            is called the odds ratio
1 p
This quantity will also increase with the value of
x, ranging from zero to infinity.
 p 
The quantity: ln           
 1 p 
is called the log odds ratio
Example: odds ratio, log odds ratio
Suppose a die is rolled:
Success = “roll a six”, p = 1/6
1          1
p                     1
The odds ratio         61        6

1 p 1 6        5
6     5

The log odds ratio
 p          1
ln         ln    ln  0.2   1.69044
 1 p       5
The logisitic Regression Model
Assumes the log odds ratio is linearly
related to x.
 p 
i. e. :    ln         b0  b1 x
 1 p 
In terms of the odds ratio
p     b 0  b1 x
e
1 p
The logisitic Regression Model
Solving for p in terms x.
p
 e b0  b1 x
1 p
pe    b0  b1x
1  p 
p  pe b0  b1x  e b0  b1x
b0  b1 x
e
or         p      b0  b1 x
1 e
Interpretation of the parameter b0
(determines the intercept)

1

0.8
p   0.6

0.4                             b0
e
b0
0.2                        1 e
0
0        2       4             6   8     10
x
Interpretation of the parameter b1
(determines when p is 0.50 (along with b0))

1

0.8                             b 0  b1 x
e               1   1
p                        p      b0  b1 x
    
0.6                     1 e             11 2
0.4                  when
b0
0.2                  b 0  b1 x  0 or x  
b1
0
0   2      4          6                8   10
x
b 0  b1 x
Also dp  d e
dx dx 1  e b0  b1 x


                       
e b0  b1x b1 1  e b0  b1x  e b0  b1x b1e b0  b1x

1  e           
b0  b1 x 2

b 0  b1 x
e                b1               b1                    b0
                                                 when x  
1  e                                                   b1
2
b 0  b1 x                4

b1 is the rate of increase in p with respect to x
4   when p = 0.50
Interpretation of the parameter b1
(determines slope when p is 0.50 )

1

0.8
p   0.6                                  b1
slope 
0.4                                  4
0.2

0
0        2       4       6          8    10
x
The data
The data will for each case consist of

1. a value for x, the continuous independent
variable
2. a value for y (1 or 0) (Success or Failure)

Total of n = 250 cases
case    x    y
case    x    y
230    4.7   1
1     0.8   0
231    0.3   0
2     2.3   1
232    1.4   0
3     2.5   0
233    4.5   1
4     2.8   1
234    1.4   1
5     3.5   1
235    4.5   1
6     4.4   1
236    3.9   0
7     0.5   0
237    0.0   0
8     4.5   1
238    4.3   1
9     4.4   1
239    1.0   0
10    0.9   0
240    3.9   1
11    3.3   1
241    1.1   0
12    1.1   0
242    3.4   1
13    2.5   1
243    0.6   0
14    0.3   1
244    1.6   0
15    4.5   1
245    3.9   0
16    1.8   0
246    0.2   0
17    2.4   1
247    2.5   0
18    1.6   0
248    4.1   1
19    1.9   1
249    4.2   1
20    4.6   1
250    4.9   1
Estimation of the parameters

The parameters are estimated by Maximum
Likelihood estimation and require a statistical
package such as SPSS
Using SPSS to perform Logistic regression
Open the data file:
Analyze -> Regression -> Binary Logistic
The following dialogue box appears

Select the dependent variable (y) and the independent
variable (x) (covariate).
Press OK.
Here is the output

The Estimates and their S.E.
The parameter Estimates
b       SE
X      1.0309   0.1334
Constant -2.0475   0.332

b1      1.0309
b0     -2.0475
Interpretation of the parameter b0
(determines the intercept)
b0        -2.0475
e         e
intercept       b0
               0.1143
1 e      1 e -2.0475

Interpretation of the parameter b1
(determines when p is 0.50 (along with b0))
b0     2.0475
x                   1.986
b1      1.0309
Another interpretation of the parameter b1
b1   is the rate of increase in p with
4    respect to x when p = 0.50

b1   1.0309
         0.258
4      4
The Logistic Regression Model

The dependent variable y is binary.
It takes on two values “Success” (1) or
“Failure” (0)

We are interested in predicting a y from a
continuous dependent variable x.
The logisitic Regression Model
Let p denote P[y = 1] = P[Success].
This quantity will increase with the value of x.
p
The ratio:            is called the odds ratio
1 p
This quantity will also increase with the value of
x, ranging from zero to infinity.
 p 
The quantity: ln           
 1 p 
is called the log odds ratio
The logisitic Regression Model
Assumes the log odds ratio is linearly
related to x.
 p 
i. e. :    ln         b0  b1 x
 1 p 
In terms of the odds ratio
p     b 0  b1 x
e
1 p
The logisitic Regression Model
In terms of p

b0  b1 x
e
p      b0  b1 x
1 e
The graph of p vs x

1

0.8
b0  b1 x
p                                    e
0.6                         p
1  e b0  b1x
0.4

0.2

0
0   2         4       6        8          10
x
The Multiple Logistic Regression
model
Here we attempt to predict the outcome of a
binary response variable Y from several
independent variables X1, X2 , … etc

 p 
ln         b0  b1 X1                   bp X p
 1 p 

b 0  b1 X1   b p X p
e
or   p      b 0  b1 X1      b p X p
1 e
Multiple Logistic Regression
an example
In this example we are interested in determining
the risk of infants (who were born prematurely)
of developing BPD (bronchopulmonary
dysplasia)
More specifically we are interested in developing
a predictive model which will determine the
probability of developing BPD from
X1 = gestational Age and X2 = Birthweight
For n = 223 infants in prenatal ward the
following measurements were determined

1. X1 = gestational Age (weeks),
2. X2 = Birth weight (grams) and
3. Y = presence of BPD
case        Gestational Age   Birthweight   presence of BMD
The data           1            28.6          1119            1
2            31.5          1222            0
3            30.3          1311            1
4            28.9          1082            0
5            30.3          1269            0
6            30.5          1289            0
7            28.5          1147            0
8            27.9          1136            1
9             30           972             0
10             31           1252            0
11            27.4          818             0
12            29.4          1275            0
13            30.8          1231            0
14            30.4          1112            0
15            31.1          1353            1
16            26.7          1067            1
17            27.4          846             1
18             28           1013            0
19            29.3          1055            0
20            30.4          1226            0
21            30.2          1237            0
22            30.2          1287            0
23            30.1          1215            0
24             27           929             1
25            30.3          1159            0
26            27.4          1046            1
The results
Variables in the Equation

B           S.E.        Wald          df       Sig.     Exp(B)
Step
a
Birthweight           -.003         .001       4.885             1     .027       .998
1      GestationalAge        -.505         .133      14.458             1     .000       .604
Constant             16.858        3.642      21.422             1     .000   2.1E+07
a. Variable(s) entered on step 1: Birthweight, GestationalAge.

 p 
ln         16.858  .003BW  .505GA
 1 p 
p
 e16.858.003 BW .505GA
1 p
e16.858.003BW .505GA
p
1  e16.858.003BW .505GA
Graph: Showing Risk of BPD vs GA and BrthWt

1

0.8
GA = 27

0.6                                             GA = 28
GA = 29
GA = 30
0.4                                             GA = 31
GA = 32

0.2

0
700   900   1100   1300   1500   1700
Non-Parametric Statistics

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 7 posted: 11/29/2011 language: English pages: 171
How are you planning on using Docstoc?