Document Sample
Regression Powered By Docstoc
					                          Simple Linear Regression
In many scientific investigations, one is interested to find how something
 is related with something else. For example the distance traveled and the time
spent driving; one’s age and height. Generally, there are two types of
relationships between a pair of variable: deterministic relationship and
probabilistic relationship.

 Deterministic relationship
                                                  s  s0  vt
   distance                                       S: distance travel
                                                  S0: initial distance
                             slope                v: speed
                                                  t: traveled

                         Probabilistic Relationship

  In many occasions we are facing a different situation. One variable is
   related to another variable as in the following.



Here we can not definitely predict one’s height from his age as we did
                          s  s0  vt
                      Linear Regression

Statistically, the way to characterize the relationship between two variables
as we shown before is to use a linear model as in the following:

                          y  a  bx  

Here, x is called independent variable
                                                                 Error: 
      y is called dependent variable
       is the error term
      a is intercept              y
      b is slope



                         Least Square Lines
   Given some pairs of data for independent and dependent variables,
   we may draw many lines through the scattered points



The least square line is a line passing through the points that minimize the
vertical distance between the points and the line. In other words, the least
square line minimizes the error term .
                 Least Square Method

For notational convenience, the line that fits through the
points is often written as              y  a  bx
The linear model we wrote before is          y  a  bx  
 If we use the value on the line, ŷ , to estimate y, the difference is (y- ŷ)
  For points above the line, the difference is positive, while the difference
  is negative for points below the line.

      y                                          y  a  bx

                                                     (y- ŷ)
                     Error Sum of Squares

For some points, the values of (y- ŷ) are positive (points above the line) and for some
other points, the values of (y- ŷ) are negative (points below the line). If we add all
these up, the positive and negative values can get cancelled. Therefore, we take a
square for all these difference and sum them up. Such a sum is called the Error Sum
of Squares (SSE)

                      SSE   ( y  y ) 2
                                i 1

 The constant a and b is estimated so that the error sum of squares is
 minimized, therefore the name least square.
           Estimating Regression Coefficients

If we solve the regression coefficients a and b from by minimizing SSE,
 the following are the solutions.

                  ( x  x )( y
                              i     i    y)
            b    i 1

                          ( xi  x ) 2
                         i 1

             a  y  bx

  Where xi is the ith independent variable value
       yi is dependdent variable value corresponding to xi
       x_bar and y_bar are the mean value of x and y.
                        Interpretation of a and b

The constant b is the slope, which gives the change in y (dependent variable) due to a
change of one unit in x (independent variable). If b> 0, x and y are positively correlated,
meaning y increases as x increases, vice versus. If b<0, x and y are negatively correlated.

       y                                             y

                                                    a              b<0

                              x                                               x
                        Correlation Coefficient

Although now we have a regression line to describe the relationship between the
dependent variable and the independent variable, it is not enough to characterize
the relationship between x and y. We may see the situation in the following graphs.

          (1)                                             (2)
   y                                             y

                          x                                             x

Obviously the relationship between x and y in (1) is stronger than that in (2) even
though the line in (2) is the best fit line. The statistic that characterizes the strength
of the relationship is correlation coefficient or R2
                         How R2 is Calculated?



                y  y  ( y  y)  ( y  y )
                              ˆ      ˆ
If we use y_bar to represent y, the error is (y-y_bar). If we use ŷ to represents y, the
error is (y- ŷ ). Therefore the error is reduced to (y- ŷ ). Thus (ŷ- y_bar )
 is the improvement over using y_bar. This is true for all points in the graph. To
account how much total improvement we get, we take a sum of all improvements, (ŷ
-y_bar). Again we face the same situation as we did while calculating variance. We
take the square of the difference and sum the squared difference for all points
                                R Square
Regression Sum of Squares

            n                                                    y
   SSR   ( yi  y )
             ˆ          2                                                     ˆ
           i 1

Total Sum of Squares
    SST   ( yi  y ) 2
           i 1

    R2 
  R square indicates the percent variance in y explained by the regression.

  We already calculated SSE (Error Sum of Squares) while estimating a and b. In fact,
  the following relationship holds true:

                            An Simple Linear Regression Example
   The followings are some survey data showing how much a family spend on
    food in relation to household income (x=income in thousand $, y=is percent of
   income left after spending on food)
         x       y          x-x_bar      y-y_bar    (x-x_bar)(y-y_bar)     (x-x_bar)^2    y_hat        (y-y_bar)^2 (y_hat-y_bar)^2 (y-y_hat)^2
              6.5      81    1.185714     1.571429           1.863265306    1.40591837     73.254325    2.46938776     38.12130132     59.99548121
                4      96     -1.31429    16.57143          -21.77959184    1.72734694       86.2722    274.612245     46.83527158     94.63009284
              2.5      93     -2.81429    13.57143          -38.19387755    7.92020408     94.082925    184.183673     214.7501205     1.172726556
              7.2      68    1.885714      -11.4286         -21.55102041    3.55591837      69.60932    130.612245     96.41767056     2.589910862
              8.1      63    2.785714      -16.4286         -45.76530612    7.76020408     64.922885    269.897959     210.4148973     3.697486723
              3.4      84     -1.91429    4.571429          -8.751020408      3.6644898     89.39649    20.8979592     99.35942913     29.12210432
              5.5      71    0.185714      -8.42857         -1.565306122      0.0344898    78.461475    71.0408163     0.935272739     55.67360918
sum          37.2     556                                   -135.7428571    26.0685714                  953.714286     706.8339631     246.8814117
mean      5.31429 79.4286
slope     -5.2071
intercept 107.101
SST       953.714
SSR       706.834
SSE       246.881
SST+SSR 953.715
R-square 0.74114

Shared By: