Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Numerical

VIEWS: 0 PAGES: 80

									  Chapter 4

 Numerical
Descriptive
Techniques
4.2 Measures of Central Location
 Usually, we focus our attention on two types of
  measures when describing population
  characteristics:
      Central location (e.g. average)
      Variability or spread

        The measure of central location
        reflects the locations of all the actual
        data points.
統計學用來衡量資料特性的統計測量數:

 1. 中央趨勢(Central location)
 2. 分散度(Variability)
中央趨勢的衡量

主要表示資料分配的中心位置或資料的
共同趨勢。用來表示資料的中央趨勢之
測量數,主要有三種:
   1.平均數(mean)
   2.中位數(median)
   3.眾數(mode)
4.2 Measures of Central Location
 The measure of central location reflects the
  locations of all the actual data points.
 How?
                                With two data points,
                                the central location
    With one data point         should fall point
                       But if the third data in the middle
                                 on the left hand-side
    clearly the centralappears between them (in order
                       of the to reflect the location of
    location is at the point midrange, it should “pull”
    itself.            the central location to the left.
                                both of them).
The Arithmetic Mean

 This is the most popular and useful measure of
  central location

             Sum of the observations
       Mean =
             Number of observations
The Arithmetic Mean

   Sample mean        Population mean
           n
         i11x i i
          n x
           i
                               N
                              i1 x i
    x                  
            n
            n                   N

    Sample size          Population size
The arithmetic
mean

      The Arithmetic Mean
   • Example 4.1
      The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33,
      14, 8, 0, 9, 22 hours. Find the mean time on the Internet.
                  i 1 xi
                  10
                                 0x1  7 2
                                       x      ...  22
                                                     x10
           x                                            11.0
                   10                        10
    • Example 4.2
       Suppose the telephone bills of Example 2.1 represent
       the population of measurements. The population mean is

              i200 x i x1  x 2  ...  x 200
                1       42.19 38.45     45.77
                                                                 43.59
              200                200
平均數(算術平均數)

1.所有觀測值的總和除以觀測值的個數
2.算術平均數是資料的平均數點
3.優點:使用所有(每一個)的數據
  缺點:易受極端值的影響
例子:
郭董:”林小姐(會計),請您算一下並
告訴我我們公司全體員工的平均的月薪
。謝謝!”林小姐面帶微笑的回答:”
請等一下,我來算一算。”(半小時以後
)王小姐:”報告總經理,我們公司的平
均月薪是新台幣35,660元。”
郭董:”很好,現在的企業這麼難經營
,本公司有有這麼好的薪資,算起來很
不錯。大家努力幹,公司不會虧待大家
!”
林小姐面上仍然個持微笑,但心裡想:”見你的鬼,該
好好幹的是你,公司沒虧待的也只有你一個。”
各位,一個小公司平均月薪35,600元算起來還不壞啊。
林小姐幹麼不高興呢?她已幹了3年的會計,但是現在
的薪水才22,500元。原來公司的十五個員工的薪資是這
樣的:
14,500: 15,000: 16,000: 16,500: 17,000: 17,900:
18,500: 19,000: 21,000: 22,500: 25,000: 30,000: 35,000
250,000(郭董)
  The Median(中位數)
  The Median of a set of observations is the value that
   falls in the middle when the observations are arranged
   in order of magnitude.

 Example 4.3                                Comment
 Find the median of the time on the internet Suppose only 9 adults were sampled
 for the 10 adults of example 4.1            (exclude, say, the longest time (33))


 Even number of observations                  Odd number of observations

      0, 5, 7, 8.5 9, 14, 22, 33
0, 0, 5, 7, 8, 8, 9,, 12,12, 14, 22, 33       0, 0, 5, 7, 8 9, 12, 14, 22
中位數

搜集得來的資料經順序排列後,居於數列中央的那
一個數值,那是中位數
 (1)N為奇數:中位數位於數列中的第(N+1)/2位。
 (2)N為偶數:則可取前後兩個數之平均數。

在所有觀察值中至少有一半(50%)的數值大於等於該
數值或至少有一半(50%)的數值小於等於該數值。
不受極端值之影響,可是不易進行統計推論。
 The Mode(眾數)
 The Mode of a set of observations is the value that
  occurs most frequently.
 Set of data may have one mode (or modal class), or two
  or more modes.

                                          For large data sets
         The modal class                  the modal class is
                                          much more relevant
                                          than a single-value
                                          mode.
The Mode The Mean, Median,
               Mode

     The Mode
    Example 4.5
     Find the mode for the data in Example 4.1. Here are the
     data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22

       Solution

            • All observation except “0” occur once. There are two “0”. Thus,
              the mode is zero.
            • Is this a good measure of central location?
            • The value “0” does not reside at the center of this set
              (compare with the mean = 11.0 and the mode = 8.5).
眾數

指資料內的觀察值中發生次數最多的那
一個數值。
不受極端值之影響;可能有多個或沒有
;對觀察值的個數或數值變化的感應不
靈敏。
Relationship among Mean, Median, and Mode
  If a distribution is symmetrical, the mean,
   median and mode coincide

  If a distribution is asymmetrical, and skewed
   to the left or to the right, the three measures
   differ.
   A positively skewed distribution
       (“skewed to the right”)



          Mode Mean
             Median
Relationship among Mean, Median, and Mode
  If a distribution is symmetrical, the mean, median
   and mode coincide

   If a distribution is non symmetrical, and skewed
    to the left or to the right, the three measures
    differ.
A positively skewed distribution   A negatively skewed distribution
    (“skewed to the right”)             (“skewed to the left”)



       Mode     Mean                       Mean      Mode
           Median                             Median
中央趨勢各統計量數之比較與選擇:

1.名義(類別)尺度:眾數
2.順序尺度:眾數、中位數
3.區間尺度:平均數、中位數、及眾
  數均可
4.單一測量數不能清楚說明或難區分
  時,可以同時採取多個測量數。
 4.3 Measures of variability
 Measures of central location fail to tell the whole story
  about the distribution.
 A question of interest still remains unanswered:

     How much are the observations spread out
     around the mean value?
  4.3 Measures of variability

Observe two hypothetical
data sets:                       Small variability

 The average value provides
 a good representation of the
 observations in the data set.



          This data set is now
          changing to...
     4.3 Measures of variability

Observe two hypothetical
data sets:                                Small variability

   The average value provides
   a good representation of the
   observations in the data set.

                                          Larger variability
The same average value does not
provide as good representation of the
observations in the data set as before.
由平均數、中位數與眾數可了解資料的中
央趨勢,若有二組資料,其中央趨勢相同,
我們要比較這兩組資料呢?
ANS:可進一步比較這兩組資料的分散程度差
異的大小。
分散程度的比較有時比中央趨勢(Mean)的
比較來得更重要。
分散程度或變異性(Variability)的計算-
---根據平均數、中位數或眾數為中心,通
常是以平均數來衡量觀測值的分散程度。
分散程度或變異性(Variability)




              Small variability




              Larger variability
分散程度的衡量

1.全距(Range)
2.變異數(Variance)
3.標準差(Standard Deviation)
4.變異係數(Coefficient of Variance )(CV)
 The range
     The range of a set of observations is the difference
      between the largest and smallest observations.

     Its major advantage is the ease with which it can be
             But, how do all the observations spread out?
      computed.

                              is ? ?
      Its major shortcoming? its failure to provide
          The range cannot assist in answering this question
                                Range
  

      information on the dispersion of the observations
                   Smallest              Largest
                  observation          observation
      between the two end points.
全距

1. R=最大值-最小值
2.以資料頭尾兩者相差的大小衡量整
  個分散度。
3.一般R愈大,表示分散程度愈大,
  可是它只考慮最大與最小兩個觀察
  值並未考慮所有的觀察值,故不能
 精確的反應與描述所觀察的整體。
The Variance
    This measure reflects the dispersion of all the
     observations
    The variance of a population of size N x1, x2,…,xN
     whose mean is  is defined as
                            N 1( x i   ) 2
                             i
                    2 
                                   N
    The variance of a sample of n observations
     x1, x2, …,xn whose mean is x is defined as
                            n 1( x i  x ) 2
                             i
                    s2 
                                 n 1
Why not use the sum of deviations?
       Consider two small populations:
                                                9-10= -1
            A measure of dispersion
          Can the sum of deviations            11-10= +1
             Should agrees dispersion?
          Be a good measure ofwith this
      The sum of deviations is
            observation.
                                                8-10= -2
 A    zero for both populations,
                                               12-10= +2

               8 9 10 11 12                    Sum = 0
      therefore, is not a good
              …but measurements
      measure The mean ofisboth in B
                 of dispersion.
              arepopulations 10...
                 more dispersed
                                                4-10 = - 6
                then those in A.               16-10 = +6
 B                                              7-10 = -3

  4         7              10       13    16   13-10 = +3

                                               Sum = 0
  The Variance
 Let us calculate the variance of the two populations
          (8  10)2  (9  10)2  (10  10)2  (11  10)2  (12  10)2
     2 
      A                                                                2
                                        5

     2   ( 4  10)2  (7  10)2  (10  10)2  (13  10)2  (16  10)2
    B                                                                 18
                                        5
Why is the variance defined as
                                         After all, the sum of squared
the average squared deviation?
                                         deviations increases in
Why not use the sum of squared
                                         magnitude when the variation
deviations as a measure of
                                         of a data set increases!!
variation instead?
   The Variance

                       Which data set deviations for both data
Let us calculate the sum of squared has a larger dispersion? sets

                                 Data set B
                                 is more dispersed
                                 around the mean
     A                       B
          1   2 3                 1     3    5
The Variance
     SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10
     SumB = (1-3)2 + (5-3)2 = 8

                 SumA > SumB. This is inconsistent with the
                 observation that set B is more dispersed.

 A                      B
      1   2 3                1     3    5
The Variance
     However, when calculated on “per observation”
     basis (variance), the data set dispersions are
     properly ranked.

                     A2 = SumA/N = 10/5 = 2
                     B2 = SumB/N = 8/2 = 4
 A                      B
      1   2 3                1     3    5
The Variance
 Example 4.7
     The following sample consists of the number of jobs
      six students applied for: 17, 15, 23, 7, 9, 13. Finds its
      mean and variance
 Solution
            i6 1 x i
                         17  15  23  7  9  13 84
      x                                              14 jobs
                   6                  6              6
  2
 s 
       n1( x i
        i
          n 1
               x)2
                    
                        1
                      6 1
                               
                           (17  14)2  (15  14)2  ...(13  14)2   
     33.2 jobs2
The Variance – Shortcut method

       1  n 2 (  n1 x i ) 2 
s2          xi  i          
     n  1  i1
                     n        
                               

  
      1  2
                 2          2  17  15  ...  132  
                                
          17  15  ...  13                        
    6 1 
                                        6            
                                                      
   33.2 jobs2
變異數Variance

1.變異數的值必≧零;若為零,表示所有
的觀測數值均相同。
2.適合進行統計推論工作。
3.變異數之單位為觀測數值單位的平方,
具有複名數,不具統計意義,不易解釋。
Standard Deviation (SD,標準 偏 差 )
 The standard deviation of a set of observations is
  the square root of the variance .


     Sample standard deviation : SD        s   2


     Population standard deviation :             2
Standard Deviation
 Example 4.8
     To examine the consistency of shots for a new
      innovative golf club, a golfer was asked to hit 150
      shots, 75 with a currently used (7-iron) club, and 75
      with the new club.
     The distances were recorded.
     Which 7-iron is more consistent?
 Standard Deviation
 Example 4.8 – solution
Excel printout, from the             Current                       Innovation

“Descriptive Statistics” sub-   Mean                 150.5467   Mean                 150.1467
menu.                           Standard Error
                                Median
                                                     0.668815
                                                          151
                                                                Standard Error
                                                                Median
                                                                                     0.357011
                                                                                          150
                                Mode                      150   Mode                      149
                                Standard Deviation   5.792104   Standard Deviation   3.091808
                                Sample Variance      33.54847   Sample Variance      9.559279
 The innovation club is         Kurtosis              0.12674   Kurtosis             -0.88542
 more consistent, and           Skewness
                                Range
                                                     -0.42989
                                                           28
                                                                Skewness
                                                                Range
                                                                                     0.177338
                                                                                           12
 because the means are          Minimum                   134   Minimum                   144
                                Maximum                   162   Maximum                   156
 close, is considered a         Sum                     11291   Sum                     11261
 better club                    Count                      75   Count                      75
標準差

1.標準差是將變異數開根號。
 由於變異數的名數為複名數,不易解
 釋,為除去該缺點,將變異數開根號所
 得的稱為標準差。
2.標準差的衡單位與原始資差無異。
3.變異數與標準差是測量資料分散程度
 ,比較良好且是最常用的統計測量測
 量數。
Interpreting Standard Deviation
 The standard deviation can be used to
      compare the variability of several distributions
      make a statement about the general shape of a
       distribution.
 The empirical rule: If a sample of observations has a
  mound-shaped distribution, the interval
 ( x  s, x  s) contains approximately 68% of the measuremen ts
 (x  2s, x  2s) contains approximately 95% of the measuremen ts
 ( x  3s, x  3s) contains approximately 99.7% of the measuremen ts
Interpreting Standard Deviation
 Example 4.9
  A statistics practitioner wants to describe the
  way returns on investment are distributed.
      The mean return = 10%
      The standard deviation of the return = 8%
      The histogram is bell shaped.
Interpreting Standard Deviation
Example 4.9 – solution
 The empirical rule can be applied (bell shaped histogram)
 Describing the return distribution
      Approximately 68% of the returns lie between 2% and 18%
                                                   [10 – 1(8), 10 + 1(8)]
      Approximately 95% of the returns lie between -6% and 26%
                                                   [10 – 2(8), 10 + 2(8)]
      Approximately 99.7% of the returns lie between -14% and 34%
                                                    [10 – 3(8), 10 + 3(8)]
經驗法則

若資料的分配呈現常態峰則或鐘型分配。
1.約有68%的資料落入一個標準差之內。
2.約有95%的資料落入二個標準差之內。
3.約有99.7%的資料落入三個標準差之內。
The Coefficient of Variation變異係數(CV)

  The coefficient of variation of a set of measurements is
   the standard deviation divided by the mean value.
                                            s
     Sample coefficient of variation : cv 
                                            x
                                                
     Population coefficient of variation : CV 
                                                
  This coefficient provides a proportionate measure of
   variation.
                    A standard deviation of 10 may be perceived
                    large when the mean value is 100, but only
                    moderately large when the mean value is 500
衡量相對分散度的變異係數(CV)

   CV =標準差 / 平均數

變異係數-標準差除以平均數的目的表達相對的變動情形。

測量分散程度的統計測量數
如全距,變異數與標準差,均只能衡量資料的絕對分散程
度。
若有二組資料,而欲比較其分散程度,變異數與標準差會
受到平均數大小不同以及不同測量單位的影響。
現假設
A公司83年營業收益中,其平均數為3371萬元,標準差為
 383萬元。變異係數為:
                      383
                 CV        0.1136
                      3371

B公司83年營業收益中,其平均數為6000萬元,標準差為
 400萬元
比較其營業額的相對分散情形何者較穩定?
 B公司的營業額的標準差雖較大,但其平均營業額為
 6000萬元,較A公司大得多,兩公司的規模顯然不同。
 因此,為比較其營業額的相對分散情形,必須利用變
 異係數來比較。B的變異係數為400/6000=0.0667小於A
 公司的變異係數。由此可知,B公司的營業收益分散程
 度相對較小,83年12個月營業收益相對A公司而言較穩
 定,變化較少。
4.4 Measures of Relative Standing
    and Box Plots
 Percentile
     The pth percentile of a set of measurements is the
      value for which
      • p percent of the observations are less than that value
      • 100(1-p) percent of all the observations are greater than
        that value.
     Example
      • Suppose your score is the 60% percentile of a SAT test.
        Then        60% of all the scores lie here 40%

                                 Your score
Quartiles
 Commonly used percentiles
     First (lower)decile           = 10th percentile
     First (lower) quartile, Q1,   = 25th percentile
     Second (middle)quartile,Q2,   = 50th percentile
     Third quartile, Q3,           = 75th percentile
     Ninth (upper)decile           = 90th percentile
Quartiles

 Example
 Find the quartiles of the following set of
 measurements 7, 8, 12, 17, 29, 18, 4, 27, 30, 2,
 4, 10, 21, 5, 8
    Quartiles
     Solution
              Sort the observations
              2, 4, 4, 5, 7, 8, 10, 12, 17, 18, 18, 21, 27, 29, 30
                                        15 observations
                   The first quartile

At most (.25)(15) = 3.75 observations        At most (.75)(15)=11.25 observations
should appear below the first quartile.      should appear above the first quartile.
Check the first 3 observations on the        Check 11 observations on the
left hand side.                              right hand side.

           Comment:If the number of observations is even, two observations
           remain unchecked. In this case choose the midpoint between these
           two observations.
Location of Percentiles
 Find the location of any percentile using the formula

                      P
        LP  (n  1)
                     100
        where LP is the locationof the P th percentile

 Example 4.11
     Calculate the 25th, 50th, and 75th percentile of the data in
     Example 4.1
   Location of Percentiles

 Example 4.11 – solution
     After sorting the data we have 0, 0, 5, 7, 8, 9, 12, 14, 22, 33.
                                         Values 0      3.75 5
                           25                   0
          L 25  (10  1)      2.75
                          100          Location 2   2.75      3
                                       Location 1          Location 3

           The 2.75th location
           Translates to the value
           (.75)(5 – 0) = 3.75
Location of Percentiles

 Example 4.11 – solution continued
                    50
   L 50  (10  1)      5.5
                   100
  The 50th percentile is halfway between the fifth
  and sixth observations (in the middle between 8
  and 9), that is 8.5.
Location of Percentiles

 Example 4.11 – solution continued
                   75
  L 75  (10  1)      8.25
                  100
  The 75th percentile is one quarter of the distance
  between the eighth and ninth observation that is
  14+.25(22 – 14) = 16.
     Eighth        Ninth
     observation   observation
Quartiles and Variability
 Quartiles can provide an idea about the shape of
  a histogram




       Q1 Q2               Q3          Q1           Q2   Q3
       Positively skewed        Negatively skewed
          histogram                histogram
Interquartile Range
 This is a measure of the spread of the middle
  50% of the observations
 Large value indicates a large spread of the
  observations

           Interquartile range = Q3 – Q1
Box Plot
     This is a pictorial display that provides the main
      descriptive measures of the data set:
        •   L - the largest observation
        •   Q3 - The upper quartile
        •   Q2 - The median
        •   Q1 - The lower quartile
        •   S - The smallest observation
      1.5(Q3 – Q1)                          1.5(Q3 – Q1)
                  Whisker                  Whisker
              S             Q1   Q2 Q3                     L
  Box Plot
   Example 4.14 (Xm02-01)
         Bills
        42.19
        38.45      Left hand boundary = 9.275–1.5(IQR)= -104.226
        29.23
        89.35
                   Right hand boundary=84.9425+ 1.5(IQR)=198.4438
       118.04
       110.46
          .
Smallest =. 0         -104.226   0   9.275      84.9425 119.63   198.4438
          .
Q1 = 9.275                               26.905
Median = 26.905
Q3 = 84.9425                          No outliers are found
Largest = 119.63
IQR = 75.6675
Outliers = ()
Box Plot
      Additional Example - GMAT scores
       Create a box plot for the data regarding the GMAT scores of
       200 applicants (see GMAT.XLS)
   GMAT        Smallest = 449
     512       Q1 = 512
     531       Median = 537
     461       Q3 = 575
     515       Largest = 788
       .       IQR = 63
       .       Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, )
       .


      417.5 449        512        537              575               669.5                 788
512-1.5(IQR)                                                   575+1.5(IQR)
Box Plot
GMAT - continued

                           Q1     Q2          Q3
               449         512    537         575                669.5

                     25%            50%              25%
      Interpreting the box plot results
        • The scores range from 449 to 788.
        • About half the scores are smaller than 537, and about half are larger than
          537.
        • About half the scores lie between 512 and 575.
        • About a quarter lies below 512 and a quarter above 575.
Box Plot
GMAT - continued
        The histogram is positively skewed

                    Q1    Q2     Q3
       449          512   537    575          669.5

             25%           50%         25%

                          50%

              25%                       25%
Box Plot
 Example 4.15 (Xm04-15)
     A study was organized to compare the quality of
      service in 5 drive through restaurants.
     Interpret the results
 Example 4.15 – solution
     Minitab box plot
      Box Plot


Jack in the Box5                            Jack in the box is the slowest in service

Hardee’s         4                          Hardee’s service time variability is the largest

McDonalds
            C7




                 3


Wendy’s          2
                                             Wendy’s service time appears to be the
                                             shortest and most consistent.
Popeyes          1


                     100   200        300
                                 C6
      Box Plot

                                             Times are symmetric

Jack in the Box5                            Jack in the box is the slowest in service

Hardee’s         4                          Hardee’s service time variability is the largest

McDonalds
            C7




                 3


Wendy’s          2
                                             Wendy’s service time appears to be the
                                             shortest and most consistent.
Popeyes          1


                     100   200        300
                                             Times are positively skewed
                                 C6
4.5 Measures of Linear Relationship

 The covariance and the coefficient of correlation
  are used to measure the direction and strength
  of the linear relationship between two variables.
      Covariance - is there any pattern to the way two
       variables move together?
      Coefficient of correlation - how strong is the linear
       relationship between two variables
 Covariance

                                                 (x i   x )(y i   y )
Population covariance  COV(X, Y) 
                                                            N
   x (y) is the population mean of the variable X (Y).
   N is the population size.


                                   (x i  x)( y i  y )
   Sample covariance  cov(x ,y) 
                                          n-1
   x (y) is the sample mean of the variable X (Y).
   n is the sample size.
Covariance

 Compare the following three sets
 xi    yi      (x – x) (y – y) (x – x)(y – y)
 2     13      -3      -7       21
 6     20      1       0        0
 7     27      2       7        14               xi   yi
 x=5   y =20                    Cov(x,y)=17.5    2    20
                                                 6    27     Cov(x,y) = -3.5
 xi    yi      (x – x) (y – y) (x – x)(y – y)    7    13
 2     27      -3      7        -21              x=5 y =20
 6     20      1       0        0
 7     13      2       -7       -14
 x=5   y =20                    Cov(x,y)=-17.5
Covariance
 If the two variables move in the same direction,
  (both increase or both decrease), the covariance
  is a large positive number.
 If the two variables move in opposite directions,
  (one increases when the other one decreases),
  the covariance is a large negative number.
 If the two variables are unrelated, the covariance
  will be close to zero.
The coefficient of correlation
           Population coefficien t of correlatio n
                         COV( X, Y)
                      
                           xy
            Sample coefficien t of correlatio n
                         cov (X, Y)
                      r
                           sx sy
    This coefficient answers the question: How strong is
     the association between X and Y.
The coefficient of correlation

           +1 Strong positive linear relationship
                                                    COV(X,Y)>0


                                                             or
 or r =   0   No linear relationship
                                                    COV(X,Y)=0



           -1 Strong negative linear relationship   COV(X,Y)<0
 The coefficient of correlation

 If the two variables are very strongly positively
  related, the coefficient value is close to +1
  (strong positive linear relationship).
 If the two variables are very strongly negatively
  related, the coefficient value is close to -1
  (strong negative linear relationship).
 No straight line relationship is indicated by a
  coefficient close to zero.
The coefficient of correlation and the
covariance – Example 4.16

 Compute the covariance and the coefficient of
  correlation to measure how GMAT scores and
  GPA in an MBA program are related to one
  another.
 Solution
      We believe GMAT affects GPA. Thus
        • GMAT is labeled X
        • GPA is labeled Y
 The coefficient of correlation and the
 covariance – Example 4.16

Student         x      y            x2      y2        xy      Shortcut Formulas
   1      599        9.6         358801    92.16   5750.4
                                                              cov(x, y ) 
   2      689     8.8      474721    77.44 6063.2
cov(x,y)=(1/12-1)[67,559.2-(7587)(106.4)/12]=26.16
   3      584       7.4          341056    54.76   4321.6       1               xi  y i 
Sx = {(1/12-1)[4,817,755-(7587)2/12)]}.5=43.56                       xi y i 
                                                              n 1                        
Sy = similar to Sx =10
   4       631
                     1.12 398161         100   6310                                 n      
   ………………………………………………….
    cov(x,y)/SxSy = 26.16/(43.56)(1.12) = .5362 5218.4
r = 11    593       8.8     351649 77.44
  12      683             8     466489      64        5464           1  2  x 2 
                                                              s2         xi    
Total   7,587       106.4     4,817,755   957.2    67,559.2        n 1 
                                                                                n 
                                                                                   
The coefficient of correlation and the
covariance – Example 4.16 – Excel
 Use the Covariance option in Data Analysis
 If your version of Excel returns the population covariance and
  variances, multiply each one by n/n-1 to obtain the
  corresponding sample values.
 Use the Correlation option to produce the correlation matrix.
   Variance-Covariance Matrix
    Population   GPA     GMAT           Sample    GPA     GMAT
    values                              values
    GPA          1.15                 12 GPA      1.25
                                   ´
                                     12-1
    GMAT         23.98   1739.52        GMAT      26.16   1897.66
The coefficient of correlation and the
covariance – Example 4.16 – Excel
 Interpretation
      The covariance (26.16) indicates that GMAT score
       and performance in the MBA program are positively
       related.
      The coefficient of correlation (.5365) indicates that
       there is a moderately strong positive linear
       relationship between GMAT and MBA GPA.
The Least Squares Method
 We are seeking a line that best fits the data when two
  variables are (presumably) related to one another.
 We define “best fit line” as a line for which the sum of
  squared differences between it and the data points is
  minimized.                   n
                                       ˆ   2
                      Minimize (y i  y i )
                                 i1
                                       The y value of point i
 The actual y value of point i
                                       calculated from the
                                                 ˆ
                                       equation y  b  b
                                                 i    0     1xi
The least Squares Method
      Y
           Errors




                                     Errors


                                     X
        Different lines generate different errors,
        thus different sum of squares of errors.
There is a line that minimizes the sum of squared errors
The least Squares Method

The coefficients b0 and b1 of the line that minimizes the
sum of squares of errors are calculated from the data.
                                         n


                cov(x, y )
                                         ( x  x )( y  y )
                                        i 1
                                                    i               i
         b1        2
                                               n
                                                                            ,
                                               
                   sx
                                                        ( xi  x ) 2
                                               i 1
         b0  y  b1 x
                          n                                   n

                        y
                         i 1
                                    i                        x
                                                             i 1
                                                                        i
         where y                        and x 
                              n                                   n
The Least Squares Method

 Example 4.17
       Find the least squares line for Example 4.16 (Xm04-16.xls)
       cov(x, y )        26.16
b1          2
                               .0138                      Scatter Diagram
            sx          1897.2

       
                                         12
                                              y = 0.1496 + 0.0138x
            xi       7,587
x                         632.25
                                         10

        n             12                  8



y
   y       
             i106.4
                      8.87
                                    6
                                      500                 600                 700   800
      n         12
b0  y  b1 x  8.87  (.0138)(632.25)  .145

								
To top