Docstoc

Frequency distributions Testing of goodness of fit and

Document Sample
Frequency distributions Testing of goodness of fit and Powered By Docstoc
					  Frequency distributions:
Testing of goodness of fit and
      contingency tables
       Chi-square statistics
• Widely used for nominal data’s analysis
• Introduced by Karl Pearson during 1900
• Its theory and application expanded by
  him and R. A. Fisher
• This lecture will cover Chi-square test,
  G test, Kolmogorov-Smirnov goodness
  of fit for continuous data
                   The       2   test:
  2 =  (observed freq. - expected freq.)2/ expected freq.

• Obtain a sample of nominal scale data and to
  infer if the population conforms to a certain
  theoretical distribution e.g. genetic study
• Test Ho that the observations (not the
  variables) are independent of each other for
  the population.
• Based on the difference between the actual
  observed frequencies (not %) and the
  expected frequencies
                   The 2 test:
  2 =  (observed freq. - expected freq.)2/ expected freq.


• As a measure of how far a sample distribution
  deviates from a theoretical distribution
• Ho: no difference between the observed and
  expected frequency (HA: they are different)
• If Ho is true: the difference and Chi-square
   SMALL
• If Ho is false: both measurements  Large
For Questionnaire

                    Example (1)
    • In a questionnaire, 259 adults were asked
      what they thought about cutting air pollution
      by increasing tax on vehicle fuel.
    • 113 people agreed with this idea but the rest
      disagreed.
    • Perform a Chi-square text to determine the
      probability of the results being obtained by
      chance.
For Questionnaire
                          Agree        Disagree
       Observed           113          259 -113 = 146
       Expected 259/2 = 129.5          259/2 = 129.5
       Ho: Observed = Expected
       2 = (113 - 129.5)2/129.5 + (146 - 129.5)2 /129.5
          = 2.102 + 2.102 = 4.204
       df = k - 1 = 2 - 1 = 1

       From the Chi-square (Table B1 in Zar’s book)
       2 ( = 0.05, df = 1)= 3.841  for 2 = 4.202, 0.025<p<0.05

       Therefore, rejected Ho. The probability of the results
         being obtained by chance is between 0.025 and 0.05.
For Genetics

                   Practical (1)
    • Calculate the Chi-square of data consisting of
      100 flowers to a hypothesized color ratio of 3:1
      (red: green) and test the Ho
    • Ho: the sample data come from a population
      having a 3:1 ratio of red to green flowers
    • Observation: 84 red and 16 green
    • Expected frequency for 100 flowers:
        – 75 red and 25 green
                                     Please Do it Now
For Genetics

                   Practical (2)
    • Calculate the Chi-square of data consisting of
      100 flowers to a hypothesized color ratio of 3:1
      (red: green) and test the Ho
    • Ho: the sample data come from a population
      having a 3:1 ratio of red to green flowers
    • Observation: 67 red and 33 green
    • Expected frequency for 100 flowers:
        – 75 red and 25 green
                                     Please Do it Now
For Genetics
                 For > 2 categories
    • Ho: The sample of Drosophila from a population
      having 9: 3: 3: 1 ratio of pale body-normal wing
      (PNW) to pale-vestigial wing (PVW) to dark-normal
      wing (DNW) to dark-vestigial wing (DVW)

    • Student’s observations in the lab:
           PNW     PVW    DNW     DVW      Total
           300     77     89      36       502


    Calculate the chi-square and test Ho
For Genetics

    • Ho: The sample of Drosophila (F2) from a population having 9: 3: 3: 1
      ratio of pale body-normal wing (PNW) to pale-vestigial wing (PVW) to
      dark-normal wing (DNW) to dark-vestigial wing (DVW)

                    PNW PVW DNW DVW Total
    Observed        300        77           89     36         502
    Exp. proportion 9/16       3/16         3/16   1/16         1
    Expected        282.4 94.1              94.1   31.4       502
    O-E             17.6       -17.1 -5.1          4.6          0
    (O - E)2        309.8 292.4 26.0               21.2
    (O - E)2/E      1.1        3.1          0.3    0.7
                    2 = 1.1 + 3.1 + 0.3 + 0.7 = 5.2
                    df = 4 -1 = 3
                    2 ( = 0.05, df = 3)= 7.815  for 2 = 5.20, 0.25<p<0.10

                      Therefore, accept Ho.
For Questionnaire

     Cross Tabulation or Contingency Tables

     – Further examination of the data on the opinion on increasing
       fuel to cut down air pollution (example 1):
     – Ho: the decision is independent of sex
                  Males             Females
     Agree          13 (a)              100 (b)
     Disagree      116 (c)               30 (d)
     Expected frequency for cell b = (a + b)[(b + d)/N]

                 Males              Females        n
     Agree         13                 100          113
         113(129/259)=56.28           113(130/259)= 56.72
     Disagree     116                   30         146
         146(129/259)=72.72            146(130/259)= 73.28
       n          129                 130         259
Cross tabulation or contingency tables:
– Ho: the decision is independent of sex
            Males             Females             n
Agree           13               100              113
              56.28              56.72
Disagree      116                  30             146
              72.72              73.28
  n           129                130              259

2 = (13 - 56.28)2/56.28 + (100 - 56.72)2/56.72 + (116 - 72.72) 2/72.72 +
        (30 - 73.28)2/73.28
     = 117.63
df = (r - 1)(c - 1) = (2 - 1)(2 - 1) = 1
2 ( = 0.05, df = 1)= 3.841  p<0.001
Therefore, reject Ho and accept H A that the decision is dependent of sex.
Quicker method for 2 x 2 cross tabulation:
          Class A         Class B         n
State 1       a              b           a+b
State 2       c              d           c+d
    n       a+c             b+d       n = a + b + c +d
2 = n (ad - bc)2/(a + b)(c + d)(a + c)(b + d)
          Males         Females
Agree       13             100         113
Disagree   116              30         146
           129             130         259
2 = 259(13  30 - 116  100)2/(113)(146)(129)(130) = 117.64
2 (   = 0.05, df = 1)=   3.841  p<0.001; Therefore, rejected Ho.
Yates’ continuity correction:

• Chi-square is also a continuous distribution, while the
  frequencies being analyzed are discontinuous (whole
  number).
• To improve the analysis, Yates’ correction is often applied
  (Yate,1934):

• 2 =  (observed freq. - expected freq. - 0.5)2/ expected freq.

• For 2 x 2 contingency table:
  2 = n (ad - bc- 0.5n)2/(a + b)(c + d)(a + c)(b + d)
   Yates’ Correction (example 1):
• 2 = n (ad - bc- 0.5n)2/(a + b)(c + d)(a + c)(b + d)

              Males          Females
    Agree        13             100          113
    Disagree    116              30          146
                129             130           259
2 = 259(1330 - 116100 -0.5259)2/(113)(146)(129)(130)
   = 114.935 (smaller than 117.64, less bias)

   2 (   = 0.05, df = 1)=   3.841  p<0.001; Therefore, rejected Ho.
   Practical 3:
• 2 = n (ad - bc- 0.5n)2/(a + b)(c + d)(a + c)(b + d)
• For a drug test, Ho: The survival of the animals is
  independent of whether the drug is administered
                      Dead           Alive           n
   Treated            12             30              42
   Not treated        27             31              58
   n                  39             61              100

Using Yates’ correction to calculate 2 and test the hypothesis


                                   Please do it at home
Bias in Chi-square calculations
                                      
• If values of expected frequency (fi) are very small,
  the calculated 2 is biased in that it is larger than the
  theoretical 2 value and we shall tend to reject Ho.
                                            
• Rules: fi > 1 and no more than 20% of fi < 5.0.

• It may be conservative at significance levels < 5%,
  especially when the expected frequencies are all
  equal.
                   
• If having small fi, (1) increase the sample size if
  possible, use G-test or (2) combine the categories if
  possible.
 The G test (log-likelihood ratio)
                      G = 2  O ln (O/E)
• Similar to the 2 test
• Many statisticians believe that the G test is superior to the
  2 test (although at present it is not as popular)
• For 2 x 2 cross tabulation:
                Class A        Class B
   State 1         a               b
   State 2         c               d
   The expected frequency for cell a = (a+b)[(a+c)/n]
   Practical 3          Dead              Alive          n
   Treated              12 (16.38)        30 (25.62)     42
   Not treated          27 (22.62)        31 (35.38)     58
   n                    39                61            100
                    G = 2  O ln (O/E)
                     Dead             Alive            n
Treated              12 (16.38)       30 (25.62)       42
Not treated          27 (22.62)       31 (35.38)       58
n                    39               61              100

(1) Calculate G:
G = 2 [ 12 ln(12/16.38) + 30 ln(30/25.62) + 27 ln(27/22.62) + 31
   ln(31/35.38)]
G = 2 (1.681) = 3.362
(2) Calculate the William’s correction: 1 + [(w2 - 1)/6nd] where w
   is the number of frequency cells, n is total number of
   measurements and d is the degree of freedom (r-1)(c-1)
  = 1 + [(42 - 1)/ (6)(100)(1)] = 1.025
 G (adjusted) = 2 = 3.362/1.025 = 3.28 (< 3.31 from 2 test)
 2 ( = 0.05, df = 1)= 3.841  p>0.05; Therefore, accept Ho.
• Ho: The sample of Drosophila (F2) from a population having 9: 3: 3: 1
  ratio of pale body-normal wing (PNW) to pale-vestigial wing (PVW) to
  dark-normal wing (DNW) to dark-vestigial wing (DVW)

                  PNW        PVW DNW DVW Total
Observed          300        77           89     36         502
Expected          282.4      94.1         94.1   31.4
O ln(O/E)         18.14      -15.44 -4.96 4.92
G value:                     G = 2  (18.14 - 15.44 - 4.96 + 4.92) = 5.32
William’s correction:        1 + [(42 - 1)/6 (502) (3)] = 1.00166
G (adjusted):                5.32/1.00166 = 5.311
                  2 ( = 0.05, df = 3)= 7.815  for 2 = 5.20, 0.25<p<0.10

                  Therefore, accept Ho.
   The Kolmogorov-Smirnov goodness of fit test =
       Kolmogorov-Smirnov one-sample test

• Deal with goodness of fit tests applicable to nominal scale
  data and for data in ordered categories
• Example: 35 cats were tested one at a time, and allowed to
  choose 5 different diets with different moisture content (1=
  very moist to 5 = very dry):

• Ho: Cats prefer all five equally
               1      2       3      4       5      n
Observed       2      18      10     4       5      35
Expected       7      7       7      7       7      35
            Kolmogorov-Smirnov one-sample test
• Ho: Cats prefer all five diets equally
                     1             2       3        4        5        n
O                    2             18      10       4        1        35
E                    7             7       7        7        7        35
Cumulative O 2                     20      30       34       35
Cumulative E 7                     14      21       28       35
 di                5             6       9        6        0
dmax = maximum  di = 9
(dmax), k, n = (dmax) 0.05, 5, 35 = 7 (Table B8: k = no. of categories)
Therefore reject Ho.               0.002< p < 0.005
• When applicable (i.e. the categories are ordered), the K-S test is more
  powerful than the 2 test when n is small or when values of observed
  frequencies are small.
• Note: order for the same data changed to 2, 1, 4, 18 and 10: the 2 test
  will give the same results (independent of the orders) but the calculated
  dmax from the K-S test will be different.
      Kolmogorov-Smirnov one-sample test for
            continuous ratio scale data

• Example 22.11 (page 479 in Zar)
• Ho: Moths are distributed uniformly from ground
  level to height of 25 m
• HA: Moths are not distributed uniformly from
  ground level to height of 25 m
• Use of Table B9
Ho: Moths are distributed uniformly from ground level to height of 25 m

                                                        Xi/25
                               Fi          Fi/15        relative
            Xi       fi        cumulative relative      expected
no.         height frequency   frequency frequency      frequency      Di        Dí
        1        1.4       1             1     0.0667            0.056    0.0107        0.0560
        2        2.6       1             2     0.1333            0.104    0.0293        0.0373
        3        3.3       1             3     0.2000            0.132    0.0680        0.0013
        4        4.2       1             4     0.2667            0.168    0.0987        0.0320
        5        4.7       1             5     0.3333            0.188    0.1453        0.0787
        6        5.6       2             7     0.4667            0.224    0.2427        0.1093
        7        6.4       1             8     0.5333            0.256    0.2773        0.2107
        8        7.7       1             9     0.6000            0.308    0.2920        0.2253
        9        9.3       1            10     0.6667            0.372    0.2947        0.2280
       10       10.6       1            11     0.7333            0.424    0.3093        0.2427
       11       11.5       1            12     0.8000            0.460    0.3400        0.2733
       12       12.4       1            13     0.8667            0.496    0.3707        0.3040
       13       18.6       1            14     0.9333            0.744    0.1893        0.1227
       14       22.3       1            15     1.0000            0.892    0.1080        0.0413

                                                              Max =       0.3707        0.3040

                                            Table B9 D 0.05, 15 =         0.3376 < D max

                                                        Therefore, reject Ho.      0.02<p<0.05
             Kolmogorov-Smirnov one-sample test
              for grouped data (example 22.11)
Xi                  0-5 m     5-10 m 10-15 m           15-20 m    20-25 m         n
observed fi             5           5      3                 1          1        15
expected fi             3           3      3                 3          3        15
Cumulative O fi         5          10     13                14         15
Cumulative E fi         3           6      9                12         15
abs di                  2           4      4                 2          0
d max                   4
d max 0.05, 5, 15       5 (use Table B8)
Thus, accept Ho 0.05<p<0.10

         Note: The power is lost (and Ho is not rejected) by grouping the data
               and therefore grouping should be avoided whenever possible.

   • The power is reduced by grouping the data and
     therefore grouping should be avoided whenever
     possible.
   • K-S test can be used to test normality of data
• Recognizing the distribution of your data
  is important
  – Provides a firm base on which to establish
    and test hypotheses
  – If data are normally distributed, you can
    use parametric tests;
  – Otherwise transform data to normal
    distribution
  – Or non-parametric tests should be
    performed
• For a reliable test for normality of
  interval data, n must be large enough
  (e.g. > 15)


  – Difficult to tell whether a small data set
    (e.g. 5) is normally distributed
•   Inspection of the frequency histogram
•   Probability plot
•   Chi-square goodness of fit
•   Kolmogorov-Smirnov one-sample test
•   Symmetry and Kurtosis: D’Agostino-
    Pearson K2 test (Chapters 6 & 7, Zar 99)
     Inspection of the frequency histogram




• Construct the frequency histogram
• Calculate the mean and median (mode as well, if possible)
• Check the shape of the distribution and the location of
  these measurements
                             Probability plot
  e.g. 1                  c.f./61                   =NORMSINV(X)


                         Cumulative                     Probit     Upper
Class        frequency   frequency Percentile z         (5 + z)    Class limit
   0-   2            1            1   0.0164    -2.1347     2.8653          2
   2-   4            2            3   0.0492    -1.6529     3.3471          4
   4-   6            3            6   0.0984    -1.2910     3.7090          6
   6-   8            5           11   0.1803    -0.9141     4.0859          8
   8-   10           8           19   0.3115    -0.4917     4.5083         10
 10 -   12          11           30   0.4918    -0.0205     4.9795         12
 12 -   14           8           38   0.6230     0.3132     5.3132         14
 14 -   16           9           47   0.7705     0.7405     5.7405         16
 16 -   18           6           53   0.8689     1.1210     6.1210         18
 18 -   20           4           57   0.9344     1.5096     6.5096         20
 20 -   22           3           60   0.9836     2.1347     7.1347         22
 22 -   24           1           61   1.0000
                 e.g. 1
                                                  Probability plot


                                                                                                   1.0
            12
                                                                                                   0.9

            10                                                                                     0.8




                                                                          Expected cum ulative p
                                                                                                   0.7
            8
frequency




                                                                                                   0.6
            6                                                                                      0.5

                                                                                                   0.4
            4
                                                                                                   0.3
                                                                                                                                y = 0.8502x + 0.0736
            2                                                                                      0.2                               R2 = 0.9711
                                                                                                   0.1
            0
                  1   2   3   4    5    6   7    8    9    10   11   12                            0.0
                                                                                                         0.0   0.2        0.4         0.6       0.8    1.0
                              Bin num ber (bin size = 2)
                                                                                                                     Observed cum ulative p
                              Probability plot
   e.g. 2


                         Cumulative                     Probit     Upper
Class        frequency   frequency Percentile z         (5 + z)    Class limit
   0-   2           10           10   0.1111    -1.2206     3.7794          2
   2-   4           24           20   0.2222    -0.7647     4.2353          4
   4-   6           11           44   0.4889    -0.0279     4.9721          6
   6-   8            9           55   0.6111     0.2822     5.2822          8
   8-   10           4           64   0.7111     0.5566     5.5566         10
 10 -   12           5           68   0.7556     0.6921     5.6921         12
 12 -   14           2           73   0.8111     0.8820     5.8820         14
 14 -   16           2           75   0.8333     0.9674     5.9674         16
 16 -   18           4           77   0.8556     1.0606     6.0606         18
 18 -   20           1           81   0.9000     1.2816     6.2816         20
 20 -   22           8           82   0.9111     1.3476     6.3476         22
 22 -   24          10           90   1.0000
                                                      Probability plot
          e.g. 2
              30
                                                                                       1.0
              25

              20                                                                       0.9
  Frequency




                                                                                                    y = 1.0876x - 0.2443
              15                                                                       0.8               R2 = 0.8576

              10                                                                       0.7

              5
                                                                                       0.6




                                                                           Exp cum P
              0
                                                                                       0.5
                   1   2   3   4    5    6   7    8    9    10   11   12
                               Bin num ber (bin size = 2)                              0.4

                                                                                       0.3

• Obviously, the data is not                                                           0.2

  distributed on the line.                                                             0.1

                                                                                       0.0

• Based on the frequency distribution                                                        0.0   0.2       0.4       0.6   0.8   1.0
                                                                                                              Obs cum P
  of the data, the distribution is
  positive skew (higher frequencies
  at lower classes)
• Concave curve indicates positive skew which suggest a log-
  normal distribution (i.e. log-transformation of the upper class
  limit is required)
      very common e.g. mortality rates



• Convex curve indicates negative skew
    less common (e.g. some binomial distribution)
• S-shaped curve suggests ‘bad’ kurtosis: Normality departure but their
  mean, median, mode remain equal
• Leptokurtic distribution: data bunched around the mean, giving a
  sharp peak
• Platykurtic distribution: a board summit which falls rapidly in the
  tails


                                                    • Bimodal distributions e.g.
                                                      toxicity data produce a
                                                      sigmoid probability plot
                                                    • Multi-modal distributions:
                                                      data from animals with several
                                                      age-classes; undulating wave-
                                                      like curve
                            Chi-Square Goodness of Fit
The heights of 70 students: Chi-square goodness of fit of a normal distribution.
(Example 6.1 in Zar)
         7.4                                                      E
                          O         Xi         Z                               Expected Expected
                          observed Upper class (Xi-mean)/s P(z)      P(Xi)     frequency frequency
no.        Height class   frequency limit                                      n(P(Xi))   n(P(Xi))        (O-E)^2/E
       1   <62.5                  0       62.5        -2.32   0.0102    0.0102     0.7172
       2    62.5 - 63.5           2       63.5        -2.02   0.0219    0.0117     0.8191          1.5363       0.1400
       3    63.5 - 64.5           2       64.5        -1.71   0.0434    0.0214     1.4987          1.4987       0.1677
       4    64.5 - 65.5           3       65.5        -1.41   0.0791    0.0358     2.5048          2.5048       0.0979
       5    65.5 - 66.5           5       66.5        -1.11   0.1338    0.0546     3.8238          3.8238       0.3618
       6    66.5 - 67.5           4       67.5        -0.81   0.2099    0.0762     5.3318          5.3318       0.3327
       7    67.5 - 68.5           6       68.5        -0.50   0.3069    0.0970     6.7906          6.7906       0.0921
       8    68.5 - 69.5           5       69.5        -0.20   0.4198    0.1129     7.8996          7.8996       1.0643
       9    69.5 - 70.5           8       70.5         0.10   0.5397    0.1199     8.3939          8.3939       0.0185
      10    70.5 - 71.5           7       71.5         0.40   0.6561    0.1164     8.1467          8.1467       0.1614
      11    71.5 - 72.5           7       72.5         0.70   0.7593    0.1032     7.2220          7.2220       0.0068
      12    72.5 - 73.5          10       73.5         1.01   0.8428    0.0835     5.8479          5.8479       2.9481
      13    73.5 - 74.5           6       74.5         1.31   0.9046    0.0618     4.3251          4.3251       0.6486
      14    74.5 - 75.5           3       75.5         1.61   0.9463    0.0417     2.9219          2.9219       0.0021
      15    75.5 - 76.5           2       76.5         1.91   0.9721    0.0258     1.8029          1.8029       0.0215
      16    76.5 - 77.5           0       77.5         2.21   0.9866    0.0145     1.0161          1.5392       1.5392
      17    77.5 - 78.5           0       78.5         2.52   0.9941    0.0075     0.5231
                                                                                            Chi-square =        7.6026
                                                                                        Chi-sq 0.05, 12 =       21.026


                                         Accept Ho: the data are normally distributed
Xi         fi               fiXi           fi(Xi)^2
Mid height      freq
       63               2           126          7938
       64               2           128          8192
       65               3           195         12675
       66               5           330         21780
       67               4           268         17956
       68               6           408         27744
       69               5           345         23805
       70               8           560         39200
                                                                    12
                                                                                                          Observed
       71               7           497         35287
       72               7           504         36288               10                                    Expected
       73              10           730         53290
                                                                    8



                                                        Frequency
       74               6           444         32856
       75               3           225         16875
       76               2           152         11552               6

sum                    70          4912       345438                4
           mean                    70.17                            2
           sd                      3.310
                                                                    0
                                                                         60   65         70          75        80
                                                                                   Height (in), Xi



  =(345438-(49122/70))/(70-1)
                          Kolmogorov-Smirnov one-sample test
The heights of 70 students: Chi-square goodness of fit of a normal distribution.
                                Xi          observed cumulative cumulative Z                  cumulative
                                Upper class O         O            relative      (Xi-mean)/s E             Di        Dí
no.         Height class        limit       frequency frequency    O frequency                frequency
        1   <62.5                      62.5         0          0.0        0.0000        -2.32       0.0102    0.0102    0.0102
        2         62.5 - 63.5          63.5         2          2.0        0.0286        -2.02       0.0219    0.0066    0.0219
        3         63.5 - 64.5          64.5         2          4.0        0.0571        -1.71       0.0434    0.0138    0.0148
        4         64.5 - 65.5          65.5         3          7.0        0.1000        -1.41       0.0791    0.0209    0.0220
        5         65.5 - 66.5          66.5         5         12.0        0.1714        -1.11       0.1338    0.0377    0.0338
        6         66.5 - 67.5          67.5         4         16.0        0.2286        -0.81       0.2099    0.0186    0.0385
        7         67.5 - 68.5          68.5         6         22.0        0.3143        -0.50       0.3069    0.0073    0.0784
        8         68.5 - 69.5          69.5         5         27.0        0.3857        -0.20       0.4198    0.0341    0.1055
        9         69.5 - 70.5          70.5         8         35.0        0.5000         0.10       0.5397    0.0397    0.1540
       10         70.5 - 71.5          71.5         7         42.0        0.6000         0.40       0.6561    0.0561    0.1561
       11         71.5 - 72.5          72.5         7         49.0        0.7000         0.70       0.7593    0.0593    0.1593
       12         72.5 - 73.5          73.5        10         59.0        0.8429         1.01       0.8428    0.0001    0.1428
       13         73.5 - 74.5          74.5         6         65.0        0.9286         1.31       0.9046    0.0240    0.0617
       14         74.5 - 75.5          75.5         3         68.0        0.9714         1.61       0.9463    0.0251    0.0178
       15         75.5 - 76.5          76.5         2         70.0        1.0000         1.91       0.9721    0.0279    0.0007
       16         76.5 - 77.5          77.5         0         70.0        1.0000         2.21       0.9866    0.0134    0.0134
       17         77.5 - 78.5          78.5         0         70.0        1.0000         2.52       0.9941    0.0059    0.0059

                                                                                            D max           0.0593     0.1593


      Another method can be found in                                                        D 0.05, 70      0.1598 > D max

      example 7.14 (Zar 99)                                                                 Accept Ho
Symmetry (Skewness)
and Kurtosis
Skewness
• A measure of the asymmetry of a
  distribution.
• The normal distribution is
  symmetric, and has a skewness
  value of zero.
• A distribution with a significant
  positive skewness has a long
  right tail.
• A distribution with a significant
  negative skewness has a long left
  tail.
• As a rough guide, a skewness
  value more than twice it's
  standard error is taken to
  indicate a departure from
  symmetry.
Symmetry (Skewness)
and Kurtosis
Kurtosis
• A measure of the extent to which
  observations cluster around a central
  point.
• For a normal distribution, the value
  of the kurtosis statistic is 0.
• Positive kurtosis indicates that the
  observations cluster more and have
  longer tails than those in the normal
  distribution ( leptokurtic).
• Negative kurtosis indicates the
  observations cluster less and have
  shorter tails ( Platykurtic).
• You should read the Chapters 1-7 of Zar 1999 which have been
  covered by the five lectures so far.
• The frequency distribution of a sample can often be identified with a
  theoretical distribution, such as the normal distribution.
• Five methods for comparing a sample distribution: inspection of the
  frequency histogram; probability plot; Chi-square goodness of fit,
  Kolmogorov-Smirnov one-sample test and D’Agostino-Pearson K2
  test.
• Probability plots can be used for testing normal and log-normal
  distributions.
• Graphical methods often provide evidence of non-normal distributions,
  such as skewness and kurtosis (Excel or SPSS can determine the
  degree of these two measurements).
• The Chi-square goodness of fit or Kolmogorov-Smirnov one-sample
  test also can be used to test of an unknown distribution against a
  theoretical distribution (apart from normal distribution).
Binomial & Poisson Distributions
     and their Application

    (Chapters 24 & 25, Zar 1999)
                 Binomial
• Consider nominal scale data that come from
  a population with only two categories
  – members of a mammal litter may be classified
    as male or female
  – victims of an epidemic as dead or alive
  – progeny of a Drosophila cross as white-eyed or
    red-eyed
               Binomial Distributions

The proportion of the population belonging to one of
the two categories is denoted as:

   – p, then the other q = 1- p

   – e.g. if 48% male and 52% female so
      p = 0.48 and q = 0.52
(Source of photos: BBC)
                                           http://www.mun.ca/biology/scarr/Bird_sexing.htm
http://zygote.swarthmore.edu/chap20.html
                 Binomial Distributions
• e.g. if p = 0.4 and q = 0.6: for taking 10 random samples,
  you will expect 4 males and 6 females; however, you
  might get 1 male and 9 females.

• The probabilities of two independent events both occurring
  is the product of the probabilities of the two separate
  events:
   – (p)(q) = (0.4)(0.6) = 0.24;
   – (p)(p) = 0.16; and
   – (q)(q) = 0.36
               Binomial Distributions


• e.g. if p = 0.4 and q = 0.6: for taking 10 random
  samples, you will expect 4 males and 6 females

• The probabilities of either of two independent events
  is sum of the probabilities of each event, e.g. for
  having one male and one female in the sample:
         pq + qp = 2 pq = 2(0.4)(0.6) = 0.48

• For having all male, all female,
   Both sexes = 0.16 + 0.36 + 0.48 = 1
             Binomial Distributions
If a random sample of size n is taken from a binomial
population, the probability of X individuals being in
one category (other category = n - X) is
       P(X) = [(n!)/(X!(n-X)!)](pX)(qn-X)

For n = 5, X = 3, p = q = 0.5, then
      P(X) = (5!/3!2!)(0.53)(0.52)
      P(X) = (10)(0.125)(0.25) = 0.3125

For X = 0, 1, 2 , 4, 5,
P(X) = 0.03125, 0.15625, 0.31250, 0.15625, 0.03125,
respectively
                                 Binomial distributions
    • For example: The data consist of observed frequencies of females in 54
      litters of 5 offspring per litter. X = 0 denotes a litter having no females, X
      = 1 denotes a litter having one female, etc; f is the observed number of
      litters, and ef is the the number of litters expected if the null hypothesis is
      true. Computation of the values of ef requires the values of P(X)
    • Ho: The sex of the offspring are from a binomial distribution with p = q =
      0.5
                                                                                  Observed          efi
X   n     n-X    n!/(X!(n-x)!)     p    q        p^X    q^(n-X)       P(X)
                                                                             Xi          fi   (P(X))(n)
0   5      5           1         0.5   0.5         1   0.03125    0.03125
                                                                             0           3      1.688
1   5      4           5         0.5   0.5       0.5    0.0625    0.15625    1          10      8.438
2   5      3          10         0.5   0.5      0.25     0.125    0.31250    2          14     16.875
3   5      2          10         0.5   0.5     0.125      0.25    0.31250    3          17     16.875
4   5      1           5         0.5   0.5    0.0625        0.5   0.15625    4           9      8.438
5   5      0           1         0.5   0.5   0.03125          1   0.03125    5           1      1.688


        2 = (Observed freq – Expected freq)2/Expected freq
        2 = (3-1.688)2/1.688 + 0.2948 + 0.4898 + 0.0009 + 0.0375 +
        0.2801
           = 2.117
        df = k -1 = 6 -1 = 5; 2 0.05, 5 = 11.07 so accept Ho. P>0.05
                                              P(X) = [(n!)/(X!(n-X)!)](pX)(qn-X)
               Poisson Distributions
  Important in describing random occurrences, these
  occurrences being either objects in space or events in
  time.

      P(X) = e- X/X!

• When n is large and p is very small, Possion
  distribution approaches the binomial distribution.
• Interesting property: 2 = 
               Poisson Distributions
                     P(X) = e- X/X!
• e.g. The data are the number of sparrow nests in an
  area of given size (8,000 m2). There are totally 40
  areas of the same size surveyed. Then Xi is the
  number of nests in an area; fi is the frequency of Xi
  nests per hectare; and P(Xi) is the probability of Xi
  nests per hectare, if the nests are distributed
  randomly.
• Ho: the population of sparrow nests is distributed
  randomly
                          Example 25.3 (Zar 1999)
     • Ho: the population of sparrow nests is distributed
       randomly
              O                                    E fi          (O-E)2/E
Xi            fi          fiXi         P(Xi)       [P(Xi)](n)
          0           9           0      0.33287     13.3148     1.398280
          1          22          22      0.36616     14.6463     3.692154
          2           6          12      0.20139        8.0555   0.524488
          3           2           6      0.07384        2.9537   0.307921
          4           1           4      0.02031        0.8123   0.043392
        >=5           0
sum                  40           44          Chi-square = 5.966234
              mean               1.1
                                                   df = K -2 = 3
                                                   Chi-square (0.05, 3) = 7.815
                                                   Accept Ho

     P(0) = e –1.1 = 0.0332871
     P(1) = (0.332871)(1.1)/1                             P(X) = e- X/X!
For further reading on Binomial and Poisson
  distributions: Zar’s chapters 24 and 25

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:2/9/2012
language:
pages:56