Docstoc

Comparing Datasets

Document Sample
Comparing Datasets Powered By Docstoc
					 Comparing Datasets and
Comparing a Dataset with a
        Standard

    How different is enough?
                Concepts:
•   Independence of each data point
•   Test statistics
•   Central Limit Theorem
•   Standard error of the mean
•   Confidence interval for a mean
•   Significance levels
•   How to apply in Excel

                     module 7         2
  Independent measurements:
• Each measurement must be independent
  (shake up the basket of tickets)
• Example of non-independent
  measurements:
  – Public responses to questions (one result
    affects the next person’s answer)
  – Samplers placed too close together so air
    flows are affected

                      module 7                  3
            Test statistics:
• Some number that is calculated
  based on the data
• In the student’s t test, for example, t
• If t is >= 1.96, and you have a
  normally distributed population, you
  know you are to the right on the curve
  where 95% of the data is in the inner
  portion is symmetrically between the
  right and left (t=1.96 on the right and -
                   module 7               4
  1.96 on the left)
     Test statistics correspond to
          significance levels
• “P” stands for percentile
• Pth percentile is where p of the data falls
  below, and 1-p fall above:




                     module 7                   5
 Two major types of questions:
• Comparing the mean against a standard
  – Does the air quality here meet the NAAQS?


• Comparing two datasets
  – Is the air quality different in 2006 than 2005?
  – Or, is the air quality better?
  – Or, is the air quality worse?


                       module 7                       6
  Comparing mean to a standard:

 • Did the air quality meet the CARB annual
   stnd of 12 microg/m3?


               Ft
     Ft Smith          Ft Smith N_Fort
year           Smith
     avg               Max        Smith
               Min
 „05     14.78     0.1       37.9       77

                     module 7                 7
Central Limit Theorem (magic!)
• Even if the underlying population is not
  normally distributed
• If we repeatedly take datasets
• These different datasets will have means
  that cluster around the true mean
• And the distribution of these means is
  normally distributed!

                   module 7                  8
     magic concept #2: Standard error
              of the mean
• Represents uncertainty
  around the mean


                                             
• as sample size N gets
  bigger, your error gets


                                         
  smaller!
• The bigger the N, the more
  tightly you can estimate
  mean
• LIKE standard deviation
                                             N
  for a population, but this is
  for YOUR sample

                              module 7           9
                   For a
“large” sample (N > 60), or when very close
          to a normal distribution:
 A confidence interval for a population mean is:


                    s 
              x  Z   
                    n
    Choice of z determines 90%, 95%, etc.

                       module 7                    10
        For a “small” sample:
Replace the Z value with a t value to get:


                    s 
              x  t   
                    n
   where “t” comes from Student’s t distribution,
   and depends on the sample size.


                       module 7                     11
          Student’s t distribution versus
              Normal Z distribution
             T-distribution and Standard Normal Z distribution

              0.4

                                             Z distribution
              0.3
density




              0.2
                                               T with 5 d.f.
              0.1



              0.0

                    -5               0                         5
                                    Value


                                  module 7                         12
compare t and Z values:

Confidence t value with Z value
  level        5 d.f
  90%          2.015     1.65
  95%          2.571     1.96
  99%          4.032     2.58



             module 7             13
              What happens as
             sample gets larger?
          T-distribution and Standard Normal Z distribution

            0.4
                                      Z distribution
            0.3
density




            0.2                           T with 60 d.f.


            0.1



            0.0

                  -5              0                        5
                                 Value


                               module 7                        14
    What happens to CI as
     sample gets larger?


         s 
x  Z              For large samples:

          n        Z and t values
                     become almost
         s         identical, so CIs are
x  t              almost identical.

          n   module 7                     15
   First, graph and review data:
• Use box plot add-in
• Evaluate spread
• Evaluate how far apart mean and
  median are
• (assume the sampling design and
  the QC are good)

               module 7            16
Excel summary stats:




        module 7       17
40              1. Use the
                   box-plot             N=77
35
                   add-in
30
                2. Calculate      Min     0.1
25
                   summary       25th     7.5
                   stats
20                             Median    13.7
15
                                 75th    18.1

10
                                 Max     37.9
                                Mean     14.8
5

                                  SD      8.7
0
     Ft Smith       module 7               18
             Our question:
• Can we be 95%, 90% or how confident
  that this mean of 14.78 is really greater
  than the standard of 12?
• Saw that N = 77, and mean and median
  not too different
• Use z (normal) rather than t



                    module 7                  19
   The mean is 14.8 +- what?
• We know the equation for CI is
•                      s 
                 x  Z    
                       n
• The width of the confidence interval
  represents how sure we want to be
  that this CI includes the true mean
• Now all we need to decide is how
  confident we want to be
                  module 7               20
            CI calculation:
• For 95%, z = 1.96 (often rounded to 2)
• Stnd error (sigma/N) = (8.66/square root of
  77) = 0.98
• CI around mean = 2 x 0.98
• We can be 95% sure that the mean is
  included in (mean +- 2), or 14.8-2 at the
  low end, to 14.8 + 2 at the high end
• This does NOT include 12 !
                    module 7                21
    Excel can also calculate a
  confidence interval around the
              mean:




The mean plus and minus 1.93 is a 95%
confidence interval that does NOT
include 12!
                 module 7               22
 We know we are more than 95%
 confident, but how confident can
 we be that Ft Smith mean > 12?
• Calculate where on the curve our mean of
  14.8 is, in terms of the z (normal) score,
• Or if N small, use the t score:




                    module 7               23
To find where we are on the curve,
       calc the test statistic:
• Ft Smith mean = 14.8,
  sigma =8.66, N =77
• Calculate the test                        (x  )
  statistic, which in this         z
  case is the z factor
  (we decided we can use
                                              
  the z rather than the t
  distribution)
• If N was < 60, the test
                                                N
  stat is t, but                   Data‟s
  calculated the same              mean
                        module 7        The stnd of 12   24
  way
         Calculate z easily:
• our mean 14.8 minus the standard of 12
  (treat the real mean  (mu) as the stnd) is
  the numerator (= 2.8)
• The stnd error is sigma/square root of N =
  0.98 (same as for CI)
• so z = (2.8)/0.98 = z = 2.84
• So where is this z on the curve?
• Remember at z = 3 we are to the right of ~
  99%


                    module 7                25
       Where on the curve?




                                Z=2

                                        Z=3

So between 95 and 99% probable that the true mean
               will not include 12
                     module 7                   26
Can calculate exactly where on the
       curve, using Excel:
• Use Normsdist function, with z

If z (or t) =
  2.84, in
   Excel:

Yields 99.8% probability that the
true mean does NOT include 12
                    module 7        27

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:11/16/2011
language:English
pages:27