# Comparing Datasets

Document Sample

```					 Comparing Datasets and
Comparing a Dataset with a
Standard

How different is enough?
Concepts:
•   Independence of each data point
•   Test statistics
•   Central Limit Theorem
•   Standard error of the mean
•   Confidence interval for a mean
•   Significance levels
•   How to apply in Excel

module 7         2
Independent measurements:
• Each measurement must be independent
(shake up the basket of tickets)
• Example of non-independent
measurements:
– Public responses to questions (one result
affects the next person’s answer)
– Samplers placed too close together so air
flows are affected

module 7                  3
Test statistics:
• Some number that is calculated
based on the data
• In the student’s t test, for example, t
• If t is >= 1.96, and you have a
normally distributed population, you
know you are to the right on the curve
where 95% of the data is in the inner
portion is symmetrically between the
right and left (t=1.96 on the right and -
module 7               4
1.96 on the left)
Test statistics correspond to
significance levels
• “P” stands for percentile
• Pth percentile is where p of the data falls
below, and 1-p fall above:

module 7                   5
Two major types of questions:
• Comparing the mean against a standard
– Does the air quality here meet the NAAQS?

• Comparing two datasets
– Is the air quality different in 2006 than 2005?
– Or, is the air quality better?
– Or, is the air quality worse?

module 7                       6
Comparing mean to a standard:

• Did the air quality meet the CARB annual
stnd of 12 microg/m3?

Ft
Ft Smith          Ft Smith N_Fort
year           Smith
avg               Max        Smith
Min
„05     14.78     0.1       37.9       77

module 7                 7
Central Limit Theorem (magic!)
• Even if the underlying population is not
normally distributed
• If we repeatedly take datasets
• These different datasets will have means
that cluster around the true mean
• And the distribution of these means is
normally distributed!

module 7                  8
magic concept #2: Standard error
of the mean
• Represents uncertainty
around the mean


• as sample size N gets
bigger, your error gets


smaller!
• The bigger the N, the more
tightly you can estimate
mean
• LIKE standard deviation
N
for a population, but this is

module 7           9
For a
“large” sample (N > 60), or when very close
to a normal distribution:
A confidence interval for a population mean is:

 s 
x  Z   
 n
Choice of z determines 90%, 95%, etc.

module 7                    10
For a “small” sample:
Replace the Z value with a t value to get:

 s 
x  t   
 n
where “t” comes from Student’s t distribution,
and depends on the sample size.

module 7                     11
Student’s t distribution versus
Normal Z distribution
T-distribution and Standard Normal Z distribution

0.4

Z distribution
0.3
density

0.2
T with 5 d.f.
0.1

0.0

-5               0                         5
Value

module 7                         12
compare t and Z values:

Confidence t value with Z value
level        5 d.f
90%          2.015     1.65
95%          2.571     1.96
99%          4.032     2.58

module 7             13
What happens as
sample gets larger?
T-distribution and Standard Normal Z distribution

0.4
Z distribution
0.3
density

0.2                           T with 60 d.f.

0.1

0.0

-5              0                        5
Value

module 7                        14
What happens to CI as
sample gets larger?

   s 
x  Z              For large samples:

    n        Z and t values
become almost
    s         identical, so CIs are
x  t              almost identical.

     n   module 7                     15
First, graph and review data:
• Use box plot add-in
• Evaluate how far apart mean and
median are
• (assume the sampling design and
the QC are good)

module 7            16
Excel summary stats:

module 7       17
40              1. Use the
box-plot             N=77
35
30
2. Calculate      Min     0.1
25
summary       25th     7.5
stats
20                             Median    13.7
15
75th    18.1

10
Max     37.9
Mean     14.8
5

SD      8.7
0
Ft Smith       module 7               18
Our question:
• Can we be 95%, 90% or how confident
that this mean of 14.78 is really greater
than the standard of 12?
• Saw that N = 77, and mean and median
not too different
• Use z (normal) rather than t

module 7                  19
The mean is 14.8 +- what?
• We know the equation for CI is
•                      s 
x  Z    
 n
• The width of the confidence interval
represents how sure we want to be
that this CI includes the true mean
• Now all we need to decide is how
confident we want to be
module 7               20
CI calculation:
• For 95%, z = 1.96 (often rounded to 2)
• Stnd error (sigma/N) = (8.66/square root of
77) = 0.98
• CI around mean = 2 x 0.98
• We can be 95% sure that the mean is
included in (mean +- 2), or 14.8-2 at the
low end, to 14.8 + 2 at the high end
• This does NOT include 12 !
module 7                21
Excel can also calculate a
confidence interval around the
mean:

The mean plus and minus 1.93 is a 95%
confidence interval that does NOT
include 12!
module 7               22
We know we are more than 95%
confident, but how confident can
we be that Ft Smith mean > 12?
• Calculate where on the curve our mean of
14.8 is, in terms of the z (normal) score,
• Or if N small, use the t score:

module 7               23
To find where we are on the curve,
calc the test statistic:
• Ft Smith mean = 14.8,
sigma =8.66, N =77
• Calculate the test                        (x  )
statistic, which in this         z
case is the z factor
(we decided we can use

the z rather than the t
distribution)
• If N was < 60, the test
N
stat is t, but                   Data‟s
calculated the same              mean
module 7        The stnd of 12   24
way
Calculate z easily:
• our mean 14.8 minus the standard of 12
(treat the real mean  (mu) as the stnd) is
the numerator (= 2.8)
• The stnd error is sigma/square root of N =
0.98 (same as for CI)
• so z = (2.8)/0.98 = z = 2.84
• So where is this z on the curve?
• Remember at z = 3 we are to the right of ~
99%

module 7                25
Where on the curve?

Z=2

Z=3

So between 95 and 99% probable that the true mean
will not include 12
module 7                   26
Can calculate exactly where on the
curve, using Excel:
• Use Normsdist function, with z

If z (or t) =
2.84, in
Excel:

Yields 99.8% probability that the
true mean does NOT include 12
module 7        27

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 1 posted: 11/16/2011 language: English pages: 27