Quantitative Analysis and Research Methods

Document Sample
Quantitative Analysis and Research Methods Powered By Docstoc
					Environmental Health: Science, Policy and Social Justice
Fall quarter 2008
CAL workshop
Week 6

Sampling and Estimation
(due Friday Nov. 14)


                                      Normal Distribution
A discussion of some simulation experiments appears in Chapter 7 of your custom statistics text
(Kuzma-Bohnenblust). You need to go through the assignment individually. As it is a simulation
using a random number generator, the odds of two results being the same are astonishingly low…

By using Excel’s Random Number Generation Function and do this as a simulation, we can
generate many independent random samples, which would be difficult with actual data sets. You
will generate 100 samples with n = 10 are taken from a population with  = 100 and  = 25.

Before you start Random Number generation, you need to make sure that Excel’s recalculation
option is disabled. If it is not, ever time you hit the Enter key, Excel will generate a new set of
random digits. Go to Tools  Options  Calculation, and toggle on the Manual option. (For
Macs: got to Excel menu  Preferences  Calculation  check on “Manual”) As indicated in
that dialog box, if you want a new set of random numbers, all you need to do is hit the F9 key
(“apple” “+” “t” in Macs). Make sure that the Recalculate before save option is also turned off.
If you see a check mark in the box, click on the box to turn off the feature. Note: The manual
calculation mode also affects the auto fill function. When you click and drag, the new cells will
look the same as the cells you are copying. Simply hit F9 (“apple” “+” “t” in Macs) to complete
the auto fill.


Simulation

From the menu bar, go to Tools, then select Data Analysis, then Random Number Generation.
In the window that opens, first select the type of Distribution: for this exercise select “Normal”.
In the Number of Variables enter 10 (this is the number of observations per trial, or sample
size), and in Number of Random Variables enter 100 (this is the number of trials, or samples).
You will need to add a random seed, or everyone’s results will come out the same (if you
remember the number, and reinsert it, EXCEL will generate the same series of random numbers,
providing you keep all of the other options the same). Lastly, you need to give the mean and
standard deviation of the distribution you are trying to simulate; remember that your “samples”
are taken from a population with  = 100 and  = 25.

The numbers that are generated will likely have a “General” format. You should go back and
reformat them as “Number,” with 2 or 3 decimal places – number of decimal places isn’t that
critical. Reformatting is done in the Format→Cells menu. Each sample of 10 observations will
occupy a row on your EXCEL worksheet, utilizing columns A through J. After EXCEL
generates 100 samples of size 10, insert a row or two at the top of the page, so you can add
column labels. Or, you can tell EXCEL to start generating the numbers in cell A3. You should



                                                                                                      1
complete this assignment using the same block of random numbers for all three parts of this
simulation of sampling from the normal distribution.



Part 1

This part shows that the sample mean x is an unbiased estimator of the population mean, .
    1. At the end of each row (use column L, leave column K blank), use the Average function
        to determine the mean, x, of each of the 100 samples. Don’t forget the Fill handle!
    2. Make a histogram of your 100 sample means: See instructions at the end of the
        document. How does you histogram compare to statement #1 of the Central Limit
        Theorem (p.82)?
    3. Now at the bottom of your list of means in column L, find the mean of your 100 sample
        means: use the average function for the mean of all sample means. How does it compare
        to the population mean, ? How do the results of your simulation compare to statement
        #2 of the Central Limit Theorem (p.82)?
    4. What conclusion can you draw from this experiment?


Part II

This part demonstrates that the sample variance, s2, is an unbiased estimator of the population
variance 2.
    1. For ease of interpretation, we will use both the =STDEV command and the =VAR
        command to obtain information on variability. For each of the 100 samples, calculate the
        standard deviation (s) in column M and the variance (s2) in column N, if you are using
        the proposed cell-location format.
    2. Calculate the average standard deviation and average variance of the 100 samples that
        you generated. How does it compare to the population standard deviation,  and
        variance, 2, (i.e. the value used in the Random Number Generation)?
    3. Calculate the standard deviation and variance of the 100 sample means. How do the
        results of your simulation compare to statement #3 of the Central Limit Theorem (p.82)?
    4. What is the standard deviation and variance for all 1000 observations? How do they
        compare to the values based on averaging the 100 sample standard deviations and
        variances?
    5. Create a histogram of all 1000 observations following the instructions at the end of the
        document. Compare with the histogram of the means you created in step 2 of Part I. Pay
        attention to the scale of the x-axis! You may re-do the histogram of the means using the
        same range of bins that you used to make the histogram of all 1000 observations: now the
        difference should be clear!
    6. What conclusion can you draw from this experiment?


Part III (optional, but highly recommended!)

This part demonstrates one of the fundamental concepts of statistical inference: the interval
estimate based on a sample does not always include the value of the parameter we are
trying to estimate! The objective of this simulation is to show that the interval estimates are




                                                                                                  2
correct most of the time. We will use a 90% confidence interval estimate for our samples. The
90% confidence interval estimator work should be in columns O, P, and Q.

    1. In column O, calculate the lower confidence limit (LCL) = x – 1.645(/n), e.g. “=M3-
       1.645*(25/SQRT(10))”. Remember that x (x bar) is the mean of the sample for any
       given row, not the population mean. In this experiment, however, we assume that we
       know the population variance (2), so the standard deviation () is known, and is thus a
       constant, = 25.
    2. In column P, calculate the upper confidence limit (UCL) = x + 1.645(/n), e.g.
       “=M3+1.645*(25/SQRT(10))”.
    3. In column Q, use the =AND command e.g. “=AND($M$107>P3,$M$107<Q3)”,
       (instructor or stats aid can demonstrate) to have EXCEL determine if the population
       mean (i.e. µ = 100) is within the 90% confidence interval.
    4. Use the =COUNTIF command to have EXCEL count the number of 90% confidence
       intervals that contain µ=100 (TRUE), e.g. “=COUNTIF(R3:R102, TRUE)”. Also count
       the number of times that µ doesn’t fall within the 90% confidence interval (FALSE), e.g.
       “=COUNTIF(R3:R102, FALSE)” to make sure that the sum of those that do and those
       that don’t = 100.
    5. What conclusions can you draw from this experiment using the 90 % confidence
       interval?


In the Workspace folder for our program, you will find a folder labeled Results. In that folder,
there is an EXCEL file Estimation Simulation-program results. After you have completed
your simulations, add your name and place your results in the appropriate columns.


    Creating a histogram in Excel

    a. You may let Excel create a column chart for you, using its default settings. From
    the menu bar go to Tools - Data Analysis - Histogram. In the dialog box, enter the
    input range for the “means” values. Leave the bin range blank. Make sure that you also
    select the Output range (the place where excel will display the frequency table of the
    histogram) or select New Worksheet Ply if you want to display these in a separate
    worksheet), and also check the box for Chart Output, so that Excel displays the
    histogram. In some situations, a chart like this could be useful, but you can improve the
    formatting by creating your own categories (bins), as described below.

    b. If you want to have more control over the quality of the histogram, you may want to
    create your own categories (bins). Use the =max and =min functions to determine the
    range of values of your data. This should aid in setting up your bins. Once again,
    you should have approximately n bins. In one column (name it “bins”) create a series
    of bins of 5, starting with the next “5” number above the minimum of your range of data
    (eg, if your minimum is 32, start at 35) and going up to the maximum of your range of
    data. When you make the histogram now, you can specify the cell range for the bins that
    you created in the “Bins” cell of the dialogue box, and the same data range for the
    “Input” cell as above (Excel sorts them for you!). Select Output range or New
    Worksheet Ply and check the box for Chart Output, as above.




                                                                                                   3