Docstoc

Statistics with Excel

Document Sample
Statistics with Excel Powered By Docstoc
					Basic Statistics with Excel
Outline – Excel as a Tool for . . .
   Summarizing data
   Fitting data (regression)
   Hypothesis testing

This is a tutorial. It consists of both information and
exercises to introduce Excel as a tool for statistical
analysis. Work through it at your own pace. It assumes that
you are familiar with Excel and can do simple calculations
and make plots. If you can‟t, seek more help.
             Some Excel Tidbits
   Excel can display dates in many formats, but it stores
    dates as numbers.
       1 = January 1, 1900
       37622 = January 1, 2003
   A single quote forces a cell entry to be text
   To “fix” a cell, use $ in front of the column letter
    and/or row number. Then the address won‟t update
    when you copy the cell.
   Excel can count the number of values in a range
       =count(N2:N35)
       =countif(N2:N35,“-777”)
       =countif(N2:N35,“>-777”)
       Summarizing Data: Histograms
      When your data set consists of a single variable
       measured multiple times, a histogram shows
       how the values are distributed.         0.3
Indicate total #                                                                                                                        About 26% of
of values on                                  0.25                                                                                      measurements
graph or in                                             27 total measurements                                                           are between 55
                   Fraction of Measurements




caption.
                                               0.2
                                                                                                                                        and 56.
                                              0.15

This is  50 and                                                                                                                     This distribution
< 51. Next                                     0.1                                                                                   is left-skewed.
category is  51                              0.05

and < 52. No
overlap.                                        0                                                                                       All categories
                                                    7

                                                           8

                                                                  9

                                                                         0

                                                                                1

                                                                                       2

                                                                                              3

                                                                                                     4

                                                                                                            5

                                                                                                                   6

                                                                                                                          7

                                                                                                                                 8
                                                                                                                                        the same size.
                                                 -4

                                                        -4

                                                               -4

                                                                      -5

                                                                             -5

                                                                                    -5

                                                                                           -5

                                                                                                  -5

                                                                                                         -5

                                                                                                                -5

                                                                                                                       -5

                                                                                                                              -5
                                                46

                                                     47

                                                           48

                                                                  49

                                                                         50

                                                                                51

                                                                                       52

                                                                                              53

                                                                                                     54

                                                                                                            55

                                                                                                                   56

                                                                                                                          57

                                                                         Measured Calibration Factor
Exercise: Construct a Histogram
    Now that you have seen a histogram
  generated by Excel, you should practice
    by making one yourself. On the next
  slide, you will be directed to load a data
  file and make a histogram to display the
 data. Each bullet on the slide is a hint for
 what you should do next to complete the
 histogram. You should be able to figure
  out how to make Excel do these things.
Exercise: Construct a Histogram
   Load calibration.xls from the ChE 408 homepage
   Count the total number of entries under cal factor
       =count(b2:b28)
   Set up a list of categories for the histogram
       ‟46-47
   Count the number of entries in each category
       =countif(b$2:b$28,”<47”)-countif(b$2:b$28,”<46”)
   Calculate the fraction of entries in each category
   Insert a column chart on a new page, fraction vs.
    category
End of Histogram Exercise
Continue to learn about more ways to
          summarize data.
    More Ways to Summarize Data
   Plots
       Time series (use “xy”, not “line”, to see trends
        over course of experiment)
       y vs. x (use “xy”, not “line”, to investigate
        correlations)
       Column chart (to make histogram)
       Pie chart (to show proportions; rarely used)
   Descriptive Statistics
       Measures of location
       Measures of variability
       Tools > Data Analysis > Descriptive Statistics
                                                      Plots to Summarize Data
                              60                                                                                      60

                                                 Look! No gray                                                                 Dependent variable.
                                                  background.                                                                  Cal Factor depends
Measured Calibration Factor




                                                                                        Measured Calibration Factor
                              55                                                                                      55
                                                                                                                               on sample size.


                              50                                                                                      50

                                                 Axes scaled so                                                                     Independent variable
                                                 data covers graph.
                              45                                                                                      45
                                3-Jul   13-Jul     23-Jul    2-Aug    12-Aug   22-Aug                                      0   50    100      150      200    250   300
                                                   Calibration Date                                                                  Amount of Sample, Torr


                              Figure 2a. Example of a time-series                                                     Figure 2b. Example of a „y vs. x‟
                              plot, showing that values from the                                                      plot, showing that lower values of
                              first day appear to be lower and more                                                   “calibration factor” are obtained with
                              scattered than the rest.                                                                smaller amounts of sample.
Descriptive Statistics – “Location”

   Mean (average)
       Strongly affected by unusual points
       =average(b2:b28)
   Median (50% of data higher, 50% of data
    lower)
       Seldom strongly affected by unusual points
       =median(b2:b28)
Descriptive Statistics – “Variability”
    Standard Deviation, =stdev(b2:b28)
        A measure of how widely the data is spread around
         the mean
        Strongly affected by unusual points
    Relative Standard Deviation
        =stdev(b2:b28)/average(b2:b28)
        Usually given as percentage
            Format > Cells > Percent
    Variance
        =(stdev(b2:b28))^2
Descriptive Statistics – “Variability”
   Interquartile range, IQR
       Width of the middle 50 % of the data
       Seldom strongly affected by unusual points
       IQR = q0.75 – q0.25
           q0.75: 75% of data is lower, =percentile(b2:b28,0.75)
           q0.25: 25% of data is lower, =percentile(b2:b28,0.25)
   Range
       Values spanned by the data
       = max(b2:b28) – min(b2:b28)
       Strongly affected by unusual points
Descriptive Statistics – “Variability”

    Confidence Interval
        Mean +/- Uncertainty covers the confidence
         interval
        “95 % confidence limits” means that if you
         calculate a mean and confidence interval
         from a set of replicate measurements (the
         sample), you can be 95 % sure that if you
         made an infinite number of measurements
         (the population), their mean would lie within
         the confidence interval
                Descriptive Statistics
The mode is the value                Calibration Factor
                                                                    The 95%
that appears most often.   Mean
                           Standard Error
                                                      54.24020706
                                                      0.420502872   confidence
Each value appears just    Median
                           Mode
                                                      54.86773908
                                                         #N/A       interval is
once in this data set.     Standard Deviation
                           Sample Variance
                                                      2.184997019
                                                      4.774211972   54.2 ± 0.9.
                           Kurtosis                   4.710663042
                           Skewness                  -1.881680628
                           Range                      10.22476031
                           Minimum                    46.67110279
                           Maximum                     56.8958631
Kurtosis and skewness      Sum
                           Count
                                                      1464.485591
                                                               27
describe the symmetry      Confidence Level(95.0%)    0.864356599


of the distribution. We
won‟t discuss these
further in this class.  This table is as-generated by Excel.
                        Round to appropriate sig figs before
                        reporting these numbers.
 Exercise: Summarizing Data
   Now that you have seen graphical and
 statistical methods for summarizing data,
  you should practice. On the next slide,
you will be directed to load a data file and
  generate plots and descriptive statistics.
 Each bullet on the slide is a hint for what
you should do next. You should be able to
  figure out how to make Excel do these
                   things.
    Exercise: Summarizing Data
   Load calibration.xls from the ChE 408
    homepage
   Construct a time-series plot for the calibration
    factor
   Plot calibration factor vs. sample amount
   Use descriptive statistics to summarize the
    “location” and “variability” of the calibration
    factor
       Calculate each statistic individually
       Use Tools > Data Analysis > Descriptive
        Statistics
  End of Summarizing Data
          Exercise
Continue to learn about fitting equations
                to data.
Fitting Data
   For linear regression, use Tools > Data Analysis
    > Regression
       Don‟t use trendline – no statistics
       X Variable  slope
       R2 – Coefficient of Determination
           A value near 1.0 means that the value of y depends strongly
            on the value of x. Does NOT mean the dependence is linear.
       Use residual plots to show linearity
           Residuals should be random around zero
       Use p-values to show significance of linear fit
           p = probability that points are arranged like this by chance
  Linear Regression: Example of Excel Output
Round to appropriate
                                                                             99% of the variation in y is explained by variation
sig figs before reporting                                                    in x. The remainder may be random error, or may
these numbers.                                                               be explained by some factor other than x.
                                      SUMMARY OUTPUT

                                                                                Ratio of variability explained
   y = (0.58±0.02)x + (0.015±0.010)




                                            Regression Statistics
                                      Multiple R           0.99476849           by model to leftover variability.
                                      R Square             0.98956435
                                      Adjusted R Square 0.98923824              High number means model
                                      Standard Error       0.01439391
                                      Observations                 34
                                                                                explains most variation in data.

                                      ANOVA
                                                                                                                      Probability of
                                                             df            SS        MS        F Significance F       getting that value of
                                      Regression                   1    0.628683679 0.629   3034.41 2.782E-33
                                      Residual                    32    0.006629909 2E-04
                                                                                                                      F by randomly
                                      Total                       33    0.635313588                                   sampling from
                                                         Coefficients Standard Error t Stat
                                                                                                                      normally-
                                                                                            P-value Lower 95% Upper 95%
                                      Intercept             0.015017    0.004778837 3.142                             distributed data.
                                                                                             0.0036 0.0052828 0.02475116
                                      Slope                0.5825802    0.010575927 55.09 2.8E-33   0.5610378 0.60412264




                                                           Probability of getting a slope or
                                                           intercept this much different from                      Confidence limits on
                                                           zero by randomly sampling from                          slope and intercept.
                                                           normally distributed data.
                                          Residual Plot
This plot is as-generated by
                                                                           The random distribution of
Excel. Fix the formatting before                                           residuals around zero suggests
including this plot in a report.                                           that the model accounts for all
                                                                           predictable variation in y, and
                                                                           all that is left is random
                                              n-Butane Residual Plot
                                                                           uncertainty.
                                  0.03
                                  0.02
A residual for a                  0.01
given value of x is
                      Residuals




                                      0
the difference                        0.000
                                  -0.01       0.200   0.400    0.600   0.800    1.000
between the                       -0.02
measured value of                 -0.03
y, and the value of               -0.04
y calculated using                -0.05
the regression                                           n-Butane
model.
 Exercise: Linear Regression
     Now that you have seen the kind of
information Excel gives you when you fit
a straight line to data, you should practice.
 On the next slide, you will be directed to
load a data file, plot the data, and fit a line
 to it. Each bullet on the slide is a hint for
 what you should do next. You should be
  able to figure out how to make Excel do
                 these things.
         Exercise – Fitting Data
   Load Jul2627data.xls from the ChE 408 homepage
   Plot i-butane vs. n-butane. Does this plot appear
    linear?
   Perform linear regression on i-butane vs. n-butane.
       Are these values strongly correlated?
       Are these values linearly correlated?
       What is the linear equation relating these values?
       What uncertainty would you place on the values of the
        slope and intercept for this linear relationship?
  End of Summarizing Data
          Exercise
Continue to learn about fitting equations
                to data.
Hypothesis Testing Can Answer
     Questions Like . . .

   Is this value significantly different from the
    one I expected?
   Is the variability of this data significantly
    different from what I expected?
   Are these two sets of data significantly
    different from each other?
    When you say “significant”, back it up
               with statistics.
          Hypothesis Testing
   H0 = null hypothesis
       There is no significant difference
   H1 = alternative hypothesis
       There is a significant difference (two-sided)
       This is significantly higher / lower (one-sided)
   You cannot prove the null hypothesis. You
    can at best say the data offer no significant
    evidence against it.
        Testing “Location”
    p-value (Significance Level)
   Significance level of 5 % (p < 0.05) means
    at most a 5 % chance that this difference is
    due to random uncertainty and not real.
   Excel calculates this as part of a t-test
   p-value > significance level: accept H0
       Cannot prove that observations and
        expectations differ
         Testing “Location”: t-test
   t  (difference between samples) / (variability)
   Excel will automatically calculate t-values to
    compare:
       Means of two datasets with equal variances
       Means of two datasets with unequal variances
       Two sets of paired data
   abs(t-score) < abs(t-critical): accept H0
       Cannot prove that observations and expectations
        differ
              t-Test – Independent Samples
 Round to appropriate
                                                        Change this if you want to know whether
 sig figs before reporting                              the means of the two samples differ by at
 these numbers.                                         least some specified amount.

                                                                                   Probability of drawing two
t > tcritical(one-tail), so the t-Test: Two-Sample Assuming Unequal Variances      random samples from a
mean of sample #1 is                                         Variable 1 Variable 2
                                                                                   normally distributed population
significantly larger than Mean                                54.99931 50.90014 and getting the mean of sample
                                Variance                      1.262476 7.290012 #1 this much larger than the
the mean of sample #2. Observations                                  22          5
                                Hypothesized Mean Difference          0            mean of sample #2. The mean
                                df                                    4            of sample #1 is larger at a
                                t Stat
                                P(T<=t) one-tail
                                                              3.329922
                                                              0.014554
                                                                                   significance level of =0.02 (or
                                t Critical one-tail           2.131846             “at the 2 % significance level”),
                                P(T<=t) two-tail              0.029108             because p < 0.02.
                              t Critical two-tail          2.776451

t > tcritical(two-tail), so the
mean of sample #1 is
significantly different from                                          P < 0.03, so the means of the two
the mean of sample #2.                                                samples are different at the 3 %
                                                                      significance level.
Exercise: Hypothesis Testing
  Now that you have seen the results of
  some t-tests and p-tests in Excel, you
 should practice. On the next slide, you
 will be directed to load a data file and
  make decisions about the data. Each
bullet on the slide is a hint for what you
 should do next. You should be able to
 figure out how to make Excel do these
                  things.
    Exercise - Hypothesis Testing
   Load calibration.xls from the ChE 408
    homepage
   Calculate the mean, std dev, and variance for
    only the first day.
   Calculate the mean, std dev, and variance
    excluding the first day.
   Use a t-test and a p-test to learn whether the
    calibration factor measured the first day differs
    significantly from the rest
       Tools > Data analysis > t-Test: Two sample . . .
       Can use “equal variances” if ratio of variances < 3.
End of Hypothesis Testing
        Exercise
 This concludes the Excel tutorial.

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:119
posted:3/18/2010
language:English
pages:31