# Statistics with Excel

Document Sample

```					Basic Statistics with Excel
Outline – Excel as a Tool for . . .
   Summarizing data
   Fitting data (regression)
   Hypothesis testing

This is a tutorial. It consists of both information and
exercises to introduce Excel as a tool for statistical
analysis. Work through it at your own pace. It assumes that
you are familiar with Excel and can do simple calculations
and make plots. If you can‟t, seek more help.
Some Excel Tidbits
   Excel can display dates in many formats, but it stores
dates as numbers.
   1 = January 1, 1900
   37622 = January 1, 2003
   A single quote forces a cell entry to be text
   To “fix” a cell, use \$ in front of the column letter
and/or row number. Then the address won‟t update
when you copy the cell.
   Excel can count the number of values in a range
   =count(N2:N35)
   =countif(N2:N35,“-777”)
   =countif(N2:N35,“>-777”)
Summarizing Data: Histograms
   When your data set consists of a single variable
measured multiple times, a histogram shows
how the values are distributed.         0.3
Indicate total #                                                                                                                        About 26% of
of values on                                  0.25                                                                                      measurements
graph or in                                             27 total measurements                                                           are between 55
Fraction of Measurements

caption.
0.2
and 56.
0.15

This is  50 and                                                                                                                     This distribution
< 51. Next                                     0.1                                                                                   is left-skewed.
category is  51                              0.05

and < 52. No
overlap.                                        0                                                                                       All categories
7

8

9

0

1

2

3

4

5

6

7

8
the same size.
-4

-4

-4

-5

-5

-5

-5

-5

-5

-5

-5

-5
46

47

48

49

50

51

52

53

54

55

56

57

Measured Calibration Factor
Exercise: Construct a Histogram
Now that you have seen a histogram
generated by Excel, you should practice
by making one yourself. On the next
slide, you will be directed to load a data
file and make a histogram to display the
data. Each bullet on the slide is a hint for
what you should do next to complete the
histogram. You should be able to figure
out how to make Excel do these things.
Exercise: Construct a Histogram
   Load calibration.xls from the ChE 408 homepage
   Count the total number of entries under cal factor
   =count(b2:b28)
   Set up a list of categories for the histogram
   ‟46-47
   Count the number of entries in each category
   =countif(b\$2:b\$28,”<47”)-countif(b\$2:b\$28,”<46”)
   Calculate the fraction of entries in each category
   Insert a column chart on a new page, fraction vs.
category
End of Histogram Exercise
Continue to learn about more ways to
summarize data.
More Ways to Summarize Data
   Plots
   Time series (use “xy”, not “line”, to see trends
over course of experiment)
   y vs. x (use “xy”, not “line”, to investigate
correlations)
   Column chart (to make histogram)
   Pie chart (to show proportions; rarely used)
   Descriptive Statistics
   Measures of location
   Measures of variability
   Tools > Data Analysis > Descriptive Statistics
Plots to Summarize Data
60                                                                                      60

Look! No gray                                                                 Dependent variable.
background.                                                                  Cal Factor depends
Measured Calibration Factor

Measured Calibration Factor
55                                                                                      55
on sample size.

50                                                                                      50

Axes scaled so                                                                     Independent variable
data covers graph.
45                                                                                      45
3-Jul   13-Jul     23-Jul    2-Aug    12-Aug   22-Aug                                      0   50    100      150      200    250   300
Calibration Date                                                                  Amount of Sample, Torr

Figure 2a. Example of a time-series                                                     Figure 2b. Example of a „y vs. x‟
plot, showing that values from the                                                      plot, showing that lower values of
first day appear to be lower and more                                                   “calibration factor” are obtained with
scattered than the rest.                                                                smaller amounts of sample.
Descriptive Statistics – “Location”

   Mean (average)
   Strongly affected by unusual points
   =average(b2:b28)
   Median (50% of data higher, 50% of data
lower)
   Seldom strongly affected by unusual points
   =median(b2:b28)
Descriptive Statistics – “Variability”
   Standard Deviation, =stdev(b2:b28)
   A measure of how widely the data is spread around
the mean
   Strongly affected by unusual points
   Relative Standard Deviation
   =stdev(b2:b28)/average(b2:b28)
   Usually given as percentage
   Format > Cells > Percent
   Variance
   =(stdev(b2:b28))^2
Descriptive Statistics – “Variability”
   Interquartile range, IQR
   Width of the middle 50 % of the data
   Seldom strongly affected by unusual points
   IQR = q0.75 – q0.25
   q0.75: 75% of data is lower, =percentile(b2:b28,0.75)
   q0.25: 25% of data is lower, =percentile(b2:b28,0.25)
   Range
   Values spanned by the data
   = max(b2:b28) – min(b2:b28)
   Strongly affected by unusual points
Descriptive Statistics – “Variability”

   Confidence Interval
   Mean +/- Uncertainty covers the confidence
interval
   “95 % confidence limits” means that if you
calculate a mean and confidence interval
from a set of replicate measurements (the
sample), you can be 95 % sure that if you
made an infinite number of measurements
(the population), their mean would lie within
the confidence interval
Descriptive Statistics
The mode is the value                Calibration Factor
The 95%
that appears most often.   Mean
Standard Error
54.24020706
0.420502872   confidence
Each value appears just    Median
Mode
54.86773908
#N/A       interval is
once in this data set.     Standard Deviation
Sample Variance
2.184997019
4.774211972   54.2 ± 0.9.
Kurtosis                   4.710663042
Skewness                  -1.881680628
Range                      10.22476031
Minimum                    46.67110279
Maximum                     56.8958631
Kurtosis and skewness      Sum
Count
1464.485591
27
describe the symmetry      Confidence Level(95.0%)    0.864356599

of the distribution. We
won‟t discuss these
further in this class.  This table is as-generated by Excel.
Round to appropriate sig figs before
reporting these numbers.
Exercise: Summarizing Data
Now that you have seen graphical and
statistical methods for summarizing data,
you should practice. On the next slide,
you will be directed to load a data file and
generate plots and descriptive statistics.
Each bullet on the slide is a hint for what
you should do next. You should be able to
figure out how to make Excel do these
things.
Exercise: Summarizing Data
   Load calibration.xls from the ChE 408
homepage
   Construct a time-series plot for the calibration
factor
   Plot calibration factor vs. sample amount
   Use descriptive statistics to summarize the
“location” and “variability” of the calibration
factor
   Calculate each statistic individually
   Use Tools > Data Analysis > Descriptive
Statistics
End of Summarizing Data
Exercise
Continue to learn about fitting equations
to data.
Fitting Data
   For linear regression, use Tools > Data Analysis
> Regression
   Don‟t use trendline – no statistics
   X Variable  slope
   R2 – Coefficient of Determination
   A value near 1.0 means that the value of y depends strongly
on the value of x. Does NOT mean the dependence is linear.
   Use residual plots to show linearity
   Residuals should be random around zero
   Use p-values to show significance of linear fit
   p = probability that points are arranged like this by chance
Linear Regression: Example of Excel Output
Round to appropriate
99% of the variation in y is explained by variation
sig figs before reporting                                                    in x. The remainder may be random error, or may
these numbers.                                                               be explained by some factor other than x.
SUMMARY OUTPUT

Ratio of variability explained
y = (0.58±0.02)x + (0.015±0.010)

Regression Statistics
Multiple R           0.99476849           by model to leftover variability.
R Square             0.98956435
Adjusted R Square 0.98923824              High number means model
Standard Error       0.01439391
Observations                 34
explains most variation in data.

ANOVA
Probability of
df            SS        MS        F Significance F       getting that value of
Regression                   1    0.628683679 0.629   3034.41 2.782E-33
Residual                    32    0.006629909 2E-04
F by randomly
Total                       33    0.635313588                                   sampling from
Coefficients Standard Error t Stat
normally-
P-value Lower 95% Upper 95%
Intercept             0.015017    0.004778837 3.142                             distributed data.
0.0036 0.0052828 0.02475116
Slope                0.5825802    0.010575927 55.09 2.8E-33   0.5610378 0.60412264

Probability of getting a slope or
intercept this much different from                      Confidence limits on
zero by randomly sampling from                          slope and intercept.
normally distributed data.
Residual Plot
This plot is as-generated by
The random distribution of
Excel. Fix the formatting before                                           residuals around zero suggests
including this plot in a report.                                           that the model accounts for all
predictable variation in y, and
all that is left is random
n-Butane Residual Plot
uncertainty.
0.03
0.02
A residual for a                  0.01
given value of x is
Residuals

0
the difference                        0.000
-0.01       0.200   0.400    0.600   0.800    1.000
between the                       -0.02
measured value of                 -0.03
y, and the value of               -0.04
y calculated using                -0.05
the regression                                           n-Butane
model.
Exercise: Linear Regression
Now that you have seen the kind of
information Excel gives you when you fit
a straight line to data, you should practice.
On the next slide, you will be directed to
load a data file, plot the data, and fit a line
to it. Each bullet on the slide is a hint for
what you should do next. You should be
able to figure out how to make Excel do
these things.
Exercise – Fitting Data
   Load Jul2627data.xls from the ChE 408 homepage
   Plot i-butane vs. n-butane. Does this plot appear
linear?
   Perform linear regression on i-butane vs. n-butane.
   Are these values strongly correlated?
   Are these values linearly correlated?
   What is the linear equation relating these values?
   What uncertainty would you place on the values of the
slope and intercept for this linear relationship?
End of Summarizing Data
Exercise
Continue to learn about fitting equations
to data.
Questions Like . . .

   Is this value significantly different from the
one I expected?
   Is the variability of this data significantly
different from what I expected?
   Are these two sets of data significantly
different from each other?
When you say “significant”, back it up
with statistics.
Hypothesis Testing
   H0 = null hypothesis
   There is no significant difference
   H1 = alternative hypothesis
   There is a significant difference (two-sided)
   This is significantly higher / lower (one-sided)
   You cannot prove the null hypothesis. You
can at best say the data offer no significant
evidence against it.
Testing “Location”
p-value (Significance Level)
   Significance level of 5 % (p < 0.05) means
at most a 5 % chance that this difference is
due to random uncertainty and not real.
   Excel calculates this as part of a t-test
   p-value > significance level: accept H0
   Cannot prove that observations and
expectations differ
Testing “Location”: t-test
   t  (difference between samples) / (variability)
   Excel will automatically calculate t-values to
compare:
   Means of two datasets with equal variances
   Means of two datasets with unequal variances
   Two sets of paired data
   abs(t-score) < abs(t-critical): accept H0
   Cannot prove that observations and expectations
differ
t-Test – Independent Samples
Round to appropriate
Change this if you want to know whether
sig figs before reporting                              the means of the two samples differ by at
these numbers.                                         least some specified amount.

Probability of drawing two
t > tcritical(one-tail), so the t-Test: Two-Sample Assuming Unequal Variances      random samples from a
mean of sample #1 is                                         Variable 1 Variable 2
normally distributed population
significantly larger than Mean                                54.99931 50.90014 and getting the mean of sample
Variance                      1.262476 7.290012 #1 this much larger than the
the mean of sample #2. Observations                                  22          5
Hypothesized Mean Difference          0            mean of sample #2. The mean
df                                    4            of sample #1 is larger at a
t Stat
P(T<=t) one-tail
3.329922
0.014554
significance level of =0.02 (or
t Critical one-tail           2.131846             “at the 2 % significance level”),
P(T<=t) two-tail              0.029108             because p < 0.02.
t Critical two-tail          2.776451

t > tcritical(two-tail), so the
mean of sample #1 is
significantly different from                                          P < 0.03, so the means of the two
the mean of sample #2.                                                samples are different at the 3 %
significance level.
Exercise: Hypothesis Testing
Now that you have seen the results of
some t-tests and p-tests in Excel, you
should practice. On the next slide, you
will be directed to load a data file and
make decisions about the data. Each
bullet on the slide is a hint for what you
should do next. You should be able to
figure out how to make Excel do these
things.
Exercise - Hypothesis Testing
   Load calibration.xls from the ChE 408
homepage
   Calculate the mean, std dev, and variance for
only the first day.
   Calculate the mean, std dev, and variance
excluding the first day.
   Use a t-test and a p-test to learn whether the
calibration factor measured the first day differs
significantly from the rest
   Tools > Data analysis > t-Test: Two sample . . .
   Can use “equal variances” if ratio of variances < 3.
End of Hypothesis Testing
Exercise
This concludes the Excel tutorial.

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 119 posted: 3/18/2010 language: English pages: 31