Basic Statistics with Excel Outline – Excel as a Tool for . . . Summarizing data Fitting data (regression) Hypothesis testing This is a tutorial. It consists of both information and exercises to introduce Excel as a tool for statistical analysis. Work through it at your own pace. It assumes that you are familiar with Excel and can do simple calculations and make plots. If you can‟t, seek more help. Some Excel Tidbits Excel can display dates in many formats, but it stores dates as numbers. 1 = January 1, 1900 37622 = January 1, 2003 A single quote forces a cell entry to be text To “fix” a cell, use $ in front of the column letter and/or row number. Then the address won‟t update when you copy the cell. Excel can count the number of values in a range =count(N2:N35) =countif(N2:N35,“-777”) =countif(N2:N35,“>-777”) Summarizing Data: Histograms When your data set consists of a single variable measured multiple times, a histogram shows how the values are distributed. 0.3 Indicate total # About 26% of of values on 0.25 measurements graph or in 27 total measurements are between 55 Fraction of Measurements caption. 0.2 and 56. 0.15 This is 50 and This distribution < 51. Next 0.1 is left-skewed. category is 51 0.05 and < 52. No overlap. 0 All categories 7 8 9 0 1 2 3 4 5 6 7 8 the same size. -4 -4 -4 -5 -5 -5 -5 -5 -5 -5 -5 -5 46 47 48 49 50 51 52 53 54 55 56 57 Measured Calibration Factor Exercise: Construct a Histogram Now that you have seen a histogram generated by Excel, you should practice by making one yourself. On the next slide, you will be directed to load a data file and make a histogram to display the data. Each bullet on the slide is a hint for what you should do next to complete the histogram. You should be able to figure out how to make Excel do these things. Exercise: Construct a Histogram Load calibration.xls from the ChE 408 homepage Count the total number of entries under cal factor =count(b2:b28) Set up a list of categories for the histogram ‟46-47 Count the number of entries in each category =countif(b$2:b$28,”<47”)-countif(b$2:b$28,”<46”) Calculate the fraction of entries in each category Insert a column chart on a new page, fraction vs. category End of Histogram Exercise Continue to learn about more ways to summarize data. More Ways to Summarize Data Plots Time series (use “xy”, not “line”, to see trends over course of experiment) y vs. x (use “xy”, not “line”, to investigate correlations) Column chart (to make histogram) Pie chart (to show proportions; rarely used) Descriptive Statistics Measures of location Measures of variability Tools > Data Analysis > Descriptive Statistics Plots to Summarize Data 60 60 Look! No gray Dependent variable. background. Cal Factor depends Measured Calibration Factor Measured Calibration Factor 55 55 on sample size. 50 50 Axes scaled so Independent variable data covers graph. 45 45 3-Jul 13-Jul 23-Jul 2-Aug 12-Aug 22-Aug 0 50 100 150 200 250 300 Calibration Date Amount of Sample, Torr Figure 2a. Example of a time-series Figure 2b. Example of a „y vs. x‟ plot, showing that values from the plot, showing that lower values of first day appear to be lower and more “calibration factor” are obtained with scattered than the rest. smaller amounts of sample. Descriptive Statistics – “Location” Mean (average) Strongly affected by unusual points =average(b2:b28) Median (50% of data higher, 50% of data lower) Seldom strongly affected by unusual points =median(b2:b28) Descriptive Statistics – “Variability” Standard Deviation, =stdev(b2:b28) A measure of how widely the data is spread around the mean Strongly affected by unusual points Relative Standard Deviation =stdev(b2:b28)/average(b2:b28) Usually given as percentage Format > Cells > Percent Variance =(stdev(b2:b28))^2 Descriptive Statistics – “Variability” Interquartile range, IQR Width of the middle 50 % of the data Seldom strongly affected by unusual points IQR = q0.75 – q0.25 q0.75: 75% of data is lower, =percentile(b2:b28,0.75) q0.25: 25% of data is lower, =percentile(b2:b28,0.25) Range Values spanned by the data = max(b2:b28) – min(b2:b28) Strongly affected by unusual points Descriptive Statistics – “Variability” Confidence Interval Mean +/- Uncertainty covers the confidence interval “95 % confidence limits” means that if you calculate a mean and confidence interval from a set of replicate measurements (the sample), you can be 95 % sure that if you made an infinite number of measurements (the population), their mean would lie within the confidence interval Descriptive Statistics The mode is the value Calibration Factor The 95% that appears most often. Mean Standard Error 54.24020706 0.420502872 confidence Each value appears just Median Mode 54.86773908 #N/A interval is once in this data set. Standard Deviation Sample Variance 2.184997019 4.774211972 54.2 ± 0.9. Kurtosis 4.710663042 Skewness -1.881680628 Range 10.22476031 Minimum 46.67110279 Maximum 56.8958631 Kurtosis and skewness Sum Count 1464.485591 27 describe the symmetry Confidence Level(95.0%) 0.864356599 of the distribution. We won‟t discuss these further in this class. This table is as-generated by Excel. Round to appropriate sig figs before reporting these numbers. Exercise: Summarizing Data Now that you have seen graphical and statistical methods for summarizing data, you should practice. On the next slide, you will be directed to load a data file and generate plots and descriptive statistics. Each bullet on the slide is a hint for what you should do next. You should be able to figure out how to make Excel do these things. Exercise: Summarizing Data Load calibration.xls from the ChE 408 homepage Construct a time-series plot for the calibration factor Plot calibration factor vs. sample amount Use descriptive statistics to summarize the “location” and “variability” of the calibration factor Calculate each statistic individually Use Tools > Data Analysis > Descriptive Statistics End of Summarizing Data Exercise Continue to learn about fitting equations to data. Fitting Data For linear regression, use Tools > Data Analysis > Regression Don‟t use trendline – no statistics X Variable slope R2 – Coefficient of Determination A value near 1.0 means that the value of y depends strongly on the value of x. Does NOT mean the dependence is linear. Use residual plots to show linearity Residuals should be random around zero Use p-values to show significance of linear fit p = probability that points are arranged like this by chance Linear Regression: Example of Excel Output Round to appropriate 99% of the variation in y is explained by variation sig figs before reporting in x. The remainder may be random error, or may these numbers. be explained by some factor other than x. SUMMARY OUTPUT Ratio of variability explained y = (0.58±0.02)x + (0.015±0.010) Regression Statistics Multiple R 0.99476849 by model to leftover variability. R Square 0.98956435 Adjusted R Square 0.98923824 High number means model Standard Error 0.01439391 Observations 34 explains most variation in data. ANOVA Probability of df SS MS F Significance F getting that value of Regression 1 0.628683679 0.629 3034.41 2.782E-33 Residual 32 0.006629909 2E-04 F by randomly Total 33 0.635313588 sampling from Coefficients Standard Error t Stat normally- P-value Lower 95% Upper 95% Intercept 0.015017 0.004778837 3.142 distributed data. 0.0036 0.0052828 0.02475116 Slope 0.5825802 0.010575927 55.09 2.8E-33 0.5610378 0.60412264 Probability of getting a slope or intercept this much different from Confidence limits on zero by randomly sampling from slope and intercept. normally distributed data. Residual Plot This plot is as-generated by The random distribution of Excel. Fix the formatting before residuals around zero suggests including this plot in a report. that the model accounts for all predictable variation in y, and all that is left is random n-Butane Residual Plot uncertainty. 0.03 0.02 A residual for a 0.01 given value of x is Residuals 0 the difference 0.000 -0.01 0.200 0.400 0.600 0.800 1.000 between the -0.02 measured value of -0.03 y, and the value of -0.04 y calculated using -0.05 the regression n-Butane model. Exercise: Linear Regression Now that you have seen the kind of information Excel gives you when you fit a straight line to data, you should practice. On the next slide, you will be directed to load a data file, plot the data, and fit a line to it. Each bullet on the slide is a hint for what you should do next. You should be able to figure out how to make Excel do these things. Exercise – Fitting Data Load Jul2627data.xls from the ChE 408 homepage Plot i-butane vs. n-butane. Does this plot appear linear? Perform linear regression on i-butane vs. n-butane. Are these values strongly correlated? Are these values linearly correlated? What is the linear equation relating these values? What uncertainty would you place on the values of the slope and intercept for this linear relationship? End of Summarizing Data Exercise Continue to learn about fitting equations to data. Hypothesis Testing Can Answer Questions Like . . . Is this value significantly different from the one I expected? Is the variability of this data significantly different from what I expected? Are these two sets of data significantly different from each other? When you say “significant”, back it up with statistics. Hypothesis Testing H0 = null hypothesis There is no significant difference H1 = alternative hypothesis There is a significant difference (two-sided) This is significantly higher / lower (one-sided) You cannot prove the null hypothesis. You can at best say the data offer no significant evidence against it. Testing “Location” p-value (Significance Level) Significance level of 5 % (p < 0.05) means at most a 5 % chance that this difference is due to random uncertainty and not real. Excel calculates this as part of a t-test p-value > significance level: accept H0 Cannot prove that observations and expectations differ Testing “Location”: t-test t (difference between samples) / (variability) Excel will automatically calculate t-values to compare: Means of two datasets with equal variances Means of two datasets with unequal variances Two sets of paired data abs(t-score) < abs(t-critical): accept H0 Cannot prove that observations and expectations differ t-Test – Independent Samples Round to appropriate Change this if you want to know whether sig figs before reporting the means of the two samples differ by at these numbers. least some specified amount. Probability of drawing two t > tcritical(one-tail), so the t-Test: Two-Sample Assuming Unequal Variances random samples from a mean of sample #1 is Variable 1 Variable 2 normally distributed population significantly larger than Mean 54.99931 50.90014 and getting the mean of sample Variance 1.262476 7.290012 #1 this much larger than the the mean of sample #2. Observations 22 5 Hypothesized Mean Difference 0 mean of sample #2. The mean df 4 of sample #1 is larger at a t Stat P(T<=t) one-tail 3.329922 0.014554 significance level of =0.02 (or t Critical one-tail 2.131846 “at the 2 % significance level”), P(T<=t) two-tail 0.029108 because p < 0.02. t Critical two-tail 2.776451 t > tcritical(two-tail), so the mean of sample #1 is significantly different from P < 0.03, so the means of the two the mean of sample #2. samples are different at the 3 % significance level. Exercise: Hypothesis Testing Now that you have seen the results of some t-tests and p-tests in Excel, you should practice. On the next slide, you will be directed to load a data file and make decisions about the data. Each bullet on the slide is a hint for what you should do next. You should be able to figure out how to make Excel do these things. Exercise - Hypothesis Testing Load calibration.xls from the ChE 408 homepage Calculate the mean, std dev, and variance for only the first day. Calculate the mean, std dev, and variance excluding the first day. Use a t-test and a p-test to learn whether the calibration factor measured the first day differs significantly from the rest Tools > Data analysis > t-Test: Two sample . . . Can use “equal variances” if ratio of variances < 3. End of Hypothesis Testing Exercise This concludes the Excel tutorial.