A Brief History of Statistics
Statistics, the science of learning from data, is a relatively new discipline. One
can divide the history of Statistics into three periods using the years 1900 and
• In the early days of Statistics (before 1900), much of the statistical work
was devoted to data analysis including the construction of graphical dis-
plays. There was little work done on inferential statistics. The foundations
of Bayesian inference had been developed by Bayes and Laplace in the
• The foundations of statistical inference were developed in the period be-
tween 1900 and 1970. Karl Pearson developed the chi-square goodness of ﬁt
procedure around the year 1900 and R. A. Fisher developed the notions of
suﬃciency and maximum likelihood in this period. Statistical procedures
are evaluated in terms of their long-run behavior in repeated sampling.
For this reason, these procedures are known as frequentist methods. Prop-
erties such as unbiasedness and mean square error are used to evaluate
procedures. Some prominent Bayesians such as Harold Jeﬀreys, Jimmie
Savage, and I. J. Good made substantial contributions during this period,
but the frequentist methods became the standard inferential methods in
the statistician’s toolkit.
• In the last 40 years, there has been a great development in new statis-
tical methods, especially computational demanding methods such as the
bootstrap and nonparametric smoothing. Due to the recent availability
of high-speed computers together with new simulation-based ﬁtted algo-
rithms, Bayesian methods have become increasingly popular. In contrast
to the middle period of statistics where frequentist methods were domi-
nate, we currently live in a frequentist/Bayesian world where statisticians
routinely use Bayesian methods in situations where this inferential per-
spective has particular advantages.
One fundamental inference problem is learning about the association pattern
in a 2 by 2 contingency table. Suppose we sample data values that are cat-
egorized with respect to the presence and absence of two variables A and B
and one observes the following table of counts.
Var A yes no
yes a b
no c d
There are two common questions that one is interested in answering. First,
is there a signiﬁcant association structure in the table? Second, if variables A
and B are indeed dependent, one is interested in estimating the strength of
As an example, suppose one observes the following table counts.
Var A yes no
yes 10 0
no 2 5
One constructs a statistical test of the hypothesis of independence to see
if there is signiﬁcant association in the table. The standard test of indepen-
dence is based on the Pearson’s chi-squared test. One implements this testing
procedure on R by the function chisq.test and one observes the following
output for these data.
Pearson’s Chi-squared test with Yates’ continuity correction
X-squared = 6.971, df = 1, p-value = 0.008284
In chisq.test(y) : Chi-squared approximation may be incorrect
We note that the p-value of the test statistic is 0.008284 which indicates that
there is signiﬁcant evidence that the two variables are dependent. But we see
a warning in the output saying that the accuracy of this p-value computation
is in doubt.
What is going wrong? The chi-squared test is based on the test statistic
(o − e)2
where o and e represent, respectively, the observed cell count and estimated
expected cell count under the independence assumption. Asymptotically, un-
der the assumption of independence, X has a chi-squared distribution with
one degree of freedom. The displayed p-value is the tail probability of a chi-
square(1) random variable. When the cell counts are large, the distribution
of X is approximately chi-square. But, when the counts are small (as in this
example), the distribution of X may not be approximately chi-square(1) and
so the accuracy of the p-value calculation is in doubt.
What can one do in this situation? A standard alternative test procedure is
Fisher’s exact test where the p-value is computed based on the hypergeometric
distribution. If one implements this test using the R function fisher.test,
one sees the following output.
Fisher’s Exact Test for Count Data
p-value = 0.003394
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
One obtains a new p-value of 0.003394 which is signiﬁcantly diﬀerent from
the “large sample” p-value of 0.008284, indicating that the accuracy of the
chi-square approximation was relatively poor. But this analysis raises new
1. What sampling model?
There are diﬀerent sampling models that can produce the observed table
counts. For example, one may be taking a single random sample of a
particular size and classifying each observation with respect to the two
variables – this is the multinomial sampling model. Alternatively, one
may be taking two independent samples; the “A-sample” is classiﬁed with
respect to variable B, and a second “not A-sample” is also classiﬁed with
respect to variable B – this is the product of binomials sampling model.
Or perhaps one assumes that the observed margins of the table are ﬁxed
and the only random quantity is the one count in the top left of the table
– this gives rise to the hypergeometric distribution under independence
that is the basis for Fisher’s exact test.
2. Does the choice of sampling model matter?
If one is unsure about the sampling method that produces the table, one
might hope that the test of signiﬁcance is insensitive to the choice of
sampling model. But this is not the case. The p-value is dependent on
the choice of model. Actually, there is a debate among frequentists on the
“proper” choice of sampling model in the test of independence.
3. What about estimating the association?
In this example, since the test of independence seems to be clearly rejected,
the focus should be on the estimation of the association. A standard mea-
sure of association in a two by two table is the odds ratio deﬁned by
where (assuming a multinomial sampling model) pij is the probability of
an observation in the ith row and jth column of the table. The maximum
likelihood estimate of α is given by
For these data, we observe a = 10, b = 0, c = 2 and d = 5, resulting in
an inﬁnite estimate for α. This is indicated by the fisher.exact output.
Also the standard (asymptotic) 95% conﬁdence interval for α for these
data is given by (2.164093, ∞). We see that the observed zero count has
made it diﬃcult to get reasonable point and interval estimates of the odds
In this problem, we see some pitfalls in applying frequentist testing meth-
ods for this simple problem. Since there are small counts, standard methods
relying on asymptotic approximations seem unsuitable. But the computation
of an “exact” p-value is also unclear in this situation, since this computation
relies on the sampling distribution which may be unknown.
Frequentist methods also perform poorly for the estimation problem since
the inﬁnite estimate of α is not reasonable. If one thinks about the cell proba-
bilities, then one would think that all of these probabilities would be positive,
resulting in a ﬁnite value of the odds-ratio. But there are no ways to include
these “prior beliefs” that the probabilities are positive in the estimation prob-
lem. A standard ad-hoc solution to this problem is to add a fake count of 1/2
to each cell count, and estimate alpha by computing the maximum likelihood
estimate on these adjusted counts:
(a + 1/2)(d + 1/2)
(b + 1/2)(c + 1/2)
But it is not obvious that 1/2 is the correct choice of fake count to get a “best”
estimate of the odds ratio.
Pro and Cons of the Two Modes of Statistical Inference
The previous example illustrates some of the problems in applying frequentist
inferential methods and so it desirable to consider the alternative Bayesian
paradigm for inference. Here we make a short list of some positive and negative
aspects of the frequentist and Bayesian approaches to inference.
Positive Aspects of Frequentist Inference:
1. There are a number of good methods such as maximum likelihood and
most powerful tests and good criteria for evaluating procedures such as
unbiased and mean square error.
2. These methods are automatic to apply and have wide applicability.
3. One is generally interested in evaluating procedures by their performance
in repeated sampling.
Negative Aspects of Frequentist Inference:
1. There is no general method for inference. One has to be clever to devise
good statistical procedures in situations where standard methods fail.
2. Frequentist methods can perform poorly. For example, frequentist meth-
ods do not perform well for sparse contingency tables with one or more
3. One is unable to incorporate prior knowledge into the inference.
Positive Aspects of Bayesian Inference:
1. One has one recipe (Bayes’ rule) for statistical inference.
2. One can formally incorporate prior information into the analysis.
3. Nuisance parameters are easily handled in a Bayesian analysis.
Negative Aspects of Bayesian Inference:
1. Bayesian thinking requires more thought with the introduction of a prior
2. From a calculation perspective, it can be diﬃcult to implement Bayesian
methods, although powerful computational tools exist.