TASK FORCE SUMMARY
Don’t ‘pretend’ it’s something it’s not
Hypothesis generating vs. Hypothesis testing
Or exploratory vs. confirmatory
Both can be of great value and they are not mutually exclusive
even within a study
Populations can be anything, make sure it’s clear which
you are trying to speak to
Can actually be a quite complex undertaking, make sure it’s clear
how the data was arrived at
Critical in experimental design
Do not think you are random
Humans are terrible at it, e.g. let software decide assignment1
In cases of non-experimental design, ‘comparison’ groups may be
implemented but are not true controls and should not be implied
Control can be introduced via design and analysis
Random assignment and control do not provide causality
Causal claims are subjective ones made based on evidence,
control of confounds, contiguity, common sense etc.
Precision in naming is a must
Variable names should reflect operational definitions of constructs
For example: Intelligence no, IQ test score yes
Nothing about how that value is derived should be left to question
Range and calculations must be made extremely clear
Reliability standards in psychology are low, and somehow getting worse
The easiest way to ruin a study and waste a lot of time is using a poor measure; It only
takes one to muck up everything
You are much better off assuming that a previously used instrument was a bad idea than
assuming that it’s ok because someone else used it before
Even when using a well-known instrument you should report the reliability for your
study whenever possible.
This not only informs about what populations a measure may or may not be reliable for, it is
crucial for meta-analysis
Recall that there is no single ‘Reliability’ for an instrument, there are reliability estimates for
that instrument for various populations
Methods of collection must be sound and every aspect about it must be
communicated so others can be sure of lack of bias
“Missing” data can be accounted for in a variety of ways this day and
And the worst way to do it is completely ignoring incomplete cases, which can
introduce extreme bias into a study
Power and sample size
Don’t be lazy, get a big sample.
It is very easy to calculate the sample size needed for typical analyses
However there are many problems with such estimates both theoretical and
practical as we will discuss later
The main thing is that it should be clear how the present sample size was
Obviously any problems that arise should be made known
You will be able to do so easily with a thorough initial
examination of data
Search for outliers, miskeys etc.
Test statistical assumptions
Identify missing data
Inspecting your data is not fishing, snooping or whatever, it is
required for doing minimally adequate research1
Visual methods are best and really highlight issues easily
From the article “if you assess hypotheses without examining your
data, you risk publishing nonsense.”
“If you assess hypotheses without examining your data, you will
publish nonsense.” Fixed.
Your analysis is determined before data collection, not after
If you do not know what analysis to run and you’ve already collected the data, you just
wasted a lot of time
Theory Research Hypotheses Analysis ‘family’1 Appropriate measures for those
analyses Data collection
The only exception to this is when using archival data, but then if doing that, you have a
whole host of other problems to deal with.
“Do not choose an analytic method to impress your readers or to deflect criticism.”
Unfortunately it seems common in psych for researchers to choose the analysis before the
research question, mostly for the former reason (at which point they do it poorly and have
the opposite effect on those who do know the analysis)
While “the simpler classical approaches” are fine, I do not agree that they should
have special status if for no other reason than because neither data nor sufficiently
considered research questions conform to their use except on rare occasion2.
Furthermore, we also have the tools to do much better and as easily understood
analyses, and saying an analysis is ‘complex’ is often more a statement about
familiarity than it is about difficulty.
Regarding programs specifically
“There are many good computer programs for analyzing data.”
“If a computer program does not provide the analysis you need, use
another program rather than let the computer shape your thinking.”
Regarding not letting the program do your thinking for you.
“Do not report statistics found on a printout without understanding how
they are computed or what they mean.”
“There is no substitute for common sense.”
Is it just me or are these very clear and easily understood statements?
Would you believe I’ve actually had to defend them?
“You should take efforts to assure that the underlying assumptions
required for the analysis are reasonable given the data.”
Despite this it is often difficult to find any mention of
analysis of assumptions or appropriate and modern ways
of dealing with the problem of not meeting them.
“Never use the unfortunate expression ‘accept the null
Outcomes are fuzzy, that’s ok.
“Always present effect sizes for primary outcomes.”
“Always present effect sizes.” Fixed.
Small effects may still have practical importance or maybe that
finding is more important to others than to you.
Reporting uncertainty of estimate is important. Do it. And do it
for the effect sizes.
“Interval estimates should be given for any effect sizes involving
First, pairwise methods… were designed to control a familywise error
rate based on the sample size and number of comparisons. Preceding
them with an omnibus F test in a stagewise testing procedure defeats this
design, making it unnecessarily conservative.
Second, researchers rarely need to compare all possible means to
understand their results or assess their theory; by setting their sights
large, they sacrifice their power to see small.
Third, the lattice of all possible pairs is a straightjacket; forcing
themselves to wear it often restricts researchers to uninteresting
hypotheses and induces them to ignore more fruitful ones.
Again, fairly straightforward in the recommendation of not ‘laying
waste with t-tests’.
“There is a variant of this preoccupation with all possible pairs that
comes with the widespread practice of printing p values or asterisks
next to every correlation in a correlation matrix… One should ask
instead why any reader would want this information.”
People do not need an asterisk to tell them whether a correlation is
strong or not.
The correlation is an effect size and should be treated accordingly
Humans are good pattern recognizers, if there is a trend they will likely
spot it on their own or you might make it more apparent in summary
statements that highlight such patterns. Putting asterisks all over the
place1 doesn’t imply anything more than that you are going to prop up
poor results with statistical significance, or worse, that some ‘fishing’
Establishing causality is tricky business, especially since
it can’t technically be done
There is no causality statistic, and neither causal modeling
nor experimentation establish it in and of themselves
However, we do assume causal relations based on
evidence and careful consideration of the problem
itself, but be prepared for a difficult undertaking in
attempting to establishing.
Tables and figures
People simply do not take enough time or put enough thought into how their results are
Like anything else, you need to be able to hold your audience’s attention
People spend a lot of time going back over tables and figures, and more than they do
rereading the text.
It is very easy to display a lot of pertinent information in a fairly simple graph, and this
is the goal: max info min clutter.
Furthermore, what can be displayed in in a meaningful way graphically is not restricted1
Any number of graphs you’ve never come across may be the best
This is where you can really be creative, allow yourself to be!
Unfortunately, many limit themselves to the limitations of their statistical program, and
while trying to spruce up bad graphics end up making interpretation worse
E.g. 3-d bar chart
Stats programs are in general behind in their offerings compared to what graphics programs
are available (obviously), and some are so archaic as to actually make customizing simple
graphs a labor intensive enterprise.
Credibility, generalizability, and robustness
Do not reside in a vacuum but must be placed within the context
of prior and ongoing relevant studies
Do not overgeneralize. In the grand scheme of things one study is
rarely worth much and no study has value without
Thoughtfully make recommendations on issues to be addressed by
future research and how they may do so
“Further research must be done…” Is already known before you
started coming up with theories to test. Might as well say “Future
research should be printed in black ink.”, it’d be about as useful.
The real problem
The initial approach laid out Problems with power
Fisher, R.A. (1925). Statistical Methods for Cohen, J. (1969). Statistical Power Analysis for the Behavioral
On the utility of exploration
Fisher, R.A. (1935). The Design of
Experiments. Tukey, John W (1977). Exploratory Data Analysis.
Emphasis on use of relevant graphics
Neyman, Jerzy (1937). "Outline of a Theory
of Statistical Estimation Based on the Tufte, Edward R. (1983). The Visual Display of Quantitative
Classical Theory of Probability",
Philosophical Transactions of the Royal Effect sizes
Society of London. Series A.1 Correlation coefficient
Pearson, K (1896). Regression, heredity and panmixia.
Philosophical Transactions A.
Peirce, C.S. (1884). The Numerical Measure of the Success of
Immediate criticism Predictions. Science.
Standardized mean difference
Berkson, J. (1938). Some difficulties of Cohen, J. (1969). Statistical power analysis for the behavioral
interpretation encountered in the application sciences.
of the chi-square test. Journal of the Issues regarding causality2
American Statistical Association.
Aristotle, Physics II 3.
Berkson, J. (1942). Tests of significance Hume, D. (1739). Treatise of human nature.
considered as evidence. Journal of the
American Statistical Association. Related methods: SEM, Propensity score matching
Some ‘Modern’ methods
Meehl, P. E. (1978). Theoretical risks and Bradley Efron (1979). "Bootstrap Methods: Another Look at the
Jackknife". The Annals of Statistics 7 (1).
tabular asterisks: Sir Karl, Sir Ronald, and
the slow progress of soft psychology. Journal Robust methods
of Consulting and Clinical Psychology, 46, Huber, P. J. (1981) Robust Statistics.3
Recent criticism Bayes, T. (1764). Essay Towards Solving a Problem in the
Doctrine of Chances .
Robbins, H. (1956) An Empirical Bayes Approach to Statistics,
Harlow, Mulaik, Steiger (1997). What if
Proceeding of the Third Berkeley Symposium on Mathematical
there were no significance tests? Statistics.
Structural Equation Modeling
Wright, Sewall S. (1921). "Correlation of causation". Journal of
Agricultural Research, 20.
The real problem
The real issue is that most of these problems and issues have existed since the beginning
of statistical science, been noted since the beginning, have had many solutions offered
for decades and yet much of psych research exists apparently oblivious of this or…
Are researchers simply ignoring them?
Task Force on Statistical Inference initial meetings and recommendations
Official paper 1999
Follow up study 2006
Statistical Reform in Psychology: Is Anything Changing?
Cumming et al.
Change, but Little Reform Yet
“At least in these 10 journals1, NHST continues to dominate overwhelmingly. CI reporting is
increasing but still low, and CIs are seldom used for interpretation. Figures with error bars are
now common, but bars are usually SEs, not the recommended CIs...2
If we can’t expect the ‘top’ journals to change in a reasonable amount of time what are
we to make of our science?