The Insignificance of Significance Testing

This week, scientists from around the world have made a call to stop the over-reliance on the use of statistical significance testing as a means of establishing what constitutes good science. The problem it seems is that the general public, and many researchers, don’t seem to understand the significance of statistical significance. Tests of statistical significance suffer from many problems, conveniently overlooked by publishers and the general public alike:

The issue of large sample sizes: Large sample sizes result in small differences being statistically significant, which in turn leads to…
Statistical significance often disconnected from practical insignificance: Often you will find statistically significant differences between groups that in real terms are meaningless (e.g. a variation of next nothing between two groups).
Non-significance means non-publishable: The over-reliance of journals on significant results has the effect of research that has non-significant results not being published. The net effect is a distortion of findings in the field.
The 0.5 level is arbitrary: In discussion boards, there has often been a discussion to increase this significance level. The fear is that this will result in fewer findings meeting the grade.
Inferential statistics requires assumptions in the data that are often merely overlooked.

We have long known that effect sizes more often than not fail to replicate (Stanley, Carter & Doucouliagos, 2018; Trafimow, Amrhein., Areshenkoff, Barrera-Causil, Beh, Bilgiç, … & Chaigneau, 2018). Indeed, some researchers have been far more scathing calling the practice sorcery (Lambdin, 2012) or a cult (Ziliak,& McCloskey, 2009). There have been calls for the practice to stop (Cumming, 2014) with some journals going as far as to stop publishing studies using this method (Trafimow & Marks, 2015) reinforced by the writer two years later (Trafimow, 2017).

Other researchers have not been so harsh on significance testing noting that what is required is to supplement this type of testing with different methods and better research before drawing firm conclusions (McGough, & Faraone, 2009; Wasserstein, Lazar, 2016). Moreover, the reliance on statistical testing overlooks when it is not fit for purpose, such as when looking at uncovering incremental changes. (Gelman, 2018).

In this regards, proposed strategies to improve the quality of the science being conducted include observational orientation modelling (Grice, Yepez, Wilson, & Shoda, 2017) and more use of visual representation methods (Campitelli, Macbeth, Ospina., & Marmolejo-Ramos, 2017). Just last week I published in the Journal of Employment Counselling arguing for the application of evaluation methods in research (Englert, & Plimmer, 2019). In short statistical tests on there own are never enough to draw the types of conclusions about effectiveness that psychologists often want to make (Fritz, Scherndl, & Kühberger, 2013).

So given all of these problems with statistical testing why does it continue? In my view, this is both a scientist and a commercial problem. Science is an incredibly challenging pursuit. From well-designed research studies to data collection, there is a myriad of areas that require a high level of sophistication, detail and robustness and therefore to make good science is inherently tricky. To establish relationships between variables takes a lot of effort. Simple science, represented by inferential statistics between two measures, is a far easier pursuit. For these reasons, many studies will use an inferential test such as a correlation and consider this sufficient.

The low-level type of science discussed is also what is expected by a market place. Consumers aren’t inclined to look at studies in detail nor engage in detailed in-house research. They want to have a box to tick and move on. They want outcomes (“is this person in or out of the selection process”) rather than nuance (“well it depends on a range of variables not least of all their manager, and the recognition that their trait profile places them in a band with 68% of the population”). People want to be able to categorise. They want to know that a person is in or out; a study is good or bad; a test is useful or not. There are however nuances to all of these issues that require far more sophisticated research approaches and more sophisticated questions from the market place.

So, in short, do I see this problem going away in a hurry? Unfortunately not. Neither, journals nor the marketplace incentivise the type of psychological science that the world needs, a point that is very much the case in I/O psychology. The best we can do is keep pushing for change, but it will be a long slow ride, so bring your lunch! I will leave the last word to Nature, which provides simple steps we can take to bring the science back into I/O psychology when utilising tests of statistical significance:

“ We must learn to embrace uncertainty. One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence. Specifically, we recommend that authors describe the practical implications of all values inside the interval, especially the observed effect (or point estimate) and the limits.”

—

References

Campitelli, G., Macbeth, G., Ospina, R., & Marmolejo-Ramos, M. (2017). Three strategies for the critical use of statistical methods in psychological research. Educational and Psychological Measurement, 77, 5, 881-895.

Cumming, G. (2014). The new statistics: Why and how. Psychological science, 25(1), 7-29.

Fritz, A., Scherndl, T., & Kühberger, A. (2013) A comprehensive review of reporting practices in psychological journals: Are effect sizes really enough? Theory and Psychology, 23, 1, 98-122.

Englert, P., & Plimmer, G. (2019). Moving From Classical Test Theory to the Evaluation of Usefulness:
A Theoretical and Practical Examination of Alternative Approaches to the Development of Career Tools for Job Seekers. Journal of Employment Counseling, 56(1), 20-32.

Gelman, A. (2018). The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. Personality and Social Psychology Bulletin, 44(1), 16-23.

Grice, J.W., Yepez, M., Wilson, N.L., & Shoda, Y. (2017). Observation-Oriented Modeling: Going beyond “is it all a matter of chance”? Educational and Psychological Measurement, 77, 5, 855-867.

Lambdin, C. (2012). Significance tests as sorcery: Science is empirical—significance tests are not. Theory & Psychology, 22(1), 67-90.

McGough, J. J., & Faraone, S. V. (2009). Estimating the size of treatment effects: moving beyond p values. Psychiatry (Edgmont (Pa. : Township)), 6(10), 21-9.

Ronald L. Wasserstein, Allen L. Schirm & Nicole A. Lazar (2019) Moving to a World Beyond “p < 0.05”, The American Statistician,73:sup1, 1-19, DOI: 10.1080/00031305.2019.1583913

Stanley, T. D., Carter, E. C., & Doucouliagos, H. (2018). What meta-analyses reveal about the replicability of psychological research. Psychological bulletin.

Trafimow, D. (2017). Using the coefficient of confidence to make the philosophical switch from a posteriori to a priori inferential statistics. Educational and psychological measurement, 77(5), 831-854.

Trafimow, D., Amrhein, V., Areshenkoff, C. N., Barrera-Causil, C. J., Beh, E. J., Bilgiç, Y. K., … & Chaigneau, S. E. (2018). Manipulating the alpha level cannot cure significance testing. Frontiers in Psychology, 9.

Trafimow, D. & Marks M. (2015) Editorial, Basic and Applied Social Psychology, 37:1, 1-, DOI: 10.1080/01973533.2015.1012991

Wasserstein, R. L., Lazar, N. A., American Statistical Association, Gelman, A., Loken, E., Johnson, V. E., … & Peng, R. (2016). Editorial,” Basic and Applied Social Psychology.

Ziliak, S. T., & McCloskey, D. N. (2009). The cult of statistical significance. Joint Statistical Meetings, Section on Statistical Education, 2302–2316.