feedback professor

bmarlurer8
FlawsinNullHypothesisTesting.docx

1

Flaws in Null Hypothesis Testing

Student’s Name

Institutional Affiliation

Course

Instructor’s Name

Date Due

Flaws in Null Hypothesis Testing

The null hypothesis significance testing (NHST) was put forth by Jacob Cohen and Frank Schmidt, who did so by reporting what they saw as fundamental flaws that made the method useless in psychological science. The main issue at hand is the wrong use of the p-value. According to Cohen (1990), the p-value is not the probability that the null hypothesis is true or that the results are due to chance. Instead, it is the likelihood of obtaining more extreme data than what was obtained assuming the null hypothesis is true. That is a fine point, but a very important one, which notes that reporting extreme data does not support the alternative hypothesis or report the size of an effect or its practical import (Lindsay, 2025). In addition, this improper interpretation tends to put results into a binary ‘either it is significant or it is not’ rubric. This artificial division outweighs more productive issues of effect size and practical significance.

This sort of binary thinking originates directly from the low statistical power problem, an error much discussed by the authors. Low-power studies, usually due to a small sample size, have little chance of detecting an effect if it exists; hence, they suffer a higher Type II error rate. From this perspective, literature would be filled with false negatives, and there would be an overestimation of Type I errors by publication bias, according to which only statistically significant results get published. According to Schmidt (1996), it has never been a helpful exercise with NHST because, in psychological research, the null hypothesis is always false; there is always some tiny, non-zero difference between the groups. Testing a hypothesis considered to be false a priori is irrational to him. The focus on null hypothesis rejection distracts us from a much more important question: "How big is the effect?" The sum of these problems is considerable.

NHST has created a research culture obsessed with getting p-values below an arbitrary threshold at the expense of replication, theoretical rigor, and cumulative knowledge. This is one of the sources of the replication crisis because statistically significant results based on tiny, transient effects do not replicate in subsequent studies. As these critics advocate, the solution is to move away from sole dependence on NHST (Corotto, 2022). Researchers should instead focus on estimating and reporting effect sizes and confidence intervals, which give them information on the size and precision of an effect, and put more emphasis on research design, power analysis, and the systematic accumulation of results through meta-analysis to build a bigger and more informative body of scientific knowledge.

In addition, this flawed system also promotes a world where the size of a p-value is confused with the practical importance of a finding, so science devotes enormous amounts of attention and resources to the wrong pursuits. With a large enough sample, an infinitesimally small effect can be statistically significant but have no theoretical or real-world relevance. In contrast, a small and possibly beneficial effect can be called “non-significant” in an underpowered study. This distortion encourages bad research practices like p-hacking or data peeking as researchers try to reach the arbitrary .05 significance threshold for publication. The scientific literature is a distorted mirror of reality, full of inflated effect sizes and false positives, which misguide future research and theory development (Lindsay, 2025). It undermines the very purpose of science – to understand and accurately describe the world – when the tool of inference is so easily misused and manipulated, slowing down the accumulation of real knowledge.

References

Cohen, J. (1990). Things I have learned (so far).  American psychologist45(12), 1304.

Corotto, F. S. (2022).  Wise Use of Null Hypothesis Tests: A Practitioner's Handbook. Elsevier.

Lindsay, R. M. (2025). The null hypothesis statistical testing paradigm undermines knowledge acquisition in management accounting research: It needs to be abandoned. In  Advances in Management Accounting (pp. 1-55). Emerald Publishing Limited.

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers.  Psychological methods1(2), 115.