journal 3
http://tap.sagepub.com
Theory & Psychology
DOI: 10.1177/095935439992007 1999; 9; 191 Theory Psychology
Bruce Thompson of Pseudo-Objectivity
Statistical Significance Tests, Effect Size Reporting and the Vain Pursuit
http://tap.sagepub.com/cgi/content/abstract/9/2/191 The online version of this article can be found at:
Published by:
http://www.sagepublications.com
can be found at:Theory & Psychology Additional services and information for
http://tap.sagepub.com/cgi/alerts Email Alerts:
http://tap.sagepub.com/subscriptions Subscriptions:
http://www.sagepub.com/journalsReprints.navReprints:
http://www.sagepub.co.uk/journalsPermissions.navPermissions:
http://tap.sagepub.com/cgi/content/refs/9/2/191 Citations
at LRITC/LIBRARY on January 27, 2010 http://tap.sagepub.comDownloaded from
Response
Statistical Significance Tests, Effect Size Reporting and the Vain Pursuit of Pseudo-objectivity
Bruce Thompson Texas A&M University and Baylor College of Medicine
Abstract. Two themes are argued in this comment on the use of statistical significance tests. First, effect sizes are an important aspect of results that should be reported. However, 10 empirical studies (some of several different journals) of articles in various disciplines demonstrate that effect sizes are still not usually being reported, notwithstanding the admonitions of the 1994 American Psychological Association (APA) Publication Man- ual. Second, using statistical significance tests does not (and cannot) make scientists (or their science) objective.
KEY WORDS: effect size, replication, statistical significance
I appreciate the opportunity to respond to the thoughtful comments of Frick (1999). Frick (1996) and Cortina and Dunlap (1997) have been among the more reflective defenders of contemporary statistical testing practices (see Thompson, 1998). This and similar dialogues (e.g. Biskin, 1998; Robinson & Levin, 1997; Thompson, 1996, 1997; Vacha-Haase & Nilsson, 1998; Vacha-Haase & Thompson, 1998) help to focus the incremental but inexora- ble evolution of our discipline. Furthermore, as I have noted elsewhere, ‘seeing reasonable people discussing areas of agreement and disagreement may stimulate readers to think further about how we all can be more reflective in our scholarly practice’ (Thompson, in press).
Space limitations preclude a detailed response to all the arguments offered by Frick (1999). More importantly, the critical areas of disagreement must not become lost in the relative minutiae. Here I focus on two themes: (a) effect sizes are not being reported, notwithstanding the admonitions of the 1994 American Psychological Association (APA) Publication Manual and (b) using statistical significance tests does not (and cannot) make scientists (or their science) objective.
Theory & Psychology Copyright © 1999 Sage Publications. Vol. 9(2): 191–196 [0959-3543(199904)9:2;191–196;007664]
at LRITC/LIBRARY on January 27, 2010 http://tap.sagepub.comDownloaded from
Resistance to Reporting Effect Sizes
As I noted in my initial article (Thompson, 1999), since the early summer of 1994 the APA Publication Manual has ‘encouraged’ authors to report effect sizes. In that article I also cited five empirical studies indicating that ‘encouraging’ better behavior has not been effective in altering behavior. I am now aware of an additional five empirical studies corroborating these pre- vious empirical results (cf. Keselman et al., 1998; Nilsson & Vacha-Haase, 1998).
Frick (1999) notes that ‘Thompson and I agree that effect size should be reported’ (p. 183). Obviously, both of us should be distressed by the cumulating evidence that ‘encouraging’ better practice simply is not working. Some comment on the etiology of this continuing failure seems warranted.
In Thompson (in press) I explored two possible reasons why editorial policies ‘requiring’ effect size reports will eventually be necessary (e.g. Heldref Foundation, 1997; Murphy, 1997; Thompson, 1994). First, the APA Publication Manual admonition is inherently vague. That is, in my role as an editor of two journals, the admonition is unclear as to how I am to evaluate and enforce whether authors submitting manuscripts have felt ‘encouraged’ to report effect sizes. Second, the admonition is inherently self- contradictory. As I noted in Thompson (in press), ‘To present an “encour- agement” in the context of strict absolute standards regarding the esoterics of author note placement, pagination, and margins is to send the message, “these myriad requirements count, this encouragement doesn’t.” ’
Frick (1999) spends considerable time reviewing the myriad possible ways to compute effect sizes (Kirk, 1996, also presents a very helpful review). At this point I am not prepared to advocate the use of a single effect estimate in all cases. Rather, I instead continue to focus my energies on getting authors to report some/any effect sizes (i.e. on getting effect sizes to be required by editors).
Statistical Significance as a Precondition
Frick (1999) argues that ‘pcalculated measures the strength of the evidence against the claim that the observed pattern was caused only by chance fluctuations’ (p. 187), and seems to be saying that effect sizes should not be reported or interpreted in the absence of statistical significance. I, on the other hand, am fully prepared to attend to effect sizes as long as I believe that they are both noteworthy and replicable, regardless of statistical significance.
For example, if I am the 200th author finding an omega(v) 2
of 3 percent or a Cohen’s d of .2 for a cancer intervention’s impacts on permanent cure, I would not be deterred by the pseudo-tragedy that I and my predecessors have all reported results involving a pcalculated of .06. I concur with
THEORY & PSYCHOLOGY 9(2)192
at LRITC/LIBRARY on January 27, 2010 http://tap.sagepub.comDownloaded from
Rosnow and Rosenthal’s (1989) view that, ‘surely, God loves the .06 [level of statistical significance] nearly as much as the .05’ (p. 1277).
In a context in which ‘virtually any study can be made to show significant results if one uses enough subjects’ (Hays, 1981, p. 293), must we ignore even demonstrably noteworthy and replicable results merely because they are not statistically significant? It would be ironic if those championing statistical tests on the basis that the tests pre-empt the interpretation of ‘chance’ results would then ignore all non-significant results that were clearly replicated. As Robinson and Levin (1997, p. 25) argued, a replication should be ‘worth a thousandth p value.’
Effect Sizes as Grist for Meta-analyses
I have noted incidentally that reporting effect sizes even for results that are not statistically significant provides grist for the meta-analysis mill. And Frick (1999) notes that meta-analyses of small effects across studies can yield more statistical power to detect such effects, which otherwise might go unnoticed.
However, Frick (1999) argues that ‘it is not clear that researchers should report statistical effect size just for the sake of future meta-analyses. . . . [M]ost studies are never a part of a meta-analysis’ (p. 186). Of course, one will never know if results in a particular study will be useful in some future meta-analysis. This uncertainty seems an insufficient justification for an incomplete characterization of results. It would seem prudent for the researcher to be safe so that the someday future meta-analyst will not have to be sorry.
The ‘Nil’ Null and the Theory-Relevance of Effect Sizes
A very important (though sometimes missed—Thompson, 1998) aspect of Cohen’s influential 1994 article was his drawing of a distinction between statistical tests of ‘nil’ as against non-nil null hypotheses. Cohen (1994) was disturbed by a world in which researchers mindlessly point and click at statistical tests evaluating hypotheses of no difference or of zero relation- ship, regardless of previous findings or expectations. Of course, researchers do not always have to test nil nulls (Thompson, 1997), but such practices have in some cases become matters of habit.
Frick (1999) observes that, ‘Current theories in psychology, for the most part, do not make predictions about effect size’ (p. 186). In my view, the empirically demonstrated resistance of authors to reporting effect sizes (cf. Keselman et al., 1998; Nilsson & Vacha-Haase, 1998) tends to lead to a chicken-and-egg paradox in which theories often don’t make predictions about effects because empirical evidence is missing, and researchers do not focus on effects because we have weak theories. We might eventually escape this paradox if authors routinely reported their effect sizes.
THOMPSON: EFFECT SIZES AND OBJECTIVITY 193
at LRITC/LIBRARY on January 27, 2010 http://tap.sagepub.comDownloaded from
It is my view that the use of nil nulls, when they are thought or known to be false, compromises the value of resultant p statistics calculated on the grounding premise that the nil null is exactly descriptive of the population. As I noted in Thompson (1997), ‘in many contexts the use of a “nil” hypothesis as the hypothesis we assume [in order to derive a single, statistically “determined” pcalculated] can render me largely disinterested in whether a result is “nonchance” ’ (p. 30).
Statistical Tests and Pseudo-objectivity
Frick (1999) argues that, ‘I think the statement of importance, in addition to being subjective and irrelevant, is outside the basic purview of science’ (p. 187). He also notes that ‘the information in a scientific report should be expressed objectively’ and that ‘the guiding rule should be whether the researcher has something useful and objective to say’ (p. 187). However, he also acknowledges that ‘many aspects of the scientific enterprise are not objective’ (p. 187).
It is my view that scientists are not objective, and neither is the science done by non-objective scientists. As Piel (1978) argued, ‘In the social sciences the act of observation perturbs the observer. The social scientist is himself [sic] caught up in the web of circumstances under study; he cannot escape his role as an actor in society’ (p. 9). As I noted in Thompson (1993),
Statistics can be employed to evaluate the probability of an event. But importance is a question of human values, and math cannot be employed as an atavistic escape (à la Fromm’s Escape from Freedom) from the existential human responsibility for making value judgments. . . . Like it or not, empirical science is inescapably a subjective business. (p. 365)
I believe that some researchers focus exclusively on the mindless pursuit of tabular asterisks, because they feel they can then feign objectivity. These researchers feel safe from the horror that at a national conference someone might actually stand up and disagree that their results are noteworthy. When tabular asterisks serve as the ‘objective’ arbiters of result noteworthiness, researchers must agree which results are important, and no one has to argue or be embarrassed. This seems very civil.
But even regarding statistical tests, I have a Tukey-esque data exploration bias which means that I analyze data lots of different ways. I do not like statistics textbooks with charts of decision diamonds leading to the pseudo- definitive one correct analysis for a given problem. This means that I may end up with divergent pcalculated values relevant to a given issue, and that, ‘As in all of statistical inference, subjective judgment cannot be avoided. Neither can reasonableness!’ (Huberty & Morris, 1988, p. 573).
I see nothing wrong (and lots honest) about researchers reporting effect
THEORY & PSYCHOLOGY 9(2)194
at LRITC/LIBRARY on January 27, 2010 http://tap.sagepub.comDownloaded from
sizes and exposing the values that give their results import. Is the cancer researcher who reports a 3 per cent omega
2 , argues that practice should
change to accommodate this new intervention, but says nothing explicit about valuing people and life, really more ‘objective’ than similar re- searchers who declare their values?
On the other hand, I see plenty wrong with defining small pcalculated values as ‘objective’ measures of result importance. First, improbable events are not intrinsically important. At a recent national meeting I conducted multivariate statistics training with Carl Huberty. Huberty held a drawing for a copy of his seminal book on discriminant analysis in which (we’ll say here exactly) 50 people participated.
One woman drew a winner’s name from a box, and the name was her own. This was an unlikely event. Because the drawing was fair, the fact that the woman drew her own name was statistically irrelevant. The odds of her name being drawn were still .02 (1/50). Thus, the result was still statistically significant (p , .05). But no one died, ascended directly to heaven, nor did the Earth spin off its axis. The event was unlikely, but trivial. On the other hand, no one died during the training session. (Neither Carl nor I are really that boring.) This was a very likely event. But it was nevertheless important!
Second, pcalculated numbers contain no information about human values, and so can’t be used as indices of value. The computation of p does not incorporate information about the researcher’s values. And the conclusion of a valid deductive process simply cannot incorporate any information not present in the argument’s premises.
References
American Psychological Association. (1994) Publication manual of the American Psychological Association (4th ed.). Washington, DC: Author.
Biskin, B.H. (1998). Comment on significance testing. Measurement and Evaluation in Counseling and Development, 31, 58–62.
Cohen, J. (1994). The earth is round (p , .05). American Psychologist, 49, 997–1003.
Cortina, J.M., & Dunlap, W.P. (1997). Logic and purpose of significance testing. Psychological Methods, 2, 161–172.
Frick, R.W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379–390.
Frick, R.W. (1999). Defending the statistical status quo. Theory & Psychology, 9, 183–189.
Hays, W.L. (1981). Statistics (3rd ed.). New York: Holt, Rinehart & Winston. Heldref Foundation. (1997). Guidelines for contributors. Journal of Experimental
Education, 65, 287–288. Huberty, C.J., & Morris, J.D. (1988). A single contrast test procedure. Educational
and Psychological Measurement, 48, 567–578. Keselman, H.J., Huberty, C.J., Lix, L.M., Olejnik, S., Cribbie, R., Donahue, B.,
Kowalchuk, R.K., Lowman, L.L., Petoskey, M.D., Keselman, J.C., & Levin, J.R.
THOMPSON: EFFECT SIZES AND OBJECTIVITY 195
at LRITC/LIBRARY on January 27, 2010 http://tap.sagepub.comDownloaded from
(1998). Statistical practices of educational researchers: An analysis of their ANOVA, MANOVA and ANCOVA analyses. Review of Educational Research, 68, 350–386.
Kirk, R. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746–759.
Murphy, K.R. (1997). Editorial. Journal of Applied Psychology, 82, 3–5. Nilsson, J., & Vacha-Haase, T. (1998, August). A review of statistical significance
reporting in the Journal of Counseling Psychology. Paper presented at the annual meeting of the American Psychological Association, San Francisco, CA.
Piel, G. (1978). Research for action. Educational Researcher, 7(2), 8–12. Robinson, D., & Levin, J. (1997). Reflections on statistical and substantive sig-
nificance, with a slice of replication. Educational Researcher, 26(5), 21–26. Rosnow, R.L., & Rosenthal, R. (1989). Statistical procedures and the justification of
knowledge in psychological science. American Psychologist, 44, 1276–1284. Thompson, B. (1993). The use of statistical significance tests in research: Bootstrap
and other alternatives. Journal of Experimental Education, 61, 361–377. Thompson, B. (1994). Guidelines for authors. Educational and Psychological Meas-
urement, 54, 837–847. Thompson, B. (1996). AERA editorial policies regarding statistical significance
testing: Three suggested reforms. Educational Researcher, 25(2), 26–30. Thompson, B. (1997). Editorial policies regarding statistical significance tests:
Further comments. Educational Researcher, 26(5), 29–32. Thompson, B. (1998). In praise of brilliance, where that praise really belongs.
American Psychologist, 53, 799–800. Thompson, B. (1999). If statistical significance tests are broken/misused, what
practices should supplement or replace them? Theory & Psychology, 9, 165–181. Thompson, B. (in press). Journal editorial policies regarding statistical significance
tests: Heat is to fire as p is to importance. Educational Psychology Review. Vacha-Haase, T., & Nilsson, J.E. (1998). Statistical significance reporting: Current
trends and usages within MECD. Measurement and Evaluation in Counseling and Development, 31, 46–57.
Vacha-Haase, T., & Thompson, B. (1998). Further comments on statistical sig- nificance tests. Measurement and Evaluation in Counseling and Development, 31, 63–67.
Bruce Thompson is Professor and Distinguished Research Scholar, Depart- ment of Educational Psychology, Texas A&M University, and Adjunct Professor of Community Medicine, Baylor College of Medicine (Houston, TX). He is a member of the APA Task Force on Statistical Inference. He is the author of 133 articles, and author/editor of six books. He previously edited Measurement and Evaluation in Counseling and Development, and currently edits Educational and Psychological Measurement, the Journal of Experimental Education and the JAI Press book series ‘Advances in Social Science Methodology’. Address: Department of Educational Psychology, Texas A&M University, College Station, TX 77843–4225, USA. The author and related reprints can both be accessed on the Internet via URL: http:/ /acs.tamu.edu/~bbt6147/
THEORY & PSYCHOLOGY 9(2)196
at LRITC/LIBRARY on January 27, 2010 http://tap.sagepub.comDownloaded from