buisness data analysis
Psychological Reports, 1999,85,3-18. O Psychological Reports 1999
ANSWEFUNG T W O CRITICISMS OF HYPOTHESIS TESTING '
LES LEVENTHAL
University of Marzifoba
Szrmtnay.-Two generations of merhodologists have criticized hypothesis testing by claiming that most point null hypotheses are false and that hypothesis tests d o not provide the probability that the null hypothesis is true. These criticisms are answered. (1) The point-null criticism, if correct, undermines only the traditional two-tailed test, not the one-tailed test or the Little-known directional two-tailed test. The directional two-tailed test is the only hypothesis test that, properly used, provides for deciding the direction O F a parameter, that is, deciding whether a parameter is positive o r nega- tive o r whether i t falls above or below some interesting nonzero value. The point-null criticism becomes unimportant if we replace traditional one- and two-tailed tests with the directional two-tailed test, a replacement already recommended for most purposes by previous writers. (2) If one interprets probability as a relative Frequency, as most textbooks do, then the concept of probability cannot meaningfully be attached to the truth of a n hypothesis; hence, i t is meaningless to ask for the probability that the null is true. ( 3 ) Hypothesis tests provide the next best thing, namely, a relative frequency probabihry that the decision about the statistical hypotheses is correct. Two argu- ments are offered.
Methodologists have attacked hypothesis testing for half a century. Two criticisms have been repeated frequently. The first is that poinc null hypoth- eses are almost always false and therefore testing a point null with an hypoth- esis test tells us nothing new. The second is that hypothesis tests d o not cell us what we want to know: the probabhty that the null hypothesis is true. I w d argue that "the p r o b a b h t y the null is true" is meaningless when using a relative frequency interpretation of probabhty but that hypothesis tests can provide the next best thing, namely, a relative frequency probabhty that the decision about the statistical hypotheses is correct. I will argue that the point-null criticism does not undermine the one-tailed test or the litde- known directional two-tailed test. The directional two-tailed test WLU be em- phasized since it has been recently endorsed as a replacement for traditional tests.
The directional test has been discussed since the 1950s (e.g., Bahadur, 1952; H a n d , McCarcer, & H a n d , 1985; Harris, 1797a, 1977b; Kaiser, 1960; Lehmann, 1957a, 1957b; Leventhal, 1994b, 1999; Leventhal & Huynh, 1996a, 1996b; Shaffer, 1972). The most detailed descriptions are in Harris (1997a), Leventhal (1999), and Leventhal and Huynh (1996b).
'I am grateful to Cam-Loi Huynh and Toby Martin of the Department of Psycholo y at the University of Manitoba for comments on an earlier draft of this article. Please ad&ess cor- respondence and reprint requests to Les Leventhal, Department of Psychology. University of Man~toba, Winnipeg R3T 2N2, Canada or e-mail (leventha@cc.umanitoba.ca).
L. LEVENTHAL
DIRECTIONAL TWO-TAILED TEST The most frequent use of an hypothesis test may be to decide the direc-
tion of a parameter. A parameter has a direction when it differs from zero (positive or negative) or differs from an interesting nonzero value (larger or smaller than the value). The directional two-tailed test is the only hypothesis test that, properly used, provides for a decision in either direction (Leven- thal, 1999; Leventhal & Huynh, 1996b). One way to understand the test is to view it as a single test assessing three statistical hypotheses. For example, when evaluating the difference between two population means with a t test, the hypotheses are
H I : ~ , - P 2 < 0 H z : pl - p2 = 0 (null hypothesis) H j : ~ l - p 2 > 0 .
HZ is the null hypothesis, the hypothesis generating the s a m p h g distri- bution containing the rejection regions. HI and H3 are alternative hypothe- ses. In contrast is the conventional two-tailed test, which uses nondlrectional hypotheses
and does not provide for a directional decision. Leventhal and Huynh (1996b) give side-by-side comparisons of the directional two-tailed test to conventional tests, includmg explanatory diagrams, worked examples, and calculations for power, error rates, and sample size. Llke the nondirectional test, the directional two-tailed test has a rejection region in each tail of the sampling distribution and has two critical values. Given the same a, the tests use the same critical values and make the same decision about the null. But after null rejection, the tests accept different alternatives. The nonlrectional test accepts a nondirectional alternative and the directional test accepts one of the directional alternatives, the one matching the data's observed direc- tion. Although the tests use the same critical values, they differ in power, sample size, and the possibhty of Type I11 error-deciding in one direction when the other direction is correct (Kaiser, 1960; Leventhal, 1999; Leventhal & Huynh, 1996b).
Harris (1997a) and Leventhal (1999) recommended replacing traditional one- and two-tailed tests with the directional two-tailed test for most pur- poses. Unlike the traditional two-tailed test, the directional test provides for a directional decision and provides for the calculation of power, sample size, and Type I11 error risk when a directional decision is planned. When rejec- tion regions are unequal for the directional test, it can have almost as much power as a one-tailed test withouc forfeiting the right, as does the one-tailed test, to decide in the direction opposite to a prediction.
HYPOTHESIS TESTING 5
Two generations of methodologists have argued that all or most point null hypotheses are false (e.g., Abelson, 1995, p. 10; Bakan, 1966; Berger, 1985, pp. 148, 153; Braver, 1975; Cohen, 1990, 1994; Harris, 1985, p. 2; Hays, 1973, p. 415; H ~ c k , 1952; Hodges & Lehmann, 1954; Ktrk, 1982, p. 41, 1996; Loftus, 1991; Lykken, 1968; Meehl, 1967, 1978, 1990; Moxley, 1979; Murphy, 1990; Rogers, Howard, & Vessey, 1993; Schmidt, 1992; Tu- key, 1991). But, see Frick (1995) and Hagen (1997) who disagree with some versions of the argument. A point null states an exact point value for the pa- rameter under test. Some writers come close to claiming the most extreme form of the argument, that a point null is false no matter what its value or how one arrives at it. Others are more moderate, claiming that two popula- tion means are never exactly equal and that a population correlation is never exactly zero-because complex human behavior is related to just about ev- erything, at least to a small extent. More specialized claims are that a theoret- ical point prediction (stated by the null) is never exactly true (e.g., S e r h & Lapsley, 1985, 1993) and that a theoretical function or curve (stated by the null) is never exactly duplicated by empirical data (e.g., Berkson, 1938). The point-null criticism takes many forms.
The nondirectional two-tailed test typically evaluates a point null. [The test can also evaluate a range null, which states a range of values (e.g., Green- wald, 1975; Hays, 1973, p. 851; S e r h & Lapsley, 1985, 1993).] Since the statistical hypotheses for the nondirectional test are exhaustive and mutually exclusive, knowing the null is false implies that the alternative hypothesis is . . true. Hence, if we-know a point null is-false, a nondirectional test that evalu- ates it is useless. We already know which hypothesis is true.
The vulnerabhty of the one-tailed test to the point-null criticism is complicated to d~scuss because some textbooks express its null as an inexact value, for example,
Ho: p , - p2 1 0 (inexact null) H A : p l - 1 2 > 0 ,
and some textbooks express its null as a point value, for example,
Ho: p , - p2 = 0 (point null) H A : ~ l - p 2 > 0 .
Kaiser (1960) argued that either is correct, and I agree. [See Glass & Stanley (1970, p. 289) for a different opinion.] Consider the hypothesis set with the point null. Data which would reject Ho: F, - jlZ = O at a = .05 would also re- ject at a < .05 any point null stating a pI -pz value less than zero (Hays, 1981, p. 232, pp. 254-257; Kirk, 1984, p. 250). Accordingly, the inexact null, H o : pI - p2 1 0 , expresses the hypotheses actually under test more accurately
6 L. LEVENTHAL
than does the point null. With an inexact null, is usually understood to mean the maximum p r o b a b h t ~ of Type I error (e.g., Hodges & Lehmann, 1954, p. 264; Kaiser, 1960, p. 161; I r k , 1982, p. 31; Shaffer, 1972, pp. 195, 196). Although either null is correct, the inexact null makes it clear that the one-tailed test is not vulnerable to the point-null criticism (see also Meehl, 1967, p. 1 0 8 ) . ~
The preceding discussion can be found in the literature, some of it countless times. What is new is how the directional two-tailed test affects the discussion. Even though Kaiser discussed the test i n a core psychology journal as early as 1960, the test has been largely forgotten-save for re- newed attention by Harris (1997a, 1997b), Leventhal (1994b, 1999), and Leventhal and Huynh (1996a, 1996b). The directional two-tailed test evalu- ates a point null but is not useless, even when we know the null is false. The test evaluates three exhaustive and mutually exclusive statistical hypotheses. Knowing the null is false implies that one of the alternative hypotheses is true, but not which one. The test makes its contribution by telling us which directional alternative to accept after null rejection. If direction is not impor- tant, the test is useless when we know the null is false. But direction is almost always important.
Thus, the point-null criticism, if correct, undermines only the nondirec- tional two-tailed test. Even this problem has little consequence since, as argued by Harris (1997a) and Leventhal (1999), we would do well to replace conventional tests with the directional two-tailed test for most purposes. Some nondirectional tests, however, such as the F test of analysis of vari- ance, can compare more than two conditions and cannot be replaced by a single directional two-tailed test. Such a nondirectional test is vulnerable to the point-null criticism.
SECOND CRITICISM: PROBABILITY THAT THE NULL IS TRUE Two generations of methodologists have argued that hypothesis tests do
'Two notes: First, if a one-tailed test used a oint null, che statistical hypotheses would not be exhaustive; hence, knowing that the null is f i e would not imply chat the alternative hypothesis is true. The test would make a contribuuon by indicating whether to accept the alternative hy- pothesis. So, whichever null is used, the point-null argument does not undermine the one-tailed test. Second, a reviewer of this article claimed that Meehl's Paradox (Meehl, 1967) undermines the one-tailed test. Meehl discussed using a one-tailed rest to evaluate a directional prediction made by a worthless theory. According to Meehl, if alternative hypothesis HA expresses the predicrion and statistical power is perfect, the probability of the tesr confirming the prediction and therefore corroborating the worthless theory would be at least .5. In my opinion, chis claim, even if true, fails to undermine one-tailed (or directional nvo-tailed) tests. The miscreant here is not the hypothesis test bur the theory which can d o no better than make a directional prediction. A better theory would make a more precise prediction such as predicting the shape of a function or predicting an exact point value. If a theoretical prediction is merely directional, then a11y accurate method of testing the prediction, be i t another statistical technique or a direct telephone line to the Almighty, will result in at least a .5 probability of corroborating a worthless theory (Leventhal, 1994a).
HYPOTHESIS TESTING 7
not provide the probability that the null hypothesis is true (e.g., Bakan, 1966; Berger, 1985, pp. 119-120; Carver, 1978; Cohen, 1990, 1994; Cortina & Dunlap, 1997; Frick, 1996; Gigerenzer, 1993; Gigerenzer & Murray, 1987; Greenwald, Gonzales, Harris, & Guthrie, 1996; Hagen, 1997; Kirk, 1996; Morrison & Henkel, 1970; Oakes, 1990, p . 129; Pollard, 1993; Sedl- rneier & Gigerenzer, 1989; Wilson, 1961). Cohen (1994) put the criticism succinctly, stating that the null hypothesis significance test "does not tell us what we want to know," which is, "Given these data, what is the probability that Ho is true?" (p. 997). Thus, while significance level a is the probability of rejecting the null when the null is true and p is the probabhty of the o b - tained or more extreme data when the null is true, we want instead the prob- a b h t y that the null is true.
The problem with this criticism is that the null hypothesis is a single case and the relative frequency theory of probability, the theory used in most statistics texts written for social science, cannot meaningfully attach a proba- bility to a single case. Probability theorists have proposed many theories of probability (e.g., Howson, 1995), and one concern is how a probabhty the- ory deals with a single case. The reason for this concern may be that, in everyday language, we often use the concept of probabhty with a single case. We ask the probabhty it w d rain tomorrow where we live. We ask the p r o b a b h t y that North Korea wdl attack South Korea. In science, we ask the probability that a particular scientific theory is true. These are all single, unique cases. Statistics is also concerned with single cases. Take, for exarn- ple, hypothesis testing. Suppose we spec& a population a n d investigate pop- ulation mean p by c o n d u c c ~ n ~ arl hypothesis test evaluating
Once the value of 5 0 is fixed for the null hypothesis, the null becomes a sin- gle case. Cohen and other methodologists ask for the probabhty that the null is true. Can one apply a probability to a single case? The answer de- pends on what probabhty theory one uses.
Most social science statistics textbooks interpret probabhty as a relative frequency. Philosophers advocating this interpretation include Reichenbach, "on Mises, and van Fraassen. According to Popper (1965), relative frequency theory
. . . treats every numerical probability statement as a statement a b o u t the relafiueJreqrre~~ry with which an event of a certain kind occurs within a sequence of occurrerzces, . . . t h e statement ' T h e probability of t h e next throw with this die being five equals 1/6' is not really an assertion a b o u t t h e next throw; rather it is an assertion a b o u t a whole class of /brows of which the next throw is merely an element. T h e statement in question says n o more than the relative frequency of fives, within this class of throws, equals 1/6 ( a l l emphasis his, p. 149). . . . T h e idea of probabil- ity is therefore applicable o ~ z f y to seqz~ences oJeuenfs [emphasis his] . . . ( p . 153).
8 L. LEVENTHAL
Thus, one cannot meaningfully attach a relative frequency probability to a single case. Some single cases, lLke a die throw, are repeatable. Attaching a relative frequency probability to a die throw would simply be a ltnguistic convenience, a way of saving words, for when we say that the probability of obtaining a five on the next throw (a single case) is 1/6, we mean that &e throws are repeatable, that the next throw is a member of an hypothetical collection of throws, and that the relative frequency of fives in this collection is 1/6. This way of saving words is so convenient that we use it often. Nev- ertheless, the concept of probability applies to a collection, not to a single case. Some single cases, hke an hypothesis, are not repeatable. An hypothesis is true or not true. One cannot attach a relative frequency probabhty to the truth of an hypothesis, not even as a linguistic convenience. There is no re- peating event, no collection of things, upon which to count a relative fre- quency. If there is no relative frequency, there is nothing to which the word probability can refer. While attaching a probability to a die throw is a lin- guistically convenient way of referring to a relative frequency in a collection of throws, probabhty refers to nothing at all when attached to an hypothe- sis. Therefore, one cannot sensibly ask for the relative frequency probability that an hypothesis is true (see especially Popper, 1965, pp. 254-262, but see also Good, 1971, p. 494; Hagen, 1997; Oakes, 1990; Savage, 1972, p. 4 : von Mises, 1957, pp. 8-20, 1964, p. 1). Accordingly, it makes no sense to crltl- cize hypothesis testing on the grounds that it does not provide a relat~ve frequency probabdity that the null hypothesis is true.
Although one cannot meaningfully speak of the relative frequency prob- abhty that the null is true, one can indeed meaningfully speak of the relative frequency probabdity that a decision about the statistical hypotheses is correct. Lrke die throws, decisions are repeatable events, and the probability of a cor- rect decision is the relative frequency of correct decisions in a collection of decisions. Consider, for example, a directional two-tailed t test at cr=.05 evaluating
H,: p e 5 0 Hz: p = 5 0 (null hypothesis) H,: p > 5 0 .
The test uses the following decision rule: If data fall in either of the 2.5% rejection regions, reject H 2 in favor of the alternative hypothesis match- ing the data's direction. I f data do not fall in a rejection region, decide that
'Statistical hypotheses evaluated by an hypothesis test differ from the scientific h potheses, or theories, from which statistical hy otheses are derived. Scientific hypotheses u s u a d have great- er scope, are more complex and agstract, and rest on a foundation of previous research. Never- theless, arguments against attachin a relative frequency probability are generally the same for both scientific and statistical hyporffeses.
H Y P O T H E S I S T E S T I N G 9
any of the statistical hypotheses may be true. T h e decision that any of the hy- potheses may be true is always correct. But it is uninformative since it tells us what we already know. Saying that the p r o b a b h t y is .95 that a decision is correct means that, if we select a large number of random samples, conduct the above hypothesis test o n each sample, and reach a decision using the above decision rule, then 95% of the tests w~Ll produce a correct decision regarding the statistical hypotheses. Here, the relative frequency of correct decisions in the collection of decisions is .95. Some correct decisions w~Ll b e uninformative. They result from nonsignificant data for which one decides that any of the three hypotheses may be true.
FINDING THE PROBABILITY O F A CORRECT D E C I S I O N
Hypothesis tests provide a probability that the decision about the statis- tical hypotheses is correct. There are two arguments. T h e first argument has three steps: (1) Confidence intervals provide a probabdity that a decision about parameter direction is correct. (2) Certain hypothesis tests predict the drectional decision made with certain confidence intervals. ( 3 ) If i n hypoth- esis test prechcts the directional decision made with a confidence interval, one can use the hypothesis test to claim the same p r o b a b h t y about the deci- sion that would have been provided by the confidence interval had it been computed. Since confidence intervals play an important role in this argu- ment, note that textbooks usually give a relative frequency interpretation to confidence interval probabhties. For example, suppose one calculates a 95% confidence interval for population mean p and finds the interval t o be 60 f 4. Given a relative frequency interpretation, the summary statement, "the probability or confidence is .95 that the interval 60 f 4 contains o r cov- ers p," means that if we were to select a large number of random samples and compute a 95% confidence interval for each sample, then 95% of the samples would produce an interval that contains p.' T h e three steps of the argument are explained below.
( I ) Confidence Intervals Provide a Probability That a Decision About Direc- tion Is Correct
Although confidence intervals are typically used t o estimate the size of a parameter, they can also b e used for the more narrow purpose of estimating
'It is sometimes said that a single random sample, before it is d r a w n , has a probability of .95 of producing an interval containing p. Strictly speakulg, however, just as attachin a relative frequency p r o b a b h t y t o the next d i e throw says nothing a b o u t the next ~ h r o w , a t t a c f i n g a rela- cive frequency p r o b n b l l ~ r ~ to a random sample before it is drawn says nothing a b o u t the ran- d o m sample. Both usagcz are a linguistic convenience because probabiljt actually refers to a relative frequency found m a collection, .95 in o n e coUection a n d 1/6 in t i e other. Attachin a probability t o a random sample says only that the sample is a member of the collection. S o t f i s relative frequency interpretation of a confidence interval probability attaches a relative €re- quency probability t o a single case, a randoni sample, only as a linguistic convenience.
10 L. LEVENTHAL
only the direction of a parameter (see Frick, 1996; Hand, et al., 1985; Har- ris, 1997a; Hsu, 1996, p. 4; Koopmans, 1987, pp. 281-282; Leventhal & Huynh, 1996b).
Two-sided intervals.-Two examples: First, one can use a 95% two- sided confidence interval for pI - p2 to decide the direction of the difference between population means. The value 95% is called the confidence coeffi- cient. If interval values are uniformly positive, one decides with at least 95% confidence that the true difference is positive. If interval values are uniform- ly negative, one decides with the same confidence that the true difference is negative. If the interval contains zero (i.e., lower limit < 0 < upper limit), one decides that the true difference is positive, negative, or zero-a correct but uninformative decision. This correct but uninformative decision parallels the correct but uninformative decision made by the directional two-tailed test above when data do not fall in a rejection region. Second, one can use a 95% two-sided interval to decide whether pI - p2 is larger or smaller than an interesting value, say a theoretically predicted value of +3. If interval val- ues are uniformly smaller than +3, one decides with at least 95% confidence that the true ddference is smaller than + 3 . If interval values are uniformly larger than +3, one decides with the same confidence that the true differ- ence is larger than +3. If the interval contains + 3 (i.e., lower limit < + 3 <upper h i t ) , one decides that the true difference is above, below, or equal to +3.
One-sided intervals.-There are two kinds of one-sided confidence in- tervals, one having a lower h i t without an upper h i t and one having an upper limit without a lower limit (Kirk, 1984, p. 322; Koopmans, 1987, pp. 236, 307, and 318; but see especially Leventhal & Huynh, 1996b, p. 289). Example: One can use a 95% one-sided, lower-limit confidence interval for p, - p2 to test a theoretical prediction that the ddference benveen popula- tion means is positive. If the lower limit of the interval is, say, +5, then all values in the interval are positive and one concludes with at least 95% confi- dence that the true ddference is positive. If the lower limit is, say, -5, then the interval contains zero h e . , lower h i t <O) and one concludes that the true difference is positive, negative, or zero.
Conditional versus unconditional pyobability.-A confidence interval's probability, expressed by the confidence coefficient, is not conditioned on the true value of the parameter. That is, a parameter can take any value and yet the probability that the interval covers the value remains the same (Giere, 1977, p. 65; Kendall & Stuart, 1961, p. 99; Kyburg, 1974, pp. 48-50; Lehmann, 1959, p. 82).5 For example, Giere (1977) discussed a confidence
'There are, however, special cases of confidence intervals in which the probability that the in- terval covers the parameter can vary wirh the value of the arameter (CaseUa & Berger, 1990, p. 405). But these special cases are not likely t o b e used in tge analysis of research.
HYPOTHESIS TESTING 11
interval for population proportion p: "This 'confidence interval' is construct- ed in s u c h - a - w a y that it has an assignable probab~lity [[he confidence coefficient] of includmg the true value of p, n o matter what this value may be" ( p . 65).
Confidence interval p r o b a b h t i e s dlffer from hypothesis test p r o b a b h - ties. Hypothesis test probabilities are conditional, meaning that they apply on the condition that the parameter has a particular value. They are a, the probability of rejecting the null when the null is true; I -a, the probabhcy bf retaining the null when the null is true; power, the probability of reject- ing the null when the null is false; p, the p r o b a b h t y of retaining the null when the null is false; y, the probability of deciding in one direction when the other is true; and p, where, for example, two-tailed p is the p r o b a b h t y of obtaining data at least as excreme in either direction as the obtained data when the null is true. A p value of ,031 is a conditional p r o b a b h t y because the p r o b a b h t y of .O3l applies when the true value of the parameter is the value expressed by the null. T o emphasize the difference between co~lfi- dence interval and hypothesis test probabhties, I will call confidence inter- val p r o b a b h t i e s unconditional because they d o not depend o n the true value of the parameter. They depend only on the data, which must have the ap- propriate values, be trustworthy, and satisfy any statistical assumption^.^
If the probability that an interval covers the parameter is unconditional, it follows thac, when reaching a decision about parameter direction with a confidence interval, the probability that the decision is correct is also uncon- ditional. Accordingly, if we select a large number of random samples a n d , for each sample, compute a 95% confidence interval for p, - p2 to decide the direction of the difference, 95% of the decisions will be correct what- ever the true value of p, - p2. Actually, 95% is the minimum value. Reason: If the true p1 - p2 difference is zero, 95% of the intervals w d contain zero, producing for each interval the decision that the true difference is positive, negative, o r zero. These are the only correct decisions when the true differ- ence is zero, and none are informative. If the true dlfference is slightly pos- itive, then (a) slightly over 2.5% of the intervals wlll contain uniformly posi- tive values, producing for each interval the correct and informative decision that the true dlfference is positive, ( b ) slightly under 95% of the intervals w d contain zero, producing for each interval the correct but uninformative
'Another wa to put che criticism of hypothesis tescing by Cohen, Carver, and others is that what we reafiy want to know and what hypothesis tests do not provide is an rr~zcorrdrrio~zol probability that che null hypothesis is true, chat is, a probability for the truth of the null rhat depends only on the daca, P ( H o l D ) . See Cohen (1994, p. 998) and Carver (1978, p. 385). An- other way to put my re ly is chat a relacive Erequency probabiljt cannot be meaningfully attached to the trurh of c!e null but that hy ochesis rests provide cle next best chin an un- conditional relative frequency probability o?a correct decision, that is, a relative Eequency probability for a correct decision rhat depends only on the data, P(correct decisionlD).
12 L. LEVENTHAL
decision that the true difference is positive, negative, or zero, and (c) slightly under 2.5% of the intervals wdl contain uniformly negative values, produc- ing for each interval the informative but incorrect decision that the true difference is negative. Hence, when the true difference is slightly positive, slightly over 97.5% of all decisions are correct, and slightly over 2.5% of all decisions are both correct and informative. When the true difference is suffi- ciently greater than zero, almost 100% of the intervals wfi contain uniform- ly positive values, resulting in nearly 100% of the decisions being correct and informative. Thus, for the 95% two-sided confidence interval, the prob- ability of a correct decision varies from .95 to nearly 1.0, depending on the true value of pl - p2. This probabhty is conditional. The minimum of .95, however, is an unconditional probabdity since it holds whatever the value of pI - p2. Put differently, the probabhty of a correct decision is always at least .95. The probabhty of a correct and informative decision varies from 0.0 to nearly 1.0, depending on the true value of p1 - p2. This probabhty is condi- tional. The minimum of 0.0, however, is unconditional since it holds what- ever the value of p1 - p2. But such a small minimum is useless.
(2) Certain Hypothesis Tests Predict the Directional Decision Made W i t h Conftdence Interuals
Textbooks describe how confidence intervals predict decisions about the null made with hypothesis tests, making hypothesis tests unnecessary. The present interest is roughly the reverse-how certain hypothesis tests predict decisions about direction made with confidence intervals, makmg confidence intervals unnecessary if parameter direction, not size, is irnpor- tant. If parameter size is important, hypothesis tests w d not replace confi- dence intervals.
The following describes how the directional two-tailed test predcts the directional decision made with the two-sided confidence interval and how the one-tailed test predicts the directional decision made with the one-sided confidence interval. [Predicting a decision about a population proportion can be a problem-see Koopmans (1987, pp. 281-282).] The nondirectional two-tailed test does not provide for a directional decision and therefore can- not predict the drectional decision made with a confidence interval. The ex- amples below are t tests.
Directional two-tailed test.-For the same data, the drectional test at a and the 1 -a two-sided confidence interval reach the same decision about the direction of a parameter.' Hence, the drectional test can predct the di-
'The confidence coefficient is represented by 1 -a. Here, a denotes the (uncondiriond) roba- bhry that the interval does not cover [he parameter (K burg. 1974, p. 50) and I -a ¬es the (unconditional) probabihty that the interval covers t i e pnr.arnerer ( K e n d d & Stuart, 1961, p , 99;Lehmann. 1959, p 1 7 4 ) For an hypothesis test having s ~ ~ ~ ~ ~ i ~ c a n c e level a, a denotes the condluonal) ~ r o b a b i l i t ~ of rejecting the null when i t is [rue Hcnce, when the 1 - a confidence
HYPOTHESIS TESTING 13
rectional decision reached with the two-sided interval. T w o examples: First, if a directional test at the .05 level evaluating
H,: ~ , - P 2 < 0 Hz: p l - p2 = 0 (null hypothesis) H,: ~ , - P 2 > 0
rejects H2 in favor of H3, then a 95% two-sided confidence interval for p1 - p2 will contain uniformly positive values, resulting in the decision that t h e true pl - p2 difference is positive. If the test rejects H2 in favor of HI, the interval w d contain uniformly negative values, resulting in the decision that the true dlfference is negative. If the test does not reject H Z , the inter- val wdl contain zero, resulting in the decision that the true dlfference is positive, negative, o r zero. Second, if a directional test at the .05 level evalu- ating
H I : < 55 Hz: p = 55 (null hypothesis)
H,: ~ > 5 5
rejects H2 in favor of H I , then a 95% two-sided confidence interval for p will contain values uniformly smaller than 55, resulting in the decision that p is smaller than 55. If the test rejects H2 in favor of H 3 , the interval WLU con- tain values uniformly larger than 5 5 , resulting in the decision that p is larger than 55. If the test does not reject H 2 , the interval WLU contain 55, resulting in the decision that p is above, below, o r equal to 55.
One-tailed test.-For t h e same data, the one-tailed test at a and the 1 - a one-sided confidence interval, given that they predict the same direc- tion, reach the same decision about the direction of a parameter. Hence, the one-tailed test can predict the directional decision reached with the one- sided interval. For example, if a one-tailed test at the .05 level evaluating
rejects Ho in favor of H A , then a 95% lower limit interval for p w d contain only values above 55, resulting in the decision that p is above 55. If the test
interval is paired with an hypothesis test a t a to discuss the similarity in their decisions, as done here, the value of a for both techniques is the same but the meaning is not. It is some- times useful to interpret confidence intervals as "inverted" hypothesis tests, with the 1 - a confidence inrerval consisting of all null hypothesis values retained at significance level a. Here, confidence coefficient 1 - a simply means that the tests were conducted at significance level a (Lehmann, 1959, p. 176); however, this is not the only way ro interpret confidence intervals (Lehmann, 1959, p. 174). Indeed, Pratr (1971) said, ". . . I would not think i r satisfactory or even ossible to limir the interpretation of confidence intervals to h e i r role as inverted tests . . ." &. 196).
14 L. LEVENTHAL
does not reject Ho, the interval wdl contain 5 5 , resulting in the decision that p is above, below, or equal to 55.
(3) Claiming an Unconditional Probability With an Hypothesis Test Since directional nvo-tailed and one-tailed tests predict the directional
decision made with a confidence interval, they justify attaching an uncondi- tional probability to a decision about the statistical hypotheses, the same un- conditional probabhty that would have been provided by the confidence in- terval had it been computed. The unconditional probabhty that transfers from the confidence interval to the hypothesis test is a relative frequency probabhty that refers to the relative frequency of correct decisions in an hy- pothetical collection of decisions. The probabhty is at least 1 -a, where a is the significance level of the test.
For two generations, methodologists such as Bakan (19661, Carver (1978), and Cohen (1994) have argued that hypothesis tests d o not provide uncondtional probabhties. Nevertheless, directional hvo-tailed and one-tail- ed tests do in fact provide an unconditional probability that their decision is correct. One who reaches a decision with one of these tests can be certain that were the appropriate confidence interval computed i t would reach the same decision and confer the stamp of unconditional ~robabllity upon it. Ln the same sense that a confidence interval predicts the outcome of an hypoth- esis test and is therefore said to ~ r o v i d e this information-statisticians have lectured researchers about this for decades-certain hypothesis tests predict part of the outcome of certain confidence intervals, the part relating to a di- rectional decision, and therefore the tests provide that information.
The above argument has an advantage and a disadvantage. The advan- tage is that it uses statistical precedents. Previous writers have stated that a confidence interval's coverage probability does not vary with the true value of the parameter and have reasoned that, if method 1 (a confidence interval) predicts the outcome of method 2 (an hypothesis test), then the information yield of method 2 can be attributed to method 1. The disadvantage of the argument is that it omits the nondirectional two-tailed test since the test cannot predict the directional decision made with a confidence interval. One can ignore confidence intervals and use another argument to claim that hy- pothesis tests provide the unconditional probabdity of a correct decision. This argument applies to all three tests. The examples below are t tests at a=.05.
For the directional two-tailed test, consider hypotheses
H,: p < 5 5 Hz: p = 55 (null hypothesis) H 3 : p > 5 5 .
This test has a 2.5% rejection region in each tail of the s a m p h g distribu-
HYPOTHESIS TESTING 15
tion. Select a large number of random samples and repeat the test for each sample. When the null is true, 95% of the samples will not fall in a rejection region, producing for each sample the decision thac any of the statistical hy- potheses may be true. These are the only correct decisions when the null is true, a n d none are informative. If the true value of y is slightly larger than the null value, then (a) slightly over 2.5% of the samples w d fall in the right tail rejection region, producing for each sample the correct and informative decision to reject H2 in favor of H,, ( b ) slightly under 95% of the samples wdl not fall in a rejection region, producing for each sample the correct but uninformative decision that any of the hypotheses may be true, and (c) slight- ly under 2.5% of the samples ~ v d l fall in the left tail rejection region, pro- ducing for each sample the informative b u t incorrect decision to reject Hz in favor of H I . Hence, when the true value of p is slightly larger than the null value, slightly over 97.5% of all decisions wdl be correct, and slightly over 2.5% of all decisions will be both correct and informative. When the true value is sufficiently larger than the null value, almost 100% of the samples w d fall in the right tail rejection region, resulting in nearly 100% of the de- cisions being both correct and informative. Hence, regardless of the true value of p, the percentage of correct decisions \ v d always be 95% o r larger. Accordmgly, when a is .05, the unconditional p r o b a b h t y of a correct deci- sion is at least .95.' T h e condtional p r o b a b h t y of a correct and informative decision varies from 0 . 0 t o nearly 1.0, depending on the difference between the true value and the null. W h e n the true value of p is smaller than the null value, a similar picture emerges.
For the nondirectional two-tailed test, consider hypotheses
This test has a 2.5% rejection region in each tail of the s a m p h g d s t r i b u - tion. When the null is true, 95% of the samples wdl not fall in a rejection region, producing correct decisions that either hypothesis may be true. These are the only correct decisions when the null is true, and none are in- formative. If the true value of p ddfers slightly from the null value, then (a) slightly over 2.5% of the samples wdl fall in one rejection region and slightly under 2.5% wdl fall in the other, both producing correct and informative decisions to reject the null, and ( b ) slightly under 95% of the samples will not fall in a rejection region, producing correct but uninformative decisions that either hypothesis may be true. When the true value is sufficiently differ-
'It I S sonletimes argued thac, if a null is False, the only type O F error typicnlly controlled-Type I error, L V I L ~ conditional probability a-cannot occur, and therefore \\re should pay more atten- tion co controlling Type I1 error (e.g., I r k , 1996; Schmidt, 1992). But the minimum uncondi- tional probability of a correct decision, 1 -a, depends on a. So, even when one knows the point null of a directional nvo-tailed test is false, a should be kepc small.
16 L. LEVENTHAL
ent from the null value, nearly 100% of the samples w d fall in one of the rejection regions, resulting in nearly 100% of the-decisions being both cor- rect and informative. Hence, the unconditional probability of a correct deci- sion is at least .95, and the condtional probabhty of a correct and informa- tive decision varies from 0.0 to nearly 1.0, depending on the difference be- tween the true value and the n d .
For the one-tailed test, consider hypotheses
This test has a 5 % rejection region in the right tail of the s a m p h g distribu- tion. When the true value of p is 55 or less, 95% or more of the samples WLLI not fall in the rejection region, producing correct decisions that either hypothesis may be true. These are the only correct decisions when the null is true, and none are informative. If the true value of p. is slightly greater than 55, then slightly over 5 % of the samples w d fall in the rejection region, pro- ducing correct and informative decisions to reject the null, and slightly un- der 95% of the samples will not fall in the rejection region, producing cor- rect but uninformative decisions that either hypothesis may be true. When the true value of y is sufficiently greater than 5 5 , nearly 100% of the sam- ples wdl fall in the rejection region, resulting in nearly 100% of the deci- sions being both correct and informative. Hence, the unconditional proba- bility of a correct decision is a t least .95, and the conditional probabhty of a correct and informative decision varies from 0.0 to nearly 1.0, depending on the difference between the true value and the null.
I have argued that all three hypothesis tests provide the unconditional probabhty of a correct decision. It is a separate issue whether this probabil- ity is useful. One might argue that it is not useful since the probabhty of a decision being both correct and informative varies between 0.0 and nearly 1.0, depending on the true value of the parameter-a value that is not known; however, I believe the unconditional probabhty is useful. After con- ducting an hypothesis test, one knows whether the null was rejected and therefore whether the decision was informative. If the decision was informa- tive, one can take comfort in the fact that the hypothesis test just used has at least a 1 - a unconditional probability of producing a correct decision.
REFERENCES
ELS SON, R P. (1995) Statistics as prirzc+led argttmenf. Hillsdale, N J : Erlbaurn. BAHADUR, R. R. (1952) A property of the !-statistic. S a n k h y ~ : The Indian ]ozrrtral of Statisticr,
12, 79-88. BAKAN. D. (1966) T h e test of significance in psychological research. Psychological Bullelin, 66,
423-437. BERGER. J . 0. (1985) Statistical decziiorz theory and Bayesian analysis. (2nd e d . ) New York:
Springer-Verlag. BERKSON. 1. (1938) Some difficulties of interpretation encountered in the application of the
chi-square test. lotrrnal of / h e American Sfati.cficai Associa!iotz. 33, 526-542.
H P O T H E S I S TESTING 17
BRAVER, S. L. (1975) O n splittin the rails unequally: a new perspective on one- versus rwo- tailed tests. Edzrcational and fiPrychological Measrrrement , 35, 283-301.
CARVER, R I? (1978) The case against statistical significance testing. Haruard Edzrcational Re- view, 48, 378-399.
CASELLA. G., & BERCER. R. (1990) Statistical inference. Pacific Grove, CA: Wadswonh & Brooks/Cole.
COHEN, J. (1990) Things I have learned (so far). America11 Psychologist, 45, 1304-1312. COHEN, J. (1994) The earth is round ( p < . 0 5 ) . American Psychologist, 49, 997-1003. C O R ~ N A , J. M., & DUNLAP, (1997) O n the logic and purpose of significance testing.
Psychological Mefhods, 2, 161- 172. FRICK, R. W. (1995) Accepting the null hypothesis. M e m o y G Cognition, 23, 132-138. FRICK, R. W. (1996) T h e appropriate use of null hypothesis testing. Psychological Methods, 1,
379-390. GIERE, R. N. (1977) Tesring versus information models of statistical inference. In R. G .
Colodny (Ed.), Logic, laws, G liJe. (Pittsburgh Series in h e Philosophy of Science, Vol. 6) Pittsburgh, PA: Univer. of Pittsburgh Press. Pp. 19-70.
GICERENZER, G. (1993) The superego, the ego, and the id in statistical reasoning. In G . Keren & C. Lewis (Eds.), A handbook for data analysis in the behauioral sciences: methodological issues. HiUsdale, NJ: Erlbaum. Pp. 311-339.
GIGERENZER. G., & MURRAY, D. J . (1987) Cognition as intrritiue statistics. Hillsdale, NJ: Erl- baum.
GLASS, G. V.. &STANLEY, 1. C. (1970) Statistical methods in education and psychology. Engle- wood Cms, NJ: Pi-entice-Hall.
Gooo, I . 1. (1971) Comment on a paper by 0. Kemprhorne. In V. P. Godambe & D. A . Sprort (Eds.), Foundations of statistical inference. Toronro: Holt, finehart & Winston.
GREENWALD, A. G. (1975) Consequences of prejudice against the null hypothesis. Psychological Bulletin. 82. 1-20.
GREENWALD, A. G., GONZALEZ, R., HARRIS, R. I., & G U T H R I E , D. (1996) Effect sizes and p-val- ues: what should be reported and what should be replicated? Psychophysiology, 33, 175- 183.
HAGEN, R. L. (1997) In praise of the null hypothesis significance test. American Psychologist, 52, 15-24.
HAND. I., MCCARTER, R. E.. & HAND, M. R. (1985) T h e procedures and justification of a two-tailed directional test of significance. Psychological Reports, 56, 495-498.
HARRIS, R J. (1985) A primer of mzrltivariate statisftcs. (2nd ed.) New York: Academic Press. HARRIS. R. I. (1997a) Reforming significance testing via three-valued logic. In L. Harlow & S.
Mula~k (Eds.), What if there were no signt$cance tesfs? Mahwah, N J : Erlbaurn. P p . 145- 173.
HARRIS, R. J . (1997b3 Significance tests have heir place. Psychological Science, 8 , 8-11. HAYS. W. L. (1973) Statistics for the soczal sciences. (2nd ed.) New York: Holt, h n e h a r t &
Winston. HAYS, W. L. (1981) Statistics. (3rd ed.) New York: Holr, Rinehart & Winston. H I C K , W. E. (1952) A note on one-tailed and two-tailed tests. Psychological Review, 59, 316-
318. HODGES. 1. L., &LEHMANN, E. L. (1954) Testing the approximate validity of staustical hypothe-
ses. ]ordrnal of the Royal Statistical Society, Lorzdon, Series B fibfethodological), 16, 261-268. HOWSON. C. (1995) Theories of probability. British Jorrnzal for the Philosophy of Science, 46, 1-
32. Hsu. J . C. (1996) Multiple comparisons, theory and methods. London: Chapman & Hall. KAISER, H. F, (1960) Directional statistical decisions. Psychological Review, 67, 160-167. KENOALL, M.. &STUART, A. (1961) The aduarzced h e o y of statistics. Vol. 2. New York: Hafner. KIRK. R. E. (1982) Experimental design: procedzrres for the behavioral sciences. (2nd ed.) Bel-
mont, CA: Brooks/Cole. KIRK. R. E. (1984) Ebnentary statistics. (2nd ed.) Mon terey, CA: Brooks/Cole. KIRK, R. E. (1996) Practical significance: a concept whose time has come. Edzrcational and
Psychological Measurement, 56, 746-759.
18 L. LEVENTHAL
KOOPMANS. L. H. (1987) I~zfrodrrcfiora to corrternporary stafisticol ~ n e f h o d s . (2nd ed.) Boston, MA: Duxbury.
KYBURG, H., J R (1974) The logicalforr~rdations ofrtatisfical inference. Boston, MA: Reidel. LEH~MANN, E. L. (1957a) A theory of some multiple decision problems: I. T h e Annals of Math-
e~natiral Statisfics, 28, 1-25. LEHMANN, E. L. (1957b) A theory of some multiple decision problems: 11. T h e Arznals o f M a t h -
e?nalicaf Sfalistics, 28, 547-572. LEHMANN, E. L. (1959) Tesfing statrstical hvpo~he.ves. New York: U'iley. LEVENTHAL, L. (1994a) Nudging aside Meehl's paradox. Cartadiarr Psychologv. 35. 283-298.
[Erratum: 36, 881 LEVENTHAL, L. (1994b) StatisticaUy significant poor performance in listening tests. ]ourno1 of
[he Arrdio Engineerin Societv, 42, 585-587. LEVENTHAL, L. (1999) ~ p c f a t i n g the debate on one- versus nvo-tailed tests with the directional
nvo-tailed test. Psychological Reporfs, 84, 707-718. LEVENTHAL, L., & H L I Y N H . C-L. (1976a) Analyzing listening tests with the directional nvo-tailed
test. ]oztrrzal o f f h e Azrdio Engineering Societv, 44, 850-863. LEVENTHAL. L., & HLIYNH, C-L. (1996b) Directional decisions for nvo-tailed tests: power, error
rates, and sample size. Psychological Mefhods, 1, 278-292. Lomus. G. R. (1991) O n the tyranny of hypothesis testing in the social sciences. Corttevzporary
Psycho/ogy. 36 1 02-105. LYKKEN. D. T. (196s) Statistical signxicance in psychological research. Psychological Bzrlletitr,
71) 151.159 . - , - > - A * , MEEHL, I? E. (1967) Theory-testing in psychology and physics: a methodological paradox.
Philo.cophy of Science, 34, 103-115. MEEHL, P: E. (1978) Theoretical risks and tabular asterisks: Sir Karl. Sir Ronald. and the slow
progress of soft psychology. Journal of Corzsultirrg and Clirrical ~ s ~ c h o l o g ~ , 46, 806-834. MEEHL, E. (1990) Appraising and amending theories: the strategy of Lakatosian defense and
nvo principles that warrant it. Psvchological Inqzriry, 1, 108-141. MISES, R. VON. (1957) Probability, stohtics, and ~rzrth. (2nd rev. ed.) London: M e n & Unwin. MISES, R. VON. (1964) Ma!hetnatical fheory of probabilrty and sfafisfics. New York: Academic
Press. IMORRISON, D. E.. & HENICEL, R. E. (Eds.) (1970) The .rigniJcance fe.cf corzfroverv. Chicago, IL:
Aldine. M o x ~ n , R. A,. J R . (1979) Subjective, indi~ldual and aggregate references in educational re-
search. Irzstrzrctiorrol Science, 8, 169-205. MURPHY, K. R. (1990) If the null hypothesis is impossible, why test it? America12 psycho log is^,
45, 403-404. OAKES, M. (1990) Sfatisfical irrfererrce. Newton Lower Falls, [MA: Epidemiology Resources, Inc. POLLARD. P: (1993) How significant is "significance"? In G . Keren & C. Lewis (Eds.), A bond-
book for dafa analvsi.r i n [he behavioral sciences: nzerhodological rssrres. Hillsdale, N J : Erl- baum. P 449-460.
POPPER, K. R. ri965) The logic o f s c i e r t ~ i f c discovery. New York: Harper & Row. PRATT, J . W. (1971) A comment on a paper b y 0. Kempthorne: Probability, statistics and the
knowledge business. In V. P . Godambe & D . A. Sprotr (Eds.), Fozrndatiotrs of .rtali.rtical inference. Toronto: Holt, Rinehart & Winston of Canada. Pp. 496-497.
ROGERS. 1. L . , HOWARD, K. 1.. &VESSEY, J . T. (1993) Using significance tests to evaluate equiva- lence between nvo experimental roups Prvchological BtrllefLz, 113,553-565.
SAVAGE. L. J. (1972) Thefozrrtdotiorrs o j s t a i i s ~ i c s . . ~ e w York: Dover. SCHMIDT, E L. (19921 What do data really mean? Research findings. meta-analysis, and cumu-
larive knowledge in psychology. Atnerican Ps chologisf, 47, 1173-1181. S E D L M E I E R , P., & G I G E R E N Z E R , G. (1989) DO s t ~ 1 2 e s of statistical pourer have an effect on the
power of studies? Pcychological Btrlletin, 105, 309-316. SERLIN, R. C.. &LAPSLEY, D. K. (1985) Rationality in psychological research. Anzerican Psychol-
ogi.if, 40, 73-83. SERLIN, R. C.. & LAPSLEY. D. K. (1993) Rational appraisal of ps chological research and the
good-enough principle. In G. Keren & C. Lewis (Eds.), A Xarzdbook for d a ~ a a~zalysis i n the behavioral sciences: vzelhodological i.iszies. Hillsdale, NJ: Erlbaum. Pp. 199-228.
SHAFFER, J . I? (1972) Directional statistical hypotheses and comparisons among means. Psycho- logical Btrlletit~, 77, 195-197.
TUKEY, J. W. (1991) The hilosophy o f multi le comparisons. Statbtical Science, 6 , 100-116. WILSON, K. V (1961) ~ u i ~ e c t i v i s t statistics L r the current crisis. Co,rfenzporory Psychology, 6 ,
229-23 1. Accepted Jzore 3 0, 1999.