Null Hypothesis

profileTT24
Reading-AppliedstatisticsIChap8-9.pdf

Warner, R. M. (2021). Applied sta s cs I: Basic bivariate techniques (3rd ed.). Thousand Oaks, CA: Sage Publica ons. ISBN: 978-1-5063-5280-0.

CHAPTER 8 THE ONE-SAMPLE T TEST: INTRODUCTION TO STATISTICAL SIGNIFICANCE TESTS 8.1 INTRODUCTION The previous chapter on confidence intervals (CIs) explained how sampling error is taken into account when we set up a CI on the basis of a sample mean to make inferences about a corresponding popula on mean. SEM provides informa on about sampling error. This chapter introduces a different use for sampling error: We can use sampling error to answer yes/no ques ons about possible values of μ. Null-hypothesis significance tes ng (NHST), usually just called significance tes ng, uses the sample mean and SEM to answer ques ons about hypothesized (or proposed) values of μ. First, a possible value for μ is proposed; then, using sample data, we try to answer this yes/no ques on: Does the proposed value of μ seem reasonable, given the value of the sample mean? A sample mean M that is very far away from the proposed value of μ can lead us to doubt that the proposed value for μ is correct. When you see statements like “The outcome was sta s cally significant at p < .05,” they are telling you that sta s cal significance tests were used. Sta s cal significance tests use the same informa on from data as CIs (sample mean M, sample standard devia on SD, sample size N, and SEM). However, although a CI gives us a range of plausible values for the unknown popula on mean μ, a sta s cal significance test answers a yes/no ques on about a specific suggested hypothesized value for μ. Cumming (2014) and others argue that the yes/no ques on involved in sta s cal significance tests is not the best way to understand what sample means can tell us about poten al values of popula on means. This chapter describes the logic involved in sta s cal significance tests and the procedures for doing them. Even if future research de-emphasizes or drops sta s cal significance tests, as some authori es recommend, you need to know about sta s cal significance. As of this date (2019), most published studies in disciplines such as psychology report sta s cal significance tests and p values. You need to understand what p values can and cannot tell us about research results. The logic of sta s cal significance tests is usually introduced using the one-sample t test as an example. A one-sample t test is used to answer yes/no ques ons about one popula on mean on the basis of one sample mean. Unfortunately, the one-sample t test is an awkward place to begin for students in behavioral, social, and medical research. Studies in these areas rarely use the one-sample t test, and it can be difficult to think of plausible situa ons in which you would want to know the results of this test. Research in behavioral and social sciences almost always examines ques ons about the ways scores on two or more variables are related. Later chapters will show how the logic of sta s cal significance tests can be used to evaluate the strength of rela onships between variables. Those applica ons of significance tests make more sense to

most behavioral science students. We begin with the one-sample t test only because it is rela vely simple. 8.2 SIGNIFICANCE TESTS AS YES/NO QUESTIONS ABOUT PROPOSED VALUES OF POPULATION MEANS In Chapter 7, the body temperature data in the file shoemaker.sav were used to set up a CI for mean body temperature in degrees Fahrenheit. The 95% CI based on that data did not include the value of 98.6°F that most people believe is the mean temperature for healthy human popula ons. In this chapter, we begin by proposing (or hypothesizing) that μ = 98.6°F; then we examine sample data to decide whether that proposed value of μ is, or is not, plausible. The procedure for NHST involves familiar opera ons: compu ng descrip ve sta s cs such as M, SD, and SEM and looking up cri cal or cutoff values of t in a table for t with df = (N – 1). New steps involve se ng up null and alterna ve hypotheses about proposed values of μ. Each individual step is simple; however, it can be difficult for beginning students to keep all these steps in mind. It is important to go through these steps “by hand”; the more you repeat them, the be er you may understand the logic. A er you escape into the “real world” and write research reports, you will not have to write out all the logic involved in NHST; most of the logic will be implicit. Research reports rarely provide detailed informa on about all the steps that are outlined in this chapter. SPSS and similar programs will generate final numerical results for you; you won’t need to do arithme c and table lookups. However, you need to understand the logic so that you can understand the meaning and limita ons of p values and NHST. Some mes real-world data analysts have access to data for an en re popula on of interest. Sta s cal significance tests are not needed when informa on is available for the en re popula on. Significance tests are used when we want to make inferences (or es mates or guesses) about unknown popula on characteris cs such as μ, using only data from a sample. The following sec ons describe the steps that are involved in NHST. 8.3 STATING A NULL HYPOTHESIS The term hypothesis can refer to a verbal statement (e.g., “I think my partner is chea ng”). For sta s cal significance tests, hypotheses correspond to equa ons. To set up a yes/no ques on about a proposed value for μ, the unknown popula on mean, we begin by sta ng a null hypothesis (H0) in this form:

In words, this null hypothesis says, “I hypothesize that the true popula on mean body temperature equals 98.6°F.” Depending on the variable that is examined in the study, the proposed value for the popula on mean stated in the null hypothesis could have other values, such as a driving speed of 35 mph, a diameter of 10 cm, or an IQ of 100 points. Most books refer to the H0 equality statement in Equa on 8.1 as a null hypothesis. It makes more sense to think of this as a hypothesis that can poten ally be nullified or rejected on the basis of the obtained sample mean. On the basis of informa on from the sample temperature data (N, M, SD) we will be able to make one of two decisions:

 Reject H0. If we reject H0, that is equivalent to saying that we do not believe μhyp (which is 98.6°F in the body temperature example) is a plausible value for μ.

 Do not reject H0. If we do not reject H0, that is equivalent to saying that we cannot rule out μhyp as a plausible value for μ.

We cannot say “Accept H0.” This would be logically equivalent to saying “I have proved that μ exactly equals μhyp (98.6°F).” The logic used in NHST does not provide support for that kind of conclusion. If a research report says “accept H0,” the author has misunderstood sta s cal significance tes ng. Never, never say “accept H0”! Neither decision (reject H0 or do not reject H0) can be made with certainty when we have only sample data. For either decision, reject or do not reject, there is a risk that the decision is wrong. In theory, NHST provides ways to evaluate the risk or probability of a Type I decision error (a decision to reject H0 when H0 is correct). Note that a researcher can make a Type I decision error even if he or she has done everything correctly. Uncertainty about decisions is inherent in the process of using sample data to make inferences about popula ons. Note that the logic of NHST differs from the way we reason about evidence in everyday life. In everyday life, a person thinks of a hypothesis (such as “My da ng partner is chea ng on me”) and then looks for evidence to support that hypothesis. In everyday life, we tend to look for confirmatory evidence, that is, evidence that supports our ini al hypotheses (Abelson & Rosenberg, 1958). NHST requires us to look for disconfirmatory evidence. In effect, in many research situa ons, researchers set up a null hypothesis that they don’t believe and then look for evidence to reject that null hypothesis. This requires us to think in terms of double nega ves

(e.g., I have evidence against a null hypothesis that I want to believe is wrong). This setup is counterintui ve; it differs from our natural inclina ons in everyday reasoning. The logic of NHST focuses on evidence that is inconsistent with a null hypothesis (or, to be more precise, evidence we would be unlikely to obtain if H0 is correct). The first step in NHST is se ng up a nullifiable hypothesis. An example of a nullifiable hypothesis in NHST is H0: μ = 98.6°F. The evidence that would lead us to doubt or reject H0 is a value of M that is “very far” from μhyp (i.e., a sample mean very different from 98.6°F). The computa ons made in sta s cal significance tests make it possible to quan fy precisely what we mean by “very far.” O en (but not always) data analysts hope to reject (or “nullify”) H0. In many studies, a researcher specifies a null hypothesis he or she does not believe and then hopes to obtain evidence to reject that null hypothesis. Some mes rejec ng H0 means that, from the researcher’s point of view, the study was a success. 8.4 SELECTING AN ALTERNATIVE HYPOTHESIS Two hypotheses are needed for NHST: a null hypothesis (denoted H0) and an alterna ve hypothesis (denoted Halt or some mes H1). As noted previously, the equa on for a null hypothesis is of the form H0: μ = μhyp (where μhyp is a specific value chosen by the data analyst, such as 98.6°F). Note that a null hypothesis could be incorrect in any of three different ways: μ could be unequal to μhyp, greater than μhyp, or less than μhyp. (True popula on mean body temperature could be ≠ 98.6°F, > 98.6°F, or < 98.6°F). Alterna ve hypotheses are statements of alterna ve reali es: In the body temperature example, if H0 is incorrect, what range of outcomes for sample mean temperature would you expect? Each version of Halt specifies a different range. For a one- sample t test, a data analyst selects one of the following three alterna ve hypotheses. Alterna ve Hypothesis 1: The popula on mean is hypothesized to differ from μhyp, but we do not specify a direc on of difference. The equa on for a two-tailed or nondirec onal alterna ve hypothesis is:

This version of Halt is called two-tailed because we will reject H0 for values of M that are either much higher or much lower than μhyp. These values of M correspond to t values in either the lower or upper tail of the t distribu on. To evaluate distance from μhyp (such as 98.6°F), values of M are converted into t ra os, and then t ra os are used to assess distance from the mean in unit-free terms. (Use of a t ra o to assess the distance of M from a hypothesized value of μ is analogous to the use of a z score to assess the distance of a single X score from a sample mean.)

The two-tailed version of Halt is also called nondirec onal because the direc on of difference between M and the specific value of μhyp (such as 98.6°F) is not specified. It is called two-tailed because we reject H0 for values of M or t that correspond to either the lower or upper tail of a t distribu on. In prac ce, the terms two-tailed test and nondirec onal test can be used interchangeably. Using this nondirec onal or two-tailed version of Halt, the researcher collects data, examines sample M, and rejects H0 if the sample mean M is either far above or far below μhyp. In this example, we would reject H0: μ = 98.6°F as implausible if we obtain a sample M that is either much lower or much higher than 98.6°F. Later in the chapter you’ll see how we quan fy what we mean by “much higher”: Exactly how far away from μhyp does M need to be to reject H0? Can we reject H0: μ = 98.6°F if we obtain M = 99.0°F? M = 94.5°F? M = 101.3°F? When we can specify the expected direc on of difference, we can use one of the following one- tailed or direc onal alterna ve hypotheses. These tests are called direc onal because they specify one of two possible direc ons in which μ might differ from μhyp. They are called one- tailed because for Halt 2, H0 is rejected only for outcome values of M and t that fall in the upper tail, and for Halt 3, H0 is rejected only for outcome values of M and t that fall in the lower tail. The terms one-tailed test and direc onal test can be used interchangeably. The direc on of difference should be stated when test results are reported.

Why are there both two-tailed and one-tailed alterna ve hypotheses? In some situa ons, use of a one-tailed test makes it easier to reject the null hypothesis (and o en data analysts want to reject the null hypothesis). I suggest that you use Halt 1 (the two-tailed test or nondirec onal alterna ve hypothesis) in most situa ons. When you learn about other sta s cs later, such as the F test, you will find that some tests are always one tailed. The choice between one- and two-tailed Halt op ons is an issue primarily for t tests. 8.5 THE ONE-SAMPLE T TEST We need to quan fy the distance of M from μhyp precisely so that we can decide whether M is “very far” from μhyp. We want the distance to be in unit-free terms so that we can evaluate it as a large or small distance by looking at standardized distribu ons of z or t values. In earlier chapters, when we wanted to specify the distance of an individual X score from the sample mean M, we computed a z score, z = (X – M)/SD. Because z was unit free, we could look up the value of z in a table of the standard normal distribu on (Appendix A at the end of the book) to evaluate areas below or above z, to decide whether the X score was very far away from M. For example, if an X score was in the top 2% of the area of the normal distribu on, we could say that it was unusually high. (This works only if the X scores are normally distributed.) To evaluate the distance of M from μhyp, we do something analogous. The changes we need to make are as follows:

a. Instead of dividing the (M – μhyp) distance by SD to obtain a standardized test ra o, we divide it by the standard error of M (SEM).

b. We call the test sta s c t, rather than z. c. We call the sta s c a t ra o to remind ourselves that we need to use tables for t

distribu ons (not the normal distribu on) to look up areas in tails of distribu ons. A t ra o quan fies “how far” M is from μhyp in unit-free terms. Values of M that are far from μhyp correspond to values of t that are large in absolute value. A large absolute value of t tells you that the mean in your sample lies far out in the tails of the sampling distribu on for M. To say this another way, a large t ra o tells you that you obtained a value of M that would be an unusual result if we assume μ = 98.6°F. To set up the t ra o for the one-sample test, we begin with the difference or distance (M – μhyp). We divide that difference by the value that provides informa on about varia on in the sampling distribu on: SEM. (Computa on of SEM from SD and N was explained in Chapter 7.) This gives us the following equa on for the one-sample t test:

When this difference (M – μhyp) is divided by SEM, the resul ng t ra o is unit free. (Standardizing something by dividing it by SD or SEM is a trick in that small “bag of tricks” that is

used repeatedly in sta s cs.) The value of t describes the distance of M from μhyp in terms of “number of standard errors.” If M happens to be exactly equal to μhyp, then M – μhyp = 0, and t will equal 0. (Values of exactly 0 are not common outcomes.) This outcome for the sample mean is consistent with H0: μ = μhyp (e.g., μ = 98.6°F); however, it is not proof that μ is exactly equal to μhyp (because values of M are always influenced by sampling error). Just as values of z have a fixed rela onship with tail areas for the normal distribu on, values of t have a fixed rela onship with tail areas for t distribu ons. We will call tail areas that lie beyond specific values of t “p values.” Tables of the t distribu ons (in Appendix B at the end of the book) can be used to look up tail areas (p values) for selected values of df and t (as discussed in Sec on 8.7). We decide whether M is “very far” from μhyp by asking whether the value of t lies far out in the tails of the t distribu on. Data analysts examine p values to decide whether the data outcome (M) would be a surprising or unusual outcome if H0 is true. If M would be very unlikely to occur if H0 is true, this evidence makes it reasonable to reject H0 as implausible. However, this decision can be incorrect. When H0 is true (e.g., if mean body temperature really equals 98.6°F), researchers can expect to obtain values of M that are “far away” from 98.6°F about 5% of the me, because of sampling error (just like the members of the imaginary sta s cs class who drew different random samples from the same popula on). When we reject H0, we are be ng that we didn’t get one of the unusual outcomes that arise because of sampling error. That bet, and the decision to reject, can both be incorrect. Recapitula ng the steps so far:

 We obtain (M – μhyp) to evaluate the distance of M from μhyp.  To evaluate “how far” M is from μhyp in unit-free terms, we convert the (M – μhyp)

distance into a t ra o by dividing by SEM.  Other factors being equal, as (M – μhyp) gets larger, t gets larger.  We can set up “reject regions” for values of t. Typical reject regions are values in the

lowest 5% of the t distribu on and values in the highest 5% of the t distribu on. Depending on the nature of the alterna ve hypothesis, the reject region may consist of area in just one tail or in both tails.

 If the obtained value of t is large enough in absolute value to fall into a reject region, that tells us that the value of M is “far away” from the proposed or hypothesized value of μ.

 A value of M that is very far from μhyp is evidence that leads us to doubt H0. In the temperature example, if M is much lower than 98.6°F, we have evidence that the hypothesis μ = 98.6°F may be incorrect. (Note that extreme values of M are unlikely, but not impossible, outcomes when H0 is correct.)

8.6 CHOOSING AN ALPHA (Α) LEVEL

Recall from earlier chapters that extreme or unusual values are o en defined as values that fall in the upper and/or lower tails of a distribu on (such as a z or t distribu on). Researchers o en decide to call values “unusual” if they fall in the lowest and highest 2.5% tails of the distribu on. We can call these tail areas “reject” regions.

 If we obtain a t ra o that falls “far out” in one of the tails of the t distribu on, in one of the reject regions, this tells us that M was far from the proposed value of μhyp.

 If M is far from μhyp, we can also say that this sample value of M would be unlikely to occur if μhyp is correct.

 This in turn may lead us to doubt that the proposed value of μhyp is correct. To specify reject regions for values of t, we need the following informa on:

1. The df for the t distribu on. For the one-sample t test, df = N – 1. 2. The nature of the alterna ve hypothesis (two tailed or one tailed). 3. We also need to choose an alpha (α) level, as explained in this sec on.

Even if M is very far from μhyp, and t is very large in absolute value, a decision to reject the proposed value of μhyp can be an error. If the proposed value of μ is correct, and we reject it, that is a Type I decision error. We can’t completely get rid of risk for decision error, but in theory, we can limit the risk. It is common for people to decide that a 5% risk for Type I decision error is acceptable. In the formal logic of NHST, this is what we do to limit the risk for Type I decision error. We “set” the α level to an acceptably low level of risk, o en, α = .05. The α level tells us what percentage of areas in the tails of the t distribu on we will iden fy as “reject” regions. To define specific reject regions in terms of ranges of values of t, the α level is combined with informa on about the type of alterna ve hypothesis (direc onal or nondirec onal) and the value of df for the t distribu on. Let’s set up some reject regions to show how these three pieces of informa on (α level, direc onal or nondirec onal test, and df) are used in combina on.

 Suppose we have a sample of N = 31 cases; df = N – 1 = 30.  Suppose we use the nondirec onal alterna ve hypothesis (Halt: μ ≠ 98.6°F).  Suppose we choose α = .05 as the acceptable risk for decision error.

Using these three pieces of informa on, we can specify reject regions for a yes/no ques on (in terms of ranges of values of the t ra o given in Equa on 8.5).

This logic is convoluted. Try not to worry about that too much. You will get to a place where you can make yes/no decisions about proposed values of μ very quickly. The reason to spell out these steps is to show that, in actual prac ce, there are many places in this process where things can go wrong. When things go wrong, the final yes/no or reject/do not reject decisions we make on the basis of NHST logic o en have much higher risks for error than we want them to have. As an example, the values of t that correspond to two-tailed α levels of .05 and .01 are marked in Figure 8.1 for a t distribu on with df = 30. These cutoff values were obtained from the table in Appendix B at the end of the book. The following decision rules should limit the risk of Type I decision error to α = .05 or less (if all assump ons for the use of NHST are sa sfied and all the rules are followed).

 Reject H0 if t < –2.042 (values of t in this range tell us that M is more than 2 standard errors below the hypothesized value of μ; that is, M is “very far” from μ).

 Do not reject H0 if t is between –2.042 and + 2.042 (values of t in this range tell us that M is not very far away from the hypothesized value of μ).

 Reject H0 if t > +2.042 (these values of t tell us that M is more than 2 standard errors above the hypothesized value of μ; that is, M is “very far” from μ).

The decision about the propor on or percentage of the area of the sampling distribu on to use as the defini on for unusual or extreme outcomes of M is called the choice of alpha (α) level. An α level of .05 is the most “popular” or conven onal choice of α level. Most readers assume you are using α = .05 unless you specifically state otherwise.

This mul ple-step logic is complicated (and even a li le wobbly). The sta s cians who developed the elements of this logic (including Fisher, Neyman, Pearson, and others) disagreed

about the way the logic should work (Lenhard, 2006). The NHST methods we use today are a mixture of their ideas that probably none of them would have approved. Here’s the catch (and it’s a big one). If the assump ons that NHST is based on are violated, and rules that should be followed when using NHST are broken, this complicated logic does not, in fact, limit the risk for error to α = .05. When rules for the use of NHST are broken, the true risk for error may be much higher than 5%. In actual prac ce, assump ons are o en violated and rules are o en broken. Those rules and assump ons are discussed later in this chapter. 8.7 SPECIFYING REJECT REGIONS ON THE BASIS OF Α, HALT, AND DF We will reject H0 (that μ equals some specific value, such as 98.6°F) if the t ra o tells us that M is very far from μhyp. To do this, we need to define reject and do not reject regions in terms of specific values for t. To define the reject region(s) for values of t, we need to know these three things:

 Choice of Halt. This tells us whether to include only one tail or both tails in the reject region.

 Choice of α. This tells us how much area is included in one or both tails of the t distribu on; o en α is 5% or 1%.

 Sample df (N – 1). This tells us which t distribu on to use to find cri cal values that cut off tail areas.

Suppose your sample has N = 16 (df = 15); that you use Halt: μ ≠ μhyp; and you choose α = .05. The reject regions correspond to α = .05, two tailed. Thus, you need the cri cal values that divide a t distribu on with 15 df into the bo om 2.5%, middle 95%, and top 2.5% areas. Cri cal values of t can be found in the t distribu on table in Appendix B at the end of the book. An excerpt from this table appears in Figure 8.2. To locate the cri cal (or cutoff) values for t with α = .05, two tailed, look under the heading “Level of Significance for Two-Tailed Test” in the column “.05.” To find df, look in the le -hand column in Figure 8.2. For df = 15, α = .05, two tailed, the reject regions are below –2.131 and above +2.131; we reject H0 if t is either less than –2.13 or greater than +2.13. The two-tailed reject regions are shown as a graph in Figure 8.3a. If we use the second version of Halt (μ > μhyp), the reject region consists of only the upper tail; we have α = .05, one tailed (upper tail only). Look under “Level of Significance for One-Tailed Test” in the column “.05” to find the cri cal value for df = 15; this cri cal t value is 1.753. We reject H0 if t > 1.753. This reject region appears in Figure 8.3b. Because the t distribu on is symmetrical, once you know that +1.753 iden fies the top 5%, you also know that t = –1.753 corresponds to the bo om 5%. If Halt: μ < μhyp, we reject H0 for values of t below –1.753, as shown in Figure 8.3c. The values of t used to label the reject and do not reject regions for the three different versions of Halt appear in Figures 8.3a, 8.3b, and 8.3c. (Reject regions could also be given in terms of values of M, but it is more conven onal to think about them in terms of values of t.) The reject regions in Figure 8.3 correspond to values of M that are so far away from μhyp (with distance

between M and μhyp expressed in terms of the unit-free t test) that they would be very unlikely to occur if H0 is true. Later you will see that there is an easier way to decide whether to reject or not reject H0 than comparing an obtained value of t with these reject regions from a t distribu on. You can just examine p values in SPSS output instead of t values. The reject/do not reject decision becomes quite simple when you do this:

 If obtained p < α, reject H0 (the outcome is called “sta s cally significant”).  If obtained p > α, do not reject H0 (the outcome is called “not sta s cally significant,”

some mes abbreviated ns). The α level is selected by a data analyst before looking at the data. O en α is set at .05. The p value is obtained from your computer output. SPSS reports a p value as “Sig.” SPSS usually reports two-tailed p values (for tests such as t ra os that can be either one or two tailed). If you use a two-tailed alterna ve hypothesis, just evaluate whether the SPSS “Sig.” or p value is less than α. If you use a one-tailed alterna ve hypothesis, you need to convert the two- tailed SPSS “Sig.” or p value into a one-tailed p value.

Figure 8.3 Reject Regions for Two-Tailed and One-Tailed t Tests (Example: df = 15) A one-tailed p value is half of the corresponding two-tailed p value. If SPSS says that (the two- tailed) p = .06, then the corresponding one-tailed p value is .03. For a one-tailed or direc onal t test, compare the one-tailed p value (in this example, p = .03) with α. You must also check that the direc on of difference of M from μhyp is consistent with the direc on of difference in your alterna ve hypothesis. To avoid possible confusion between one- and two-tailed p values, and for other reasons, I recommend that you use nondirec onal (two-tailed) tests in most situa ons. 8.8 QUESTIONS FOR THE ONE-SAMPLE T TEST The ques on examined by the one-sample t test can be worded three different ways. For the body temperature example, we use 98.6°F as the value for μhyp.

1. Can we reject H0: μ = μhyp? (The decision can be either to reject or not reject H0.) 2. Is μhyp a plausible value for μ? (The decision can be either yes or no.) 3. Is M significantly different from μhyp? (The decision can be either that M is significantly

different from μhyp or that M is not significantly different from μhyp.) The third version of the ques on is most consistent with the ways NHST is usually reported for most sta s cal significance tests discussed later in the book. 8.9 ASSUMPTIONS FOR THE USE OF THE ONE-SAMPLE T TEST

 Scores for the X variable must be quan ta ve. (If they are not, it makes no sense to compute a mean.)

 Scores for the X variable should be independent of one another. (In the following example with driving speeds, the speeds would be nonindependent if there were heavy traffic or if cars were racing one another. The independence assump on was discussed in Chapter 2.) If scores are not independent, the es mate of SD may be too small.

 Some sources state that the distribu on of X scores in the sample must be normal. Technically, this is not correct. The assump ons made when this test was developed were that scores are normally distributed in the popula on, and the scores in the sample were randomly selected from the popula on. We usually have li le informa on about the distribu on of scores in popula ons. We usually have convenience samples, instead of random samples from the popula on of interest.

 Pay a en on to non-normal features of data that could make the sample mean a poor way to describe central tendency, such as extreme outliers, a mode at zero, and bimodal distribu ons with modes far apart. If M does not make sense to describe scores in the sample, then the one-sample t test won’t make sense either.

Viola ons of assump ons can lead to p values that underes mate the true risk for Type I error. 8.10 RULES FOR THE USE OF NHST If you want to make yes/no decisions, you should do things in the correct sequence. Before you collect data, decide on N, decide on procedures for the iden fica on and handling of outliers, formulate the null and alterna ve hypotheses, and select the α level. Do one significance test (or a small number of tests). Do not run dozens or hundreds of tests and then hand-pick a few with small p values to report. A er you have done significance tests, do not go back and rerun tests with varia ons in procedure to see if you can obtain different results. For example, do not change from a two- tailed to a one-tailed test, do not change the α level, do not drop outliers and rerun the analysis, and do not collect more data and rerun the analysis. Running large numbers of analysis in search of small p values is called p-hacking. Viola ons of rules can also lead to p values that underes mate the true risk for Type I decision error. Unfortunately, in real-world research, viola ons of rules and assump ons are fairly common. Therefore we should not have too much faith in p values. 8.11 FIRST ANALYSIS OF MEAN DRIVING SPEED DATA (USING A NONDIRECTIONAL TEST) We are now ready to apply the one-sample t test using a hypothe cal example. Suppose that a cranky resident of a college town is upset about students’ driving speeds. The posted speed limit is 35 mph. The ci zen plans to gather data on driving speed to evaluate if she can plausibly complain to the police that the actual average driving speed for the popula on of all student drivers is significantly different from the posted speed limit. (In a later example, a one-tailed test using a direc onal alterna ve hypothesis is used.)

In the tradi onal approach to NHST, it is important to decide on N, α, and the nature of the alterna ve hypothesis before the collec on and analysis of data.

Figure 8.4 Reject Regions for α = .05, Two Tailed, With 8 df, Corresponding to Shaded Areas Step 7: Obtain descrip ve sta s cs. For small data sets, this can be done by hand. Descrip ve sta s cs can also be obtained using the SPSS frequencies procedure discussed in Chapter 3. For this hypothe cal data set, M = 39, SD = 6.103, N = 9, and SEM = SD/ = 6.103/3 = 2.034. Step 8: Find the t ra o and its df. The one-sample t ra o can be calculated by hand or obtained using the SPSS one-sample t procedure. On the basis of the null hypothesis, μhyp = 35. Given N, for a one-sample t test, df = N – 1 = 8. From the previous step, M = 39 and SEM = 2.034. Combining this informa on, we have t = (M – μhyp)/SEM = (39 – 35) = 4/2.034 = 1.966. Screenshots of the output of SPSS’s one-sample t-test procedure appear in Figures 8.5 and 8.6. Enter the value for μhyp (which is 35, in this example) into the space for “Test Value.” Step 9: Find the CI for M that corresponds to the selected α level.

Current recommenda ons for repor ng from many sources call for inclusion of CI informa on when significance tests are reported. To obtain the CI for M, the one-sample t-test procedure is run a second me (using test value = 0, as demonstrated in Chapter 7). For the distribu on on the le side of Figure 8.1, the middle area under the distribu on corresponds to C (95%); the combined areas in the upper and lower tails correspond to α (2.5% + 2.5% = 5%). Thus, the distribu on is divided into the lower 2.5%, the middle 95%, and the top 2.5%. C + α = 1.00, the en re area. The equa on to obtain level of confidence C for a CI that corresponds to the α level used for a two-tailed t test is:

8.12 SPSS ANALYSIS: ONE-SAMPLE T TEST FOR MEAN DRIVING SPEED (USING A NONDIRECTIONAL OR TWO-TAILED TEST) The SPSS one-sample t procedure was used in the previous chapter (where it was used to set up a 95% CI for M); screenshots for the menu selec ons appeared there. You can use the same procedure to perform the one-sample t test for M (using μhyp as the test value). Make the following menu selec ons: <Analyze> → <Compare Means> → <One-Sample T Test>. Enter the value of μhyp specified in the null hypothesis into the space for “Test Value”; in this example, μhyp is 35. Output appears in Figure 8.5. (We will ignore the CI informa on in Figure 8.5 and focus on the t test.) In the output in Figure 8.5, the obtained value of t = 1.966. This agrees with the value of t reported above from by-hand computa on. This t ra o has 8 df (df = N – 1, where N = 9). The value under the heading “Mean Difference” refers to the numerator of the t ra o, that is, M – (μhyp). Using M = 39 and μhyp = 35, the difference between sample mean speed and hypothesized mean speed is (39 – 35) = 4. The sample mean was 4 mph higher than the hypothesized popula on mean of 35 mph. The confidence interval in Figure 8.5 is for the difference between M and μhyp (not for M). The 95% CI for (M – μhyp) is [–.69, +8.69]. 8.13 “EXACT” P VALUES A new piece of informa on appears in the SPSS output in Figure 8.5. In the column headed “Sig. (2-tailed)” we find the “exact” p value that corresponds to the obtained value of t. This p value is the sum of the two tail areas that lie beyond the obtained t value of ±1.966. “Exact” is in quota on marks because many common data analysis prac ces result in p values that greatly

underes mate the true risk for Type I error that p is supposed to es mate. The p value in computer output is exact only in the sense that it corresponds exactly to the tail area(s) using the obtained t value to “cut off” the tails.

 If p < .05, reject the null hypothesis. More generally, reject H0 if p < α.  If p > .05, do not reject the null hypothesis. More generally, do not reject H0 if p > α.

Proponents of the New Sta s cs suggest that we report the exact p value from the SPSS output (e.g., p = .0845, two tailed) and avoid making yes/no decisions about a null hypothesis. In other words, we don’t state that we reject or do not reject the null hypothesis; we don’t say that the result is sta s cally significant or not sta s cally significant. Repor ng an exact p value makes it possible for readers who s ll prefer the tradi onal approach to NHST to make their own decisions whether an outcome is “significant” or not. Repor ng an exact p value also avoids the following problem: What can you say if p = .051 or p = .06? For an outcome such as p = .051, you should not say that the outcome was “almost” significant. Repor ng exact p values reminds us that values of p represent a con nuum and that we do not have to think of .05 as a “cliff.” 8.14 REPORTING RESULTS FOR A TWO-TAILED ONE-SAMPLE T TEST When you report results for significance tests in research papers, much of the logic is implicit. For example, you convey the informa on that you used H0: μ ≠ 35 by saying that the p value is

two-tailed. The example “Results” sec on below follows the New Sta s cs guidelines: Report an exact p value; do not state a decision whether the result is “sta s cally significant.”

Results A one-sample t test was conducted to assess whether mean speed for a sample of N = 9 cars differed from the posted speed limit of 35 mph. For this sample, M = 39, SD = 6.103, and SEM = 2.024. The one-sample t sta s c was t(8) = 1.966, p = .0845, two tailed. Cars in this sample drove an average of 4 mph faster than the posted speed limit. The 95% CI for this difference was [–.69, +8.69].

A person who prefers tradi onal NHST reasoning could go on to say that, using α = .05, two tailed, as the criterion for sta s cal significance, this difference was not -sta s cally significant. Proponents of the New Sta s cs advise against this yes/no kind of thinking. When scores are given in meaningful units, it is useful to think about differences in terms of those units. In this example, sample mean driving speed exceeded the speed limit by 4 mph. In the United States, police usually do not bother to give speeding ckets unless driving speed is at least 5 mph above the speed limit (and o en much higher than that). From a prac cal or real- world perspec ve, a sample mean speed only 4 mph above the posted limit is negligible. We could say this outcome has no prac cal significance, and it is not sta s cally significant. In Chapter 9, you will learn how to add effect size informa on when you report significance tests. Here are several things to no ce about “Results” sec ons.

 For t tests, you must specify whether the reported p value is based on a two-tailed or one-tailed (nondirec onal or direc onal) alterna ve hypothesis.

 Older textbooks some mes reported whether p was less than or greater than a chosen α level, for example, p < .05 or p > .05, or some mes ns as an abbrevia on for not significant. Repor ng an exact p value from SPSS (e.g., p = .0845, two tailed) is now preferred.

 If you don’t specify a choice of α level within the “Results” sec on or earlier in a research report, readers generally assume α = .05, and they may use that to draw their own yes/no conclusions about the null hypothesis.

8.15 SECOND ANALYSIS OF DRIVING SPEED DATA USING A ONE-TAILED OR DIRECTIONAL TEST Let’s return to the car speed data. Wait! The cranky resident is really interested only in the possibility that students are driving faster on average than 35 mph (not slower). The resident could decide to do a one-tailed test. These would be the null and alterna ve hypotheses: Other

For this direc onal version of Halt, the decision rule becomes: Reject H0 if obtained t < +1.86. If obtained t < +1.86, do not reject H0. We now examine the obtained t value compared with this one-tailed decision rule. From the same SPSS output in Figure 8.5, the obtained t was +1.966. This did not fall into the reject region using the two-tailed test, but for a one-tailed test, t = +1.966 falls in the upper tail reject region. The SPSS output reports a two-tailed p value, as noted earlier. You can obtain the one-tailed p by taking half of the two-tailed p. For the driving speed example, SPSS reported p = .085, two tailed. (Some SPSS procedures allow you to request one-tailed p values as an op on, but many procedures produce two-tailed p values by default.) The corresponding one-tailed p value = .085/2 = .04225. For the one-tailed test (Halt: μ > 35), the decision to reject H0 could be based on either:

 Obtained t of 1.996 falls within the one-tailed reject region at the upper end of the distribu on,

and/or

 one-tailed p value (calculated by taking half of the SPSS reported two-tailed p value) is .04225, which is less than the α of .05. Whether the decision is made on the basis of the t value or the p value, the results are the same.

8.16 REPORTING RESULTS FOR A ONE-TAILED ONE-SAMPLE T TEST Using a one-tailed test, we can report the t-test result as follows:

Results A one-sample mean test was conducted to assess whether mean speed for a sample of N = 9 cars differed from the posted speed limit of 35 mph. The alterna ve hypothesis was that the mean popula on speed was greater than 35 mph. For this sample, M = 39, SD = 6.103, and SEM = 2.024. The result was t(8) = 1.966, p = .04225, one tailed. Cars in this sample drove an average of 4 mph faster than the posted speed limit. The 95% CI for this difference was [–.69, +8.69].

Authors who prefer the tradi onal approach to NHST would go on to say that, using α = .05, one tailed, as the criterion, this difference would be judged sta s cally significant. Note that everything in the write-up is the same as for the two-tailed test, except for the reported p value (now one-tailed) and any verbal statement about sta s cal significance. 8.17 ADVANTAGES AND DISADVANTAGES OF ONE-TAILED TESTS In the previous example, the grumpy resident could reject H0 using a one-tailed test, p < .05, but she was not able to reject H0 using the two-tailed test. Data analysts some mes prefer one- tailed tests because these can yield a decision to reject H0 in situa ons where a two-tailed test does not. Note that the converse is never true; it is not possible for a two-tailed test to be significant when the one-tailed test is not. It is possible for both tests to be significant or for both tests to be not significant. One-tailed tests have a poten al advantage compared with two-tailed tests. If we use the 5% of most unusual outcomes as the reject region, the criterion to reject H0 given is a value of t smaller in absolute value than the value of t used for the nondirec onal or two-tailed test. M does not have to be as far away from μhyp to reject H0 when a one-tailed test is used. However, there are two poten al problems with a one-tailed test. First, if you are wrong in your guess whether M will be larger or smaller than μhyp, you cannot reject H0. Second, you need to have a reasonable basis for the choice between the direc onal hypotheses Halt: μhyp > 35 and Halt: μhyp < 35, and this choice must be made before you peek at your data. Some mes data analysts have no idea what is going to turn up in their data. It is bad prac ce to run a two-tailed test, have it turn out not significant, then run a one-tailed test and report only the one-tailed test. What I did here as a teaching example—first run a two-tailed test, find a nonsignificant result, then run a one-tailed test—should not be done in prac ce. If you insist on thinking in tradi onal

terms (reject or do not reject H0), you should decide on Halt and the α level before you look at your data, and s ck with this choice even if you do not like the decision. 8.18 TRADITIONAL NHST VERSUS NEW STATISTICS RECOMMENDATIONS Throughout this chapter you have seen references to two schools of thought.

 Tradi onal use of sta s cal significance tests  The New Sta s cs

The tradi onal approach to NHST involves making decisions in yes/no terms; for example, we reject or do not reject a null hypothesis. Some journal editors have used p < .05 as a threshold for deciding whether a study is a success (and worth publishing) or a failure (and not worth publishing). This creates a bias toward publishing only decisions to reject null hypotheses; however, some mes the decision not to reject a null hypothesis provides be er informa on. You need to understand the logic of NHST because it is s ll widely used in current research reports and almost universally used in past research reports in many research fields. However, you also need to understand that there are problems and limita ons of this yes/no approach to research ques ons. Gradually, there is growing apprecia on of these problems. Proponents of the New Sta s cs (e.g., Cumming, 2014, and others) argue against thinking in terms of yes/no decisions about research results. They recommend that researchers place more emphasis on confidence intervals (discussed in the previous chapter) and on effect size (discussed in the next chapter). They recommend that if p values are s ll included in research reports, authors should state the obtained p value as addi onal informa on and avoid using terms such as significant and nonsignificant. Their recommenda on is based on concerns about the misuse and misinterpreta ons of p values (among other things). I think many of you may find the New Sta s cs approach a rac ve. You don’t need to set up reject regions! You don’t have to judge your study a failure if p > .05! At least one journal (Basic and Applied Social Psychology) no longer accepts reports of p values (Trafimow & Marks, 2015). However, the New Sta s cs view has not en rely replaced tradi onal thinking (at least not yet). My current recommenda on is to report “exact” p values, but don’t place too much faith in them, and always include confidence interval and effect size informa on. You will learn about effect size in the next chapter. 8.19 THINGS YOU SHOULD NOT SAY ABOUT P VALUES

1. If SPSS shows “Sig. (2-tailed)” as .000, do not say that p = .000. A p value is a risk for Type I error, and theore cally, this risk is never zero. The tails of t distribu ons are infinite; tail areas are never exactly zero, they just become smaller and smaller as t increases. If SPSS shows “Sig. (2-tailed)” as .000, report this as “p < .001, two tailed.”

2. Given small p values such as p < .001, do not say that the result is “highly significant.” In everyday language, people understand “significant” to mean important or worthy of

no ce. As you’ll see in upcoming chapters, a small p value does not necessarily imply that a treatment is highly effec ve.

3. Do not say that a p value such as .05 is “marginally significant” or “approaches significance” or is “close to significant” or “trends toward significance.” This will make readers and reviewers cringe, whether they advocate tradi onal use of significance tests or prefer the New Sta s cs approach. In the minds of tradi onalists, p is either less than .05 or it isn’t. It is either significant or not. (To paraphrase the late Groucho Marx, “Close is no cigar.”) From the perspec ve of the New Sta s cs, just say that p = .052, without invoking an α = .05 criterion to decide what the p value means.

8.20 SUMMARY Most of this chapter outlines procedures used in tradi onal approaches to interpreta on of p values. Sta s cs textbooks prior to 2000 generally presented a tradi onal approach to significance tes ng, with a strong focus on yes/no significance tests. (Some books s ll do.) In recent years, advocates of the New Sta s cs have urged us to move away from yes/no decisions and to focus more on confidence intervals and effect size informa on. Effect sizes are discussed in the next chapter. Although proponents of the New Sta s cs (e.g., Cumming, 2014) do not necessarily dismiss p values as completely useless, they make the following recommenda ons.

 Do not report sta s cal significance test results and p values in isola on; include addi onal informa on, such as confidence intervals (discussed in the previous chapter) and effect sizes (discussed in the following chapter).

 If you report p values, give “exact” p values, and do not use them as a basis to make yes/no decisions (such as reject or do not reject the null hypothesis, or an outcome is “significant” vs. “nonsignificant”).

 Understand that because assump ons and rules for significance test procedures are o en violated in prac ce, the true risk for decision error is usually much higher than the reported p value.

 Do not judge a study a “success” just because p < . 05 or a “failure” because p > .05. (However, keep in mind that some readers and reviewers con nue to think that way.)

COMPREHENSION QUESTIONS 1. Rerun the analysis for the carspeed.sav data and compare results across analyses (the

chapter reports results for 95% CI and α = .05, two tailed). a. Use α = .01, two tailed, and a 99% CI. b. Use α = .10, two tailed, and a 90% CI.

As α increases, does it become easier or more difficult to reject H0?

2. Using the data in shoemakertemp.sav, test the null hypothesis H0: μ = 98.6°F using a nondirec onal alterna ve hypothesis. Conduct a one-sample t test using α = .05, two tailed, as the criterion for significance. Also obtain the 95% CI. Can you reject H0: μ = 98.6°F? How does this result compare with the CI obtained for the same data in the previous chapter?

3. What is a null hypothesis?

4. Describe three possible alterna ve hypotheses. 5. What is an alpha level? What determines the value of α? 6. Sketch reject regions for each of the following situa ons:

7. What is the difference between a direc onal and a nondirec onal significance test? 8. Other factors being equal, which type of significance test requires a value of t that is

larger (in absolute value) to reject H0—a direc onal or a nondirec onal test? 9. When a researcher reports a p value, p stands for “probability” or risk. What probability

or risk does this refer to? 10. Do we typically want p to be large or small? 11. What is the conven onal standard for an “acceptably small” p value? DIGITAL RESOURCES Find free study tools to support your learning, including eFlashcards, data sets, and web resources, on the accompanying website at edge.sagepub.com/warner3e. Descrip ons of Images and Figures Back to Figure The image is a diagram of a t distribu on that shows the percentage of area under the curve. The image shows values of t with 30 df that correspond to 5 percent area in the combined upper and lower tails or the 2.5 percent of area at the ends of each tail. The area between plus 2.042 on the right and minus 2.042 on the le is equal to 95 percent. The area beyond plus 2.042 is the upper 2.5 percent and beyond minus 2.042 is lower 2.5 percent. Below plus 2.042 is a statement that t is with 8 df. Back to Figure

The image is an extract from the cri cal values for T distribu on and has been adapted from the table by Fisher and Yates. The table lists different confidence intervals, and levels of significance for one tailed and two tailed tests. It also shows the df ranges that result in the cri cal values. Details are below:

The 95 percent confidence interval has been circled, as has the .05 level of significance for the One-tailed and two tailed tests. The df 15 level and the level of significance for two-tailed test values of 1.753 and 2.131 are also circled. Back to Figure The image shows the reject regions for two-tailed and one-tailed t tests. There are three diagrams, for different values of H subscript ait. The df level is equal to 15. a) H subscript ait: mu is not equal to 35. Two-tailed test or nondirec onal test: reject H subscript 0 for values of t in both lower and upper tails. The image is of a normal distribu on. The area between plus 2.131 on the right and minus 2.131 on the le is the central region with the statement - Do not reject if t is between minus 2.131 and plus 2.131. The area beyond plus 2. 131 is the upper 2.5 percent and beyond minus 2. 131 is lower 2.5 percent. Both these regions are to be rejected. Statements state: Reject if t less than –2.131 and Reject if t is greater than 2.131. b) H subscript ait: mu is greater than 35. One-tailed test or direc onal test: reject H subscript 0 only for t values in upper tail or values of M greater than 35. The image is of a normal distribu on.

The area beyond plus 1.753 on the right is the reject region of 5 percent with the statement – reject H subscript 0. The area to the le of plus 1.753 has the statement – Do not reject H subscript 0. a) H subscript ait: mu is less than 35. One-tailed test or direc onal test: reject H subscript 0 only for t values in upper tail or values of M less than 35. The image is of a normal distribu on. The area beyond minus 1.753 on the le is the reject region of 5 percent with the statement – reject H subscript 0. The area to the right of minus 1.753 has the statement – Do not reject H subscript 0. Back to Figure The image is a diagram of a t distribu on that shows the percentage of rejected area under the curve for alpha equals .05, two tailed with df equaling 8. The image shows values of t that correspond to 5 percent area in the combined upper and lower tails or the 2.5 percent of area at the ends of each tail. The area between plus 2.306 on the right and minus 2. 306 on the le is equal to 95 percent. The area beyond plus 2. 306 is the upper 2.5 percent and beyond minus 2. 306 is lower 2.5 percent. Both these regions are shaded and have a statement: reject H subscript 0. Below plus 2. 306 is a statement that t is with 8 df. Back to Figure The image is a diagram of a t distribu on that shows the percentage of rejected area under the curve for alpha equals .05, one tailed with df equaling 8. The image shows values of t that correspond to 5 percent area in right tail that are to be rejected. The area to the le of plus 1.86 is equal to 95 percent. This region has a statement: Do not reject H subscript 0. The area beyond plus 1.86 on the right is the 5 percent reject area and has the statement: reject H subscript 0.

CHAPTER 9 ISSUES IN SIGNIFICANCE TESTS: EFFECT SIZE, STATISTICAL POWER, AND DECISION ERRORS 9.1 BEYOND P VALUES Many past research reports have focused on p values. When they obtain p < .05, some researchers conclude that the study was a “success” and that the findings are substan al enough to be important in the real world. There are many problems with p values.

 The many assump ons and rules for null-hypothesis sta s cal tes ng (NHST) are o en violated in prac ce. The p values obtained when rules are violated o en underes mate the risk for Type I decision error. Some assump ons (such as independence of observa ons) are not widely understood. Other assump ons, such as the use of a random or representa ve sample that is selected from or similar to the popula on of interest, are widely ignored.

 Se ng up p < .05 as the “target” in data analysis creates a tempta on for data analysts to engage in p-hacking. P-hacking occurs when researchers do things such as run large numbers of analyses and focus on the few with p < .05, add and drop outliers, or change from a two-tailed to a one-tailed test, in order to achieve p < .05. A p value obtained a er p-hacking will underes mate the true risk for Type I error.

 Obtained p values are o en misinterpreted (in a variety of ways). Language some mes used to describe p values (such as “highly significant”) can mislead readers into thinking that an interven on had a strong impact (as you will see in this chapter, p depends on sample size as well as treatment impact).

 The complicated logic of NHST is problema c (in addi on to being confusing). This chapter discusses issues that are important to keep in mind when looking at p values.

1. We’ll look at effect size, which provides informa on about the strength or impact of a treatment or predictor variable that is independent of sample size. Effect sizes should always be included in descrip ons of results (along with other informa on, such as iden fica on of outliers and missing values, t, df, p, M, SD, and confidence intervals [CIs]).

2. The difference between “sta s cal significance” versus prac cal, clinical, or everyday importance or significance is discussed.

3. We’ll examine how the values of N and effect size both influence the magnitude of obtained t and p values.

4. There is a brief introduc on to sta s cal power analysis. Before doing a study, a researcher can o en make an order-of-magnitude guess about effect size. Effect sizes reported in past research may be helpful. Using this guess about popula on effect size, and the choice of a one- versus a two-tailed α criterion for sta s cal significance, a data analyst can use a sta s cal power table to find minimum values of N that provide a reasonable chance of detec ng the effect (e.g., at least an 80% chance of obtaining a sta s cally significant outcome).

5. Type I and Type II decision errors are discussed, along with ways to control the risks for these types of error.

6. The interpreta on of both “null” outcomes (p > .05) and “significant” outcomes is discussed.

7. Guidelines for repor ng results are provided, along with a list of things you should not say.

9.2 COHEN’S D: AN EFFECT SIZE INDEX An effect size provides informa on about the size of differences between group means, or the impact of treatments, that is independent of sample size and o en in unit-free terms that can be compared across studies. The effect size Cohen’s d provides an index that assesses the magnitude of the difference between M and μhyp independent of sample size. Its magnitude (like that of other effect size indexes) is not related to N. SPSS does not provide Cohen’s d as part of the output of t tests. However, it provides the informa on you need to compute Cohen’s d by hand: M, the test value μhyp, and SD. For the one-sample t test: Type I and Type II decision errors are discussed, along with ways to control the risks for these types of error. The interpreta on of both “null” outcomes (p > .05) and “significant” outcomes is discussed. Guidelines for repor ng results are provided, along with a list of things you should not say.

Cohen’s d tells us the magnitude of the (M – μhyp) difference rela ve to the value of SD, the standard devia on of individual X scores. (This differs from the t ra o, where the divisor, SEM, depended on the value of N.) Cohen’s d is unit free, it can be posi ve or nega ve, and its range of possible values is related to the range of possible scores for the X variable. Cohen (1994) suggested guidelines for verbal interpreta on of values of Cohen’s d (see Table 9.1). (Other tables for verbal labels of effect size some mes provide slightly different values; these verbal labels are only approximate.) Table 9.1 Suggested Verbal Labels for Cohen’s d Effect Size Index in Behavioral and Social Science Research

Cohen’s d effect size can be calculated for the driving speed data in the carspeed.sav data. The one-sample t test for these data was discussed in Chapter 8. For these data, M = 39, μhyp (test

value) = 35, and SD = 6.103. For this example, Cohen’s d = (M – μhyp)/SD = (39 − 35)/6.103 = .655. We can say that M is about .66 or two thirds of a standard devia on above μhyp of 35. Using Cohen’s standards, d = .66 for the driving speed study would be called a medium effect size. Mean speed (39 mph) observed in the study was two thirds of a standard devia on higher than the proposed or hypothesized value of mean speed (34 mph). That difference was not sta s cally significant when a two-tailed test was used; it was significant, p < .05, for a one- tailed test. Although the simple difference in original units of X is o en not included in discussions of effect size, note that the difference (M – μhyp), in miles per hour in this example, can also be interpreted as informa on about effect size. This difference is useful informa on when variables are measured in meaningful units. Pek and Flora (2018) recommended that when a variable is measured in meaningful units, this type of difference (M – μhyp) should be discussed and evaluated. In the driving speed example, the difference between sample speed M = 39 and the posted speed limit that was the hypothesized value (μhyp = 35) was only 4 mph. In many regions of the United States there is an informal understanding that going 5 mph faster than the posted speed limit is not a serious viola on. In this example, because the sample mean exceeded 35 mph by only 4 mph, and because the sample was not selected to be representa ve of the popula on of all student drivers, the cranky resident probably could not persuade local police that there is a serious speeding problem in the popula on of all student drivers, even if the difference is judged to be sta s cally significant. Many authori es strongly recommend that effect size and CI informa on should be included for all important outcomes in a study. Some suggest that the importance of t and p values should be de-emphasized, and a few even recommend that these should be abandoned. Effect size indexes such as Cohen’s d can be used several ways: When you write a research report, you should report and interpret effect size informa on. For the one-sample t test, report Cohen’s d, and if the variable is measured in meaningful units, also report M – μhyp.

1. When you read published research reports, you can use effect size informa on to be er understand the outcome and to compare strength of effects in the study with other studies. If effect size is not provided by the author, you can usually calculate effect size from the informa on provided.

2. An es mate of effect size can be used to help decide the sample size needed in future studies. This is called sta s cal power analysis; that is discussed in a later sec on of this chapter.

3. In systema c research reviews called meta-analyses, researchers combine effect size informa on across numerous studies to summarize them. That is beyond the scope of the current textbook; for further informa on, see Field and Gille (2010).

9.3 FACTORS THAT AFFECT THE SIZE OF T RATIOS

9.4 STATISTICAL SIGNIFICANCE VERSUS PRACTICAL IMPORTANCE The term significant means something different in sta s cs than in everyday use. In everyday use, the word significant usually means large, substan al, of prac cal or clinical value, or worthy of no ce. By contrast, sta s cal significance has a specific technical meaning; outcomes of studies are judged “sta s cally significant” when results would be unlikely to arise just from sampling error, on the basis of the logic of NHST. It is useful to dis nguish between “sta s cal significance” and clinical or prac cal significance (Kirk, 1996). A result that is sta s cally significant may be too small to have much real-world value. A difference between M and μhyp can be sta s cally significant and yet be too small in actual units to have much prac cal or clinical significance, as in the car speed example. Sta s cal significance alone is not a guarantee of prac cal significance or usefulness (Vacha- Haase, 2001). We evaluate sta s cal significance by examining a test sta s c (such as a t ra o) and accompanying informa on such as df and p value. We evaluate prac cal or clinical significance for a one-sample t test by examining Cohen’s d and if the variable is measured in meaningful units, the M – μhyp difference. Hypothe cal examples illustrate situa ons in which a sta s cally significant result may not have much clinical or prac cal value. Suppose a researcher obtains a sample mean IQ of M = 99 for a large sample of twins (similar to results reported by Record, McKeown, & Edwards, 1970). This sample mean value of M = 99 can be evaluated rela ve to the mean IQ for the general popula on. On the basis of test norms, the popula on mean μ for IQ = 100. We can use this to set up the null hypothesis, H0: μhyp = 100. The popula on standard devia on σ for IQ is 15 (IQ test scores are scaled to these values). Cohen’s d for this study = (M – μhyp)/σ = 1/15 = .0667. On the basis of Table 9.1, this is a small effect size. If this effect is found in a sample with N = 2,500, using Equa on 9.2, we obtain 𝑡=.0667×2,500=.0667×50=3.335, with 2,499 df and p of about .0004. This t ra o would be judged “sta s cally significant.” We could say that twins had a “sta s cally significantly” lower IQ than the popula on IQ norm of 100. However, is this twin IQ difference large enough to be important or significant in clinical or everyday terms? No. If you spent me talking with two persons, one with an IQ of 99 and one with an IQ of 100, you would not be able to detect a difference in mental ability (or whatever it is that IQ tests measure). It is not a large enough difference to have any prac cal or clinical importance. (On the other hand, if twins scored on average 30 points lower than the popula on mean IQ of 100, we would be concerned about nega ve effects of being born a twin.) The 1- point IQ difference in this example was sta s cally significant only because N was large. Consider another purely hypothe cal ques on. Suppose a drug company wants to convince people to buy a new weight-loss drug. Consider one possible outcome: The mean number of pounds people lose by taking this drug is M = 1 lb. (We assume μhyp = 0, weight loss of 0 lb, as

the test value, the mean weight loss for a popula on that does not take the drug.) If thousands of cases are included in this study, a weight loss of just 1 lb could be judged “sta s cally significant.” Now consider a second possible outcome: The mean weight loss is M = 25 lb. Most people would agree that a drug that causes only 1 lb of weight loss is not “significant” in a prac cal, clinical, or everyday sense, even if it was found to be “sta s cally significant.” On the other hand, most people would agree that a weight loss of 25 lb could be considered significant or important from a prac cal or clinical perspec ve. (We would also want this 25-lb weight loss outcome to be sta s cally significant, as assurance that it is probably not due just to sampling error.) Some mes media reports say things such as “The new drug had a sta s cally significant effect on weight loss, p < .001.” Worse yet, the media report may say that “the new drug had a highly significant effect on weight loss.” The phrase “highly significant” is misleading and should never be used! Statements like these can lead readers to think that a drug caused a large amount of weight loss. It should be clear by now that the magnitude of a p value, or the phrase “sta s cally significant,” by itself, does not tell you how many pounds people in the study lost. Readers need addi onal informa on, specifically, the average number of pounds lost, to make a realis c evalua on of the drug’s effec veness. Addi onal informa on (such as SD and minimum and maximum weight loss) would also be helpful. To evaluate poten al effects of the new drug, readers also need to ask, What kinds of people were included in the study? Were the par cipants doing addi onal things, such as exercise and diet modifica on? How long did they take the drug and in what dose? How long was weight loss maintained a er the drug was stopped? Was there a control group that did not receive the drug? And so forth. Do not use the phrase “highly significant” to describe research outcomes with small p values. That language leads people to believe the results of a study have great prac cal or clinical importance, when in fact p < .001 can arise when a small effect is combined with a very large sample size. When you see the phrase “highly significant” in media reports, be skep cal. You need more informa on (such as the actual difference between means, or Cohen’s d) to evaluate whether the results of the study indicate that an interven on or treatment had strong, or even no ceable, effects. 9.5 STATISTICAL POWER In most (although not all) applica ons of NHST, researchers hope to reject H0. Sta s cal power is defined as the probability of obtaining a value of t that is large enough to reject H0 when H0 is actually false. Refer back to Table 9.2 to see four possible outcomes when decisions are made whether to reject or not reject a null hypothesis. The outcome of interest, at this point, is the one in the upper right-hand corner of the table: the probability of correctly rejec ng H0 when H0 is false, which is called sta s cal power. Researchers want sta s cal power to be reasonably high; o en, sta s cal power of .80 is suggested as a reasonable goal. Recall that we can reject H0 when the obtained value of t is sufficiently large. Equa on 9.2 is repeated here to remind you how the magnitude of t in a sample is related to sample effect size and sample 𝑁:𝑡=𝑑×𝑁.

In Equa on 9.2, d represents Cohen’s d, and N is sample size. This equa on suggests that if we want to obtain a large value of t in a future study, in theory, we could do that by examining a large effect size (d) or by using a large N or both. However, any value of d we guess for popula on effect size may be incorrect, and even if we did know d, the magnitude of t in a future study will also be affected by sampling error. We cannot simply put values of d and N into Equa on 9.2 and solve the equa on for t and assume that our study will result in that value of t. In prac ce, values of t (like values of M) vary because of sampling error. The logic used to es mate sta s cal power given values of d and N is discussed in Appendix 9A. In prac ce, tables can be used to look up es mated sta s cal power for combina ons of planned values of N and guessed values of d (Cohen, 1988, 1992a, 1992b). An example of a sta s cal power table, adapted from Jaccard and Becker (2009), appears in Table 9.3. Given an es mate for the popula on value of Cohen’s d and for planned sample size N, you can look up expected sta s cal power in the body of the table. Alterna vely, you can look down the column for an es mated popula on effect size, find the cell for power = .80, and look at the N for that row to find the minimum N required. This table applies only to tests that use α = .05, two tailed. Different tables would be needed for other α levels or one-tailed tests. For example, suppose that a researcher believes that the magnitude of difference she is trying to detect using a one-sample t test corresponds to a popula on effect size of Cohen’s d = .50 and plans to use α = .05, two tailed. The researcher can read down the column of values for es mated power under the column headed d = .50 un l reaching the table entry of .80. Then, she would look to the le (of this value of .80) for the corresponding value of N. On the basis of the values in Table 9.3, the value of N required to have sta s cal power of about .80 to detect an effect size of d = .5 in a one-sample t test with α = .05, two tailed, is between 30 and 40. Table 9.3 Power Table for the One-Sample t Test Using α = .05, Two Tailed

The true strength of the popula on effect size we are trying to detect is not known. For example, the degree to which the actual popula on mean μ differs from the hypothesized

value, μhyp, as indexed by the popula on value of Cohen’s d, is not known in advance of the study. If we knew the answer to that ques on, we would not need to do a study! The sample size needed for adequate sta s cal power can be approximated only by making an educated guess about the true magnitude of the effect, as indexed by d. If the guess about the popula on effect size d is wrong, then the es mate of power based on that guess will also be wrong. Informa on from past studies can o en be used to make at least approximate es mates of popula on effect size. Sta s cal power analysis is useful when planning a future study. It is important to think about whether the expected effect size, alpha level, and sample size provide you with a reasonably large chance (reasonably high power) to obtain a sta s cally significant outcome. People who write proposals to compete for research funds from government grant agencies are generally required to include a ra onale for decisions about planned sample size on the basis of power. There are several places to obtain informa on for sta s cal power analysis. Jaccard and Becker (2009) provide power tables for some addi onal situa ons. SPSS has an add-on procedure for sta s cal power, and numerous other computer programs (some free) can do power analyses. Free online power calculators are widely available (for example, at h p://powerandsamplesize.com/Calculators/). Usually researchers rely on computer programs instead of tables for power analysis. A researcher provides program input informa on about type of analysis (e.g., a one-sample t test), planned α level, whether a one- or two-tailed test is desired, and expected effect size. Programs usually provide either the es mated power for an input value of N or the minimum N needed to achieve a requested level of power. You should not report a post hoc power analysis. That is, do not look up your obtained Cohen’s d effect size and N and then report power for a study that has already been conducted. Some mes people do this when a result was not sta s cally significant and then say, incorrectly, that the result “would have been” sta s cally significant if N had been larger. That statement is incorrect for several reasons. First, if more data were collected to augment the sample size, Cohen’s d might go down (or up). Second, this doesn’t take the effects of sampling error in future data collec on into account. Third, it’s a sneaky way of pleading, please take my nonsignificant results seriously; they would have been be er if I had had a larger sample. It is legi mate to consider the Cohen’s d from a past study when making sta s cal power es mates for future research. It may help you evaluate how large your sample sizes need to be in future studies. (However, that is different than making a claim that “the result would have been significant if I had a larger N in this completed study.”) 9.6 TYPE I AND TYPE II DECISION ERRORS When we use NHST logic to make binary decisions (reject or do not reject a null hypothesis), we end up in one of the four situa ons described in Table 9.4. We cannot know the actual “state of the world,” that is, whether our null hypothesis about a popula on mean is really true or really

false. We only know which of the two decisions we make: Reject H0 or do not reject H0. Usually (but not always) researchers want to reject H0. Table 9.4 Type I Versus Type II Decision Errors in NHST

To make this more specific, consider this hypothe cal example. A researcher plans to give all members of a sample a new weight loss drug. The X outcome measure is number of pounds lost. As a basis for comparison the null hypothesis is H0: μ = 0. The researcher will test whether M for the sample differs significantly from the proposed value of μ in this null hypothesis. The researcher probably hopes for a large posi ve value of M as evidence that members of the sample are losing weight, enough weight to reject the hypothesis of no weight loss. The four possible outcomes for this hypothe cal study appear in Table 9.5. A researcher knows which row of the table he or she has landed in (the researcher knows whether the decision was to reject or not reject). Researchers don’t know which column they are in, because for many research ques ons, we never get to the point where we know what is happening in the popula on. A researcher who has decided to reject H0 has either commi ed a Type I error or has reported a correct decision to reject H0. (The researcher can never be sure which.) A researcher who has decided not to reject H0 has either commi ed a Type II error or has reported a correct decision not to reject H0. (The researcher can never be sure which.) We want the probability or risk for both types of error to be low, that is, we want both α and β to be low. When a data analyst selects an α level, such as α = .05, that choice theore cally sets an upper limit for the risk for Type I error. If α is set at .05, then in theory, we have a maximum risk of 5% for Type I error. However, the limit of risk for Type I error works in prac ce only if the assump ons and rules for NHST are followed—and in many situa ons, they are not. The actual risk for Type I error in many research situa ons is o en much higher than the nominal (selected) α level. The risk for Type II error, β, cannot be exactly known; but we know something about factors that tend to make β larger or smaller. In the previous sec on we talked about sta s cal power: the

probability of rejec ng H0 when it is false. Power is (1 – β), and we want power to be high, usually on the order of .80. Table 9.5 Type I Versus Type II Decision Errors in NHST

What does it mean for H0 to be false? H0 is true only if μ is exactly equal to 0 (or exactly equal to the proposed value in the null hypothesis, such as 98.6 or 35 or 100 in previous examples). However, H0 can be false in billions of ways. If we consider H0: μ = 35, H0 is false if μ really equals any number other than 35 (e.g., 45, 12, 35.01, 99, 34.3, and so forth). H0 can be false to varying degrees; in a sense, H0: μ = 35 is “less false” if μ is really 35.2 or 34.9 than if μ is really 30 or 51. Popula on effect size is the degree to which H0 is false. For example, if Cohen’s d (for the difference between the real and hypothesized popula on means) is d = 1.00, this indicates that the difference between hypothesis and reality is large; if d = .05, this indicates that the difference between hypothesis and reality is small. The values of β and (1 – β) vary depending on the popula on effect size. We never know the exact popula on effect size, but we can think about the values of β and (1 – β) that we would expect, in theory, for possible different values of d and for fixed decisions about N and α. Appendix 9A explains this in more detail. These are the factors that influence β, risk for Type II error (and also 1 – β, sta s cal power):

 As α increases, β decreases. However, researchers are reluctant to increase α, risk for Type I error. Increasing α is not a common way to try to reduce risk for Type II error.

 As sample size N increases, risk for Type II error β decreases, and sta s cal power increases. This is consistent with intui ons you probably have by now: You have a higher probability to reject H0 when sample size is large.

 As popula on effect size such as Cohen’s d increases, risk for Type II error β decreases, and sta s cal power increases. Design decisions that are o en under researcher control are related to effect size. This is discussed more extensively in Chapter 12 on the independent-samples t test, a test you are more likely to use and a situa on that will be easier for you to think about.

These are the factors that influence risk for Type I error, α:

 The α level that the data analyst chooses as criterion for sta s cal significance.  Adherence to the assump ons and rules for NHST. If there are viola ons of assump ons

and rules, the true risk for Type I error is o en much higher than α. If a study has an N too small to have a reasonable chance to detect an effect (to reject H0 when H0 is false), it is called underpowered. Researchers try to avoid underpowered studies by using the sta s cal power analysis methods in the previous sec on. They decide on the type of sta s cal analysis, the alpha level, and the nature of the test (one vs. two tailed). They make educated guesses about possible popula on effect size, such as Cohen’s d. They decide on an adequate level of sta s cal power, 1 – β, o en .80 They look up these numbers in a table for sta s cal power to find the minimum value of N that will provide the desired level of power under those condi ons. (Or they input this informa on into a sta s cal power calcula ng program.) 9.7 MEANINGS OF “ERROR” Note that the term error has different meanings in everyday life than the term error in sta s cs. In everyday life, error means mistake. For example, if a student adds a set of numbers incorrectly when calcula ng a sample mean, that is an error in the everyday sense: a mistake. The assump ons and rules involved in NHST were designed to keep the risks for commi ng each of these kinds of error low. However, even a researcher who follows all the rules exactly s ll has risk for decision errors. In sta s cs, we talk about many kinds of error, and each has a technical defini on. So far you have learned about sampling error. Because of sampling error, the values of means vary across samples drawn from the same popula on. Sampling error is not a “mistake.” This is just the way the world works. Predic on error has also been men oned: If the mean from a single sample is used to es mate an unknown popula on mean, it will probably not exactly equal the popula on mean μ; if we use M to es mate μ, we will make a predic on error. In this chapter you learned about two new kinds of error; these are the two kinds of error that can occur when making a reject/do not reject decision about a null hypothesis. (Addi onal types of error arise later, such as measurement error.) Of course, people who handle data can make mistakes (errors, in the everyday sense of the word): errors in computa on or copying numerical values or interpre ng numbers. Mistakes may be surprisingly common in published research reports (Green et al., 2018). The technical types of error that arise in sta s cs (such as sampling error and predic on error) do not arise because the data analyst has made a mistake. Procedures such as sta s cal significance tests involve inherent uncertainty. Even when a data analyst has done all the steps correctly, the data analyst can make a decision error, such as rejec ng H0 when it is true. This kind of error is unavoidable in inferen al sta s cs. We can’t get rid of it no ma er how careful we are, but we can try to reduce the risk for error, and we must take risk for error into account when we report results.

9.8 USE OF NHST IN EXPLORATORY VERSUS CONFIRMATORY RESEARCH In a confirmatory study, a researcher usually has a small number of hypotheses. These may have been selected during earlier exploratory research, or specified by a theory, or they may be varia ons of hypotheses in previous confirmatory studies. Confirmatory studies are o en (but not always) experiments. Confirmatory studies o en have few variables and a limited number of sta s cal significance tests. This is the context in which Fisher and colleagues developed the logic for NHST. Researchers may face fewer tempta ons to violate some of the rules of NHST in confirmatory studies than in exploratory research. However, there s ll are many ways to violate rules and assump ons for NHST in confirmatory research, for example, by trying out different methods of handling outliers and switching from two-tailed to one-tailed tests. In exploratory studies, research ques ons are o en open ended. For example, in a nonexperimental survey, an analyst may evaluate many variables to see which one(s) best predict an outcome such as life sa sfac on. Fishing for predictors in a large set of “candidate” variables poten ally opens up a much wider range of ways to violate rules for NHST. Some journals seem to accord greater value to confirmatory studies than to exploratory work. Perhaps because of this, there is a tempta on for researchers who have done exploratory studies (who have tried out many different combina ons of variables, rules for iden fica on, handling of outliers, etc.) to cherry-pick a small set of results and write research reports that make it sound as if the study were confirmatory. Exploratory and confirmatory studies both have value. In many research areas, truly confirmatory studies are possible only a er a period of exploratory work. However, repor ng hand-picked p values from large numbers of tests in exploratory studies violates a fundamental rule for the use of NHST: Do only a small number of significance tests. When a small number of selected results from an exploratory study are reported as if they were obtained through a confirmatory study, p values can greatly underes mate the true risk for Type I error. A specific study may provide informa on to do both confirmatory and exploratory analyses. When this is the case, the first part of a “Results” sec on can report a limited number of analyses for which the researcher had specific hypotheses in advance. A later sec on tled “Exploratory Results” can report addi onal interes ng results that were not predicted in advance. In general, we should not place much faith in p values obtained in exploratory research that includes numerous variables and analyses (par cularly when many of these variables and analyses are not reported). We can have somewhat more faith in p values from truly confirmatory studies that include small numbers of variables and analyses, provided that assump ons and rules for NHST were not violated.

If we do not place faith in p values and significance tests, what results should we focus on? Effect sizes and confidence intervals are not completely problem free; however, they are o en less problema c than p values. 9.9 INFLATED RISK FOR TYPE I DECISION ERROR FOR MULTIPLE TESTS In introductory sta s cs books, students learn to do one analysis at a me. However, in prac ce, researchers o en run large numbers of sta s cal significance tests. A single research report may include 50 or 100 significance tests. When you run large numbers of significance tests and then selec vely report only the outcomes with the smallest p values, the true risk for Type I decision error is much higher than the nominal α level chosen to evaluate significance. When a data analyst runs dozens or hundreds of significance tests in search of p < .05, this is called p-hacking. This leads to p values that seriously underes mate the true risk for Type I decision error. P-hacking includes but is not limited to adding and dropping scores, cases, groups, and variables when you do analyses and repor ng only a few results. These prac ces violate the implicit, but crucial, assump on that we do only one (or a limited number of) significance tests. 9.10 INTERPRETATION OF NULL OUTCOMES When a study yields a nonsignificant result, we cannot “accept H0.” A researcher may obtain a nonsignificant test result (even when H0 is actually false) for many different reasons:

1. The effect size the researcher is trying to detect (e.g., the magnitude of the difference between μ and μhyp) is very small.

2. The number of cases in the study (N) may be too small to provide adequate sta s cal power for the significance test.

3. A nonsignificant result can be a Type II decision error that has occurred because of sampling error.

4. Failure to reject H0 when it is false can occur because of other design flaws, such as measures that are unreliable or invalid, or lack of control for other important variables.

Editors of some journals are reluctant to publish papers that do not report sta s cally significant results (p < .05). This is unfortunate for several reasons.

 Some mes not rejec ng H0 is the correct answer.  Nonsignificant outcomes may be consigned to file drawers or wastebaskets, and other

researchers don’t know about these results.  Researchers face a tempta on to engage in p-hacking (torture the data un l they obtain

p < .05).  Authors of research summaries such as meta-analyses have difficulty loca ng all relevant

studies when many are unpublished; it is important to include studies that found no effect when assessing average effect across all past studies.

9.11 INTERPRETATION OF STATISTICALLY SIGNIFICANT OUTCOMES Reports of “sta s cally significant” outcomes should also be viewed with cau on. It is important to understand that a “sta s cally significant” outcome can be obtained even when H0 is correct.

Here are some common reasons why a decision to reject H0 and call a test result “sta s cally significant” may be incorrect.

9.11.1 Sampling Error A sta s cally significant outcome may arise because of sampling error. That is, even when the null hypothesis H0: μ = μhyp is correct, some values of the sample mean that are quite far away from μhyp can arise just because of sampling error or chance. By defini on, when the nominal alpha level is set at .05, values of M that are far enough away from μhyp to meet the criterion for the decision to reject H0 occur about 5% of the me when the null hypothesis is actually correct. 9.11.2 Human Error Human error in computa on and repor ng of sta s cs is common (Green et al., 2018). Usually errors are in favor of a researcher’s preferred outcome (people rarely recheck their numbers when they have results they like). 9.11.3 Misleading p Values Obtained p values underes mate true risk for Type I error. The decision to reject H0 is o en made by no cing whether obtained p is less than .05. If the obtained p value underes mates the true risk for Type I error, then the decision to reject H0 may be incorrect. 9.12 UNDERSTANDING PAST RESEARCH When you read past research, think about these ques ons. Were too many significance tests done for p values to be believable? There is no universally agreed upon rule about the number of tests that is acceptable. I suggest that if you see more than 10 p values in a research report, you should begin to suspect that at least a few of them are due to Type I error. Ideally, authors should acknowledge this problem (inflated risk for Type I error when mul ple tests are performed) in the discussion sec ons of papers. If an author reports that an important variable was measured 12 different ways and then reports sta s cally significant results for only 1 of these measures, you might suspect that the other 11 measures did not turn out to be significant when they were examined. Some mes numerous tests are done but not included in a paper. The use of too many significance tests is problema c whether you see them in the published paper or not. Evaluate p values cri cally. Realize that viola ons of assump ons and rules (that probably are not explicitly reported in most research reports) can make p values poor es mates of the true risk for Type I error. Realize that a very small p value does not necessarily imply that the effect is large in prac cal or clinical terms. Look for effect size informa on. If effect size is not reported, there should be sufficient informa on for you to calculate this by hand. All you need to find Cohen’s d is M, SD, and μhyp

(the proposed or hypothesized value of μ). Also evaluate whether the effect size is large enough to have any prac cal or clinical importance. When variables are measured in meaningful units, M – μhyp is useful informa on. Look for confidence intervals. Ask if it is reasonable to generalize from the types of cases in this study to larger popula ons in the real world. Ask if the situa on in the study is comparable with real-world situa ons. 9.13 PLANNING FUTURE RESEARCH Research methods textbooks specific to your field of interest provide much informa on about planning research. From the perspec ve of NHST, here are some important issues. Make decisions ahead of me about significance tests (test sta s c, α level, direc onal or nondirec onal test). Make decisions ahead of me about the iden fica on and handling of outliers. Es mate the popula on effect size. Effect sizes from past studies (your own past research or other people’s) may be used to do this. It is be er to underes mate popula on effect size than to overes mate it. Use your es mated effect size and type of test (e.g., one-tailed t test, α = .05, two tailed) to look up es mated power for your effect size and planned N. Or, using .80 for power, figure out the minimum N needed to have 80% power. 9.14 GUIDELINES FOR REPORTING RESULTS The informa on to include in a research report depends on the specific test. For a one-sample t test, include N, M, SD, df, SEM, t, and (exact) p; whether p is one tailed or two tailed; effect size informa on such as Cohen’s d and/or M – μhyp; and a CI for M (or for M – μhyp). The following elements should be included in a wri en report for a one-sample t test.

 A statement of what test was done, for what variable.  Sample size (N ), M, SD, and SEM.  The CI for M (or the CI for the M – μhyp difference).  Obtained t with its df and exact p. State whether p is one tailed or two tailed.  Tradi onally, a statement of whether a test was sta s cally significant and/or whether

the null hypothesis can be rejected has usually been included. Proponents of the New Sta s cs suggest that we should avoid yes/no thinking and instead focus on confidence intervals and effect sizes.

 Effect size (such as Cohen’s d) and, if units of measurement are interpretable, a difference such as M – μhyp may also be useful as informa on about prac cal significance.

Here is an example of a complete “Results” sec on for a one-sample t test that includes all informa on listed above.

Results A one-sample t test was conducted to assess whether mean speed for a sample of N = 9 cars differed from the posted speed limit of 35 mph. A two-tailed test was used. For this sample, M = 39, SD = 6.103, and SEM = 2.024. The 95% CI for M was [34.31, 43.69]. The result was t(8) = 1.966, p = .0845, two tailed. Cohen’s effect size was .66; by Cohen’s standards, this represents a medium effect. However, the obtained 4 mph difference between the sample mean (M = 39) and the posted speed limit (35 mph) was too small to have much prac cal importance.

We could add that, using α = .05, two tailed, this difference was not sta s cally significant. A discussion sec on following these results should consider limita ons such as the following:

 An accidental sample may not be representa ve of (similar to) the popula on of all drivers in this town. If the sample contained mostly male (rather than female) drivers, or was obtained mostly during rush hour, the sample mean may overes mate driving speed for cars more generally.

 This sample size (N = 9) is too small to draw meaningful conclusions.  This report makes no men on of screening for outliers (was one driver clocked at 90

mph?). You may be able to think of addi onal ques ons. 9.15 WHAT YOU CANNOT SAY A major problem with p values is that they cannot answer the ques on we really want to answer. We would like to know something about the probability that the null (or the alterna ve) hypothesis is correct, given the informa on in our sample data. Instead, a p value tells us (o en very inaccurately) about the probability of obtaining the values of M and t we got in our sample, given that the null hypothesis is correct (Cohen, 1994). I don’t suggest that you try to say that in a research report (it may confuse your readers). Here are examples of things you should not say. Never make any of the following statements:

 p = .000  p was “highly” significant  p was “almost” significant (or synonymous terms such as “close to” or “marginally”

significant) For “small” p values, such as p = .04, we cannot say:

 Results were not due to chance, or could not be explained by chance (we don’t know that!)

 Results will replicate in future studies  H0 is false  We accept (or have proved) the alterna ve hypothesis  Because p is small, this is an important difference

We also cannot use (1 – p), for example (1 – .04 = .96), to make probability statements such as:

 There is a 96% chance that results will replicate  There is a 96% chance that the null hypothesis is false

For p values on the order of p = .37, we cannot say, “Accept the null hypothesis.” The language we use to report results should not overstate the strength of the evidence, imply large effect sizes in the absence of careful evalua on of effect size, overgeneralize the findings, or imply causality when rival explana ons cannot be ruled out. We should never say, “This study proves that….” Any one study has limita ons. As suggested in Chapter 1, it is be er to think in terms of degrees of belief. As we obtain increasing amounts of good-quality evidence, we may become more confident of a belief. We should also pay a en on to inconsistent evidence that would reduce our belief. We can say things such as:

 The evidence in this study is consistent with the hypothesis that …  The evidence in this study is not consistent with the hypothesis that …

Hypothesis can be replaced by similar terms, such as predic on. 9.16 SUMMARY In prac ce, many assump ons and rules for the use of NHST are frequently violated. Samples are o en not randomly selected from real popula ons of interest or evaluated for their representa veness rela ve to real-world popula ons. Researchers o en report large numbers of significance tests. The desire to obtain sta s cally significant results can tempt researchers to engage in “data fishing”; researchers may “torture” their data un l it confesses (Mills, 1993). For example, they may run many different analyses or delete extreme scores un l they obtain sta s cally significant results. When any of these viola ons of rules and assump ons are present, reported p values do not accurately represent the true risk for incorrectly rejec ng H0. In addi on to difficul es and disputes about the logic of sta s cal significance tes ng, there are addi onal reasons why the results of a single study should not be interpreted as conclusive evidence that the null hypothesis is either true or false. A study can be flawed in many ways that make the results uninforma ve, and even when a study is well designed and carefully conducted, sta s cally significant outcomes some mes arise just by chance. Therefore, the results of a single study should never be treated as conclusive evidence. To have enough

evidence to be confident that we know how variables are related, it is necessary to have many replica ons of a result based on methodologically rigorous studies. Despite logical and prac cal problems with NHST, most experts do not recommend that NHST and reports of p values should be en rely abandoned. NHST can help researchers evaluate whether chance or sampling error are likely explana ons for an observed outcome of a study. We can’t completely get rid of risk for error, no ma er how well we behave. But we should avoid behaviors that we know make our risk for error worse. These behaviors have been given many names (p-hacking, fishing, data torturing, ques onable research prac ces). I will remind you of these problems as you learn addi onal sta s cal tests. Thou shalt not place too much faith in p values. APPENDIX 9A: FURTHER EXPLANATION OF STATISTICAL POWER We can incorporate sampling error into understanding sta s cal power by visualizing two sampling distribu ons. The first describes the sampling distribu on for M (and for t) if H0 is correct. The second describes the sampling distribu on for M if μ equals a specific value different than the value specified in H0. In the following example, let’s consider tes ng hypotheses about intelligence scores. Suppose that the null hypothesis is

The next step is to ask what values of M would be expected to occur if H0 is false (one of many ways that H0 can be false is if μ is actually equal to 115). An actual popula on mean of μ = 115 corresponds to a Cohen’s d effect size of 1 (i.e., the actual popula on mean 115 is 1 standard devia on higher than the value of μhyp = 100 given in the null hypothesis). The lower panel of Figure 9.1 illustrates the theore cal sampling distribu on of M if the popula on mean is really equal to 115. We would expect most values of M to be fairly close to 115 if the real popula on mean is 115, and we can use SEM to predict the amount of sampling error that is expected to arise for values of M across many samples. The final step involves asking this ques on: On the basis of the distribu on of outcomes for M that would be expected if μ is really equal to 115 (as shown in the bo om panel of Figure 9.1), how o en would we expect to obtain values of the sample mean M that are larger than the cri cal value of M = 110.72 (as shown in the upper panel of Figure 9.1)? Note that values of M below the lower cri cal value of M = 89.28 would occur so rarely when μ really is equal to 115 that we can ignore this set of possible outcomes. To work out the probability of obtaining a sample mean M greater than 110.72 when actual μ = 115, we find the t ra o that tells us the distance between the “real” popula on mean, μ = 115,

and the cri cal value of M = 110.72. This value is t = (M – μ)/SEM = (110.72 – 115) /4.74 = –.90. The likelihood that we will obtain a sample value for M that is large enough to be judged sta s cally significant given the decision rule developed previously (i.e., reject H0 for M > 110.72) can now be evaluated by finding the propor on of the area in a t distribu on with 9 df that lies to the right of t = –.90. Tables of the t distribu on, such as the one in Appendix B, do not provide this informa on; however, it is easy to find Java applets on the Web that calculate exact tail areas for any specific value of t and df. Using one such applet, the propor on of the area to the le of M = 110.72 and z = –.90 (for the distribu on centered at μ = 115) was found to be .20, and the propor on of area to the right of M = 110.72 and z = –.90 was found to be .80. The shaded region on the right-hand side of the distribu on in the lower part of Figure 9.1 corresponds to sta s cal power in this specific situa on; that is, if we test the null hypothesis H0: μ = 100 using α = .05, two tailed, and a t test with df = 9, and if the real value of Cohen’s d = 1.00 (i.e., the real popula on mean is equal to μ = 115), then there is an 80% chance that we will obtain a value of M (and therefore a value of t) that is large enough to reject H0. We can check the results of this example against the power es mates from Table 9.3. The example in this appendix involves a one-sample t test with an effect size d = 1.00, a sample size N = 10, and α = .05, two tailed. Table 9.3 provides es mates of sta s cal power for a one- sample t test with α = .05, two tailed. If you find the column for d = 1.00, and the row for N = 10, the table entry that corresponds to the es mated power is .80. This agrees with the power es mate that was based on an examina on of the distribu ons of the values of M shown in Figure 9.1. COMPREHENSION QUESTIONS

1. What is a Type I error? 2. What factors influence the magnitude of risk for Type I error? 3. What is a Type II error? 4. What factors influence the magnitude of risk for Type II error? 5. What is sta s cal power? What informa on is needed to decide what sample size is

required to obtain some desired level of power (such as 80%)? 6. How are the risk for a Type II error and sta s cal power related? 7. What can a researcher do to decrease risk for Type II error (and increase sta s cal

power)? 8. Why do reported or “nominal” p values o en seriously underes mate the true risk for a

Type I error? 9. In your own words, what does it mean to say “p < .05”? 10. Describe at least two poten al problems with NHST. 11. What conclusions can be drawn from a study with a null result? 12. What conclusions can be drawn from a study with a “sta s cally significant” result? 13. Briefly discuss: What informa on do you look at to evaluate whether an effect obtained

in an experiment is large enough to have “prac cal” or “clinical” significance?

14. Suppose a researcher writes in a journal ar cle that “the obtained p was .032; thus there is only a 3.2% chance that the null hypothesis is correct.” Is this a correct or an incorrect statement?

15. Iden fy two common viola ons of assump ons that make the actual risk for Type I error much higher than the “nominal” risk for Type I error that is set by choosing an alpha level. (There are more than two.)

16. Using Table 9.3, suppose you plan a study using a one-sample t test. On the basis of past research, you think that the effect size you are trying to detect may be approximately equal to Cohen’s d = .30. You plan to use α = .05, nondirec onal (two tailed). (a) If you want to have power of .80, what minimum N do you need in the sample? (b) If you can afford to have only N = 20 par cipants in the sample, approximately what is the expected level of sta s cal power?

DIGITAL RESOURCES Find free study tools to support your learning, including eFlashcards, data sets, and web resources, on the accompanying website at edge.sagepub.com/warner3e. Descrip ons of Images and Figures Back to Figure The image is a combina on diagram with two graphs that illustrates the sta s cal power and risk for type II error. 1. The first diagram shows the distribu on of values M if H subscript 0: mu equals 100 and SE subscript M equals 4.74. The graph is that of a normal distribu on where the X axis ranges from 90 to 110. The cri cal value of M is 110.72. The tail region that corresponds to alpha by 2 at the tail end of the distribu on on either side equals .025. This has been shaded. 2. The second diagram shows the distribu on of values M is mu equals 115 and SE subscript m equals 4.74. The graph is that of a normal distribu on where the X axis ranges from 110 to 125. The cri cal value of M is 110.72. The region that corresponds to power open bracket 1 minus beta close bracket 0.8 of the distribu on has been shaded. This is the en re region to the right of 110.72.