Null Hypothesis
Warner, R. M. (2021). Applied sta s cs II: Mul variable and Mul variate Techniques. Los Angeles, CA: Sage Publica ons. ISBN: 978-1-5443-9872-3
CHAPTER 1 THE NEW STATISTICS 1.1 REQUIRED BACKGROUND This book begins with analyses that involve three variables, for example, an independent variable, a dependent variable, and a variable that is sta s cally controlled when examining the associa on between these, o en called a covariate. Later chapters describe situa ons that involve mul ple predictors, mul ple outcomes, and/or mul ple covariates. The bivariate analyses covered in introductory sta s cs books are the building blocks for these analyses. Therefore, you need a thorough understanding of bivariate analyses (i.e., analyses for one independent and one dependent variable) to understand the analyses introduced in this book. The following topics are covered in Volume I (Applied Sta s cs I: Basic Bivariate Techniques [Warner, 2020]) and most other introductory sta s cs books. If you are unfamiliar with any of these topics, you should review them before you move forward. The use of frequency tables, histograms, boxplots, and other graphs of sample data to describe approximate distribu on shape and extreme outliers. This is important for data screening. Understanding that some frequently used sta s cs, such as the sample mean, are not robust against the impact of outliers and viola ons of other assump ons. Compu ng and interpre ng sample variance and standard devia on and the concept of degrees of freedom (df). Interpreta on of standard scores (z scores) as unit-free informa on about the loca on of a single value rela ve to a distribu on. The concept of sampling error, indexes of sampling error such as SEM, and the way sampling error is used in se ng up confidence intervals (CIs) and sta s cal significance tests. Choice of appropriate bivariate sta s cs on the basis of types of variables involved (categorical vs. quan ta ve and between-groups designs vs. repeated measures or paired or correlated samples). The most commonly used sta s cs, including independent-samples t, between-S analysis of variance (ANOVA), correla on, and bivariate regression. Ideally, you should also be familiar with paired-samples t and repeated-measures or paired-samples ANOVA. The mul variate and mul variate analyses covered in this book are built on these basic analyses. The logic of sta s cal significance tests (null-hypothesis sta s cal tes ng [NHST]), interpreta on of p values, and limita ons and problems with NHST and p values. Distribu ons used in familiar significance tests (normal, t, F, and χ2) and the use of tail areas to describe outcomes as unusual or extreme. The concept of variance par oning. In correla on and regression, r2 is the propor on of variance in Y that can be predicted from X, and (1 – r2) is the propor on of variance in Y that cannot be predicted from X. In ANOVA, SSbetween provides informa on about propor on of
variance in Y that is predictable from group membership, and SSwithin provides informa on about variance in Y that is not predictable from group membership. Effect size. The difference between sta s cal significance and prac cal or clinical importance. Factors that influence sta s cal power, par cularly effect size and sample size. 1.2 WHAT IS THE “NEW STATISTICS”? In the past, many data analysts relied heavily on sta s cal significance tests to evaluate results and did not always report effect size. Even when used correctly, significance tests do not tell us everything we want to know; misuse and misinterpreta on are common (Greenland et al., 2016). Misuse of significance tests has led to selec ve publica on of only results with p < .05; publica on of these selected results has some mes led to widespread reports of “findings” that are not reproduced when replica on studies are performed. The focus on “new” and “sta s cally significant” outcomes means that we some mes don’t discard incorrect results. Progress in science requires that we weed out mistakes, as well as make new discoveries. Proponents of the “New Sta s cs” (such as Cumming, 2014) do not claim that their recommenda ons are really new. Many sta s cians have called for changes in the way results are evaluated and reported, at least since the 1960s (including but not limited to Cohen, 1988, 1992, 1994; Daniel, 1998; Morrison & Henkel, 1970; and Rozeboom, 1960). However, prac oners of sta s cs are o en slow to respond to calls for change, or to adopt new methods (Sharpe, 2013). The main changes called for by New Sta s cs advocates include: Understanding the limita ons of significance tests. The need to report effect sizes and CIs. Greater use of meta-analysis to summarize effect size informa on across studies. All introductory sta s cs books I know of cover sta s cal significance tests and CIs, and most discuss effect size. Adop ng the New Sta s cs perspec ve does not require you to learn anything new. New Sta s cs advocates only ask you to think about topics such as sta s cal significance tests from a more cri cal perspec ve. Even though you have probably studied CIs and effect size before, review can be enlightening. This chapter also includes a brief introduc on to meta-analysis. 1.3 COMMON MISINTERPRETATIONS OF P VALUES Advocates of the New Sta s cs have pointed out that misunderstandings about interpreta on of p values are widespread. In a survey of researchers that asked which statements about p values they believed to be correct, large numbers of them endorsed incorrect interpreta ons (Mi ag & Thompson, 2000). Sta s cs educa on needs to be improved so that people who use NHST understand its limita ons. There are numerous problems with p values that lead to misunderstandings.
A p value cannot tell us what we want to know. We would like to know, on the basis of our data, something about the likelihood that a research hypothesis (usually an alterna ve hypothesis) is true. Instead, a p value tells us, o en very inaccurately, about the probability of obtaining the values of M and t we found using our sample data, given that the null hypothesis is correct (Cohen, 1994). Common prac ces, such as running mul ple tests and selec ng only a few to report on the basis of small p values, make p values very inaccurate informa on about risk for Type I decision error. Even if we follow the rules and do everything “right,” there will always be risk for decision error. Ideal descrip ons of NHST require us to obtain a random sample from the popula on of interest, sa sfy all the assump ons for the test sta s c, have no problems with missing values or outliers, do one significance test, and then stop. Even if we could do this (and usually we can’t), there would s ll be nonzero risks for both Type I and Type II decision errors. Because of sampling error, there is an intrinsic uncertainty that we cannot get rid of. There is a fairly common misunderstanding that p values tell us something about the size, strength, or importance of an effect. Published papers some mes include statements like “with p < .001, the effect was highly significant.” In everyday language, significant means important, large, or worthy of no ce. However, small p values can be obtained even for trivial effects if sample N is large enough. We need to dis nguish between p values and effect size. Chapter 9 in Volume I (Warner, 2020) discusses this further. From Volume I (Warner, 2020), here are examples of some things you should not say about p values. A more complete list of misconcep ons to avoid is provided by Greenland et al. (2016). Never make any of the following statements: p = .000 (the risk for Type I error can become very small, but in theory, it is never 0). p was “highly” significant. This leads readers to think that your effect was “significant” in the way we define significant in everyday language: large, important, or worthy of no ce. Other kinds of effect size informa on (not p values) are required to evaluate the prac cal or clinical significance of the outcome of a study. p was “almost” significant (or synonymous terms such as close to or marginally significant). This language will make people who use NHST in tradi onal ways, and New Sta s cs advocates, cringe. For “small” p values, such as p = .04, we cannot say: Results were not due to chance or could not be explained by chance. (We cannot know that!) Results are likely to replicate in future studies. The null hypothesis (H0) is false. We accept (or have proved) the alterna ve hypothesis. We also cannot use (1 − p), for example (1 − .04) = .96, to make probability statements such as:
There is a 96% chance that results will replicate. There is a 96% chance that the null hypothesis is false. For p values larger than .05, we cannot say, “Accept the null hypothesis.” The language we use to report results should not overstate the strength of the evidence, imply large effect sizes in the absence of careful evalua on of effect size, overgeneralize the findings, or imply causality when rival explana ons cannot be ruled out. We should never say, “This study proves that….” Any one study has limita ons. As suggested in Volume I (Warner, 2020): It is be er to think about research in terms of degrees of belief. As we obtain addi onal high-quality evidence, we may become more confident of a belief. If high-quality inconsistent evidence arises, that should make us rethink our beliefs. We can say things such as: The evidence in this study is consistent with the hypothesis that … The evidence in this study was not consistent with the hypothesis that … Hypothesis can be replaced by similar terms, such as predic on. Misunderstandings of p values, and what they can and cannot tell us, have been one of several contribu ng factors in a “replica on crisis.” 1.4 PROBLEMS WITH NHST LOGIC The version of NHST presented in sta s cs textbooks and used by many researchers in social and behavioral science is an amalgama on of ideas developed by Fisher, Neyman, and Pearson (Lenhard, 2006). Neyman and Pearson strongly disagreed with important aspects of Fisher’s thinking, and probably none of them would endorse current NHST logic and prac ces. Here are some commonly iden fied concerns about NHST logic. NHST turns an uncertainty con nuum into a true/false decision. Cohen (1994) and Rosnow and Rosenthal (1989) argued that we should think in terms of a con nuum of likelihood: A successful piece of research doesn’t conclusively se le an issue, it just makes some theore cal proposi on to some degree more likely…. How much more likely this single research makes the proposi on depends on many things, but not on whether p is equal to or greater than .05: .05 is not a cliff but a convenient reference point along the possibility-probability con nuum. (Cohen, 1994) Surely, God loves the .06 nearly as much as the .05. (Rosnow & Rosenthal, 1989) One way to avoid trea ng .05 as a cliff is to report “exact” p values, as recommended by the American Psychological Associa on (APA) Task Force on Sta s cal Inference (Wilkinson & Task Force on Sta s cal Inference, APA Board of Scien fic Affairs, 1999). The APA recommended that authors report “exact” values, such as p = .032, instead of a yes/no judgment of whether a
result is significant or nonsignificant on the basis of p < .05 or p > .05. The possibly annoying quota on marks for “exact” are meant as a reminder that in prac ce, obtained p values o en seriously underes mate the true risk for Type I error. NHST cannot tell us what we want to know. We would like to know something like the probability that our research or alterna ve hypothesis is true, or the probability that the finding will replicate in future research, or how strong the effects were. In fact, NHST can tell us only the (theore cal) probability of obtaining the results in our data, given that H0 is true (Cohen, 1994). NHST does not even do this well, given problems with its use in actual research prac ce. Some philosophers of science argue that progress in science requires us to discard faulty or incorrect evidence. However, when researchers reject H0, this is not “falsifica on” in that sense.1 NHST is trivial because H0 is always false. Any nonzero difference (between μ1 and μ2) can be judged sta s cally significant if the sample size is sufficiently large (Kline, 2013). NHST requires us to think in terms of double nega ves (and people aren’t very good at understanding double nega ves). First, we set up a null hypothesis (of no treatment effect) that we almost always do not believe, and then we try to obtain evidence that would lead us to doubt this hypothesis. Double nega ves are confusing and inconsistent with every day “psycho- logic” (Abelson & Rosenberg, 1958). In everyday reasoning, people have a strong preference to seek confirmatory evidence. People (including researchers) are confused by double nega ves. NHST is misused in many research situa ons. Assump ons and rules for proper use of NHST are stringent and are o en violated in prac ce (as discussed in the next two sec ons). These viola ons o en invalidate the inferences people want to make from p values. Despite these cri cisms, an argument can be made that NHST serves a valuable purpose when it is not misused. It can help assess whether results obtained in a study would be likely or unlikely to occur just because of sampling error when H0 is true (Abelson, 1997; Garcia-Pérez, 2017). However, informa on about sampling error is also provided by CIs, in a form that may be less likely to lead to misunderstanding and yes/no thinking (Cumming, 2012). 1.5 COMMON MISUSES OF NHST In actual prac ce, applica ons of NHST o en do not conform to the ideal requirements for their use. Three sets of condi ons are important for the proper use of NHST. I describe these as assump ons, rules, and handling of specific problems such as outliers. (These are fuzzy dis nc ons.) In actual prac ce, it is difficult to sa sfy all the requirements for p to be an accurate es mate of risk for Type I error. When these requirements are not met, values of p that appear in computer program results are biased; usually they underes mate the true risk for Type I error. When the true risk for Type I error is underes mated, both readers and writers of research reports may be overconfident that studies provide support for claims about findings. This can lead to publica on and press-release distribu on of false-posi ve results (Woloshin, Schwartz, Casella, Kennedy, & Larson, 2009). Inconsistent and even contradictory media reports of research findings may erode public trust and respect for science.
1.5.1 Viola ons of Assump ons Most sta s cs textbooks precede the discussion of each new sta s c with a list of formal mathema cal assump ons about distribu on shapes, independence of observa ons and residuals, and so forth. The list of assump ons for parametric analyses such as the independent- samples t test and one-way between-S ANOVA include: Data on quan ta ve variables are assumed to be normally distributed in the popula on from which samples were randomly drawn. Variances of scores in popula ons from which samples for groups were randomly drawn are assumed to be equal across groups (the homogeneity of variance assump on) Observa ons must be independent of one another. (Some textbooks do not explain this very important assump on clearly. See Chapter 2 in Volume I [Warner, 2020].) For Pearson’s r and bivariate regression, addi onal assump ons include: The rela on between X and Y is linear. The variances of Y scores at each level of X are equal. Residuals from regression are uncorrelated with one another. Advanced analyses o en require addi onal assump ons. Textbooks o en provide informa on about evalua on of assump ons. However, most introductory data analysis exercises do not require students to detect or remedy viola ons of assump ons. The need for preliminary data screening and procedures for screening aren’t clear in most introductory books. For NHST results to be valid, we need to evaluate whether assump ons are violated. However, journal ar cles o en do not report whether assump ons were evaluated and whether remedies for viola ons were applied (Hoekstra, Kiers, & Johnson, 2012). 1.5.2 Viola ons of Rules for Use of NHST I use the term rules to refer to other important guidelines about proper use of NHST. These are not generally included in lists of formal assump ons about distribu on and independence of observa ons. These rules are o en implicit; however, they are very important. These include the following: Select the sample randomly from the (actual) popula on of interest (Knapp, 2017). This is important whether you think about NHST in the tradi onal or classic manner, as a way to answer a yes/no ques on about the null hypothesis, or in terms of the New Sta s cs, with greater focus on CIs and less focus on p values. Bad prac ces in sampling limit generalizability of results and also compromise the logic of procedures of NHST. In prac ce, researchers o en use convenience samples. When they want to generalize results, they imagine hypothe cal popula ons similar to the sample in the study (invoking the idea of “proximal similarity” [Trochim, 2006] as jus fica on for generaliza on beyond the sample). The use of convenience samples does not correspond closely to the situa ons the original developers of inferen al sta s cs had in mind. For example, in industrial quality control, a popula on could be all the objects made by a factory in a month; the sample could be a random
subset of these objects. The logic of NHST inferen al sta s cs makes more sense for random sampling. Studies based on accidental or convenience samples create much more difficult inference problems. Select the sta s cal test and criterion for sta s cal significance (e.g., α < .05, two tailed) prior to analysis. This is important if you want to interpret p values as they have o en been interpreted in the past, as a basis to make a yes/no decision about a null hypothesis. This rule is o en violated in prac ce. For example, data analysts may use asterisks that appear next to correla ons in tables and report that for one asterisk, p < .05; for two asterisks, p < .01; and for three asterisks, p < .001. Using asterisks to report a significance level separately for each correla on could be seen as implicitly se ng the α criterion a er the fact. On the other hand, many authori es recommend that instead of selec ng specific α criteria, you should report an exact p value and not use the p value to make a yes/no decision about the believability of the null hypothesis. In other words, do not use p values as the basis to make statements such as “the result was sta s cally significant” or “reject H0.” Advocates of the New Sta s cs recommend that we should not rely on p values to make yes/no decisions. Perform only one significance test (or at most a small number of tests). The opposite of this is: Perform numerous sta s cal tests, and/or numerous varia ons of the same basic analysis, and then report only a few “sta s cally significant” results. This prac ce is o en called p-hacking. Other names for p-hacking include data fishing, “the garden of forking paths” (Gelman & Loken, 2013), or my personal favorite, torturing the data un l they confess (Mills, 1993). Introductory sta s cs books usually discuss the problem of inflated risk for Type I error in the context of post hoc tests for ANOVA. They do not always make it clear that this problem is even more serious when people run dozens or hundreds of t tests or correla ons. 1.5.3 Ignoring Ar facts and Data Problems That Bias p Values Many ar facts that commonly appear in real data influence the magnitude of parameter es mates (such as M, SD, r, and b, among others) and p values. These include, but are not limited to: Univariate, bivariate, and mul variate outliers. Missing data that are not missing randomly. Measurement problems such as unreliability. For example, the obtained value of rxy is a enuated (reduced) by unreliability of measures for X and Y. Mismatch of distribu on shapes (for Pearson’s r and regression sta s cs) that constrain the range of possible r values. 1.5.4 Summary Consider an F ra o in a one-way between-S ANOVA. The logic for NHST goes something like this: If we formulate hypotheses and establish criteria for sta s cal significance and sample size prior to data collec on, and if the null hypothesis is true, and if we take a random sample from the popula on of interest, and if all assump ons for the sta s c are sa sfied, and if we have not broken important rules for proper use of NHST, and if there are no ar facts such as outliers and missing values, then p should be an unbiased es mate of the likelihood of obtaining a value of F
as large as, or larger than, the F ra o we obtained from our data. (Addi onal ifs could be added in many situa ons.) This is a long condi onal statement. The point is: Values of sta s cs such as F and p can provide the informa on described in ideal or imaginary situa ons in textbooks only when all of these condi ons are sa sfied. In actual research, one or many of these assump ons about condi ons are violated. Therefore, sta s cs such as F and p rarely provide a firm basis for the conclusions described for ideal or imaginary research situa ons in textbooks. Problems with any of these (assump ons, rules, and ar facts) can result in biased p values that in turn may lead to false- posi ve decisions. In real-life applica ons of sta s cs, it may be impossible to avoid all these problems. For all these reasons, I suggest that most p values should be taken with a very large grain of salt. P values are least likely to be misleading in simple experiments with a limited number of analyses, such as ANOVA with post hoc tests. They are highly likely to be misleading in studies that include large numbers of variables that are combined in different ways using many different analyses. It is difficult to priori ze these problems; my guess is that viola ons of rules (such as running large numbers of significance tests and p-hacking) and neglect of sources of ar fact (such as outliers) o en create greater problems with p values in prac ce than viola ons of some of the formal assump ons about distribu ons of scores in popula ons (such as homogeneity of variance). It requires some adjustment in thinking to realize that, to a very great extent, the numbers we obtain at the end of an analysis are strongly influenced by decisions made during data collec on and analysis (Volume I [Warner, 2020]). Beginning students may think that final numerical results represent some “truth” about the world. We need to understand that with different data analysis decisions, we could have ended up with quite different answers. Greater transparency in repor ng (Simmons, Nelson, & Simonsohn, 2011) helps readers understand the degree to which results may have been influenced by a data analyst’s decisions. 1.6 THE REPLICATION CRISIS Misuse and misinterpreta on of sta s cs (par cularly p values) is one of many factors that has contributed to rising concerns about the reproducibility of high-profile research findings in psychology. To evaluate reproducibility of research results, Brian Nosek and Jeff Spies founded the Center for Open Science in 2013 (Open Science Collabora on, 2015). Their aim was to increase openness, integrity, and reproducibility of scien fic research. Par cipa ng scien sts come from many fields, including astronomy, biology, chemistry, computer science, educa on, engineering, neuroscience, and psychology. Results reported for the first group of studies evaluated were disturbing. They conducted replica ons of 100 studies (both correla onal and experimental) published in three psychology journals, using large samples (to provide adequate sta s cal power) and original materials if available. The average effect sizes were about half as large as the original results. Only 39 of the 100 replica ons yielded sta s cally significant
outcomes (all original studies were “sta s cally significant”). This was not quite as bad as it sounds, because many original effect sizes associated with nonsignificant outcomes were within 95% CIs on the basis of replica on effect sizes (Baker, 2015; Open Science Collabora on, 2015). These results a racted substan al a en on and concern. Failures to replicate have also been noted in biomedical research. Ioannidis (2005) examined 49 highly regarded medical studies from 13 prior years. He compared ini al claims for interven on effec veness with results in later studies with larger samples; 7 (16%) of the original studies were contradicted, and another 7 (16%) had smaller effects than the original study. Later studies have yielded even less favorable results. Begley and Ellis (2012) reported that biotechnology firm Amgen tried to confirm results from 53 landmark studies about issues such as new approaches to targe ng cancers and alterna ve clinical uses for exis ng therapeu cs. Findings were confirmed for only 6 (11%) studies. Baker and Dolgin (2017) noted that early results from the Cancer Reproducibility Project’s examina on of 6 cancer biology studies were mixed. Do these replica on failures indicate a “crisis”? That is debatable. Only a small subset of published studies were tested. Some of the original studies were chosen for replica on because they reported surprising or counterintui ve results. Examina on of p values is not the best way to assess whether results have been reasonably well replicated; p values are “fickle” and difficult to reproduce (Halsey, Curran-Evere , Vowler, & Drummond, 2015). It may be be er to evaluate reproducibility using effect sizes or CIs instead of p values. Cri cs of the reproducibility projects argue that the replica on methods and analyses were flawed (Gilbert, King, Pe grew, & Wilson, 2016). It would be premature to conclude that large propor ons of all past published research results would not replicate; however, concerns raised by failures to replicate should be taken seriously. A failure to reproduce results does not necessarily mean that the original or past study was wrong. The replica on study may be flawed, or the results may be context dependent (and might appear only in the specific circumstances in an earlier study, and not under the condi ons in the replica on study). Concerns about reproducibility have led to a call for new approaches to repor ng results, o en called the New Sta s cs, along with a movement toward preregistra on of study plans and Open Science, in which researchers more fully share informa on about study design and sta s cal analyses. Many changes in research prac ce will be needed to improve reproducibility of research results (Wicherts et al., 2016). Misuse and misinterpreta on of sta s cal significance tests (and p values) to make yes/no decisions about whether studies are “successful” have contributed to problems in replica on. Some have even argued that NHST and p values are an inherently flawed approach to evalua on of research results (Krueger, 2001; Rozeboom, 1960). Cumming (2014) and others argue that a shi in emphasis (away from sta s cal significance tests and toward reports of effect size, CIs, and meta-analysis) is needed. However, many published
papers s ll do not include effect size and CIs for important results (Watson, Lenz, Schmit, & Schmit, 2016). 1.7 SOME PROPOSED REMEDIES FOR PROBLEMS WITH NHST 1.7.1 Bayesian Sta s cs Some authori es argue that we got off on the wrong foot (so to speak) when we adopted NHST in the early 20th century. Probability is a basic concept in sta s cal significance tes ng. The examples used to explain probability suggest that it is a simple concept. For example, if you draw 1 card at random from a deck of 52 cards with equal numbers of diamonds, hearts, spades, and clubs, what is the probability that the card will be a diamond? This example does not even begin to convey how complicated the no on of probability becomes in more complex situa ons (such as inference from sample to popula on). NHST is based on a “frequen st” understanding of probability; this is not the only possible way to think about probability, and other approaches (such as Bayesian) may work be er for some research problems. A full discussion of this problem is beyond the scope of this chapter; see Kruschke and Liddell (2018), Li le (2006), Malakoff (1999), or Williamson (2013). Researchers in a few areas of psychology use Bayesian methods. However, students typically receive li le training in these methods. Whatever benefits this might have, a major shi toward the use of Bayesian methods in behavioral or social sciences seems unlikely to happen any me soon. 1.7.2 Replace α = .05 with α = .005 It has recently been suggested that problems with NHST could be reduced by se ng the conven onal α criterion to .005 instead of the current .05 (Benjamin et al., 2017). This would establish a more stringent standard for announcement of “new” findings. However, given the small effect sizes in many research areas, enormous sample sizes would be needed to have reasonable sta s cal power with α = .005. This would be prohibi vely costly. Bates (2017) and Schimmack (2017) argued that this approach is neither necessary nor sufficient and that it would make replica on efforts even more unlikely. A change to this smaller α level is unlikely to be widely adopted. 1.7.3 Less Emphasis on NHST The “new” sta s cs advocated by Cumming (2012, 2014) calls for a shi of focus. He recommended that research reports should focus more on confidence intervals, effect size informa on, and meta-analysis to combine effect size informa on across studies. How “new” is the New Sta s cs? As noted by Cumming (2012) and others, experts have been calling for these changes for more than 40 years (e.g., Morrison & Henkel, 1970; Cohen, 1990, 1994; Wilkinson & Task Force on Sta s cal Inference, APA Board of Scien fic Affairs, 1999). Cumming (2012, 2014) bolstered these arguments with further discussion of the ways that CIs
(vs. p values) may lead data analysts to think about their data. Some argue that the New Sta s cs is not really “new” (Palij, 2012; Savalei & Dunn, 2011); CIs and significance tests are based upon the same informa on about sampling error. In prac ce, many readers may choose to convert CIs into p values so that they can think about them in more familiar terms. However, effect size repor ng is cri cal; it provides informa on that is not obvious from examina on of p values. Unlike a shi to Bayesian approaches, or the use of α = .005, including CIs and effect sizes in research reports would not be difficult or costly. In general, researchers have been slow to adopt these recommenda ons (Sharpe, 2013). The Journal of Basic and Applied Social Psychology (Trafimow & Marks, 2015) now prohibits publica on of p values and related NHST results. The following sec ons review the major elements of the New Sta s cs: CIs and effect size. CIs and effect size are both discussed in Volume I (Warner, 2020) for each bivariate sta s c. A brief introduc on to meta-analysis is also provided. 1.8 REVIEW OF CONFIDENCE INTERVALS A confidence interval is an interval es mate for some unknown popula on characteris c or parameter (such as μ, the popula on mean) based on informa on from a sample (such as M, SD, and N). CIs can be set up for basic bivariate sta s cs using simple formulas. Unfortunately SPSS does not provide CIs for some sta s cs, such as Pearson’s r. For more advanced sta s cs, CIs can be set up using methods such as bootstrapping, which is discussed in Chapter 15, on structural equa on modeling, later in this book. 1.8.1 Review: Se ng Up CIs Consider an example of the CI for one sample mean, M. Suppose a data analyst has IQ scores for a sample of N = 100 cases, with these sample es mates: M = 105, SD = 15. In addi on to repor ng that mean IQ in the sample was M = 105, an interval es mate (a 95% CI) can be constructed, with lower and upper boundaries. The procedure used in this example can be used only when the sample sta s c is known to have a normally shaped sampling distribu on and when N is large enough that the standard normal or z distribu on can be used to figure out what range of values lies within the center 95% of the distribu on. (With smaller samples, t distribu ons are usually used.) These are the steps to set up a CI: Decide on C (level of confidence) (usually this is 95%). Assuming that your sample sta s c has a normally shaped sampling distribu on, use the “cri cal values” from a z or standard normal distribu on that correspond to the middle 95% of values. For a standard normal distribu on, the middle 95% corresponds to the interval between zlower = –1.96 and zupper = +1.96. (Rounding these z values to –2 and +2 is reasonable when thinking about es mates.)
Find the standard error (SE) for the sample sta s c. The SE depends on sample size and standard devia on. For a sample mean, SEM = SD/√N. Other sample sta s cs (such as r, b, and so forth) also have SEs that can be es mated. On the basis of SD = 15, and N = 100, we can compute the standard error of the sampling distribu on for M: SEM = 15/√100 = 15/10 = 1.5. Now we combine SEM with M and the z cri cal values that correspond to the middle 95% of the standard normal distribu on to compute the CI limits: Lower limit = M + zlower × SEM = 105 –1.96 * 1.5 = 105 – 2.94 = 102.06. Upper limit = M + zupper × SEM = 105 +1.96 * 1.5 = 105 + 2.94 = 107.94. This would be reported as “95% CI [102.06, 107.94].” This procedure can be generalized and used with many other (but not all) sample sta s cs. To use this procedure, an es mate of the value of SEsta s c is needed, and the sampling distribu on for the sta s c must be normal:
The sta s c can be (M1 – M2), r, or a raw-score regression slope b, for example. In more advanced analyses such as structural equa on modeling, it is some mes not possible to calculate the SE values for path coefficients directly, and it may be unrealis c to expect sampling distribu ons to be normal in shape. In these situa ons, Equa ons 1.1 and 1.2 cannot be used to set up CIs. The chapters that introduce structural equa on modeling and logis c regression discuss different procedures to set up CIs for these situa ons. 1.8.2 Interpreta on of CIs It is incorrect to say that there is a 95% probability that the true popula on mean μ lies within a 95% CI. (It either does, or it doesn’t, and we cannot know which.) We can make a long-range predic on that, if we have a popula on with known mean and standard devia on, and set a fixed sample size, and draw thousands of random samples from that popula on, that 95% of the CIs set up using this informa on will contain μ and the other 5% will not contain μ. Cumming and Finch (2005) provided other correct interpreta ons for CIs.
1.8.3 Graphing CIs Upper and lower limits of CIs may be reported in text, tables, or graphs. One common type of graph is an error bar chart, as shown in Figure 1.1. (Bar charts can also be set up with error bars.) For either error bar or bar chart graphs, the graph may be rotated, such that error bars run from le to right instead of from bo om to top. The data in Figure 1.1 are excerpted from an actual study. Undergraduates reported posi ve affect and the number of servings of fruit and vegetables they consumed in a typical day. Earlier research suggested that higher fruit and vegetable intake was associated with higher posi ve affect. Given the large sample size, number of servings could be treated as a group variable (i.e., the first group ate no servings of fruits and vegetables per day, the second group ate one serving per day, etc.) This was useful because past research suggested that the increase in posi ve affect might not be linear. The ver cal “whiskers” in Figure 1.1 show the 95% CI limits for each group mean. The horizontal line that crosses the Y axis at about 32.4 helps clarify that the CI for the zero servings of fruits and vegetables group did not overlap with the CIs for the groups of persons who ate three, four, or five servings per day. In graphs of this type, the author must indicate whether the error bars correspond to a CI (and what level of confidence). Some graphs use similar-looking error bar markers to indicate the interval between –1 SEM and +1 SEM or the interval between –1 SD and +1 SD.
Figure 1.1 Mean Posi ve Affect for Groups With Different Fruit and Vegetable Intake (With 95% CI Error Bars) Source: Adapted from Warner, Frye, Morrell, and Carey (2017). 1.8.4 Understanding Error Bar Graphs A reader can make two kinds of inferences from error bars in this type of graph (Figure 1.1). First, error bars can be used to guess which group means differed significantly. Cumming (2012, 2014) cau oned that analysts should not automa cally convert CI informa on into p values for significance tests when they think about their results. However, if readers choose to do that, it is important to understand the way CIs and two-tailed p values are related. In general, if the CIs for two group means do not overlap in graphs such as Figure 1.1, the difference between means is sta s cally significant (assuming that the level of confidence corresponds to the α level, i.e., 95% confidence and α = .05, two tailed). On the other hand, the difference between a pair of group means can be sta s cally significant even if the CIs for the means overlap slightly. Whether the difference is sta s cally significant depends on the amount of overlap between CIs (Cumming & Finch, 2005; Knezevic, 2008). The nonoverlapping CIs for the zero-servings group and five-servings group indicates that if a t test were done to compare these two group means, using α = .05, two tailed, this difference would be sta s cally significant. There is some overlap in the CIs for the two-servings and three-servings groups. This difference might or might not be sta s cally significant using α = .05, two tailed. The second kind of informa on a reader should look for is prac cal or clinical significance. Mean posi ve affect was about 34 for the five-servings group and 32 for the zero-servings group. Is that difference large enough to value or care about? Would a typical person be mo vated to raise fruit and vegetable consump on from zero to five servings if that meant a chance to increase posi ve affect by two points? (Maybe there are easier ways to “get happy.”) Numbers on the scale for posi ve affect scores are meaningless unless some context is provided. In this example, the minimum possible score for posi ve affect was 10 points, and the maximum was 50 points. A 2-point difference on a 50-point ra ng scale does not seem like very much. Also note that this graph “lies with sta s cs” in a way that is very common in both research reports and the mass media. The Y axis begins at about 30 points rather than the actual minimum value of 10 points. How different would this graph look if the Y axis included the en re possible range of values from 10 to 50? In the final analyses in our paper (Warner et al., 2017), fruit and vegetable intake uniquely predicted about 2% of the variance in posi ve affect a er controlling for numerous other variables that included exercise and sleep quality. That 2% was sta s cally significant. However, on the basis of 2% of the variance and a two-point difference in posi ve affect ra ngs for the low versus high fruit and vegetable consump on groups, I would not issue a press release urging people to eat fruit and get happy. Other variables (such as gra tude) have much stronger
associa ons with posi ve affect. (It may be of theore cal interest that consump on of fruits and vegetables, but not sugar or fat consump on, was related to posi ve affect. Fruit and vegetable consump on is related to other important outcomes such as physical health.) The point is: Informa on about actual and poten al range of scores for the outcome variable can provide context for interpreta on of scores (even when they are in essen ally meaningless units). Readers also need to remember that the selec on of a limited range of values to include on the Y axis creates an exaggerated percep on of group differences. 1.8.5 Why Report CIs Instead of, or in Addi on to, Significance Tests? Cumming (2012) and others suggest these possible advantages of focusing on CIs rather than p values: Repor ng the CI can move us away from the yes/no thinking involved in sta s cal significance tests (unless we use the CI only to reconstruct the sta s cal significance test). CIs make us aware of the lack of precision of our es mates (of values such as means). Informa on about lack of precision is more compelling when scores on a predicted variable are in meaningful units. Consider systolic blood pressure, given in millimeters of mercury (mm Hg). If the 95% CI for systolic blood pressure in a group of drug-treated pa ents ranges from 115 mm Hg (not considered hypertensive) to 150 mm Hg (hypertensive), poten al users of the drug will be able to see that mean outcomes are not very predictable. (On the other hand, if the CI ranges from 115 to 120 mm Hg, mean outcomes can be predicted more accurately.) CIs may be more stable across studies than p values. In studies of replica on and reproducibility, overlap of CIs across studies may be a be er way to assess consistency than asking if studies yield the same result on the binary outcome judgment: significant or not significant. P values are “fickle”; they tend to vary across samples (Halsey et al., 2015). Asendorpf et al. (2013) recommended that evalua on of whether two studies produce consistent results should focus on CI overlap rather than on “vote coun ng” (i.e., no cing whether both studies had p < .05). Data analysts hope that CIs will be rela vely narrow, because if they are not, it indicates that es mates of mean have considerable sampling error. Other factors being equal, the width of a CI depends on these factors: As SD increases (other factors being equal), the width of the CI increases. As level of confidence increases (other factors being equal), the width of the CI increases. As N increases (other factors being equal), the width of the CI decreases. Despite calls to include CIs in research reports, many authors s ll do not do so (Sharpe, 2013). This might be partly because, as Cohen (1994) noted, they are o en “so embarrassingly large!” 1.9 EFFECT SIZE Bivariate sta s cs introduced in Volume I (Warner, 2020) were accompanied by a discussion of one (or some mes more than one) effect size indexes. For χ2, effect sizes include Cramer’s V and ϕ. Pearson’s r and r2 directly provide effect size informa on. For sta s cs such as the independent-samples t test, several effect sizes can be used; these include point biserial r (rpb),
Cohen’s d, η, and η2. It is also possible to think about the (M1 – M2) difference as informa on about prac cal or clinical effect size terms if the dependent variable is measured in meaningful units such as dollars, kilograms, or inches. For ANOVA, η and η2 are commonly used. Rosnow and Rosenthal (2003) discussed addi onal, less widely used effect size indexes. 1.9.1 Generaliza ons About Effect Sizes
1. Effect size is independent of sample size. For example, the magnitude of Pearson’s r does not systema cally increase as N increases.2
2. Some effect sizes have a fixed range of possible values (r ranges from –1 to +1), but other effect sizes do not (Cohen’s d is rarely higher than 3 in absolute value, but it does not have a fixed limit).
3. Many effect sizes are in unit-free (or standardized) terms. For example, the magnitude of Pearson’s r is not related to the units in which X and Y are measured.
4. On the other hand, effect size informa on can be presented in terms of the original units of measurement (e.g., M1 – M2). This is useful when original units of measurement were meaningful (Pek & Flora, 2018).
5. Some effect sizes can be directly converted (at least approximately) into other effect sizes (Rosnow & Rosenthal, 2003).
6. Cohen’s (1988) guidelines for verbal labeling of effect sizes are widely used; these appear in Table 1.1. Alterna ve guidelines based on Fritz, Morris, and Richler (2012) appear in Table 1.2.
7. The value of a test sta s c (such as the independent-samples t test) depends on both effect size and sample size or df. This is explained further in the next sec on.
8. Many journals now call for repor ng of effect size informa on. However, many published research reports s ll do not include this informa on.
9. Judgments about the clinical or prac cal importance of research results should be based on effect size informa on, not based on p values (Sullivan & Feinn, 2012).
10. If you read a journal ar cle that does not include effect size informa on, there is usually enough informa on for you to compute an effect size yourself. (There should be!)
11. Computer programs such as SPSS o en do not provide effect sizes; however, effect sizes can be computed from the informa on provided.
12. In the upcoming discussion of meta-analysis, examples o en focus on effect sizes such as Cohen’s d that describe the difference between group means for treatment and control groups. However, raw or standardized regression slope coefficients can also be treated as effect sizes in meta-analysis (Nieminen, Leh niemi, Vähäkangas, Huusko, & Rau o, 2013; Peterson & Brown, 2005).
13. CIs can be set up for many effect size es mates (Kline, 2013; Thompson, 2002b). Ul mately, it would be desirable to report these along with effect size. In the short term, just ge ng everyone to report effect size for primary results is probably a more reasonable goal.
Table 1.1 Suggested Verbal Labels for Cohen’s d and Other Common Effect Sizes
a. The minimum values suggested by Fritz et al. are much higher than the ones proposed by Cohen (1988). b. Analyses such as logis c regression (in which the dependent variable is a group membership, such as alive vs. dead) provide informa on about rela ve or compara ve risk, for example, how much more likely is a smoker to die than a nonsmoker? This may be in the form of rela ve risk (RR) and an odds ra o (OR). See Chapter 16. 1.9.2 Test Sta s cs Depend on Effect Size Combined With Sample Size Consider the independent-samples t test. M1 and M2 denote the group means, SD1 and SD2 are the group standard devia ons, and n1 and n2 denote the number of cases in each group. One of the effect sizes used with the independent-samples t is Cohen’s d (the standardized distance or difference between the sample means M1 and M2). The difference between the sample means is standardized (converted to a unit-free distance) by dividing (M1 – M2) by the pooled standard devia on sp:
Examining Equa on 1.4 makes it clear that if effect size d is held constant, the absolute value of t increases as the df (sample size) increases. Thus, even when an effect size such as d is
extremely small, as long as it is not zero, we can obtain a value of t large enough to be judged sta s cally significant if sample size is made sufficiently large. Conversely, if the sample size given by df is held constant, the absolute value of t increases as d increases. This dependence of magnitude of the test sta s c on both effect size and sample size holds for other sta s cal tests (I have provided only a demonstra on for one sta s c, not a proof). This is the important point: A very large value of t, and a correspondingly very small value of p, can be obtained even when the effect size d is extremely small. A small p value does not necessarily tell us that the results indicate a large or strong effect (par cularly in studies with very large N’s). Furthermore, both the value of N and the value of d depend on researcher decisions. For an independent-samples t test, other factors being equal, d o en increases when the researcher chooses types of treatments and/or dosages of treatments that cause large differences in the response variable and when the researcher controls within-group error variance through standardiza on of procedures and recruitment of homogeneous samples. Some undergraduate students became upset when I explained this: “You mean you can make the results turn out any way you want?” Yes, within some limits. When we obtain sta s cs in samples, such as values of M or Cohen’s d or p, these values depend on our design decisions. They are not facts of nature. See Volume I (Warner, 2020), Chapter 12, for further discussion. 1.9.3 Using Effect Size to Evaluate Theore cal Significance Judgments about theore cal significance are some mes made on the basis of the magnitude of standardized effect size indexes such as d or r. One way to think about the importance of research results is to ask, Given the effect size, how much does this variable add to our ability to predict some outcome of interest, or to “explain variance”? Is the added predic ve informa on sufficient to be “worthwhile” from a theore cal perspec ve? Is it useful to con nue to include this variable in future theories, or are its effects so trivial as to be negligible? For example, if X and Y have rxy = .10 and therefore, r2 = .01, then only 1% of the variance in Y is linearly predictable from X. By implica on, the other 99% of the variance is related to other variables (or is due to nonlinear associa ons or is inherently unpredictable). Is it worth expending a lot of energy on further study of a variable that predicts only 1% of the variance? When an effect size is this small, very large N’s are needed in future studies in order to have sufficient sta s cal power (i.e., a reasonably high probability of obtaining a sta s cally significant outcome). Researchers need to make their own judgments as to whether it is worth pursuing a variable that predicts such a small propor on of variance. There are two reasons why authors may not report effect sizes. One is that SPSS does not provide effect size informa on for some common sta s cs, such as ANOVA. This lack is easy to deal with, because SPSS does provide the informa on needed to calculate effect size informa on by hand, and the computa ons are simple. This informa on is provided for each sta s c in Volume I (Warner, 2020). For example, an η2 effect size for ANOVA can be obtained by dividing SSeffect by SStotal. There may be another reason. Cohen (1994) noted that CIs are
o en embarrassingly large; effect sizes may o en be embarrassingly small. It just does not sound very impressive to say, “I have accounted for 1% of the variance.” A long me ago, Mischel (1968) pointed out that correla ons between personality measures and behaviors tended to be no larger than r = .30. This triggered a crisis and disputes in personality research. Social psychologists argued that the power of situa ons was much greater than personality. Epstein and O’Brien (1985) argued that it is possible to obtain higher correla ons in personality with broader assessments and that typical effect sizes in social psychology were not much higher. However, at the me, r = .30 seemed quite low. This may have been because earlier psychological research in areas such as behavior analysis and psychophysics tended to yield much larger effects (stronger correla ons). I wonder whether Cohen’s labeling of r = .3 as a medium to large effect was based on the observa on that in many areas of psychology, effects much larger than this are not common. Nevertheless, accoun ng for 9% of the variance does not sound impressive. Pren ce and Miller (1992) pointed out that in some situa ons, even small effects may be impressive. Some behaviors are probably not easy to change, and a study that finds some change in this behavior can be impressive even if the amount of change is small. They cited this example: Physical a rac veness shows strong rela onships with some responses (such as interpersonal a rac on). It is impressive to note that even in the courtroom, a rac veness has an impact on behavior; una rac ve defendants were more likely to be judged guilty and to receive more punishment. If physical a rac veness has effects in even this context, its effects may apply to a very wide range of situa ons. Some mes social and behavioral scien sts have effect size envy, imagining that effect sizes in other research domains are probably much larger. In fact, effect sizes in much biomedical research are similar to those in psychology (Ferguson, 2009). Rosnow and Rosenthal (2003) cited an early study that examined whether taking low-dose aspirin could reduce the risk for having a heart a ack. Pearson’s r (or ϕ) between these two dichotomous variables was r = .034. The percentage of men who did not have heart a acks in the aspirin group (51.7%) was significantly higher than the percentage of men who did not have heart a acks in the placebo group (48.3%). Assuming that these results are generalizable to a larger popula on (and that is always a ques on), a 3.4% improvement in health outcome applied to 1 million men could translate into preven on of about 34,000 heart a acks. From a public health perspec ve, r = .034 can be seen as a large effect. From the perspec ve of an individual, the evalua on could be different. An individual might reason, I might change my risk for heart a ack from 51.7% (if I do not take aspirin) to 48.3% (if I do take aspirin). From that perspec ve, the effect of aspirin might appear to be less substan al. 1.9.4 Use of Effect Size to Evaluate Prac cal or Clinical Importance (or Significance) It is important to dis nguish between sta s cal significance and prac cal or clinical significance (Kirk, 1996; Thompson, 2002a). We have clear guidelines how to judge sta s cal significance (on the basis of p values). What do we mean by clinical or prac cal significance, and how can we make judgments about this? In everyday use, the word significant o en means “sufficiently
important to be worthy of a en on.” When research results are reported as sta s cally significant, readers tend to think that the treatment caused effects large enough to be no ced and valued in everyday life. However, the term sta s cally significant has a specific technical meaning, and as noted in the previous sec on, a result that is sta s cally significant at p < .001 may not correspond to a large effect size. For a study comparing group means, prac cal significance corresponds to differences between group means that are large enough to be valued (a large M1 – M2 difference). In a regression study, prac cal significance corresponds to large and “valuable” increases in an outcome variable as scores on the independent variable increase (e.g., a large raw-score regression slope b). Standardized effect sizes such as Cohen’s d are some mes interpreted in terms of clinical significance. However, examining the difference between group means (M1 – M2) in their original units of measurement can be a more useful way to evaluate the clinical or prac cal importance of results (Pek & Flora, 2018). M1 – M2 provides understandable informa on when variables are measured in meaningful and familiar units. Age in years, salary in dollars or euros or other currency units, and body weight in kilograms or pounds are examples of variables in meaningful units. Everyday people can understand results reported in these terms. For example, if a study that compared final body weight between treatment (1) and control (2) groups, with mean weights M1 = 153 lb in the treatment group and M2 = 155 lb in the control group, everyday folks (as well as clinicians) probably would not think that a 2-lb difference is large enough to be no ceable or valuable. Most people would not be very interested in this new treatment, par cularly if it is expensive or difficult. On the other hand, if the two group means differed by 20 or 30 lb, probably most people would view that as a substan al difference. Similar comparisons can be made for other different treatment outcomes (such as blood pressure with vs. without drug treatment). Unfortunately, when people read about new treatments in the media, reports o en say that a treatment effect was “sta s cally significant” or even “highly sta s cally significant.” Those phrases can mislead people to think that the difference between group means (for weight, blood pressure, or other outcomes) in the study was extremely large. Here are examples of criteria that could be used to judge whether results of studies are clinically or prac cally significant, that is, whether outcomes are different enough to ma er: Are group means so far apart that one mean is above, and the other mean is below, some diagnos c cutoff value? For example, is systolic blood pressure in a nonhypertensive range for the treatment group and a hypertensive range for a control group? Would people care about an effect this size? This is rela vely easy to judge when the variable is money. Judge and Cable (2004) examined annual salaries for tall versus short persons. They reported these mean annual salaries (in U.S. dollars): tall men, $79,835; short men, $52,704; tall women, $42,425, short women, $32,613. As always in research, there are many reasons we
should hesitate to generalize their results to other situa ons or apply them to ourselves individually. However, tall men earned mean salaries more than $47,000 higher than short women. I am a short woman, and this result certainly got my a en on. In economics, value or “ma ering” is called u lity. Systema c studies could be done to see what values people (clients, clinicians, and others) a ach to specific outcomes. For a person who earns very li le money, a $1,000 salary increase may have a lot of value. For a person who earns a lot of money, the same $1,000 increase might be trivial. U lity of specific outcomes might well differ across persons according to characteris cs such as age and sex. How large does a difference have to be for most people to even no ce or detect it? At a bare minimum, before we speak of an effect detected in a study as an important finding, it should be no ceable in everyday life (cf. Donlon, 1984; Stricker, 1997). 1.9.5 Uses for Effect Sizes Effect sizes should be included in research reports. Standardized effect sizes (such as Cohen’s d or r) provide a basis for labeling strength of rela onships between variables as weak, moderate, or strong. Standardized effect sizes can be compared with those found in other studies and in past research. Addi onal informa on, such as raw-score regression slopes and group means in original units of measurement, can help readers understand the real-world or clinical implica ons of findings (at least if the original units of measurement were meaningful). Effect size es mates from past research can be used to do sta s cal power analysis to make sample-size decisions for future research. Finally, effect size informa on can be combined and evaluated across studies using meta- analysis to summarize exis ng informa on. 1.10 BRIEF INTRODUCTION TO META-ANALYSIS A meta-analysis is a summary of effect size informa on from past research. It involves evalua ng the mean and variance of effect sizes combined across past studies. This sec on provides only a brief overview. For details about meta-analysis, see Borenstein, Hedges, Higgins, and Rothstein (2009) or Field and Gille (2010). 1.10.1 Informa on Needed for Meta-Analysis The following steps are involved in informa on collec on: Clearly iden fy the ques on of interest. For example, how does number of bystanders (X) predict whether a person offers help (Y)? What is the difference in mean depression scores (Y) between persons who do and do not receive cogni ve behavioral therapy (CBT) (X)? Establish criteria for inclusion (vs. exclusion) of studies ahead of me. Decide which studies to include and exclude. This involves many judgments. Poor-quality studies may be discarded. Studies that are retained must be similar enough in concep on and design that comparisons make sense (you can’t compare apples and oranges). Reading meta-analyses in your own area of interest can be helpful. Do a thorough search for past research about this ques on. This should include published studies, located using library databases, and unpublished data, obtained through personal contacts.
Create a data file that has at least the following informa on for each study: Author names and year of publica on for each study. Number in sample (and within groups). Effect size informa on (you may have to calculate this if it is not provided). The most common effect sizes are Cohen’s d and r. However, other types of effect size may be used.3 If applicable, group sizes, means, and standard devia ons. Addi onal informa on to characterize studies. If the number of studies included in the meta- analysis is large, it may be possible to analyze these variables as possible “moderators,” that is, variables that are related to different effect sizes. In studies of CBT, the magnitude of treatment effect might depend on number of treatment sessions, type of depression, client sex, or even the year when the study was done. There are also “study quality” and study type variables, for example, Was the study double blind or not? Was there a nontreatment control group? Was it a within-S or between-S design? It is a good idea to have more than one reader code this informa on and to check for interobserver reliability. 1.10.2 Goals of Meta-Analysis Es mate mean effect size. When effect sizes are averaged across studies they are usually weighted by sample size (or some mes by other characteris cs of studies). Evaluate the variance of effect sizes across studies. The varia on among effect sizes indicates whether results of studies seem to be homogeneous (that is, they all tended to yield similar effect sizes) or heterogeneous (they yielded different effect sizes). If effect sizes are heterogeneous and the number of studies is reasonably large, a moderator analysis is possible. Evaluate whether certain moderator variables are related to difference in effect sizes. For example, are smaller effect sizes obtained in recent CBT studies than in those done many years ago? The mechanics of doing a meta-analysis can be complex. For example, the analyst must choose between a fixed- and a random-effects model (for discussion, see Field & Gille , 2010); a random-effects model is probably more appropriate in many situa ons. SPSS does not have a built-in meta-analysis procedure; Field and Gillet (2010) provide free downloadable SPSS syntax files on their website, and references to so ware created by others, including rou nes in R. See the following sources for guidelines about repor ng meta-analysis: Libera et al. (2009) and Rosenthal (1995). 1.10.3 Graphic Summaries of Meta-Analysis Forest plots are commonly used to describe results from meta-analysis. Figure 1.2 shows a hypothe cal forest plot. Suppose that three studies were done to compare depression scores between a group that has had CBT and a control group that has not had therapy. For each study, the effect size, Cohen’s d, is the difference between pos est depression scores for the CBT and control groups (divided by the pooled within-group standard devia on). A 95% CI is obtained for Cohen’s d for each study. The ver cal line down the center of the table is the “line of no effect” that corresponds to d = 0. This would be the expected result if popula on means did not differ between CBT and control
condi ons. In this example, a nega ve value of d means that the treatment group had a be er outcome (i.e., lower depression a er treatment) than the control group.
Figure 1.2 Hypothe cal Forest Plot for Studies That Assess Pos reatment Depression in Therapy and Control Groups Source: Adapted with permission from the Royal Australian College of General Prac oners from: Ried K. “Interpre ng and understanding meta-analysis graphs: A prac cal guide.” Australian Family Physician, 2006; 35(8):635–38. Available at www.racgp.org.au/afp/200608/10624. Reading across the line for Study 1: Author names and year are provided, then N, mean, and SD for the CBT and control groups. The horizontal line to the right, with a square in the middle, corresponds to the 95% CI for Cohen’s d for Study 1. The size of the square is propor onal to total N for that study. The weight given to informa on from each study in a meta-analysis can be based on one or more characteris cs of studies, such as sample size. The final column provides the exact numerical results that correspond to the graphic version of the 95% CI for Cohen’s d for each study. The row denoted “Total” shows the 95% CI for the weighted mean of Cohen’s d across all three studies, first in graphic and then in numerical form. The “Total” row has a diamond-shaped symbol; the end points of the diamond indicate the 95% CI for the average effect size across studies. This CI did not include 0. The values in the lower le of the figure answer two ques ons about the set of effect sizes across all studies. First, does the weighted mean of Cohen’s d combined across studies differ significantly from 0? The test for the overall effect, z = 2.09, p = .04, indicates that the null hypothesis that the overall average effect was zero can be rejected using α = .05, two tailed. The mean Cohen’s d that describes difference of depression scores for CBT compared with control group was –.87. This suggests that average mean depression was almost 1 standard devia on lower for persons who received CBT. That would be labeled a large effect using Cohen’s standards (Table 1.1); it lies in between “minimal reportable effect” and a moderate effect using the guidelines of Fritz et al. (2012) (Table 1.2).
Second, are the effect sizes sufficiently similar or close together that they can be viewed as homogeneous? The test for heterogeneity result was χ2 = 2.03, df = 2, p = .36. The null hypothesis of homogeneity is not rejected. If the χ2 test result were significant, this would suggest that some studies yielded different effect sizes than others. If the meta-analysis included numerous studies, it would be possible to look for moderator variables that might predict which studies have larger and which have smaller effects. An actual meta-analyses of CBT effec veness suggested that effects were larger for studies done in the early years of CBT and smaller in studies done in recent years (Johnsen & Friborg, 2015). In other words, the year when each study was done was a moderator variable; effect sizes were larger, on average, in earlier years than in more recent years. 1.11 RECOMMENDATIONS FOR BETTER RESEARCH AND ANALYSIS Extensive recommenda ons have been made for improvements in data analysis and research prac ces. These could substan ally improve understanding of results from individual studies, reduce p-hacking, reduce the number of false-posi ve results, and improve replicability of research results. Cumming (2012) recommended focusing more on CIs and effect sizes (and less on p values) in reports and interpreta ons of research results. In addi on, meta-analyses should be used to summarize effect size informa on across studies. When effect size informa on is not examined, small p values are some mes misunderstood as evidence of effects strong enough to be “worthy of no ce,” in situa ons where treatment effects may be too small to be valued, and perhaps too small to even be no ced by everyday observa on. Use of language should be precise. It is unfortunate that the phrase “sta s cally significant” includes a word (significant) that means “noteworthy and important” in everyday use. Authors should try to convey accurate informa on about effect size in a way that dis nguishes between sta s cal and prac cal significance. If you describe p < .001 as “highly significant,” this leads many readers to think that the effect of a treatment or interven on is strong enough to be valuable in the real world and worthy of no ce. However, p values depend on N, as well as effect size. A very weak treatment effect can have a very small p value if N is sufficiently large. Data analysts need to avoid p-hacking, “undisclosed flexibility,” and lack of transparency in research reports (Simmons et al., 2011). Authors also need to avoid HARKing: hypothesizing a er results are known (Kerr, 1998). HARKing occurs when a researcher makes up an explana on for a result that was not expected. For a detailed p-hacking checklist (things to avoid) see Wicherts et al. (2016). When p-hacking occurs, reported p values can greatly understate the true risk for Type I error, and this o en leads data analysts and readers to believe that evidence against the null hypothesis is much stronger than it actually is. This in turn leads to overconfidence about findings and perhaps publica on of false-posi ve results. The most extensive list of recommenda ons about changes need to improve replicability of research comes from Asendorpf et al. (2013). All of the following are based on their
recommenda ons. The en re following list is an abbreviated summary of their ideas; see their paper for detailed discussion. 1.11.1 Recommenda ons for Research Design and Data Analysis Use larger sample sizes. Other factors being equal, this increases sta s cal power and leads to narrower CIs. Use reliable measures. When measures have low reliability, correla ons between quan ta ve measures are a enuated (i.e., made smaller), and within-group SS terms in ANOVA become larger. Use suitable methods of sta s cal analysis. Avoid mul ple underpowered studies. An underpowered study has too few cases to have adequate sta s cal power to detect the effect size. Consider error introduced by mul ple tes ng in underpowered studies. The literature is sca ered with inconsistent results because underpowered studies produce different sets of significant (or nonsignificant) rela ons between variables. Even worse, it is polluted by single studies repor ng overes mated effect sizes, a problem aggravated by the confirma on bias in publica on and a tendency to reframe studies post hoc to feature whatever results came out significant. (Asendorpf et al., 2013) Do not evaluate whether results of a replica on are consistent with the original study by “vote coun ng” of NHST results (e.g., did both studies have p < .05?). Instead note whether the CIs for the studies overlap substan ally and whether the sample mean for the original study falls within the CI for the sample mean in the replica on study. 1.11.2 Recommenda ons for Authors Increase transparency of repor ng (include complete informa on about sample size decisions, criteria used for sta s cal significance, all variables that were measured and all groups included, and all analyses that were conducted). Specify how possible sources of bias such as outliers and missing values were evaluated and remedied. If cases, variables, or groups are dropped from final analysis, explain how many were dropped and why. Preregister research plans and predic ons. For resources in psychology, see “Preregistra on of Research Plans” (n.d.). Publish materials, data, and details of analysis (e.g., on a webpage or in a repository; see “Recommended Data Repositories,” n.d.). Publish working papers and engage in online research discussion forums to promote dialog among researchers working on related topics. Conduct replica ons and make it possible for others to conduct replica ons. Dis nguish between exploratory and “confirmatory” analyses. It is obvious that these are difficult for authors to do, par cularly those at early stages in their careers. Publica on of large numbers of studies that yield sta s cally significant results is a de facto requirement for ge ng hired, promoted, tenured, and grant-funded. Publica on pressure can lower research quality (Sarawitz, 2016). Requirements to replicate studies and report more detail about data analysis decisions will make the process of publica on far more me consuming. Efforts to adhere to these guidelines will almost certainly lead to publishing fewer papers. This could be good for the research field (Nelson, Simmons, & Simonsohn, 2012).
Changes in individual researcher behavior can only occur if researchers are taught be er prac ces and if ins tu ons such as departments, universi es, and grant-funding agencies provide incen ves that encourage researchers to produce smaller numbers of high-quality studies instead of rewarding publica on of large numbers of studies. 1.11.3 Recommenda ons for Journal Editors and Reviewers
Promote good research prac ce by encouraging honest repor ng of less-than-perfect results.
Do not insist on “confirmatory” studies; this discourages honest repor ng when analyses are exploratory.
Publish null findings (those with p > .05) to minimize publica on bias (provided that the studies are well designed). (Of course, a nil result should not be interpreted as evidence that the null hypothesis of no treatment effect is true. It is just a failure to find evidence that is inconsistent with the null hypothesis.)
No ce when a research report presents an unlikely outcome and raise ques ons about it. For example, Asendorpf et al. (2013) noted, “If an ar cle reports 10 successful replica ons … each with a power of .60, the probability that all of the studies could have achieved sta s cal significance is less than 1%,” even if the finding is actually “true.”
Allow reviewers to discuss papers with authors. Journals may give badges to papers with evidence of adherence to good prac ce such as
study preregistra on. Psychological Science does this; other journals are beginning to as well.
Require authors to make raw data available to reviewers and readers. Reserve space for publica on of replica on studies, including failures to replicate.
1.11.4 Recommenda ons for Teachers of Research Methods and Sta s cs To a great extent, textbooks and instructors teach what researchers are doing, and researchers, reviewers, and journal editors do what they have been taught to do. This discourages change. Incorpora ng issues such as the limita ons of p values, the importance of repor ng CIs and effect size, the risk for going astray into p-hacking during lengthy data analysis, and so forth, will help future researchers take these issues into account.
Students need to understand the limita ons of informa on from sta s cal significance tests and the problems created by inadequate sta s cal power, running mul ple analyses, and selec vely repor ng only “significant” outcomes. In other words, they need to learn how to avoid p-hacking. Some of these ideas might be introduced in early courses; these topics are essen al in intermediate and advanced courses. Many technical books cover these issues, but most textbooks do not.
Graduate courses should focus more on “ge ng it right” and less on “ge ng it published.”
Students need to know about a priori power analysis as a tool for deciding sample size (as opposed to the prac ce of con nuing to collect data un l p < .05 can be obtained, one of many forms of p-hacking). Some undergraduate sta s cs textbooks include an
introduc on to sta s cal power. Earlier chapters in this book provided basic informa on about power for each bivariate sta s c.
The problems with inflated risk for Type I error that are raised by mul ple analyses and mul ple experiments should be discussed.
Transparency in repor ng should be encouraged. Students need to work on projects that use real data set with the typical problems faced in actual research (such as missing values and outliers). Students should be required to report details about data screening and any remedies applied to data to minimize sources of ar fact such as outliers.
Students can reanalyze raw data from published studies or conduct replica on studies as projects in research methods and sta s cs courses.
Instructors should promote cri cal thinking about research designs and research reports. 1.11.5 Recommenda ons About Ins tu onal Incen ves and Norms
Departments and universi es should focus on quality instead of quan ty of publica ons when making hiring, salary, and promo on decisions.
Grant agencies should insist on replica ons. 1.12 SUMMARY The tle of an ar cle in Slate describes the current situa on: “Science Isn’t Broken. It’s Just a Hell of a Lot Harder Than We Gave It Credit For” (Aschwanden, 2015). Self-correc on and quality control mechanisms for science (including peer review and replica on) do not work perfectly, but they can be made to work be er. Progress in science requires weeding out false- posi ve results as well as genera ng new findings. Unfortunately, while genera ng new findings is incen vized, weeding out false posi ves is not. P-hacking without ac ve inten on to deceive is probably the most common reason for false-posi ve results. A empts to iden fy false-posi ve results (whether in one’s own work or in the work of others) can be painful. Ideally this will happen in a culture of coopera on and construc ve commentary, rather than compe on and a ack. Public abuse of individual researchers whose work cannot be replicated is not a good way to move forward. All of us have (at least on occasion) complained about nasty reviews. We need to remember, when we become upset about the “them” who wrote those nasty reviews, that “them” is “us,” and treat one another kindly. Cri cism can be provided in construc ve ways. The stakes are high. Press releases of inconsistent or contradictory results in mass media may reduce public respect for, and trust in, science. This is turn may reduce support for research funding and higher educa on. If researchers make exaggerated claims on the basis of limited evidence, and claims are frequently contradicted, this provides ammuni on for an science and an -intellectual elements in our society. Change in research prac ces does not have to be all or nothing. It is easy to report CIs and effect sizes (as suggested by Cumming, 2014, and others). Meta-analyses are becoming more common in many fields. We can make more though ul assessments of effect sizes and dis nguish between sta s cal and prac cal or clinical importance (Kirk, 1996; Thompson, 2002a). The
many addi onal recommenda ons listed in the preceding sec on may have to be implemented more gradually, as ins tu onal support for change increases. COMPREHENSION QUESTIONS
1. If Researcher B tries to replicate a sta s cally significant finding reported by Researcher A, and Researcher B finds a nonsignificant result, does this prove that Researcher A’s finding was incorrect? Why or why not?
2. What needs to be considered when comparing an original study by Researcher A and a replica on a empt by Researcher B?
3. Is psychology the only discipline in which failures to replicate studies have been reported? (If not, what other disciplines? Your answer might include examples that go beyond those in this chapter.)
4. What does a p value tell you about: a) Probability that the results of a study will replicate in the future? b) Effect size (magnitude of treatment effect)? c) Probability that the null hypothesis is correct? d) What does a p value tell you?
5. “NHST logic involves a double nega ve.” Explain. 6. What does it mean to say that H0 is always false? 7. In words, what does Cohen’s d tell you about the magnitude of differences between two
sample means? Does d have a restricted range? Can it be nega ve? 8. How does the value of the t ra o depend on the values of d and df? 9. How does the width of a CI depend on the level of confidence, N, and SD? 10. Review: What is the difference between SEM and SD? Which will be larger? 11. Consider Equa on 1.4. Which term provides informa on about effect size? Which term
provides informa on about sample size? 12. Describe viola ons of assump ons or rules that can bias values of p. Don’t worry
whether to call something an assump on versus a rule versus an ar fact; these concepts overlap.
13. What are the major alterna ves that have been suggested to the use of α < .05 (NHST)? 14. What is p-hacking? What common researcher prac ces can be described as p-hacking?
What effect does p-hacking have on the believability of research results? 15. What is HARKing, and how can it be misleading? 16. How could p-hacking contribute to the problems that some mes arise when people try
to replicate research studies? 17. Is it correct to say that a study with p < .001 shows stronger treatment effects than a
study that reports p < .05? Why or why not? 18. How does theore cal significance differ from prac cal or clinical significance? What
kinds of informa on is useful in evalua ng prac cal or clinical significance? 19. When people report CIs instead of p values, how might this lead them to think about
data differently?
20. Can you tell from a graph or bar chart that shows 95% CIs for the means of two groups whether the t test that compares group means using α = .05 would be sta s cally significant? Explain your reasoning.
21. If a computer program or research report does not provide effect size informa on, is there any way for you to figure it out?
22. Explain the difference between (M1 – M2) and Cohen’s d. Which is standardized? What kind of informa on does each of these poten ally provide about effect size?
23. In addi on to repor ng effect size in research reports, discuss two other uses for effect size.
24. What three ques ons does a meta-analysis usually set out to answer? 25. Find a forest plot (either using a Google image search or by looking at studies in your
research area). Unless you already understand odds ra os, make sure that the outcome variable is quan ta ve (some forest plots provide informa on about odds ra os; we have not discussed those yet). To the extent that you can, evaluate the following: Does the plot include all the informa on you would want to have? What does it tell you about the magnitude of effect in each study? The magnitude of effect averaged across all studies?
26. Describe three changes (in the behavior of individual researchers) that could improve future research quality. Describe two changes (in the behavior of ins tu ons) that could help individuals make these changes. Do any of these changes seem easy to you? Which changes do you think are the most difficult (or unlikely)?
27. Has this chapter changed your understanding or thinking about how you will conduct research and analyze data in future? If so, how?
NOTES 1The eminent philosopher of science Karl Popper (cited in Meehl, 1978) argued that to advance science, we need to look for evidence that might disconfirm our preferred hypotheses. NHST is not Popperian falsifica on. Meehl (1978) pointed out that NHST actually does the opposite. It is a search for evidence to disconfirm the null hypothesis (not evidence to disconfirm the research or alterna ve hypothesis). When we use NHST (with sufficiently large samples), our preferred alterna ve hypotheses are not in jeopardy. Meehl argued that NHST is not a good way to advance knowledge in the social and behavioral sciences. It does not pose real challenges to our theories and is not well suited to deal with the sheer complexity of research ques ons in social and behavioral sciences. We make progress not only by genera ng new hypotheses and findings but also by discarding incorrect ideas and faulty evidence. Selec ve repor ng of small p values does not help us discard incorrect ideas. 2An excep on is that if N, the number of data points, becomes very small, the size of a correla on becomes large. If you have only N = 2 pairs of X, Y values, a straight line will fit perfectly, and r will equal 1 or –1. For values of N close to 2, values of r will be inflated because of “overfi ng.” 3Odds ra os or rela ve risk measures, which can be obtained from logis c regression, are also common effect sizes in meta-analyses. See Chapter 16 later in this book.
DIGITAL RESOURCES Find free study tools to support your learning, including eFlashcards, data sets, and web resources, on the accompanying website at edge.sagepub.com/warner3e. Descrip ons of Images and Figures Back to Figure The horizontal axis represents the daily servings of fruit and vegetables and ranges from 0 to 5, in intervals of 1. The ver cal axis represents the mean posi ve affect and ranges from 30 to 36, in intervals of 2. The confidence interval is marked at the value 32.4 on the ver cal axis and a line parallel to the horizontal axis is given to represent the same. There are six error bars in the plot and are approximately summarized below:
Error bar 1: The lower whisker of the error bar is at 31 and the upper whisker is at 32. Error bar 2: The lower whisker of the error bar is at 31.5 and the upper whisker is at
33.8. Error bar 3: The lower whisker of the error bar is at 30.7 and the upper whisker is at 33. Error bar 4: The lower whisker of the error bar is at 32.6 and the upper whisker is at
34.7. Error bar 5: The lower whisker of the error bar is at 33.2 and the upper whisker is at
35.8. Error bar 6: The lower whisker of the error bar is at 32.6 and the upper whisker is at 37.
The number of individuals per group is as follows: 524; 162; 162; 120; 98; 51. Back to Figure The figure is a table with the following details:
Study IDs. Total number in the group and mean of outcome in both therapy and control groups. Outcome effect measure shown graphically and numerically. Influence of studies on overall meta-analysis.
The details of the table are as follows: For Study 1:
Therapy group N: 34; Therapy group mean (SD): 9.77(2.93); Control group N:34; Control group mean (SD): 10.29 (3.43); weight: 27.5%.
For Study 2:
Therapy group N: 36; Therapy group mean (SD): 8.40(1.90); Control group N:36; Control group mean (SD): 8.90 (3.00); weight: 46.9%.
For Study 3:
Therapy group N: 30; Therapy group mean (SD): 10.26(2.96); Control group N:30; Control group mean (SD): 12.09 (3.24); weight: 25.6%.
For Total 95% CI:
Therapy group N: 100; Control group N:100; weight: 100%. It also lists the below:
Test for heterogeneity Chi-square = 2.03 df = 2 p = 036. Test for overall effect z = 2.09 p = .04.
The details of the Cohen’s d and 95% CI graph are as below:
Cohen’s d and 95% CI graphically shows a line of no effect in the center as the ver cal axis.
The horizontal axis ranges from nega ve 4.0 to 4.0 in intervals of 2.0 and represents the scale of treatment effect.
The nega ve values represent “favors interven on” and the posi ve values represent “favors control”.
The boxes are situated in line with the outcome value of the individual studies. The whiskers through the boxes depict the length of the confidence intervals (CI).
The diamond in the last row of the graph illustrates the overall result of the meta analysis. The middle of the diamond sits on the value for the overall effect es mate and the width of the diamond depicts the width of the overall CI.
The details of the boxes and whiskers is as below:
For study 1: Outcome value: Nega ve 0.52; Value of le whisker: Nega ve 2.04; Value of right whisker: 1.00.
For study 2: Outcome value: Nega ve 0.50; Value of le whisker: Nega ve 1.66; Value of right whisker: 0.66.
For study 3: Outcome value: Nega ve 1.83; Value of le whisker: Nega ve 3.40; Value of right whisker: Nega ve 0.26.
For Total: Overall effect es mate value: Nega ve 0.85; The le of the diamond: Nega ve 1.64; The right of the diamond: Nega ve 0.05.