Comparing Two or More Groups -Stats6

profileTT24
Reading-AppliedstatisticsIChap13.pdf

Warner, R. M. (2021). Applied sta s cs I: Basic bivariate techniques (3rd ed.). Thousand Oaks, CA: Sage Publica ons. ISBN: 978-1-5063-5280-0.

CHAPTER 13 ONE-WAY BETWEEN-SUBJECTS ANALYSIS OF VARIANCE 13.1 RESEARCH SITUATIONS WHERE ONE-WAY ANOVA IS USED One-way between-subjects analysis of variance (usually called ANOVA) is used in situa ons where researchers compare means on a quan ta ve Y outcome variable across two or more groups. It is called analysis of variance because the goal is to par on or divide the variance of scores on the Y outcome into variance that can be predicted from group membership and variance that cannot be predicted from group membership. In prac ce this par on is made by compu ng sums of squares (SS). In an experiment, variance related to group membership, also called between-group variance, is related to the dosage or type of treatment received by each group. Variance within groups (variance that is not related to group membership) is called experimental error. Within-group variance is due to the influence of all other uncontrolled variables (other than the treatment in the study) on Y scores. Researchers generally hope that between-group variance or SS will be rela vely large and that within-group variance or SS will be small, because this outcome suggests that dosage or type of treatment affects the outcome variable Y. You saw an ANOVA table in the bivariate regression output in Chapter 11. However, when people refer to ANOVA, they usually refer to comparison of means across groups. The t test provides informa on about the distance between the means on a quan ta ve outcome variable for just two groups, whereas a one-way ANOVA compares means on a quan ta ve variable across any number of groups. The categorical predictor variable in an ANOVA may represent either naturally occurring groups (in a nonexperimental study) or groups formed by a researcher and then exposed to different interven ons (in an experiment). The term between-S (like the term independent samples) tells us that each par cipant is a member of one and only one group, that there are no repeated measures, and that par cipants are not matched or paired across samples. When data consist of repeated measures or paired or matched samples, repeated-measures or within-S ANOVA (discussed in a later chapter) is required. In ANOVA, the categorical predictor variable is called a factor. The groups are called levels of this factor. Levels of a factor in an experiment may represent different types of treatment and/or control groups or different dosage levels of the same treatment. In the hypothe cal research example introduced in Sec on 13.3, the factor is called “type of stress,” and the levels of this factor represent four different types of stress situa ons: type 1, no stress; type 2, cogni ve stress from a mental arithme c task; type 3, stressful social role play; and type 4, stress during a mock job interview. The outcome variable is a quan ta ve measure of anxiety. The ques on is whether mean anxiety differs across these four situa ons.

When there is just one categorical variable or factor, it is o en called Factor A; the number of levels or groups is denoted a. In Chapter 16, on factorial ANOVA, you will see that you can have more than one factor, and in those situa ons factors are o en named A, B, C, and so on. Comparisons among several group means could be made by calcula ng t tests for each pairwise comparison among the means of these four treatment groups. However, as described earlier, doing numerous significance tests leads to an inflated risk for Type I error. If a study includes k groups, there are k(k – 1)/2 pairs of means; thus, for a set of four groups, the researcher would need to do (4 × 3)/2 = 6 different t tests to make all possible pairwise comparisons. If α = .05 is used as the criterion for significance for each test, and the researcher conducts six significance tests, the probability that this set of six decisions contains at least one instance of Type I error is greater than .05. One way that ANOVA limits the risk for Type I error is by obtaining a single omnibus test that examines all possible comparisons among means in the study. Researchers o en want to examine selected pairwise comparisons of means as a follow-up analysis to obtain more informa on about the pa ern of differences among groups. 13.2 QUESTIONS IN ONE-WAY BETWEEN-S ANOVA The overall null hypothesis for one-way ANOVA is that the means of the k popula ons that correspond to the groups in the study are all equal: Other (13.1) 𝐻0:𝜇1=𝜇2=…=𝜇𝑘. When each group has been exposed to different types or dosages of a treatment, as in a typical experiment, this null hypothesis corresponds to an assump on that the treatment has no effect on the outcome variable. The alterna ve hypothesis in this situa on is not that all popula on means are unequal; the alterna ve hypothesis is that there is at least one inequality between one pair of means in the set. The best ques on ever asked by a student in any of my sta s cs classes was decep vely simple: “Why is there variance?” In the hypothe cal experimental study described in the following sec on, the outcome variable is a self-report measure of anxiety, and the group membership variable is type of stress. We want to know, How much of the variance in anxiety can be predicted from type of stress? Is stress a major reason why anxiety scores differed among persons in this study? Why do some persons report more anxiety than other persons? To what extent are the differences in amount of anxiety systema cally associated with the independent variable (type of stress), and to what extent are differences in the amount of self-reported anxiety due to other factors (such as trait levels of anxiety, physiological arousal, drug use, sex, other anxiety-arousing events that each par cipant has experienced on the day of the study, etc.)?

In sta s cs, the term error usually does not mean the same thing as in everyday life. In everyday life, we use the word error to mean “mistake.” In ANOVA, the term error refers to the parts of scores that cannot be predicted from type of treatment or group membership. The part of anxiety scores that we cannot predict from type of stress is presumably due to the effects of other variables that we have not included in the study, such as personality, other upse ng events that may have happened to the person just before the study, recent use of drugs such as alcohol, caffeine, and tobacco, and possibly a mul tude of other unknown variables. Ques ons in ANOVA:

1. The first ques on in one-way ANOVA is this: When all group means are considered as a set, are there any significant differences between means? An overall F ra o will tell us whether there are any significant differences among the group means, but it does not tell us which specific means differ. It is possible that each group mean differs from every other group mean, but it is also possible that only one or a few pairs of means differ.

2. The second ques on in one-way ANOVA is this: Which specific pairs (or combina ons) of group means differ significantly? There are two ways to answer this ques on. A data analyst either decides which comparisons are of interest ahead of me and sets up planned contrasts or explores data using post hoc follow-up tests to which means differ significantly. Both approaches are discussed in this chapter.  Planned contrasts (some mes just called contrasts and also called a priori

comparisons) can be set up to examine a limited number of differences between means that the data analyst has decided ahead of me are of interest. These are called unprotected tests (that is, not protected against inflated risk for Type I error) because, except for limi ng the number of significance tests, there are no other correc ons for inflated risk for Type I error.

 Post hoc tests (such as the Tukey honestly significant difference [HSD] test) can be used to examine many or all the possible comparisons among means. These are called protected tests because most of them use more conserva ve per comparison criteria for sta s cal significance (like per comparison alpha [PCα] in the Bonferroni procedure).

Later in your study of sta s cs, you will discover that many analyses involve a similar approach: first, an omnibus test that includes all groups and/or all variables, then follow-up analyses to evaluate which groups or which variables show significant differences. 13.3 HYPOTHETICAL RESEARCH EXAMPLE Suppose that an experiment is done to compare the effects of four situa ons: Group 1 is tested in a “no-stress,” baseline situa on; Group 2 does a mental arithme c task; Group 3 does a stressful social role play; and Group 4 does a mock job interview. For this study, the X variable is a categorical variable with codes 1, 2, 3, and 4 that represent which of these four types of stress each par cipant received. This categorical X predictor variable is called a factor; in this case, the factor is called “type of stress”; the four levels of this factor correspond to no stress, mental arithme c, stressful role play, and a mock job interview. At the end of each session, the par cipants self-report their anxiety on a scale that ranges from

0 = no anxiety to 20 = extremely high anxiety. Scores on anxiety are, therefore, scores on a quan ta ve Y outcome variable. Imagine that there is a convenience sample of N = 28 par cipants. (Capital N denotes the total number of par cipants in the study.) Imagine that par cipants were randomly assigned to one of the four levels of stress. This results in k = 4 groups with n = 7 par cipants in each group, for a total of N = 28 par cipants in the en re study. Lowercase n indicates the number of cases per group. The SPSS Data View worksheet that contains data for this imaginary study appears in Figure 13.1, and data are available in the SPSS file stress_anxiety.sav. The goal of data analysis is to find out:

1. Whether mean anxiety levels differed across these four situa ons. 2. Which situa ons elicited the highest and lowest anxiety. 3. Which treatment group means differed significantly from the baseline (no-stress)

condi on. 4. Whether mean anxiety differed among the mental arithme c, role play, and mock

job interview stress situa ons.

Figure 13.1 Data View Worksheet for Stress and Anxiety Study in stress_anxiety.sav 13.4 ASSUMPTIONS AND DATA SCREENING FOR ONE-WAY ANOVA

The assump ons for one-way ANOVA are the same as those described for the independent- samples t test. The scores on the dependent variable must be quan ta ve. Observa ons must be independent of one another, both within and between groups. Ideally, scores should be approximately normally distributed within each group, and variances should be approximately equal across groups. ANOVA, like the t test, is robust against viola ons of the normality and equal variance assump ons if within-group n’s are reasonably large. Finally, there should not be extreme outliers. Preliminary screening involves the same procedures as for the t test: Histograms can be examined separately for each group to assess normality of distribu on shape; boxplots for groups can iden fy and poten al outliers within groups. The Levene test (or another test of homogeneity of variance) can be requested as part of the output and used to assess whether the homogeneity of variance assump on is violated. Because preliminary data screening for one-way between-S ANOVA uses the same procedures as those shown in Chapter 12, on the independent-samples t test, these procedures are not repeated here. 13.5 COMPUTATIONS FOR ONE-WAY BETWEEN-S ANOVA 13.5.1 Overview ANOVA begins with familiar sta s cs. For each group, we obtain M, s, SS, and n. Recall that a sum of squares or SS is obtained by finding M for the group of interest, compu ng a (Y – M) devia on for each individual Y score, squaring the devia on for each score, and summing the squared devia ons. For the independent-samples t test, we needed to find SS only for Groups 1 and 2. In ANOVA, several different forms of SS are obtained. SStotal is obtained by:

 Finding the grand mean for the en re data set, denoted MY.  Obtaining the (Y – MY) devia on for every score in the data set.  Squaring each devia on.  Summing the squared devia ons.

For a batch of data, recall that the sample variance s = SS/df. For SStotal, df = N – 1, where N is the total number of scores in the en re data set. We could use SStotal to find the total variance s for all Y scores in the study; however, to find out what propor on of variance in Y is related to group membership, it is more convenient to focus on SS than s. In ANOVA, an SS divided by its df is usually called a mean square (MS). A one-way ANOVA divides SStotal into two sources of variance, o en called SSbetween groups and SSwithin groups. The formulas to obtain the la er two SS terms can appear confusing, so let’s just focus on the informa on provided by each term. SSbetween groups tells us how far the values of M1, M2,…, Mk are from the grand mean. If the group means are all exactly equal, SSbetween groups will be 0. SS terms can never be nega ve,

and there is no fixed upper limit for values. A “large” value of SSbetween groups (also called SSbetween) tells us that:

 group means are far away from the grand mean, and/or  group means are far away from one another.

What informa on do we need to consider to decide whether SSbetween is “large”? First, we need to divide SS by its df. Devia ons of group means from the grand mean, like devia ons of individual scores from a sample mean, must sum to 0. If there are k group means, only the first k – 1 devia ons of group means from the grand mean are free to vary. Thus, for SSbetween, df = k – 1 (where k is the number of groups). Dividing an SS by its df corrects for the number of independent devia ons used to calculate the SS. An SS divided by its df is called a mean square. MSbetween is, in effect, the variance of the group means. For technical reasons, sta s cians do not refer to MS as a variance (but essen ally, that’s what it is). Second, we need to compare SSbetween with informa on about error variance or within-group variance. The error variance term is called SSwithin. There are several ways to compute SSwithin. The easiest way to think about it is this: First, find SS for the set of scores within each treatment group. For Treatment Group 1, find the group mean, M1; compute the devia on of each Y score in that group from M1; square the devia ons; and sum the squared devia ons. This yields SS1, and this tells us about varia on of scores within Group 1. For a study with k = 4 groups and n = 7 cases within each group, you obtain the following:

MSwithin is the sum of four within-group SS terms: SS1 + SS2 + SS3 + SS4. The df for MSwithin is the sum of df1, df2, df3, and df4; this can also be wri en as n1 + n2 + n3 + n4 – k, or N – k, where N is the total number of persons in the study and k is the number of groups. If SS1 (or the SS for any group) = 0, that tells us that all scores within Group 1 were equal to one another. As the value of SS1 gets larger, we have evidence that a sample of people who received the same treatment have different score values, and these differences are due to other variables that influenced the outcome. In the hypothe cal study of stress and anxiety, anxiety scores may be influenced by recent drug use, depression, or events in the lab.

A er you calculate SStotal, SSbetween, and SSwithin, you will find that this equality holds (as long as you have not made arithme c errors): Other (13.2) 𝑆𝑆total=𝑆𝑆between+𝑆𝑆within. This equa on describes the par on (division) of total varia on of Y into two sources of variance: differences among group means (SSbetween) and differences among scores within the same treatment groups (SSwithin). We hope that most of the varia on between groups is due to the different types or amounts of treatment received by groups, and we usually hope that SSbetween will be large. We know that SSwithin provides informa on about response differences among people who received the same type of treatment and that SSwithin tells us about magnitude of experimental error; we want SSwithin to be small. Recall that one of the effect sizes for t was η2 and that η2 was the propor on of variance of Y scores that is predictable from or related to group membership. In one-way ANOVA, η2 = SSbetween/SStotal. Thus, the SS terms provide effect size informa on. To obtain a sta s cal significance test, we set up an F ra o: Other (13.3) 𝐹=𝑀𝑆between/𝑀𝑆within. Because it is a ra o of MS terms, F cannot be nega ve. F would be zero if all group means were equal. There is no fixed upper limit for values of F. To decide whether F is large enough to be sta s cally significant, we need to find a cri cal value of F from the table in Appendix C at the end of this book. The reject region for F is always one tailed (values in the top 5% of an F distribu on, for instance). To locate the cri cal value that corresponds to the top 5% of the distribu on, you need to know about df. The independent-samples t test required only one df term. An F ra o compares two different MS terms, and each of those MS terms has its own df, so we need to specify two different df terms: Other (13.4) 𝑑𝑓between=𝑘−1, Other (13.5) 𝑑𝑓within=𝑁−𝑘, where k is the number of groups and N is the total number of cases.

To summarize: The by-hand computa on for one-way ANOVA (with k groups and a total of N observa ons) involves the following steps. Complete formulas are provided in the following sec ons.

1. Compute SSbetween, SSwithin, and SStotal. 2. Find effect size: η2 = SSbetween/SStotal. 3. Compute MSbetween by dividing SSbetween by its df, k – 1. 4. Compute MSwithin by dividing SSwithin by its df, N – k. 5. Compute an F ra o: MSbetween/MSwithin. 6. Compare this F value obtained with the cri cal value of F from a table of the F

distribu on with (k – 1) and (N – k) df (using the table in Appendix C at the end of the book that corresponds to the desired alpha level; for example, the first table provides cri cal values for α = .05). If the F value obtained exceeds the tabled cri cal value of F for the predetermined alpha level and the applicable degrees of freedom, reject the null hypothesis that all the popula on means are equal.

In prac ce, these computa ons are done by programs such as SPSS; you can decide whether the outcome is sta s cally significant by examining the p value for the F test and evaluate effect size by calcula ng an η2. 13.5.2 SSbetween: Informa on About Distances Among Group Means The following nota on will be used: Let k be the number of groups in the study. Let n1, n2,…, nk be the number of scores in Groups 1, 2,…, k. Let Yij be the score of subject j in Group i (i = 1, 2,…, k). Let M1, M2,…, Mk be the means of scores in Groups 1, 2,…, k. Let N be the total N in the en re study; N = n1 + n2 + … + nk. Let MY be the grand mean of all scores in the study (i.e., the total of all the individual scores, divided by N, the total number of scores). Once we have calculated the means of each individual group (M1, M2,…, Mk) and the grand mean MY, we can summarize informa on about the distances of the group means, Mj, from the grand mean, MY, by compu ng SSbetween as follows: Other

(13.6)

For the hypothe cal data in Figure 13.1, the mean anxiety scores for Groups 1 through 4 were as follows: M1 = 9.86, M2 = 14.29, M3 = 13.57, and M4 = 17.00. The grand mean on anxiety, MY, is 13.68. Each group had n = 7 scores. Therefore, for this study, Other

SSbetween ≈ 182 (this agrees with the value of SSbetween in the SPSS output presented in Figure 13.8 except for a small amount of rounding error). 13.5.3 SSwithin: Informa on About Variability of Scores Within Groups To summarize informa on about the variability of scores within each group, we compute MSwithin. For each group, for groups numbered i = 1, 2,…, k, we first find the sum of squared devia ons of scores rela ve to each group mean, SSi. The SS for scores within Group i is found by taking this sum: Other

(13.7) That is, for each of the k groups, find the devia on of each individual score from the group mean; square and sum these devia ons for all the scores in the group. These within-group SS terms for Groups 1, 2,…, k are summed across the k groups to obtain the total SSwithin: Other

(13.8) For this data set, we can find the SS term for Group 1 (for example) by taking the sum of the squared devia ons of each individual score in Group 1 from the mean of Group 1, M1. The values are shown for by-hand computa ons; it can be instruc ve to do this as a spreadsheet, entering the value of the group mean for each par cipant as a new variable and compu ng the devia on of each score from its group mean and the squared devia on for each par cipant.

Other The grand mean MY = 13.68. The SStotal term includes 28 squared devia ons, one for each par cipant in the data set, as follows:

Other As noted earlier, the SSbetween and SSwithin terms will sum to SStotal: Other

(13.10) In ANOVA, the mean square between groups is calculated by dividing SSbetween by its degrees of freedom: Other

(13.12) For the data in the hypothe cal study of stress and anxiety, SSbetween = 182, d etween = 4 – 1 = 3, and MSbetween = 182/3 = 60.7. The df for each SS within-group term is given by n – 1, where n is the number of par cipants in each group. Thus, in this example, SS1 had n – 1 or df = 6. When we form SSwithin, we add up SS1 + SS2 + ··· + SSk. There are (n – 1) df associated with each SS term, and there are k groups, so the total dfwithin = k × (n – 1). This can also be wri en as Other

(13.13) where N is the total number of scores (n1 + n2 + ··· + nk) and k is the number of groups. We obtain MSwithin by dividing SSwithin by its corresponding df: Other

(13.14) For the hypothe cal stress and anxiety data in Figure 13.1, MSwithin = 122/24 = 5.083. Finally, we can set up a test sta s c for the null hypothesis H0: μ1 = μ2 = ··· = μk by taking the ra o of MSbetween to MSwithin: Other

(13.15)

Figure 13.2 Reject Region for F Distribu on With 3 and 24 df Using α = .05 For the stress and anxiety data, F = 60.702/5.083 = 11.94. This F ra o is evaluated using the F distribu on with (k – 1) and (N – k) df. For this data set, k = 4 and N = 28, so df values for the F ra o are 3 and 24. An F distribu on has a shape that differs from the normal or t distribu on. Because an F is a ra o of two mean squares and MS cannot be less than 0, the minimum possible value of F is 0. On the other hand, there is no fixed upper limit for the value of F. Therefore, the distribu on of F tends to be posi vely skewed, with a lower limit of 0, as in Figure 13.2. The reject region for significance tests with F ra os consists of only one tail (at the upper end of the distribu on). The first table in Appendix C at the end of the book shows the cri cal values of F for α = .05. The second and third tables in Appendix C provide cri cal values of F for α = .01 and α = .001. In the hypothe cal study of stress and anxiety, the F ra o has df equal to 3 and 24. Using α = .05, the cri cal value of F from the first table in Appendix C with df = 3 in the numerator (across the top of the table) and df = 24 in the denominator (along the le -hand side of the table) is 3.01. Thus, in this situa on, the α = .05 decision rule for evalua ng sta s cal significance is to reject H0 when values of F > +3.01 are obtained. A value of 3.01 cuts off the top 5% of the area in the right-hand tail of the F distribu on with df equal to 3 and 24, as shown in Figure 13.2. The obtained F = 11.94 would therefore be judged sta s cally significant. 13.6 PATTERNS OF SCORES AND MAGNITUDES OF SSBETWEEN AND SSWITHIN It is important to understand what informa on about pa ern in the data is contained in these SS and MS terms. SSbetween is a func on of the distances among the group means (M1, M2,…, Mk); the farther apart these group means are, the larger SSbetween tends to be. Most researchers hope to find significant differences among groups, and therefore, they want SSbetween (and F) to be rela vely large. SSwithin is the total of squared within-group devia ons of scores from group means. SSwithin would be 0 in the unlikely event that all scores within each group were equal to one another. The greater the variability of scores within each group, the larger the value of SSwithin.

Consider the example shown in Table 13.1, which shows hypothe cal data for which SSbetween would be 0 (because all the group means are equal); however, SSwithin is not 0 (because the scores vary within groups). Table 13.2 shows data for which SSbetween is not 0 (group means differ) but SSwithin is 0 (scores do not vary within groups). Table 13.3 shows data for which both SSbetween and SSwithin are nonzero. Finally, Table 13.4 shows a pa ern of scores for which both SSbetween and SSwithin are 0. Table 13.1 Data for Which SSbetween Is 0 (Because All the Group Means Are Equal), but SSwithin Is Not 0 (Because Scores Vary Within Groups)

Table 13.2 Data for Which SSbetween Is Not 0 (Because Group Means Differ), but SSwithin Is 0 (Because Scores Do Not Vary Within Groups)

Table 13.3 Data for Which SSbetween and SSwithin Are Both Nonzero

Table 13.4 Data for Which Both SSwithin and SSbetween Equal 0

13.7 CONFIDENCE INTERVALS FOR GROUP MEANS Once we know the mean, variance, and n for each group, we can set up a confidence interval (CI) around the mean for each group or a CI for any difference between a pair of group means. Procedures for CIs were reviewed in Chapter 12, on the independent-samples t test, and are not repeated here. 13.8 EFFECT SIZES FOR ONE-WAY BETWEEN-S ANOVA By comparing the sizes of these SS terms that represent variability of scores between and within groups, we can make a summary statement about the compara ve size of the effects of the independent and extraneous variables. The propor on of the total variability (SStotal) that is due to between-group differences is given by Other

(13.16) In the context of a well-controlled experiment, these between-group differences in scores are, presumably, due primarily to the manipulated independent variable; in a nonexperimental study that compares naturally occurring groups, this propor on of variance is reported only to describe the magnitudes of differences between groups, and it is not interpreted as evidence of causality. An eta squared (η2) is an effect size index given as a propor on of variance; if η2 = .50, then 50% of the variance in the Yij scores is related to between-group differences. This is the same eta squared that was introduced in the previous chapter as an effect size index for the independent-samples t test; verbal labels that can be used to describe effect sizes are provided in Table 12.2. If the scores in a two-group t test are par oned into components using the logic just described here and then summarized by crea ng sums of squares, the η2 value obtained will be iden cal to the η2 that was calculated from the t and df terms. It is also possible to calculate eta squared from the F ra o and its df; this is useful when reading journal ar cles that report F tests without providing effect size informa on: Other

(13.17) An eta squared is interpreted as the propor on of variance in scores on the Y outcome variable that is predictable from group membership (i.e., from the score on X, the predictor variable). Suggested verbal labels for eta squared effect sizes were given in Table 12.2. One alterna ve effect size measure some mes used in ANOVA is called omega squared (ω2) (see Hays, 1994). The eta squared index describes the propor on of variance due to between- group differences in the sample, but it is a biased es mate of the propor on of variance that is theore cally due to differences among the popula ons. The ω2 index is essen ally a (downwardly) adjusted version of eta squared that provides a more conserva ve es mate of variance among popula on means; however, eta squared is more widely used in sta s cal power analysis and as an effect size measure in the literature. Cohen’s f2 is yet another effect size, o en used in sta s cal power analysis. Cohen’s f2 = η2/(1 – η2). 13.9 STATISTICAL POWER ANALYSIS FOR ONE-WAY BETWEEN-S ANOVA Table 13.5 is an example of a sta s cal power table that can be used to make decisions about sample size when planning a one-way between-S ANOVA with k = 3 groups and α = .05. Using Table 13.5, given the number of groups, the number of par cipants, the predetermined alpha level, and the an cipated popula on effect size es mated by eta squared, the researcher can look up the minimum n of par cipants per group that is required to obtain various levels of sta s cal power. The researcher needs to make an educated guess: How large an effect is expected in the planned study? If similar studies have been conducted in the past, the eta squared values from past research can be used to es mate effect size; if not, the researcher may have to make a guess on the basis of less exact informa on. The researcher chooses the

alpha level (usually .05), calculates d etween (which equals k – 1, where k is the number of groups in the study), and decides on the desired level of sta s cal power (usually .80, or 80%). Using this informa on, the researcher can use the tables in Cohen (1988) or in Jaccard and Becker (2009) to look up the minimum sample size per group that is needed to achieve the power of 80%. For example, using Table 13.5, for an alpha level of .05, a study with three groups and d etween = 2, a popula on eta squared value of .15, and a desired level of power of .80, the minimum number of par cipants required per group would be 19. Table 13.5 Sta s cal Power for One-Way Between-S ANOVA With k = 3 Groups Using α = .05

Java applets are available on the web for sta s cal power analysis; typically, if the user iden fies a Java applet that is appropriate for the specific analysis (such as between-S one-way ANOVA) and enters informa on about alpha, the number of groups, popula on effect size, and desired level of power, the applet provides the minimum per group sample size required to achieve the user-specified level of sta s cal power. 13.10 PLANNED CONTRASTS The idea behind planned contrasts is that the researcher iden fies a limited number of comparisons between group means before looking at the data. The test sta s c that is used for each comparison is essen ally iden cal to a t ra o, except that the denominator is usually based on the MSwithin for the en re ANOVA, rather than just the variances for the two groups involved in the comparison. Some mes an F is reported for the significance of each contrast,

but F is equivalent to t2 in situa ons where only two group means are compared or where a contrast has only 1 df. For the means of Groups a and b, the null hypothesis for a simple contrast between Ma and Mb is as follows:

Other or

Other The test sta s c can be in the form of a t test: Other

(13.18) In words, this null hypothesis says that when we combine the means using certain weights (such as +1, –1/3, –1/3, and –1/3), the resul ng composite is predicted to have a value of 0. This is equivalent to saying that the mean outcome averaged or combined across Groups 2 to 4 (which received three different types of medica on) is equal to the mean outcome in Group 1 (which received no medica on). Weights that define a contrast among group means are called contrast coefficients. Usually, contrast coefficients are constrained to sum to 0, and the coefficients themselves are usually given as integers for reasons of simplicity. If we mul ply this set of contrast coefficients by 3 (to get rid of the frac ons), we obtain the following set of contrast coefficients that can be used to see if the combined mean of Groups 2 to 4 differs from the mean of Group 1 (+3, –1, –1, –1). If we reverse the signs, we obtain the set (–3, +1, +1, +1), which s ll corresponds to the same contrast. The F test for a contrast detects the magnitude, and not the direc on, of differences among group means; therefore, it does not ma er if the signs on a set of contrast coefficients are reversed. In SPSS, users can select an op on that allows them to enter a set of contrast coefficients to make many different types of comparisons among group means. To see some possible contrasts, imagine a situa on in which there are k = 5 groups. This set of contrast coefficients simply compares the means of Groups 1 and 5 (ignoring all the other groups): (+1, 0, 0, 0, –1). This set of contrast coefficients compares the combined mean of Groups 1 to 4 with the mean of Group 5: (+1, +1, +1, +1, –4). Contrast coefficients can be used to test for specific pa erns, such as a linear trend (scores on the outcome variable might tend to increase linearly if Groups 1 through 5 correspond to equally spaced dosage levels of a drug): (–2, –1, 0, +1, +2).

A curvilinear trend can also be tested; for instance, the researcher might expect to find that the highest scores on the outcome variable occur at moderate dosage levels of the independent variable. If the five groups received five equally spaced different levels of background noise and the researcher predicts the best task performance at a moderate level of noise, an appropriate set of contrast coefficients would be (–1, 0, +2, 0, –1). When a user specifies contrast coefficients, it is necessary to have one coefficient for each level or group in the ANOVA; if there are k groups, each contrast that is specified must include k coefficients. A user may specify more than one set of contrasts, although usually the number of contrasts does not exceed k – 1 (where k is the number of groups). The following simple guidelines are usually sufficient to understand what comparisons a given set of coefficients makes:

1. Groups with posi ve coefficients are compared with groups with nega ve coefficients; groups that have coefficients of 0 are omi ed from such comparisons.

2. It does not ma er which groups have posi ve versus nega ve coefficients; a difference can be detected by the contrast analysis whether or not the coefficients code for it in the direc on of the difference.

3. For contrast coefficients that represent trends, if you draw a graph that shows how the contrast coefficients change as a func on of group number (X), the line shows pictorially what type of trend the contrast coefficients will detect. Thus, if you plot the coefficients (–2, –1, 0, +1, +2) as a func on of the group numbers 1, 2, 3, 4, and 5, you can see that these coefficients test for a linear trend. The test will detect a linear trend whether it takes the form of an increase or a decrease in mean Y values across groups.

When a researcher uses more than one set of contrasts, he or she may want to know whether those contrasts are logically independent, uncorrelated, or orthogonal. There is an easy way to check whether the contrasts implied by two sets of contrast coefficients are orthogonal or independent. Essen ally, to check for orthogonality, you just compute a (shortcut) version of a correla on between the two lists of coefficients. First, you list the coefficients for Contrasts 1 and 2 (make sure that each set of coefficients sums to 0, or this shortcut will not produce valid results). Contrast 1: (–2, –1, 0, +1, +2) Contrast 2: (+1, –1, 0, 0, 0) You cross-mul ply each pair of corresponding coefficients (i.e., the coefficients that are applied to the same group) and then sum these cross products. In this example, you get

In this case, the sum of the cross products is –1. This means that the two contrasts above are not independent or orthogonal; some of the informa on that they contain about differences among means is redundant. Consider a second example that illustrates a situa on in which the two contrasts are orthogonal or independent:

In this second example, the curvilinear contrast is orthogonal to the linear trend contrast. In a one-way ANOVA with k groups, it is possible to have up to (k – 1) orthogonal contrasts. The preceding discussion of contrast coefficients assumed that the groups in the one-way ANOVA had equal n’s. When the n’s in the groups are unequal, it is necessary to adjust the values of the contrast coefficients so that they take unequal group size into account; this is done automa cally in programs such as SPSS. 13.11 POST HOC OR “PROTECTED” TESTS If the researcher wants to make all possible comparisons among groups or does not have a theore cal basis for choosing a limited number of comparisons before looking at the data, it is possible to use test procedures that limit the risk for Type I error by using “protected” tests. Protected tests use a more stringent criterion than would be used for planned contrasts in judging whether any given pair of means differs significantly. One method for se ng a more stringent test criterion is the Bonferroni procedure, described in Chapter 10. The Bonferroni procedure requires that the data analyst use a more conserva ve (smaller) alpha level to judge whether each individual comparison between group means is sta s cally significant. For instance, in a one-way ANOVA with k = 5 groups, there are k × (k – 1)/2 = 10 possible pairwise comparisons of group means. If the researcher wants to limit the overall experiment-wise risk for Type I error (EWα) for the en re set of 10 comparisons to .05, one possible way to achieve this is to set the PCα level for each individual significance test between means at αEW/(number of post hoc tests to be performed). For example, if the experimenter wants an experiment-wise

α of .05 when doing k = 10 post hoc comparisons between groups, the alpha level for each individual test would be set at EWα/k, or .05/10, or .005 for each individual test. The t test could be calculated using the same formula as for an ordinary t test, but it would be judged significant only if its obtained p value were less than .005. The Bonferroni procedure is extremely conserva ve, and many researchers prefer less conserva ve methods of limi ng the risk for Type I error. (One way to make the Bonferroni procedure less conserva ve is to set the experiment-wise alpha to some higher value, such as .10.) Dozens of post hoc or protected tests have been developed to make comparisons among means in ANOVA that were not predicted in advance. Some of these procedures are intended for use with a limited number of comparisons; other tests are used to make all possible pairwise comparisons among group means. Some of the be er known post hoc tests include the Scheffé test, the Newman-Keuls test, and the Tukey HSD test. The Tukey HSD test has become popular because it is moderately conserva ve and easy to apply; it can be used to perform all possible pairwise comparisons of means and is available as an op on in widely used computer programs such as SPSS. The menu for the SPSS one-way ANOVA procedure includes the Tukey HSD test as one of many op ons for post hoc tests; SPSS calls it the Tukey procedure. The Tukey HSD test (and several similar post hoc tests) uses a different method of limi ng the risk for Type I error. Essen ally, the Tukey HSD test uses the same formula as a t ra o, but the resul ng test ra o is labeled q rather than t, to remind the user that it should be evaluated using a different sampling distribu on. The Tukey HSD test and several related post hoc tests use cri cal values from a distribu on called the “Studen zed range sta s c,” and the test ra o is o en denoted by the le er q: Other

(13.19) where a and b denote any two groups a and b. Values of the q ra o are compared with cri cal values from tables of the Studen zed range sta s c (see the table in Appendix F at the end of the book). The Studen zed range sta s c is essen ally a modified version of the t distribu on. Like t, its distribu on depends on the numbers of subjects within groups, but the shape of this distribu on also depends on k, the number of groups. As the number of groups (k) increases, the number of pairwise comparisons also increases. To protect against inflated risk for Type I error, larger differences between group means are required for rejec on of the null hypothesis as k increases. The distribu on of the Studen zed range sta s c is broader and fla er than the t distribu on and has thicker tails; thus, when it is used to look up cri cal values of q that cut off the most extreme 5% of the area in the upper and lower tails, the cri cal values of q are larger than the corresponding cri cal values of t.

This formula for the Tukey HSD test could be applied by compu ng a q ra o for each pair of sample means and then checking to see if the obtained q for each comparison exceeded the cri cal value of q from the table of the Studen zed range sta s c. However, in prac ce, a computa onal shortcut is o en preferred. The formula is rearranged so that the cutoff for judging a difference between groups to be sta s cally significant is given in terms of differences between means rather than in terms of values of a q ra o. Other

(13.20) Then, if the obtained difference between any pair of means (such as Ma – Mb) is greater in absolute value than this HSD, this difference between means is judged sta s cally significant. An HSD criterion is computed by looking up the appropriate cri cal value of q, the Studen zed range sta s c, from a table of this distribu on (see the table in Appendix F). The cri cal q value is a func on of both n, the average number of subjects per group, and k, the number of groups in the overall one-way ANOVA. As in other test situa ons, most researchers use the cri cal value of q that corresponds to α = .05, two tailed. This cri cal q value obtained from the table is mul plied by the error term to yield HSD. This HSD is used as the criterion to judge each obtained difference between sample means. The researcher then computes the absolute value of the difference between each pair of group means (M1 – M2), (M1 – M3), and so forth. If the absolute value of a difference between group means exceeds the HSD value just calculated, then that pair of group means is judged to be significantly different. When a Tukey HSD test is requested from SPSS, SPSS provides a summary table that shows all possible pairwise comparisons of group means and reports whether each of these comparisons is significant. If the overall F for the one-way ANOVA is sta s cally significant, it implies that there should be at least one significant contrast among group means. However, it is possible to have situa ons in which a significant overall F is followed by a set of post hoc tests that do not reveal any significant differences among means. This can happen because protected post hoc tests are somewhat more conserva ve and thus require slightly larger between-group differences as a basis for a decision that differences are sta s cally significant, than the overall one-way ANOVA. 13.12 ONE-WAY BETWEEN-S ANOVA IN SPSS To run the one-way between-S ANOVA procedure in SPSS, make the following menu selec ons from the menu bar at the top of the Data View worksheet, as shown in Figure 13.3: <Analyze> → <Compare Means> → <One-Way ANOVA>. This opens the dialog box in Figure 13.4. Enter the name of one (or several) dependent variables into the pane labeled “Dependent List”; enter the name of the categorical variable that provides group membership informa on into the box labeled “Factor.” For this example, addi onal windows were accessed by clicking on the bu ons marked Post Hoc, Contrasts, and Op ons. The screenshots that correspond to this series of dialog boxes appear in Figures 13.4 through 13.7.

Figure 13.3 SPSS Menu Selec ons for One-Way Between-S ANOVA

Figure 13.4 One-Way ANOVA Dialog Box

Figure 13.5 One-Way ANOVA: Post Hoc Mul ple Comparisons Dialog Box

Figure 13.6 Specifica on of a Planned Contrast

The null hypothesis about a weighted linear composite of means that is represented by this set of contrast coefficients:

Other or

Other or

Other From the menu of post hoc tests, this example uses the one SPSS calls “Tukey” (this corresponds to the Tukey HSD test). To define a contrast that compares the mean of Group 1 (no stress) with the mean of the three stress treatment groups combined, these contrast coefficients are entered one at a me: +3, –1, –1, –1. From the list of op ons, “Descrip ve” sta s cs and “Homogeneity of variance test” were selected by placing checks in the boxes next to the names of these tests.

Figure 13.7 One-Way ANOVA: Op ons Dialog Box 13.13 OUTPUT FROM SPSS FOR ONE-WAY BETWEEN-S ANOVA The output for this one-way ANOVA is reported in Figure 13.8. The first panel provides descrip ve informa on about each of the groups: mean, standard devia on, n, a 95% CI for the mean, and so forth. The second panel shows the results for the Levene test of the homogeneity of variance assump on; this is an F ra o with (k – 1) and (N – k) df. The obtained F was not significant for this example; there was no evidence that the homogeneity of variance assump on had been violated. The third panel shows the ANOVA source table with the overall F; this was sta s cally significant, and this implies that there was at least one significant contrast between group means. In prac ce, a researcher would not report both planned contrasts and post hoc tests; however, both were presented for this example to show how they are obtained and reported. Figure 13.9 shows the output for the planned contrast that was specified by entering these contrast coefficients: (+3, –1, –1, –1). These contrast coefficients correspond to a test of the null hypothesis that the mean anxiety of the no-stress group (Group 1) was not significantly different from the mean anxiety of the three stress interven on groups (Groups 2–4) combined. SPSS reported a t test for this contrast (some textbooks and programs use an F test). This t test was sta s cally significant, and examina on of the group means indicated that the mean anxiety level was significantly higher for the three stress interven on groups combined, compared with the control group.

Figure 13.8 SPSS Output for One-Way ANOVA

Figure 13.9 SPSS Output for Planned Contrasts Figure 13.10 shows the results for the Tukey HSD tests that compared all possible pairs of group means. The table “Mul ple Comparisons” gives the difference between means for all possible pairs of means (note that each comparison appears twice; that is, Group a is compared with Group b, and in another row, Group b is compared with Group a). Examina on of the “Sig.” or p values indicates that several of the pairwise comparisons were significant at the .05 level. The results are displayed in a more easily readable form in the last panel under the heading “Homogeneous Subsets.” Each subset consists of group means that were not significantly different from one another using the Tukey test. The no-stress group was in a subset by itself; in other words, it had significantly lower mean anxiety than any of the three stress interven on groups. The second subset consisted of the stress role play and mental arithme c groups, which did not differ significantly in anxiety. The third subset consisted of the mental arithme c and mock job interview groups.

Figure 13.10 SPSS Output for Post Hoc Test (Tukey HSD) Note that it is possible for a group to belong to more than one subset; the anxiety score for the mental arithme c group was not significantly different from the stress role play or the mock job interview groups. However, because the stress role play group differed significantly from the mock job interview group, these three groups did not form one subset. Note also that it is possible for all the Tukey HSD comparisons to be nonsignificant even when the overall F for the one-way ANOVA is sta s cally significant. This can happen because the Tukey HSD test requires a slightly larger difference between means to achieve significance. In this imaginary example, as in some research studies, the outcome measure (anxiety) is not a standardized test for which we have norms. The numbers by themselves do not tell us whether the mock job interview par cipants were moderately anxious or twitching, stu ering wrecks. Studies that use standardized measures can make comparisons with test norms to help readers understand whether the group differences were large enough to be of clinical or prac cal importance. Alterna vely, qualita ve data about the behavior of par cipants can also help readers understand how substan al the group differences were.

Figure 13.11 Bar Chart for Group Means With 95% Confidence Intervals SPSS one-way ANOVA does not provide an effect size measure, but this can easily be calculated by hand. In this case, eta squared is found by taking the ra o SSbetween/SStotal from the ANOVA source table: η2 = .60.

Graphs used to represent group means with confidence intervals for one-way ANOVA are like those used in Chapter 12, on the independent-samples t test; menu selec ons are not repeated here. A bar chart with 95% CI error bars appears in Figure 13.11. 13.14 REPORTING RESULTS FROM ONE-WAY BETWEEN-S ANOVA Following is an example of a “Results” sec on for the one-way between-S ANOVA in the study of anxiety and stress. Results A one-way between-S ANOVA was done to compare the mean scores on an anxiety scale (0 = not at all anxious, 20 = extremely anxious) for par cipants who were randomly assigned to one of four groups: Group 1, control group/no stress; Group 2, mental arithme c; Group 3, stressful role play; and Group 4, mock job interview. Examina on of a histogram of anxiety scores indicated that the scores were approximately normally distributed with no extreme outliers. Prior to the analysis, the Levene test for homogeneity of variance was used to examine whether there were serious viola ons of the homogeneity of variance assump on across groups, but no significant viola on was found, F(3, 24) = .718, p = .72. The overall F for the one-way ANOVA was sta s cally significant, F(3, 24) = 11.94, p < .001. This corresponded to an effect size of η2 = .60; about 60% of the variance in anxiety scores was predictable from the type of stress interven on. This is a large effect. The means and standard devia ons for the four groups are shown in Table 13.6. Table 13.6 Mean Anxiety Scores Across Types of Stress

One planned contrast (comparing the mean of Group 1, no stress, with the combined means of Groups 2–4, the stress interven on groups) was performed. This contrast was tested using α = .05, two tailed; the t test that assumed equal variances was used because the homogeneity of variance assump on was not violated. For this contrast, t(24) = –5.18, p < .001. The mean anxiety score for the no-stress group (M = 9.86) was significantly lower than the mean anxiety score for the three combined stress interven on groups (M = 14.95). In addi on, all possible pairwise comparisons were made using the Tukey HSD test. On the basis of this test (using α = .05), it was found that the no-stress group scored significantly lower on anxiety than all three stress interven on groups. The stressful role play (M = 13.57) was

significantly less anxiety producing than the mock job interview (M = 17.00). The mental arithme c task produced a mean level of anxiety (M = 14.29) that was intermediate between the other stress condi ons, and it did not differ significantly from either the stress role play or the mock job interview. Overall, the mock job interview produced the highest levels of anxiety. Figure 13.11 shows a bar chart of group means with 95% CIs. Data analysts usually just report one type of follow-up analysis, either post hoc tests or planned contrasts, but not both. 13.15 ISSUES IN PLANNING A STUDY When an experiment is designed to compare treatment groups, the researcher needs to decide how many groups to include, how many par cipants to include in each group, how to assign par cipants to groups, and what types or dosages of treatments to administer to each group. In an experiment, the levels of the factor may correspond to different dosage levels of the same treatment variable (such as 0 mg caffeine, 100 mg caffeine, 200 mg caffeine) or to qualita vely different types of interven ons (as in the stress study, where the groups received, respec vely, no stress, mental arithme c, role play, or mock job interview stress interven ons). In addi on, it is necessary to think about issues of experimental control, as described in research methods textbooks; for example, in experimental designs, researchers need to avoid confounds of other variables with the treatment variable. Research methods textbooks (e.g., Cozby & Bates, 2017) provide more detailed discussion of design issues; a few guidelines are listed here.

a) If the treatment variable has a curvilinear rela on to the outcome variable, it is necessary to have a sufficient number of groups to describe this rela on accurately—at least 3 groups. On the other hand, it may not be prac cal or affordable to have a very large number of groups; if an absolute minimum n of 10 (or, be er, 30) par cipants per group are included, then a study that included 15 groups would require a total N of 150 (or 450) par cipants.

b) In studies of interven ons that may create expectancy effects, it is necessary to include one or several kinds of placebo and/or no-treatment/control groups for comparison. For instance, studies of the effect of biofeedback on heart rate (HR) some mes include a group that gets real biofeedback (a tone is turned on when HR increases), a group that gets noncon ngent feedback (a tone is turned on and off at random in a way that is unrelated to HR), and a group that receives instruc ons for relaxa on and sits quietly in the lab without any feedback (Burish, 1981).

c) Random assignment of par cipants to condi ons is desirable to try to ensure equivalence of the groups prior to treatment. For example, in a study where HR is the dependent variable, it would be desirable to have par cipants whose HRs were equal across all groups prior to the administra on of any treatments. However, it cannot be assumed that random assignment will always succeed in crea ng equivalent groups. In studies where equivalence among groups prior to treatment is in doubt, the researcher should collect data on par cipant characteris cs and compare groups prior to treatment to verify whether the groups are equivalent.

d) It is crucial to make sure that no other variable is confounded with the treatment variable; the presence of a confound makes differences among group means uninterpretable.

The same factors that affect the size of the t ra o also affect the sizes of F ra os: distances among group means, the amount of variability of scores within each group, and the number, n, of par cipants per group. Other things being equal, an F ra o tends to be larger when there are large between-group differences among dosage levels (or par cipant characteris cs). As an example, a study that used three different noise levels in decibels as the treatment variable, for example, 35, 65, and 95 dB, would result in larger differences in group means on arousal and a larger F ra o than a study that looked, for example, at these three noise levels, which are closer together: 60, 65, and 70 dB. Like the t test, the F ra o involves a comparison of between-group and within-group variability; the selec on of homogeneous par cipants, standardiza on of tes ng condi ons, and control over extraneous variables will tend to reduce the magnitude of within-group variability of scores, which in turn tends to produce a larger F (or t) ra o. A researcher can o en increase the size of an F (or t) ra o by increasing the differences in the dosage levels of treatments given to groups and/or reducing the effects of extraneous variables through experimental control, and/or increasing the number of par cipants. In studies that involve comparisons of naturally occurring groups, such as age groups, a similar principle applies: A researcher is more likely to see age-related changes in mental processing speed in a study that compares ages 20, 50, and 80 than in a study that compares ages 20, 25, and 30. 13.16 SUMMARY One-way between-S ANOVA provides a method for comparison of more than two group means. However, the overall F test for the ANOVA does not provide enough informa on to completely describe the pa ern in the data. It is o en necessary to perform addi onal comparisons among specific group means to provide a complete descrip on of the pa ern of differences among group means. These can be a priori (also called planned contrast) comparisons if a limited number of differences are predicted in advance and a small number of significance tests are performed. If the researcher did not make predic ons in advance about differences between group means, then he or she may use protected or post hoc tests to do any follow-up comparisons; the Bonferroni procedure and the Tukey HSD test were described here, and many other post hoc procedures are available. The most important concept from this chapter is the idea that a score can be divided into components (one part that is related to group membership or treatment effects and a second part that is due to the effects of all other “extraneous” variables that uniquely influence individual par cipants). Informa on about the rela ve sizes of these components can be summarized across all the par cipants in a study by compu ng the sum of squared devia ons (SS) for the between-group and within-group devia ons. On the basis of the SS values, it is possible to compute an effect size es mate (η2) that describes the propor on of variance predictable from group membership (or treatment variables) in the study. Researchers usually hope to design their studies in a manner that makes the propor on

of explained variance reasonably high and that produces sta s cally significant differences among group means. However, researchers should remember that the propor on of variance due to group differences in the ar ficial world of research may not correspond to the “true” strength of the influence of the variable out in the “real world.” In experiments, we create an ar ficial world by holding some variables constant and by manipula ng the treatment variable; in nonexperimental research, we create an ar ficial world through our selec on of par cipants and measures. Research results should be interpreted and generalized cau ously. APPENDIX 13A: ANOVA MODEL AND DIVISION OF SCORES INTO COMPONENTS When we do a one-way ANOVA, the analysis involves par on of each score into two components: a component of the score that is associated with group membership and a component of the score that is not associated with group membership. The sums of squares (SS) summarize informa on about the magnitudes of these components across all scores, and a ra o of SS terms will es mate what propor on of variance in scores is associated with type or amount or treatment or other characteris cs that differ across groups. Examining propor on of variance associated with treatment group membership is one way of approaching the general research ques on: Why is there variance in the scores on the outcome variable? Table 13.7 Par on of Heart Rate Scores in Hypothe cal Sex Differences in Heart Rate Data

To illustrate the par on of scores into components, consider the hypothe cal data in Table 13.7. Suppose we measure HR for each of six persons in a small sample and obtain the following set of scores: 81, 85, and 92 for the three female par cipants and 70, 63, and 68 for the three male par cipants. In this example, the X group membership variable is sex, coded 1 = female and 2 = male, and the Y quan ta ve outcome variable is HR. We can par on each HR score into a component that is related to group membership (i.e., related to sex) and a component that is not related to sex group membership. We can call the factor that represents sex group membership Factor A. We will denote the HR score for Person j in Group i as Yij. For example, Cathy’s HR of 92 corresponds to Y13, the score for the third person in Group 1. We will denote the grand mean

(i.e., the mean HR for all N = 6 persons in this data set) as MY; for this set of scores, the value of MY is 76.5. We will denote the mean for Group i as Mi. For example, the mean HR for the female group (Group 1) is M1 = 86; the mean HR for the male group (Group 2) is M2 = 67. Once we know the individual scores, the grand mean, and the group means, we can work out a par on of each individual HR score into two components. The ques on we are trying to answer is, Why is there variance in HR? In other words, why do some people in this sample have HRs that are higher than the sample mean of 76.5, while others have HRs that are lower than 76.5? Is sex one of the variables that predicts whether a person’s HR will be rela vely high or low? For each person, we can compute a devia on of that person’s individual Yij score from the grand mean; this tells us how far above (or below) the grand mean of HR each person’s score was. Note that in Table 13.7, the value of the grand mean is the same for all six par cipants. The value of the group mean was different for members of Group i = 1 (women) than for members of Group i = 2 (men). We can use these means to compute the following three devia ons from means for each person: Other

(13.21) Other

(13.22) Other

(13.23) The total devia on of the individual score from the grand mean, DevGrand, can be divided into two components: the devia on of the individual score from the group mean, DevGroup, and the devia on of the group mean from the grand mean, effect. Other

(13.24) For example, look at the row of data for Cathy in Table 13.7. Cathy’s observed HR was 92. Cathy’s HR has a total devia on from the grand mean of (92 – 76.5) = +15.5; that is, Cathy’s HR was 15.5 beats per minute higher than the grand mean for this set of data. Compared with the group mean HR for the female group, Cathy’s HR corresponds to a devia on of (92 – 86) = +6; that is, Cathy’s HR was 6 beats per minute higher than the mean HR for the female group. Finally, the “effect” component of Cathy’s score is found by subtrac ng the grand mean (MY) from the mean of the group to which Cathy belongs (Mfemale or M1): (86 – 76.5) = +9.5. This value of 9.5 tells us that the mean HR for the female group (86) was 9.5 beats per minute higher than the mean HR for the en re sample (and this effect is the same for all members within each group). We could, therefore, say that the “effect” of membership in the female group is to increase the predicted HR score by 9.5.

Using this logic, we can represent Cathy’s HR score (or the score of any individual) using this sum: Other

(13.25) where MY is the grand mean; Mi is the difference between the group mean and grand mean for the group Cathy belongs to, in this instance, the effect of sex on HR; and eij is the devia on of Cathy’s HR from the mean of the group she belongs to. The specific numerical values for Cathy are 76.5 (grand mean of HR for study) + 9.5 (effect for Group i that represents sex difference in HR) + 6 (Cathy’s devia on from the group mean, her difference from the mean for the female group) = 92 (Cathy’s HR). If we are interested in showing that there is a sex difference in mean HR, we will want the effect for Group i (sex) to be large. This effect is the same for all women (+9.5) and all men (–9.5). Because this example involves comparison of naturally occurring groups, men versus women, we should not interpret a group difference as evidence of causality but merely as a descrip on of group differences. Another way to look at these numbers is to think about a predic ve equa on based on a theore cal model. When we do a one-way ANOVA, we seek to predict each individual score (Yij) on the quan ta ve outcome variable from the following theore cal components: the popula on mean (denoted by μ), the “effect” for Group i (o en denoted by αi, because α is the Greek le er that corresponds to A, which is the name usually given to a single factor), and the residual associated with the score for Person j in Group i (denoted by εij). Es mates for μ, αi, and εij can be obtained from the sample data as follows: Other

(13.26) Other

(13.27) Other

(13.28) Note that the sample devia ons used to es mate αi and εij (in Equa ons 13.7 and 13.8) are the same components that appeared in Equa on 13.4. An individual observed HR score Yij can be represented as a sum of these theore cal components, as follows: Other

(13.29)

In words, Equa on 13.29 says that we can predict (or reconstruct) the observed score for Person j in Group i by taking the grand mean μ, adding the “effect” αi that is associated with membership in Group i, and, finally, adding the residual εij that tells us how much Individual j’s score differed from the mean of the group Person j belonged to. The terms in the formal Equa on 13.29 correspond to devia ons that can be computed from the sample data:

Other where the i subscript indicates which treatment group each person is in, and the j subscript numbers the people within each treatment group. In words, an individual person’s total devia on from the grand mean of Y is made up of the effect for the treatment group that the individual is in (αi) and a residual or predic on error unique to that person, εij. Each of these terms can be es mated from sample data as follows: Other The (Mi – MY) difference es mates the “effect” of treatment, αi; this difference tells us whether, on average, people who received Treatment i had scores higher or lower than the grand mean.

Other The (Yij – Mi) difference tells us how much Person j’s score differed from the mean of all persons who received Treatment i. These equa ons tell us that a total devia on of each individual Y score from the grand mean can be divided into two parts: the effect for the treatment for Group i + the residual for Person j. In other words, we can divide the total devia on of each person’s score from the grand mean into two components: αi, the part of the score that is associated with or predictable from group membership (in this case, sex), and εij, the part of the score that is not associated with or predictable from group membership (the part of the HR score that is due to all other variables that influence HR, such as smoking, anxiety, drug use, level of fitness, health, etc.). In words, then, Equa on 13.29 says that we can predict each person’s HR from the following informa on: Person j’s HR = grand mean + effect of Person j’s sex on HR + effects of all other variables that influence Person j’s HR, such as Person j’s anxiety, health, drug use, fitness, and anxiety. The collec ve influence of “all other variables” on scores on the outcome variable, HR in this example, is called the residual, or “error.” In most research situa ons, researchers hope that the components of scores that represent group differences (in this case, sex differences in HR) will be rela vely large and that the components of scores that represent within-group variability in

HR (in this case, differences among women and men in HR and thus differences due to all variables other than sex) will be rela vely small. When we do an ANOVA, we summarize the informa on about the sizes of these two devia ons or components (Yij – Mi) and (Mi – MY) across all the scores in the sample. We cannot summarize informa on just by summing these devia ons; recall that the sum of devia ons of scores from a sample mean always equals 0, so Σ(Yij – Mi) = 0 and Σ(Mi – MY) = 0. When we summarized informa on about distances of scores from a sample mean by compu ng a sample variance, we avoided this problem by squaring the devia ons from the mean prior to summing them. APPENDIX 13B: EXPECTED VALUE OF F WHEN H0 IS TRUE For the independent-samples t test, if H0 is true, we expect values of t to be very close to 0. For the F test, however, if H0 is true, we expect values of F to be close to 1. The popula on variances that are es mated by the F ra o of sample mean squares (on the basis of the algebra of expected mean squares) are as follows: Other

(13.30)

where is the popula on variance of the alpha group effects, that is, the amount of variance in the Y scores that is associated with or predictable from the group membership variable, and

is the popula on error variance, that is, the variance in scores that is due to all variables other than the group membership variable or manipulated treatment variable. Earlier, the null hypothesis for a one-way ANOVA was given as

Other An alterna ve way to state this null hypothesis is that all the αi effects are equal to 0 (and therefore equal to one another) Therefore, another form of the null hypothesis for ANOVA is as follows: Other

(13.31) It follows that if H0 is true, then the expected value of the F ra o is close to 1. If F is much

greater than 1, we have evidence that may be larger than 0. How is F distributed across thousands of samples? First, note that the MS terms in the F ra o must be posi ve (MSbetween can be 0 in rare cases where all the group means are exactly equal; MSwithin can be 0 in even rarer cases where all the scores within each group are equal; but because these are sums of squared terms, neither MS can be nega ve).

The sums of squared independent normal variables have a chi-square distribu on; thus, the distribu ons of each of the mean squares are chi-square variates. An F distribu on is a ra o of two chi-square variates. Like chi-square, the graph of an F distribu on has a lower tail that ends at 0, and it tends to be skewed with a long tail off to the right. (See Figure 13.2 for a graph of the distribu on of F with df = 3 and 24.) We reject H0 for large F values (that lie in the upper 5% in the right-hand tail). Thus, F is almost always treated as a one-tailed test. (It is possible to look at the lower tail of the F distribu on to evaluate whether the obtained sample F is too small for it to be likely to have arisen by chance, but this is rarely done.) Note that for the two-group situa on, F is equivalent to t2. Both F and t are ra os of an es mate of between-group differences to within-group differences; between-group differences are interpreted as being due primarily to the manipulated independent variable, while within- group variability is due to the effects of extraneous variables. Both t and F are interpretable as signal-to-noise ra os, where the “signal” is the effect of the manipulated independent variable; in most research situa ons, we hope that this term will be rela vely large. The size of the signal is evaluated rela ve to the magnitude of “noise,” the variability due to all other extraneous variables; in most research situa ons, we hope that this will be rela vely small. Thus, in most research situa ons, we hope for values of t or F that are large enough for the null hypothesis to be rejected. A significant F can be interpreted as evidence that the between-groups independent variable had a detectable effect on the outcome variable. In rare circumstances, researchers hope to affirm the null hypothesis, that is, to demonstrate that the independent variable has no detectable effect, but this claim is actually quite difficult to prove. If we obtain an F ra o large enough to reject H0, what can we conclude? The alterna ve hypothesis is not that all group means differ from each other significantly but that there is at least one (and possibly more than one) significant difference between group means: Other

(13.32) A significant F tells us that there is probably at least one significant difference among group means; by itself, it does not tell us where that difference lies. It is necessary to do addi onal tests to iden fy the one or more significant differences. When the formal algebra of expected mean squares is applied to work out the popula on variances that are es mated by the sample values of MSwithin and MSbetween, the following results are obtained if the group n’s are equal (Winer, Brown, & Michels, 1991):

Other the popula on variance due to error (all other extraneous variables);

Other

where is the popula on variance due to the effects of membership in the naturally occurring group or to the manipulated treatment variable. Thus, when we look at F-ra o es mates, we obtain this comparison:

Other The null hypothesis (that all the popula on means are equal) can be stated in a different form, which says the variance of the popula on means is zero:

Other If H0 is true and , then the expected value of the F ra o is 1. Values of F that are substan ally larger than 1, and that exceed the cri cal value of F from tables of the F distribu on, are taken as evidence that this null hypothesis may be false. APPENDIX 13C: COMPARISON OF ANOVA AND T TEST The one-way between-S ANOVA is a generaliza on of the independent-samples t test. The t ra o provides a test of the null hypothesis that two means differ significantly. For the independent-samples t test, the null hypothesis has the following form: Other

(13.33) For a one-way ANOVA with k groups, the null hypothesis is as follows: Other

(13.34) The computa on of the independent-samples t test required that we find the following for each group: M, the sample mean; s, the sample standard devia on; and n, the number of scores in each group. For a one-way ANOVA, the same computa ons are performed; the only difference is that we have to obtain (and summarize) this informa on for k groups (instead of only two groups, as in the t test). Table 13.8 summarizes the informa on included in the computa on of the independent- samples t test and the one-way ANOVA to make it easier to see how similar these analyses are. For each analysis, we need to obtain informa on about differences between group means (for the t test, we compute M1 – M2; for the ANOVA, because we have more than two groups, we need to find the variance of the group means M1, M2,…, Mk). The variance among the k group means is called MSbetween; the formulas to compute this and other intermediate terms in

ANOVA appear below. We need to obtain informa on about the amount of variability of scores within each group; for both the independent-samples t test and ANOVA, we can begin by compu ng an SS term that summarizes the squared distances of all the scores in each group from their group mean. We then convert that informa on into a summary about the amount of variability of scores within groups, summarized across all the groups (for the independent- samples t test, a summary of within-group score variability is provided by s2p; for ANOVA, a summary of within-group score variability is called MSwithin). Table 13.8 Comparison Between the Independent-Samples t Test and One-Way Between- Subjects ANOVA

APPENDIX 13D: NONPARAMETRIC ALTERNATIVE TO ONE-WAY BETWEEN-S ANOVA: INDEPENDENT-SAMPLES KRUSKAL-WALLIS TEST The Kruskal-Wallis test assumes at least ordinal level of measurement and independent observa ons; it can be used when there are more than two groups. It has the same assump on as the Mann-Whitney U test that the distribu ons of scores have the same shape in all samples that are compared. That limits the usefulness of this procedure. Means of ranks are compared across groups (scores are ranked within the en re sample, not separately for each group). The null hypothesis is that the overall loca ons of these distribu ons, including the mean rank, are equal across groups. Compared with one-way ANOVA, this analysis (based on ranks) is less sensi ve to outliers. The data in stress_anxiety.sav used in earlier examples were also used to demonstrate use of this test. The SPSS menu selec ons appear in Figure 13.12. These menu selec ons open the Nonparametric Tests: Two or More Independent Samples dialog box that appears in Figure 13.13. Note that this has three tabs (upper le -hand corner).

Under the “Se ngs” tab I made the radio bu on selec on to “Automa cally choose the tests based on the data.” Output appears in Figure 13.14. SPSS provides minimal informa on. In this example, p < .001 tells us that there is a sta s cally significant difference between the mean ranks of the two groups.

Figure 13.12 SPSS Menu Selec ons for Nonparametric Analysis to Compare Independent Samples

Figure 13.13 Se ngs Tab and Radio Bu on Selec on for Automa c Choice of Analysis

Figure 13.14 Output for Independent-Samples Kruskal-Wallis Test COMPREHENSION QUESTIONS

1. A nonexperimental study was done to assess the impact of the accident at the Three Mile Island (TMI) nuclear power plant on nearby residents (Baum, Gatchel, & Schaeffer, 1983). Data were collected from residents of the following four areas:

Group 1: Three Mile Island, where a nuclear accident occurred (n = 38) Group 2: Frederick, with no nuclear power plant nearby (n = 27) Group 3: Dickerson, with an undamaged coal power plant nearby (n = 24) Group 4: Oyster Creek, with an undamaged nuclear power plant nearby (n = 32) Several different measures of stress were taken for people in these four groups. The researchers hypothesized that residents who lived near TMI (Group 1) would score higher on a wide variety of stress measures than people who lived in the other three areas included as comparisons. One-way ANOVA was performed to assess differences among these four groups on each outcome. Selected results are reported below for you to discuss and interpret. Here are results for two of their outcome measures: stress (total reported stress symptoms) and depression (score on the Beck Depression Inventory). Each cell lists the mean, followed by the standard devia on in parentheses.

a) Write a “Results” sec on in which you report whether these overall differences were

sta s cally significant for each of these two outcome variables (using α = .05). You will need to look up cri cal values for F, and in this instance, you will not be able to include an exact p value.

b) Include an eta squared effect size index for each of the F ra os (you can calculate this by hand from the informa on given in the table). Be sure to state the nature of the differences: Did the TMI group score higher or lower on these stress measures rela ve to the other groups?

c) Would your conclusions change if you used α = .01 instead of α = .05 as your criterion for sta s cal significance?

d) Name a follow-up test that could be done to assess whether all possible pairwise comparisons of group means were significant.

e) Write out the contrast coefficients to test whether the mean for Group 1 (people who lived near TMI) differed from the average for the other three comparison groups.

f) Here is some addi onal informa on about scores on the Beck Depression Inventory. For purposes of clinical diagnosis, Beck, Steer, and Brown (1996) suggested the following cutoffs:

0–13: Minimal depression 14–19: Mild depression 20–28: Moderate depression 29–63: Severe depression In light of this addi onal informa on, what would you add to your discussion of the outcomes for depression in the TMI group versus groups from other regions? (Did the TMI accident make people severely depressed?)

2. Sigall and Ostrove (1975) did an experiment to assess whether the physical a rac veness of a defendant on trial for a crime had an effect on the severity of the sentence given in mock jury trials. Each of the par cipants in this study was randomly assigned to one of the following three treatment groups; every par cipant received a packet that described a burglary and gave background informa on about the accused person. The three treatment groups differed in the type of informa on they were given about the accused person’s appearance. Members of Group 1 were shown a photograph of an a rac ve person; members of Group 2 were shown a photograph of an

una rac ve person; members of Group 3 saw no photograph. Some of their results are described here. Each par cipant was asked to assign a sentence (in years) to the accused person; the researchers predicted that more a rac ve persons would receive shorter sentences.

a) Prior to assessment of the outcome, the researchers did a manipula on check. Members

of Groups 1 and 2 rated the a rac veness (on a 1-to-9 scale, with 9 being the most a rac ve) of the person in the photo. They reported that for the a rac ve photo, M = 7.53; for the una rac ve photo, M = 3.20, F(1, 108) = 184.29. Was this difference sta s cally significant (using α = .05)?

b) What was the effect size for the difference in (2a)? c) Was their a empt to manipulate perceived a rac veness successful? d) Why does the F ra o in (2a) have just df = 1 in the numerator? e) The mean length of sentence given in the three groups was as follows:

Group 1: A rac ve photo, M = 2.80 Group 2: Una rac ve photo, M = 5.20 Group 3: No photo, M = 5.10 They did not report a single overall F comparing all three groups; instead, they reported selected pairwise comparisons. For Group 1 versus Group 2, F(1, 108) = 6.60, p < .025. Was this difference sta s cally significant? If they had done an overall F test to assess the significance of differences of means among all three groups, do you think this overall F would have been sta s cally significant?

f) Was the difference in mean length of sentence in part (2e) in the predicted direc on? g) Calculate and interpret an effect size es mate for this obtained F. h) What addi onal informa on would you need about these data to do a Tukey honestly

significant difference test to see whether Groups 2 and 3, as well as Groups 1 and 3, differed significantly?

3. Suppose that a researcher has conducted a simple experiment to assess the effect of background noise level on verbal learning. The manipulated independent level is the level of white noise in the room (Group 1, low level of 65 dB; Group 2, high level of 70 dB). (Here are some approximate reference values for decibel noise levels: 45 dB, whispered conversa on; 65 dB, normal conversa on; 80 dB, vacuum cleaner; 90 dB, chainsaw or jackhammer; 120 dB, rock music played very loudly.) The outcome measure is the number of syllables correctly recalled from a 20-item list of nonsense syllables. Par cipants ranged in age from 17 to 70 and had widely varying levels of hearing acuity; some of them habitually studied in quiet places, and others preferred to study with the television or radio turned on. There were five par cipants in each of the two groups. The researcher found no significant difference in mean recall scores between these groups.

a) Describe three specific changes to the design of this noise/learning study that would be likely to increase the size of the t ra o (and, therefore, make it more likely that the researcher would find a significant effect).

b) Also, suppose that the researcher has reason to suspect that there is a curvilinear rela on between noise level and task performance. What change would this require in the research design?

4. Suppose that Kim is a par cipant in a study that compares several coaching methods to see how they affect math SAT scores. The grand mean of math SAT scores for all par cipants (MY) in the study is 550. The group that Kim par cipated in had a mean math SAT score of 565. Kim’s individual score on the math SAT was 610.

a) What was the es mated residual component (εij) of Kim’s score, that is, the part of Kim’s

score that was not related to the coaching method? (Both parts of this ques on call for specific numerical values as answers.)

b) What was the “effect” (αi) component of Kim’s score? 5. What pa ern in grouped data would make SSwithin = 0? What pa ern within data

would make SSbetween = 0? 6. Assuming that a researcher hopes to demonstrate that a treatment or group

membership variable makes a significant difference in outcomes, which term does the researcher hope will be larger, MSbetween or MSwithin? Why?

7. Explain the following equa on:

Other What do we gain by breaking the (Yij – MY) devia on into two separate components, and what do each of these components represent? Which of the terms on the right-hand side of the equa on do researchers typically hope will be large, and why?

8. What is H0 for a one-way ANOVA? If H0 is rejected, does that imply that each mean is significantly different from every other mean?

9. What informa on do you need to decide on a sample size that will provide adequate sta s cal power?

10. In the equa on αj = (Mj – MY), what do we call the αj term? 11. If there is an overall significant F in a one-way ANOVA, can we conclude that the group

membership or treatment variable caused the observed differences in the group means? Why or why not?

12. How can eta squared (η2) be calculated from SS values in an ANOVA summary table? 13. Which of these types of tests is more conserva ve: planned contrasts or post hoc

(protected) tests? 14. Name two common post hoc procedures for the comparison of means in an ANOVA.

NOTES

1A harmonic mean uses sums of reciprocals of group n’s, instead of a sum of values of n. Use k to represent the number of groups and n1, n2,…, nk to represent the numbers of cases within these k groups. To find the harmonic mean of these n’s:

Other For a set of three groups with n1 = 20, n2 = 5, and n3 = 2, for example,

Other Using the harmonic mean results in an average n that is closer to the smaller group n’s. The arithme c mean in this example is (n1 + n2 + n3)/3 = 9. DIGITAL RESOURCES Find free study tools to support your learning, including eFlashcards, data sets, and web resources, on the accompanying website at edge.sagepub.com/warner3e. Descrip ons of Images and Figures Back to Figure The details of the three columns of the worksheet are as follows.