Comparing Two Independent Groups Stats

TT24
AppliedstatisticsI.docx

Warner, R. M. (2021).  Applied statistics I: Basic bivariate techniques (3rd ed.). Thousand Oaks, CA: Sage Publications. ISBN: 978-1-5063-5280-0.

CHAPTER 12 THE INDEPENDENT-SAMPLES T TEST

12.1 RESEARCH SITUATIONS WHERE THE INDEPENDENT-SAMPLES T TEST IS USED

The independent-samples t test is used to evaluate whether the means of a Y dependent variable differ significantly across two groups. This test is used much more often in actual research than the one-sample t test. The X independent variable is dichotomous; it identifies membership in one of two groups, using scores such as 1 versus 2 to identify each person as a member of one group. The Y dependent variable must be quantitative.1 Groups can be naturally occurring (such as women vs. men or Democrats vs. Republicans). Alternatively, they can be groups created in an experiment. The independent-samples t test requires a between-S design; that is, each person is a member of one and only one of the groups, people are not matched or paired across groups, and there are not repeated measures for the same persons in the two groups. If participants are matched, paired, or observed under both treatment conditions, a different analysis is required, the paired-samples t test, discussed in a later chapter.

Consider this simple hypothetical experiment. A student wants to know whether mean heart rate is higher when people consume caffeine than when they do not. The X predictor variable in this study is dosage level of caffeine (coded 1 = no caffeine, 2 = 150 mg of caffeine, about the amount in one cup of coffee). The specific numerical values used to label groups make no difference in the results; small integers are usually chosen as values for X, for convenience. The outcome variable (Y) in this example is heart rate (hr). Any drug (even caffeine) can have placebo effects, so it would be important to keep participants and researchers blind to condition. This could be done by giving each member of group 1 a cup of decaffeinated coffee and each member of group 2 a cup of coffee with 150 mg of caffeine (it would be necessary to check that participants could not taste the difference, to avoid placebo effects, and that they drank all the coffee).

Researchers usually hope to find a statistically significant difference in mean scores on Y between the groups. In this example the student researcher might expect that mean hr is higher in the caffeine group than the no-caffeine group. If the group means do not differ significantly, this suggests that there is no treatment effect (i.e., caffeine has no effect on heart rate).

When we obtain statistics such as means for more than one group, numerical subscripts are used to identify groups. For the independent-samples t test, the information we need is:

Group 1: M1, s1, and n1

Group 2: M2, s2, and n2

For the one-sample t test, one sample mean M was used to estimate or test hypotheses about one population mean μ. In this chapter we consider means for two samples and their corresponding hypothetical populations:

μ1 is the hypothetical population mean hr for people who have consumed no caffeine, and

μ2 is the hypothetical population mean hr for people who have consumed 150 mg of caffeine.

If caffeine does not affect heart rate, these population means would be equal: μ1 = μ2. A large difference between M1 and M2 in the sample data in the study would suggest that this hypothesis may be incorrect. Formally, the null hypothesis for the independent-samples t test can be written two ways:

Other

(12.1) 𝐻0:μ1=μ2.

Other

(12.2) 𝐻0:(μ1-μ2)=0.

These two statements are logically equivalent; the second version of the null hypothesis is more convenient. A translation of the null hypotheses into words is, Caffeine has no effect on heart rate. If mean heart rate is the same for people who do versus people who do not consume caffeine, this tells us that there is no treatment effect for caffeine. In this example, caffeine is the dichotomous independent variable with values 1 and 2. Equations 12.1 and 12.2 do not explicitly name the dependent variable.

Just as we used M to estimate μ for the one-sample t test, we now use (M1 – M2) to estimate (μ1 – μ2). What information from the sample data would lead us to suspect that the null hypothesis in Equation 12.2 may be incorrect? If your answer was “a large value of M1 – M2,” you are correct. A large difference between M1 and M2 is unlikely (although not impossible) if H0 is true. How do we evaluate whether the M1 – M2 difference is large enough to lead us to doubt that H0 is true? Once again, we set up a t ratio to compare our sample statistic (M1 – M2) with the standard error of that sample statistic. We will need to know the standard error of the difference, denoted SE(M1–M2). The general form of a t ratio is:

Other

For the independent-samples t test, the sample statistic is (M1 – M2), so the t ratio is:

Other

Because the hypothesized value of (μ1 – μ2) is zero, we can drop that term:

Other

This reduces to the following formula for the independent-samples t test:

Other

(12.3)

This t ratio has df = (n1 + n2 – 2).2

If t is larger than the critical values from the t distribution with (n1 + n2 – 2) df, using α = .05, two tailed, we can say that we have found a statistically significant difference between the sample means and that we can reject the null hypothesis with p > .05, two tailed. In practice, it is easier to examine the p values that correspond to t and reject H0 if obtained p is less than .05 (or less than the specific α level chosen in advance). In other words, all parts of this procedure are the same as for the one-sample t test, except that now we examine the difference between two sample means, (M1 – M2), instead of a single value of M.

12.2 A HYPOTHETICAL RESEARCH EXAMPLE

Consider the following imaginary experiment. Twenty participants are recruited as a convenience sample; each participant is randomly assigned to one of two groups. The groups are given different doses of caffeine. Group 1 receives 0 mg of caffeine; Group 2 receives 150 mg of caffeine. Half an hour later, each participant’s heart rate is measured; this is the quantitative Y outcome variable. The goal of the study is to assess whether caffeine may increase mean hr. In this situation, X (the independent variable) is a dichotomous variable (amount of caffeine, 0 vs. 150 mg); Y, hr, is the quantitative dependent variable. The scores for these variables are in Figure 12.1 and in the file hrcaffeine.sav. Group membership is identified by scores in the “caffeine” column.

Figure 12.1 SPSS Data View: Caffeine/Heart Rate Experiment Data

12.3 ASSUMPTIONS FOR USE OF INDEPENDENT-SAMPLES T TEST

Scores for the Y outcome (or dependent) variable should satisfy the following assumptions.

12.3.1 Y Scores Are Quantitative

Because we will compute a mean for the Y scores in each group, the scores on the Y outcome variable must be quantitative. It would not make sense to compute a group mean for scores on a Y variable that is categorical.

12.3.2 Y Scores Are Independent of Each Other Both Between and Within Groups

When a design is between-S, that is, each person is a member of only one group, Y scores will probably be independent of each other between groups. If matching, pairing, or repeated measures are used, the paired-samples t test must be used instead of the independent-samples t test. Chapter 2 discussed the differences between between-S and within-S designs. The paired-samples t test is introduced in Chapter 14.

Y scores should also be independent of each other within groups. Usually this assumption is satisfied if each person is assessed alone. When subjects in the same treatment condition are tested in pairs or groups, or if they happen to be roommates or couples, or if they have an opportunity to influence each other’s behavior, the scores may not be independent. As an example, consider this hypothetical study. Students consumed high-carbohydrate or high-protein drinks and rated their moods. To collect data quickly, the student tested them in groups. One participant threw up the protein shake, an event that probably affected the mood of others who were present. This made Y scores dependent (related) within the testing groups. Testing participants individually would have avoided this problem. In practice, there is little you can do during data screening to detect within-group nonindependence of scores. You need to know how data were collected.

12.3.3 Y Scores Are Sampled From Normally Distributed Populations With Equal Variances

When the independent-samples t test was developed, statisticians assumed that Y scores in samples were randomly selected from populations that correspond to the groups in the study and that scores in those populations are normally distributed. We have no way to evaluate whether that assumption is correct. In practice, researchers examine the distribution of Y scores in samples and hope that if scores in samples are normally distributed, the populations from which the samples were selected also have reasonably normal distributions.

Another assumption about the population distributions of Y for this test is called homogeneity of variance. The variances of the Y scores should be equal or homogeneous in the two populations that correspond to the samples compared in the study. We can write this assumption formally as follows:

Other

(12.4)

where denote the (unknown) population variances for the populations that correspond to the groups in our study. We can judge whether this assumption may be violated by comparing the sample variances for the scores in the two groups ( ) using the Levene test. SPSS reports the Levene3 test F ratio as part of the standard SPSS output for the independent-samples t test. If the p value for the Levene test is small (e.g., p < .01), this is evidence of possible violation of the homogeneity of variance assumption. (Note that when we test possible violations of assumptions, we would prefer that p be large and not significant.)

It is widely believed that the violation of the assumption that population scores are normally distributed does not cause serious problems if each group has n > 25 or 30, group n’s are equal, and two-tailed tests are used (Boneau, 1960; Hogg, Tanis, & Zimmerman, 2014; Sawilowsky & Blair, 1992). The sampling distribution of M turns out to be close to normal, as predicted by the central limit theorem, even when this assumption is violated. The independent-samples t test is robust against violations of the homogeneity of variance assumption; that is, p values obtained using the independent-samples t test are good estimates of the true risk for Type I error even when this assumption is violated. This has been demonstrated in studies where statisticians set up data for the two hypothetical populations such that μ1 = μ2 and the equal variances assumption is violated. When thousands of random samples are drawn from these populations, and independent-samples t tests are applied, the number of Type I errors obtained is close to what we expect on the basis of the selected α level. Myers and Well (1995) described robustness to violation of the equal variances assumption as follows:

If the two sample sizes are equal, there is little distortion in Type I error rate unless n is very small and the ratio of the variances is quite large … when n’s are unequal, whether the Type I error rate is inflated or deflated depends upon the direction of the relation between sample size and population variance. (pp. 69–70)

According to Myers and Well (1995), even when the ratio of was as high as 100, t tests based on samples of n = 5 using α = .05 resulted in Type I errors in only 6.6% of the batches of data. The impact of violations of the homogeneity assumption on the risk for committing a Type I error is greater when the n’s in the samples are small (less than 30), when the group n’s are unequal, and when a one-tailed test is used (Sawilowsky & Blair, 1992). Some authorities suggest that even smaller group n’s are sufficient for robustness against violation of the equal population variance assumption.

Despite these assurances about robustness of the independent-samples t test with equal variances assumed, SPSS takes an old-fashioned approach. SPSS output for this includes the Levene F test (which really isn’t needed when groups have 25 or more members) and also two versions of the independent-samples t test. One of these t tests (called the equal variances not assumed t test) is adjusted to correct for violations of the homogeneity assumption. Most authorities now think that the adjustment is too conservative. In practice, when you look at independent-samples t test output, you can safely ignore the Levene test and the equal variances not assumed t test. The equal variances not assumed version of the independent-samples t test is calculated using a different formula for t, and it has a downwardly adjusted df value, sometimes denoted df′. SPSS reports both versions of the independent-samples t test. Usually, you will report the equal variances assumed version of the t test.

12.3.4 No Outliers Within Groups

The presence of outliers is not stated as a formal assumption for most tests; however, extreme outliers violate the assumption of normality and, in practice, can cause serious problems in data analysis. Recall that means are not robust against the presence of outliers. Because the independent-samples t test compares group means, it follows that outliers are also a problem for the t test. As noted earlier, handling outliers can be problematic. For some kinds of variance (such as salary or reaction time) you can anticipate that outliers are likely, and in these situations, you should decide on rules for identification and handling of outliers before data collection. You must document the presence and handling of outliers in your final report so that readers know how many cases were dropped and why. For other kinds of variables, such as ratings on 5-point or 10-point scales, outliers are rare.

A boxplot or histogram of Y scores, separately for each group, provides a way to evaluate whether outliers are present in either group.

It can be instructive for a beginning student to run analyses such as t tests once with outliers included and again with outliers excluded to see how results differ. Occasionally, it can make sense to report both versions of an analysis (with and without outliers) in a research report so that readers can evaluate the situation for themselves. However, it is extremely bad practice to run a t test, obtain a result you do not like, and then throw away outliers and redo the analysis in an attempt to make data do what you want. You should decide on rules for identification and handling of outliers before you do analyses.

12.3.5 Relative Importance of Violations of These Assumptions

Violations of some assumptions can cause serious problems, while violations of other assumptions are less problematic. This list includes the set of potential violations that are serious and cannot be ignored.

12.3.4 No Outliers Within Groups

· The independent-samples t test cannot be used if the dependent variable is categorical or if scores on the dependent variable are ranks within groups.

· The independent-samples t test cannot be used if assumptions of independence of observations (between and within groups) are violated.

· The presence of outliers can lead to p values that either over- or underestimate the true risk for Type I error. Values of group means may be strongly influenced by outliers. You need to document the presence of outliers and explain whether you dropped or retained them.

· An implicit assumption in procedures for all null hypothesis significance tests is that you do one test, then stop. In practice researchers often report large numbers of significance tests. If a researcher reports a set of 10 or 20 t tests, the risk for obtaining at least one Type I error in the set is much higher than the α level used to evaluate individual t values. Bonferroni corrected per comparison alpha levels (PCα), discussed in the correlation chapter, can be used to limit inflated risk for Type I error when numerous significance tests for a set of different t ratios are reported.

· As for all other analyses, results are meaningful only if we have samples that are similar to (representative of) the hypothetical populations of interest, if we have a manipulation that represents real-world situations, and if we have a reliable and valid measure of the outcome variable. (For example, a study that compared 0 mg caffeine and 3,000 mg caffeine (approximately 20 cups of coffee) would not tell us much about caffeine consumption in everyday life; such high levels probably never occur. It would also be unethical because that much caffeine might make participants sick.

· If we want to make causal inferences, the research situation must be a well-controlled experiment. See Chapter 2 or a research methods textbook.

Violations of the assumptions that the populations have normally distributed scores with equal variances don’t cause serious problems unless there are other issues (such as very small samples and outliers within groups). (That’s good, because small samples don’t provide enough information for us to make inferences about the shapes of population distributions.)

12.4 PRELIMINARY DATA SCREENING: EVALUATING VIOLATIONS OF ASSUMPTIONS AND GETTING TO KNOW YOUR DATA

Let’s return to the imaginary experiment in which one group of participants receives 0 mg of caffeine and the other group receives 150 mg of caffeine. To evaluate whether the independence of observations is violated, we need to know the research situation. If we know that each participant is tested under only one treatment condition and that there was no matching or pairing of participants for the samples, then assumption that scores are independent between groups should be satisfied. If we know that each participant was tested individually and that the participants did not have any chance to influence one another’s levels of physiological arousal or heart rate, then the assumption that observations are independent within groups should be satisfied.

Data analysts can evaluate whether scores within each sample have reasonably normal distribution shapes and no extreme outliers and whether the homogeneity of variance assumption appears to be violated. Distribution shape for scores on a quantitative variable in the two groups or samples can be assessed by examining histograms of the dependent variable scores separately within each group. To do this, first request separate output for the groups by using the menu selections <Data> → <Split File>, as shown in Figure 12.2. This opens the Split File dialog box, also shown in Figure 12.2. (Note that you do not use the <Split into Files> command.) In the Split File dialog box, select the radio button for “Organize output by groups.” In the “Groups Based on” pane, enter the name of the grouping variable (caffeine), then click OK. Subsequent graphs and analyses will be reported separately for the caffeine and no-caffeine groups until this command is turned off. The SPSS <Analyze> → <Descriptive Statistics> → <Frequencies> procedure was used to obtain a histogram and descriptive statistics for each group.

Figure 12.2 Using the SPSS <Split File> Command to Obtain Output for Separate Groups

Figure 12.3 Histogram of Heart Rate Scores for the No-Caffeine Group (Caffeine = 1)

Figure 12.4 Histogram of Heart Rate Scores for the 150 mg Caffeine Group (Caffeine = 2)

The histograms in Figures 12.3 and 12.4 show the distributions of heart rate scores separately within each of the two groups. The smooth curves superimposed on the graphs represent ideal normal distribution shapes. These distributions are not close to normal in shape. Normal distribution shapes rarely appear when groups’ n’s are very small. Normal distribution in the samples is not a requirement for use of the independent-samples t test. (However, if you see distribution shapes within the sample that suggest something unusual is going on, like the examples of severely non-normal distribution shapes with modes at the lowest and/or highest possible score values, as in Table 5.2, you may want to stop and question whether group means are good descriptions of what is average in the samples. Other analyses might be preferable in these situations.)

Before additional data screening, turn off the <Split File> command. To do this, make the menu selections <Data> → <Split File> and then select the radio button for “Analyze all cases, do not create groups” (as shown in Figure 12.2), then click OK.

The presence of outliers can be assessed by examining boxplots for the distribution of hr scores separately for each treatment group. The menu selections are <Graphs> → <Legacy Dialogs> → <Boxplot>. In the Boxplot dialog box (Figure 12.5), choose “Summaries for groups of cases.” In the Define Simple Boxplot: Summaries for Groups of Cases dialog box (Figure 12.6), specify the dependent variable (hr); the group variable (caffeine) is placed on the category axis. The resulting boxplots appear in Figure 12.7. In this example, although two scores in the first group appeared to be outliers, none of the scores were judged to be extreme outliers, and no scores were removed from the data before further analysis. (Recall that a circle represents an outlier and an asterisk represents an extreme outlier in a boxplot.)

Figure 12.5 Boxplot Dialog Box

Figure 12.6 Define Simple Boxplot: Summaries for Groups of Cases Dialog Box

Figure 12.7 Boxplots for Heart Rate Scores Within Each Group (No Caffeine, Caffeine)

Overall, it appears that the heart rate data satisfy the assumptions for an independent-samples t test reasonably well. There is no reason to suspect that the independence of scores assumption is violated either within or between groups. Scores on hr are quantitative and reasonably normally distributed. The only potential problem identified by this preliminary data screening is the presence of two high-end outliers (not extreme for heart rate in Group 1). The decision was made not to remove these scores. If outliers had been more extreme (e.g., hr scores above 130), we would need to consider whether participants with high scores fell outside the range of inclusion criteria for the study (i.e., healthy young adults). We might have decided early on to include only persons with “normal” heart rates (which we might define as the range between 50 and 110 beats per minute). We might also consider whether extreme scores might be due to data-recording errors.

12.5 COMPUTATION OF INDEPENDENT-SAMPLES T TEST

Do not be alarmed at the number of equations in this section. The computations needed for independent-samples t tests are mostly things you have done before. When you compare means across groups, it should become almost a reflex: Compute means, variances, and standard deviations within each group. In real-life situations, you can use SPSS to do almost all the computations you need. Here is an outline of the computations.

Find M, s, and n for each group. Numerical subscripts are used to indicate group membership; for example, M1 is the mean for Group 1, and M2 is the mean for Group 2. If you prefer, verbal labels may be helpful (e.g., Mcontrol and Mcaffeine).

Find the overall df for the independent-samples t test: df = (n1 – 1) + (n2 – 1).

Using M1 and M2 (the means for Groups 1 and 2), find the (M1 – M2) difference.

Find the standard error of the (M1 – M2) difference, denoted SEM1–M2. This is the only computation that is new. (This computation uses Equations 12.11 through 12.15, coming up soon.)

Find t: t = (M1 – M2)/SEM1–M2.

To evaluate statistical significance, you can do one of the following:

Compare your obtained t value with the critical values that define reject and do not reject regions in Appendix B at the end of this book. These reject and do not reject regions depend on the α level you have selected, whether you have a directional or nondirectional alternative hypothesis, and the df for the t ratio. (Reject and do not reject regions were discussed in Chapter 8, on the one-sample t test.)

Compare your obtained p value (given by a computer program) with the α level criterion you have chosen. For example, if you chose α = .05, two tailed, as the criterion for statistical significance, and if p = .043, two tailed, you can judge the difference between means statistically significant and report the result as p = .043, two tailed. Note that SPSS provides two-tailed p values.

All statistical results listed above are provided in SPSS output. You will also need to calculate an effect size index by hand, either Cohen’s d, point biserial r, or η2 (eta squared); these are described in a later section.

Within each of the two groups, M and s are calculated as follows. The subscripts 1 and 2 indicate whether scores belong to Group 1 or Group 2. Scores on the quantitative outcome variable are denoted by Y1 when they come from Group 1 and Y2 when they belong to Group 2.

First calculate the mean for each group:

Other

(12.5) 𝑀1=Σ𝑌1/𝑛1,

Other

(12.6) 𝑀2=Σ𝑌2/𝑛2,

where Y1 is the set of scores in Group 1, n1 is the number of scores in Group 1, Y2 is the set of scores in Group 2, and n2 is the number of scores in Group 2.

Calculate the sum of squares (SS) for each group:

Other

(12.7) 𝑆𝑆1=∑(𝑌1−𝑀1)2

,

Other

(12.8) 𝑆𝑆2=∑(𝑌2−𝑀2)2,

where Y1 refers to the set of scores in Group 1, M1 is the mean of the Y scores in Group 1, Y2 refers to the set of scores in Group 2, and M2 is the mean of the Y scores in Group 2.

Here is a reminder about sequence of operations when you compute SS. Within each group, calculate a (Y – M) deviation from the mean for each individual score, then square each deviation, then sum the squared deviations within each group.

Calculate the variance (s2) separately within each group:

Other

(12.9) 𝑠2/1=𝑆𝑆1/(𝑛1−1),

Other

(12.10) 𝑠2/2=𝑆𝑆2/(𝑛2−1).

When you examine SPSS output you will see that SPSS presents two versions of the independent-samples t test. The first version, which is called the “equal variances assumed” or pooled-variances t test, is almost always reported. The second version, called the “equal variances not assumed” or “separate variances” t test, was developed to be used in situations where the equal variances assumption is violated. However, violations of the equal variances assumption generally don’t make p values poor estimates of risk for Type I error. The computations in this section are for the equal variances assumed version of the t test. You will probably never use the equal variances not assumed version of the t test.

When the “equal variances assumed” version of the t test is used, the variances within the two groups are pooled or averaged, and this average is called spooled2 or sp2. The term pooled just means averaged. To obtain sp2, the pooled or averaged within-group variance, we average the two within-group variances s12 and s22. The first version of the formula works whether n1 = n2 or not. It “weights” the variances by the sample sizes (that is, sp2 will be closer to s2 for the group with the larger n).

Other

(12.11)

If n1 = n2, this formula reduces to the following. This version of the formula makes it even clearer that sp2 is the average of s2 for the two groups:

Other

(12.12)

𝑠

P

2

=

(

𝑠

1

2

+

𝑠

2

2

)

/

2.

The alternative to the pooled-variances or equal variances assumed independent-samples t test procedure is the equal variances not assumed (also called separate variances) t test procedure.4 The formula for SEM1–M2 for the equal variances not assumed t test keeps the two variances separate instead of pooling them; this t test also uses a downwardly adjusted df. You probably will never need to use the equal variances not assumed t test, but it appears in your SPSS output whether you request it or not.

Given the values of sp2, n1, and n2, we can calculate the standard error of the difference between sample means, SEM1–M2:

Other

(12.13)

A t ratio generally has the following form:

Other

For the independent-samples t test, the value of (μ1 – μ2) is usually hypothesized to be 0. If the difference between means is 0, that is equivalent to a null hypothesis that caffeine has no effect on heart rate; that is, mean heart rate is the same whether people receive caffeine or not.

Next we calculate the independent-samples t ratio:

Other

(12.14)

Then calculate the degrees of freedom for the independent-samples t ratio:

Other

(12.15)

(For the equal variances not assumed t test, df is calculated using a complicated formula, it is smaller than n1 + n2 – 2, and it is usually given to two or more decimal places.)

12.6 STATISTICAL SIGNIFICANCE OF INDEPENDENT-SAMPLES T TEST

I recommend that you report the exact p value for the equal variances assumed (or pooled-variances) version of the t test. This is a two-tailed test. For example, if “Sig.” as reported by SPSS is .032, report p = .032, two tailed. Remember that if SPSS gives you a “Sig.” value of .000, you should report this as p < .001. A p value estimates risk for Type I error, and that risk can never be 0.

A two-tailed exact p value corresponds to the combined areas of the upper and lower tails of the t distribution that lie beyond the obtained sample values of ±t.

If you want to report your outcome as a significance test using the conventional α = .05, two tailed, level of significance, an obtained p value less than .05 is interpreted as evidence that the t value is large enough so that it would be unlikely to occur by chance (because of sampling error) if the null hypothesis were true. In other words, if we set α = .05, two tailed, as the criterion for significance, and if p < .05, two tailed, we would say that the means of the groups are significantly different.

If an analyst decides to use a one-tailed (directional) test before peeking at the data, a one-tailed p value can be obtained by dividing the two-tailed p in the SPSS output by 2 (e.g., if two-tailed p = .06, then the corresponding one-tailed p = .03). In this situation, the analyst must also check that the direction of difference between the means corresponds to the difference in the alternative hypothesis. If Halt: μ1 > μ2, the null hypothesis can be rejected if M1 > M2 but not if M1 < M2.

I have used annoying quotation marks for “exact” p. I do this as a reminder that the “exact” value of p given by programs such as SPSS, often reported to 3 decimal places, is not necessarily correct. When assumptions are violated—and they often are—the p values given by a computer program often greatly underestimate the true risk for Type I decision error.

A judgment about statistical significance can also be made directly from the obtained value of t, its df, and the α level. If t is large enough to exceed the tabled critical values of t for n1 + n2 – 2 df, the null hypothesis of equal means is rejected, and the researcher concludes that there is a significant difference between the means. In the preceding empirical example of data from an experiment on the effects of caffeine on heart rate, n1 = 10 and n2 =10, therefore df = n1 + n2 – 2 = 18. If we use α =.05, two tailed, then from the table of critical values of t in Appendix B at the end of this book, the reject regions for this test (given in terms of obtained values of t) would be as follows:

Other

Note that these values of t also correspond to the middle 95% of the area of a t distribution with 18 df. These t values (“critical” values) are also needed to set up a confidence interval (CI) for M1 – M2.

When large numbers of t tests are reported, authors often use asterisks in tables to denote p values smaller than three different conventionally used α levels. Conventionally, * indicates p <.05, ** indicates p < .01, and *** indicates p < .001. Using these asterisks to make decisions about statistical significance amounts to setting the alpha level after examining the results; this is not good practice. When tables include numerous significance test outcomes, inflated risk for Type I error is likely to be present. Unless the author specifically notes that procedures for correction for inflated risk have been used, such as Bonferroni corrected per comparison alphas, tables with large numbers of asterisks should be viewed skeptically. I recommend against the use of asterisks to indicate significance of individual tests in large tables or lists. (I am guilty of using asterisks myself in the past, and I now repent.)

12.7 CONFIDENCE INTERVAL AROUND M1 – M2

The general formula for a CI was discussed earlier. Assuming that the statistic has a sampling distribution shaped like a t distribution and that we know how to find the SE or standard error of the sample statistic, we can set up a 95% CI for a sample statistic by computing:

Other

(12.16)

where tcritical corresponds to the absolute value of t that bounds the middle 95% of the area under the t distribution curve with appropriate degrees of freedom. For example, when df = 18, the critical values for a 95% CI are the values of t that bound the middle 95% of the t distribution with 18 df; from Appendix B at the end of the book, tcritical = ±2.101. The formulas for the upper and lower bounds of a 95% CI for the difference between the two sample means, M1 – M2, are as follows:

Other

(12.17)

(12.18)

For df = 18, you would use the value 2.101 for tcritical. SPSS provides lower and upper bounds for the 95% CI for (M1 – M2) as part of the independent-samples t test output. The independent-samples t test procedure does not provide graphs of CIs, but these can be set up using other procedures.

12.8 SPSS COMMANDS FOR INDEPENDENT-SAMPLES T TEST

Results for all the computations described above are given in SPSS output. To obtain an independent-samples t value using the SPSS data file hrcaffeine.sav, make the following menu selections, starting from the top-level menu above the data worksheet (as shown in Figure 12.8): <Analyze> → <Compare Means> → <Independent-Samples T Test>. This opens the Independent-Samples T Test dialog box, as shown in Figure 12.9. The name of the (one or more) dependent variable(s) should be placed in the pane marked “Test Variable(s).” For this empirical example, the name of the dependent variable is hr. The name of the grouping variable should be placed in the box labeled “Grouping Variable”; for this example, the grouping variable is caffeine. In addition, it is necessary to click the Define Groups button; this opens the Define Groups dialog box that appears in Figure 12.10. Enter the code numbers that identify the groups that are to be compared (in this case, the codes are 1 for the 0-mg caffeine group and 2 for the 150-mg caffeine group; however, different numbers can be used to identify groups). Click the OK button to run the specified tests. The output for the independent-samples t test appears in Figure 12.11.

Figure 12.8 SPSS Menu Selections to Obtain Independent-Samples t Test

Figure 12.9 Screenshot of SPSS Dialog Box for Independent-Samples t Test

Figure 12.10 SPSS Define Groups Dialog Box for Independent-Samples t Test

Figure 12.11 Output From SPSS Independent-Samples t Test Procedure

12.9 SPSS OUTPUT FOR INDEPENDENT-SAMPLES T TEST

The top panel of the output in Figure 12.11, titled “Group Statistics,” presents the basic descriptive statistics for each group; the n of cases, mean, standard deviation, and standard error of the mean for each of the two groups are presented here. (Students should verify that they can duplicate these results by computing the means and standard deviations by hand from the data in Figure 12.1.) The difference between the means, M1 – M2, in this example is 57.8 – 67.9 = –10.1 beats per minute. The group that did not consume caffeine had a mean heart rate 10.1 beats per minute lower than the group that consumed caffeine; this is a large enough difference, in real units, to be of clinical or practical interest. (Notice that the sign for this difference depends on whether M1 represents the smaller or larger of the two means.)

Figure 12.12 Highlighted Equal Variances Assumed Version of Independent-Samples t Test

If the data analyst wants information about potential violation of the equal variances assumption, this is provided by the Levene test (on the left side in the t test results table in Figure 12.11). If the Levene F is not significant (i.e., p > .05), there is no evidence of a problem with the equal variance assumption, and the equal variances t test can be reported. In this example, the Levene F value is small (F = 1.571), and it is not statistically significant (p = .226). If p for the Levene F is small (p < .05 or p < .01), there is evidence that the homogeneity of variance assumption has been violated. If the researcher is worried about this (most people don’t worry), he or she may prefer to report the more conservative “equal variances not assumed” version of t. This is on the lower line below the heading “t-test for Equality of Means.” My results sections include the outcome of the Levene F test.

I will report the “equal variances assumed” version of the t test that appears on the upper line below the heading “t-test for Equality of Means”; this is information contained in the rectangle in Figure 12.12. The reason why the two versions of the t test are so similar in this example is that the group variances were nearly equal, and the group n’s were the same. The equal variances t test result is statistically significant, t(18) = –2.75, p = .013, two tailed. (The value in parentheses after t is the df.) Using α = 0.05, two tailed, as the criterion for significance, the 10.1-point difference in heart rate between the caffeine and no-caffeine groups is statistically significant. Mean heart rate was higher in the caffeine group.

If the researcher had specified a one-tailed test corresponding to the alternative hypothesis Halt: (μ1 – μ2) < 0, this result could be reported as follows: The equal variances t test was statistically significant, t(18) = –2.75, p =.0065, one tailed. (The one-tailed p value is obtained by dividing the two-tailed p value of .013 by 2.)

12.10 EFFECT SIZE INDEXES FOR T

Many different effect size indexes can be reported for the independent-samples t test; this discussion includes only the most widely reported.

12.10.1 M1 – M2

When the dependent variable Y is measured in meaningful units, the difference between sample means can be useful information (Pek & Flora, 2018), although may authors do not refer to that difference as an effect size. The difference between means can sometimes be interpreted as information about practical, clinical, or everyday importance. In this hypothetical example, people who consumed 150 mg of caffeine (about one cup of coffee) had heart rates about 10 beats per minute higher than those who did not consume caffeine. That is a noticeable difference, but not large enough that people need to be worried about it.

To make judgments about clinical or practical significance of differences between means, we need to understand the meanings of different score values; even then, people can have different subjective evaluations. Imagine a situation in which people who receive chemotherapy for a specific type of cancer live on average 3 weeks longer than people who decline chemotherapy. Apart from the question of whether this difference is statistically significant, we have the question, How much practical value does a 3-week difference have? A medical researcher might be pleased to find a treatment that extends life by 3 weeks. As a patient, however, I might not want to undergo possibly severe negative side effects unless the average extension of life was 2 or 3 months. In situations like this, clinicians and patients should remember that group averages often do not predict individual outcomes well. If median improvement in length of survival is 3 months, half of the patients in the study had shorter, and half had longer, improvements in length of survival. Ability to generalize results from a study to your own personal situation should also take into account how similar you are, and how similar your disease condition is, to persons included in the study.

In the extremes it may be easy to say whether a treatment such as a weight loss pill has practical or real-world significance. Most people would not think that a mean weight loss of 1 lb is enough to be meaningful or valuable. On the other hand, most people might think that a mean weight loss of 30 lb is enough to have practical, clinical, or real-world value. For in-between amounts of weight loss, people may differ in how much they think is sufficient to be of value, relative to costs and risks of the treatment.

When variables are not measured in meaningful units, M1 – M2 may not provide useful real-world information (although it may still be interesting to compare values of M1 – M2 across different studies that use the same measures). For example, suppose you are told that female teachers receive average teaching evaluation scores of 24, while male teachers receive average evaluation scores of 27. You can see that the mean rating is higher for male than female teachers in this example, but you would need much more information to evaluate whether the difference is large. It is usually helpful to know the possible minimum and possible maximum score value and the actual minimum and maximum values found in the sample (this information is sometimes not included, but it should be). Other effect size indexes use standard deviation or variance of scores to evaluate effect size.

The value of M1 – M2 is not related to sample size (n1, n2) and does not always correspond to the p value. Possible values of M1 – M2 depend on the original units of measurement.

12.10.2 Eta Squared (η2)

An eta squared (η2) coefficient is an estimate of the proportion of variance in Y dependent variable scores that is predictable from group membership (or associated with group membership). (Note that η is lowercase “h” in Symbol font.) SPSS does not provide η2; it can be calculated directly from the t ratio and df:

Other

(12.19)

η2 is similar to r2 in many ways. Both r2 and η2:

· Are standardized, unit free, and not related to original units of measurement.

· Have fixed ranges of possible values (from 0 to 1, with 0 = no association, 1 = perfect association).

· Are interpreted as proportion or percentage of variance in Y that can be predicted from X.

· Are not related to N, sample size.

Large values of r2 and η2 are not always statistically significant; statistical significance depends on both sample size and effect size combined.

Both η2 and r2 describe the proportion of variance in Y that can be predicted from X only in the sample. When we find that we can account for a proportion of variance in the sample of scores in a study—for example, η2 = .40, or 40%—this obtained proportion of variance is highly dependent on the nature of the sample and on other design decisions. It tells us about the explanatory power of the independent variable only in the somewhat artificial world and small sample that we are studying. For instance, if we find (in a study of the effects of caffeine on heart rate in a healthy college student sample) that 40% of the variance in the hr scores is predictable from exposure to caffeine, that does not mean that “out in the world,” caffeine is such an important variable that it accounts for 40% of the variability in hr. In fact, out in the real world, there may be a substantial proportion of variation in hr that is associated with other variables, such as age, smoking, physical health, and body weight. These other variables may not be important sources of difference in heart rate in a laboratory study; in fact, participants may have been selected in ways that get rid of many of these variables. A researcher might recruit only nonsmokers, for example.

In an experiment, we create an artificial world by manipulating an independent variable and holding other variables constant. In a nonexperimental study, we create an artificial world through the necessarily arbitrary choices that we make when we decide which participants and variables to include and exclude. We must be cautious when we attempt to generalize beyond the artificial world of research to broader, often purely hypothetical populations. When we find that a variable is a strong predictor in our study, that result only suggests that it might be an important variable outside the laboratory. We need more evidence to evaluate whether it is an important predictive variable in other settings than the unique and artificial setting in which we conducted our research. In the cases where participants are randomly sampled from actual populations of interest, we can be somewhat more confident about making generalizations. When we use convenience samples, we must be cautious about generalizing results. The value of M in a convenience sample may not be a good estimate of μ for the hypothetical population of interest. Similarly, the value of η2 in a convenience sample may not be a good estimate of η2 in the population of interest.

12.10.3 Point Biserial r (rpb)

A t ratio can be converted into a point biserial correlation (rpb). The value of rpb can range from 0 to 1, with 0 indicating no relationship between scores on the dependent variable and group membership:

Other

(12.20)

Note that rpb is just the square root of η2. This is called point biserial r because two lists (or series) of scores are correlated with a binary group membership variable. When results are combined across many studies using meta-analytic procedures, rpb is often the preferred effect size index. Sometimes rpb is referred to as r, and in fact, it is equivalent to Pearson’s r with dichotomous scores for one of the variables. The sign of r can be assigned to show whether higher mean Y scores occurred in the treatment or control group. For instance, if 150 mg caffeine is treatment, coded Group 2, and 0 mg caffeine is the control condition, coded Group 1, then r can be reported with a positive sign if mean heart rate is higher in Group 2 (the treatment group).

Similar to r2 and η2, rpb is standardized or unit free; its value does not depend on the original units of measurement, and its value does not depend on N, sample size. It general rpb has a fixed range of values from –1 to +1 (from perfect negative to perfect positive association). However, if group n’s are not equal in your study, possible outcomes for rpb will be limited to a narrower range. The caveat about generalization is the same for rpb as for η2: The value you obtain in your sample may not be generalizable to the real world.

12.10.4 Cohen’s d

Cohen’s d differs from the previous effect sizes; it evaluates the difference between sample means in terms of number of standard deviations. For the independent-samples t test, Cohen’s d is calculated as follows:

Other

(12.21)

where sp, the pooled within-group standard deviation, is calculated by taking the square root of the value of sp2 from Equation 12.11 or 12.12.

Like r2, η2, and rpb, Cohen’s d is unit free, standardized, and not dependent upon the original units of measurement. Its value does not depend on N. The sign of Cohen’s d depends on whether M1 > M2 or M2 > M1. There is not a fixed range for possible values of d, although values lower than –2 or higher than +2 are uncommon.

In words, d indicates the distance between the two group means in terms of within-group standard deviations. It helps us visualize how much overlap there is between two distributions of scores. The following examples illustrate small versus large values of Cohen’s d. Figure 12.13 shows a small effect size. Data from numerous studies suggests that men tend to have self-esteem scores about .22 (two tenths) SD higher than those of women (i.e., Cohen’s d = .22). This is a small effect. Figure 12.13 shows the overlap between these two distributions of scores. The normal distribution on the left represents self-esteem scores for women, with the mean located at d = 0. The distribution on the right represents self-esteem scores for men, with the mean located at d = .22.

Figure 12.13 Small Cohen’s d Effect Size and Overlap of Female (Left) Versus Male (Right) Distributions of Self-Esteem Scores

Figure 12.14 Large Cohen’s d Effect Size and Overlap of Female (Left) Versus Male (Right) Distributions of Heights

This small effect would not be noticed in real life for two reasons. First, we can’t evaluate one another’s self-esteem very accurately. Even if we had stickers on our faces showing self-esteem scores, we would still have a very difficult time seeing a difference between men and women. There is substantial overlap between these two distributions. The value d = 0 shows the mean for the female distribution (the distribution on the left). More than half of the men have higher scores than the mean score for women, d = 0, but not many more than half. Slightly fewer than half of the men have self-esteem scores below the mean score for women (d = 0).

A large effect size is shown in Figure 12.14. On the basis of data from the United Kingdom, mean male height is about 2 standard deviations higher than mean female height (i.e., Cohen’s d = 2.00). Large values of Cohen’s d (such as d = 1.00 and higher) correspond to real-world differences that people are likely to notice in everyday experience. Here also, the left-hand distribution represents height scores for women (with mean located at d = 0) and the right-hand distribution is height scores for men (with mean located at d = 2.00). Not very many women have heights that are higher than the mean height for men. There is much less overlap between the distributions than for the small effect size shown in Figure 12.13. Sex differences in height are large enough to be noticeable in everyday life, even though some women are taller than some men.

12.10.5 Computation of Effect Sizes for Heart Rate and Caffeine Data

For the pooled-variances t test using the heart rate and caffeine data, we have the following information from the SPSS output in Figure 12.11:

n1 = 10, n2 = 10

df = 18

M1 – M2 = 57.8 – 67.9 = –10.1

Because Group 2 is the caffeine group, we can say that mean heart rate was 10.1 points higher for the caffeine group:

t = –2.754.

s1 = 7.208 and s2 = 67.90.

To obtain η2:

Other

About 30% of the variance in heart rate in this study was predictable from caffeine dose.

To obtain rpb:

Take the square root of η2;

To obtain Cohen’s d:

First find sp from s1 and s2.

When n1 = n2, we can use Equation 12.12 (if n’s are not equal, use Equation 12.11):

sp2 = (s12 + s22)/2 = (7.2082 + 9.0852)/2 = (51.955 + 82.537)/2 = 134.492/2 = 67.246.

To obtain sp, take the square root of sp2: sp = 8.200.

Then d = (M1 – M2)/sp = –10.1/8.200 = –1.23.

This value of d tells us that the mean of the no-caffeine group was 1.23 standard deviations lower than the mean of the caffeine group (and the mean of the caffeine group was 1.23 standard deviations higher than the mean of the no-caffeine group).

Using Cohen’s standards5 to evaluate effect size in Table 12.1, all these values are judged to be large to very large effect sizes.

12.10.6 Summary of Effect Sizes

Table 12.1 summarizes the characteristics of these effect sizes. Effect size values do not depend on N. By comparison, the magnitude of the independent-samples t ratio does depend on N. If other factors are held constant, as N increases, t also increases in absolute magnitude. In a few respects, t is similar to some effect sizes: it is unit free or standardized and not in the original units of measurement; it has a sign that indicates the direction of the relationship (which group mean is higher). By itself, t cannot be interpreted as a proportion of variance; however, t and df can be converted into η2, which does provide information about proportion of variance. A t ratio does not have a limited range of possible values. Neither a t ratio nor its accompanying p value provides information about effect size.

· Researchers report t, df, and p as information about statistical significance; these numbers do not tell us anything about effect size. On the basis of t and p, we make judgments only about statistical significance (and not about significance or importance in practical, clinical, or real-world domains).

· Researchers should also report one or more of the effect sizes listed above as information about strength or size of effect (independent of sample size). Kirk (1996) suggested that we can interpret these values in terms of clinical or practical or real-world “significance.” Unfortunately both researchers and research consumers sometimes confuse statistical significance (p < .05) with practical, clinical, or real-world “significance.” I prefer to speak of practical, clinical, or real-world importance (and avoid use of the potentially confusing term significance6).

Table 12.1 Summary of Characteristics of Effect Size Indexes

We can also ask whether a finding has theoretical value or importance. If variable X accounts for more than 50% of the variance in a Y outcome, we might decide that variable X should be included in our theory about what causes Y. On the other hand, if variable X can account for only 1% of the variance in Y (even if X is a “statistically significant” predictor of Y), we would want to include more useful explanatory variables in a theory that attempts to explain Y. There is no clear cutoff for a minimum proportion of explained variance.

Cohen (1988) suggested guidelines for interpretations of effect sizes; Table 12.2 summarizes these labels. You may want to compare this with Table 10.3 in Chapter 10, which includes some additional information about the way effect sizes are related to whether effects are detectable in everyday life. These labels are based on recommendations made by Cohen for the evaluation of effect sizes in social and behavioral research; however, in other research domains, it might make sense to use different labels for effect size (that is, require larger values of rpb and other effect sizes before calling them “medium” or “large” effects). Effect size guidelines suggested by Cohen differ slightly when given in terms of different effect size indexes.

Table 12.2 Suggested Verbal Labels for Cohen’s d and Other Effect Size Indexes

For r < .30, effects are often not detectable by informal observation in everyday life. For instance, a sex difference in self-esteem ratings, d = .22, is too small to be noticed in everyday life. For r > .50, effects may be detectable in everyday life (for instance, the sex difference in height, with d = 2.00, is something people notice in everyday life).

Effect sizes have three major uses:

1. At least one index of effect size should be reported with every statistical significance test. For the independent-samples t test, it is common to report η2, rpb, or Cohen’s d. When the dependent variable is measured in meaningful units, discussion should also focus on the M1 – M2 difference as a way to think about the clinical or practical or real-world importance of the finding.

2. When you plan future research, you can use effect sizes from past research to estimate the minimum sample size you need to have adequate statistical power in your planned study. This is called statistical power analysis. Usually people want to have at least 80% power (i.e., approximately 80% chance of obtaining a statistically significant outcome for the guessed value of population effect size, such as η2). When a study has such small n’s that there is a very low probability of obtaining a statistically significant outcome given the population effect size, it is called underpowered.

3. When an author summarizes past research, he or she obtains and combines (averages) effect size information for each of dozens or hundreds of studies. This is called meta-analysis. For example, we might want to know whether mean depression after therapy for patients differs across numerous studies that compare client-centered therapy (treatment) with no therapy (control). An effect size such as Cohen’s d or rpb provides important information about the direction of difference (there might be a few studies in which mean depression was lower for the no-therapy group). If past studies have not reported effect sizes, effect sizes can almost always be obtained from other numerical results in the papers. In meta-analysis, it is important to include direction of effect.

12.11 FACTORS THAT INFLUENCE THE SIZE OF T

12.11.1 Effect Size and N

Formulas for statistical significance tests such as the independent-samples t test can be written in a way that makes it clear that the t test combines information about effect size and sample size or N or df (Rosenthal & Rosnow, 1991). In words:

Other

(12.22)

If effect size is held constant, the expected magnitude of t increases as N increases. If N is held constant, the expected magnitude of t increases as effect size increases. With a little bit of thought it should be clear that:

When effect size and N are both very large, the value of t will almost always be large enough to judge the outcome statistically significant (and values of p will be very small).

When effect size and N are both extremely small, the value of t will almost always be too small to judge the outcome statistically significant (and values of p will be large).

In practice, when effect size is very small, you need a larger N to have a reasonable chance of obtaining a statistically significant outcome. When the effect size is very large, you may be able to obtain a statistically significant outcome using quite a small sample.

A specific formula for the independent-samples t test given by Rosenthal and Rosnow (1991) is:

Other

(12.23)

where d is Cohen’s d, calculated as:

Other

If we substitute the formula for Cohen’s d into Equation 12.23, we have:

Other

(12.24)

The specific values of t that occur in studies will vary because of sampling error. This equation tells us that if we hold other terms in the equation constant:

· As df (sample size) goes up, t tends to increase (and p tends to become smaller).

· As (M1 – M2) goes up, t tends to increase (and p tends to become smaller).

On the other hand:

· As sp goes up, t tends to decrease (and p tends to increase).

Notice an important implication of Equation 12.24. Even when effect sizes such as (M1 – M2) or Cohen’s d are extremely small, as long as they do not turn out to be exactly zero in your sample, you can judge even very small mean differences statistically significant for larger values of N. You cannot use Equation 12.24 to predict your outcome value of t exactly from sample size and effect size, because this equation doesn’t take sampling error into account, and we don’t know population effect size. However, you can substitute different values into Equation 12.24 to get a sense of how increase in sample size makes it possible to detect very small effect sizes (i.e., judge them to be statistically significant).

For instance, research that compares mean IQ for single-birth children (Group 1) with mean IQ for identical twins (Group 2) yields sample means of about M1 = 100 and M2 = 99. (A 1-point difference in IQ is not noticeable in everyday life; you might notice IQ score differences of 20 or 30 points.) For most IQ tests, s = 15. Using Equation 12.24, we can compare potential differences in outcomes for a study with df = 100 versus a study with df = 10,000. With df = 100, the 1-point mean IQ difference is unlikely to yield a t value large enough to be statistically significant. When df = 10,000, the obtained t ratio is likely to be large enough to judge this 1-point difference statistically significant. (The t values are not exact; this equation does not take sampling error into account.)

Other

In practice, researchers sometimes can control sample size; sometimes they can control the magnitude of the other two elements in Equation 12.24. Decisions about “dosage level” or type of treatment often can increase the M1 – M2 difference. Decisions about the kinds of people to include in the study and the degree of standardization of data collection situations can influence the magnitude of sp, the within-group standard deviation.

Researchers do not always have control over sample size. Sometimes researchers do not have funds to pay participants, treatments or data collection procedures are very costly, or the study has to be completed in a very short period of time. When a researcher knows that the sample cannot be large, he or she needs to think about ways to increase the (M1 – M2) difference and/or decrease sp.

On the other hand, sometimes the results of large-N studies are reported in misleading ways. When N is very large, an effect can be judged statistically significant even when the effect size is too small to be of any real life or clinical or practical importance. Consider the twin versus individual child IQ study again. When the difference between mean IQs is tested in samples of 10,000 or more, it is almost always statistically significant. However, this difference could be deemed too small to be of any practical or clinical importance.

Unfortunately, researchers who conduct large-N studies and obtain p values < .001 sometimes call their results “extremely significant.” (Do not say that!) Here’s the problem. In everyday life, when we use the word significant, we mean large or worthy of notice (or at the very least detectable). When we hear the word significant we tend to assume that differences between groups are large enough to matter to people and clinicians. Calling the results of a study “highly significant” can mislead many readers into thinking that the effects are large enough to be valuable or at least noticeable in real life.

Statistical significance and practical or clinical importance do not always go together, particularly when N is extremely large. Here’s how to avoid confusion:

· Emphasize effect sizes in reports (instead of statistical significance tests).

· Explain effect sizes clearly and evaluate them honestly.

· Discuss simple information such as M1 – M2 when units of measurement are meaningful.

· Never say “extremely significant.”

12.11.2 Dosage Levels for Treatment, or Magnitudes of Differences for Participant Characteristics, Between Groups

The value of M1 – M2 can be affected by design decisions that involve the types of groups, types of treatment, or dosages of treatment for the two groups. Consider these two hypothetical studies of caffeine effects on heart rate:

Study A: Group 1 receives 0 mg caffeine, Group 2 receives 50 mg caffeine

Study B: Group 1 receives 0 mg caffeine, Group 2 receives 500 mg caffeine

Assuming that caffeine does have an effect on heart rate, we would expect the means for heart rate to be much farther apart in Study B than in Study A. By increasing the difference between treatment dosage amounts, researchers can often increase M1 – M2 and, therefore, other factors being equal, increase t.

Studies of naturally occurring groups can also be thought of in these terms. Suppose you want to study age group (X) differences in mean reaction time (Y).

Study A: Group 1 is ages 20–29, Group 2 is ages 30–39

Study B: Group 1 is ages 20–29, Group 2 is ages 70–79

Other factors being equal, you would expect mean reaction times to differ much more in Study B than in Study A.

Researchers must be very careful about something else that can influence the magnitude of the M1 – M2 difference: confounds of other variables with type or dosage of treatment. In the 0 mg caffeine versus 150 mg caffeine study, if the people in the 0 mg caffeine group have heart rate measured in a very relaxing setting, while those in the 150 mg group are assessed in a stressful setting, there is a complete confound between stress and caffeine dosage. Whether it is statistically significant or not, we cannot interpret a large M1 – M2 difference as information about the effects of caffeine. Some or all heart rate differences might be due to the amount of stress in the situation. In this example, a confound of high stress with high caffeine would make the M1 – M2 difference larger. Some confounds may make an M1 – M2 difference smaller (for example, if heart rate was measured by a nasty and threatening experimenter in the 0 mg caffeine group and by a relaxed and friendly experimenter in the 150 mg caffeine group, the effects of caffeine and the confound might cancel each other out and lead to a small M1 – M2 difference). The presence of one or more confounds makes an M1 – M2 difference, and the t ratio based on that difference, uninterpretable.

12.11.3 Control of Within-Group Error Variance

Researcher decisions can also influence sp, the pooled or averaged within-group standard deviation. The within-group standard deviation sp is often called experimental error. Experimental error tends to be large in drug studies where participants within each treatment group differ from one another on characteristics such as age, anxiety, history of drug use, and so forth. Experimental error is also large if participants within the same treatment groups are tested in different ways in different situations. Consider the caffeine/heart rate study again: Group 1 receives no caffeine, and Group 2 receives 150 mg caffeine. Now consider these different scenarios.

Study A: Participants within both groups are very similar in age, health, and amount of past caffeine consumption; all are nonsmokers; all have average fitness; none are evaluated during midterms or final exams; and none are tested by an anxious experimenter.

Study B: Participants within both groups vary in age, health, and amount of past caffeine consumption; some smoke, some do not; they have varying levels of aerobic fitness; some are tested during midterms and finals, others before spring break; and several different experimenters interact with the participants, some of whom are much more anxious than others.

In Study A, if participant characteristics are very similar or homogeneous, and experimental procedures are standardized and consistent, participants in each group should not show much variation in heart rates. Thus, in Study A, sp should be relatively small. On the other hand, in Study B, people who are in the same treatment group have different health backgrounds and are tested under different circumstances; you would expect wide variation in their heart rates. In Study B, sp would be relatively large. If other factors (effect size and N) are held constant, there would be a better chance of obtaining a large t value for Study A than for Study B. Recruiting similar participants can help with statistical power, but it also reduces generalizability of findings. The participants in Study A are not diverse.

12.11.4 Summary for Design Decisions

Members of my undergraduate class became upset when I explained the way research design decisions can affect the values of t. They said, “You mean you can make a study turn out any way you want?” The answer is, within some limits, yes. The independent-samples t test is likely to be large for these situations and decisions. (For each factor, such as N, add the condition “other factors being equal.”)

· N is large (a very large N study can yield a statistically significant t ratio even if the population effect is very small).

· Population effect size such as η2 is large (this is often related to treatment dosages or types of participants being compared).

· M1 – M2 is large (however, M1 – M2 is not interpretable if confounds are present).

· sp is small (this happens when participant characteristics and assessment situations are homogeneous within groups).

Depending on their research questions and resources, the degree to which researchers can control each of these factors may vary.

12.12 RESULTS SECTION

Following is an example of a “Results” section for the study of the effect of caffeine consumption on heart rate.

Results

An independent-samples t test was performed to assess whether mean heart rate differed significantly for a group of 10 participants who consumed no caffeine (Group 1) compared with a group of 10 participants who consumed 150 mg of caffeine. Preliminary data screening indicated that scores on heart rate were reasonably normally distributed within groups. There were two high-end outliers in Group 1, but they were not extreme; outliers were retained in the analysis. The mean heart rates differed significantly, t(18) = –2.75, p = .013, two tailed. Mean heart rate for the no-caffeine group (M = 57.8, SD = 7.2) was about 10 beats per minute lower than mean heart rate for the caffeine group (M = 67.9, SD = 9.1). The effect size, as indexed by η2, was .30; this is a very large effect. The 95% CI for the difference between sample means, M1 – M2, had a lower bound of –17.81 and an upper bound of –2.39. This study suggests that consuming 150 mg of caffeine may significantly increase heart rate, with an increase on the order of 10 bpm.

The assumption of homogeneity of variance was assessed using the Levene test, F = 1.57, p = .226; this indicated no significant violation of the equal variance assumption. Readers generally assume that the equal variances assumed version of the t test (also called the pooled-variances t test) was used unless otherwise stated. If you see df reported to several decimal places, this tells you that the equal variances not assumed t test was used.

12.13 GRAPHING RESULTS: MEANS AND CIS

Cumming and Finch (2005) suggested that authors should emphasize confidence intervals along with effect sizes. Graphs of CIs help focus reader attention on these. Several types of CI graphs can be presented for the independent-samples t test. We could set up a graph of the CI for the (M1 – M2) difference using either an error bar or a bar chart. The lower and upper limits of this CI are provided in the independent-samples t test output. It is more common to show a CI for each of the group means (M1 and M2). This can be done with either the SPSS error bar or bar chart procedure. To obtain an error bar graph for M1 and M2, make the menu selections shown in Figure 12.15, Figure 12.16, and Figure 12.17.

In Figure 12.18 the separate vertical lines for each group (no caffeine, 150 mg caffeine) have two features. The dot represents the group mean. The T-shaped bars identify the lower and upper limits of the 95% CI for each group. Be careful when you examine error bar plots in journals or conference posters. Error bars that resemble the ones in Figure 12.18 sometimes represent the mean ± 1 standard deviation, or the mean ± 1 SEM, instead of a 95% CI. Graphs should be clearly labeled so that viewers know what the error bars represent.

Figure 12.15 SPSS Menu Selections for Error Bar Procedure

Figure 12.16 Error Bar Dialog Box

Cumming and Finch (2005) pointed out that when two 95% CIs, like the ones in Figure 12.18, do not overlap, you know that the t test for the difference between group means must be statistically significant using α = .05, two tailed. On the other hand, if the CIs do overlap, it is possible that the t test that compares group means may be statistically significant (because the CI for [M1 – M2] has a larger df than the CIs for M1 and for M2).

Figure 12.17 Define Simple Error Bar: Summaries for Groups of Cases Dialog Box

Figure 12.18 Error Bars for Mean Heart Rates in Hypothetical Caffeine Experiment

A bar chart is another way to represent information about CIs. The menu selections to open the bar chart procedure were shown earlier (<Graphs> → <Legacy Dialogs> → <Bar>. In the Define Simple Bar: Summaries for Groups of Cases dialog box, in Figure 12.19, select the radio button for “Other statistic (e.g., mean)” and move the dependent variable name (heart rate) into the box labeled “Variable.” It will appear as MEAN([hr]). The height of each bar will correspond to the mean heart rate for one group. Enter the name of the group or category variable into the box labeled “Category Axis.” Click the Options button. In the Options dialog box, also shown in Figure 12.19, check the box for “Display error bars.” Leave the default radio button selection under “Confidence Intervals” as 95.0 for “Level (%),” unless otherwise desired. This will produce a 95% CI for each group mean.

The resulting bar chart appears in Figure 12.20. By default, SPSS uses 0 as the starting value for the Y axis. When bar charts were used to represent the frequency of cases for each group earlier, using 0 as the lowest value for Y was recommended; cutting out large portions of the Y axis that represent possible values for Y can yield a graph that exaggerates the magnitude of group sizes.

Figure 12.19 Define Simple Bar: Summaries for Groups of Cases, With 95% CI

Figure 12.20 Bar Chart for Treatment Group Means With 95% CI

Figure 12.21 Edited Bar Chart: 95% CIs for Two Group Means

When bars represent group means, starting the Y axis at 0 often does not make sense. For heart rate, it would make sense to use the lowest value for heart rate that you could call a normal healthy heart rate as your minimum. In this situation it would be reasonable to use a value such as 40 as the lowest value marked on the Y axis. This change can be made in the chart editor (commands are not shown). The edited bar chart appears in Figure 12.21.

12.14 DECISIONS ABOUT SAMPLE SIZE FOR THE INDEPENDENT-SAMPLES T TEST

Statistical power analysis provides a more formal way to address this question: How does the probability of obtaining a t ratio large enough to reject the null hypothesis (H0: μ1 = μ2) vary as a function of sample size and effect size? Statistical power is the probability of obtaining a test statistic large enough to reject H0 when H0 is false. Researchers generally want to have a reasonably high probability of rejecting the null hypothesis; power of 80% is sometimes used as a reasonable guideline. Cohen (1988) provided tables that can be used to look up power as a function of effect size and n or to look up n as a function of effect size and the desired level of power.

An example of a power table that can be used to look up the minimum required n per group to obtain adequate statistical power is given in Table 12.3. This table assumes that the researcher will use the conventional α = .05, two tailed, criterion for significance. For other alpha levels, tables can be found in Jaccard and Becker (2009) and Cohen (1988). To use this table, the researcher must first decide on the desired level of power (power of .80 is often taken as a reasonable minimum). Then, the researcher needs to make an educated guess about the population effect size that the study is designed to detect. In an area where similar studies have already been done, the researcher may calculate η2 values on the basis of the t or F ratios reported in published studies and then use the average effect size from past research as an estimate of the population effect size. (Recall that η2 can be calculated by hand from the values of t and df using Equation 12.19 if the value of η2 is not reported in the journal article.) If no similar past studies have been done, the researcher can make an educated guess; in such situations, it is safer to guess that the effect size in a new research area may be small.

Table 12.3 Sample Size as a Function of Effect Size and Desired Statistical Power

Suppose a researcher believes that the population effect size is on the order of η2 = .20. Looking at the row that corresponds to power of .80 and the column that corresponds to η2 of .20, the cell entry of 17 indicates the minimum number of participants required per group to have power of about .80. In this situation, I would still suggest a minimum n per group of 25 to 30, to ensure robustness against possible violations of assumptions and to obtain reasonably narrow CIs for each group mean.

SPSS has an add-on program to calculate statistical power. Java applets for statistical power for the independent-samples t test and many other procedures are available at http://www.stat.uiowa.edu/~rlenth/Power (Lenth, 2018). Federal agencies that provide research grants now expect statistical power analysis as a part of grant proposals; that is, researchers must demonstrate that given reasonable, educated guesses about effect size, the planned sample size is adequate to provide good statistical power (e.g., at least an 80% chance of judging the effect to be statistically significant). It is not worth undertaking a study if the researcher knows a priori that the sample size is probably not adequate to detect the effects.

Given the imprecision of procedures for estimating the necessary sample sizes, the values contained in this and the other power tables presented in this book are approximate. Larger n’s than the minimum suggested by power tables are often desirable. Even if statistical power analysis suggests that n less than 30 per group might be adequate, samples smaller than that are not advisable. When n’s are very small, consider a nonparametric test such as the Mann-Whitney U test, but note that this test requires other assumptions that may be difficult to satisfy in practice (Appendix 12A).

Do not conduct a post hoc (or postmortem) power analysis when you report results and publish comments such as “Given the sample effect size in my study, my results would have been statistically significant if I had a larger sample.” That is unwarranted speculation. However, if you can see that your sample size was too small, you will want to keep the need for larger samples in mind when you design future studies.

12.15 ISSUES IN DESIGNING A STUDY

12.15.1 Avoiding Potential Confounds

A confound of one (or more) other variables with your treatment variable makes the M1 – M2 difference uninterpretable. A confound may make M1 – M2 larger than it should be in some situations and smaller than it should be in other situations.

Suppose that you want to know whether patients have lower mean anxiety scores after Rogerian therapy (Group 1) or Freudian psychodynamic therapy (Group 2). Suppose that these two types of therapy are given by different therapists (Dr. Goodman does the Rogerian therapy and Dr. Deadwood does the psychodynamic therapy). This would be a perfect confound between therapist personality and ability and type of therapy. If the Group 1 patients do better than those in Group 2, we cannot tell whether this is due to differences in the type of therapy or differences between the two doctors. This is a perfect or complete confound, and it makes the results of this study uninterpretable. The M1 – M2 difference can be due to type of therapy, personality and ability of the therapist, or both. (Even if Dr. Goodman did the therapy in both groups, there could be problems, because she might have greater faith in one type of therapy than the other, and this could produce placebo or expectancy effects.)

Confounds do not have to be complete confounds to be problematic. Consider a group of patients in a drug study. If the drug group has 55% women and the placebo group has only 39% women, there is a partial confound between type of drug and sex. M1 might differ from M2 because the M1 group includes more women, while the M2 group includes more men—instead of or in addition to any drug effects.

Confounds can be obvious, but sometimes they are subtle. Random assignment of participants to groups is supposed to make groups equivalent in composition, but sometimes this doesn’t work as well as expected. When background information is available about participants, it’s good to compare the groups to see whether they are equivalent.

Self-selection into treatment is problematic. If your study includes a meditation training group and a control group, and participants are allowed to choose their groups, you will probably have different kinds of people in the meditation group than in the no-treatment control group.

12.15.2 Decisions About Type or Dosage of Treatment

Researcher decisions about the types or amounts of treatments (or other group characteristics) can influence the M1 – M2 difference between means. Usually, researchers want to maximize this difference. However, there are limits. We cannot give human beings 10,000 mg of caffeine to maximize the effects of caffeine on heart rate (for ethical as well as practical reasons). It would not be useful to give rats amounts of artificial sweetener that would correspond to human consumption of 50 diet sodas per day, because that dosage would not correspond to any real-world situation.

If naturally occurring groups are compared (for example, older adults vs. younger adults), it will usually be easier to find differences when groups differ substantially. For instance, a study that compares reaction time between a group of persons ages 60 to 70 and a group of persons ages 20 to 30 is more likely to find a difference than a study that compares a group of persons in their 20s with a group in their 30s.

12.15.3 Decisions About Participant Recruitment and Standardization of Procedures

Researcher decisions about types of participants to recruit, and about standardization of procedures, can affect the magnitude of sp, the pooled or averaged within-group standard deviation. Recruiting homogeneous participants such as 18-year-old healthy men helps keep sp low (compared with studies with wider ranges of age and health), but it also limits the potential generalizability of results. It is a good idea to standardize situations and testing procedures to keep sp small, but rigid protocols can result in experiences that make the situation feel even more artificial.

12.15.4 Decisions About Sample Size

Sometimes participants or cases are difficult or costly to obtain. A neuroscience study might involve surgical procedures and lengthy training and testing procedures. In such situations, standardization of procedures and optimal choice of treatment dosage levels is particularly important.

When researchers have access to very large N’s (on the order of tens of thousands), there is a different problem. Even effects that are extremely small (when evaluated by looking at M1 – M2, or η2, or Cohen’s d) can be statistically significant when N is very large. Researchers should resist the temptation to overemphasize statistical significance in these situations. Clear information about effect size should be provided in terms readers can understand. This is particularly important when important real-life decisions (such as medical decisions) are at stake.

One possible reason why researchers have been slow to adopt the reporting of effect size information and CIs is that effect sizes are often embarrassingly low, while CIs are often embarrassingly wide.

To summarize: Researcher decisions about treatment type and dosage, and the presence of confounds, will affect the magnitude of M1 – M2. Confounds make M1 – M2 differences uninterpretable even if they are statistically significant. Researcher decisions about participant recruitment and procedures can reduce the magnitude of sp but may also reduce generalizability. Very low n’s result in underpowered studies, that is, studies in which a statistically significant t value is unlikely even if the null hypothesis is false. Very large n’s can lead to situations in which effects that are too small to have any real-world practical or clinical importance are judged statistically significant. In between these extremes, statistical power tables can help researchers evaluate the sample sizes needed for adequate statistical power.

12.16 SUMMARY

This chapter discussed a simple and widely used statistical test (the independent-samples t test) and provided additional information about effect size, statistical power, and factors that affect the size of t. The t test is sometimes used by itself to report results in relatively simple studies that compare group means on a few outcome variables; it is also used as a follow-up in more complex designs that involve larger numbers of groups or outcome variables.

A t-test value (and corresponding effect sizes) is not a fact of nature. Researchers have some control over factors that influence the size of t, in both experimental and nonexperimental research situations. Because the size of t depends to a great extent on our research decisions, we should be cautious about making inferences about the strength of effects in the real world on the basis of the obtained effect sizes in our samples.

For the independent-samples t test, researchers often report one of the following effect size measures: Cohen’s d, rpb, or η2. Eta squared is an effect size commonly used to do power analysis for future similar studies. When researchers want to summarize information across many past studies (as in a meta-analysis), rpb (often just called r) is often the effect size of choice. Past research has not always included effect size information, but readers can usually calculate effect sizes from the information in published journal articles.

Notice that the independent-samples t test, like correlation and regression, provides a partition of the total variance in Y outcome scores into two parts; η2 is the proportion of variance in Y that differs between groups (variance that may be due to different types or amounts of treatment). In regression, r2 was the proportion of variance in Y that could be linearly predicted from X. Similarly, (1 – r2) was the proportion of variance in Y that could not be linearly predicted from X; for the independent-samples t test, (1 – η2) is the proportion of variance in Y that is not predictable from group membership or from the score on the predictor variable.

The r2 and η2 are both called proportion of predicted (or sometimes explained) variance. Predicted variance is variance in Y that is related to scores on the X predictor variable. By contrast, (1 – r2) and (1 – η2) are the parts of the variance in Y that are not predictable from the X independent variable. These are interpreted as proportions of error variance.

The term error in everyday life means “mistake.” In statistics, error has many different meanings, depending on context. Errors in prediction don’t happen because the data analyst made a mistake (although mistakes in data analysis can happen, of course). Errors in prediction happen because many other variables, other than the X variable used as a predictor, influence the scores on the Y outcome variable. Error refers collectively to all the variables in the world that are related to Y, but that we did not control in the study or include in the statistical analysis. This may clarify why proportions of error variance are so high in most research! Error also includes any chance or random or unpredictable elements in Y. If you go on to learn about analyses that include multiple predictor variables, you will see that use of multiple predictors sometimes reduces the proportion of error variance.

To describe the problem of error variance another way, consider the tongue-in-cheek Harvard Law of Animal Behavior: “Under carefully controlled experimental circumstances, an animal will behave as it damned well pleases.”

This chapter was long and detailed because it introduces issues that arise when comparing means across groups; many of the following chapters describe analyses that also compare means across groups. This set of analyses is called analysis of variance. The same issues (assumptions, data screening, effect size, and so forth) continue to be important for those analyses, and I’ll often refer you back to this chapter for more complete discussion.

APPENDIX 12A: A NONPARAMETRIC ALTERNATIVE TO THE INDEPENDENT-SAMPLES T TEST

One of several nonparametric alternatives to the independent-samples t test is the Mann-Whitney U test. It assumes that scores on the outcome or Y variables are at least at the ordinal (rank) level of measurement, that two groups are compared, and that observations are independent. The null hypothesis is that the means for ranks of scores (ranked within the entire sample, not within each group separately) are equal, or more specifically, that the two sample distributions have the same overall location. Because it uses ranks, this test is less affected by outliers than the independent-samples t test. The Mann-Whitney U test does not require the assumption of normal distribution shape. However, this test has a very restrictive assumption that may not often be satisfied in practice: that the shapes of the distributions of scores in the two samples are the same. That limits the usefulness of this test.

To conduct this test using SPSS, make the following menu selections, as shown in Figure 12.22: <Analyze> → <Nonparametric Tests> → <Independent Samples>. This opens the Nonparametric Tests: Two or More Independent Samples dialog box in Figure 12.23. This dialog box has two tabs. In the “Objective” tab (Figure 12.24), select “Compare medians across groups.” (This is misleading, because the null hypothesis for the Mann-Whitney is not that the medians are equal.) In the “Fields” tab (Figure 12.24), move the variable names into the appropriate panes; the quantitative outcome variable will be in the “Test Fields” pane, and the categorical variable will be in the “Groups” box. Click OK to run the analysis; results appear in Figure 12.25.

Figure 12.22 Menu Selections for Nonparametric Statistics With Independent Groups

Figure 12.23 Objectives for Independent-Samples Test

Figure 12.24 Specification of Variables (Assignment to Fields)

Figure 12.25 Mann-Whitney U Test Results

Details for computation of Mann-Whitney U are not presented here. When sample sizes are reasonably large (i.e., N > 30 for the entire data set), the Mann-Whitney U test begins by converting the Y scores (ignoring group membership) into ranks. These ranks replace the original Y scores in the two samples or groups. Nonparametric tests vary in the way they handle tied ranks. The null hypothesis is that these distributions of ranks are the same across the two groups.

SPSS does not display the Mann-Whitney U statistic, only the corresponding p value. When the hrcaffeine.sav data were analyzed, the result was p = .023. The distribution of heart rate ranks in the two samples (no caffeine vs. caffeine) differed significantly. Whether the independent-samples t test (Figure 12.11) or the Mann-Whitney U test (Figure 12.25) is used to compare heart rate across groups, in this data set, the conclusion was the same. That does not always happen. Parametric tests such as the t test may have greater statistical power than corresponding nonparametric tests in some situations, but parametric tests are not always more powerful.

Your decision whether to perform an independent-samples t test or a Mann-Whitney U test will depend on the most common practices in your discipline. Many journals accept the independent-samples t as an appropriate analysis even when some assumptions (such as normality of distribution shapes) are violated. If practitioners in your research area prefer to report non-parametric statistics such as Mann-Whitney U, it is probably better to follow common practice.

COMPREHENSION QUESTIONS

image6.png

image7.png

image8.png

image9.png

image10.png

image11.png

image12.png

image13.png

image14.png

image15.png

image16.png

image17.png

image18.png

image19.png

image20.png

image21.png

image22.png

image23.png

image24.png

image25.png

image26.png

image27.png

image28.png

image29.png

image30.png

image31.png

image32.png

image33.png

image34.png

image35.png

image36.png

image37.png

image38.png

image39.png

image40.png

image41.png

image42.png

image43.png

image44.png

image45.png

image1.png

image46.png

image47.png

image48.png

image49.png

image50.png

image51.png

image52.png

image53.png

image54.png

image55.png

image2.png

image56.png

image57.png

image58.png

image59.png

image60.png

image61.png

image62.png

image63.png

image64.png

image65.png

image3.png

image66.png

image67.png

image68.png

image69.png

image70.png

image71.png

image72.png

image73.png

image74.png

image75.png

image4.png

image76.png

image77.png

image78.png

image79.png

image80.png

image81.png

image82.png

image5.png