Pick Topic Stats
Hayes, A. F. (2022). Introduc on to media on, modera on, and condi onal process analysis (3rd ed.). NewYork, NY: Guilford Press. CHAPTER 3 FREQUENCY DISTRIBUTION TABLES 3.1 INTRODUCTION Most published research examines rela onships between variables. However, before you can examine how scores on two or more variables are related, you need to examine scores for each variable separately. In this chapter, you’ll learn how frequency distribu on tables (or, simply, frequency tables) are used to understand the behavior of scores for just one variable using frequency distribu on tables set up by hand or using SPSS. Examples in this book use IBM SPSS® Version 25 unless otherwise noted. SPSS data sets used in this book can be downloaded from the textbook website at edge.sagepub.com/warner3e. Instruc ons for most analyses described in this book using the R open-access program are provided in Rasco (2020). Appendix 3A at the end of this chapter provides a brief introduc on to SPSS and basic file management. The following examples use the small set of hypothe cal data in the SPSS data file named temphr10.sav that appears in Figure 3.1. File structure is similar for many other sta s cal programs. The data set in Figure 3.1 has 10 cases (rows 1 through 10) and five variables (columns 1 through 5). In most of my examples, each case corresponds to one person. Cases can be other kinds of en es (such as trees, asteroids, or na ons). Each column contains the 10 scores for the variable named at the top (for example, the variable sex). Each row lists the scores for one case on all five variables.
This hypothe cal data set includes five variables as examples of different types of variables. The categorical variable has codes of 1 = male and 2 = female (addi onal codes to represent categories such as nonbinary could have been included.) Heart rate (hr) is a quan ta ve variable (number of beats per minute); integer values are reported. Temp_Fahrenheit and temp_Celsius are also quan ta ve variables; these represent body temperature reported to one decimal place (e.g., 97.1°F). Likert_ra ng is a score on a five-point scale indica ng degree of agreement for this ques on: “I believe the president is doing a good job,” with the following response op ons: 1 = strongly disagree (SD), 2 = disagree (D), 3 = don’t know or neutral (N), 4 = agree (A), and 5 = strongly agree (SA). No ce the top-level menu bar in Figure 3.1; the four top-level SPSS menus discussed in examples are circled. Before se ng up tables or graphs or doing analyses, always pause and ask, What ques ons can this answer? For categorical variables such as sex we can ask: Were any group membership codes “impossible” values? Did some individuals report scores that were not included in the possible response op ons? Do some responses iden fy types of people the researcher does not intend to include in the study? For example, if a study requires par cipants to have normal vision, and some poten al par cipants report limited vision, data for those persons may be excluded from later analyses. How many categories or groups does the categorical variable have? What are the rela ve sizes of these groups? Which score is most common or frequent? Which group has the most cases? The mode corresponds to the group with the largest number of cases. (There can be more than one mode.) Do all groups have large enough numbers of cases (for example, more than 10 cases) to be used in later analyses that compare groups? Were there missing values? Did some individuals not provide responses? Frequency tables provide answers to all these ques ons. In Figure 3.1, the categorical variable sex has two categories or groups (addi onal groups such as nonbinary sex could have been included). You can see that there are seven male respondents (people with scores of 1 on sex) and three female respondents (people with scores of 2 on sex). This tells you (obviously) that there are two groups in the sample and that there are more male than female respondents. For quan ta ve variables such as heart rate, we can ask: What are the lowest and highest scores? What is the range of scores? Range is the difference between highest and lowest score. This provides a preliminary idea of variability.
Are all the values plausible (for example, in a frequency table for heart rate, are all scores plausible values for heart rate)? Were there any missing values? What is an average or a typical score? Loca ng a score in the “middle” of a frequency table provides a preliminary idea about something more formally called central tendency. For selected score values, what are loca ons of these scores compared with the distribu on in the table? For example, consider your own heart rate. What percentage of scores in the sample were lower than your heart rate? Is your heart rate unusually low or unusually high, or near the middle? Frequency tables for quan ta ve variables provide preliminary informa on about variability and central tendency. In Figure 3.1, you can see that the lowest score for the quan ta ve variable heart rate (hr) is 62; the highest score is 82. The difference between highest and lowest scores, 20, is the range. This tells us something about variability in hr. Further informa on about central tendency (also called average) and about variability for quan ta ve variables can be obtained from descrip ve sta s cs such as standard devia on and variance and from graphs such as histograms. These are discussed in the next two chapters. A frequency table of scores for quan ta ve variables provides the context we need to evaluate an individual score (e.g., to decide whether that score is high rela ve to the sample). If your score on hr is 95, you can see that it is unusually high compared with this sample. In addi on, thinking about frequency distribu ons is an important skill needed to understand later topics throughout the book. 3.2 USE OF FREQUENCY TABLES FOR DATA SCREENING We look at frequency tables to get to know the data and to iden fy poten al errors and problems with data before we do other analyses. This process is called preliminary data screening. Introductory sta s cs textbooks o en present students with sample data sets that are assumed not to have errors or missing informa on. In real-world applica ons, data o en have problems, and it is important to look for them. These problems include: Informa on is some mes missing for some members of a sample. Some scores can be unusually large or small; unusual or extreme scores can be problema c in some analyses. Some groups contain too few cases for meaningful analyses. Real data sets o en contain mistakes (incorrect, or even impossible, score values). Implausible or incorrect score values can arise in many ways. If a person is asked to report hair color and reports “plaid,” that is an unlikely response. If a heart rate is recorded as 275 beats per minute, the heart rate monitor is probably malfunc oning. However, a score value can appear plausible and s ll be incorrect; if a heart rate monitor is not properly calibrated, a person whose heart rate is given as 110 beats per minute might really have a heart rate of 95 beats per minute. In an ideal world, researchers would proofread every single number in the data file against original data sources, if these exist. However, original data sources are some mes not available,
and complete proofreading of data may be extremely me-consuming and costly. At a minimum, spot checks (checking some score values in SPSS against original sources of data) provide an opportunity to detect problems that might be more widespread throughout the data set and would require much closer checking. If you find scores that are clearly impossible or at least highly unlikely, the best op on is to obtain valid scores from other sources if that is possible. If a student reports a grade point average (GPA) of 6 when college GPAs are on a 0-to-4 scale, and you have access to university records and can find that student’s GPA, you could use the university record to replace the incorrect self-reported value. If a respondent reports large numbers of silly or impossible values, you might decide to drop that person’s data en rely. There is increasing concern about completeness and transparency in data repor ng (Simmons, Nelson, & Simonsohn, 2011). Research reports should include informa on about problems detected during preliminary screening. This informa on is o en obtained from frequency tables and graphs (such as histograms). The numbers or percentages of incorrect scores, extreme scores, and missing values should be reported. Authors also need to specify what, if anything, was done to remedy these problems. You might say something like “Data for five students were dropped because they reported unlikely or inconsistent informa on” or “Data from three sessions had to be dropped because of equipment malfunc on.” Whatever problems with data you find, and whatever ac ons you take, you need to keep a detailed record and include this informa on in published research reports. Problems (such as missing values) are discussed in greater detail elsewhere (e.g., Volume II [Warner, 2020]). 3.3 FREQUENCY TABLES FOR CATEGORICAL VARIABLES A frequency distribu on table for a categorical variable provides informa on about the number of groups and the number of cases in each group. There is one line in the frequency table for each possible score value; that line includes informa on about the number of cases and other informa on based on those numbers (such as percentages). To set up a frequency table by hand, the first step is to list all possible score values; you need one line for each score value. For the categorical variable sex, the ini al table setup is of the form shown in Table 3.1. Each score value iden fies a group. The table is completed by coun ng the number of persons who have scores of 1 versus scores of 2 on the variable sex. For small data sets this can easily be done by hand. For larger data sets it is more convenient to use a program such as SPSS. In Table 3.1, on the basis of the data in Figure 3.1, you would enter 7 as the number of male par cipants and 3 as the number of female par cipants. The total number of cases in the sample, denoted N, is 10. 3.4 ELEMENTS OF FREQUENCY TABLES The names of the elements of frequency tables are usually given as follows; there is some varia on among computer programs and textbooks. 3.4.1 Frequency Counts (n or f)
The frequency is the number of scores in each group, for a categorical variable, or the number of persons who have a specific score value on a quan ta ve variable, such as hr = 73. Here are three common nota ons for frequency; these can be used interchangeably. Lowercase n is the most common nota on for number of scores per group in research reports. For the data in Figure 3.1, there were n = 7 male and n = 3 female par cipants. Numerical or word subscripts can be used to indicate the name or code number for each group, for example, n1 or nmale = 7, and n2 or nfemale = 3.
Lowercase f, which stands for frequency, is a common nota on in textbooks; f is the same as n. In Figure 3.1, fmale = 7 and ffemale = 3. You will rarely see f in later discussions of sta s cs; group size is usually given as n. Frequency (the en re word) is the nota on SPSS uses to refer to n or f. In Figure 3.1, the frequency of male par cipants is 7; for female par cipants, the frequency is 3. 3.4.2 Total Number of Scores in a Sample (N) Uppercase N is used in research reports to represent the total number of cases in a sample. Total (the word) is used in SPSS frequency tables to represent N. N is the sum of ns (or fs) across all groups. For the temphr10.sav file in Figure 3.1, N = 10 cases: N = nmale + nfemale = 7 + 3 = 10. 3.4.3 Missing Values (if Any) If scores are not available for some cases, the cells for those scores are usually le blank. (It is possible to iden fy some numerical value, such as 999, as an indicator of missing data.) SPSS counts frequencies for missing values and includes this count in the frequency table.
3.4.4 Propor ons Values of n (or f) are usually converted into propor ons (P) rela ve to N. Propor on is some mes called rf (rela ve frequency; i.e., the frequency in one group rela ve to or as a part of total N). SPSS omits propor on and instead reports percentages. A propor on is obtained for each group by dividing the number of persons in that group (ni or fi or frequency) by the total number of people in the en re data set (N). Propor ons are a useful way to summarize group size informa on for categorical variables. To compute propor ons for groups: Find ni or fi for each group by coun ng cases in each group. This count can be called ni or fi or frequency. To find the propor on P (or rf) for each group: Pi = ni/N or fi/N. The propor on P for each group is the number of people in that group divided by the total number of people in the data set or sample. For the data in Figure 3.1, the propor ons of people in the male and female groups are: Other
The sum of percentages across all groups must equal 100%; for example, %male + %female = 100%. Propor on and percentage can be interpreted as probabili es. If one person is selected at random from this set of data, there is a 70% probability the person will be male and a 30% probability the person will be female. 3.4.6 Cumula ve Frequencies or Cumula ve Percentages These are useful only for quan ta ve variables (not for categorical variables), and they are discussed later. 3.5 USING SPSS TO OBTAIN A FREQUENCY TABLE To use SPSS to obtain a frequency table for the data in Figure 3.1, open the temphr10.sav file. (See Appendix 3A if you need to know more about ge ng started in SPSS.) At the top-level menu bar, make the following menu selec ons, as shown in Figure 3.2: <Analyze> → <Descrip ve Sta s cs> → <Frequencies>.
These menu selec ons open the Frequencies dialog box that appears in Figure 3.3. The list of all variables in the data set appears on the le side. Move the variable sex to the pane on the right under the heading Variable(s). To do this, highlight the variable name sex using the cursor. An arrow appears that indicates movement from le to right; click the arrow to move the variable sex into the Variable(s) pane. The list of one or more variables in the Variable(s) pane heading tells SPSS which variable or variables you want to examine. You can obtain frequencies for more than one variable at a me; this example uses just one variable. To run the analysis, click OK in the main Frequencies dialog box. Part of the SPSS frequency output appears in Figure 3.4. In this example, focus on the columns enclosed in the box; ignore the columns with headings “Valid Percent” and “Cumula ve Percent.” Cumula ve percentage is meaningful for quan ta ve variables, but it is not meaningful for categorical variables. The “Valid Percent” column provides informa on that differs from the “Percent” column only when there are missing values for the variable, as discussed in the next example and in Appendix 3B. The SPSS results in Figure 3.4 confirm results obtained from by-hand coun ng of cases. There are two groups; the group sizes are 7 male (70%) and 3 female (30%) par cipants. To summarize informa on about a categorical variable, a table could include the informa on in Table 3.2; however, this is more informa on than required for a research report. Given N, readers can deduce any one column from other columns. When propor on or percentage is reported, total N must also be reported. Unless we know N, we can’t reproduce the numbers (ns) in groups. In journal ar cles, it is common to report percentages to characterize a sample (in this example, we would just say that the sample is 70% male). In addi on, readers need to know when percentages are based on very small samples. Suppose someone claims that “80% of den sts prefer Gooey toothpaste.” That makes it sound
as if at least 100 den sts were asked about preference, and 80 of them preferred Gooey. However, if 4 out of 5 den sts prefer Gooey, that is also 80%
A research report might include informa on about the sex composi on of the sample in its “Par cipants” sec on. This can be in sentence form: “The sample of N = 10 persons was 70% male and 30% female; there were no missing values for sex.” If only two or three categorical variables are used to describe the kinds of people in the study, a few sentences suffice. If a study includes a larger number of categorical variables, percentages for each categorical variable may be summarized in a table.
3.6 MODE, IMPOSSIBLE SCORE VALUES, AND MISSING VALUES Consider the data for a different hypothe cal categorical variable, hair color, in Figure 3.5. Only the first 20 out of 50 lines in the file hair color.sav are displayed. This example illustrates addi onal things to look for in frequency tables. This file illustrates the use of case numbers to iden fy individual par cipants. A case number is a unique iden fying number for each person in the study; these numbers can be arbitrary. It is good prac ce to include case numbers for several reasons. Case numbers are needed if: You plan to check scores in the SPSS file against original data sources. You want to follow up, or contact, individual cases. You want to match scores from two or more data sources for each person (such as data from surveys done at different points in me). You want to iden fy persons with extreme or implausible scores. You want to iden fy persons who have missing values. In Figure 3.5, note that the case number 206 appears 3 mes. This indicates a problem in record keeping. For online surveys, duplicate case numbers can indicate that the same person logged in and completed the survey mul ple mes. No ce that the case in line 14 has an empty cell for hair color. This indicates a missing value; the person did not provide an answer to the survey ques on. Variables with large numbers of missing values are a problem. Pink and blue are not natural hair colors, but a few people dye their hair these colors. Next, examine the frequency table obtained for the hair color data set (using the same SPSS procedures as in the earlier example) in Figure 3.6. I crossed out the “Cumula ve Percent” column in Figure 3.6 as a reminder that this is not applicable for categorical variables. In the le -hand column, each hair color numerical score (1, 2, 3, etc.) is replaced by a verbal label. Appendix 3A explains how verbal labels can be assigned to numerical score values. Excluding the hair color response “12,” seven different hair colors were reported. The modal hair color was brown. For a categorical variable, the mode is the score that has the highest frequency. It is possible to report secondary modes; the next most frequent hair type, black, could be characterized as the second highest mode. If you decide that you need a minimum of 10 cases in a group to use that group in later analyses, the only groups that qualify are brown and black hair. Each of the other hair colors (blond, red, gray, pink, and blue) was reported by fewer than 10 persons. Two lines near the bo om of the table indicate problems. A score of “12” for hair color did not have a verbal label. Assume that a response of “12” was not an cipated by the researcher and that “12” was not presented to par cipants as a possible response op on. This kind of error can
be avoided during data collec on by requiring fixed-choice responses instead of allowing people to write in open-ended responses. This response is nonsense or out of range.
Figure 3.5 Hypothe cal Data for Self-Reported Hair Color The line labeled “Missing System,” with a frequency of 4, tells us that four persons did not answer the hair color ques on. When more than a few (let’s say more than 5%) of persons don’t provide data, this signals a problem. For analyses covered in this volume, we will usually allow SPSS to exclude missing values from analyses automa cally (using listwise dele on). (However, this is not best prac ce; see Volume II [Warner, 2020] for discussion.) The percentage of missing values should be included in research reports for all variables. To understand why percentages in the “Valid Percent” column differ from those in the “Percent” column, see Appendix 3B.
Figure 3.6 Frequency Table for Hypothe cal Hair Color Data 3.7 REPORTING DATA SCREENING FOR CATEGORICAL VARIABLES On the basis of Figure 3.6, you could report the following. The total N for the study was 50 (you would also include other informa on, such as how the sample was selected, age and other demographic or background variables, and so forth). The sample included these hair color groups: 32% brown, 24% black, 16% blond, 6% red, 6% gray, 4% pink, 2% blue, 2% out-of-range scores, and 8% missing values. 3.8 FREQUENCY TABLES FOR QUANTITATIVE VARIABLES 3.8.1 Ungrouped Frequency Distribu on For quan ta ve variables, SPSS sets up a frequency distribu on table with one row for each possible score value. This is called an ungrouped frequency distribu on. Each line provides informa on about the number of persons who had the same score, for instance, the number of persons whose heart rate score was 72. Ungrouped frequency distribu on tables can have many lines and some mes don’t provide the clearest sense of pa ern in your data. However, they can answer the same ques ons as frequency tables for categorical variables, that is, are there out- of-range or impossible or missing values? To illustrate frequency tables for quan ta ve variables, it is useful to consider a larger data set. The data for the next example is in the SPSS data file temphr130.sav. To obtain the output shown in Figure 3.7, follow the instruc ons to run the frequencies procedure in Figures 3.2 and 3.3; choose the variable hr instead of the variable sex. Examina on of this frequency table tells us the following things: There were 130 persons in the sample (total N = 130). The smallest (minimum) value for heart rate was 57 beats per minute. The highest (maximum) value for heart rate was 89 beats per minute.
Figure 3.7 Ungrouped Frequency Distribu on Table for Values of hr in Data File temphr130.sav The range of heart rate values (89 – 57) was 32. All scores are plausible values for heart rate. If there were hr values below 40 or above 150, we would suspect that the heart rate monitor malfunc oned or that there were some extremely unusual people in the sample. There were no missing values in this data set (there is not a line for missing values). 3.8.2 Evalua on of Score Loca on Using Cumula ve Percentage The “Cumula ve Percent” column is not useful for categorical variables; however, it is useful for quan ta ve variables. Cumula ve percentage is the percentage of cases who have scores equal to or below a specific score value. For example, the cumula ve percentage for a heart rate score of 61 is the sum of the percentages for all lower hr scores and cases with a score of 61 (i.e., we sum the percentages of cases for hr scores of 57, 58, 59, 60, and 61). The cumula ve percentage for 61 = 1.5% + .8% + .8% + 1.5% = 4.6%. This tells us that 4.6% of the people in this sample had hr scores of 61 or lower. For a specific value of heart rate such as 77 beats per minute, we can ask whether that score is low or high, compared with other scores in the sample. There are two different ways to word the ques on, and they have slightly different answers. We can ask: What percentage of people have hr scores of 77 or lower? (The answer is 65.4%.) What percentage of people have hr scores below 77 (i.e., 76 or below)? (The answer is 60%.) No ce that these two ques ons differ. To answer the first ques on, we include people with scores of 77. To answer the second ques on, we exclude people with scores of 77. We answer the first ques on (what percentage of people have hr scores of 77 or lower?) by repor ng the cumula ve percentage for 77 from Figure 3.7; 65.4% of cases had hr scores of 77 or lower. The cumula ve percentage of scores at or below a specific score value (such as 77) can also be called the percen le rank of that score. We answer the second ques on (what percentage of people have hr scores below 77?) by looking at the cumula ve percentage for the first score that is smaller than 77, in this case, the cumula ve percentage for a score of 76. From Figure 3.7, 60% of cases had hr scores lower than 77. For variables that apply to humans, you can personalize your experience with sta s cs by thinking about your own scores. If you know that your heart rate is 92, it is higher than all hr scores in the hypothe cal data in Figure 3.7. This tells you (in the parlance of some high school students) whether your heart rate is “weird” or unusual rela ve to this set of scores. If your heart rate is 75, your score is near the middle of these scores. Later you will learn more formal methods to evaluate score loca ons. O en researchers want to know a “typical” or average value. For now, we’ll define this as a score with a cumula ve percentage at or near 50% (this is close to the median, discussed in the
next chapter). A score of 73 had a cumula ve percentage of 49.2, so we can say that a score of 73 was about average or typical for this sample. 3.8.3 Grouped or Binned Frequency Distribu ons When the number of different score values is large, it may be more convenient to set up a grouped frequency distribu on, that is, to report the number of cases for groups defined by ranges of score values; this process is known as binning. A grouped frequency distribu on could show numbers of cases for ranges of heart rate scores, such as: 61 to 65 66 to 70 71 to 75 SPSS provides ungrouped (rather than grouped) frequency tables, by default. Default procedures are the “decisions” SPSS makes unless you specify something different. Ungrouped frequency distribu on tables provide much of the informa on you need to evaluate poten al problems with scores on quan ta ve variables. When we begin to look at graphs in the next few chapters, it is o en helpful to have grouped scores. SPSS automa cally groups data when it sets up histograms. Grouped frequency tables are rarely necessary. However, methods to create these by hand are provided in Appendix 3C if you want to do this. 3.9 FREQUENCY TABLES FOR CATEGORICAL VERSUS QUANTITATIVE VARIABLES Frequency distribu on tables are very useful for categorical variables; they provide all the informa on you need to describe the pa ern of scores and iden fy poten al problems with data. However, for quan ta ve variables, informa on beyond frequency tables is needed. Addi onal summary informa on for quan ta ve variables (descrip ve sta s cs such as the mean and graphs such as histograms) is described in the next two chapters. 3.10 REPORTING DATA SCREENING FOR QUANTITATIVE VARIABLES Frequency tables don’t tell us everything we need to know when screening quan ta ve variables. Informa on from graphs such as histograms and boxplots will also be needed. Therefore, discussion of repor ng data screening for quan ta ve variables is presented a er these topics. 3.11 WHAT WE HOPE TO SEE IN FREQUENCY TABLES FOR CATEGORICAL VARIABLES Here are the things we hope to see in frequency tables: Groups that correspond to all the groups we wanted to include in the study. A reasonable minimum number of cases in all groups that will be compared in later analyses. Standards for minimum group size vary depending on the cost and difficulty of obtaining cases. For a variety of reasons, I suggest that 30 per group is a reasonable minimum in many situa ons, but that is not an ironclad rule.
There should not be groups that correspond to kinds of cases we planned to exclude from the study. There should be few missing values (less than 5% missing is a reasonable standard). There should be no “impossible” responses to group membership ques ons. Categorical variables can correspond to naturally occurring groups or to groups formed by a researcher (o en in the context of an experiment). 3.11.1 Categorical Variables That Represent Naturally Occurring Groups O en, scores for naturally occurring groups are used to characterize the sample (i.e., describe the kinds of cases included in the sample). If researchers want to generalize results to some hypothe cal popula on, the composi on of the sample (e.g., propor ons male and female) should be similar in the sample and popula on; the sample should be representa ve of the popula on of interest. Some mes researchers want to compare naturally occurring groups exposed to different risk or protec ve factors by self-selec on into the situa on (for instance, smokers vs. nonsmokers; meditators vs. nonmeditators). If this is the goal, these naturally occurring groups should be as similar as possible on other variables (e.g., similar propor ons of male and female, similar average age). Groups should have large enough ns to make comparisons reasonable. Suppose a researcher wants to compare level of educa on across religious groups. The categorical variable religion could have many categories. Depending on where the sample is obtained, some religious groups may contain very small numbers. It may not be possible to include these in later analyses. 3.11.2 Categorical Variables That Represent Treatment Groups For an experiment, all groups should have sufficient cases for analyses to be believable. A minimum of about 10 cases per group is desirable. In experiments, approximately equal group sizes are usually preferred, although unequal group sizes can be handled by most sta s cal procedures. 3.12 WHAT WE HOPE TO SEE IN FREQUENCY TABLES FOR QUANTITATIVE VARIABLES We hope to see few or no missing or implausible values. The range of score values in the sample should correspond at least approximately to the hypothe cal popula on of interest. A very small range some mes makes it difficult to find any associa on between variables in later analysis. For example, consider amount of television watching me as a quan ta ve variable. Suppose a researcher obtains a convenience sample of college students and that within the sample, the minimum amount of me is 0 hours and the maximum amount is 4 hours per week. However, the researcher would like to know something about the effects of TV exposure on mood in a broader adult popula on, some of whom watch 3 or 4 hours of TV each day (and thus 21 to 28 hours a week). If the sample does not include any people who watch TV this much, results won’t be generalizable to popula ons with much higher viewing me.
On the other hand, if a researcher wants to hold a variable (fairly) constant within a study, a narrow range of scores may be desirable. If the popula on of interest is college students between the ages of 18 and 22, then a range of 18 to 22 in the sample would be desirable. Much higher ages (37, 55, and so forth) would be problema c. 3.13 SUMMARY Researchers should make decisions about cases and values they want to include or exclude from their data before data collec on. For example, persons with some kinds of vision limita ons may not provide useful informa on about responses to op cal illusions. On the other hand, some studies might be set up to examine persons with specific types of vision limita ons. Inclusion and exclusion criteria can be set up in terms of categories (e.g., exclude smokers) or in terms of quan ta ve scores (e.g., include only ages between 18 and 65). Excluding cases a er data analyses have been performed is a ques onable research prac ce that can lead to incorrect conclusions (John, Loewenstein, & Prelec, 2012). Research reports should include informa on about problems with data (e.g., the numbers of percentages of impossible or missing or extreme values). Anything done to remedy problems (such as dele ng scores from analysis) should be stated clearly, and the ra onale for the decisions should be provided. (Don’t make up a different story for each case you take out of a data set; have consistent rules.) Readers need this informa on to evaluate generalizability of results, poten al limita ons of the study, and poten al problems in data analysis. You need to “clean up” your data before you do any addi onal analyses. There’s an old maxim in computer programming: Garbage in, garbage out. If your scores have errors or come from types of cases that were not supposed to be included in your study, results of analyses will be incorrect. The me to look for errors is at the beginning, before you have invested a lot of me in running analyses. APPENDIX 3A: GETTING STARTED IN IBM SPSS® VERSION 25 First, make sure your computer is set up to run SPSS. You can purchase or rent your own copy of SPSS (be sure to look for academic, educator, or student discounts, if you qualify). If you use SPSS through an organiza on site license in a business or university, ask your instructor or informa on technology department for site-specific instruc ons. When you have access to SPSS on your laptop, you should see the SPSS program icon on your desktop or in the start menu. If you can’t find it, ask someone to help you locate the icon in program file folders and create a shortcut on the desktop. Next, obtain the SPSS data files for this book. Data files for examples in this book can be downloaded from the SAGE textbook website at edge.sagepub.com/warner3e. I suggest that you create a folder (you might call it MySPSS) for your SPSS files. Ini ally you need to know about two types of files. File names that end in .sav are SPSS data files; the icon for an SPSS data file looks like this: File names that end in .spv are output files; the output file icon is .
If you aren’t familiar with the management of computer folders and files, YouTube and Google can help. Search for “find files in Windows 10” (or whatever opera ng system you use). They can o en help answer ques ons about procedures both inside and outside SPSS. 3.A.1 The Bare Minimum: Using an Exis ng SPSS Data File to Obtain, Print, and Save Results For a minimal session in SPSS, you need to: CHAPTER 4 DESCRIPTIVE STATISTICS 4.1 INTRODUCTION Informa on about quan ta ve variables can be summarized using simple descrip ve sta s cs, including the mode, median, and mean. These describe central tendency or “typical” responses. We obtain informa on about varia on of scores by examining minimum and maximum values, range, variance, and standard devia on. These descrip ve sta s cs cannot be applied to categorical variables, because scores on quan ta ve variables are only labels for group membership; it would not make sense to rank or add scores for a categorical variable such as hair color. Descrip ve sta s cs such as means tell us what kinds of cases are included in a sample. For example, what was a typical age for a person in the study, and how much did people vary in age? Readers need to know the characteris cs of the sample to assess poten al generalizability of results. For example, if the average age of persons in a sample is 18, results may not be generalizable to persons between 40 and 80 years of age. 4.2 QUESTIONS ABOUT QUANTITATIVE VARIABLES You have already seen that a frequency table for a quan ta ve variable provides informa on about the number of missing values and the presence of implausible scores. The minimum, maximum, and range can also be obtained (preliminary informa on about variability), and a “middle” can be iden fied approximately by examining the score that has a cumula ve frequency close to 50% (to describe a typical or average score). One or more modes can be iden fied, although modes are not always useful when obtained from ungrouped frequency tables. Addi onal informa on about central tendency and variability can be obtained by compu ng descrip ve sta s cs on the basis of the values of all scores. These descrip ve sta s cs answer the following ques ons: What is a typical or common response (central tendency)? The sample mean and median describe central tendency for quan ta ve variables. How much do responses vary or differ? Addi onal informa on about variability is obtained from variance and standard devia on, introduced in the next sec ons. 4.3 NOTATION Here is all the nota on you need for now:
N represents the total number of cases in a data set; n refers to the number of cases within a group. X is used to represent scores for a variable. Subscripts can be used to make it clear when a sta s c belongs to an individual person. The scores for persons 1, 2, and 3 in a sample can be denoted X1, X2, and X3. When a list of scores in a data set or group are summed, as in X1 + X2 + X3 + … + XN, the shorthand nota on ∑ is used. (This is capital le er S set in Greek symbol font, and it is read as “summa on.”) “∑X” means “add all scores for variable X.” 4.4 SAMPLE MEDIAN The mode, discussed earlier for categorical variables, can also be used with quan ta ve scores; the mode is the score with the highest frequency. The median, discussed in this sec on, defines the average as the score value that has 50% of people’s scores above it and 50% of people’s scores below it. The mean, discussed in the next sec on, defines the average as the sum of all the scores divided by the number of scores. For some samples, the values of the mode, median, and mean are equal. When the values of the mean, median, and mode differ, the ques on arises: Which one or two of these sta s cs do the best job of communica ng informa on about typical or “average” values? Readers of research reports and mass media ar cles need to ask, Did the author define average using the mode, median, or mean? Readers need to understand the circumstances under which these numbers provide different percep ons of what is “average.” This example uses data that appeared in Figure 3.1 in the preceding chapter. To obtain the sample median (denoted “Mdn”), we first need to rank scores from highest to lowest, as shown in Figure 4.1. We then count the numbers of cases up from the bo om of the list and down from the top of the list. It is easiest to explain the median by looking at the list of individual scores, as in the following example. (There are also procedures to locate the median in grouped frequency tables, not included here.) To obtain the sample median for a set of N scores in a small data set: Arrange the X scores in rank order from highest to lowest (as in Figure 4.1). If you have data in an SPSS or Excel file, there are commands to sort or rank scores in a column. Determine the number of scores, c, that corresponds to half of N, the total number of scores. The value c tells you how many scores you need to count (up from the bo om and down from the top) to locate the median. If N is an odd number, c = (N – 1)/2. If N is an even number, c = (N – 2)/2. Count the number of scores (down from the top and up from the bo om of the list). If N is an odd number, this coun ng procedure iden fies one score in the middle of the ranked list of scores, and that score is the median. If N is an even number, this will iden fy two scores. When there are two scores in the middle of the ranked list, add them and divide the sum by 2 to obtain the median.
Figure 4.1 Finding the Median for an Even Number of Scores Consider the list of N = 10 heart rate scores in Figure 4.1. For N = 10, c = (10 – 2)/2 = 4. Count down four scores from the top of the list of scores; count up four scores from the bo om, as shown by the arrows in Figure 4.1. Because N is an even number, two scores remain in the middle (scores of 73 and 74). Add these and divide the sum by 2; in this example, the median = (73 + 74)/2 = 73.5. Figure 4.1 shows a list of the 10 scores; each line is the score for one case. One of the score values, the score of 75, appears twice in this list because two persons had heart rates of 75. This figure helps make the interpreta on of a median clear: A sample median is the score value that separates cases into two equally sized groups: Half of the people in the sample have scores above the median, and half the cases have scores below the median. We can also say the median defines “average” as a value for which 50% of people have scores below and 50% of the persons have scores above that value. The mechanics of finding a sample median can be tedious if the number of scores is large, if scores must be rank-ordered by hand, or if the median must be obtained from a grouped frequency distribu on. I prefer that students focus on the way the median defines average instead of details of by-hand methods to obtain the median. 4.5 SAMPLE MEAN (M) The sample mean (denoted M) is obtained by first summing all the X score values in a sample of N scores and then dividing that sum by N, the number of scores:
Here and in later chapters, we’ll think about each equa on in terms of the informa on included, how that informa on is combined, and the condi ons under which the value of the sta s c tends to be larger or smaller. Adding the X scores summarizes score informa on across all par cipants. The size of ∑ X depends on two things. Other factors being equal, the larger the X scores are, the larger ∑ X will be. Other things being equal, the larger N is, the larger ∑ X will be. To obtain a sample mean that represents the size of a typical score and that is independent of N, we correct for sample size by dividing ∑ X by N. The “bag of tricks” (frequently used arithme c opera ons) in basic sta s cs is small. The formula for the mean illustrates two of these tricks. First, to summarize informa on for a set of scores, we sum them. Second, to correct for the effect of sample size, we divide the sum by the number of pieces of informa on included in the sum. As you encounter more advanced sta s cs, you will see these tricks used again. Learn to look at ∑X and think, this term summarizes informa on about values of all the scores across cases. When you see “divide by N,” remember that dividing by N corrects for the effect of sample size on the sum. Equa ons can be converted into sentences in everyday language that tell you what informa on is included and what is being done to that informa on. Equa on 4.1 is more than just instruc ons for computa on. It is also a statement or “sentence” that tells us the following: What informa on is the sample sta s c M based on? It includes all the individual X scores and the N of cases in the sample. Under what circumstances will the sta s c (M) turn out to have a large or small value? M is large when individual X scores are large and posi ve. Because we divide by N when compu ng M to correct for sample size, the magnitude of M is independent of N. 4.6 AN IMPORTANT CHARACTERISTIC OF M: THE SUM OF DEVIATIONS FROM M = 0 For each score in a sample, we can compute a devia on (or distance from the sample mean). If you know or can guess your heart rate, you can calculate your own devia on from the mean of the sample data in temphr10.sav. For example, if your hr score X = 75, and the mean M for the sample is 73.1, then the devia on of your score from the mean is (X – M) = (75 – 73.1) = +1.9 beats per minute. This devia on tells you that your heart rate is 1.9 beats per minute higher than the mean of this sample. We can find a devia on from the mean for every score in a sample. We find the devia on of an individual X score from M by subtrac ng M from X:
For each case, this devia on tells us two things. On the basis of the sign of the devia on, we know whether a person’s X score was below or above the mean. From the absolute magnitude of the devia on, we can see whether the score was close to M (small devia on) or far from M
(large devia on). Figure 4.2 shows the heart rate scores and the devia on from the mean for each score (the variable named devmean is X – M for each case). For the person in line 1 in Figure 4.2, hr = 70, and devmean, devia on of heart rate score from the mean of heart rate, is (70 – 73.1) = –3.1. For the person in line 1, heart rate is about 3 beats per minute lower than the sample mean. What happens if we sum the 10 devmean scores? Their sum (–3.1 – 4.1 – 2.1 – 11.1 + 1.9 + .9 + 6.9 – .1 +1.9 + 8.9) is 0. In fact, the sum of devia ons of scores from the mean in a sample is always 0. This is not a formal proof, only a demonstra on. (Proofs are presented in mathema cal sta s cs books.) The mean is the value rela ve to which devia ons of scores in a sample sum to 0. We can state this property of the sum of the (X – M) terms in equa on form:
The mean and the median are both in the “middle” of the distribu on, but in different ways. The median is the score value that divides the top 50% of cases from the bo om 50% of cases. The mean is the score value for which the sum of the posi ve devia ons equals the sum of the nega ve devia ons. The mean some mes equals the median, and both of these can equal the mode; however, these values are not always equal. Why is it useful to think about “average” on the basis of the sum of devia ons from the mean? Devia ons from means are basic building blocks for almost every analysis you will learn. In a later sec on of this chapter, you will see that sample variance and standard devia on are also calculated using devia ons from the mean. You will see in later chapters that most of the analyses you will learn to evaluate rela onships between pairs of values are calculated using
devia ons from means (and squared devia ons from means). These analyses may appear to be quite different from one another, but you will be able to see later that they are a “family” of techniques based on the same informa on about data (they are all constructed using devia ons from the mean). The mean is part of this family of analyses. This is a major reason why researchers o en report sample means (rather than medians or modes). The median and mode, while o en useful on their own, are not members of this family of analyses. The mean has two advantages. First, it is related to many other popular types of sta s cal analyses. Second, it is the measure of central tendency or average most o en included in research reports; it is what readers of research reports usually expect to see. However, means have poten al disadvantages, as discussed in the next sec ons. First, means can be influenced greatly by extreme scores; and second, for some kinds of distribu on shapes, the mean is not a good indica on of typical or common responses. 4.7 DISADVANTAGE OF M: IT IS NOT ROBUST AGAINST INFLUENCE OF EXTREME SCORES The sample mean is not robust against the influence of extreme scores or outliers. Informally, we can say that a sta s c is robust if it yields reasonable results even when its assump ons are violated. Sample means and some other sta s cs do not “behave well” when outliers are present. Ideally, we do not want the value of a sta s c to change substan ally if we add or drop a few extreme scores. Unfortunately, adding a few extreme scores to a sample can greatly change the value of the mean. It is not desirable for any sta s c to depend on the values of one or a few scores. To demonstrate the impact of one extreme score on the mean, we’ll look at the heart rate data with and without an extremely high score at the upper end. The first column in Figure 4.3 shows the original hr scores from temphr10.sav (which you saw in Figure 3.1). In the column to the right, the new variable called hrOutlier has the same scores for hr for the first nine cases, but the last score is changed from 82 to a more extreme value of 160. Here are the values of mean, median, and mode for the two sets of scores in Figure 4.3:
Figure 4.3 Demonstra on of Effect of High-End Outlier on Sample Mean
We can evaluate the robustness of mean, median, and mode by asking whether each central tendency sta s c changes when one score value is changed. The median and the mode did not change. The value of the mean changed substan ally. With the new score of 160 added to the data, the sample mean is ∑ X/N = (62 + 69 + 70 + … + 160)/10 = 809/10 = 80.9. This is much larger than the mean obtained without the score of 160 (which was 73.1). In the original data (hr scores in the first column), the value of M was in the “middle”; there were five scores above and five scores below M. M was a “central” value that was reasonably close to most people’s individual scores and close to the median. On the other hand, for the hrOutlier scores, there are nine heart rate scores below and one heart rate score above the mean of 80.9. This example illustrates two things: When one very high score is added to this sample, the value of M increases (while the value of the median and mode do not change). This demonstrates that the mean is less robust against the impact of extreme scores than the median and mode. With one or more extremely high scores added, the value of the sample mean M is higher than the median; and in this example, M is actually higher than the majority of the individual scores in the sample. Under these circumstances the sample mean M is not a very good way to describe “average” or typical responses. Note that adding an extremely low score will make the mean smaller than the median. 4.8 BEHAVIOR OF MEAN, MEDIAN, AND MODE IN COMMON REAL-WORLD SITUATIONS This sec on previews the use of graphs to represent score frequencies for quan ta ve variables (graphs are discussed more extensively in Chapter 5). Figure 4.4 shows a frequency table for a set of hypothe cal scores. A corresponding histogram presents the same informa on graphically; the height of each bar in the histogram corresponds to the frequency of that score (i.e., the number of people who had that score value). 4.8.1 Example 1: Bell-Shaped Distribu on First let’s consider a hypothe cal batch of scores for which the mean, median, and mode have similar values. Suppose you have a survey ques on that asks people to rate their degree of agreement with this statement: “I think that the U.S. economy is doing well.” Response op ons are scores of 1 = strongly disagree (SD), 2 = disagree (D), 3 = neutral (N), 4 = agree (A), and 5 = strongly agree (SA). We might obtain a frequency distribu on like the one in Figure 4.4. Note that the answer given by the largest number of people corresponds to 3 (neutral), the next highest frequency responses were 2 (disagree) and 4 (agree), and the most extreme responses, 1 (strongly disagree) and 5 (strongly agree), were uncommon. For now, we will call this pa ern a “bell-shaped” distribu on. (Later, we’ll talk more formally about normal distribu ons.) Bell- shaped distribu ons tend to have values of the mean, median, and mode that are close to one another.
In the graph in the lower part of Figure 4.4, the number above the bar for each score value (such as 0) corresponds to the frequency of that score in the table (in the upper part of Figure 4.4). For example, in this hypothe cal data set, a score of 1 had a frequency of 6. A score of 3 had a frequency of 33 (i.e., 33 people chose the answer 3). The histogram or graph at the bo om of Figure 4.4 represents the same informa on about frequencies using bars with heights that correspond to frequency. This distribu on can be informally defined as bell shaped; there is a peak in the middle, and the pa ern is symmetrical; that is, the le -hand side of the distribu on is approximately a mirror image of the right-hand side.
Figure 4.4 Hypothe cal Likert Scale Ra ngs With Bell-Shaped Frequency Distribu on: (a) Frequency Table and (b) Corresponding Histogram When a distribu on is approximately bell shaped, the values of the mean, median, and mode are close together. For the data in Figure 4.4, the mean, median, and mode all have values close to 3 (and all are good descrip ons of “typical” score values). In Figure 4.4, most scores are near 3; the nearby values of 2, 3, and 4 include about 80% of the scores. For a bell-shaped distribu on, any of these values (mean, median, or mode) provides a good sense of average or typical scores. Choice among these sta s cs is not a problem when we have an approximately bell-shaped distribu on. Data that have this bell-shaped distribu on typically work well in most bivariate analyses covered later in this book. 4.8.2 Example 2: Bimodal or Polarized Distribu on Next consider a set of hypothe cal Likert scale ra ngs for this statement: “I am a poli cal liberal.” A Likert ques onnaire item has two components: a statement of opinion or belief and a
set of response op ons that represent different degrees of agreement with that statement. A 1- to-5 scale is o en used. The frequency distribu on of ra ngs for a hypothe cal Likert format ques on in Figure 4.5 is an example of bimodal or polarized ra ngs (i.e., scores tend to be at one extreme or the other). Most people gave a ra ng of 1 (strongly disagree) or 5 (strongly agree). The highest mode is for a ra ng of 5. The frequency for a ra ng of 1 was almost as high. Very few people gave ra ngs between these extremes. Figure 4.5 shows the frequency table and histogram for this hypothe cal outcome.
Figure 4.5 Hypothe cal Likert Scale Ra ngs for Polarized or Bimodal Responses: (a) Frequency Table and (b) Histogram In this example, because the distribu on in Figure 4.5 is bimodal, with one mode at the highest possible score and a second mode at the lowest possible score, neither the mean (M = 3.11) nor the median (Mdn = 3) describes typical or average response very well. In fact, very few people gave ra ngs close to 3. We get a be er sense of “typical” responses if we report the two modes. People either love liberal policies or hate them. The point of this example is that in some frequency distribu ons, the mean and median may not be good ways to describe typical or average response. 4.8.3 Example 3: Skewed Distribu on Some variables represent counts of events or behaviors, for example, How many children do you have? How many speeding ckets have you received? Distribu ons for variables like these o en have many responses of 0, 1, or 2 (with a smallest possible value of 0). However, the highest responses can be 8, 10, or more. For these types of variables, the shape of a distribu on is o en asymmetrical or skewed. A frequency table of hypothe cal answers to the ques on “How many children do you want to have in the future?” appears in Figure 4.6.
Figure 4.6 Frequency Distribu on for Hypothe cal Scores on Number of Children Wanted: (a) Frequency Table and (b) Histogram The distribu on in Figure 4.6 is described as “posi vely skewed” because there is a longer (and thinner) tail at the posi ve end of the distribu on. In this posi vely skewed distribu on, there are a few extreme scores at the high end (e.g., the persons who said they wanted 11 and 16 children). In this example, the mean of 2 is not a good indica on of typical responses (more than half of the people in this sample reported wan ng fewer than 2 children, i.e., either 1 or 0 children). As noted earlier, the mean is not robust against the effect of outliers; the extreme high scores of 11 and 16 made the mean (M = 2) for this set of scores higher than the median (Mdn = 1). We could call the scores of 0 and 1 both modes. These two modes are a be er way to describe “typical” scores. We could say that there is a mode at 0 and a smaller mode at 1; large percentages of people reported wan ng 0 children (31.8%) or 1 child (25.8%). In this situa on, the most informa ve way to report data would be to report one or two modes, or perhaps the en re frequency table. In this situa on, the median would also be a reasonable way to describe “typical” responses. The mean is somewhat high, although not so high that it would be completely unreasonable to report it. 4.8.4 Example 4: No Clear Mode It is possible for the numbers of cases to be about the same for all score values. Suppose the list of scores is as follows: [2, 4, 5, 7, 8, 8, 9]. The value 8 could be called the mode because it has a frequency of 2 and all other scores have a frequency of 1. Calling 8 a mode would not make sense, and using it to describe a typical score value would not make sense in this situa on.
When in doubt, report more informa on. The en re frequency distribu on table, or the values of mode, mean, and median along with a graph, provide the most complete informa on to readers. The goal of repor ng should be to impart the clearest possible understanding of pa ern in the data. Despite poten al problems with sample means, it is more common to see reports of means (than of medians or modes) in science research reports. This happens because sample means are the basis for computa on of the most widely used bivariate sta s cs (such as t tests and analysis of variance). If you see a report of an “average” in mass media or in a research report, you need to know whether the descrip on of average is based on a mean, median, or mode. For some kinds of frequency distribu ons, such as the bimodal distribu on in Figure 4.5, a mean can be misleading informa on. 4.9 CHOOSING AMONG MEAN, MEDIAN, AND MODE Students some mes ask ques ons such as “Which is be er: The mean, the median, or the mode?” This is usually not the best kind of ques on to ask when choosing among sta s cs. It is more useful to ask, Under what circumstances does the mean provide the most useful informa on? Under what circumstances is the median (or mode) a be er choice? Here are some guidelines for choice among the mean, median, and mode: If a frequency distribu on or histogram has one or more modes that are not near the center of the distribu on, the mean may not be a good way to describe typical response. It may be be er to report one or more modal scores. In Figure 4.5, we could say that there was a polariza on of opinion; people either strongly agreed with liberal policies (score of 5) or strongly disagreed (score of 1), with very few persons repor ng neutral feelings. If a frequency distribu on is skewed (with a long thin tail on one side), the median, and one or more modes, may be a be er way to describe what is typical or average than the mean. Posi ve skewness (with extreme scores on the high end) is common in social science data. Nega ve skewness is possible (with a few extreme scores at the low end) but less common. If a distribu on is bell shaped or approximately normal, the values of the mean, median, and mode will be close together. The mean is a good way to describe central tendency for bell- shaped distribu ons; the median and mode will have similar values. When in doubt, or if the situa on is complicated, it may be be er to report the en re frequency distribu on (and/or histogram) along with values for the mean, median, and one or more modes. Good prac ce: Do preliminary data screening by examining a frequency distribu on table and graph to evaluate whether the mean, median, and/or mode(s) are be er ways to describe central tendency. If implausible score values appear, go back and reexamine the data to correct errors.
Note the number of missing values. State whether extreme scores or mul ple modes were detected (or whether the distribu on is approximately normal). State clearly what sta s c is used (mean, median, or mode) to describe average responses. Bad prac ce: Obtain a mean, median, or mode without examining a frequency table or graph. Select the index of central tendency value that “fits the narra ve.” For example, if you want to report a high average, you can select whichever of these three sta s cs has the highest value, whether it makes sense or not. This is decep ve. Fail to make clear which index of central tendency is reported, and fail to note poten al problems with it. Chapter 1 men oned “lying with sta s cs.” Reports of central tendency can be decep ve when they present only selected informa on that creates the impression the author wants to create. When an author wants readers to think, “Wow, that average is really high,” the author might choose to report the highest of the three values (mean, median, or mode). Conversely, if the author wants readers to think, “Wow, that average is really low,” the author might choose to report the lowest value among mean, median, and mode. An author who cherry-picks the -highest “average” is presen ng misleading (although perhaps not technically false) informa on. 4.10 USING SPSS TO OBTAIN DESCRIPTIVE STATISTICS FOR A QUANTITATIVE VARIABLE Previous sec ons discussed sta s cs for central tendency; the following sec ons discuss sta s cs to describe variability. In this sec on, SPSS is used to obtain all these descrip ve sta s cs (to describe both central tendency and variability) from data in the file named temphr10.sav using the SPSS frequencies procedure. To run Frequencies, make these menu selec ons (as in the example in Chapter 3): <Analyze> → <Descrip ve Sta s cs> → <Frequencies>. This opens the main dialog box for the frequencies procedure; in this window, move the variable hr into the Variables window. Click the Sta s cs bu on in the top right-hand corner of the main dialog box for the frequencies procedure to open the Frequencies: Sta s cs dialog box (shown on the right-hand side of Figure 4.7). There is a checkbox menu; click these checkboxes as shown to select central tendency sta s cs and sta s cs to describe variability (in the area headed “Dispersion”), as shown. The sta s cs that describe variability are explained in upcoming sec ons. Click Con nue to exit from the Frequencies: Sta s cs box and return to the main Frequencies dialog box; and click OK in the main dialog box to run the analysis. Output appears in Figure 4.8.
Figure 4.7 SPSS Frequencies: Sta s cs Dialog Box to Obtain Descrip ve Sta s cs for Quan ta ve Variables The values for mean, median, and mode in Figure 4.8 agree with the values obtained in earlier sec ons by hand, and they are close together. This example verifies the by-hand computa ons for mean and median done in previous sec ons for the same set of scores.
Figure 4.8 Output for Descrip ve Sta s cs for Hypothe cal Heart Rate Data in temphr10.sav The next sec on describes variability or varia on in quan ta ve scores. You will see how descrip ve sta s cs for varia on (including minimum, maximum, range, variance, and standard devia on) can be obtained by hand and how they are interpreted. 4.11 MINIMUM, MAXIMUM, AND RANGE: VARIATION AMONG SCORES The simplest way to describe varia on among scores begins by rank-ordering scores from lowest to highest. The lowest score value is the minimum (o en abbreviated as Min); the highest score value is the maximum (Max). As noted in Chapter 3, the range is maximum – minimum. For the heart rate data in Figure 4.1, Min = 62, Max = 82, and range = 20. Why does this informa on ma er? It helps us characterize the variety of people we have in the sample. When a variable has real-world uses, clinical or other interpreta on guidelines can help us understand what the minimum and maximum scores in a sample tell us. For example, guidelines published by the Mayo Clinic state that the normal adult res ng heart rate ranges from approximately 60 to 100 beats per minute. A well-condi oned athlete might have a heart rate of about 50 beats per minute. The people in this hypothe cal sample all have hr scores within the lower half of the normal range. This tells us that the sample consisted of people with heart rates in the low normal range, and this suggests a sample of persons with good cardiovascular fitness. If the sample had Min hr = 90 and Max hr = 120, this would indicate that many or most of the members of the sample have unusually high heart rates. When a frame of reference for the evalua on of scores is available, it should be used when characterizing the sample. For example, if depression is assessed, one might ask, Are some scores high enough to warrant diagnoses of mild, moderate, or severe depression? In a study of a new an depressant drug, for example, readers would want to know whether most pa ents were mildly or severely depressed. 4.12 THE SAMPLE VARIANCE S2 We can obtain more useful informa on about variability by using informa on for all the individual scores. If all people had the same heart rate score, there would be no variance (e.g., a sample with hr scores of 72, 72, 72, …, 72 will have variance of 0). Variance in hr exists when people have different values of hr. Variability is evaluated by examining how far individual people’s scores are from the mean. 4.12.1 Step 1: Devia on of Each Score From the Mean Equa on 4.2 appeared earlier, and it is repeated here as Equa on 4.4. The first step in calcula on of variance is to compute the devia on of each person’s score from the sample mean M. (X – M) answers the ques on, How far is a person’s X score above or below the mean?
For the data in temphr10, the devia on of the first X score from the mean is (70 – 73.1), that is, the score for the first case minus the mean of hr scores. Why do some people have higher, and some people lower, hr scores? Because people have different characteris cs, such as physical fitness, smoking, and anxiety, that make their heart rates higher or lower than other people’s. Sta s cal analyses you will learn later in the course provide ways to evaluate how much of the individual differences in hr might be related to each variable, such as anxiety. 4.12.2 Step 2: Sum of Squared Devia ons Next, we need to summarize informa on about distances from the mean across all the people in the sample. You might think that you could summarize informa on by summing the devia ons, the values of (X – M), across all people in the data set. However, recall from Sec on 4.6 that this sum of devia ons from the mean is always zero. It might occur to you that this problem could be avoided by summing the absolute values of these devia ons. However, there is another approach that yields more useful results. Here we introduce another tool in the sta s cian’s bag of tricks. When devia ons sum to 0, we get around that problem by squaring the devia ons before we sum them. Squaring devia ons makes all the terms in this sum posi ve. To summarize informa on about individual score distances from the mean: First, we square each person’s devia on from the mean. (Squaring a nega ve value yields a posi ve value, so squaring devia ons gets rid of the problem that posi ve and nega ve devia ons would cancel each other out by summing to 0.) Then we sum those squared devia ons. The resul ng sum is called the sum of squares (or sum of squared devia ons), abbreviated SS. In upcoming steps, SS will be used to compute sample variance and standard devia on. We return to the ques on: How much do people’s scores in a sample vary or differ rela ve to the sample mean? In words, the answer to this ques on is: We find out how far each X score is from the mean by compu ng a devia on, we square each devia on, then we sum the squared devia ons to summarize informa on about distance from the mean. This gives the formula for SS, the sum of squared devia ons of scores from their mean:
Equa on 4.5 makes it easier to see what informa on about scores is included when you compute SS. Equa on 4.6 is easier for by-hand computa on of SS from scores. They yield the same results. No ce that we square each individual devia on first; then we add those squared devia ons. Appendix 4A describes rules about precedence in the order of arithme c opera ons. Opera ons that are enclosed in parentheses are done before opera ons outside the parentheses. For example, if you see the expression ∑(Y 2), you square the value of each Y, and then sum the squared values. If you see the expression (∑Y )2, you sum the values of Y and then square that sum. Some mes textbook examples use numbers that give a whole-number result for SS; however, in real data, SS is usually not a whole number. Appendix 4B reviews rounding. I suggest that you retain at least three decimal places during computa ons. Final results for most sta s cs are o en rounded to two decimal places. See Appendix 4B for a discussion of rounding. In Figure 4.9 (data from temphr10.sav) the squared devia on from the mean for each individual person appears in the last column (the variable named devia onsq). Adding the scores for devia onsq gives the value of SS for this data set: SS = 288.90. For larger data sets, it is more convenient to have a computer program do this.
Figure 4.9 Devia ons and Squared Devia ons of Heart Rate Scores From Mean Note that SS cannot be a nega ve number (because we are summing squared devia ons, and squared numbers cannot be nega ve). Other factors being equal, SS tends to be larger when: The individual (X – M) devia ons from the mean are larger in absolute value. The number of squared devia ons included in the sum increases.
The minimum possible value of SS (which is 0) occurs when all the X scores are equal and, therefore, equal to M. For example, in the set of scores [73, 73, 73, 73, 73], the SS term would equal 0. There is no limit, in prac ce, for the maximum value of SS. To interpret SS as informa on about variability, we need to correct for the fact that SS tends to be larger when the number of squared devia ons included in the sum is large. Dividing by N, the number of scores in the sample, seems like the obvious solu on. However, this does not provide the best answer. 4.12.3 Step 3: Degrees of Freedom It might seem logical to divide SS by N to correct for the increase in size of SS as N increases. However, this yields values that are slightly too small; Gosset (discussed in Tankard, 1984) worked out the reason for the problem and discovered a simple solu on. When we look at the pieces of informa on used to compute SS (i.e., the devia on of each score from the sample mean), it is possible to see that we do not have N independent devia ons (or pieces of informa on) available to compute the SS; in fact, we have only (N – 1) pieces of informa on. To explain why devia ons from the mean in a sample of N scores provide only (N – 1) independent pieces of informa on about distance from the mean, recall that the sum of all devia ons of scores from the mean must equal 0. Suppose we have N = 3 scores in a sample (call these scores X1, X2, and X3) and that their mean is M. First, we convert each X score into a devia on by subtrac ng the sample mean M. We know that the sum of these devia ons must equal zero. That yields this simple equa on:
When we compute (X1 – M) + (X2 – M) (on the le side of the equa on), this gives us the value that the remaining devia on, (X3 – M), must have. Only the first two devia ons are “free to vary,” that is, free to take on any possible value. Once we know the value of any two of the devia ons, the value of the last devia on is determined (it must be whatever number is needed to make the sum of all devia ons equal 0). This is only a demonstra on, not a formal proof. A degrees of freedom (df) term tells us how many independent pieces of informa on we have available when we compute SS or another sta s c, such as a variance (in other words, how many devia ons from the mean are free to vary). This modified divisor, N – 1, is called the degrees of freedom (df). The df term tells us how many of the devia ons are “free to vary.” The use of df instead of N as a divisor is another frequently used tool in the sta s cian’s bag of tricks. Later analyses also use df terms, although df o en
has different values than (N – 1) in other situa ons. Degrees of freedom for the SS and sample variance are obtained using Equa on 4.7:
4.12.4 Pu ng the Pieces Together: Compu ng a Sample Variance The variance for a sample is usually denoted s2. A sample variance is obtained by dividing SS by its degrees of freedom:
(Some textbooks use ŝ2 to denote a sample variance calculated as SS/N. In actual prac ce, this nota on is almost never used when sta s cs are applied to real-world data, and you will not see ŝ2 again in this book.) Return to the data in Figure 4.9. The first column shows heart rate scores for each person. The second column shows the devia on of each person’s score from the mean (the variable name is devia on). The third column shows each person’s squared devia on (the variable name is devia onsq). If we sum the squared devia ons, we obtain SS = 288.90. For this sample of N = 10 cases, df = (N – 1) = 9. For the hr data in Figure 4.9, s2 = 288.90/9 = 32.1. It is useful to think about situa ons that would make the sample variance s2 take on larger or smaller values. The smallest possible value of s2 occurs when all the scores in the sample have the same value; for example, the set of scores [73, 73, 73, 73, 73, 73] would have SS = 0 and s2 = 0. The value of s2 is larger for a sample in which individual devia ons from the sample mean are rela vely large. SS will be larger for a batch of data in which scores take on a wide range of different values, such as [44, 52, 66, 97, 101, 119, 120], than in a data set in which the scores differ by very small amounts, such as [60, 65, 66, 68, 71, 74, 74]. SS cannot be nega ve; however, there is no fixed upper limit for possible values of SS. In large data sets, and for variables that have values in the thousands or tens of thousands, values of SS can be extremely large. The informa on s2 provides about differences among hr score values is in terms of “squared hr in beats per minute”; however, the original hr scores were in beats per minute. It would be useful to convert variance back into the units of the original data. Now we use another tool in the sta s cian’s bag of tricks: To convert something given in X2 units back into original X units, we take the square root. 4.13 SAMPLE STANDARD DEVIATION (S OR SD) The sample standard devia on is usually denoted s in textbooks. In research reports it is o en denoted SD. Taking the square root of s2 gives the value of s. Another way to find s is to compute it from the values of SS and df:
For the set of N = 10 heart rate scores given above, the variance s2 was 32.1; the sample standard devia on s equals the square root of this value, 5.67. The sample standard devia on tells us something about typical distances of scores from the mean. The frequency distribu on of female height data in the next sec on provides an intui ve sense of what a standard devia on (s) tells us about data. 4.14 HOW A STANDARD DEVIATION DESCRIBES VARIATION AMONG SCORES IN A FREQUENCY TABLE A larger sample of scores is used in the next example to get a sense of what standard devia on tells us about distances of scores from the sample mean. Note that this works well only when the data have a bell-shaped or approximately normal distribu on. The hypothe cal data in this sample consist of height (in inches) for N = 120 women; see Figure 4.10. The frequency table in Figure 4.10 shows that the minimum height was 58 and the maximum height was 70 inches. The loca on of the mean, M = 64.5, is marked on the right side of the frequency table. The ver cal arrows on the right indicate what range of height values are included when you compute the following. The values of M and SD can be combined to set up ranges of score values; that is, we can combine informa on about the mean and informa on about typical distances from the mean. This can be done using integer mul ples of SD, such as M ± 1 × SD and M ± 2 × SD. For M = 64.5 and SD = 2.5, we obtain the following values for the hypothe cal female height data:
The shorter ver cal arrow next to the frequency table in Figure 4.10 extends from M – (1 × SD) to M + (1 × SD). This corresponds to the frequencies enclosed in the smaller ellipse. The longer ver cal arrow ranges from M – (2 × SD) to M + (2 × SD), score values from 59.5 to 69.5. This corresponds to scores in the larger ellipse. Most women in the sample had heights that were included in the range M – (2 × SD) to M + (2 × SD); only three women (2.5%) had scores below 59.5, and only two women (1.7%) had scores above 69.5. In words: When we combine informa on about distance from the mean (SD) with the loca on of the mean (M), we obtain informa on about the range of values within which most of the X scores lie; this is called the range rule. The range rule works only for bell-shaped distribu ons, as in the present example.
Figure 4.10 Hypothe cal Data for Female Height in Inches for N = 120 Women With M = 64.5 and SD = 2.5 Here are some approximate (not exact) rela onships of SD with data values that can help you understand what SD = 2.5 tells us. In the preceding example, the range for height scores (70 – 58) was 12. The range rule suggests that, for a bell-shaped distribu on, the range is o en a li le less than 4 × SD. For these data, 4 × SD = 4 × 2.5 = 10. Turning this statement around, the range rule suggests that SD is o en a li le less than one quarter of the range. Knowing that SD is related to range may help you understand SD. Remember that the range rule works only for bell-shaped distribu ons. The value of SD tells us about typical distances of scores from the sample mean. Few scores are lower than 2 × SD units below M or higher than 2 × SD units above M. In other words, 2 × SD is a large distance from the mean; only a small percentage of scores are that far away from M. If a research report tells you that the distribu on of scores is close to normal with known values for M and SD, this is sufficient informa on for you to guess the range. Using SD = 2.5, individual devia ons of height from a mean height of 2.5 inches or less (either posi ve or nega ve devia ons) were very common. Almost all people had devia ons from the mean that were less than 2 × SD in absolute value; 2 × SD = 2 × 2.5 = 5 inches. To say this another way, most women had heights between 62 and 67 inches. Good prac ce: To choose the most appropriate sta s cs to describe central tendency and variability, the data analyst should examine a frequency distribu on table or graph. If the distribu on is approximately normal, M and SD are good ways to describe these. If the distribu on is clearly non-normal, Mdn and interquar le range may be preferred. For distribu ons that are not bell shaped, see the next chapter for be er ways to describe varia on among scores. 4.15 WHY IS THERE VARIANCE?
Why do scores differ across people? This is the most fundamental ques on in applied sta s cs. For data about humans, the ques on becomes: What makes people different? Why do some people have higher, and some people lower, heart rates? Why are some people taller and others shorter? Some characteris cs do not differ across people (they are constant). Most people have five fingers on each hand. The rare excep ons are people who have genes for a different number of fingers, or people who have lost fingers because of injury. However, characteris cs such as heart rate do differ across persons and situa ons. Suppose you measure hr for all members of a group. Some persons will have low hr; their hr may be lower than average because they are physically fit and do not smoke. Others have high hr; these elevated hr scores might be due to anxiety or caffeine consump on. A first goal of sta s cal analysis is to quan fy or describe how much people differ. Range, variance, and standard devia on provide this informa on. We will consider a more interes ng ques on in upcoming chapters: Can we explain or predict these differences in heart rate? Can we understand why people differ? You probably already have some intui ons about factors that are related to hr, for example, smoking and physical fitness. When we go on to bivariate analyses, we will ask how hr scores are sta s cally related to other variables, such as amount of anxiety or stress. Results of these analyses can lead to inferences that stress predicts, or perhaps influences, heart rate. In later chapters you’ll see that the overall variance for a variable such as hr can be divided (or par oned) into propor ons of variance that can be predicted from or are related to other variables (such as physical fitness, smoking, anxiety, and caffeine use). Some variables may predict large propor ons of variance in heart rate (possibly these are the variables that have the strongest influence on hr). For those of us who are excited about sta s cs, this is where the fun begins; this is where we can make discoveries or test past research claims about discoveries. Other variables may predict li le or none of the variance in hr. 4.16 REPORTS OF DESCRIPTIVE STATISTICS IN JOURNAL ARTICLES Most journal ar cles report descrip ve sta s cs for numerous variables. Informa on about categorical variables (that describe groups in the study) can usually be provided in sentence form. Usually informa on for numerous quan ta ve variables is summarized in table form. The following data are from Warner, Frye, Morrell, and Carey (2017). The predictor variable of most interest was number of servings of fruit and vegetables consumed per day (NCIfv, servings of fruits and vegetables from a Na onal Cancer Ins tute food frequency ques onnaire). Past research suggested that people who eat more fruits and vegetables tend to have higher scores on measures of well-being such as life sa sfac on and posi ve mood. The outcome variable of most interest was life sa sfac on (LS). Before doing analyses to evaluate whether NCIfv predicts LS, we need to know about the behavior of scores for each of these variables. This survey was completed by 492 students from a university in New England, including 152 male and 340 female students. They were recruited from introductory courses, 79 from a nutri on course and 413 from psychology classes. All par cipants were between ages 18 and 24; the modal age was 18. Descrip ve sta s cs for quan ta ve variables appear in Table 4.1.
Tables of descrip ve sta s cs o en use abbreviated names for variables that are used throughout the paper. Notes at the bo om of the table iden fy the variables and provide addi onal informa on about them. Direc on of scoring must be clear (for example, we need to know that a score of 5 indicates be er sleep, rather than more sleep problems). It is helpful to list variables in sets (in this example, a list of well-being outcome measures, a list of behavioral predictors, and a list of dietary predictors). An earlier “Methods” sec on in the research report would provide more informa on about how variables were measured. Informa on about distribu on shapes should be included; this is discussed in Chapter 5. 4.17 ADDITIONAL ISSUES IN REPORTING DESCRIPTIVE STATISTICS Many addi onal kinds of informa on can be included in summary tables. The minimum informa on usually provided for each quan ta ve variable is M and SD. Table 4.1 included the possible minimum and maximum scores for each variable, on the basis of the way scores were obtained for these variables. Readers who are not familiar with the variables will find this helpful to evaluate the obtained scores. Here are other things a summary table might include: the minimum and maximum scores obtained in the sample, numbers of missing values for each variable, and informa on about reliability for each variable. If you do research in a specific area, look at tables in published research reports to see if addi onal informa on is usually included in summary tables for descrip ve sta s cs.
a. NCIfv is the number of servings of fruit and vegetables in a typical day on the basis of a Na onal Cancer Ins tute food frequency ques onnaire, with responses recoded on a scale from 0 to 8. The modal response was 0. Because of this, and because this variable is the most important predictor variable in this study, the en re frequency distribu on should be presented in a separate table (not shown in this chapter). For a categorical variable such as sex, report propor ons of male and female respondents as descrip ve informa on (not the mean and standard devia on of scores for sex). When a study includes many groups and/or many variables, all groups and all variables should be iden fied and reported in descrip ve tables. This lets readers know if you have selec vely excluded some groups or variables from the analyses you report later. 4.18 SUMMARY Research reports o en describe scores on quan ta ve data using the sample mean M, the standard devia on SD (or s), and the variance s2. Readers tend to assume that scores for quan ta ve variables have an approximately bell-shaped distribu on (if they are not informed otherwise), and they interpret the descrip ve sta s cs accordingly. The “bag of tricks” used to compute many sta s cs is actually quite small, and you have seen several of these tricks in this chapter: When a sum of devia ons would be zero, square terms before summing them. When correc ng for the number of devia ons (or pieces of informa on) included in a sum, divide by df instead of by N. To put informa on back into the original terms of measurement, take the square root. These “tricks” are used again in many future analyses. You have seen that the sample mean is not always the best descrip on of central tendency. In some frequency distribu ons, M is much larger (or smaller) than the median, and the magnitude of the mean is influenced strongly by a few extreme scores. When frequencies have more than one mode, or are skewed, M is some mes not the best descrip on of the “typical” response. When you report a mean, you need to tell readers something about the shape of the frequency distribu on to provide the background informa on needed to understand poten al problems with the mean. Sta s cs books provide so many examples of bell-shaped distribu ons that students may assume that all data have this distribu on shape. However, many common kinds of variables do not have bell-shaped distribu ons. Graphs, discussed in Chapter 5, can be used to evaluate whether scores have a bell-shaped distribu on or some other distribu on shape. We should not assume that all distribu on shapes are bell shaped. When repor ng informa on about variables, remember that readers may assume a bell-shaped distribu on if you do not explain clearly that the distribu on shape is different.
If you read mass media reports about “averages,” you need to know whether average was es mated using the mode, median, or mean; under some circumstances, these three descrip ve sta s cs can yield very different values. The next chapter provides further informa on about obtaining and interpre ng graphs of frequency distribu ons and addi onal ques ons we can ask about distribu ons of scores on a quan ta ve variable. APPENDIX 4A: ORDER OF ARITHMETIC OPERATIONS Many equa ons combine two or more arithme c opera ons, for example, ∑ X2 includes both squaring and summing X scores. When opera ons are combined, the result o en differs depending upon the order in which opera ons are done. Consider this set of scores: X = [1, 3, 5, 2]. If you square each X value and then sum the squared values, you would obtain (1 + 9 + 25 + 4) = 39 . If you sum the X’s and then square that sum, you would obtain (1 + 3 + 5 + 2)2 = 112 = 121. It is important to know which arithme c opera on to do first. CHAPTER 5 GRAPHS: BAR CHARTS, HISTOGRAMS, AND BOXPLOTS 5.1 INTRODUCTION Informa on about scores that was presented in the form of frequency tables in Chapters 3 and 4 can be presented in simple graphs. This chapter describes some widely used types of graphs: pie charts and bar charts for categorical variables, and histograms and boxplots for quan ta ve variables. Each approach (frequency table vs. graph) has advantages and poten al disadvantages: An advantage of frequency tables is that they provide exact informa on about the numbers or percentages of persons who had each score value. The corresponding disadvantage of graphs is that when they are poorly labeled, it is difficult to iden fy exact numbers and percentages. A disadvantage of graphs is that they can be constructed in ways that create decep ve impressions. Frequency tables generally are not decep ve. An advantage of graphs is that they provide appealing visual informa on that grabs readers’ a en on; this is par cularly useful in mass media reports, PowerPoint or Prezi presenta ons, and poster presenta ons at professional conferences. A disadvantage of frequency tables is that they do not have much visual appeal. An advantage of graphs for quan ta ve variables (such as histograms) is that they provide easily understandable informa on about distribu on shape. This chapter describes several distribu on shapes commonly seen in real data (examples appear in Tables 5.1 and 5.2). The bell-shaped curve (more formally, the normal distribu on or Gaussian distribu on) is of par cular interest. The normal distribu on will be discussed further in Chapter 6. A disadvantage of frequency tables is that it can be difficult (although it is possible) to evaluate distribu on shape by inspec on of a frequency table. Ideally, preliminary data screening includes frequency tables (Chapter 3), descrip ve sta s cs (Chapter 4), and graphs (the present chapter). Frequency tables are rarely included in published research reports. Graphs of frequency distribu ons are not o en reported in journal ar cles, although they can be. Informa on in frequency tables can be used to label graphs accurately.
SPSS does not produce publica on-quality graphics. For beginners, this is not a major problem; the graphs are adequate for preliminary data screening. Advanced users may prefer other programs to generate graphics. The R supplement for this book (Rasco, 2020) demonstrates use of the ggplot procedure; this produces be er quality graphics. I modified most SPSS graphics in this book by edi ng to increase font sizes and add informa on. In real-world data analysis, descrip ve sta s cs, frequency tables, and graphs should be examined before a data analyst conducts the main analysis that is of primary interest (such as a t test or analysis of variance [ANOVA]). These provide informa on needed for preliminary data screening. Published research reports typically include only a few sentences about preliminary screening (if they men on it at all). Hoekstra, Kiers, and Johnson (2012) noted that many authors don’t report much about data screening; they argue that the validity of sta s cal results is o en ques onable because assump ons required for sta s cal analysis are not sa sfied (and o en not even checked). Poten al viola ons of some of the assump ons that are introduced later can be assessed by examining graphs. 5.2 PIE CHARTS FOR CATEGORICAL VARIABLES Pie charts are almost universally despised by scien sts, and you are unlikely to see them in academic journals; however, they are popular in mass media, so you should be familiar with them. Consider the frequency table for hypothe cal scores for the categorical variable marital status (Figure 5.1). Recall that the “Cumula ve Percent” column automa cally provided by SPSS makes no sense for categorical variables. Focus on the “Frequency” and “Percent” columns. To request a pie chart, use the familiar Frequencies procedure, beginning with these SPSS menu selec ons: <Analyze> → <Descrip ve Sta s cs> → <Frequencies>. Click the Charts bu on to open the Frequencies: Charts dialog box in Figure 5.2; within that window, select the radio bu on for “Pie charts,” then click Con nue and OK. Edited pie chart output appears in Figure 5.3.
The frequency table in Figure 5.1 tells us that the group with the largest number of members is “never married”; this corresponds to the solid “slice” in the pie chart. The frequency table has a great advantage over the pie chart; it provides exact frequencies and percentages, while the pie chart only approximates group sizes (unless the slices are labeled using numbers or percentages).
Kopf (2015) reviewed reasons why many data analysts hate pie charts. For example, people are not good at es ma ng percentages from the areas of the slices. Pie charts require the use of colors (or textures such as dots or stripes) to differen ate slices; most science journals do not publish figures in color. Tu e (2001), who authored several books about excellence in graphing, regards most mul colored figures as unsightly; he argues that graphs should use as li le ink as possible to provide complete informa on. Pie charts have only two virtues. They provide colorful slides in presenta ons, and this is something that some data analysts (in marke ng, for example) may like. Also, they lend themselves well to humor. (Search online for “funny pie charts” to find examples, or create your own comic version. Perhaps you can persuade your instructor to give a prize or extra credit for the most comical or ingenious examples.) If you become a science researcher, you will probably never use pie charts. 5.3 BAR CHARTS FOR FREQUENCIES OF CATEGORICAL VARIABLES The SPSS Frequencies procedure, which was used in previous chapters to obtain frequency tables, can also provide bar charts (or bar graphs). To open the Frequencies dialog box that appears in Figure 5.4, make these menu selec ons: <Analyze> → <Descrip ve Sta s cs> → <Frequencies>. Click the Charts bu on on the right-hand side of the Frequencies dialog box to open the Frequencies: Charts box; a bar chart is obtained by selec ng the radio bu on for “Bar charts” in the Frequencies: Charts dialog box (also shown in Figure 5.4). The Y axis may be given in frequencies (number of cases) or percentages. Click Con nue to return to the main Frequencies dialog box. Click OK to run the procedure. The hypothe cal marital status scores in Figure 5.1 were used to set up the bar graph in Figure 5.5. The height of each bar represents group size. I edited the bar graph produced by SPSS (using the SPSS Chart Editor and Microso Paint) in the following ways: I increased font sizes for the X and Y axis labels and added the exact number of cases per group (from the frequency table) above each bar.
5.4 GOOD PRACTICE FOR CONSTRUCTION OF BAR CHARTS Bar charts and other graphs should provide accurate informa on that is easy to understand. It is easier for readers to understand graphs when they follow simple rules and conven onal standards. A separate bar represents the frequency (or propor on or percentage of cases) for each group. The height of the bar corresponds to the number or frequency in each group (or the propor on or percentage of cases in each group). The labels on the Y axis should make clear whether frequency, propor on, or percentage is reported. However, the rela ve heights of the bars are the same no ma er which label is used. (Usually bars are ver cal, but it is possible to set up bar charts in which bars are horizontal.) Names of groups are specified by labels on the X axis. Bars should have equal widths. (This rule is not always followed.) The height of the graph (Y axis) is usually less than the width of the X axis (the height of Y is o en about 75% the length of X). The Y axis begins at 0 (or at another minimum value of Y). The top of each bar is labeled with an exact numerical value (a frequency or a percentage). SPSS does not do this for you; I added this informa on using SPSS Chart Editor. Informa on about total N must be provided. In a footnote or the body of the text, the source of data should be stated. Readers tend to assume that numbers are based on new data collected by the researcher; if there is another source (such as Gallup polls or the U.S. census), that source must be iden fied. Bars in bar graphs for categorical variables usually do not touch one another. (This reminds readers that bars represent dis nct groups.) When you generate bar charts for frequencies in SPSS, many of these good form requirements are taken care of by default (e.g., bars are equal widths, and the Y axis begins at 0).
5.5 DECEPTIVE BAR GRAPHS The most common way to make a bar chart for group frequencies “lie” is to set up the Y axis so that it does not start at 0. To illustrate this decep on, I modified the graph in Figure 5.5 so that the Y axis begins at 2 (instead of 0). The modified bar chart in Figure 5.6 is poten ally misleading because people tend to look at the ra o of bar heights (or bar areas) when they compare group sizes; people o en do not pay close a en on to the specific values indicated on the Y axis. In Figure 5.6, the differences in group sizes appear larger than in Figure 5.5. In Figure 5.6, the never married group appears to have about 10 mes as many members as the widowed group (measure the height of the bar for never married and divide this by the height of the bar for the widowed group). When actual group sizes are considered (20 never married, 3 divorced), the never married group is only about 7 mes as large as the widowed group.
Another way to make this type of graph decep ve is by using cartoon figures instead of bars. Huff and Geis (1993) provide examples. Here is a graph similar to examples in Huff and Geis: Suppose you graphed the number of new houses built in two different years using the heights of cartoon figures of houses, instead of the heights of bars, to represent frequencies (see Figure 5.7).
In this imaginary example, twice as many new houses (10,000) were built in 2019 as in 2009 (5,000). The heights of the cartoon houses correspond to these frequencies. However, a reader is likely to use the area of each cartoon (or even the perceived volume, if the cartoon appears three-dimensional) to make the comparison. On the basis of the areas of the house images, a reader might form the impression that number of new houses was something like 6 mes greater in 2019. Replacement of bars with equal widths by cartoon images (par cularly if they have unequal widths) can be very decep ve (Huff & Geis, 1993). When you create bar charts, make them as informa ve and honest as possible; include exact numerical informa on such as frequencies or percentages for each group. When you read graphs such as bar charts, pay more a en on to the numbers on the Y axis (the frequencies and percentages for each group) than to the picture (bars or cartoon figures). Be aware that when the Y axis starts at a value larger than 0, a bar chart may be decep ve. 5.6 HISTOGRAMS FOR QUANTITATIVE VARIABLES A histogram provides visual informa on about the distribu on of scores on quan ta ve variables. A histogram is like a bar chart in that each bar represents the number or percentage or propor on of cases that have each score value. However, because quan ta ve variables o en have more different score values than categorical variables, histograms tend to have more bars. Marks on the X axis represent score values, and marks on the Y axis represent the frequency (or propor on or percentage) of cases in the sample that had each score value. Some introductory textbooks state that bars in histograms should touch one another in histograms. In SPSS, the bars in both bar charts and histograms have small spaces between them. A new ques on we can ask when we examine a histogram is, What is the shape of the distribu on? The easiest way to learn about distribu on shape is to consider examples. Some distribu on shapes that o en appear in real data appear in Tables 5.1 and 5.2. The first line in Table 5.1 shows a bell-shaped (more formally called a normal or Gaussian) distribu on. Other rows in Table 5.1 show distribu ons that are varia ons of this shape.
However, when we look at real data, we o en see distribu ons that do not look anything like normal or bell-shaped curves. Table 5.2 shows distribu ons with shapes that are clearly not close to bell-shaped or normal. Tables 5.1 and 5.2 do not include all possible distribu on shapes; there are many others. To decide whether one or more of these distribu on shapes best describe the data in your sample, you can obtain a histogram and compare it with the examples in these tables. Visual examina on of a histogram is usually sufficient to make reasonable evalua ons about distribu on shapes. In Chapter 6, you’ll see that there are quan ta ve methods to evaluate how well data fit a specific distribu on shape; however, these are rarely used in prac ce. The bell-shaped distribu on in row 1 of Table 5.1 is discussed extensively in sta s cs. Informally, we can describe this bell-shaped distribu on shape as follows. There is a “hump” in the center of a bell-shaped distribu on. In a perfectly normal distribu on, the mean, median, and mode are exactly equal and correspond to the center of the distribu on (and all correspond to the top of the hump). Frequencies (the heights of the bars in the histogram) decline gradually as scores become either larger or smaller than the mean, median, or mode; this creates a shape something like a bell. The distribu on is symmetrical around the mean. That is, the upper half of the histogram is a mirror image of the lower half. Comprehension ques ons will ask you to examine histograms and evaluate whether the distribu on is bell-shaped with minor varia ons or is described be er by quite different
distribu on shapes. This is a somewhat subjec ve judgment call. Some mes the best decision is to say that none of the common distribu on shapes is a good descrip on for a histogram you obtain for your data. People who o en work with specific types of variables (such as reac on
me) will learn the specific distribu on shapes for those variables. 5.7 OBTAINING A HISTOGRAM USING SPSS The hypothe cal female height data in femaleheight.sav are used to set up a histogram. You may have wondered why height and temperature are used as variables in early examples. Each of these variables can be given in different units (for example, height can be inches or cen meters; temperature can be given in degrees Fahrenheit or Celsius). (The United States is one of very few na ons that s ll uses nonmetric units such as inches.) The following example shows how to obtain histograms; examples demonstrate that conver ng units of measurement from inches to cen meters does not change the shape of the frequency distribu on (although unit conversion does change the values of M, SD, and other descrip ve sta s cs). You will find it useful to be able to convert scores from one unit to another, and to do other computa ons. The SPSS Compute Variable command can be used to do this and has many addi onal poten al uses. In this situa on, we use this command to obtain (approximate) height in cen meters by mul plying height in inches by 2.54. To open the Compute Variable dialog box in Figure 5.8, select these menu op ons: <Transform> → <Compute Variable>. In the le -hand window, type the name of the new variable (in this example, heightcm). In the right-hand window, type a numerical expression that includes the name of one (or more) exis ng variable(s) that is used to assign values to the new variable (in this example, the numerical expression is “2.54*heigh nch”). A er you click OK, the new variable heightcm will appear as a new column on the right-hand side of your data worksheet. Now let’s compare the distribu ons of height in inches and height in cen meters. The familiar Frequencies procedure (used to obtain descrip ve sta s cs and pie and bar charts for categorical variables) can be used to request histograms for quan ta ve variables. Use these SPSS menu selec ons: <Analyze> → <Descrip ve Sta s cs> → <Frequencies>. Move both variables (heigh nch and heightcm) into the Variable(s) pane. Click the Charts bu on and select the radio bu on for “Histograms.” You may also want to check the box for “Show normal curve on histogram.” Click the Sta s cs bu on and use checkboxes in the Frequencies: Sta s cs dialog box to choose the desired descrip ve sta s cs. Click OK to run the procedure. Output for descrip ve sta s cs appears in Figure 5.9 and the histograms in Figure 5.10.
As you might expect, transforma on of scores from inches to cm changed all values for descrip ve sta s cs, such as mean and standard devia on. For example, the mean for height in cen meters is 2.54 mes the mean for height in inches. Each descrip ve sta s c for height in cen meters is 2.54 mes the corresponding sta s c for height in inches (except that variance for height in cen meters is 2.542 mes the variance in inches). Did this transforma on change the shape of the distribu on? Figure 5.10 shows that the distribu ons of height scores given in inches and cen meters have iden cal shapes, even though individual scores and descrip ve sta s cs such as M and SD are in different units, and the units along the X axis differ. I have marked M and SD in the two histograms above. The sample mean is approximately in the middle, marked by the le er M on the X axis. Recall that SD summarizes informa on about distances of scores from the mean; SD is shown as horizontal arrows that indicate distance from the mean of X. For height in inches, SD was 2.5 inches. The end points of the arrows that indicate the distance of one SD below M and one SD above M are:
In Chapter 6, you’ll learn about the mathema cal defini on of normal distribu on shape (expressed in the form of a somewhat complicated equa on). That equa on generates the smooth curves superimposed on the histograms above. 5.8 DESCRIBING AND SKETCHING BELL-SHAPED DISTRIBUTIONS When sample data are approximately normally distributed, you need only three pieces of informa on to specify the distribu on, communicate informa on about it to someone else, and/or draw a sketch of that distribu on. These pieces of informa on are: The distribu on shape (normal). The sample mean M. The sample standard devia on SD.
For example, scores on many IQ tests are normally distributed with a mean of 100 and a standard devia on of 15 (or some mes 16). This is enough informa on to sketch the shape of the distribu on, as shown in Figure 5.11. The range rule (from the previous chapter) will help you iden fy the approximate loca ons of the minimum and maximum value on the X axis, and this range is divided into six parts (the range is approximately equal to 6 × SD if the distribu on is normal). Note that you won’t be able to label the Y axis in this graph. You can label the following seven points along the X axis if you know M and SD. The seven X axis values marked in Figure 5.11 are calculated from M and SD as follows:
Score loca ons rela ve to the mean can be approximately described as follows (we will describe distance from the mean more precisely in Chapter 6). A score can be called “not very far from the mean” if it lies within the range M – 1 SD and M + 1 SD. For example, an IQ of 110 is not very far from the mean. An X score can be called “far from the mean” if it is below M – 2 SD or above M + 2 SD. For example, an IQ of 135 is far above the mean; and an IQ of 69 is far below the mean. A score can be called “unusually far from the mean” if it is less than M – 3 SD or greater than M + 3 SD. For example, an IQ of 50 is unusually far below the mean, while an IQ of 150 is unusually far above the mean. Another way to look at this: If you had a set of 1,000 IQ scores with M =100 and SD = 15, and you selected one case at random, the most likely outcome would be an IQ in the range from 85 to 115. You could obtain a case with an IQ in the range 150 and up, but that would be an unusual or unlikely outcome. If you know your own IQ, or any specific IQ score, you can locate that score on the X axis, and immediately see the following: Is your IQ score above or below the mean M? Is it far from the mean, or unusually far from the mean? Refer to Figure 5.11. An IQ of 90 is below the mean, but it is not very far from the mean. An IQ of 160 is above the mean, and it is unusually far from the mean (in other words, very few people have IQ scores equal to or greater than 160). 5.9 GOOD PRACTICES IN SETTING UP HISTOGRAMS
Most rules for good prac ce in bar chart construc on also apply to the construc on of histograms: A separate bar represents the frequency (or propor on or percentage of cases) for each score value (or for a range of score values, as described later in this sec on). The height of each bar corresponds to the number or frequency in each group (or the propor on or percentage of cases in each group). Labels on the Y axis should make clear whether frequency, propor on, or percentage is reported. (However, the rela ve heights of the bars are the same regardless of labels.) Score values are specified by labels on the X axis. Bars should have equal widths. The height of the graph (Y axis) is usually less than the width of the X axis. The Y axis begins at 0. In bar charts, it is good prac ce to label the top of each bar with an exact numerical value (a frequency or a percentage). There may not be enough space on a histogram to include such labels. Clearly labeled ck marks on the Y axis help readers evaluate frequencies. Informa on about total N must be provided. In a footnote or the body of the text, source of data should be stated. Readers tend to assume that numbers are based on new data collected by the researcher; if there is another source (such as Gallup polls or the U.S. census), that source must be iden fied. For many quan ta ve variables, it is necessary to divide scores into bins or groups to set up a histogram with a reasonable shape. SPSS does this automa cally for you and uses a “secret sauce” recipe to op mize the number of bins. When prac cal, the width of bins should be equal. Appendix 3C describes how scores can be grouped or binned to set up a grouped frequency table. If heart rate scores range from 57 to 89, we can set up an ungrouped frequency table with one row for each score value; we could set up an ungrouped histogram with one bar for each score value. Grouping isn’t essen al in frequency tables. However, it is much easier to evaluate distribu on shapes in histograms when score values are grouped or binned. For hr scores from 57 to 89, the following seven groups or bins could be created. Each bar in the histogram would correspond to the number of cases with scores in each range: one bar for hr between 56 and 60, one bar for scores between 61 and 65, and so on.
The following example varies the number of bins to show how the number and widths of bins affect the shape of a histogram. Examples below show unpublished BMI (body mass index) scores for N = 1,250 students at my university. Using metric values of weight and height, BMI equals weight (in kilograms) divided by height (in meters) squared. (Using pounds and inches, BMI equals weight divided by height squared, mul plied by 703.) Healthy or normal BMI is usually defined as 18.5 to 24.9. Larger BMI values indicate greater body weight rela ve to height. Three situa ons are shown: a histogram with only one bin for all BMI values from 14 to 40; a histogram with “too many” bins, each for a very narrow weight range; and a histogram with an op mal number of bins. If all BMI scores are placed in one bin, there would be only one bar in the histogram, as shown in Figure 5.12. A histogram with only one bar does not provide any informa on about distribu on shape. If there are varying numbers of observa ons in a large number of very narrow bins, the histogram might resemble the jagged graph in Figure 5.13. This is be er than the previous graph, but it can be improved.
Figure 5.14 shows the histogram for BMI scores when SPSS was allowed to decide on the “op mal” number of bins. I marked the clinical cutoffs for normal BMI (18.5 and 24.9) on this histogram as points of reference. In Figure 5.14, it is clear that the distribu on shape for BMI was approximately normal except for a few outliers at the high end of the distribu on, that is, a few cases with unusually high BMI. The SPSS default choice for the number and widths of bins provided a rela vely smooth histogram to use for evalua on of the distribu on shape for this data set. (SPSS does not publish the details of how this decision is made, and rules for this can be complex.) When your variable can be evaluated in terms of clinical guidelines (a BMI between 18.5 and 24.9 is generally described as indica ng healthy body weight), it can be useful to evaluate distribu on shape rela ve to these clinical cutoffs. A large propor on of students had BMI scores in the “healthy” range. A fairly substan al minority of students had BMI scores that would be judged overweight or very overweight; a few had BMI scores that would be described as underweight. (A frequency table would provide the informa on needed to find the exact percentages of persons who were over- or underweight.) This distribu on is posi vely skewed; it has a longer tail at the high end. Skewness is discussed further in Chapter 6. It is desirable to have bins that correspond to the same ranges of score values, but this is not feasible in some situa ons. Figure 5.15 shows a histogram for real data: the percentages of households whose annual incomes fall into ranges such as less than $5,000, between $5,001 and $10,000, and so forth. In the histogram in Figure 5.15, each bar (except for the last two bars on the right) indicates the percentage of households that reported incomes in $5,000 increments: the first bar $0 to $5,000, the second bar $5,001 to $10,000, and so forth. The last two bars represent people whose incomes exceed $200,000 per year.
If the graph con nued to use $5,000 increments at the upper end of the income distribu on, many addi onal bars would be needed to represent the full range of incomes in the United States. If you drew an X axis wide enough to include all these addi onal bars, the graph would have to be at least five mes wider than shown in Figure 5.15. To avoid that problem, informa on about incomes greater than $200,000 was compressed into two bars. When looking at graphs like this, readers need to no ce how the last few bars were defined. A first impression might be that there is a mode for incomes between $200,000 and $205,000, but this impression is incorrect. In fact, there is an extremely long and thin tail for this income distribu on (the distribu on is extremely posi vely skewed).
It is easier to evaluate shapes of distribu ons by examining histograms or other graphs than by looking at frequency tables. It is easier to evaluate shape when the number and widths of bins are op mized; SPSS does this by default. In prac ce, histograms do not provide much informa on about sample distribu on shape when sample sizes are very small. I suggest that a minimum of 30 scores may be required to get a sense of any sample distribu on shape. 5.10 BOXPLOT (BOX AND WHISKERS PLOT) A boxplot (also called a box and whiskers plot) provides a different way to assess the distribu on of scores for a quan ta ve variable. Boxplots are par cularly useful in situa ons where outliers (unusually high or low scores) are present, or where distribu on shape does not resemble a bell-shaped curve. Recall that the sample mean M is not robust against the effects of extremely low and extremely high scores (outliers). The value of M can change substan ally when outliers are added to or dropped from a sample. When outliers are present, and when histograms are not bell shaped, an alterna ve approach based on the median instead of the
mean may be preferable. Boxplots represent central tendency using the median and dispersion or variability using percen les. The median corresponds to the 50th percen le in a distribu on of scores. That is, 50% of the scores lie below the median and 50% lie above it. The median is more robust than the mean against the influence of outliers. To describe distances of individual scores from a sample mean M in a bell-shaped distribu on, we used the standard devia on (SD). The standard devia on is computed using the devia ons of all individual scores from the sample mean. This may not be a good way to represent distances from the center of the distribu on when the sample mean itself is misleading (because of the influence of outliers, for example). A boxplot avoids problems that arise because of extreme scores and non-normal distribu on shape by: Using the median (50th percen le) as the index of central tendency (instead of M ). Using the 25th and 75th percen les (instead of SD), and other distances that are calculated from these percen les, to describe distance of scores from the center of the distribu on. You can obtain the 25th and 75th percen les as descrip ve informa on from the SPSS frequencies procedure; results appear in Figure 5.16. The 25th percen le iden fies scores in the bo om 25%, and the 75th percen le iden fies scores in the top 25%. The space between the 25th and 75th percen les corresponds to the middle 50% of scores (it corresponds to a range in the center of the distribu on a li le less than that between M – 1 SD and M + 1 SD). 5.10.1 How to Set Up a Boxplot by Hand It is easier to understand SPSS output for a boxplot if you first construct a boxplot by hand. To set up a boxplot by hand, you need the following informa on: median, 25th percen le, and 75th percen le for the variable of interest. It is also useful to have the minimum and maximum scores and a frequency table of all score values. Informa on needed to set up a boxplot for the femaleheight.sav data, obtained from the SPSS frequencies procedure, appears in Figure 5.16. If you follow these instruc ons, you should obtain a graph similar to Figure 5.17.
Here are the steps to set up a boxplot by hand. Place the full range of score values for the variable of interest (female height) on the Y axis. Draw a shaded box in the center of the graph, as shown in Figure 5.17. The lower end of this box corresponds to the 25th percen le, the line that bisects the box corresponds to the median, and the upper end of the box corresponds to the 75th percen le. For the female height data, from Figure 5.16, these values are 63, 64.5, and 66. Calculate the interquar le range (IQR): IQR = 75th percen le – 25th percen le. On the basis of the informa on in Figure 5.16, IQR = 66 – 63 = 3. This is the height of the shaded box; 50% of the scores lie within this box. Calculate the loca ons of the inner fences. On the basis of the informa on in Figure 5.16: Upper inner fence = median + 1.5 × IQR = 64.5 + 1.5 × 3 = 64.5 + 4.5 = 69. Lower inner fence = median – 1.5 × IQR = 64.5 – 1.5 × 3 = 64.5 – 4.5 = 60. If scores are normally distributed, about 95% of the scores lie between the inner fences. Use these values (60 and 69) to draw the ends of the “whiskers” or T-shaped bars, as shown in Figure 5.17. An addi onal set of boundaries used to make judgments about outliers, the outer fences, do not appear in SPSS boxplots. The outer fences are usually defined as the median + 3 × IQR and median – 3 × IQR. O en, about 99% of the scores lie within the outer fences. The inner and outer fences in boxplots can be used to iden fy outliers and extreme outliers. Sort the scores in the female height data set from lowest to highest. If any of the lowest scores fall below the lower inner fence, indicate them as outliers on the boxplot (using open circles). If any of the highest scores fall above the upper inner fence, mark them as outliers (also using open circles). Then check whether any of these can be called extreme outliers. Any score that lies more than ±3 × IQR away from the median is marked as an extreme outlier (using an asterisk). In the graph in Figure 5.17, I added a number associated with each outlier. Note that
these numbers are not the score values; they are the lines in the SPSS data file where these outliers are found. To find the score values on rows 13, 84, and 120, look at the height values on these lines of the femaleheight.sav data file. We can summarize informa on for the boxplot in Figure 5.17 as follows: For female height in inches, the median was 64.5 in.; 50% of heights were between 63 and 66 in. There were two high-end outliers (both were scores of 70 in.) and three low-end outliers (two scores of 59 in. and one score of 58 in.). None of the outliers were judged to be extreme. 5.10.2 How to Obtain a Boxplot Using SPSS SPSS1 will be used to set up a boxplot for a different data set, a set of hypothe cal BMI scores for 200 men and 200 women in the file bmi.sav. BMI scores were truncated (that is, all values to the right of the decimal points were dropped). To locate the SPSS boxplot procedure, go to the top-level menu heading for <Graphs> and make these menu selec ons: <Graphs> → <Legacy Dialogs> → <Boxplot>, as shown in Figure 5.18. In most real-life situa ons, researchers want to compare boxplots for the same variable for two or more groups, as in the following example. BMI is an index of body weight corrected for height. Using data in the file bmi.sav, we will examine BMI scores separately for men and women. In the first Boxplot dialog box (Figure 5.19), highlight the box for “Simple” boxplot and select the radio bu on for “Summaries for groups of cases.” In the Define Simple Boxplot: Summaries for Groups of Cases dialog box (in Figure 5.20), the name of the variable for the plot (heigh nches) is moved into the variables list. The resul ng boxplot graph appears in Figure 5.21. On the basis of output from the SPSS frequencies procedure (not shown here), the median BMI was 23 for men and 22 for women. There were numerous outliers for both groups, mostly higher BMI scores. You need to know that when two scores have the same value, SPSS draws just one circle. Each circle indicates the row number in the SPSS data file where the outlier score is located. You can determine the number of scores iden fied as outliers by coun ng these numbers. To find the number of nonextreme outliers, count the case numbers for the open circles and ignore the case numbers for the asterisks. An outlier that is not extreme is denoted using an open circle, while outliers labeled as extreme appear as asterisks.
Figure 5.20 Define Simple Boxplot: Compare BMI Scores for Male Versus Female Groups
Figure 5.21 Boxplot Set Up Using SPSS: Separate Boxplots for Male and Female Groups on the Basis of Descrip ve Sta s cs The boxplot for men shows only two low-end outliers (on rows 20 and 23); these are not extreme. If you look at the data file bmi.sav, you will find that the scores on rows 20 and 23 both correspond to BMI scores of 16. There were numerous high-end outliers. The boxplot for men shows nine scores on rows 102, 111, 120, 124, 131, 167, 189, 190, and 199 that were labeled outliers (but not as extreme outliers). Three scores (on rows 126, 157, and 197) were iden fied as extremely high outliers. The case on row 3 lies in between these groups of scores. You can determine whether the in-between BMI score on row 3 is an extreme outlier by comparing the BMI value in the data file on row 3 (which is 34) with the BMI values for the two neighboring
values. The BMI score on row 157 is also 34, and case number 157 is not tagged as an extreme outlier in this boxplot; therefore row 3 would also not be iden fied as an extreme outlier. We can report results for the male BMI boxplot as follows. Values obtained from the SPSS frequencies procedure, not shown here, are used to iden fy the exact values for the 25th, 50th, and 75th percen les and the minimum and maximum scores. For men, median BMI was 23; 50% of male BMI scores were between 22 and 25. There were 2 low-end outliers for male BMI; neither was extreme. There were 13 high-end outliers; 10 were not extreme and 3 were extreme outliers. For men, minimum BMI was 16 and maximum BMI was 41. Median BMI for women was 22; 50% of female BMI scores were between 20 and 23. The female group had no low-end outliers for BMI. There were five nonextreme high-end outliers (rows 204, 302, 318, 353, and 398). There were also two extreme high-end outliers; these BMI scores appear on rows 290 and 374. Minimum BMI for women was 17 and maximum BMI was 33. If we compare men and women, it appears that men tend to have higher BMIs than women. It would also be useful to examine the frequency distribu ons for BMI using suggested clinical cutoffs to evaluate the percentage of persons whose BMIs were within the range 18.5 to 24.9, which is considered healthy. Histograms would help to evaluate distribu on shapes. 5.11 TELLING STORIES ABOUT DISTRIBUTIONS A er you examine graphs such as histograms or boxplots, you should be able to tell an honest and reasonably complete story about the pa ern you see. Imagine this game: Your task is to get a person who has not seen the histogram or other graph to draw a picture of the graph, based only on verbal informa on you provide. You win the game if you and your partner can do this more quickly and accurately than other teams. Ready? Go! If you have a roughly normal or bell-shaped distribu on, you can communicate this to your partner very quickly with three pieces of informa on (normal, M, SD). That should be sufficient for your partner to sketch a graph. If the distribu on appears somewhat normal but with some varia ons, such as posi ve skewness or outliers (see Table 5.1 for examples), you need to add that informa on (for example, three outliers at the high end). On the other hand, if your distribu on does not resemble a bell-shaped curve (see Table 5.2), you need different stories or pieces of informa on. It may be sufficient to say “reverse J- shaped” or bimodal or uniform. However, you will need to give your partner more informa on (for example, the maximum score was 10). Distribu ons that have one or more modes and non- normal shapes require more informa on. Where was each mode located? Were some modes higher than others? Think about what your results mean.
Figure 5.22 Histogram for Polarized Degree of Agreement Ra ngs How can the hypothe cal results in Figure 5.22 be described? Opinion is highly polarized; that is, people are at either the nega ve or posi ve extreme in this hypothe cal example. There are two modes (52% of people strongly disagree and 30% strongly agree). Very few people chose intermediate levels of agreement. Most people strongly disagree, but the number of people who strongly agree is a substan al minority. Use of a mean or median (a value somewhere around 2.5) to describe central tendency would be misleading in this situa on; 2.5 is near the neutral point, but very few people chose ra ngs near neutral. A concise way to communicate this would be: “Fi y-two percent strongly disagreed with this statement, 30% strongly agreed, and very small percentages of people chose intermediate levels of agreement. Opinion was strongly polarized.” If the author of a research report makes a blanket statement that all variables had approximately normal distribu ons, or allows readers to assume that all distribu ons were normal, and then tells readers that the mean degree of agreement with this statement was 2.5, this informa on by itself provides a misleading descrip on of the results. 5.12 USES OF GRAPHS IN ACTUAL RESEARCH Data screening: Iden fy poten al errors or problems with data (such as recording errors, implausible scores, and missing values). Researchers need to report the number of scores that are problema c and indicate what they did to correct these problems. For beginning students, it may be sufficient to report the percentage of missing scores and the number of outliers and extreme outliers for each variable. I suggest that beginning students run analyses with outliers included and with outliers excluded; if results are substan ally the same, report one of these analyses and add a footnote to indicate that the other analysis yielded similar results. For both beginning and advanced students, keep a record of any problems you detect in data, and
anything that you do to deal with the problems. Discussion of be er ways to handle outliers and missing values are provided in Volume II (Warner, 2020). Evalua on of whether assump ons for analyses are violated: When you learn sta s cal techniques such as t tests, ANOVA, and regression, you will see that each analysis is based on some assump ons. Some analyses work fairly well, under certain circumstances, when their assump ons are violated; others do not. There is a widespread, but not exactly accurate, belief that scores in samples need to be normally distributed to sa sfy the assump ons for many common analyses. I think it would be more accurate to say that, in prac ce, some kinds of departure from normality in the sample (such as the presence of extreme outliers, or reverse J- shaped or polarized distribu ons) create problems in many common analyses. The ways the viola ons of assump ons and rules can lead to incorrect conclusions are discussed in later chapters about significance tests. Report informa on needed to characterize and describe your sample: For categorical variables, this is o en in sentence form, for example, “The sample consisted of 100 male and 150 female university students, with a mean age of 19.1 years.” Here are some of the stories (or descrip ons) about distribu ons that might appear in a research report. You might say, “The histogram appears approximately normal with no extreme outliers.” You can state this in the “Data Screening” sec on of your research report. (For some sta s cs you will need to check addi onal assump ons.) You might need to say, “The histogram appears approximately normal except for a specific number of outliers.” In this situa on you face the “what to do with outliers” problem. Ideally, you decide what to do with outliers prior to data collec on. You need to document the number of outliers and what you decided to do with them (such as drop from analysis, recode into different values, or leave them in). Do not experiment with different ways of handling outliers un l you find results you like; this is p-hacking. You might need to say, “The distribu on is very skewed, and skewness cannot be corrected by modifying or removing a few outliers.” Only if it is conven onal in your field, only if values differ by orders of magnitude, and only if planned ahead, log or other nonlinear transforma ons may be applied to data analyses using log(X) instead of X. In some situa ons that involve outliers, nonparametric analysis may be preferable. When scores are converted to ranks, extreme outliers and skewness are not problems. (Newer robust techniques, not covered in this book, may be be er choices; Field, 2018.) If distribu on looks nothing like a normal distribu on (e.g., uniform, J-shaped, U-shaped, mode at zero), proceed with cau on. En rely different analyses than the ones in this book may be required. 5.13 DATA SCREENING: SEPARATE BAR CHARTS OR HISTOGRAMS FOR GROUPS When the independent-samples t test is introduced you will see that it’s useful to create graphs such as histograms and boxplots separately for each of the groups. If a t test compares mean height for a male group versus mean height for a female group, we should examine the distribu on of heights separately within each group, as in the following example. The data set malefemaleht.sav contains hypothe cal heights in inches for 120 women (not the same sample as in femaleht.sav) and 120 men.
Figure 5.23 SPSS Dialog Box to Obtain Boxplots for Separate Groups To obtain separate boxplots for male and female groups, make the same menu selec ons as for the previous boxplot example: <Graphs> → <Legacy Dialogs> → <Boxplots>. In the Boxplot dialog box (the one that appeared previously in Figure 5.17), select “Simple” and choose the radio bu on for “Summaries for groups of cases.” In the next dialog box, shown in Figure 5.23, enter the name of the quan ta ve variable (heigh nch) in the space for “Variable,” and enter the name of the categorical variable Sex in the space for “Category Axis,” then click OK. See the boxplots in Figure 5.24. Several kinds of informa on can be obtained through visual examina on. Median height for the female group is lower than median height for the male group. The female group has one low- end outlier for height (in row 1 of the file). The male group has one high-end outlier height (in row 240) and one low-end outlier (in row 121). To obtain separate histograms for male and female groups, the <Data> → <Split File> command is used. (Note that you do not select <Split into Files>.) In the Split File dialog box in Figure 5.25, click the radio bu on for “Organize output by groups,” move the categorical variable (Sex) into the “Groups Based on” pane, then click OK. All subsequent analyses will be done separately for men and women. (Note that you must turn this command off to go back to using the full data set. To do that, make the same menu selec ons, <Data> → <Split File>, and choose the radio bu on for “Analyze all cases, do not create groups.”) The separate histograms for female and male height scores appear in Figures 5.26 and 5.27. Height scores are not perfectly normally distributed within either group; they could be described as approximately normal. Among the three outliers iden fied in the boxplots, the only one that stands out clearly in the histograms is the male height of 78 inches or 6’6, or
about 198 cm. This is unusually tall, but the number is not so large that you would think it impossible.
Figure 5.25 Command to Organize Output by Groups Note that if you converted all heights from inches to cen meters, the appearance of the boxplots and histograms would not change; however, the numerical values of heights and the descrip ve sta s cs would change.
5.14 USE OF BAR CHARTS TO REPRESENT GROUP MEANS In this chapter, the heights of bars in bar charts represents the number (or propor on or percentage) of cases in each group. When later topics, such as the independent-samples t test) are introduced, bar charts have another use. Suppose that you want to compare mean height for two groups: female and male. You can set up a bar chart in which the height of each bar represents the mean height for each group, as in Figure 5.28.
One difference you may no ce is that, in this chart, the Y axis begins at 60 (instead of 0, which was the recommended value for the Y axis origin when bar charts were used for frequencies). Here’s why. For group frequencies, 0 cases per group is a possible value. For means of variables such as adult height, 0 is not a possible value of height. It makes sense to choose a value of Y that is below the minimum height in the sample, but higher than 0, for a bar chart in which bars represent means. If you read research reports, you are more likely to encounter bar charts that represent group means than bar charts for group sizes or frequencies. You will learn more about setup and interpreta on of this type of bar chart in chapters about the independent-samples t test and ANOVA. 5.15 OTHER EXAMPLES 5.15.1 Sca erplots In some studies, researchers want to evaluate whether scores on one quan ta ve variable (extraversion) are related to scores on another quan ta ve variable (physical energy). A preliminary graph called a sca erplot is used to examine the rela onship between variables prior to doing sta s cal analyses such as correla on or regression. An example of a sca erplot appears in Figure 5.29. In this hypothe cal study, each person provided self-report scores for extraversion (rated on a scale from 1, not at all extraverted, to 5, highly extraverted) and for energy (1 = very low energy, 6 = very high energy). Each data point in the sca erplot represents the combina on of scores on extraversion (on the X axis) and energy (on the Y axis) for one case. For example, the case marked with a circle in Figure 5.29 represents a person with an extraversion score of 4 and an energy score of 3. The three ellipses in Figure 5.29 iden fy areas of the graph that can be compared. On the le , an ellipse encloses energy scores for people whose scores on extraversion were low (below 2). On the right, an ellipse encloses the energy scores for persons whose extraversion ra ngs were high (above 4). You can see that for the people with low scores on extraversion, energy scores also tended to be low. For persons with high scores for extraversion, energy scores tended to be
high. People with moderate scores on one variable also had moderate scores on the other variable. This is an example of a posi ve linear rela onship. In a later chapter this kind of rela onship between two quan ta ve variables will be assessed using Pearson correla on.
5.15.2 Maps Maps are useful formats for some kinds of graphs. For example, the Centers for Disease Control and Preven on has produced graphics in the form of maps to show the spread of obesity in the United States over me. A PowerPoint presenta on that shows a series of maps from 1985 to 2010 appears at h ps://www.cdc.gov/obesity/downloads/obesity_trends_2010.ppt. Figure 5.30 shows a more recent graph for prevalence of obesity in the United States in 2017. States
shaded darker gray have higher percentages of obesity. (The corresponding map online at h ps://www.cdc.gov/obesity/data/prevalence-maps.html is keyed in color.) At a glance you can see several features of the data. High rates of obesity occurred in the deep south, Iowa, and West Virginia. Colorado, Hawaii, and the District of Columbia had low rates. U.S. residents can see how obesity rates in their states compare with those of other states. 5.15.3 Historical Example Most people think of Florence Nigh ngale as a pioneer of nursing; her work also had an enormous impact on medicine and hospital design (Lienhard, 2002). During the Crimean War, she sent reports to Britain about the number of soldiers who died each month and their causes of death. She used polar diagrams (this is not currently a popular form of graph) to communicate this informa on. Figure 5.31 is adapted from part of her graphics (Nigh ngale, 1858). Her major finding was that far more soldiers were dying from preventable diseases (some mes acquired in the military hospitals) than from wounds. Up un l the 19th century, this was true in many wars. The point she wanted to make was that far more sanitary condi ons and be er nutri on were needed to keep the army (and civilian popula ons) healthy. This was not something the War Department wanted to hear. Nevertheless, she persisted.
Figure 5.31 Florence Nigh ngale’s Graph: Number of Bri sh Soldiers Who Died in the Crimean War During Each Month Divided Into Three Causes of Death 5.16 SUMMARY
Why are both data analysts and mathema cal sta s cians so interested in the bell-shaped curve or normal distribu on? Here are some of the reasons. Scores for many (but not all) variables tend to be normally distributed. When a variable is approximately normally distributed, we can summarize informa on about the distribu on of scores using just three pieces of informa on: That the distribu on shape is normal The value of the sample mean M The value of the sample standard devia on SD For example, IQ scores are normally distributed with M = 100 and SD = 15. We can write that as N(100, 15), where N means “normally distributed,” and the values of M and SD are in parentheses. (Note that in this context, N is a descrip on of the shape of a distribu on, not sample size.) When scores are normally distributed, it is easy to evaluate the loca on of an individual score (or an individual event, such as the value of M obtained in one study) rela ve to the overall distribu on of scores. This is discussed further in Chapter 6. Scores on individual variables are not the only values that tend to be normally distributed. You will see that if many values of M are obtained from different random samples from the same popula on, values of M also tend to be normally distributed. This makes it possible for us to set up confidence intervals for M and to conduct sta s cal significance tests. Here are ques ons to ask when you look at a histogram: Is the distribu on bell shaped and symmetrical? If yes: Good. You can report that the distribu on was approximately normal. Does the distribu on have outliers at the lower and/or upper ends? If yes: You need to report the number of outliers at the upper and lower ends and whether any of these were extreme outliers. Is the distribu on skewed (asymmetrical)? If yes: You should men on this in your descrip on of data. In Chapter 6, you will learn how to assess amount of skewness. Is the distribu on bimodal or mul modal? If yes: Consider whether the distribu on may consist of two or more overlapping distribu ons for different groups of people, such as men versus women. Is the distribu on U-shaped or polarized (as in Figure 5.22)? If yes: Treat people who gave ra ngs of 1, ra ngs of 2, and so forth, as members of five separate groups. Is the distribu on uniform? If the variable is quan ta ve, ask, Are the scores ranks? It is unlikely that you will see a uniform distribu on for quan ta ve variables unless they are ranks. You can describe distribu on shape by thinking about the answers to these ques ons. CHAPTER 12 THE INDEPENDENT-SAMPLES T TEST 12.1 RESEARCH SITUATIONS WHERE THE INDEPENDENT-SAMPLES T TEST IS USED The independent-samples t test is used to evaluate whether the means of a Y dependent variable differ significantly across two groups. This test is used much more o en in actual research than the one-sample t test. The X independent variable is dichotomous; it iden fies
membership in one of two groups, using scores such as 1 versus 2 to iden fy each person as a member of one group. The Y dependent variable must be quan ta ve.1 Groups can be naturally occurring (such as women vs. men or Democrats vs. Republicans). Alterna vely, they can be groups created in an experiment. The independent-samples t test requires a between-S design; that is, each person is a member of one and only one of the groups, people are not matched or paired across groups, and there are not repeated measures for the same persons in the two groups. If par cipants are matched, paired, or observed under both treatment condi ons, a different analysis is required, the paired-samples t test, discussed in a later chapter. Consider this simple hypothe cal experiment. A student wants to know whether mean heart rate is higher when people consume caffeine than when they do not. The X predictor variable in this study is dosage level of caffeine (coded 1 = no caffeine, 2 = 150 mg of caffeine, about the amount in one cup of coffee). The specific numerical values used to label groups make no difference in the results; small integers are usually chosen as values for X, for convenience. The outcome variable (Y) in this example is heart rate (hr). Any drug (even caffeine) can have placebo effects, so it would be important to keep par cipants and researchers blind to condi on. This could be done by giving each member of group 1 a cup of decaffeinated coffee and each member of group 2 a cup of coffee with 150 mg of caffeine (it would be necessary to check that par cipants could not taste the difference, to avoid placebo effects, and that they drank all the coffee). Researchers usually hope to find a sta s cally significant difference in mean scores on Y between the groups. In this example the student researcher might expect that mean hr is higher in the caffeine group than the no-caffeine group. If the group means do not differ significantly, this suggests that there is no treatment effect (i.e., caffeine has no effect on heart rate). When we obtain sta s cs such as means for more than one group, numerical subscripts are used to iden fy groups. For the independent-samples t test, the informa on we need is:
For the one-sample t test, one sample mean M was used to es mate or test hypotheses about one popula on mean μ. In this chapter we consider means for two samples and their corresponding hypothe cal popula ons:
If caffeine does not affect heart rate, these popula on means would be equal: μ1 = μ2. A large difference between M1 and M2 in the sample data in the study would suggest that this hypothesis may be incorrect. Formally, the null hypothesis for the independent-samples t test can be wri en two ways:
These two statements are logically equivalent; the second version of the null hypothesis is more convenient. A transla on of the null hypotheses into words is, Caffeine has no effect on heart rate. If mean heart rate is the same for people who do versus people who do not consume caffeine, this tells us that there is no treatment effect for caffeine. In this example, caffeine is the dichotomous independent variable with values 1 and 2. Equa ons 12.1 and 12.2 do not explicitly name the dependent variable. Just as we used M to es mate μ for the one-sample t test, we now use (M1 – M2) to es mate (μ1 – μ2). What informa on from the sample data would lead us to suspect that the null hypothesis in Equa on 12.2 may be incorrect? If your answer was “a large value of M1 – M2,” you are correct. A large difference between M1 and M2 is unlikely (although not impossible) if H0 is true. How do we evaluate whether the M1 – M2 difference is large enough to lead us to doubt that H0 is true? Once again, we set up a t ra o to compare our sample sta s c (M1 – M2) with the standard error of that sample sta s c. We will need to know the standard error of the difference, denoted SE(M1–M2). The general form of a t ra o is:
If t is larger than the cri cal values from the t distribu on with (n1 + n2 – 2) df, using α = .05, two tailed, we can say that we have found a sta s cally significant difference between the sample means and that we can reject the null hypothesis with p > .05, two tailed. In prac ce, it
is easier to examine the p values that correspond to t and reject H0 if obtained p is less than .05 (or less than the specific α level chosen in advance). In other words, all parts of this procedure are the same as for the one-sample t test, except that now we examine the difference between two sample means, (M1 – M2), instead of a single value of M. 12.2 A HYPOTHETICAL RESEARCH EXAMPLE Consider the following imaginary experiment. Twenty par cipants are recruited as a convenience sample; each par cipant is randomly assigned to one of two groups. The groups are given different doses of caffeine. Group 1 receives 0 mg of caffeine; Group 2 receives 150 mg of caffeine. Half an hour later, each par cipant’s heart rate is measured; this is the quan ta ve Y outcome variable. The goal of the study is to assess whether caffeine may increase mean hr. In this situa on, X (the independent variable) is a dichotomous variable (amount of caffeine, 0 vs. 150 mg); Y, hr, is the quan ta ve dependent variable. The scores for these variables are in Figure 12.1 and in the file hrcaffeine.sav. Group membership is iden fied by scores in the “caffeine” column.
Figure 12.1 SPSS Data View: Caffeine/Heart Rate Experiment Data 12.3 ASSUMPTIONS FOR USE OF INDEPENDENT-SAMPLES T TEST Scores for the Y outcome (or dependent) variable should sa sfy the following assump ons. 12.3.1 Y Scores Are Quan ta ve
Because we will compute a mean for the Y scores in each group, the scores on the Y outcome variable must be quan ta ve. It would not make sense to compute a group mean for scores on a Y variable that is categorical. 12.3.2 Y Scores Are Independent of Each Other Both Between and Within Groups When a design is between-S, that is, each person is a member of only one group, Y scores will probably be independent of each other between groups. If matching, pairing, or repeated measures are used, the paired-samples t test must be used instead of the independent-samples t test. Chapter 2 discussed the differences between between-S and within-S designs. The paired-samples t test is introduced in Chapter 14. Y scores should also be independent of each other within groups. Usually this assump on is sa sfied if each person is assessed alone. When subjects in the same treatment condi on are tested in pairs or groups, or if they happen to be roommates or couples, or if they have an opportunity to influence each other’s behavior, the scores may not be independent. As an example, consider this hypothe cal study. Students consumed high-carbohydrate or high- protein drinks and rated their moods. To collect data quickly, the student tested them in groups. One par cipant threw up the protein shake, an event that probably affected the mood of others who were present. This made Y scores dependent (related) within the tes ng groups. Tes ng par cipants individually would have avoided this problem. In prac ce, there is li le you can do during data screening to detect within-group nonindependence of scores. You need to know how data were collected. 12.3.3 Y Scores Are Sampled From Normally Distributed Popula ons With Equal Variances When the independent-samples t test was developed, sta s cians assumed that Y scores in samples were randomly selected from popula ons that correspond to the groups in the study and that scores in those popula ons are normally distributed. We have no way to evaluate whether that assump on is correct. In prac ce, researchers examine the distribu on of Y scores in samples and hope that if scores in samples are normally distributed, the popula ons from which the samples were selected also have reasonably normal distribu ons. Another assump on about the popula on distribu ons of Y for this test is called homogeneity of variance. The variances of the Y scores should be equal or homogeneous in the two popula ons that correspond to the samples compared in the study. We can write this assump on formally as follows:
where denote the (unknown) popula on variances for the popula ons that correspond to the groups in our study. We can judge whether this assump on may be violated by
comparing the sample variances for the scores in the two groups using the Levene test. SPSS reports the Levene3 test F ra o as part of the standard SPSS output for the independent-
samples t test. If the p value for the Levene test is small (e.g., p < .01), this is evidence of possible viola on of the homogeneity of variance assump on. (Note that when we test possible viola ons of assump ons, we would prefer that p be large and not significant.) It is widely believed that the viola on of the assump on that popula on scores are normally distributed does not cause serious problems if each group has n > 25 or 30, group n’s are equal, and two-tailed tests are used (Boneau, 1960; Hogg, Tanis, & Zimmerman, 2014; Sawilowsky & Blair, 1992). The sampling distribu on of M turns out to be close to normal, as predicted by the central limit theorem, even when this assump on is violated. The independent-samples t test is robust against viola ons of the homogeneity of variance assump on; that is, p values obtained using the independent-samples t test are good es mates of the true risk for Type I error even when this assump on is violated. This has been demonstrated in studies where sta s cians set up data for the two hypothe cal popula ons such that μ1 = μ2 and the equal variances assump on is violated. When thousands of random samples are drawn from these popula ons, and independent-samples t tests are applied, the number of Type I errors obtained is close to what we expect on the basis of the selected α level. Myers and Well (1995) described robustness to viola on of the equal variances assump on as follows: If the two sample sizes are equal, there is li le distor on in Type I error rate unless n is very small and the ra o of the variances is quite large … when n’s are unequal, whether the Type I error rate is inflated or deflated depends upon the direc on of the rela on between sample size and popula on variance. (pp. 69–70)
According to Myers and Well (1995), even when the ra o of was as high as 100, t tests based on samples of n = 5 using α = .05 resulted in Type I errors in only 6.6% of the batches of data. The impact of viola ons of the homogeneity assump on on the risk for commi ng a Type I error is greater when the n’s in the samples are small (less than 30), when the group n’s are unequal, and when a one-tailed test is used (Sawilowsky & Blair, 1992). Some authori es suggest that even smaller group n’s are sufficient for robustness against viola on of the equal popula on variance assump on. Despite these assurances about robustness of the independent-samples t test with equal variances assumed, SPSS takes an old-fashioned approach. SPSS output for this includes the Levene F test (which really isn’t needed when groups have 25 or more members) and also two versions of the independent-samples t test. One of these t tests (called the equal variances not assumed t test) is adjusted to correct for viola ons of the homogeneity assump on. Most authori es now think that the adjustment is too conserva ve. In prac ce, when you look at independent-samples t test output, you can safely ignore the Levene test and the equal variances not assumed t test. The equal variances not assumed version of the independent- samples t test is calculated using a different formula for t, and it has a downwardly adjusted df value, some mes denoted df′. SPSS reports both versions of the independent-samples t test. Usually, you will report the equal variances assumed version of the t test. 12.3.4 No Outliers Within Groups
The presence of outliers is not stated as a formal assump on for most tests; however, extreme outliers violate the assump on of normality and, in prac ce, can cause serious problems in data analysis. Recall that means are not robust against the presence of outliers. Because the independent-samples t test compares group means, it follows that outliers are also a problem for the t test. As noted earlier, handling outliers can be problema c. For some kinds of variance (such as salary or reac on me) you can an cipate that outliers are likely, and in these situa ons, you should decide on rules for iden fica on and handling of outliers before data collec on. You must document the presence and handling of outliers in your final report so that readers know how many cases were dropped and why. For other kinds of variables, such as ra ngs on 5-point or 10-point scales, outliers are rare. A boxplot or histogram of Y scores, separately for each group, provides a way to evaluate whether outliers are present in either group. It can be instruc ve for a beginning student to run analyses such as t tests once with outliers included and again with outliers excluded to see how results differ. Occasionally, it can make sense to report both versions of an analysis (with and without outliers) in a research report so that readers can evaluate the situa on for themselves. However, it is extremely bad prac ce to run a t test, obtain a result you do not like, and then throw away outliers and redo the analysis in an a empt to make data do what you want. You should decide on rules for iden fica on and handling of outliers before you do analyses. 12.3.5 Rela ve Importance of Viola ons of These Assump ons Viola ons of some assump ons can cause serious problems, while viola ons of other assump ons are less problema c. This list includes the set of poten al viola ons that are serious and cannot be ignored.
The independent-samples t test cannot be used if the dependent variable is categorical or if scores on the dependent variable are ranks within groups.
The independent-samples t test cannot be used if assump ons of independence of observa ons (between and within groups) are violated.
The presence of outliers can lead to p values that either over- or underes mate the true risk for Type I error. Values of group means may be strongly influenced by outliers. You need to document the presence of outliers and explain whether you dropped or retained them.
An implicit assump on in procedures for all null hypothesis significance tests is that you do one test, then stop. In prac ce researchers o en report large numbers of significance tests. If a researcher reports a set of 10 or 20 t tests, the risk for obtaining at least one Type I error in the set is much higher than the α level used to evaluate individual t values. Bonferroni corrected per comparison alpha levels (PCα), discussed in the correla on chapter, can be used to limit inflated risk for Type I error when numerous significance tests for a set of different t ra os are reported.
As for all other analyses, results are meaningful only if we have samples that are similar to (representa ve of) the hypothe cal popula ons of interest, if we have a manipula on
that represents real-world situa ons, and if we have a reliable and valid measure of the outcome variable. (For example, a study that compared 0 mg caffeine and 3,000 mg caffeine (approximately 20 cups of coffee) would not tell us much about caffeine consump on in everyday life; such high levels probably never occur. It would also be unethical because that much caffeine might make par cipants sick.
If we want to make causal inferences, the research situa on must be a well-controlled experiment. See Chapter 2 or a research methods textbook.
Viola ons of the assump ons that the popula ons have normally distributed scores with equal variances don’t cause serious problems unless there are other issues (such as very small samples and outliers within groups). (That’s good, because small samples don’t provide enough informa on for us to make inferences about the shapes of popula on distribu ons.) 12.4 PRELIMINARY DATA SCREENING: EVALUATING VIOLATIONS OF ASSUMPTIONS AND GETTING TO KNOW YOUR DATA Let’s return to the imaginary experiment in which one group of par cipants receives 0 mg of caffeine and the other group receives 150 mg of caffeine. To evaluate whether the independence of observa ons is violated, we need to know the research situa on. If we know that each par cipant is tested under only one treatment condi on and that there was no matching or pairing of par cipants for the samples, then assump on that scores are independent between groups should be sa sfied. If we know that each par cipant was tested individually and that the par cipants did not have any chance to influence one another’s levels of physiological arousal or heart rate, then the assump on that observa ons are independent within groups should be sa sfied. Data analysts can evaluate whether scores within each sample have reasonably normal distribu on shapes and no extreme outliers and whether the homogeneity of variance assump on appears to be violated. Distribu on shape for scores on a quan ta ve variable in the two groups or samples can be assessed by examining histograms of the dependent variable scores separately within each group. To do this, first request separate output for the groups by using the menu selec ons <Data> → <Split File>, as shown in Figure 12.2. This opens the Split File dialog box, also shown in Figure 12.2. (Note that you do not use the <Split into Files> command.) In the Split File dialog box, select the radio bu on for “Organize output by groups.” In the “Groups Based on” pane, enter the name of the grouping variable (caffeine), then click OK. Subsequent graphs and analyses will be reported separately for the caffeine and no-caffeine groups un l this command is turned off. The SPSS <Analyze> → <Descrip ve Sta s cs> → <Frequencies> procedure was used to obtain a histogram and descrip ve sta s cs for each group.
Figure 12.2 Using the SPSS <Split File> Command to Obtain Output for Separate Groups
Figure 12.3 Histogram of Heart Rate Scores for the No-Caffeine Group (Caffeine = 1) The histograms in Figures 12.3 and 12.4 show the distribu ons of heart rate scores separately within each of the two groups. The smooth curves superimposed on the graphs represent ideal normal distribu on shapes. These distribu ons are not close to normal in shape. Normal distribu on shapes rarely appear when groups’ n’s are very small. Normal distribu on in the samples is not a requirement for use of the independent-samples t test. (However, if you see distribu on shapes within the sample that suggest something unusual is going on, like the examples of severely non-normal distribu on shapes with modes at the lowest and/or highest
possible score values, as in Table 5.2, you may want to stop and ques on whether group means are good descrip ons of what is average in the samples. Other analyses might be preferable in these situa ons.) Before addi onal data screening, turn off the <Split File> command. To do this, make the menu selec ons <Data> → <Split File> and then select the radio bu on for “Analyze all cases, do not create groups” (as shown in Figure 12.2), then click OK. The presence of outliers can be assessed by examining boxplots for the distribu on of hr scores separately for each treatment group. The menu selec ons are <Graphs> → <Legacy Dialogs> → <Boxplot>. In the Boxplot dialog box (Figure 12.5), choose “Summaries for groups of cases.” In the Define Simple Boxplot: Summaries for Groups of Cases dialog box (Figure 12.6), specify the dependent variable (hr); the group variable (caffeine) is placed on the category axis. The resul ng boxplots appear in Figure 12.7. In this example, although two scores in the first group appeared to be outliers, none of the scores were judged to be extreme outliers, and no scores were removed from the data before further analysis. (Recall that a circle represents an outlier and an asterisk represents an extreme outlier in a boxplot.)
Figure 12.5 Boxplot Dialog Box
Fig 12.6 Define Simple Boxplot: Summaries for Groups of Cases Dialog Box
Figure 12.7 Boxplots for Heart Rate Scores Within Each Group (No Caffeine, Caffeine) Overall, it appears that the heart rate data sa sfy the assump ons for an independent-samples t test reasonably well. There is no reason to suspect that the independence of scores assump on is violated either within or between groups. Scores on hr are quan ta ve and reasonably normally distributed. The only poten al problem iden fied by this preliminary data screening is the presence of two high-end outliers (not extreme for heart rate in Group 1). The decision was made not to remove these scores. If outliers had been more extreme (e.g., hr scores above 130), we would need to consider whether par cipants with high scores fell outside the range of inclusion criteria for the study (i.e., healthy young adults). We might have decided early on to include only persons with “normal” heart rates (which we might define as the range between 50 and 110 beats per minute). We might also consider whether extreme scores might be due to data-recording errors. 12.5 COMPUTATION OF INDEPENDENT-SAMPLES T TEST Do not be alarmed at the number of equa ons in this sec on. The computa ons needed for independent-samples t tests are mostly things you have done before. When you compare means across groups, it should become almost a reflex: Compute means, variances, and standard devia ons within each group. In real-life situa ons, you can use SPSS to do almost all the computa ons you need. Here is an outline of the computa ons.
Compare your obtained t value with the cri cal values that define reject and do not
reject regions in Appendix B at the end of this book. These reject and do not reject regions depend on the α level you have selected, whether you have a direc onal or
nondirec onal alterna ve hypothesis, and the df for the t ra o. (Reject and do not reject regions were discussed in Chapter 8, on the one-sample t test.)
Compare your obtained p value (given by a computer program) with the α level criterion you have chosen. For example, if you chose α = .05, two tailed, as the criterion for sta s cal significance, and if p = .043, two tailed, you can judge the difference between means sta s cally significant and report the result as p = .043, two tailed. Note that SPSS provides two-tailed p values.
All sta s cal results listed above are provided in SPSS output. You will also need to calculate an effect size index by hand, either Cohen’s d, point biserial r, or η2 (eta squared); these are described in a later sec on. Within each of the two groups, M and s are calculated as follows. The subscripts 1 and 2 indicate whether scores belong to Group 1 or Group 2. Scores on the quan ta ve outcome variable are denoted by Y1 when they come from Group 1 and Y2 when they belong to Group 2. First calculate the mean for each group:
When you examine SPSS output you will see that SPSS presents two versions of the independent-samples t test. The first version, which is called the “equal variances assumed” or
pooled-variances t test, is almost always reported. The second version, called the “equal variances not assumed” or “separate variances” t test, was developed to be used in situa ons where the equal variances assump on is violated. However, viola ons of the equal variances assump on generally don’t make p values poor es mates of risk for Type I error. The computa ons in this sec on are for the equal variances assumed version of the t test. You will probably never use the equal variances not assumed version of the t test. When the “equal variances assumed” version of the t test is used, the variances within the two groups are pooled or averaged, and this average is called spooled2 or sp2. The term pooled just means averaged. To obtain sp2, the pooled or averaged within-group variance, we average the two within-group variances s12 and s22. The first version of the formula works whether n1 = n2 or not. It “weights” the variances by the sample sizes (that is, sp2 will be closer to s2 for the group with the larger n).
12.6 STATISTICAL SIGNIFICANCE OF INDEPENDENT-SAMPLES T TEST I recommend that you report the exact p value for the equal variances assumed (or pooled- variances) version of the t test. This is a two-tailed test. For example, if “Sig.” as reported by SPSS is .032, report p = .032, two tailed. Remember that if SPSS gives you a “Sig.” value of .000, you should report this as p < .001. A p value es mates risk for Type I error, and that risk can never be 0. A two-tailed exact p value corresponds to the combined areas of the upper and lower tails of the t distribu on that lie beyond the obtained sample values of ±t. If you want to report your outcome as a significance test using the conven onal α = .05, two tailed, level of significance, an obtained p value less than .05 is interpreted as evidence that the t value is large enough so that it would be unlikely to occur by chance (because of sampling
error) if the null hypothesis were true. In other words, if we set α = .05, two tailed, as the criterion for significance, and if p < .05, two tailed, we would say that the means of the groups are significantly different. If an analyst decides to use a one-tailed (direc onal) test before peeking at the data, a one- tailed p value can be obtained by dividing the two-tailed p in the SPSS output by 2 (e.g., if two- tailed p = .06, then the corresponding one-tailed p = .03). In this situa on, the analyst must also check that the direc on of difference between the means corresponds to the difference in the alterna ve hypothesis. If Halt: μ1 > μ2, the null hypothesis can be rejected if M1 > M2 but not if M1 < M2. I have used annoying quota on marks for “exact” p. I do this as a reminder that the “exact” value of p given by programs such as SPSS, o en reported to 3 decimal places, is not necessarily correct. When assump ons are violated—and they o en are—the p values given by a computer program o en greatly underes mate the true risk for Type I decision error. A judgment about sta s cal significance can also be made directly from the obtained value of t, its df, and the α level. If t is large enough to exceed the tabled cri cal values of t for n1 + n2 – 2 df, the null hypothesis of equal means is rejected, and the researcher concludes that there is a significant difference between the means. In the preceding empirical example of data from an experiment on the effects of caffeine on heart rate, n1 = 10 and n2 =10, therefore df = n1 + n2 – 2 = 18. If we use α =.05, two tailed, then from the table of cri cal values of t in Appendix B at the end of this book, the reject regions for this test (given in terms of obtained values of t) would be as follows:
Note that these values of t also correspond to the middle 95% of the area of a t distribu on with 18 df. These t values (“cri cal” values) are also needed to set up a confidence interval (CI) for M1 – M2. When large numbers of t tests are reported, authors o en use asterisks in tables to denote p values smaller than three different conven onally used α levels. Conven onally, * indicates p <.05, ** indicates p < .01, and *** indicates p < .001. Using these asterisks to make decisions about sta s cal significance amounts to se ng the alpha level a er examining the results; this is not good prac ce. When tables include numerous significance test outcomes, inflated risk for Type I error is likely to be present. Unless the author specifically notes that procedures for correc on for inflated risk have been used, such as Bonferroni corrected per comparison alphas, tables with large numbers of asterisks should be viewed skep cally. I recommend against the use of asterisks to indicate significance of individual tests in large tables or lists. (I am guilty of using asterisks myself in the past, and I now repent.) 12.7 CONFIDENCE INTERVAL AROUND M1 – M2
The general formula for a CI was discussed earlier. Assuming that the sta s c has a sampling distribu on shaped like a t distribu on and that we know how to find the SE or standard error of the sample sta s c, we can set up a 95% CI for a sample sta s c by compu ng:
12.8 SPSS COMMANDS FOR INDEPENDENT-SAMPLES T TEST Results for all the computa ons described above are given in SPSS output. To obtain an independent-samples t value using the SPSS data file hrcaffeine.sav, make the following menu selec ons, star ng from the top-level menu above the data worksheet (as shown in Figure 12.8): <Analyze> → <Compare Means> → <Independent-Samples T Test>. This opens the Independent-Samples T Test dialog box, as shown in Figure 12.9. The name of the (one or more) dependent variable(s) should be placed in the pane marked “Test Variable(s).” For this empirical example, the name of the dependent variable is hr. The name of the grouping variable should be placed in the box labeled “Grouping Variable”; for this example, the grouping variable is caffeine. In addi on, it is necessary to click the Define Groups bu on; this opens the Define Groups dialog box that appears in Figure 12.10. Enter the code numbers that iden fy the groups that are to be compared (in this case, the codes are 1 for the 0-mg caffeine group and 2 for the 150-mg caffeine group; however, different numbers can be used to iden fy groups). Click the OK bu on to run the specified tests. The output for the independent-samples t test appears in Figure 12.11.
12.9 SPSS OUTPUT FOR INDEPENDENT-SAMPLES T TEST The top panel of the output in Figure 12.11, tled “Group Sta s cs,” presents the basic descrip ve sta s cs for each group; the n of cases, mean, standard devia on, and standard error of the mean for each of the two groups are presented here. (Students should verify that they can duplicate these results by compu ng the means and standard devia ons by hand from the data in Figure 12.1.) The difference between the means, M1 – M2, in this example is 57.8 – 67.9 = –10.1 beats per minute. The group that did not consume caffeine had a mean heart rate 10.1 beats per minute lower than the group that consumed caffeine; this is a large enough difference, in real units, to be of clinical or prac cal interest. (No ce that the sign for this difference depends on whether M1 represents the smaller or larger of the two means.)
If the data analyst wants informa on about poten al viola on of the equal variances assump on, this is provided by the Levene test (on the le side in the t test results table in Figure 12.11). If the Levene F is not significant (i.e., p > .05), there is no evidence of a problem with the equal variance assump on, and the equal variances t test can be reported. In this example, the Levene F value is small (F = 1.571), and it is not sta s cally significant (p = .226). If p for the Levene F is small (p < .05 or p < .01), there is evidence that the homogeneity of variance assump on has been violated. If the researcher is worried about this (most people don’t worry), he or she may prefer to report the more conserva ve “equal variances not assumed” version of t. This is on the lower line below the heading “t-test for Equality of Means.” My results sec ons include the outcome of the Levene F test. I will report the “equal variances assumed” version of the t test that appears on the upper line below the heading “t-test for Equality of Means”; this is informa on contained in the rectangle in Figure 12.12. The reason why the two versions of the t test are so similar in this example is that the group variances were nearly equal, and the group n’s were the same. The equal variances t test result is sta s cally significant, t(18) = –2.75, p = .013, two tailed. (The value in parentheses a er t is the df.) Using α = 0.05, two tailed, as the criterion for significance, the 10.1-point difference in heart rate between the caffeine and no-caffeine groups is sta s cally significant. Mean heart rate was higher in the caffeine group. If the researcher had specified a one-tailed test corresponding to the alterna ve hypothesis Halt: (μ1 – μ2) < 0, this result could be reported as follows: The equal variances t test was
sta s cally significant, t(18) = –2.75, p =.0065, one tailed. (The one-tailed p value is obtained by dividing the two-tailed p value of .013 by 2.) 2.10 EFFECT SIZE INDEXES FOR T Many different effect size indexes can be reported for the independent-samples t test; this discussion includes only the most widely reported. 12.10.1 M1 – M2 When the dependent variable Y is measured in meaningful units, the difference between sample means can be useful informa on (Pek & Flora, 2018), although may authors do not refer to that difference as an effect size. The difference between means can some mes be interpreted as informa on about prac cal, clinical, or everyday importance. In this hypothe cal example, people who consumed 150 mg of caffeine (about one cup of coffee) had heart rates about 10 beats per minute higher than those who did not consume caffeine. That is a no ceable difference, but not large enough that people need to be worried about it. To make judgments about clinical or prac cal significance of differences between means, we need to understand the meanings of different score values; even then, people can have different subjec ve evalua ons. Imagine a situa on in which people who receive chemotherapy for a specific type of cancer live on average 3 weeks longer than people who decline chemotherapy. Apart from the ques on of whether this difference is sta s cally significant, we have the ques on, How much prac cal value does a 3-week difference have? A medical researcher might be pleased to find a treatment that extends life by 3 weeks. As a pa ent, however, I might not want to undergo possibly severe nega ve side effects unless the average extension of life was 2 or 3 months. In situa ons like this, clinicians and pa ents should remember that group averages o en do not predict individual outcomes well. If median improvement in length of survival is 3 months, half of the pa ents in the study had shorter, and half had longer, improvements in length of survival. Ability to generalize results from a study to your own personal situa on should also take into account how similar you are, and how similar your disease condi on is, to persons included in the study. In the extremes it may be easy to say whether a treatment such as a weight loss pill has prac cal or real-world significance. Most people would not think that a mean weight loss of 1 lb is enough to be meaningful or valuable. On the other hand, most people might think that a mean weight loss of 30 lb is enough to have prac cal, clinical, or real-world value. For in- between amounts of weight loss, people may differ in how much they think is sufficient to be of value, rela ve to costs and risks of the treatment. When variables are not measured in meaningful units, M1 – M2 may not provide useful real- world informa on (although it may s ll be interes ng to compare values of M1 – M2 across different studies that use the same measures). For example, suppose you are told that female teachers receive average teaching evalua on scores of 24, while male teachers receive average evalua on scores of 27. You can see that the mean ra ng is higher for male than female teachers in this example, but you would need much more informa on to evaluate whether the
difference is large. It is usually helpful to know the possible minimum and possible maximum score value and the actual minimum and maximum values found in the sample (this informa on is some mes not included, but it should be). Other effect size indexes use standard devia on or variance of scores to evaluate effect size.
Are standardized, unit free, and not related to original units of measurement. Have fixed ranges of possible values (from 0 to 1, with 0 = no associa on, 1 = perfect
associa on). Are interpreted as propor on or percentage of variance in Y that can be predicted from
X. Are not related to N, sample size.
Large values of r2 and η2 are not always sta s cally significant; sta s cal significance depends on both sample size and effect size combined. Both η2 and r2 describe the propor on of variance in Y that can be predicted from X only in the sample. When we find that we can account for a propor on of variance in the sample of scores in a study—for example, η2 = .40, or 40%—this obtained propor on of variance is highly dependent on the nature of the sample and on other design decisions. It tells us about the explanatory power of the independent variable only in the somewhat ar ficial world and small sample that we are studying. For instance, if we find (in a study of the effects of caffeine on heart rate in a healthy college student sample) that 40% of the variance in the hr scores is predictable from exposure to caffeine, that does not mean that “out in the world,” caffeine is such an important variable that it accounts for 40% of the variability in hr. In fact, out in the real world, there may be a substan al propor on of varia on in hr that is associated with other variables, such as age, smoking, physical health, and body weight. These other variables may not be important sources of difference in heart rate in a laboratory study; in fact, par cipants may have been selected in ways that get rid of many of these variables. A researcher might recruit only nonsmokers, for example. In an experiment, we create an ar ficial world by manipula ng an independent variable and holding other variables constant. In a nonexperimental study, we create an ar ficial world through the necessarily arbitrary choices that we make when we decide which par cipants and
variables to include and exclude. We must be cau ous when we a empt to generalize beyond the ar ficial world of research to broader, o en purely hypothe cal popula ons. When we find that a variable is a strong predictor in our study, that result only suggests that it might be an important variable outside the laboratory. We need more evidence to evaluate whether it is an important predic ve variable in other se ngs than the unique and ar ficial se ng in which we conducted our research. In the cases where par cipants are randomly sampled from actual popula ons of interest, we can be somewhat more confident about making generaliza ons. When we use convenience samples, we must be cau ous about generalizing results. The value of M in a convenience sample may not be a good es mate of μ for the hypothe cal popula on of interest. Similarly, the value of η2 in a convenience sample may not be a good es mate of η2 in the popula on of interest.
Note that rpb is just the square root of η2. This is called point biserial r because two lists (or series) of scores are correlated with a binary group membership variable. When results are combined across many studies using meta-analy c procedures, rpb is o en the preferred effect size index. Some mes rpb is referred to as r, and in fact, it is equivalent to Pearson’s r with dichotomous scores for one of the variables. The sign of r can be assigned to show whether higher mean Y scores occurred in the treatment or control group. For instance, if 150 mg caffeine is treatment, coded Group 2, and 0 mg caffeine is the control condi on, coded Group 1, then r can be reported with a posi ve sign if mean heart rate is higher in Group 2 (the treatment group). Similar to r2 and η2, rpb is standardized or unit free; its value does not depend on the original units of measurement, and its value does not depend on N, sample size. It general rpb has a fixed range of values from –1 to +1 (from perfect nega ve to perfect posi ve associa on). However, if group n’s are not equal in your study, possible outcomes for rpb will be limited to a narrower range. The caveat about generaliza on is the same for rpb as for η2: The value you obtain in your sample may not be generalizable to the real world. 12.10.4 Cohen’s d Cohen’s d differs from the previous effect sizes; it evaluates the difference between sample means in terms of number of standard devia ons. For the independent-samples t test, Cohen’s d is calculated as follows:
where sp, the pooled within-group standard devia on, is calculated by taking the square root of the value of sp2 from Equa on 12.11 or 12.12. Like r2, η2, and rpb, Cohen’s d is unit free, standardized, and not dependent upon the original units of measurement. Its value does not depend on N. The sign of Cohen’s d depends on whether M1 > M2 or M2 > M1. There is not a fixed range for possible values of d, although values lower than –2 or higher than +2 are uncommon. In words, d indicates the distance between the two group means in terms of within-group standard devia ons. It helps us visualize how much overlap there is between two distribu ons of scores. The following examples illustrate small versus large values of Cohen’s d. Figure 12.13 shows a small effect size. Data from numerous studies suggests that men tend to have self- esteem scores about .22 (two tenths) SD higher than those of women (i.e., Cohen’s d = .22). This is a small effect. Figure 12.13 shows the overlap between these two distribu ons of scores. The normal distribu on on the le represents self-esteem scores for women, with the mean located at d = 0. The distribu on on the right represents self-esteem scores for men, with the mean located at d = .22.
Figure 12.13 Small Cohen’s d Effect Size and Overlap of Female (Le ) Versus Male (Right) Distribu ons of Self- Esteem Scores
This small effect would not be no ced in real life for two reasons. First, we can’t evaluate one another’s self-esteem very accurately. Even if we had s ckers on our faces showing self-esteem scores, we would s ll have a very difficult me seeing a difference between men and women. There is substan al overlap between these two distribu ons. The value d = 0 shows the mean for the female distribu on (the distribu on on the le ). More than half of the men have higher scores than the mean score for women, d = 0, but not many more than half. Slightly fewer than half of the men have self-esteem scores below the mean score for women (d = 0). A large effect size is shown in Figure 12.14. On the basis of data from the United Kingdom, mean male height is about 2 standard devia ons higher than mean female height (i.e., Cohen’s d = 2.00). Large values of Cohen’s d (such as d = 1.00 and higher) correspond to real-world differences that people are likely to no ce in everyday experience. Here also, the le -hand distribu on represents height scores for women (with mean located at d = 0) and the right-hand distribu on is height scores for men (with mean located at d = 2.00). Not very many women have heights that are higher than the mean height for men. There is much less overlap between the distribu ons than for the small effect size shown in Figure 12.13. Sex differences in height are large enough to be no ceable in everyday life, even though some women are taller than some men.
12.10.5 Computa on of Effect Sizes for Heart Rate and Caffeine Data For the pooled-variances t test using the heart rate and caffeine data, we have the following informa on from the SPSS output in Figure 12.11:
This value of d tells us that the mean of the no-caffeine group was 1.23 standard devia ons lower than the mean of the caffeine group (and the mean of the caffeine group was 1.23 standard devia ons higher than the mean of the no-caffeine group). Using Cohen’s standards5 to evaluate effect size in Table 12.1, all these values are judged to be large to very large effect sizes.
12.10.6 Summary of Effect Sizes Table 12.1 summarizes the characteris cs of these effect sizes. Effect size values do not depend on N. By comparison, the magnitude of the independent-samples t ra o does depend on N. If other factors are held constant, as N increases, t also increases in absolute magnitude. In a few respects, t is similar to some effect sizes: it is unit free or standardized and not in the original units of measurement; it has a sign that indicates the direc on of the rela onship (which group mean is higher). By itself, t cannot be interpreted as a propor on of variance; however, t and df can be converted into η2, which does provide informa on about propor on of variance. A t ra o does not have a limited range of possible values. Neither a t ra o nor its accompanying p value provides informa on about effect size. Researchers report t, df, and p as informa on about sta s cal significance; these numbers do not tell us anything about effect size. On the basis of t and p, we make judgments only about sta s cal significance (and not about significance or importance in prac cal, clinical, or real- world domains). Researchers should also report one or more of the effect sizes listed above as informa on about strength or size of effect (independent of sample size). Kirk (1996) suggested that we can interpret these values in terms of clinical or prac cal or real-world “significance.” Unfortunately both researchers and research consumers some mes confuse sta s cal significance (p < .05) with prac cal, clinical, or real-world “significance.” I prefer to speak of prac cal, clinical, or real- world importance (and avoid use of the poten ally confusing term significance6).
We can also ask whether a finding has theore cal value or importance. If variable X accounts for more than 50% of the variance in a Y outcome, we might decide that variable X should be included in our theory about what causes Y. On the other hand, if variable X can account for only 1% of the variance in Y (even if X is a “sta s cally significant” predictor of Y), we would want to include more useful explanatory variables in a theory that a empts to explain Y. There is no clear cutoff for a minimum propor on of explained variance. Cohen (1988) suggested guidelines for interpreta ons of effect sizes; Table 12.2 summarizes these labels. You may want to compare this with Table 10.3 in Chapter 10, which includes some addi onal informa on about the way effect sizes are related to whether effects are detectable in everyday life. These labels are based on recommenda ons made by Cohen for the evalua on of effect sizes in social and behavioral research; however, in other research domains, it might make sense to use different labels for effect size (that is, require larger values of rpb and other effect sizes before calling them “medium” or “large” effects). Effect size guidelines suggested by Cohen differ slightly when given in terms of different effect size indexes.
For r < .30, effects are o en not detectable by informal observa on in everyday life. For instance, a sex difference in self-esteem ra ngs, d = .22, is too small to be no ced in everyday life. For r > .50, effects may be detectable in everyday life (for instance, the sex difference in height, with d = 2.00, is something people no ce in everyday life). Effect sizes have three major uses:
1. At least one index of effect size should be reported with every sta s cal significance test. For the independent-samples t test, it is common to report η2, rpb, or Cohen’s d. When the dependent variable is measured in meaningful units, discussion should also
focus on the M1 – M2 difference as a way to think about the clinical or prac cal or real- world importance of the finding.
2. When you plan future research, you can use effect sizes from past research to es mate the minimum sample size you need to have adequate sta s cal power in your planned study. This is called sta s cal power analysis. Usually people want to have at least 80% power (i.e., approximately 80% chance of obtaining a sta s cally significant outcome for the guessed value of popula on effect size, such as η2). When a study has such small n’s that there is a very low probability of obtaining a sta s cally significant outcome given the popula on effect size, it is called underpowered.
3. When an author summarizes past research, he or she obtains and combines (averages) effect size informa on for each of dozens or hundreds of studies. This is called meta- analysis. For example, we might want to know whether mean depression a er therapy for pa ents differs across numerous studies that compare client-centered therapy (treatment) with no therapy (control). An effect size such as Cohen’s d or rpb provides important informa on about the direc on of difference (there might be a few studies in which mean depression was lower for the no-therapy group). If past studies have not reported effect sizes, effect sizes can almost always be obtained from other numerical results in the papers. In meta-analysis, it is important to include direc on of effect.
12.11 FACTORS THAT INFLUENCE THE SIZE OF T 12.11.1 Effect Size and N Formulas for sta s cal significance tests such as the independent-samples t test can be wri en in a way that makes it clear that the t test combines informa on about effect size and sample size or N or df (Rosenthal & Rosnow, 1991). In words:
If effect size is held constant, the expected magnitude of t increases as N increases. If N is held constant, the expected magnitude of t increases as effect size increases. With a li le bit of thought it should be clear that: When effect size and N are both very large, the value of t will almost always be large enough to judge the outcome sta s cally significant (and values of p will be very small). When effect size and N are both extremely small, the value of t will almost always be too small to judge the outcome sta s cally significant (and values of p will be large). In prac ce, when effect size is very small, you need a larger N to have a reasonable chance of obtaining a sta s cally significant outcome. When the effect size is very large, you may be able to obtain a sta s cally significant outcome using quite a small sample. A specific formula for the independent-samples t test given by Rosenthal and Rosnow (1991) is:
As df (sample size) goes up, t tends to increase (and p tends to become smaller). As (M1 – M2) goes up, t tends to increase (and p tends to become smaller).
On the other hand: As sp goes up, t tends to decrease (and p tends to increase).
No ce an important implica on of Equa on 12.24. Even when effect sizes such as (M1 – M2) or Cohen’s d are extremely small, as long as they do not turn out to be exactly zero in your sample, you can judge even very small mean differences sta s cally significant for larger values of N. You cannot use Equa on 12.24 to predict your outcome value of t exactly from sample size and effect size, because this equa on doesn’t take sampling error into account, and we don’t know popula on effect size. However, you can subs tute different values into Equa on 12.24 to get a sense of how increase in sample size makes it possible to detect very small effect sizes (i.e., judge them to be sta s cally significant). For instance, research that compares mean IQ for single-birth children (Group 1) with mean IQ for iden cal twins (Group 2) yields sample means of about M1 = 100 and M2 = 99. (A 1-point difference in IQ is not no ceable in everyday life; you might no ce IQ score differences of 20 or 30 points.) For most IQ tests, s = 15. Using Equa on 12.24, we can compare poten al differences in outcomes for a study with df = 100 versus a study with df = 10,000. With df = 100, the 1-point mean IQ difference is unlikely to yield a t value large enough to be sta s cally significant. When df = 10,000, the obtained t ra o is likely to be large enough to judge this 1- point difference sta s cally significant. (The t values are not exact; this equa on does not take sampling error into account.)
In prac ce, researchers some mes can control sample size; some mes they can control the magnitude of the other two elements in Equa on 12.24. Decisions about “dosage level” or type of treatment o en can increase the M1 – M2 difference. Decisions about the kinds of people to include in the study and the degree of standardiza on of data collec on situa ons can influence the magnitude of sp, the within-group standard devia on. Researchers do not always have control over sample size. Some mes researchers do not have funds to pay par cipants, treatments or data collec on procedures are very costly, or the study has to be completed in a very short period of me. When a researcher knows that the sample cannot be large, he or she needs to think about ways to increase the (M1 – M2) difference and/or decrease sp. On the other hand, some mes the results of large-N studies are reported in misleading ways. When N is very large, an effect can be judged sta s cally significant even when the effect size is too small to be of any real life or clinical or prac cal importance. Consider the twin versus individual child IQ study again. When the difference between mean IQs is tested in samples of 10,000 or more, it is almost always sta s cally significant. However, this difference could be deemed too small to be of any prac cal or clinical importance. Unfortunately, researchers who conduct large-N studies and obtain p values < .001 some mes call their results “extremely significant.” (Do not say that!) Here’s the problem. In everyday life, when we use the word significant, we mean large or worthy of no ce (or at the very least detectable). When we hear the word significant we tend to assume that differences between groups are large enough to ma er to people and clinicians. Calling the results of a study “highly significant” can mislead many readers into thinking that the effects are large enough to be valuable or at least no ceable in real life. Sta s cal significance and prac cal or clinical importance do not always go together, par cularly when N is extremely large. Here’s how to avoid confusion:
Emphasize effect sizes in reports (instead of sta s cal significance tests). Explain effect sizes clearly and evaluate them honestly. Discuss simple informa on such as M1 – M2 when units of measurement are
meaningful. Never say “extremely significant.”
12.11.2 Dosage Levels for Treatment, or Magnitudes of Differences for Par cipant Characteris cs, Between Groups The value of M1 – M2 can be affected by design decisions that involve the types of groups, types of treatment, or dosages of treatment for the two groups. Consider these two hypothe cal studies of caffeine effects on heart rate: Study A: Group 1 receives 0 mg caffeine, Group 2 receives 50 mg caffeine Study B: Group 1 receives 0 mg caffeine, Group 2 receives 500 mg caffeine
Assuming that caffeine does have an effect on heart rate, we would expect the means for heart rate to be much farther apart in Study B than in Study A. By increasing the difference between treatment dosage amounts, researchers can o en increase M1 – M2 and, therefore, other factors being equal, increase t. Studies of naturally occurring groups can also be thought of in these terms. Suppose you want to study age group (X) differences in mean reac on me (Y). Study A: Group 1 is ages 20–29, Group 2 is ages 30–39 Study B: Group 1 is ages 20–29, Group 2 is ages 70–79 Other factors being equal, you would expect mean reac on mes to differ much more in Study B than in Study A. Researchers must be very careful about something else that can influence the magnitude of the M1 – M2 difference: confounds of other variables with type or dosage of treatment. In the 0 mg caffeine versus 150 mg caffeine study, if the people in the 0 mg caffeine group have heart rate measured in a very relaxing se ng, while those in the 150 mg group are assessed in a stressful se ng, there is a complete confound between stress and caffeine dosage. Whether it is sta s cally significant or not, we cannot interpret a large M1 – M2 difference as informa on about the effects of caffeine. Some or all heart rate differences might be due to the amount of stress in the situa on. In this example, a confound of high stress with high caffeine would make the M1 – M2 difference larger. Some confounds may make an M1 – M2 difference smaller (for example, if heart rate was measured by a nasty and threatening experimenter in the 0 mg caffeine group and by a relaxed and friendly experimenter in the 150 mg caffeine group, the effects of caffeine and the confound might cancel each other out and lead to a small M1 – M2 difference). The presence of one or more confounds makes an M1 – M2 difference, and the t ra o based on that difference, uninterpretable. 12.11.3 Control of Within-Group Error Variance Researcher decisions can also influence sp, the pooled or averaged within-group standard devia on. The within-group standard devia on sp is o en called experimental error. Experimental error tends to be large in drug studies where par cipants within each treatment group differ from one another on characteris cs such as age, anxiety, history of drug use, and so forth. Experimental error is also large if par cipants within the same treatment groups are tested in different ways in different situa ons. Consider the caffeine/heart rate study again: Group 1 receives no caffeine, and Group 2 receives 150 mg caffeine. Now consider these different scenarios. Study A: Par cipants within both groups are very similar in age, health, and amount of past caffeine consump on; all are nonsmokers; all have average fitness; none are evaluated during midterms or final exams; and none are tested by an anxious experimenter.
Study B: Par cipants within both groups vary in age, health, and amount of past caffeine consump on; some smoke, some do not; they have varying levels of aerobic fitness; some are tested during midterms and finals, others before spring break; and several different experimenters interact with the par cipants, some of whom are much more anxious than others. In Study A, if par cipant characteris cs are very similar or homogeneous, and experimental procedures are standardized and consistent, par cipants in each group should not show much varia on in heart rates. Thus, in Study A, sp should be rela vely small. On the other hand, in Study B, people who are in the same treatment group have different health backgrounds and are tested under different circumstances; you would expect wide varia on in their heart rates. In Study B, sp would be rela vely large. If other factors (effect size and N) are held constant, there would be a be er chance of obtaining a large t value for Study A than for Study B. Recrui ng similar par cipants can help with sta s cal power, but it also reduces generalizability of findings. The par cipants in Study A are not diverse. 12.11.4 Summary for Design Decisions Members of my undergraduate class became upset when I explained the way research design decisions can affect the values of t. They said, “You mean you can make a study turn out any way you want?” The answer is, within some limits, yes. The independent-samples t test is likely to be large for these situa ons and decisions. (For each factor, such as N, add the condi on “other factors being equal.”)
N is large (a very large N study can yield a sta s cally significant t ra o even if the popula on effect is very small).
Popula on effect size such as η2 is large (this is o en related to treatment dosages or types of par cipants being compared).
M1 – M2 is large (however, M1 – M2 is not interpretable if confounds are present). sp is small (this happens when par cipant characteris cs and assessment situa ons are
homogeneous within groups). Depending on their research ques ons and resources, the degree to which researchers can control each of these factors may vary. 12.12 RESULTS SECTION Following is an example of a “Results” sec on for the study of the effect of caffeine consump on on heart rate. Results An independent-samples t test was performed to assess whether mean heart rate differed significantly for a group of 10 par cipants who consumed no caffeine (Group 1) compared with a group of 10 par cipants who consumed 150 mg of caffeine. Preliminary data screening indicated that scores on heart rate were reasonably normally distributed within groups. There were two high-end outliers in Group 1, but they were not extreme; outliers were retained in the
analysis. The mean heart rates differed significantly, t(18) = –2.75, p = .013, two tailed. Mean heart rate for the no-caffeine group (M = 57.8, SD = 7.2) was about 10 beats per minute lower than mean heart rate for the caffeine group (M = 67.9, SD = 9.1). The effect size, as indexed by η2, was .30; this is a very large effect. The 95% CI for the difference between sample means, M1 – M2, had a lower bound of –17.81 and an upper bound of –2.39. This study suggests that consuming 150 mg of caffeine may significantly increase heart rate, with an increase on the order of 10 bpm. The assump on of homogeneity of variance was assessed using the Levene test, F = 1.57, p = .226; this indicated no significant viola on of the equal variance assump on. Readers generally assume that the equal variances assumed version of the t test (also called the pooled-variances t test) was used unless otherwise stated. If you see df reported to several decimal places, this tells you that the equal variances not assumed t test was used. 12.13 GRAPHING RESULTS: MEANS AND CIS Cumming and Finch (2005) suggested that authors should emphasize confidence intervals along with effect sizes. Graphs of CIs help focus reader a en on on these. Several types of CI graphs can be presented for the independent-samples t test. We could set up a graph of the CI for the (M1 – M2) difference using either an error bar or a bar chart. The lower and upper limits of this CI are provided in the independent-samples t test output. It is more common to show a CI for each of the group means (M1 and M2). This can be done with either the SPSS error bar or bar chart procedure. To obtain an error bar graph for M1 and M2, make the menu selec ons shown in Figure 12.15, Figure 12.16, and Figure 12.17. In Figure 12.18 the separate ver cal lines for each group (no caffeine, 150 mg caffeine) have two features. The dot represents the group mean. The T-shaped bars iden fy the lower and upper limits of the 95% CI for each group. Be careful when you examine error bar plots in journals or conference posters. Error bars that resemble the ones in Figure 12.18 some mes represent the mean ± 1 standard devia on, or the mean ± 1 SEM, instead of a 95% CI. Graphs should be clearly labeled so that viewers know what the error bars represent.
Figure 12.16 Error Bar Dialog Box Cumming and Finch (2005) pointed out that when two 95% CIs, like the ones in Figure 12.18, do not overlap, you know that the t test for the difference between group means must be sta s cally significant using α = .05, two tailed. On the other hand, if the CIs do overlap, it is
possible that the t test that compares group means may be sta s cally significant (because the CI for [M1 – M2] has a larger df than the CIs for M1 and for M2).
A bar chart is another way to represent informa on about CIs. The menu selec ons to open the bar chart procedure were shown earlier (<Graphs> → <Legacy Dialogs> → <Bar>. In the Define Simple Bar: Summaries for Groups of Cases dialog box, in Figure 12.19, select the radio bu on for “Other sta s c (e.g., mean)” and move the dependent variable name (heart rate) into the box labeled “Variable.” It will appear as MEAN([hr]). The height of each bar will correspond to the mean heart rate for one group. Enter the name of the group or category variable into the box labeled “Category Axis.” Click the Op ons bu on. In the Op ons dialog box, also shown in Figure 12.19, check the box for “Display error bars.” Leave the default radio bu on selec on under “Confidence Intervals” as 95.0 for “Level (%),” unless otherwise desired. This will produce a 95% CI for each group mean. The resul ng bar chart appears in Figure 12.20. By default, SPSS uses 0 as the star ng value for the Y axis. When bar charts were used to represent the frequency of cases for each group earlier, using 0 as the lowest value for Y was recommended; cu ng out large por ons of the Y axis that represent possible values for Y can yield a graph that exaggerates the magnitude of group sizes.
When bars represent group means, star ng the Y axis at 0 o en does not make sense. For heart rate, it would make sense to use the lowest value for heart rate that you could call a normal healthy heart rate as your minimum. In this situa on it would be reasonable to use a value such as 40 as the lowest value marked on the Y axis. This change can be made in the chart editor (commands are not shown). The edited bar chart appears in Figure 12.21. 12.14 DECISIONS ABOUT SAMPLE SIZE FOR THE INDEPENDENT-SAMPLES T TEST Sta s cal power analysis provides a more formal way to address this ques on: How does the probability of obtaining a t ra o large enough to reject the null hypothesis (H0: μ1 = μ2) vary as a func on of sample size and effect size? Sta s cal power is the probability of obtaining a test sta s c large enough to reject H0 when H0 is false. Researchers generally want to have a reasonably high probability of rejec ng the null hypothesis; power of 80% is some mes used as a reasonable guideline. Cohen (1988) provided tables that can be used to look up power as a func on of effect size and n or to look up n as a func on of effect size and the desired level of power. An example of a power table that can be used to look up the minimum required n per group to obtain adequate sta s cal power is given in Table 12.3. This table assumes that the researcher will use the conven onal α = .05, two tailed, criterion for significance. For other alpha levels, tables can be found in Jaccard and Becker (2009) and Cohen (1988). To use this table, the researcher must first decide on the desired level of power (power of .80 is o en taken as a reasonable minimum). Then, the researcher needs to make an educated guess about the popula on effect size that the study is designed to detect. In an area where similar studies have already been done, the researcher may calculate η2 values on the basis of the t or F ra os reported in published studies and then use the average effect size from past research as an es mate of the popula on effect size. (Recall that η2 can be calculated by hand from the values of t and df using Equa on 12.19 if the value of η2 is not reported in the journal ar cle.) If no
similar past studies have been done, the researcher can make an educated guess; in such situa ons, it is safer to guess that the effect size in a new research area may be small.
Suppose a researcher believes that the popula on effect size is on the order of η2 = .20. Looking at the row that corresponds to power of .80 and the column that corresponds to η2 of .20, the cell entry of 17 indicates the minimum number of par cipants required per group to have power of about .80. In this situa on, I would s ll suggest a minimum n per group of 25 to 30, to ensure robustness against possible viola ons of assump ons and to obtain reasonably narrow CIs for each group mean. SPSS has an add-on program to calculate sta s cal power. Java applets for sta s cal power for the independent-samples t test and many other procedures are available at h p://www.stat.uiowa.edu/~rlenth/Power (Lenth, 2018). Federal agencies that provide
research grants now expect sta s cal power analysis as a part of grant proposals; that is, researchers must demonstrate that given reasonable, educated guesses about effect size, the planned sample size is adequate to provide good sta s cal power (e.g., at least an 80% chance of judging the effect to be sta s cally significant). It is not worth undertaking a study if the researcher knows a priori that the sample size is probably not adequate to detect the effects. Given the imprecision of procedures for es ma ng the necessary sample sizes, the values contained in this and the other power tables presented in this book are approximate. Larger n’s than the minimum suggested by power tables are o en desirable. Even if sta s cal power analysis suggests that n less than 30 per group might be adequate, samples smaller than that are not advisable. When n’s are very small, consider a nonparametric test such as the Mann- Whitney U test, but note that this test requires other assump ons that may be difficult to sa sfy in prac ce (Appendix 12A). Do not conduct a post hoc (or postmortem) power analysis when you report results and publish comments such as “Given the sample effect size in my study, my results would have been sta s cally significant if I had a larger sample.” That is unwarranted specula on. However, if you can see that your sample size was too small, you will want to keep the need for larger samples in mind when you design future studies. 12.15 ISSUES IN DESIGNING A STUDY 12.15.1 Avoiding Poten al Confounds A confound of one (or more) other variables with your treatment variable makes the M1 – M2 difference uninterpretable. A confound may make M1 – M2 larger than it should be in some situa ons and smaller than it should be in other situa ons. Suppose that you want to know whether pa ents have lower mean anxiety scores a er Rogerian therapy (Group 1) or Freudian psychodynamic therapy (Group 2). Suppose that these two types of therapy are given by different therapists (Dr. Goodman does the Rogerian therapy and Dr. Deadwood does the psychodynamic therapy). This would be a perfect confound between therapist personality and ability and type of therapy. If the Group 1 pa ents do be er than those in Group 2, we cannot tell whether this is due to differences in the type of therapy or differences between the two doctors. This is a perfect or complete confound, and it makes the results of this study uninterpretable. The M1 – M2 difference can be due to type of therapy, personality and ability of the therapist, or both. (Even if Dr. Goodman did the therapy in both groups, there could be problems, because she might have greater faith in one type of therapy than the other, and this could produce placebo or expectancy effects.) Confounds do not have to be complete confounds to be problema c. Consider a group of pa ents in a drug study. If the drug group has 55% women and the placebo group has only 39% women, there is a par al confound between type of drug and sex. M1 might differ from M2 because the M1 group includes more women, while the M2 group includes more men—instead of or in addi on to any drug effects.
Confounds can be obvious, but some mes they are subtle. Random assignment of par cipants to groups is supposed to make groups equivalent in composi on, but some mes this doesn’t work as well as expected. When background informa on is available about par cipants, it’s good to compare the groups to see whether they are equivalent. Self-selec on into treatment is problema c. If your study includes a medita on training group and a control group, and par cipants are allowed to choose their groups, you will probably have different kinds of people in the medita on group than in the no-treatment control group. 12.15.2 Decisions About Type or Dosage of Treatment Researcher decisions about the types or amounts of treatments (or other group characteris cs) can influence the M1 – M2 difference between means. Usually, researchers want to maximize this difference. However, there are limits. We cannot give human beings 10,000 mg of caffeine to maximize the effects of caffeine on heart rate (for ethical as well as prac cal reasons). It would not be useful to give rats amounts of ar ficial sweetener that would correspond to human consump on of 50 diet sodas per day, because that dosage would not correspond to any real-world situa on. If naturally occurring groups are compared (for example, older adults vs. younger adults), it will usually be easier to find differences when groups differ substan ally. For instance, a study that compares reac on me between a group of persons ages 60 to 70 and a group of persons ages 20 to 30 is more likely to find a difference than a study that compares a group of persons in their 20s with a group in their 30s. 12.15.3 Decisions About Par cipant Recruitment and Standardiza on of Procedures Researcher decisions about types of par cipants to recruit, and about standardiza on of procedures, can affect the magnitude of sp, the pooled or averaged within-group standard devia on. Recrui ng homogeneous par cipants such as 18-year-old healthy men helps keep sp low (compared with studies with wider ranges of age and health), but it also limits the poten al generalizability of results. It is a good idea to standardize situa ons and tes ng procedures to keep sp small, but rigid protocols can result in experiences that make the situa on feel even more ar ficial. 12.15.4 Decisions About Sample Size Some mes par cipants or cases are difficult or costly to obtain. A neuroscience study might involve surgical procedures and lengthy training and tes ng procedures. In such situa ons, standardiza on of procedures and op mal choice of treatment dosage levels is par cularly important. When researchers have access to very large N’s (on the order of tens of thousands), there is a different problem. Even effects that are extremely small (when evaluated by looking at M1 – M2, or η2, or Cohen’s d) can be sta s cally significant when N is very large. Researchers should resist the tempta on to overemphasize sta s cal significance in these situa ons. Clear
informa on about effect size should be provided in terms readers can understand. This is par cularly important when important real-life decisions (such as medical decisions) are at stake. One possible reason why researchers have been slow to adopt the repor ng of effect size informa on and CIs is that effect sizes are o en embarrassingly low, while CIs are o en embarrassingly wide. To summarize: Researcher decisions about treatment type and dosage, and the presence of confounds, will affect the magnitude of M1 – M2. Confounds make M1 – M2 differences uninterpretable even if they are sta s cally significant. Researcher decisions about par cipant recruitment and procedures can reduce the magnitude of sp but may also reduce generalizability. Very low n’s result in underpowered studies, that is, studies in which a sta s cally significant t value is unlikely even if the null hypothesis is false. Very large n’s can lead to situa ons in which effects that are too small to have any real-world prac cal or clinical importance are judged sta s cally significant. In between these extremes, sta s cal power tables can help researchers evaluate the sample sizes needed for adequate sta s cal power. 12.16 SUMMARY This chapter discussed a simple and widely used sta s cal test (the independent-samples t test) and provided addi onal informa on about effect size, sta s cal power, and factors that affect the size of t. The t test is some mes used by itself to report results in rela vely simple studies that compare group means on a few outcome variables; it is also used as a follow-up in more complex designs that involve larger numbers of groups or outcome variables. A t-test value (and corresponding effect sizes) is not a fact of nature. Researchers have some control over factors that influence the size of t, in both experimental and nonexperimental research situa ons. Because the size of t depends to a great extent on our research decisions, we should be cau ous about making inferences about the strength of effects in the real world on the basis of the obtained effect sizes in our samples. For the independent-samples t test, researchers o en report one of the following effect size measures: Cohen’s d, rpb, or η2. Eta squared is an effect size commonly used to do power analysis for future similar studies. When researchers want to summarize informa on across many past studies (as in a meta-analysis), rpb (o en just called r) is o en the effect size of choice. Past research has not always included effect size informa on, but readers can usually calculate effect sizes from the informa on in published journal ar cles. No ce that the independent-samples t test, like correla on and regression, provides a par on of the total variance in Y outcome scores into two parts; η2 is the propor on of variance in Y that differs between groups (variance that may be due to different types or amounts of treatment). In regression, r2 was the propor on of variance in Y that could be linearly predicted from X. Similarly, (1 – r2) was the propor on of variance in Y that could not be linearly predicted
from X; for the independent-samples t test, (1 – η2) is the propor on of variance in Y that is not predictable from group membership or from the score on the predictor variable. The r2 and η2 are both called propor on of predicted (or some mes explained) variance. Predicted variance is variance in Y that is related to scores on the X predictor variable. By contrast, (1 – r2) and (1 – η2) are the parts of the variance in Y that are not predictable from the X independent variable. These are interpreted as propor ons of error variance. The term error in everyday life means “mistake.” In sta s cs, error has many different meanings, depending on context. Errors in predic on don’t happen because the data analyst made a mistake (although mistakes in data analysis can happen, of course). Errors in predic on happen because many other variables, other than the X variable used as a predictor, influence the scores on the Y outcome variable. Error refers collec vely to all the variables in the world that are related to Y, but that we did not control in the study or include in the sta s cal analysis. This may clarify why propor ons of error variance are so high in most research! Error also includes any chance or random or unpredictable elements in Y. If you go on to learn about analyses that include mul ple predictor variables, you will see that use of mul ple predictors some mes reduces the propor on of error variance. To describe the problem of error variance another way, consider the tongue-in-cheek Harvard Law of Animal Behavior: “Under carefully controlled experimental circumstances, an animal will behave as it damned well pleases.” This chapter was long and detailed because it introduces issues that arise when comparing means across groups; many of the following chapters describe analyses that also compare means across groups. This set of analyses is called analysis of variance. The same issues (assump ons, data screening, effect size, and so forth) con nue to be important for those analyses, and I’ll o en refer you back to this chapter for more complete discussion.