Data Screening Basics Stats

TT24
Reading-AppliedstatisticsIChap5-6.pdf

Warner, R. M. (2021). Applied sta s cs I: Basic bivariate techniques (3rd ed.). Thousand Oaks, CA: Sage Publica ons. ISBN: 978-1-5063-5280-0.

CHAPTER 5 GRAPHS: BAR CHARTS, HISTOGRAMS, AND BOXPLOTS 5.1 INTRODUCTION Informa on about scores that was presented in the form of frequency tables in Chapters 3 and 4 can be presented in simple graphs. This chapter describes some widely used types of graphs: pie charts and bar charts for categorical variables, and histograms and boxplots for quan ta ve variables. Each approach (frequency table vs. graph) has advantages and poten al disadvantages: An advantage of frequency tables is that they provide exact informa on about the numbers or percentages of persons who had each score value. The corresponding disadvantage of graphs is that when they are poorly labeled, it is difficult to iden fy exact numbers and percentages. A disadvantage of graphs is that they can be constructed in ways that create decep ve impressions. Frequency tables generally are not decep ve. An advantage of graphs is that they provide appealing visual informa on that grabs readers’ a en on; this is par cularly useful in mass media reports, PowerPoint or Prezi presenta ons, and poster presenta ons at professional conferences. A disadvantage of frequency tables is that they do not have much visual appeal. An advantage of graphs for quan ta ve variables (such as histograms) is that they provide easily understandable informa on about distribu on shape. This chapter describes several distribu on shapes commonly seen in real data (examples appear in Tables 5.1 and 5.2). The bell-shaped curve (more formally, the normal distribu on or Gaussian distribu on) is of par cular interest. The normal distribu on will be discussed further in Chapter 6. A disadvantage of frequency tables is that it can be difficult (although it is possible) to evaluate distribu on shape by inspec on of a frequency table. Ideally, preliminary data screening includes frequency tables (Chapter 3), descrip ve sta s cs (Chapter 4), and graphs (the present chapter). Frequency tables are rarely included in published research reports. Graphs of frequency distribu ons are not o en reported in journal ar cles, although they can be. Informa on in frequency tables can be used to label graphs accurately. SPSS does not produce publica on-quality graphics. For beginners, this is not a major problem; the graphs are adequate for preliminary data screening. Advanced users may prefer other programs to generate graphics. The R supplement for this book (Rasco, 2020) demonstrates use of the ggplot procedure; this produces be er quality graphics. I modified most SPSS graphics in this book by edi ng to increase font sizes and add informa on. In real-world data analysis, descrip ve sta s cs, frequency tables, and graphs should be examined before a data analyst conducts the main analysis that is of primary interest (such as a t test or analysis of variance [ANOVA]). These provide informa on needed for preliminary data screening. Published research reports typically include only a few sentences about preliminary

screening (if they men on it at all). Hoekstra, Kiers, and Johnson (2012) noted that many authors don’t report much about data screening; they argue that the validity of sta s cal results is o en ques onable because assump ons required for sta s cal analysis are not sa sfied (and o en not even checked). Poten al viola ons of some of the assump ons that are introduced later can be assessed by examining graphs. 5.2 PIE CHARTS FOR CATEGORICAL VARIABLES Pie charts are almost universally despised by scien sts, and you are unlikely to see them in academic journals; however, they are popular in mass media, so you should be familiar with them. Consider the frequency table for hypothe cal scores for the categorical variable marital status (Figure 5.1). Recall that the “Cumula ve Percent” column automa cally provided by SPSS makes no sense for categorical variables. Focus on the “Frequency” and “Percent” columns. To request a pie chart, use the familiar Frequencies procedure, beginning with these SPSS menu selec ons: <Analyze> → <Descrip ve Sta s cs> → <Frequencies>. Click the Charts bu on to open the Frequencies: Charts dialog box in Figure 5.2; within that window, select the radio bu on for “Pie charts,” then click Con nue and OK. Edited pie chart output appears in Figure 5.3.

Figure 5.2 Use of Frequencies: Charts Dialog Box to Request a Pie Chart

Figure 5.3 Pie Chart for Hypothe cal Marital Status Data, N = 42 The frequency table in Figure 5.1 tells us that the group with the largest number of members is “never married”; this corresponds to the solid “slice” in the pie chart. The frequency table has a great advantage over the pie chart; it provides exact frequencies and percentages, while the pie chart only approximates group sizes (unless the slices are labeled using numbers or percentages). Kopf (2015) reviewed reasons why many data analysts hate pie charts. For example, people are not good at es ma ng percentages from the areas of the slices. Pie charts require the use of colors (or textures such as dots or stripes) to differen ate slices; most science journals do not publish figures in color. Tu e (2001), who authored several books about excellence in graphing, regards most mul colored figures as unsightly; he argues that graphs should use as li le ink as possible to provide complete informa on. Pie charts have only two virtues. They provide colorful slides in presenta ons, and this is something that some data analysts (in marke ng, for example) may like. Also, they lend themselves well to humor. (Search online for “funny pie charts” to find examples, or create your own comic version. Perhaps you can persuade your instructor to give a prize or extra credit for the most comical or ingenious examples.) If you become a science researcher, you will probably never use pie charts. 5.3 BAR CHARTS FOR FREQUENCIES OF CATEGORICAL VARIABLES The SPSS Frequencies procedure, which was used in previous chapters to obtain frequency tables, can also provide bar charts (or bar graphs). To open the Frequencies dialog box that appears in Figure 5.4, make these menu selec ons: <Analyze> → <Descrip ve Sta s cs> →

<Frequencies>. Click the Charts bu on on the right-hand side of the Frequencies dialog box to open the Frequencies: Charts box; a bar chart is obtained by selec ng the radio bu on for “Bar charts” in the Frequencies: Charts dialog box (also shown in Figure 5.4). The Y axis may be given in frequencies (number of cases) or percentages. Click Con nue to return to the main Frequencies dialog box. Click OK to run the procedure. The hypothe cal marital status scores in Figure 5.1 were used to set up the bar graph in Figure 5.5. The height of each bar represents group size. I edited the bar graph produced by SPSS (using the SPSS Chart Editor and Microso Paint) in the following ways: I increased font sizes for the X and Y axis labels and added the exact number of cases per group (from the frequency table) above each bar.

Figure 5.5 Bar Chart for Hypothe cal Marital Status Groups, Total N = 42 5.4 GOOD PRACTICE FOR CONSTRUCTION OF BAR CHARTS Bar charts and other graphs should provide accurate informa on that is easy to understand. It is easier for readers to understand graphs when they follow simple rules and conven onal standards.

1. A separate bar represents the frequency (or propor on or percentage of cases) for each group. The height of the bar corresponds to the number or frequency in each group (or the propor on or percentage of cases in each group). The labels on the Y axis should make clear whether frequency, propor on, or percentage is reported. However, the rela ve heights of the bars are the same no ma er which label is used. (Usually bars are ver cal, but it is possible to set up bar charts in which bars are horizontal.)

2. Names of groups are specified by labels on the X axis. 3. Bars should have equal widths. (This rule is not always followed.) 4. The height of the graph (Y axis) is usually less than the width of the X axis (the height of

Y is o en about 75% the length of X). 5. The Y axis begins at 0 (or at another minimum value of Y). 6. The top of each bar is labeled with an exact numerical value (a frequency or a

percentage). SPSS does not do this for you; I added this informa on using SPSS Chart Editor.

7. Informa on about total N must be provided. 8. In a footnote or the body of the text, the source of data should be stated. Readers tend

to assume that numbers are based on new data collected by the researcher; if there is another source (such as Gallup polls or the U.S. census), that source must be iden fied.

9. Bars in bar graphs for categorical variables usually do not touch one another. (This reminds readers that bars represent dis nct groups.)

When you generate bar charts for frequencies in SPSS, many of these good form requirements are taken care of by default (e.g., bars are equal widths, and the Y axis begins at 0). 5.5 DECEPTIVE BAR GRAPHS The most common way to make a bar chart for group frequencies “lie” is to set up the Y axis so that it does not start at 0. To illustrate this decep on, I modified the graph in Figure 5.5 so that the Y axis begins at 2 (instead of 0). The modified bar chart in Figure 5.6 is poten ally misleading because people tend to look at the ra o of bar heights (or bar areas) when they compare group sizes; people o en do not pay close a en on to the specific values indicated on the Y axis. In Figure 5.6, the differences in group sizes appear larger than in Figure 5.5. In Figure 5.6, the never married group appears to have about 10 mes as many members as the widowed group (measure the height of the bar for never married and divide this by the height of the bar for the widowed group). When actual group sizes are considered (20 never married, 3 divorced), the never married group is only about 7 mes as large as the widowed group.

Figure 5.7 Decep ve Bar Chart: Use of Cartoons Instead of Bars to Represent Frequencies Another way to make this type of graph decep ve is by using cartoon figures instead of bars. Huff and Geis (1993) provide examples. Here is a graph similar to examples in Huff and Geis: Suppose you graphed the number of new houses built in two different years using the heights of cartoon figures of houses, instead of the heights of bars, to represent frequencies (see Figure 5.7).

In this imaginary example, twice as many new houses (10,000) were built in 2019 as in 2009 (5,000). The heights of the cartoon houses correspond to these frequencies. However, a reader is likely to use the area of each cartoon (or even the perceived volume, if the cartoon appears three-dimensional) to make the comparison. On the basis of the areas of the house images, a reader might form the impression that number of new houses was something like 6 mes greater in 2019. Replacement of bars with equal widths by cartoon images (par cularly if they have unequal widths) can be very decep ve (Huff & Geis, 1993). When you create bar charts, make them as informa ve and honest as possible; include exact numerical informa on such as frequencies or percentages for each group. When you read graphs such as bar charts, pay more a en on to the numbers on the Y axis (the frequencies and percentages for each group) than to the picture (bars or cartoon figures). Be aware that when the Y axis starts at a value larger than 0, a bar chart may be decep ve. 5.6 HISTOGRAMS FOR QUANTITATIVE VARIABLES A histogram provides visual informa on about the distribu on of scores on quan ta ve variables. A histogram is like a bar chart in that each bar represents the number or percentage or propor on of cases that have each score value. However, because quan ta ve variables o en have more different score values than categorical variables, histograms tend to have more bars. Marks on the X axis represent score values, and marks on the Y axis represent the frequency (or propor on or percentage) of cases in the sample that had each score value. Some introductory textbooks state that bars in histograms should touch one another in histograms. In SPSS, the bars in both bar charts and histograms have small spaces between them. A new ques on we can ask when we examine a histogram is, What is the shape of the distribu on? The easiest way to learn about distribu on shape is to consider examples. Some distribu on shapes that o en appear in real data appear in Tables 5.1 and 5.2. The first line in Table 5.1 shows a bell-shaped (more formally called a normal or Gaussian) distribu on. Other rows in Table 5.1 show distribu ons that are varia ons of this shape.

However, when we look at real data, we o en see distribu ons that do not look anything like normal or bell-shaped curves. Table 5.2 shows distribu ons with shapes that are clearly not close to bell-shaped or normal. Tables 5.1 and 5.2 do not include all possible distribu on shapes; there are many others.

To decide whether one or more of these distribu on shapes best describe the data in your sample, you can obtain a histogram and compare it with the examples in these tables. Visual examina on of a histogram is usually sufficient to make reasonable evalua ons about distribu on shapes. In Chapter 6, you’ll see that there are quan ta ve methods to evaluate how well data fit a specific distribu on shape; however, these are rarely used in prac ce. The bell-shaped distribu on in row 1 of Table 5.1 is discussed extensively in sta s cs. Informally, we can describe this bell-shaped distribu on shape as follows. There is a “hump” in the center of a bell-shaped distribu on. In a perfectly normal distribu on, the mean, median, and mode are exactly equal and correspond to the center of the distribu on (and all correspond to the top of the hump). Frequencies (the heights of the bars in the histogram) decline gradually as scores become either larger or smaller than the mean, median, or mode; this creates a shape something like a bell. The distribu on is symmetrical around the mean. That is, the upper half of the histogram is a mirror image of the lower half. Comprehension ques ons will ask you to examine histograms and evaluate whether the distribu on is bell-shaped with minor varia ons or is described be er by quite different distribu on shapes. This is a somewhat subjec ve judgment call. Some mes the best decision is to say that none of the common distribu on shapes is a good descrip on for a histogram you obtain for your data. People who o en work with specific types of variables (such as reac on

me) will learn the specific distribu on shapes for those variables. 5.7 OBTAINING A HISTOGRAM USING SPSS The hypothe cal female height data in femaleheight.sav are used to set up a histogram. You may have wondered why height and temperature are used as variables in early examples. Each of these variables can be given in different units (for example, height can be inches or cen meters; temperature can be given in degrees Fahrenheit or Celsius). (The United States is one of very few na ons that s ll uses nonmetric units such as inches.) The following example shows how to obtain histograms; examples demonstrate that conver ng units of measurement from inches to cen meters does not change the shape of the frequency distribu on (although unit conversion does change the values of M, SD, and other descrip ve sta s cs). You will find it useful to be able to convert scores from one unit to another, and to do other computa ons. The SPSS Compute Variable command can be used to do this and has many addi onal poten al uses. In this situa on, we use this command to obtain (approximate) height in cen meters by mul plying height in inches by 2.54. To open the Compute Variable dialog box in Figure 5.8, select these menu op ons: <Transform> → <Compute Variable>. In the le -hand window, type the name of the new variable (in this example, heightcm). In the right-hand window, type a numerical expression that includes the name of one (or more) exis ng variable(s) that is used to assign values to the new variable (in this example, the numerical expression is “2.54*heigh nch”). A er you click OK, the new variable heightcm will appear as a new column on the right-hand side of your data worksheet.

Now let’s compare the distribu ons of height in inches and height in cen meters. The familiar Frequencies procedure (used to obtain descrip ve sta s cs and pie and bar charts for categorical variables) can be used to request histograms for quan ta ve variables. Use these SPSS menu selec ons: <Analyze> → <Descrip ve Sta s cs> → <Frequencies>. Move both variables (heigh nch and heightcm) into the Variable(s) pane. Click the Charts bu on and select the radio bu on for “Histograms.” You may also want to check the box for “Show normal curve on histogram.” Click the Sta s cs bu on and use checkboxes in the Frequencies: Sta s cs dialog box to choose the desired descrip ve sta s cs. Click OK to run the procedure. Output for descrip ve sta s cs appears in Figure 5.9 and the histograms in Figure 5.10.

Figure 5.8 SPSS Compute Statement to Convert Height From Inches to Cen meters

Figure 5.10 Histograms for Hypothe cal Female Height Data As you might expect, transforma on of scores from inches to cm changed all values for descrip ve sta s cs, such as mean and standard devia on. For example, the mean for height in cen meters is 2.54 mes the mean for height in inches. Each descrip ve sta s c for height in cen meters is 2.54 mes the corresponding sta s c for height in inches (except that variance for height in cen meters is 2.542 mes the variance in inches). Did this transforma on change the shape of the distribu on? Figure 5.10 shows that the distribu ons of height scores given in inches and cen meters have iden cal shapes, even though individual scores and descrip ve sta s cs such as M and SD are in different units, and the units along the X axis differ. I have marked M and SD in the two histograms above. The sample mean is approximately in the middle, marked by the le er M on the X axis. Recall that SD summarizes informa on about distances of scores from the mean; SD is shown as horizontal arrows that indicate distance from the mean of X. For height in inches, SD was 2.5 inches. The end points of the arrows that indicate the distance of one SD below M and one SD above M are: Other

Lower end point: 𝑀−1 𝑆𝐷=65−2.5=62 Lower end point: 𝑀+1𝑆𝐷=64.5+2.5=67

In Chapter 6, you’ll learn about the mathema cal defini on of normal distribu on shape (expressed in the form of a somewhat complicated equa on). That equa on generates the smooth curves superimposed on the histograms above. 5.8 DESCRIBING AND SKETCHING BELL-SHAPED DISTRIBUTIONS

When sample data are approximately normally distributed, you need only three pieces of informa on to specify the distribu on, communicate informa on about it to someone else, and/or draw a sketch of that distribu on. These pieces of informa on are:

1. The distribu on shape (normal). 2. The sample mean M. 3. The sample standard devia on SD.

Figure 5.11 Sketch Based on Three Pieces of Informa on: Normal Shape, M = 100, SD = 15 For example, scores on many IQ tests are normally distributed with a mean of 100 and a standard devia on of 15 (or some mes 16). This is enough informa on to sketch the shape of the distribu on, as shown in Figure 5.11. The range rule (from the previous chapter) will help you iden fy the approximate loca ons of the minimum and maximum value on the X axis, and this range is divided into six parts (the range is approximately equal to 6 × SD if the distribu on is normal). Note that you won’t be able to label the Y axis in this graph. You can label the following seven points along the X axis if you know M and SD. The seven X axis values marked in Figure 5.11 are calculated from M and SD as follows:

Score loca ons rela ve to the mean can be approximately described as follows (we will describe distance from the mean more precisely in Chapter 6). A score can be called “not very far from the mean” if it lies within the range M – 1 SD and M + 1 SD. For example, an IQ of 110 is not very far from the mean. An X score can be called “far from the mean” if it is below M – 2 SD or above M + 2 SD. For example, an IQ of 135 is far above the mean; and an IQ of 69 is far below the mean. A score can be called “unusually far from the mean” if it is less than M – 3 SD or greater than M + 3 SD. For example, an IQ of 50 is unusually far below the mean, while an IQ of 150 is unusually far above the mean. Another way to look at this: If you had a set of 1,000 IQ scores with M =100 and SD = 15, and you selected one case at random, the most likely outcome would be an IQ in the range from 85 to 115. You could obtain a case with an IQ in the range 150 and up, but that would be an unusual or unlikely outcome. If you know your own IQ, or any specific IQ score, you can locate that score on the X axis, and immediately see the following: Is your IQ score above or below the mean M? Is it far from the mean, or unusually far from the mean? Refer to Figure 5.11. An IQ of 90 is below the mean, but it is not very far from the mean. An IQ of 160 is above the mean, and it is unusually far from the mean (in other words, very few people have IQ scores equal to or greater than 160). 5.9 GOOD PRACTICES IN SETTING UP HISTOGRAMS Most rules for good prac ce in bar chart construc on also apply to the construc on of histograms:

1. A separate bar represents the frequency (or propor on or percentage of cases) for each score value (or for a range of score values, as described later in this sec on).

2. The height of each bar corresponds to the number or frequency in each group (or the propor on or percentage of cases in each group).

3. Labels on the Y axis should make clear whether frequency, propor on, or percentage is reported. (However, the rela ve heights of the bars are the same regardless of labels.)

4. Score values are specified by labels on the X axis. 5. Bars should have equal widths. 6. The height of the graph (Y axis) is usually less than the width of the X axis. 7. The Y axis begins at 0. 8. In bar charts, it is good prac ce to label the top of each bar with an exact numerical

value (a frequency or a percentage). There may not be enough space on a histogram to include such labels. Clearly labeled ck marks on the Y axis help readers evaluate frequencies.

9. Informa on about total N must be provided. 10. In a footnote or the body of the text, source of data should be stated. Readers tend to

assume that numbers are based on new data collected by the researcher; if there is another source (such as Gallup polls or the U.S. census), that source must be iden fied.

11. For many quan ta ve variables, it is necessary to divide scores into bins or groups to set up a histogram with a reasonable shape. SPSS does this automa cally for you and uses a

“secret sauce” recipe to op mize the number of bins. When prac cal, the width of bins should be equal.

Appendix 3C describes how scores can be grouped or binned to set up a grouped frequency table. If heart rate scores range from 57 to 89, we can set up an ungrouped frequency table with one row for each score value; we could set up an ungrouped histogram with one bar for each score value. Grouping isn’t essen al in frequency tables. However, it is much easier to evaluate distribu on shapes in histograms when score values are grouped or binned. For hr scores from 57 to 89, the following seven groups or bins could be created. Each bar in the histogram would correspond to the number of cases with scores in each range: one bar for hr between 56 and 60, one bar for scores between 61 and 65, and so on.

The following example varies the number of bins to show how the number and widths of bins affect the shape of a histogram. Examples below show unpublished BMI (body mass index) scores for N = 1,250 students at my university. Using metric values of weight and height, BMI equals weight (in kilograms) divided by height (in meters) squared. (Using pounds and inches, BMI equals weight divided by height squared, mul plied by 703.) Healthy or normal BMI is usually defined as 18.5 to 24.9. Larger BMI values indicate greater body weight rela ve to height. Three situa ons are shown: a histogram with only one bin for all BMI values from 14 to 40; a histogram with “too many” bins, each for a very narrow weight range; and a histogram with an op mal number of bins. If all BMI scores are placed in one bin, there would be only one bar in the histogram, as shown in Figure 5.12. A histogram with only one bar does not provide any informa on about distribu on shape. If there are varying numbers of observa ons in a large number of very narrow bins, the histogram might resemble the jagged graph in Figure 5.13. This is be er than the previous graph, but it can be improved.

Figure 5.12 Histogram for BMI Scores: One Bin Includes All BMI Values

Figure 5.13 Large Number of Bins (“Too Many” Bins) With Jagged Distribu on Shape

Figure 5.14 Op mal Number of Bins Determined by SPSS for BMI Histogram Figure 5.14 shows the histogram for BMI scores when SPSS was allowed to decide on the “op mal” number of bins. I marked the clinical cutoffs for normal BMI (18.5 and 24.9) on this histogram as points of reference. In Figure 5.14, it is clear that the distribu on shape for BMI was approximately normal except for a few outliers at the high end of the distribu on, that is, a few cases with unusually high BMI. The SPSS default choice for the number and widths of bins provided a rela vely smooth histogram to use for evalua on of the distribu on shape for this data set. (SPSS does not publish the details of how this decision is made, and rules for this can be complex.) When your variable can be evaluated in terms of clinical guidelines (a BMI between 18.5 and 24.9 is generally described as indica ng healthy body weight), it can be useful to evaluate distribu on shape rela ve to these clinical cutoffs. A large propor on of students had BMI scores in the “healthy” range. A fairly substan al minority of students had BMI scores that would be judged overweight or very overweight; a few had BMI scores that would be described as underweight. (A frequency table would provide the informa on needed to find the exact percentages of persons who were over- or underweight.) This distribu on is posi vely skewed; it has a longer tail at the high end. Skewness is discussed further in Chapter 6. It is desirable to have bins that correspond to the same ranges of score values, but this is not feasible in some situa ons. Figure 5.15 shows a histogram for real data: the percentages of households whose annual incomes fall into ranges such as less than $5,000, between $5,001 and $10,000, and so forth. In the histogram in Figure 5.15, each bar (except for the last two bars on the right) indicates the percentage of households that reported incomes in $5,000 increments: the first bar $0 to $5,000, the second bar $5,001 to $10,000, and so forth. The last two bars represent people whose incomes exceed $200,000 per year. If the graph con nued to use $5,000 increments at the upper end of the income distribu on, many addi onal bars would be needed to represent the full range of incomes in the United States. If you drew an X axis wide enough to include all these addi onal bars, the graph would have to be at least five mes wider than shown in Figure 5.15. To avoid that problem, informa on about incomes greater than $200,000 was compressed into two bars. When looking at graphs like this, readers need to no ce how the last few bars were defined. A first impression might be that there is a mode for incomes between $200,000 and $205,000, but this impression is incorrect. In fact, there is an extremely long and thin tail for this income distribu on (the distribu on is extremely posi vely skewed).

Figure 5.15 Annual Household Income in the United States in 2010 It is easier to evaluate shapes of distribu ons by examining histograms or other graphs than by looking at frequency tables. It is easier to evaluate shape when the number and widths of bins are op mized; SPSS does this by default. In prac ce, histograms do not provide much informa on about sample distribu on shape when sample sizes are very small. I suggest that a minimum of 30 scores may be required to get a sense of any sample distribu on shape. 5.10 BOXPLOT (BOX AND WHISKERS PLOT) A boxplot (also called a box and whiskers plot) provides a different way to assess the distribu on of scores for a quan ta ve variable. Boxplots are par cularly useful in situa ons where outliers (unusually high or low scores) are present, or where distribu on shape does not resemble a bell-shaped curve. Recall that the sample mean M is not robust against the effects of extremely low and extremely high scores (outliers). The value of M can change substan ally when outliers are added to or dropped from a sample. When outliers are present, and when histograms are not bell shaped, an alterna ve approach based on the median instead of the mean may be preferable. Boxplots represent central tendency using the median and dispersion or variability using percen les. The median corresponds to the 50th percen le in a distribu on of scores. That is, 50% of the scores lie below the median and 50% lie above it. The median is more robust than the mean against the influence of outliers. To describe distances of individual scores from a sample mean M in a bell-shaped distribu on, we used the standard devia on (SD). The standard devia on is computed using the devia ons

of all individual scores from the sample mean. This may not be a good way to represent distances from the center of the distribu on when the sample mean itself is misleading (because of the influence of outliers, for example). A boxplot avoids problems that arise because of extreme scores and non-normal distribu on shape by:

 Using the median (50th percen le) as the index of central tendency (instead of M ).  Using the 25th and 75th percen les (instead of SD), and other distances that are

calculated from these percen les, to describe distance of scores from the center of the distribu on.

You can obtain the 25th and 75th percen les as descrip ve informa on from the SPSS frequencies procedure; results appear in Figure 5.16. The 25th percen le iden fies scores in the bo om 25%, and the 75th percen le iden fies scores in the top 25%. The space between the 25th and 75th percen les corresponds to the middle 50% of scores (it corresponds to a range in the center of the distribu on a li le less than that between M – 1 SD and M + 1 SD). 5.10.1 How to Set Up a Boxplot by Hand It is easier to understand SPSS output for a boxplot if you first construct a boxplot by hand. To set up a boxplot by hand, you need the following informa on: median, 25th percen le, and 75th percen le for the variable of interest. It is also useful to have the minimum and maximum scores and a frequency table of all score values. Informa on needed to set up a boxplot for the femaleheight.sav data, obtained from the SPSS frequencies procedure, appears in Figure 5.16. If you follow these instruc ons, you should obtain a graph similar to Figure 5.17.

Figure 5.17 Boxplot Set Up by Hand for Hypothe cal Female Height Data With Numerical Labels on the Basis of Descrip ve Sta s cs Here are the steps to set up a boxplot by hand.

1. Place the full range of score values for the variable of interest (female height) on the Y axis.

2. Draw a shaded box in the center of the graph, as shown in Figure 5.17. The lower end of this box corresponds to the 25th percen le, the line that bisects the box corresponds to the median, and the upper end of the box corresponds to the 75th percen le. For the female height data, from Figure 5.16, these values are 63, 64.5, and 66.

3. Calculate the interquar le range (IQR): IQR = 75th percen le – 25th percen le. On the basis of the informa on in Figure 5.16, IQR = 66 – 63 = 3. This is the height of the shaded box; 50% of the scores lie within this box.

4. Calculate the loca ons of the inner fences. On the basis of the informa on in Figure 5.16:

 Upper inner fence = median + 1.5 × IQR = 64.5 + 1.5 × 3 = 64.5 + 4.5 = 69.  Lower inner fence = median – 1.5 × IQR = 64.5 – 1.5 × 3 = 64.5 – 4.5 = 60.  If scores are normally distributed, about 95% of the scores lie between the inner

fences. 5. Use these values (60 and 69) to draw the ends of the “whiskers” or T-shaped bars, as

shown in Figure 5.17. An addi onal set of boundaries used to make judgments about outliers, the outer fences, do not appear in SPSS boxplots. The outer fences are usually defined as the median + 3 × IQR and median – 3 × IQR. O en, about 99% of the scores lie within the outer fences. The inner and outer fences in boxplots can be used to iden fy outliers and extreme outliers. Sort the scores in the female height data set from lowest to highest. If any of the lowest scores

fall below the lower inner fence, indicate them as outliers on the boxplot (using open circles). If any of the highest scores fall above the upper inner fence, mark them as outliers (also using open circles). Then check whether any of these can be called extreme outliers. Any score that lies more than ±3 × IQR away from the median is marked as an extreme outlier (using an asterisk). In the graph in Figure 5.17, I added a number associated with each outlier. Note that these numbers are not the score values; they are the lines in the SPSS data file where these outliers are found. To find the score values on rows 13, 84, and 120, look at the height values on these lines of the femaleheight.sav data file. We can summarize informa on for the boxplot in Figure 5.17 as follows: For female height in inches, the median was 64.5 in.; 50% of heights were between 63 and 66 in. There were two high-end outliers (both were scores of 70 in.) and three low-end outliers (two scores of 59 in. and one score of 58 in.). None of the outliers were judged to be extreme. 5.10.2 How to Obtain a Boxplot Using SPSS SPSS1 will be used to set up a boxplot for a different data set, a set of hypothe cal BMI scores for 200 men and 200 women in the file bmi.sav. BMI scores were truncated (that is, all values to the right of the decimal points were dropped). To locate the SPSS boxplot procedure, go to the top-level menu heading for <Graphs> and make these menu selec ons: <Graphs> → <Legacy Dialogs> → <Boxplot>, as shown in Figure 5.18. In most real-life situa ons, researchers want to compare boxplots for the same variable for two or more groups, as in the following example. BMI is an index of body weight corrected for height. Using data in the file bmi.sav, we will examine BMI scores separately for men and women. In the first Boxplot dialog box (Figure 5.19), highlight the box for “Simple” boxplot and select the radio bu on for “Summaries for groups of cases.” In the Define Simple Boxplot: Summaries for Groups of Cases dialog box (in Figure 5.20), the name of the variable for the plot (heigh nches) is moved into the variables list. The resul ng boxplot graph appears in Figure 5.21. On the basis of output from the SPSS frequencies procedure (not shown here), the median BMI was 23 for men and 22 for women. There were numerous outliers for both groups, mostly higher BMI scores. You need to know that when two scores have the same value, SPSS draws just one circle. Each circle indicates the row number in the SPSS data file where the outlier score is located. You can determine the number of scores iden fied as outliers by coun ng these numbers. To find the number of nonextreme outliers, count the case numbers for the open circles and ignore the case numbers for the asterisks. An outlier that is not extreme is denoted using an open circle, while outliers labeled as extreme appear as asterisks.

Figure 5.18 Menu Selec ons to Access Boxplot Dialog Box

Figure 5.19 Ini al Boxplot Dialog Box: Select “Simple” and “Summaries for groups of cases”

Figure 5.20 Define Simple Boxplot: Compare BMI Scores for Male Versus Female Groups

Figure 5.21 Boxplot Set Up Using SPSS: Separate Boxplots for Male and Female Groups on the Basis of Descrip ve Sta s cs The boxplot for men shows only two low-end outliers (on rows 20 and 23); these are not extreme. If you look at the data file bmi.sav, you will find that the scores on rows 20 and 23 both correspond to BMI scores of 16. There were numerous high-end outliers. The boxplot for men shows nine scores on rows 102, 111, 120, 124, 131, 167, 189, 190, and 199 that were labeled outliers (but not as extreme outliers). Three scores (on rows 126, 157, and 197) were iden fied as extremely high outliers. The case on row 3 lies in between these groups of scores. You can

determine whether the in-between BMI score on row 3 is an extreme outlier by comparing the BMI value in the data file on row 3 (which is 34) with the BMI values for the two neighboring values. The BMI score on row 157 is also 34, and case number 157 is not tagged as an extreme outlier in this boxplot; therefore row 3 would also not be iden fied as an extreme outlier. We can report results for the male BMI boxplot as follows. Values obtained from the SPSS frequencies procedure, not shown here, are used to iden fy the exact values for the 25th, 50th, and 75th percen les and the minimum and maximum scores. For men, median BMI was 23; 50% of male BMI scores were between 22 and 25. There were 2 low-end outliers for male BMI; neither was extreme. There were 13 high-end outliers; 10 were not extreme and 3 were extreme outliers. For men, minimum BMI was 16 and maximum BMI was 41. Median BMI for women was 22; 50% of female BMI scores were between 20 and 23. The female group had no low-end outliers for BMI. There were five nonextreme high-end outliers (rows 204, 302, 318, 353, and 398). There were also two extreme high-end outliers; these BMI scores appear on rows 290 and 374. Minimum BMI for women was 17 and maximum BMI was 33. If we compare men and women, it appears that men tend to have higher BMIs than women. It would also be useful to examine the frequency distribu ons for BMI using suggested clinical cutoffs to evaluate the percentage of persons whose BMIs were within the range 18.5 to 24.9, which is considered healthy. Histograms would help to evaluate distribu on shapes.

5.11 TELLING STORIES ABOUT DISTRIBUTIONS A er you examine graphs such as histograms or boxplots, you should be able to tell an honest and reasonably complete story about the pa ern you see. Imagine this game: Your task is to get a person who has not seen the histogram or other graph to draw a picture of the graph, based only on verbal informa on you provide. You win the game if you and your partner can do this more quickly and accurately than other teams. Ready? Go! If you have a roughly normal or bell-shaped distribu on, you can communicate this to your partner very quickly with three pieces of informa on (normal, M, SD). That should be sufficient for your partner to sketch a graph. If the distribu on appears somewhat normal but with some varia ons, such as posi ve skewness or outliers (see Table 5.1 for examples), you need to add that informa on (for example, three outliers at the high end). On the other hand, if your distribu on does not resemble a bell-shaped curve (see Table 5.2), you need different stories or pieces of informa on. It may be sufficient to say “reverse J- shaped” or bimodal or uniform. However, you will need to give your partner more informa on (for example, the maximum score was 10). Distribu ons that have one or more modes and non- normal shapes require more informa on. Where was each mode located? Were some modes higher than others? Think about what your results mean.

How can the hypothe cal results in Figure 5.22 be described? Opinion is highly polarized; that is, people are at either the nega ve or posi ve extreme in this hypothe cal example. There are two modes (52% of people strongly disagree and 30% strongly agree). Very few people chose intermediate levels of agreement. Most people strongly disagree, but the number of people who strongly agree is a substan al minority. Use of a mean or median (a value somewhere around 2.5) to describe central tendency would be misleading in this situa on; 2.5 is near the neutral point, but very few people chose ra ngs near neutral. A concise way to communicate this would be: “Fi y-two percent strongly disagreed with this statement, 30% strongly agreed, and very small percentages of people chose intermediate levels of agreement. Opinion was strongly polarized.” If the author of a research report makes a blanket statement that all variables had approximately normal distribu ons, or allows readers to assume that all distribu ons were normal, and then tells readers that the mean degree of agreement with this statement was 2.5, this informa on by itself provides a misleading descrip on of the results.

5.12 SES OF GRAPHS IN ACTUAL RESEARCH 1. Data screening: Iden fy poten al errors or problems with data (such as recording errors,

implausible scores, and missing values). Researchers need to report the number of scores that are problema c and indicate what they did to correct these problems. For beginning students, it may be sufficient to report the percentage of missing scores and the number of outliers and extreme outliers for each variable. I suggest that beginning students run analyses with outliers included and with outliers excluded; if results are

substan ally the same, report one of these analyses and add a footnote to indicate that the other analysis yielded similar results. For both beginning and advanced students, keep a record of any problems you detect in data, and anything that you do to deal with the problems. Discussion of be er ways to handle outliers and missing values are provided in Volume II (Warner, 2020).

2. Evalua on of whether assump ons for analyses are violated: When you learn sta s cal techniques such as t tests, ANOVA, and regression, you will see that each analysis is based on some assump ons. Some analyses work fairly well, under certain circumstances, when their assump ons are violated; others do not. There is a widespread, but not exactly accurate, belief that scores in samples need to be normally distributed to sa sfy the assump ons for many common analyses. I think it would be more accurate to say that, in prac ce, some kinds of departure from normality in the sample (such as the presence of extreme outliers, or reverse J-shaped or polarized distribu ons) create problems in many common analyses. The ways the viola ons of assump ons and rules can lead to incorrect conclusions are discussed in later chapters about significance tests.

3. Report informa on needed to characterize and describe your sample: For categorical variables, this is o en in sentence form, for example, “The sample consisted of 100 male and 150 female university students, with a mean age of 19.1 years.”

Here are some of the stories (or descrip ons) about distribu ons that might appear in a research report.

a) You might say, “The histogram appears approximately normal with no extreme outliers.” You can state this in the “Data Screening” sec on of your research report. (For some sta s cs you will need to check addi onal assump ons.)

b) You might need to say, “The histogram appears approximately normal except for a specific number of outliers.” In this situa on you face the “what to do with outliers” problem. Ideally, you decide what to do with outliers prior to data collec on. You need to document the number of outliers and what you decided to do with them (such as drop from analysis, recode into different values, or leave them in). Do not experiment with different ways of handling outliers un l you find results you like; this is p-hacking.

c) You might need to say, “The distribu on is very skewed, and skewness cannot be corrected by modifying or removing a few outliers.” Only if it is conven onal in your field, only if values differ by orders of magnitude, and only if planned ahead, log or other nonlinear transforma ons may be applied to data analyses using log(X) instead of X.

d) In some situa ons that involve outliers, nonparametric analysis may be preferable. When scores are converted to ranks, extreme outliers and skewness are not problems. (Newer robust techniques, not covered in this book, may be be er choices; Field, 2018.)

e) If distribu on looks nothing like a normal distribu on (e.g., uniform, J-shaped, U-shaped, mode at zero), proceed with cau on. En rely different analyses than the ones in this book may be required.

5.13 DATA SCREENING: SEPARATE BAR CHARTS OR HISTOGRAMS FOR GROUPS When the independent-samples t test is introduced you will see that it’s useful to create graphs such as histograms and boxplots separately for each of the groups. If a t test compares mean

height for a male group versus mean height for a female group, we should examine the distribu on of heights separately within each group, as in the following example. The data set malefemaleht.sav contains hypothe cal heights in inches for 120 women (not the same sample as in femaleht.sav) and 120 men.

Figure 5.23 SPSS Dialog Box to Obtain Boxplots for Separate Groups To obtain separate boxplots for male and female groups, make the same menu selec ons as for the previous boxplot example: <Graphs> → <Legacy Dialogs> → <Boxplots>. In the Boxplot dialog box (the one that appeared previously in Figure 5.17), select “Simple” and choose the radio bu on for “Summaries for groups of cases.” In the next dialog box, shown in Figure 5.23, enter the name of the quan ta ve variable (heigh nch) in the space for “Variable,” and enter the name of the categorical variable Sex in the space for “Category Axis,” then click OK. See the boxplots in Figure 5.24. Several kinds of informa on can be obtained through visual examina on. Median height for the female group is lower than median height for the male group. The female group has one low- end outlier for height (in row 1 of the file). The male group has one high-end outlier height (in row 240) and one low-end outlier (in row 121). To obtain separate histograms for male and female groups, the <Data> → <Split File> command is used. (Note that you do not select <Split into Files>.) In the Split File dialog box in Figure 5.25, click the radio bu on for “Organize output by groups,” move the categorical variable (Sex) into the “Groups Based on” pane, then click OK. All subsequent analyses will be done separately for men and women. (Note that you must turn this command off to go back to using the full data

set. To do that, make the same menu selec ons, <Data> → <Split File>, and choose the radio bu on for “Analyze all cases, do not create groups.”) The separate histograms for female and male height scores appear in Figures 5.26 and 5.27. Height scores are not perfectly normally distributed within either group; they could be described as approximately normal. Among the three outliers iden fied in the boxplots, the only one that stands out clearly in the histograms is the male height of 78 inches or 6’6, or about 198 cm. This is unusually tall, but the number is not so large that you would think it impossible.

Figure 5.24 Separate Boxplots for Height for Female and Male Groups (Data From malefemaleht.sav)

Figure 5.25 Command to Organize Output by Groups Note that if you converted all heights from inches to cen meters, the appearance of the boxplots and histograms would not change; however, the numerical values of heights and the descrip ve sta s cs would change.

Figure 5.26 Histogram for Hypothe cal Female Height Scores in malefemaleht.sav

Figure 5.27 Histogram for Hypothe cal Male Height Scores in malefemaleht.sav 5.14 USE OF BAR CHARTS TO REPRESENT GROUP MEANS In this chapter, the heights of bars in bar charts represents the number (or propor on or percentage) of cases in each group. When later topics, such as the independent-samples t test)

are introduced, bar charts have another use. Suppose that you want to compare mean height for two groups: female and male. You can set up a bar chart in which the height of each bar represents the mean height for each group, as in Figure 5.28.

Figure 5.28 Bar Chart to Represent Mean Heights (in Inches) for Female Versus Male Groups One difference you may no ce is that, in this chart, the Y axis begins at 60 (instead of 0, which was the recommended value for the Y axis origin when bar charts were used for frequencies). Here’s why. For group frequencies, 0 cases per group is a possible value. For means of variables such as adult height, 0 is not a possible value of height. It makes sense to choose a value of Y that is below the minimum height in the sample, but higher than 0, for a bar chart in which bars represent means. If you read research reports, you are more likely to encounter bar charts that represent group means than bar charts for group sizes or frequencies. You will learn more about setup and interpreta on of this type of bar chart in chapters about the independent-samples t test and ANOVA. 5.15 OTHER EXAMPLES 5.15.1 Sca erplots In some studies, researchers want to evaluate whether scores on one quan ta ve variable (extraversion) are related to scores on another quan ta ve variable (physical energy). A preliminary graph called a sca erplot is used to examine the rela onship between variables prior to doing sta s cal analyses such as correla on or regression. An example of a sca erplot appears in Figure 5.29. In this hypothe cal study, each person provided self-report scores for extraversion (rated on a scale from 1, not at all extraverted, to 5, highly extraverted) and for energy (1 = very low energy, 6 = very high energy). Each data point in the sca erplot represents the combina on of scores on extraversion (on the X axis) and energy (on the Y axis) for one

case. For example, the case marked with a circle in Figure 5.29 represents a person with an extraversion score of 4 and an energy score of 3. The three ellipses in Figure 5.29 iden fy areas of the graph that can be compared. On the le , an ellipse encloses energy scores for people whose scores on extraversion were low (below 2). On the right, an ellipse encloses the energy scores for persons whose extraversion ra ngs were high (above 4). You can see that for the people with low scores on extraversion, energy scores also tended to be low. For persons with high scores for extraversion, energy scores tended to be high. People with moderate scores on one variable also had moderate scores on the other variable. This is an example of a posi ve linear rela onship. In a later chapter this kind of rela onship between two quan ta ve variables will be assessed using Pearson correla on.

Figure 5.29 Sca erplot of Physical Energy Scores (Y Axis) with Extraversion Scores (X Axis)

Figure 5.30 Prevalence of Self-Reported Obesity Among U.S. Adults by State in 2017 5.15.2 Maps Maps are useful formats for some kinds of graphs. For example, the Centers for Disease Control and Preven on has produced graphics in the form of maps to show the spread of obesity in the United States over me. A PowerPoint presenta on that shows a series of maps from 1985 to 2010 appears at h ps://www.cdc.gov/obesity/downloads/obesity_trends_2010.ppt. Figure 5.30 shows a more recent graph for prevalence of obesity in the United States in 2017. States shaded darker gray have higher percentages of obesity. (The corresponding map online at h ps://www.cdc.gov/obesity/data/prevalence-maps.html is keyed in color.) At a glance you can see several features of the data. High rates of obesity occurred in the deep south, Iowa, and West Virginia. Colorado, Hawaii, and the District of Columbia had low rates. U.S. residents can see how obesity rates in their states compare with those of other states. 5.15.3 Historical Example Most people think of Florence Nigh ngale as a pioneer of nursing; her work also had an enormous impact on medicine and hospital design (Lienhard, 2002). During the Crimean War, she sent reports to Britain about the number of soldiers who died each month and their causes of death. She used polar diagrams (this is not currently a popular form of graph) to communicate this informa on. Figure 5.31 is adapted from part of her graphics (Nigh ngale, 1858). Her major finding was that far more soldiers were dying from preventable diseases (some mes acquired in the military hospitals) than from wounds. Up un l the 19th century, this was true in many wars. The point she wanted to make was that far more sanitary condi ons and be er nutri on were needed to keep the army (and civilian popula ons) healthy. This was not something the War Department wanted to hear. Nevertheless, she persisted.

Figure 5.31 Florence Nigh ngale’s Graph: Number of Bri sh Soldiers Who Died in the Crimean War During Each Month Divided Into Three Causes of Death 5.16 SUMMARY Why are both data analysts and mathema cal sta s cians so interested in the bell-shaped curve or normal distribu on? Here are some of the reasons.

1. Scores for many (but not all) variables tend to be normally distributed. 2. When a variable is approximately normally distributed, we can summarize informa on

about the distribu on of scores using just three pieces of informa on: a. That the distribu on shape is normal b. The value of the sample mean M c. The value of the sample standard devia on SD

For example, IQ scores are normally distributed with M = 100 and SD = 15. We can write that as N(100, 15), where N means “normally distributed,” and the values of M and SD are in parentheses. (Note that in this context, N is a descrip on of the shape of a distribu on, not sample size.)

3. When scores are normally distributed, it is easy to evaluate the loca on of an individual score (or an individual event, such as the value of M obtained in one study) rela ve to the overall distribu on of scores. This is discussed further in Chapter 6.

4. Scores on individual variables are not the only values that tend to be normally distributed. You will see that if many values of M are obtained from different random samples from the same popula on, values of M also tend to be normally distributed. This makes it possible for us to set up confidence intervals for M and to conduct sta s cal significance tests.

Here are ques ons to ask when you look at a histogram:  Is the distribu on bell shaped and symmetrical? If yes: Good. You can report that

the distribu on was approximately normal.  Does the distribu on have outliers at the lower and/or upper ends? If yes: You

need to report the number of outliers at the upper and lower ends and whether any of these were extreme outliers.

 Is the distribu on skewed (asymmetrical)? If yes: You should men on this in your descrip on of data. In Chapter 6, you will learn how to assess amount of skewness.

 Is the distribu on bimodal or mul modal? If yes: Consider whether the distribu on may consist of two or more overlapping distribu ons for different groups of people, such as men versus women.

 Is the distribu on U-shaped or polarized (as in Figure 5.22)? If yes: Treat people who gave ra ngs of 1, ra ngs of 2, and so forth, as members of five separate groups.

 Is the distribu on uniform? If the variable is quan ta ve, ask, Are the scores ranks? It is unlikely that you will see a uniform distribu on for quan ta ve variables unless they are ranks.

You can describe distribu on shape by thinking about the answers to these ques ons. Some of these descrip ons are not mutually exclusive. For example, a posi vely skewed distribu on may also have high-end outliers, and it may have a large mode at zero. In a typical research report, authors would like to be able to say something like this at the beginning of the “Results” sec on: “All quan ta ve variables were approximately normally distributed with no extreme outliers.” Real data o en do not behave so nicely, of course. An author might have to say something more like this: “Number of doctor visits had a reverse J- shaped distribu on with five high-end outliers.” COMPREHENSION QUESTIONS

1. In the bar graphs in most of this chapter (except those in Sec on 5.14), the height of the Y axis provides what informa on?

2. Suppose you generate a bar graph using SPSS. You also have a frequency table for the same data. What informa on from the frequency table might you add to the bar graph to make the informa on in the bar graph more precise?

3. What is a common prac ce that can make a bar graph decep ve? Can you think of at least one other way bar graphs can be made decep ve?

4. What can you see in a histogram of quan ta ve scores that is less easy to see in a frequency table?

5. Consider the histogram in Figure 5.32. a) What were the minimum and maximum number of servings of fruits and vegetables

people said they ate per day? What was the range? b) What was the modal amount of fruit and vegetable consump on? c) Diet experts o en recommend at least five servings of fruits and vegetables per day.

How well are the people in this sample doing at mee ng that standard? d) What percentage of persons reported ea ng one serving per day? This is a

frustra ng ques on to answer, given this bar chart. If you had access to these data, what other SPSS output would you want to see to answer this ques on precisely?

6. Briefly describe, in your own words, three things you look for to decide whether a histogram looks like a “reasonably normal” distribu on.

7. Describe the shape of each of the histograms in Table 5.3. Some mes more than one term can be applied; for example, skewed distribu ons may also have outliers.

8. What type of plot appears in Figure 5.33? What do the values on the Y axis correspond to? (Score values? Frequencies?) What informa on can you report from this plot? There are omissions in labeling. What labels could be added to this chart?

Figure 5.32 Results From Warner, Frye, Morrell, and Carey (2017): Number of Servings of Fruits and Vegetables Eaten on a Typical Day, N = 1,250 Table 5.3 Examples of Histograms

Figure 5.33 Figure for Comprehension Ques on 8: What Is It? NOTE 1. Many different rules have been proposed to determine the loca ons of inner fences in boxplots. See Frigge, Hoaglin, and Iglewicz (1989) and McGill, Tukey, and Larsen (1978) for further discussion of different methods. Boxplot results can differ across computer

programs, or between by-hand and SPSS-generated boxplots, if different choices are made among different rules. DIGITAL RESOURCES Find free study tools to support your learning, including eFlashcards, data sets, and web resources, on the accompanying website at edge.sagepub.com/warner3e. Descrip ons of Images and Figures Back to Figure The image is a frequency table that shows hypothe cal marital status scores. There are five columns: valid count, frequency, percent, valid percent and cumula ve percent. Details are as below:

 valid count, frequency, percent, valid percent, cumula ve percent  never married, 20, 47.6, 47.6, 47.6  engaged, 4, 9.5, 9.5, 57.1  married, 11, 26.2, 26.2, 83.3  divorced, 4, 9.5, 9.5, 92.9  widowed, 3, 7.1, 7.1, 100  Total, 42, 100, 100

Back to Figure There are two boxes, and the one on the right has a variable tled maritalstatus. Below is a selected check box named display frequency tables. At the bo om are op ons bu ons for the following; OK, Paste, Reset, Cancel and Help. On the right are the radio bu ons Sta s cs, charts, format and help. The Charts op on has been depressed. The frequencies charts dialog box has four chart type check op ons: none, bar charts, pie charts and histograms. The Pie charts op on has been checked. The chart values tab has two choices frequencies and percentages. Frequencies has been selected. At the bo om are the op on bu ons Con nue, Cancel and Help. Back to Figure There are five op ons: never married, married, divorced, engaged and widowed. The largest is the never married pie, followed by married, engaged, divorced and widowed. Back to Figure

There are two boxes, and the one on the le has a variable tled marital. Below is a check box named display frequency tables. At the bo om are op ons bu ons for the following: OK, Paste, Reset, Cancel and Help. On the right are the radio bu ons Sta s cs, charts, format and help. The Charts op on has been depressed. The frequencies charts dialog box has four chart type check op ons; none, bar charts, pie charts and histograms. The bar charts op on has been checked. The chart values tab has two choices frequencies and percentages. Frequencies has been selected. At the bo om are the op on bu ons Con nue, Cancel and Help. Back to Figure There are five op ons: never married, married, divorced, engaged and widowed. The largest is the never married pie, followed by married, engaged, divorced and widowed. Back to Figure There are two boxes, and the one on the le has a variable tled marital. Below is a check box named display frequency tables. At the bo om are op ons bu ons for the following: OK, Paste, Reset, Cancel and Help. On the right are the radio bu ons Sta s cs, charts, format and help. The Charts op on has been depressed. The frequencies charts dialog box has four chart type check op ons; none, bar charts, pie charts and histograms. The bar charts op on has been checked. The chart values tab has two choices frequencies and percentages. Frequencies has been selected. At the bo om are the op on bu ons Con nue, Cancel and Help. Back to Figure The X axis denotes the marital status of never married, engaged, married, divorced and widowed. The Y axis denotes the frequencies. The details are as follows:

 never married: 20  engaged: 4  married: 11  divorced: 4

 widowed: 3 Back to Figure The X axis denotes the marital status of never married, engaged, married, divorced and widowed. The Y axis denotes the frequencies. The details are as follows:

 never married: 20  engaged: 4  married: 11  divorced: 4  widowed: 3

Back to Figure The X axis denotes the year, 2009 and 2019 and the Y axis denotes the number of new houses built and ranges from 0 to 10,000. Instead of bars to signify the frequency, the image has a drawing of a house. The heights of the cartoon houses correspond to the different frequencies. Back to Figure At the top of the spreadsheet, tled femaleheight.sav, are the following menu bu ons: file, edit, view, data, transform, analyze, graphs, u li es, extensions, window and help. Below these bu ons are icon bu ons to open a file, save, print, and other table edi ng op ons. The Transform menu bu on, on being clicked results in a drop down menu with the following op ons; compute variable, programmability transforma on, count values within cases, shi values, recode into same variables, recode into different variables, automa c recode, create dummy variables, visual binning, rank cases, data and me wizard, create

me series, replace missing values, random number generators and run pending transforms. The compute variable bu on has been depressed, leading to a dialog box to compute variables. At the top le , there is a box tled Target variable, where heightcm has been filled in the field. Below this is a Type and label bu on, which has two entries; heigh nch and heightcm. Heigh nch has been selected. On the right, a numeric expression field has the entry 2.54 into heigh nch. A keypad with standard numbers and symbols is below this. On the right is a Func on group sec on with the following entries; all, arithme c, CDF and noncentral CDF, conversion, current date or me, date arithme c, and date crea on.

Below this is an empty box tled Func ons and special variables. An IF statement box has the statement Op onal case selec on condi on. At the bo om of the dialog box are op ons bu ons for the following; OK, Paste, Reset, Cancel and Help. Back to Figure The details of the sta s cs figures are men oned below:

 N Valid: 120, 120  N Missing: 0, 0  Mean: 64.48, 163.7665  Median: 64.5, 163.8300  Mode: 64, 162.56  Std. Devia on: 2.463, 6.25614  Variance: 6.067, 39.139  Minimum: 58, 147.32  Maximum: 70, 177.80  Percen les:

o 25: 63, 160.02 o 50: 64.5, 163.83 o 75: 66, 167.64

Back to Figure In the first diagram, the X axis denotes the height in inches which ranges from 58 to 72, rising in increments of 2. The Y axis denotes the frequency and ranges from 0 to 20, rising in increments of 5. The SD has been specified as 2.5 on either side of the mean. A curve drawn through each of the bars of the histogram is approximately bell shaped. The second diagram’s X axis denotes the height in cen meters and ranges from 140 to 180, rising in increments of 10. The Y axis denotes the frequency and ranges from 0 to 20, rising in increments of 5. The SD has been specified as 6.25 on either side of the mean. A curve drawn through each of the bars of the histogram is approximately bell shaped. Back to Figure The X axis ranges from 55 to 145, rising in increments of 15. The mean is the highest point of the curve, at 100, and the curve is symmetrical on both sides. There are three arrows on either side of the mean. Back to Figure The X axis ranges from 10 to 40, rising in increments of 10 and the Y axis has just one number 1250.

The histogram is a big bar stretching from 15 to 40 along the X axis and reaching upto the single value 1250 on the Y axis. Back to Figure The X axis denotes the BMI and ranges from 10 to 40, rising in increments of 10. The Y axis denotes the frequency and ranges from 0 to 60, rising in increments of 10. There are several bars, signifying many bins, and the distribu on has several spikes in the center, with the right end becoming almost flat. The curve drawn through the bars resembles a normal distribu on that is posi vely skewed. Back to Figure The X axis denotes the BMI and ranges from 10 to 40, rising in increments of 10. The Y axis denotes the frequency and ranges from 0 to 200, rising in increments of 50. There are several bars, signifying the bins, and the distribu on spikes in the center, with the right end becoming almost flat. The curve drawn through the bars is approximately normal except for a few outliers at the high end of the distribu on. The bars for 18.5 and 24.9 have been specifically marked out. Back to Figure The X axis denotes the household income in dollars and ranges from 5000 to 200000. The income levels of 50,000, 85,000 and 135000 dollars have been shown on the X axis. The Y axis denotes the percentage of households and ranges from 0 to 6 percent. The image has several bars towards the le and the tail is extremely long and thin on the right. Back to Figure The image is a set of sta s cal figures for female height.

 N: Valid – 120  N: missing – 0  Median – 64.5  Percen les:

o 25 – 63 o 50 – 64.5 o 75 – 66

Back to Figure The Y axis consists of the height in inches and ranges from 58 to 70 rising in increments of 2. There is a box in the center of the graph, with the median marked at its center at 64.5. The hinges are 63 and 66. The inner fence boundaries are at 60 and 69.

At the top there are two scores of 70 on line 12 and 102. Similarly, there are two scores of 59 on lines 13 and 84 and one score of 58 on line 120 at the bo om. Back to Figure At the top of the spreadsheet are the following menu bu ons; file, edit, view, data, transform, analyze, graphs, u li es, extensions, window and help. Below these bu ons are icon bu ons to open a file, save, print, go back and forward, and other table edi ng op ons. The graphs menu op on has been clicked and a drop-down menu shows the following; chart builder, graphboard template chooser, Weibull plot, compare subgroups, regression variable plots, and legacy dialogs. Legacy dialogs has been depressed, leading to the next group of menu op ons; bar, 3-D bar, line, area, pie, high-low, boxplot, error bar, popula on pyramid, sca er or dot and histogram. The spreadsheet has five columns and 15 rows filled with numerical data. Back to Figure There are two types of boxplot choices available; simple and clustered. The data in the chart can be of two types; summaries for groups of cases and summaries of separate variables. The box for “Simple” boxplot and the radio bu on for “Summaries for groups of cases” have been selected. At the bo om of the dialog box are bu ons for the following; Define, cancel and help. Back to Figure On the extreme le there are three variables; weight, height and dollar casenum less than 201. On the right, the variable BMI has been indicated separately. Below, the category axis has been defined as Sex. The Label cases by op on has been le blank. Below this are boxes to shows rows and columns. At the bo om of the dialog box are op ons bu ons for the following; OK, Paste, Reset, Cancel and Help. Back to Figure The X axis denotes the male and female groups and the Y axis denotes the BMI that ranges from 15 to 45. The male box chart has a median of 23. There is a circle with two number 20 and 23 below the box plot. There are several other outliers for the group, mainly at the higher level. There are 4 circles and three asterisks, and the numbers for the outliers are; 131, 102,124,120, 111, 167, 189, 190, 199, 157, 126, 197.

The female box chart has a median of 22. There are several outliers for the group, mainly at the higher level. There are 5 circles and an asterisk, and the numbers for the outliers are; 374, 290, 353, 318, 204, 302, 398. Back to Figure The image is a histogram that shows a degree of responses to a statement “The current U.S. president is doing an excellent job”. The X axis denotes the responses that range from strongly disagree, disagree, neutral, agree and strongly agree. The Y axis denotes the percentage of responses. There are five bars, and their heights are;

 Strongly disagree: 52 percent  Disagree: 10 percent  Neutral: 2 percent  Agree: 6 percent  Strongly agree: 30 percent

Back to Figure On the extreme le there an empty box for variables. On the right, the variable Heigh nch has been indicated separately. Below, the category axis has been defined as Sex. The Label cases by op on has been le blank. An Op ons bu on is present on the far right. Below this are boxes to shows rows and columns. At the bo om of the dialog box are op ons bu ons for the following; OK, Paste, Reset, Cancel and Help. Back to Figure There are two boxplots in the image indica ng heights for female and male groups. The X axis denotes the sex, whether male or female and the Y axis denotes the height in inches. This range from 55 to 80, rising in increments of 5. The female boxplot has a median of 64 and one low-end outlier. This figure lies at a lower plane than the male boxplot. The male boxplot, with median around 70, has one high-end outlier and one low-end outlier. Back to Figure The image is a screenshot of the menu bar in SPSS. At the top of the spreadsheet are the following menu bu ons; file, edit, view, data, transform, analyze, graphs, u li es, extensions, window and help.

Below these bu ons are icon bu ons to open a file, save, print, go back and forward, and other table edi ng op ons. The Data bu on has been depressed, and the following op ons are visible in the drop-down menu; define variable proper es, set measurement levels of unknown, copy data proper es, new custom a ribute, define date and me, define mul ple response sets, iden fy duplicate cases, compare datasets, sort cases, sort variables, transpose, adjust string widths across files, merge files, restructure, rake weights, propensity score matching, case control matching, aggregate, copy dataset, and split into files. The split file dialog box has a large box for the variable which has been filled with the variable Heigh nch. On the right are checkboxes such as; analyze all cases, do not create groups; compare groups and organize output by groups. The last has been checked. Below this is a box that shows Groups based on. This has been filled as Sex. There are two check boxes, Sort files by grouping variables and File is already sorted. The first op on has been checked. A statement at the bo om states; Current Status: Analysis by groups is off. At the bo om of the dialog box are op ons bu ons for the following; OK, Paste, Reset, Cancel and Help. Back to Figure The X axis denotes the height in inches and ranges from 58 to 72, rising in increments of 2 inches. The Y axis denotes the frequency and ranges from 0 to 20. There are 13 bars, that show the heights. The mean, standard devia on and number have been provided as 64.47, 2.463 and 120 respec vely. A normal curve has been drawn over the bars. Back to Figure The X axis denotes the height in inches and ranges from 60 to 80, rising in increments of 5 inches. The Y axis denotes the frequency and ranges from 0 to 25. There are 13 bars, that show the heights. The mean, standard devia on and number have been provided as 69.24, 2.533 and 120 respec vely. A normal curve has been drawn over the bars. Back to Figure

The X axis denotes the sex, male and female. The Y axis denotes the mean height, and ranges from 60 to 70. There are two bars, female and male. The female bar has a height of 64, while the male bar has a height of 69. Back to Figure The X axis denotes the extraversion level and ranges from 1 to 5. The Y axis denotes the energy levels and ranges from 1 to 6. There are several sca er dots spread across the graph, mainly concentrated across the region above 3 on the Y axis and to the right of 2 on the X axis. One specific dot has been singled out and marked. This is the case with extraversion score of 4 and energy score of 3. There are three ellipses that cover three groups of sca er dots. The first encloses energy scores for people whose scores on extraversion were below 2. The second includes sca er dots where both energy and extraversion scores were average. The third covers those dots where the extraversion and energy scores were very high. Back to Figure The legend below the map states the following; Very light: less than 20 percent; Light: 20 percent to 25 percent; Medium: 25 percent to 40 percent; Dark: 30 to 35 percent; Very dark: above 35 percent and White: insufficient data. The states that fall into the categories have been men oned below:

 Very light: less than 20 percent o None

 Light: 20 percent to 25 percent o Colorado, D.C., Hawaii

 Medium: 25 percent to 30 percent o Washington, Montana, Oregon, Idaho, Wyoming, California, Nevada, Utah, Arizona,

New York, Vermont, New Hampshire, Connec cut, New Jersey, Massachuse s, New Mexico, Maine, Florida, Minnesota

 Dark: 30 to 35 percent o North Dakota, South Dakota, Nebraska, Kansas, Missouri, Wisconsin, Illinois, Indiana,

Michigan, Tennessee, Ohio, Pennsylvania, Virginia, Georgia, South Carolina, North Carolina, Delaware, Texas, Alaska,

 Very dark: above 35 percent o Rhode Island, Oklahoma, Louisiana, Arkansas, Mississippi, Alabama, West Virginia,

Iowa, Kentucky  White: insufficient data

o None Back to Figure

In the diagram, each month has a separate slice of a pie. The length and width of the pie varied based on the number of deaths. The pie is subdivided based on the cause of death, mainly reasons of ba le, disease and other causes. The main cause of the deaths seems to be due to disease. The data for the months between April 1854 to March 1855 has been covered in the diagram. April to June had very low deaths. July was slightly higher, while August and September was higher than the previous months. There was a dip in October, but November levels are similar to those of September. Both October and November have no deaths due to ba le, and the death count is due to disease or other causes. December deaths are far higher than before, mainly due to disease. January 1855 has the highest number of deaths in the graph, and as before, the majority of them are due to disease. February has higher ba le deaths, but disease was the bigger killer. March levels see the overall count reduced, despite which disease took more soldiers than ba les. Back to Figure The X axis represents the number of servings of fruit and vegetables and the Y axis the percentage eaten. There are 8 bars, and their values are as follows;

 0: 42  1: 12  2: 11  3: 10  4: 5  5: 2  6: 3  7: 2  8: 3

Back to Figure The Y axis ranges from 6 to 28, rising at an increment of 2. The median is at 10 and there are 3 outliers at the top and 1 at the bo om. CHAPTER 6 THE NORMAL DISTRIBUTION AND Z SCORES 6.1 INTRODUCTION In the previous chapter, you learned to evaluate score loca on by examining cumula ve percentages in frequency tables. You can obtain informa on such as the percentage of persons who have scores below a specific value of X by examining cumula ve percentages in frequency tables. You already know something about score loca ons in everyday life. To evaluate how tall you are, you look at other people of the same sex and ask, Are most of them taller or shorter than I am?

If you see that more than half of them are shorter, you know your height is above average. If something like 90% of people are shorter than you, you know you are much taller than average. We will need a method to describe loca ons in distribu ons that can be generalized to more situa ons (and that does not require all the informa on in a frequency table). When distribu ons have an approximately normal shape, we can evaluate loca ons of specific X outcomes quickly by conver ng X values into a unit-free index of distance from the mean. The only informa on we need for that is M and SD for the distribu on of X scores. 6.2 LOCATIONS OF INDIVIDUAL SCORES IN NORMAL DISTRIBUTIONS The new method for score loca on introduced in this chapter involves two steps: First, we compute a z score that corresponds to the original X score. A z score, also called a standard score or a standardized score, tells us the distance of an X score (which might be in dollars, kilograms, or degrees Celsius) from the sample mean, in unit-free or standardized terms. Then, we use a table of areas for the standard normal distribu on to look up the percentage of scores that fall below that z score. This method works well only if the distribu on shape for scores is reasonably close to normal. To do this, we need to define normal distribu on shape more precisely. A normal distribu on, also called a Gaussian distribu on, appears approximately bell shaped in a histogram. However, many bell-shaped curves do not correspond exactly to normal distribu ons. What defines a normal distribu on is a fixed rela onship between distance from the mean and area under the curve. This rela onship is given in detail in tables of the standard normal distribu on. Appendix 6A provides a brief explana on of the mathema cs of the normal distribu on. The area below the value of z that corresponds to an X score in a normal distribu on is roughly equivalent to the cumula ve percentage of scores below that X value in the frequency table. 6.3 STANDARDIZED OR Z SCORES A z score is an index of the distance of an X score from the sample mean that has been converted into unit-free or standardized terms. Suppose that X is height; X scores can be given in different units, such as inches or cen meters. When we convert an X score into a z score, we obtain an index of distance from that mean that is not related to the original units of measurement. 6.3.1 First Step in Finding a z Score for X: The Distance of X From M The first step toward evalua ng the loca on of a specific score is to find the distance (or devia on) of the X score from the sample mean M in the original units of measurement, such as inches. That distance, also called a devia on from the mean, is (X – M). You have seen this term before. Devia on of individual score X from a sample mean M is: Other (6.1)

(𝑋−𝑀).

For example, I am 62 in. tall (X = 62). Let’s assume the mean height of women in a sample is M = 64.5 in., and the standard devia on SD = 2.5. For me, (X – M) = (62 – 64.5) = –2.5. I am 2.5 in. below average height for women in the sample. The sign of (X – M) tells you whether X is below the mean (if X – M is nega ve) or above the mean (if X – M is posi ve). This is part of the informa on we want. However, the value of (X – M) doesn’t tell us what percentage of persons are shorter or taller than X inches. 6.3.2 Second Step: Divide the (X – M) Distance by SD to Obtain a Unit-Free or Standardized Distance of Score From the Mean To evaluate how far an individual X score is from M, we can compute a z score (also called a standard score or standardized score): Other (6.2)

𝑧=(𝑋−𝑀)/𝑆𝐷.

The values of M and SD differ depending on the unit of measurement (e.g., feet, -cen meters, or inches). When we convert X to z, we obtain z scores that are independent of the original unit of measurement. We can say that z scores are standardized or unit free. Standardiza on is a very frequently used tool in the sta s cian’s bag of tricks. You will see this again in many future situa ons. As an example, consider one individual female height score (my own), given in both inches and cen meters. The example in Table 6.1 demonstrates that we end up with the same z score even if the units of measurement for X differ. The le -hand column in Table 6.1 provides all the needed informa on in inches, and the right-hand column gives the corresponding informa on in cen meters. At the bo om of each column, a z score is computed using the values of X, M, and SD. Note that you convert inches to cen meters by mul plying by 2.54. A woman whose height is 62 in. is 167.64 cm tall. Table 6.1 Conversion of X Scores (in Inches and Cen meters) to z Score

The point of this example is that the value of z is the same (within rounding error) whether X height is given in inches or cen meters. I am 62 in. tall (or approximately 163.8 cm). Whether height is given in inches or cen meters, I am 1 standard devia on below the average height for women in this example. This is a demonstra on (not a proof) that the value of z does not depend on original units of measurement. I suggest that you obtain a z score for your own height. For female height in inches, use M = 64.5 and SD = 2.5; for male height in inches, use M = 67.5 and SD = 2.5. To convert inches to cen meters, mul ply these values by 2.54. Your z score tells you whether you are above or below average in height rela ve to the imaginary data in this example. (You can find es mates of male and female height for many different na ons online, and use these values if you want to compare your height with na onal averages.) 6.4 CONVERTING Z SCORES BACK INTO X UNITS If you know that scores are normally distributed, and you know the values of z, M, and SD, you can convert a z score back into the original X score by “reversing” the opera ons in Equa on 6.2. First you mul ply z by SD, then you add M, as in Equa on 6.3: Other (6.3) 𝑋=(𝑧×𝑆𝐷)+𝑀. If I know that height is normally distributed, that my z score is –1, and that for height in inches, M = 64.5 and SD = 2.5, then I can find X; X = (–1 × 2.5) + 64.5 = 62. 6.5 UNDERSTANDING VALUES OF Z A z score can be verbally interpreted. My height is 62 in., and rela ve to the values of M and SD in the previous sec on, this corresponds to z = –1.00. A z score of –1.00 tells me that this height is 1 standard devia on below the mean. More generally, once we have a z value, we can say, This score is z standard devia ons below the mean (if z is nega ve) or this score is z standard devia ons above the mean (if z is posi ve). We don’t know yet whether a distance of z = –1.00 is not very far, or very far, below the mean. Is z = –1.00 so far below the mean that when people see me, they think, wow, that’s the shortest woman I’ve ever seen? We need a way to evaluate whether the absolute value of z indicates a notably large, or small, difference from average. If scores for the variable of interest, such as height, are normally distributed, we can use graphs or tables of z scores for a standard normal distribu on to find areas that lie below (or above) z. These are interpreted like cumula ve percentages. If I want to compare my height with other heights in a normally distributed sample, I obtain approximately the same informa on about loca on if I look at the cumula ve percentage in a frequency table or the area below z in a normal distribu on. To evaluate loca on using cumula ve percentage, I needed a lot of informa on (all the scores and frequencies in a frequency table). To evaluate score loca on

using z scores, I need only three pieces of informa on: the informa on that the distribu on shape is normal, with mean M and standard devia on SD. That leads to the next ques on: How do we evaluate whether a distribu on of scores is approximately normal? 6.6 QUALITATIVE DESCRIPTION OF NORMAL DISTRIBUTION SHAPE The term normal has a different meaning in sta s cs than in everyday life. In everyday life, when we call something normal, that usually means typical or common. In clinical prac ce in medicine and psychology, normal describes scores that do not exceed levels that are diagnos c of a disease, or that are within the range of commonly observed values. In sta s cs, normal is o en used to refer to a bell-shaped distribu on of scores; in mathema cal sta s cs, normal refers to a specific bell-shaped curve defined by the equa on in Appendix 6A. Examples of histograms that are approximately normal in shape appeared in Table 5.1 in the preceding chapter. Some, but not all, variables used in behavioral and social sciences have distribu ons that approximate this bell-shaped curve. Un l now we have described normal distribu on shape informally by saying that normal distribu ons have:

 A rela vely smooth “bell-shaped” curve with a peak or hump in the middle and tails that taper down at both the low and high ends.

 Equal values for the mean, median, and mode (at the center of the distribu on).  A symmetrical shape. Symmetry exists if the le and right halves of the distribu on are

mirror images of each other. (In other words, if you folded a cutout of the distribu on shape at the mean, the le and right sides would match.)

 Few extreme scores at the low and high ends of the distribu on.

6.7 MORE PRECISE DESCRIPTION OF NORMAL DISTRIBUTION SHAPE So far, we have used qualita ve visual examina on of histograms to evaluate whether a -distribu on is bell shaped. Now consider the way mathema cians define normal distribu on shape. The ideal normal distribu on is defined precisely by Equa on 6.4 in Appendix 6A at the end of this chapter. This equa on corresponds to a smooth curve with a fixed and known rela onship between area under the curve, in rela on to distance from the mean. Distance from the mean is given in z score units. For a standard normal distribu on (a normal distribu on with mean = 0 and standard devia on = 1), Figure 6.1 specifies the rela onship between distance from the mean (distances are given as z scores) and area under the curve. The tails in a normal distribu on are infinite (which can’t be shown clearly in this graph); and yet the total area under the curve is 1.00.

The Y axis (not shown) provides informa on about frequency or probability. When we use this diagram, we focus on areas under the curve (iden fied by the percentages, such as 34.13%, in Figure 6.1) rather than values on the Y axis. These areas are comparable with the cumula ve probabili es in a histogram. The correspondence of areas (or percentages or probabili es) to z values can be summarized using the 68/95/99.7 rule. Approximately 68% of the area in Figure 6.1 lies between the values of z = –1.00 and z = +1.00. About 95% of the area lies between z = –2.00 and z = +2.00. About 99.7% of the area lies between z = –3.00 and z = +3.00. We can select any two values of z and examine the percentage of area between them. Areas can be combined by addi on or subtrac on. For example, what percentage of cases correspond to z values above z = +1.00? This can be found by summing the probabili es for the “slices” above z = +1.00: 13.59% + 2.14% + .13% = 15.86%. That is, 15.86% of cases in a perfectly normal distribu on have z values greater than +1.00. Note that z = 0.00 corresponds to the mean of this distribu on. The sum of all the slices in Figure 6.1 is 100%. The sum of the slices above the mean (above z = 0.00) is 50%, and the area below z = 0.00 is also 50%.

Figure 6.1 Areas in Normal Distribu on That Correspond to z Scores Because the distribu on is perfectly symmetrical (a mirror image), the percentage of area below a specific nega ve value of z (such as z = –1.00) is the same as the percentage of area that is above the corresponding posi ve area of z (in this example, z = +1.00); that is, 15.86% of the area lies above z = +1.00, and 15.86% of the area lies below z = –1.00. Because the total area under the curve is 100%, once we know the percentage of cases that lie above a value of z, we can find the percentage of cases below z by subtrac on. Because 15.86%

of scores lie above z = +1.00, we know that (100% – 15.86%) = 84.14% of cases lie below z = +1.00. 6.8 AREAS UNDER THE NORMAL DISTRIBUTION CURVE CAN BE INTERPRETED AS PROBABILITIES If you were to draw a case at random from a normally distributed popula on of scores, the probability that it would have a z score greater than z = +1.00 is 15.86%. The probability that a randomly drawn case will have z > 0.00 is 50%. In other words, areas can be interpreted as probabili es. For integer values of z, such as z = +1.00, the diagram in Figure 6.1 can be used to answer ques ons about area and probabili es. However, z values are o en not integers. Areas that correspond to other (noninteger) values of z can be obtained from tables of the standard normal distribu on, as discussed in the next sec on. To summarize informa on about areas in the standard normal distribu on:

 The total area under the curve is 100%.  The area below the mean = 50%; the area above the mean = 50%. The mean is z = 0.  The area above a specific value of +z, such as z = 1.96, is the same as the area below –z

(–1.96).  Areas can be combined by addi on and subtrac on.

Standard normal distribu on tables -generally give area in terms of propor on; people o en talk about areas in terms of percentages. To convert propor on to percentage, mul ply propor on by 100. 6.9 READING TABLES OF AREAS FOR THE STANDARD NORMAL DISTRIBUTION The equa on in Appendix 6A can be used to generate normal distribu ons for any values of the mean and standard devia on that you want. For example, a normal distribu on for IQ scores would have a mean of 100 and a standard devia on of 15. The standard normal distribu on has a mean of 0 and a standard devia on of 1; it corresponds to a distribu on of z scores. Figure 6.1 provides only areas related to integer values of z. In prac ce we will o en need areas that -correspond to noninteger values. More detailed informa on about z score distances from the mean, and areas under the normal distribu on, is given in tables of the standard normal -distribu on. See the table in Appendix A at the back of this book. Part of that table appears in Figure 6.2 (for selected values of z that range from 1.83 to 2.12). Figure 6.3 shows enlarged versions of the diagrams that appear at the top and bo om of the table; these diagrams indicate which slices or areas correspond to the numbers in the table. For each value of z, the table provides two kinds of informa on about z score loca on. Column A lists the z values. Column B gives the area between z = 0.00 and the z value you want to evaluate. (Recall that z = 0.00 corresponds to X = M.) Column C gives the area that lies beyond the z value you want to evaluate (out in the tail of the distribu on).

There are several ways to use this table to describe the loca on of a score with z = +1.96. Here is the easiest. Suppose we want to know the propor on of area that lies above, and the propor on that lies below, z = +1.96. Locate the value of z = 1.96 in column A in Figure 6.2. The corresponding number in column C, the “tail area,” is .025. We can convert from propor on to percentage by mul plying by 100; 2.5% of the area in this distribu on lies above z = +1.96. By subtrac on, 97.5% of the area in this distribu on lies below z = +1.96. The percentage of area below a z score is like the cumula ve percentage in a frequency table. We could say this score is at the 97.5th percen le. This tells us that a person who has a z score of +1.96 has an unusually high score. We can also think in terms of probability. If a person is randomly selected from this distribu on of scores, there is a 2.5% probability that the person will have a higher score, and a 97.5% probability that the person will have a lower score, than z = +1.96. (We can convert z scores back into units for X if we want to make these statements in terms of X score values.) When z is nega ve, use the diagrams at the bo om of the table to iden fy which slices of area in the distribu on correspond to ranges of z values. Because the distribu on is symmetrical, we know the following: The area between z = 0.00 and z = +1.96 is the same as the area between z = 0.00 and z = –1.96 (.475). The area below z = –1.96 is the same as the area above z = +1.96. This table can be used to answer these ques ons for any values of z:

 What percentage of area lies above z?  What percentage of area lies below z?  What percentage of area lies between two specific z values, z1 and z2?

If using this table to answer ques ons about specific z values seems complicated or confusing, don’t despair. In prac ce, we are usually interested in a small set of specific z values, discussed in the next sec on.

Figure 6.2 Excerpt From Table of Standard Normal Distribu on (See Appendix A at the End of the Book)

Figure 6.3 Detail From Standard Normal Distribu on Table 6.10 DIVIDING THE NORMAL DISTRIBUTION INTO THREE REGIONS: LOWER TAIL, MIDDLE, AND UPPER TAIL Textbooks some mes drill students in the use of the normal distribu on table with ques ons such as “What percentage of area lies between z = –1.00 and z = +2.00?” These ar ficial examples do not correspond to the kinds of ques ons that are of real interest to data analysts. Data analysts usually want to answer a simple ques on: Is an X score or other outcome close to, far from, or extremely far away from the mean? Data analysts some mes choose different numerical values to define “far from.” The following z values are common ways of thinking about distance from the mean.

 Values between z = –1.00 and z = +1.00 are “close” to the mean.  Values between z = –2.00 and z = + 2.00 (but outside the range –1.00 and +1.00) are “in

between” close and far from the mean.  Values below z = –2.00 or above z = +2.00 are “far from” the mean.  Values below z = –3.00 or above z = +3.00 are “very far from” the mean.

A normal curve divided into these areas appears in Figure 6.4. Individual researchers are free to use other values of z as criteria for distances. Researchers are o en interested in the situa on where the areas beyond ±z sum to exactly 5%. A normal distribu on can be divided into three areas:

 2.5% of the area below -z, the “lower tail,”  95% of the area in the center, and  2.5% of the area above +z, the “upper tail.”

Figure 6.4 Areas That Are Close, Far, and Very Far From the Mean (in z Score Units)

Figure 6.5 Normal Distribu on Divided Into Areas Below z = –1.96, Between z = –1.96 and +1.96, and Above z = +1.96 These areas appear in Figure 6.5. The exact value of z that “cuts off” 2.5% of area in each tail, with 95% of the area in the center, is z = ±1.96. Another common way to divide the distribu on into lower tail, center, and upper tail appears in Figure 6.6. In Chapter 7 (on confidence intervals) we will focus on the range of values that is “not very far from the mean,” that is, the middle 95%. There is a 95% chance that a randomly selected case will have a score that lies in the center area. In Chapter 8 (on significance tests), we focus on the areas in the lower and upper tails. There is a 2.5% chance that a randomly selected score will lie in the lower tail and a 2.5% chance that a randomly selected score will lie in the upper tail. These two areas combined describe outcomes that can be called “far away from” the mean. You should develop a sense that z scores larger than 2 in absolute value (the rounded value for 1.96) indicate that an outcome is usually considered far from the mean (and therefore unusual or unlikely). Also, z-score values larger than 3 in absolute value are very far from the mean (and therefore very unusual or unlikely).

Figure 6.6 Bo om .5%, Middle 99%, and Top .5% of Normal Distribu on 6.11 OUTLIERS RELATIVE TO A NORMAL DISTRIBUTION A score that has a very large distance from the mean (and therefore a very large absolute value of z) is called an outlier. It is possible to use z scores to iden fy scores as outliers. Tabachnick and Fidell (2018) suggested that scores with z values less than –3.29 or greater than +3.29 can be called outliers. Scores can be iden fied as outliers using other criteria, for example, loca on in a boxplot. Boxplots and z scores may not iden fy the same scores as outliers. Many other rules can be used to iden fy outliers (Aguinas, Go redson, & Joo, 2013). Outliers create problems in many sta s cal analyses. For example, the value of the sample mean M is not robust against the effect of outliers. When you see outliers in a sample, at a minimum, you need to report:

 The method you used to iden fy cases as outliers  The number of outliers  The decisions you made about handling outliers

There are several possible ways to handle outliers, and none of them is a perfect solu on:

1. Leave the outliers in. 2. Remove outliers from the data set before analysis (using methods described in Appendix

6B). 3. Modify the values of outliers (i.e., change the score value of an outlier to the next

nearest score value that is not an outlier; this is called Winsorizing; Aguinas et al., 2013). 4. Use a nonparametric analysis that can reduce the effects of outliers; for instance, report

the median instead of the mean. 5. Use robust sta s cal methods (these are beyond the scope of this book; see Field and

Wilcox, 2017, for an introduc on). Ideally you decide the method you will use to iden fy outliers, and the method you will use to handle them, before you collect data. For example, you could use z scores (which work well for normally distributed samples) or boxplots (which are preferable for non-normally distributed samples) to iden fy outliers. You must describe the criteria for outliers, the number of outliers, and the handling of outliers in the research report. Ideally you decide the method you will use to iden fy outliers, and the method you will use to handle them, before you collect data. For example, you could use z scores (which work well for normally distributed samples) or boxplots (which are preferable for non-normally distributed samples) to iden fy outliers. You must describe the criteria for outliers, the number of outliers, and the handling of outliers in the research report. It can be useful as a student exercise to “experiment” with outliers in data (data that you will not publish!). You can evaluate how results of analyses change when outliers are retained versus removed. In actual research, you should commit to decisions ahead of me. It is bad prac ce to “experiment” with outliers in data you plan to publish. You should not drop outliers

in various ways un l you obtain the outcome you want, and then report one final outcome without explaining that it was “cherry-picked” from a large number of different analyses. 6.12 SUMMARY OF FIRST PART OF CHAPTER At this point you should be able to do the following.

 Convert an X score (for example, height in cen meters) into a z score, given values of M and SD.

 Given a z score and values of M and SD, find the original X score.  Given a diagram of the normal distribu on (as in Figure 6.1), find the percentage of area

above or below any integer value of z, or the percentage of area between any two integer values of z.

 Using the table of the normal distribu on in Appendix A at the end of the book, find the percentage of area above or below any noninteger value of z, or the percentage of area between any two noninteger values of z. However, this is less important for further work in sta s cs than understanding the idea of dividing a normal distribu on into regions (lower tail, center, and upper tail).

 Decide whether an X value is far away from the mean, on the basis of its absolute value of z. I suggest that you call values of z greater than 2 in absolute value “far from” the mean and values of z greater than 3 in absolute value “very far from” the mean.

6.13 WHY WE ASSESS DISTRIBUTION SHAPE You should always examine histograms or boxplots or other graphs for scores on quan ta ve variables before you do addi onal analyses. These graphs provide informa on you need to do the following things:

1. Describe distribu on shapes for your variables. 2. Detect outliers. 3. Evaluate whether data meet requirements and assump ons for sta s cal analyses you

plan to do. The third point (evalua ng possible viola ons of assump ons) will be discussed for each new analysis when it is introduced; you do not need to worry about it now. Refer back to Tables 5.1 and 5.2 to see examples of histograms that represent common distribu on shapes. Table 5.1 shows some approximately normal distribu ons with slight departures from normal shape, such as outliers and mild to moderate skewness. Other histograms were clearly non-normal (such as the uniform and reverse J-shaped distribu ons). At this point, when you look at a histogram for sample data, try to find a good match for your histogram shape in these tables. It is okay if you cannot find a match. Some distribu ons in samples don’t have any simple shape.

1. Distribu ons that resemble those in Table 5.1 can be judged “reasonably normal” in shape, with appropriate modifica ons to descrip ons such as “moderately posi vely skewed.”

2. Distribu ons that resemble those in Table 5.2 are not at all close to normal in shape. Some of these distribu on shapes can cause serious problems if they are analyzed using the basic bivariate techniques in this book and may require different and more advanced analyses.

3. Whether a distribu on is approximately normal in shape or not, pay a en on to outliers. Outliers can have a dispropor onate impact on results. You must acknowledge the presence of outliers and decide what to do with them (even if your decision is to leave them in).

4. It is possible to do quan ta ve tests for departure from normality, as described in Appendix 6A at the end of this chapter. However, quan ta ve tests for skewness, kurtosis, and overall departure from normal distribu on shape are o en not very useful in prac ce. Results of these tests o en depend more on sample size than on degree of departure from normality (these tests almost always signal problems with distribu on shape, even for distribu ons that are similar to normal, when samples are large, for example, N > 200). Furthermore, some sta s cal tests (not all tests) are fairly robust to viola ons of assump ons about normal distribu on of scores in the popula on.

6.14 DEPARTURE FROM NORMALITY: SKEWNESS One common departure from ideal normal distribu on shape is skewness. Skewness is asymmetry; an ideal normal distribu on is perfectly symmetrical. If you could “fold” a normal distribu on along the line that corresponds to the mean, the two halves would match. Skewness describes the degree to which a histogram deviates from perfect symmetry. We say that a distribu on is posi vely skewed if it is “heavy” at the lower end and has a longer, thin tail at the upper end. Conversely, we say that a distribu on is nega vely skewed if it has a longer, thinner tail at the lower end. Figure 6.7 shows schema c examples of posi ve and nega ve skewness. Visual examina on of a histogram is o en sufficient to decide whether there is notable skewness. A quan ta ve index of skewness can be requested from SPSS (see Appendix 6C); it isn’t needed in most situa ons. In most situa ons, visual examina on of the histogram is sufficient. Some common situa ons cause data to be skewed. For example, there may be a lower limit to score values (a person cannot have fewer than 0 children) or an upper limit to scores (a student cannot obtain more than 100% correct on an exam). Variables such as annual income tend to be strongly posi vely skewed because minimum income is 0, but there is virtually no limit to income at the upper end of the distribu on. Figure 6.8 shows substan al posi ve skewness (along with a possible floor effect and high-end outliers). A floor effect occurs when there is a limit to possible scores at the low end of a distribu on. For most students, the quiz was too hard. For example, a student cannot earn an exam score less than 0 points. If an exam is extremely difficult, many students will earn very low scores, and few students will earn high scores, as in the hypothe cal distribu on in Figure 6.8.

Figure 6.7 Examples of Distribu on Shapes for Posi ve, Zero, and Nega ve Skewness

Figure 6.8 Hypothe cal Example of Posi ve Skewness: Number of Correct Items on a Quiz With a Possible Range of 0 to 8 Points Figure 6.9 shows substan al nega ve skewness and a possible ceiling effect. In Figure 6.9 most scores are piled up near 100 points (out of 100 possible points). A ceiling effect occurs when an exam is “too easy” for most students. Visual examina on is usually sufficient to evaluate skewness. Skewness should be men oned when data are described in research reports. Posi ve skewness is common in real data. Some mes an appearance of skewness is due to a few high-

end outliers. An index to describe degree of skewness is available; see Appendix 6A for further discussion. Usually visual examina on of a histogram is sufficient to evaluate skewness. What should you do if you see skewness in your sample data?

 Include men on of skewness in your descrip on of the distribu on.  If skewness is not extreme (as in the examples in Table 5.1), you may not need to do

anything to try to get rid of skewness. If skewness is extreme (as in Figures 6.8 and 6.9), you may want to consider op ons such as outlier removal to reduce skewness.

 Decisions about the iden fica on and removal of outliers should be made before you collect data. If you make these decisions a er you peek at your data, you must explain this when you report informa on about your data.

Figure 6.9 Example of Nega ve Skewness: Hypothe cal Exam Scores on a Scale From 0 to 100 6.15 ANOTHER DEPARTURE FROM NORMALITY: KURTOSIS Kurtosis is a different kind of departure from ideal normal distribu on shape. Kurtosis is included here only because virtually all introductory sta s cs textbooks for behavioral and social sciences discuss it; however, no one ever men ons it a er early chapters about distribu on shape. Any problems that a large value for kurtosis might suggest, such as outliers, can be more easily evaluated using simpler methods (such as boxplots). The sketches for platykur c and leptokur c distribu on shapes in some textbooks do not accurately depict the distribu on shapes for varying amounts of posi ve and nega ve kurtosis (according to Wes all, 2014). Kurtosis tells us whether a dispropor onate number of scores are farther away from the mean (or closer to the mean) than for an ideal normal distribu on. The

pa erns of scores in the center of platykur c (some mes described as “fla er” than normal) and leptokur c (some mes described as more “peaked” than normal) distribu ons can vary in ways that do not correspond to the graphs that appear in some textbooks. It is inaccurate to describe kurtosis simply as degree of “peakedness” (Wes all, 2014). In prac cal applica ons of sta s cs, you can ignore kurtosis. Visual examina on of distribu on shape in histograms and boxplots provides more useful informa on about related poten al problems, such as extreme outliers. More complete informa on about kurtosis (for curious readers) is provided in Appendix 6C. 6.16 OVERALL NORMALITY There are tests (such as Kolmogorov-Smirnov and Shapiro-Wilk) to evaluate overall departure from normality (skewness, kurtosis, and outliers). When samples are small (N < 20 or 30), we don’t have enough informa on to evaluate possible departures from normality. On the other hand, when sample sizes are large (N > 200), these tests almost always indicate significant departures from normality. The results of these tests of normality o en depend more on sample size than on distribu on shape (University College London, Great Ormond Street Ins tute of Child Health, 2010). In most situa ons, simple visual examina on of a histogram is enough to evaluate whether sample data are reasonably normally distributed. Quan ta ve tests for overall departure from normal distribu on shape (essen ally, comparing the shape of the histogram in your sample with an ideal normal distribu on) appear in Appendix 6C. Some textbooks say that a normal distribu on of scores in a sample is a required assump on for the use of many common sta s cs. Strictly speaking, that is incorrect (Field, 2018). (An assump on involved in developing many of the sta s cs you will use was that scores were randomly sampled from a normally distributed popula on, but we usually don’t have enough informa on to evaluate distribu on shape in the popula on.) In prac ce, some departures from normal distribu on shape, such as extreme outliers, do cause problems in data analysis. Distribu on shape is discussed in later chapters when it is important for specific analyses. 6.17 PRACTICAL RECOMMENDATIONS FOR PRELIMINARY DATA SCREENING AND DESCRIPTIONS OF SCORES FOR QUANTITATIVE VARIABLES When you work with quan ta ve variables, you should do the following things.

 In all research, decide the value of N before you begin to collect data. (Do not collect data, repeatedly analyze it, collect more data because you are not happy with results, and then stop at a point where you have results you like.)

 Choose the method for outlier iden fica on (such as boxplots or z scores) before you collect data.

 Establish rules for inclusion or exclusion of cases ahead of data collec on. (For example, you may want to include a limited range of ages, or only right-handed persons, in your sample.)

 Decide how you will handle outliers before you collect data.  If you an cipate skewness, think about what you might do to reduce skewness ahead of

me. In many cases, if skewness is not extreme, you don’t need to do anything about it.  Collect data.  Obtain a frequency table; iden fy impossible or ques onable score values and note

percentage of missing values.  Obtain a histogram and visually examine it to evaluate distribu on shape and skewness.

Unless skewness is extreme, you probably don’t need to do anything about it.  To evaluate outliers, obtain a boxplot and/or z scores for all cases. Either boxplots or z

scores can be used to iden fy outliers. Note the number and loca ons of outliers.  Your decision whether to use mean or median (as well as choices among later sta s cs)

may depend on distribu on shape and whether outliers are present.  Document every decision you made.

6.18 REPORTING INFORMATION ABOUT DISTRIBUTION SHAPE, MISSING VALUES, OUTLIERS, AND DESCRIPTIVE STATISTICS FOR QUANTITATIVE VARIABLES You use all the informa on discussed in Chapters 3 through 6 to describe the behavior of each quan ta ve variable early in your research report. Try to communicate the pa ern of informa on as clearly as possible. Informa on about distribu on shape can be summarized in statements such as: Heart rates were approximately normally distributed, with N = 100, M = 74, and SD = 4.5. There were no missing values. Using z > 3.29 in absolute value as the criterion for iden fying outliers, there were no outliers. Heart rates were approximately normally distributed, with N = 100, M = 74, and SD = 4.5. There were no missing values. Using z > 3.29 in absolute value as the criterion for iden fying outliers, there were no outliers. The ini al data set had N = 340 heart rate scores, with M = 76 and SD = 6.5. There were 20 missing values. Using z > 3.29 in absolute value as the criterion for iden fying outliers, there were 10 outliers, all at the upper end of the distribu on. On the basis of prior plans for data handling, the 20 missing values and 10 outliers were removed from the data set, leaving N = 310 cases for analysis. For these 310 cases, M = 68 and SD = 5.7. Number of daily servings of fruit and vegetables had a possible range of scores from 0 to 8. Scores were not normally distributed; there was a mode at 0. The distribu on was posi vely skewed. Because this variable is important in the study, and because its distribu on shape cannot be described simply, a histogram is provided as more complete informa on (see Figure 6.10).

In the ini al sample of N = 1,000 survey respondents, ra ngs of approval of the president were made on a 1- to 7-point scale, with 1 indica ng lowest approval and 7 indica ng highest approval. Fi y-four percent of persons contacted refused to answer the survey ques on; ra ngs were available for only 460 persons. The distribu on of ra ngs was bimodal; 41% of persons gave ra ngs of 1, 34% of persons gave ra ngs of 7, and smaller percentages gave intermediate ra ngs. Because of this extremely non-normal distribu on shape, medians and modes are reported instead of means when summarizing ra ngs. [Note that outliers would not be a problem for this kind of ra ng scale.]

Figure 6.10 Number of Daily Servings of Fruits and Vegetables (N = 1,250) Readers should be able to answer ques ons like these on the basis of your descrip on: What did the distribu on look like? How many missing data were there? What were the numbers and loca ons of outliers? How were outliers handled? Past research reports have not always included complete informa on about data -screening, distribu on shapes, missing values, outliers, and decisions about handling these. There is growing concern about the need for more transparent and complete repor ng (-Simmons, Nelson, & Simonsohn, 2011). 6.19 SUMMARY There are several reasons why you need to know about distribu on shapes.

1. Parsimony: When we know the shape of a distribu on (for example, that it has a normal shape), we need only two addi onal pieces of informa on to draw an exact picture of that distribu on: the mean M and the standard devia on SD. We do not need all the addi onal pieces of informa on in a frequency table. If we do not know distribu on shape, we need to provide much more informa on to provide a complete descrip on of different scores or responses. In some situa ons, we may need to report a complete frequency table to provide full informa on.

2. Problems in the distribu on: Informa on about distribu on shape is needed to iden fy poten al problems such as outliers and skewness.

3. Describing quan ta ve variables in research reports: If there are few variables, you might summarize informa on about each variable is a sentence such as “Scores on X were approximately normally distributed,” or “Scores on X were extremely posi vely skewed, with 3% missing values, and two low-end outliers iden fied by loca on in a boxplot.” If there are many variables, a table could summarize this informa on.

The skills you need to remember from this chapter are:

 How to convert an X score into a z score (given values of M and SD).  How to convert a z score back into an X score (given values of M and SD).  How to find the percentage of area above or below any value of z or the percentage of

area between any two values of z.  How to iden fy outliers.  How to summarize informa on about a quan ta ve variable, including at

least distribu on shape, missing values, outliers, and descrip ve sta s cs such as M and SD.

The material in this chapter is extremely important; two widely used sta s cal procedures (confidence intervals and sta s cal significance tests) depend on understanding the way areas of normal distribu ons (and other similar distribu ons) are used to iden fy “common” versus “uncommon” (or rare or unexpected) outcomes. APPENDIX 6A: THE MATHEMATICS OF THE NORMAL DISTRIBUTION A func on is an equa on that generates values for a Y variable on the basis of values of one or more X variables. The very simple func on to convert height in inches (X) to height in cen meters (Y) is Y = 2.54 × X. This is a linear func on; if you plot values of Y (ver cal axis) against values of X (horizontal axis), the equa on corresponds to a straight line, as shown in Figure 6.11. The equa on (func on) for the normal (or Gaussian) distribu on is much more complicated, and it generates a curve (not a straight line). The equa on uses a lot of nota on you have not seen yet. The key things to no ce are that: Y represents the height of the curve (on the ver cal axis) (X – μ) represents the distance of an X score from the mean (on the horizontal axis of the plot of the func on) Equa on 6.4 generates a value for Y (the height of the distribu on) as a func on of the distance of an X score from the mean. Other (6.4)

where π is a mathema cal constant with an approximate value of 3.1416. e is a mathema cal constant with an approximate value of 2.7183. μ is the popula on mean of the popula on, that is, the center of the distribu on. For now, we will use M for the mean of sample data to es mate μ. In later chapters we will use μ to represent unknown or hypothesized means of popula ons. σ is the standard devia on of the popula on; it corresponds to the dispersion of the distribu on. For now, we use SD from a sample to es mate σ. (X – μ) is the distance of each X score from the popula on mean. We can generate many different normal distribu ons by using any values we want for μ and σ. For example, IQ scores have a popula on mean of μ = 100 and a popula on standard devia on of σ = 15; if we subs tute these values into Equa on 6.4, the graph will be a normal distribu on curve with μ = 100 and σ = 15, as shown in Figure 6.12. When the values μ = 0 and σ = 1 are used, this equa on generates the “standard” normal distribu on as shown in Figure 6.13. The standard version of the normal distribu on is extremely useful. We can convert X scores that occur in any units into z scores with M = 0 and SD = 1. Then we can use the standard normal distribu on to find areas that correspond to the z scores.

Figure 6.11 The Linear Func on That Shows How Height in Inches Is Related to Height in Cen meters Qualita vely, the normal distribu on curve has the following proper es:

 The mean is in the center.  The curve is perfectly symmetrical (and it has zero skewness and kurtosis, as discussed in

later sec ons). Quan ta vely (mathema cally) a perfect standard normal distribu on has the following addi onal property:

 The curve corresponds to a fixed rela onship of distance from mean (given in z-score units) with area under the curve, exactly as shown in Figure 6.13.

A curve that appears bell shaped from visual examina on of a histogram but that lacks this exact fixed rela onship between distance from the mean and area is not an exactly normal distribu on. We may be able to describe it as approximately normal. For many situa ons in applied sta s cs, approximately normal is good enough. In addi on, the tails of the mathema cal normal distribu on func on extend to infinity. At the same me, the total under the curve is finite and scaled to equal 1.00. (This may seem contradictory, but func ons that involve the mathema cal constant e o en do amazing things.) You will not have to use these equa ons to find values of Y, or areas under the curve, in this course. Sta s cians have done that already, and the results are available in tables such as the one in Appendix A at the end of this book. SPSS also provides exact tail areas for many analyses, and when you have that informa on, you may not need to do any table lookup. In prac ce we will not be interested in the probability of an individual X score. Instead, we will want to know the probability that scores lie within some range of X values. To find the area under the normal curve that lies between any pair of values on the X axis, z1 and z2, sta s cians solve for the integral (area) of the normal distribu on func on (Equa on 6.4) from z1 to z2. The areas obtained this way are widely available online and in table form, for example, in Appendix A at the end of this book. You will not have to solve integrals in this course. You can use tables (such as in Appendix A at the end of this book) to find values of Y (probability) between any pair of values for z.

Figure 6.12 Normal Distribu on for IQ Scores, μ = 100 and σ = 15

Figure 6.13 The Standard Normal Distribu on (μ = 0 and σ = 1) APPENDIX 6B: HOW TO SELECT AND REMOVE OUTLIERS IN SPSS If a researcher decides on rules for the iden fica on and removal of outliers before looking at the data, and detects outliers using these rules, the following SPSS commands can be used to remove (filter out) outliers. In the following example, SPSS Select Cases commands are used to retain temperatures that are below 100 degrees Fahrenheit (and temporarily filter out any temperatures higher than this). To do this, make the following menu selec ons: <Data> → <Select Cases>. In the Select Cases dialog box (Figure 6.14), click the radio bu on for “If condi on is sa sfied.” Then click the If bu on immediately below that to open the second Select Cases: If dialog box. Next you will see the Select Cases: If dialog box in Figure 6.15. Type in the logical expression “temp_Fahrenheit < 100.” A logical expression generally includes a variable; operators such as greater than, equal to, or less than; and specific numerical values (see Table 6.2). The full command this creates is “Select cases if temp_Fahrenheit is less than 100.” By implica on, cases with values of temp_Fahrenheit greater than or equal to 100 are not selected. Data for the cases that sa sfy this condi on will be included in later analyses. Cases that do not meet this condi on (that is, persons with temperatures above 100) will be excluded from future analyses. Under the Output heading in the Select Cases dialog box in Figure 6.14, I le the radio bu on selec on as “Filter out unselected cases.” If you choose “Delete unselected cases,” cases will be removed permanently. Permanent dele on is usually not a good idea. A research report must include informa on about any cases that are selected out. The number of cases, the score values, and the reason for selec ng them out should be stated. Usually scores are removed because they are outliers, but there can be other reasons to remove scores. When you look at the data file in Figure 6.16 you’ll see that the row numbers for two excluded cases (with temperature scores of 101.3 and 100.4) are marked out with cross hatches. If the

frequencies procedure is run to obtain the sample mean M, those two values will not be included.

Figure 6.14 Select Cases Dialog Box

Figure 6.17 “Select Cases” Radio Bu on to Select All Cases (Stop Excluding Outliers) It is possible to use the Select Cases procedure to remove scores at both ends of the distribu on. This could be done with a logical expression such as “temp_Fahrenheit > 97 AND temp_Fahrenheit < 100.” This would include scores only if they are both greater than 97 and less than 100 and exclude scores outside that range. Remember that you must report the number of outliers that were removed or modified and explain the rules and ra onale. APPENDIX 6C: QUANTITATIVE ASSESSMENTS OF DEPARTURE FROM NORMALITY You may have no ced that the sample mean is a func on of X to the first power and that the sample variance is a func on of X 2. These can be called the first and second “moments” of a distribu on. There two addi onal moments for normal distribu ons, based on devia ons from the mean raised to the third and fourth powers; these are called skewness and kurtosis. An ideal normal distribu on has skewness of 0 and kurtosis of 3 (or excess kurtosis of 0). 6.C.1 Index for Skewness The most common formula to quan fy degree of skewness is: Other (6.5)

If you think about this formula, perhaps you can see that when an (X – MX) devia on is posi ve, its cube is posi ve; when (X – MX) is nega ve, its cube is nega ve. Skewness therefore provides informa on about the compara ve magnitudes of posi ve versus nega ve devia ons from the mean. A posi ve value for the skewness index indicates more extreme scores at the upper end

of the distribu on. SPSS provides a skewness index (along with the standard error of skewness, or SEskewness) that can be used to test whether skewness differs significantly from zero. However, in most research situa ons, visual examina on of a histogram is sufficient to evaluate skewness. To decide whether skewness is severe, you can divide skewness by the standard error of skewness given in SPSS’s output for descrip ve sta s cs and evaluate this ra o using -standards for z scores. If the z ra o is greater than 3 in absolute value, it indicates “sta s cally significant” skewness. Consider the fruit and vegetable servings data that appeared in Figure 6.10. For the daily number of servings of fruits and vegetables variable (NCIfv), skewness = 1.273, SEskewness = .110, and 1.273/.110 = 11.57. This distribu on is very posi vely skewed (it has a longer tail on the upper end). When skewness is present, values of the median and mode are usually not close together, and the curve for a normal distribu on does not fit well. Descrip ve sta s cs, including the skewness index, appear in Figure 6.18. 6.C.2 Index for Kurtosis Kurtosis has been widely misunderstood; it is some mes described as informa on about “peakedness” of distribu on shape. That is incorrect (Wes all, 2014). Thinking about the computa onal formula helps us see why. A common formula for kurtosis is: Other (6.6)

When devia ons from the mean are taken to the fourth power, greater weight is given to extreme scores. Kurtosis provides more informa on about extreme scores in the tails than about the shape of the peak. Different distribu on shapes can arise for varying degrees of kurtosis. Wes all (2014) offers examples to demonstrate that kurtosis does not provide informa on about the shape of distribu on peaks.

Figure 6.18 Descrip ve Sta s cs Including Skewness for Daily Fruit and Vegetable Consump on Data A further source of confusion regarding kurtosis is that using Equa on 6.6, the kurtosis for a standard normal distribu on is 3. However, data analysts o en subtract the value of 3 from the number obtained using Equa on 6.6 and call the resul ng value “excess kurtosis.” The label confusion gets worse because SPSS and some other programs label the value they report as a kurtosis index, when they are actually repor ng excess kurtosis. In SPSS, kurtosis of 0 indicates no departure from the normal distribu on in terms of kurtosis. SPSS also provides a standard error for kurtosis (SEkurtosis). Kurtosis can be evaluated by calcula ng z = kurtosis/SEkurtosis and using the standard normal distribu on to find tail areas that correspond to the absolute value of z. I do not think you need to evaluate, or worry about, kurtosis in applied sta s cs. A normal distribu on can have any values for the first two moments (mean and standard devia on). However, a perfectly normal distribu on must have skewness of 0 and (excess) kurtosis of 0. 6.C.3 Test for Overall Departure From Normal Distribu on Shape If a quan ta ve test for overall distribu on shape is desired, the SPSS Explore -procedure can be used to obtain addi onal (rarely reported) tests for departure from normality. To do this, make the following menu selec ons: <Analyze> → <Descrip ve Sta s cs> → <Explore>. In the Explore dialog box (Figure 6.19), move the name of the variable(s) you want to examine into the “Dependent List” pane, then click OK. Selected output from the Explore procedure follows in Figure 6.20. Explore also provides a Q-Q plot (Figure 6.21), which graphs the cumula ve frequency in the empirical data against the cumula ve frequency that would be expected in a perfectly normal distribu on; for perfectly normally distributed data, the points would all fall on a straight line. In Figure 6.20, both tests (Kolmogorov-Smirnov and Shapiro-Wilk) have p < .001; the -distribu on shape for scores in the sample would be judged significantly different from an ideal -normal distribu on shape. If a distribu on of scores is close to normal, the Q-Q plot (Figure 6.21) should be close to a straight line.

Figure 6.19 SPSS Explore Dialog Box

Figure 6.20 Tests for Overall Departure From Normality

Figure 6.21 Q-Q Plot From the SPSS Explore Procedure I do not think these tests are useful in applied sta s cs. As noted earlier, the results of these tests of normality o en depend more on sample size than on distribu on shape (University College London, Great Ormond Street Ins tute of Child Health, 2010). When sample sizes are large, tests for non-normal distribu on shape almost always indicate that the data are significantly non-normally distributed, even when the histogram shows a reasonably normal

shape. Evalua ons about distribu on made by visual examina on of histograms are sufficient in most real-world situa ons. APPENDIX 6D: WHY ARE SOME REAL-WORLD VARIABLES APPROXIMATELY NORMALLY DISTRIBUTED? Your reac on to the equa ons for the normal distribu on in Appendix 6A may be like a story told by Wigner (1960), briefly paraphrased here. A sta s cian shows a student the equa on for the normal distribu on and says that this equa on describes the distribu on of heights in a popula on. The student asks, “What is that symbol?” The sta s cian replies, “Pi (π), the ra o of the circumference of a circle to its diameter.” The student says, “You must be joking. What can the circumference of a circle possibly have to do with the distribu on of heights?” This is indeed very odd when you stop to think about it. It is an example of what Wigner called “the unreasonable effec veness of mathema cs.” The mathema cs for the normal distribu on evolved from probability theories, developed in the 18th century as a way of answering ques ons asked by gamblers, such as, What is the probability that I will get heads six mes in a row when tossing a coin? Later sta s cians discovered that the mathema cs developed to predict gambling outcomes turned out to describe distribu on shapes for some variables, such as height, fairly well. The normal distribu on equa on and related func ons provide the founda on for some of the most common procedures in sta s cs that you will learn, including confidence intervals and sta s cal significance tests. The mathema cal answer to a ques on about gambling turned out to have applica ons the ini al developers could not have imagined. It is strange that developments in mathema cs some mes precede empirical discoveries about the way the world works. Many (but not all!) real-world variables, such as height, have scores that are approximately (but rarely exactly) normally distributed. Why does this happen? Chance or randomness plays a part. Here is some historical background. Gamblers have always been interested in probability. For example, what is the probability of ge ng heads when a fair coin is tossed? If the coin is fair, the probability of heads is 50% or .50; the probability of tails also is 50% or .50. Gamblers also wanted to answer ques ons such as, What is the probability of ge ng six heads in a row? Each coin toss is independent of other coin tosses, which means that we can work out the probability of a series of outcomes by mul plica on. Given that the propor on of heads = .50 for each toss, the probability of ge ng heads six mes in a row is .5 × .5 × .5 * .5 × .5 × .5 = (.5)6 = .015625, or about 1.6%. In other words, this is a very unlikely outcome. This outcome can happen only one way (heads on all six tosses). Other outcomes can happen several ways; for example, we can obtain three heads (H) and three tails (T) in many sequences, such as HHHTTT, HTHTHT, THTHTH, and so forth. Because there are so many ways to obtain three heads, that outcome has a higher probability than HHHHHH. (Formally, this situa on can be described using terms such as Bernoulli trials and binomial distribu on; some introductory sta s cs books include en re chapters about the binomial distribu on.)

What happens if a large number of gamblers (let’s say 200) each do six coin tosses? What will the distribu on of outcomes look like? Your intui on may tell you that many gamblers will obtain three heads and three tails, or four heads and two tails, or two heads and four tails. Very few gamblers will obtain six heads. A teacher could run this experiment by having 200 students in a large classroom each toss a coin six mes and then have each student report the number of heads obtained. That might be fun, but there is a quicker way to get comparable results. A physical device called a quincunx or Galton board can be used to represent binary outcomes (such as heads vs. tails) for a series of events (such as six consecu ve coin tosses). To see how this process works, I suggest that you do the following two things. First, view this brief YouTube video: h ps://www.youtube.com/watch?v=6YDHBFVIvIs. This video shows how dropping small metal balls through a physical device can represent a series of binary outcomes. Keep this analogy in mind: Each me a ball hits a nail and goes le or right, this can represent any random binary outcome, for example, whether a gambler gets heads or tails on a coin toss. Second, go to the webpage at h ps://www.mathsisfun.com/data/quincunx.html to run an app that models the same process (Pierce, 2017). A screenshot appears in Figure 6.22. The small dots that are arranged in a triangle correspond to the nails in the physical Galton board in the video. The larger dots show the individual balls as they fall down through the set of nails. (This diagram does not show the funnel at the top to feed balls into the device.)

Figure 6.22 Screenshot From an Online Quincunx Simulator Imagine this situa on from the perspec ve of one of the larger balls (if you can). A large ball drops straight down and hits the first peg. It can be deflected either toward the right side or the le side. By chance, the ball should go right 50% of the me and le 50% of the me. Now the ball hits a second peg. Again, it can go right or le . By the me the ball reaches the bo om and falls into one of the bins, it has hit 10 pegs. Each peg represents one random event (for example, one coin toss) with two possible outcomes: right or le , heads or tails. If we put one ball into the top and let it drop through, that represents the outcome for one gambler doing a series of six coin tosses. That gambler’s ball will end up in just one of the seven bins (there are seven bins because number of heads can be 0, 1, 2, 3, 4, 5, or 6). Now suppose that we line up 200 gamblers and have each one put a ball into the top. We are now doing a simula on to find out what percentage of the 200 gamblers get six heads, what percentage get equal numbers of heads and tails, and so forth. To run the coin toss experiment using the app in the screenshot in Figure 6.22, you input the following informa on. First, how many coin tosses do you want to know about? How many

mes will each gambler toss a coin? How many gamblers will you examine? Suppose your ques on is, What are the outcomes for a sequence of six coin tosses? If you choose 6 for size, this sets up a Galton board that has six rows of “nails” or choice points. To specify that the chance of le /right or tails/heads is 50%/50%, move the slider in the “Le /Right” line. Each me a large ball falls through this mass of choice points, it has a 50% chance of going le (tails) and a 50% chance of going right (heads). (You can adjust the speed of the simula on, but this has no effect on the results.) To obtain the final result in Figure 6.22, I stopped the simula on when 200 “balls” had fallen into bins in the bo om. This corresponds to the imaginary situa on in which 200 gamblers (or students) have each tossed a coin six mes. The histogram at the bo om of Figure 6.22 summarizes the outcomes when 200 gamblers each report how many heads they obtained. Number of heads can range from zero to six (for six coin tosses). Only 1 gambler obtained six heads (and no tails), while 71 gamblers obtained three heads (and three tails). Balls that ended up in the far right-hand bin (and there will not be many of them) have gone right six mes. If going right represents heads, the balls in the far right-hand bin represent gamblers who obtained six heads. The bin on the far le of the page represents zero heads (or six tails). The bins in between the two extreme bins represent outcomes that include three heads and three tails, two heads and four tails, and so forth. The outcome of this simula on appears at the bo om of Figure 6.22. This is the number of balls in each bin, and it can be interpreted as a histogram, with number of heads as the X variable

and number of gamblers as the Y variable. Many gamblers ended up in bins near the middle because they obtained three heads and three tails or four heads and two tails. Smaller numbers of gamblers end up in the bins at the far le (zero heads) and the far right (six heads). This informal example illustrates something that is more formally called a binomial distribu on. Using simple mathema cs, it is possible to predict the propor ons of outcomes in the bins at the bo om. As the number of events in the sequence (in this example, the number of coin tosses) and the number of gamblers increases, and if the probability of heads on each trial is 50%, the binomial distribu on converges to a normal distribu on. If you are a gambler, the moral of the story is, don’t bet on ge ng six heads in a row. It’s very unlikely to happen. So what does this have to do with height? Let’s suppose that there are six genes that influence height. Let’s also suppose that each gene has two different forms (alleles) and that one form causes a person to be taller, and the other causes a person to be shorter. Let’s suppose that at concep on, the probability of ge ng the “tall” versus the “short” allele for each of the six genes is 50%. We can make the analogy that ge ng the tall allele for each gene is like ge ng heads on each coin toss. Very few people will get all six tall alleles and end up being very tall or zero tall alleles and end up being very short; most people will receive some of both alleles and end up being moderate in height, with a height distribu on that resembles the frequencies for coin toss sequences in Figure 6.22. Of course, this is not an accurate descrip on of factors that influence height or -stature. There may be more than 400 genes that influence height (Boston Children’s Hospital, 2014). Some genes may have more than two forms or alleles. The selec on of alleles an individual child can inherit depends on the alleles his or her parents have. In addi on, -environmental -factors such as malnutri on affect adult stature. (Look up stature if you want to know -addi onal interes ng things about factors that affect height.) However, human heights are fairly normally distributed, even if the process is far more complicated than illustrated by the -quincunx or Galton board example. Formally, a mathema cian would say, As N (of gamblers) increases, and if the number of events in the sequence is sufficiently large, and if the probability of success on each trial (such as ge ng heads) is 50%, this distribu on converges to a normal distribu on. Mathema cians (including DeMoivre in 1733, LaPlace in 1783, Adrain in 1808, and Gauss in 1809) worked out the mathema cal details (that is, the equa ons in Appendix 6A and the proper es of the normal distribu on). Quetelet, in 1835, was the first to no ce that this distribu on shape approximately matches the outcomes for some measurements of human characteris cs, such as height. This, as Wigner (1960) pointed out, is a drama c example of “the unreasonable effec veness of mathema cs.” The development of the binomial and normal or Gaussian distribu ons was related to ques ons about outcomes in random processes such as coin tossing. The observa on

that heights and some other variables have approximately normal distribu on shapes came later. Keep in mind that not all variables have normal distribu ons, and in fact, empirical distribu ons in real data are o en not close to normal in shape (Micceri, 1989). APPENDIX 6E: SAVING Z SCORES FOR ALL CASES If you want z scores for all the cases in your data set, so that you can use z-score values to -iden fy outliers, this is the quickest way to do it. Make the following SPSS menu selec ons: <Analyze> → <Descrip ve Sta s cs> → <Descrip ves>. In the SPSS Descrip ves dialog box, shown in Figure 6.23, move the variable for which you want z scores into the “Variable(s)” pane. This example uses BMI (body mass index). Place a check in the checkbox in the lower le corner next to “Save standardized values as variables.” Click OK to run. When you return to Data View, the far right-hand column has a new variable, ZBMI, which contains the z scores for BMI. To sort the scores in this column, right-click the column and select <Sort -Ascending> from the pull-down menu (Figure 6.24). This will produce the sorted list depicted in Figure 6.25.

Figure 6.23 SPSS Descrip ves Dialog Box

Figure 6.24 Command to Sort Scores in Column

Figure 6.25 Sorted z Scores for BMI Using Tabachnick and Fidell’s (2018) suggested rule (to iden fy cases with z scores greater than 3.29 in absolute value as outliers), Cases 1254 through 1260 would be called outliers. See Appendix 6B for a way to remove these cases from the ac ve data file using the Select Cases command. COMPREHENSION QUESTIONS

1. Suppose you obtain a score on a test of “obfusca on perspicacity.” (Obviously you would have no idea what this test measures given this odd name.) Suppose you know that

scores on this test are normally distributed. What informa on would you need about test scores for other students to figure out whether your score was low, high, close to the mean, or far from the mean? What would you do with that informa on (what equa ons or figures or tables would you use) to figure out the percentage of persons who have scores lower than yours?

2. Suppose you have normally distributed IQ scores with M = 100 and SD = 15. Find the value of z for each of these IQ scores, look up the area below z (if possible; some of these values are not included in tables and diagrams, and for these, just give approxima ons). Briefly describe loca on in words using terms such as slightly below average or extremely above average. (Note that dis nc ons of “far” versus “very far” versus “extremely far” are somewhat subjec ve.) Show your work for the z score.

3. Finding X from z (given that X is IQ with M = 100 and SD = 15):

o If a person has a z score of –.75, what is the corresponding IQ score? o If a person has a z score of +1.96, what is the corresponding IQ score? o If a person has a z score of 0.00, what is the corresponding IQ score?

4. What values of z divide the normal distribu on in each of the following ways? Middle 68%

a) Fi y percent above and 50% below b) Bo om .5%, middle 99%, top .5% c) Bo om 10%, middle 80%, top 10% d) Bo om 2.5%, middle 95%, top 2.5% e) Bo om 99%, top 1%

5. Does it make sense to find percen le rank using z scores and the standard normal distribu on table when the empirical frequency distribu on is extremely non-normal? Why or why not?

DIGITAL RESOURCES Find free study tools to support your learning, including eFlashcards, data sets, and web resources, on the accompanying website at edge.sagepub.com/warner3e. Descrip ons of Images and Figures Back to Figure

At the top is a scale that shows the percentage of area that falls under the curve under different z scores. Minus 1 to plus 1 is about 68 percent. Minus 2 to plus 2 is around 95 percent and minus 3 to plus 3 is close to 99.7 percent. There are 6 z scores, 3 each on both the posi ve and nega ve side of 0. The area covered is:

 to Plus 1 and minus 1 correspond to 34.13 percent  Plus 1 to plus 2 and minus 1 to minus 2 corresponds to 13.59 percent  Plus 2 to plus 3 and minus 2 to minus 3 corresponds to 2.14 percent  The outer edges beyond minus 3 and plus 3 corresponds to 13 percent

Back to Figure The values of z from 1.83 to 2.12 and the values under B and C have been shown as a table. At the top and bo om, there are figures showing the extent of area covered by B and C. One specific z value, 1.96, which corresponds to .4750 for B and .0250 for C has been highlighted. Table values:

There are four diagrams that show the area between 0 and z as well as beyond z for posi ve and nega ve values of z. The first diagram highlights the area between 0 and posi ve value of z in a normal distribu on diagram. The area to the le of 0 has been marked as Area below 0 equals 50 percent. The second diagram, to the right of the first, highlights the area beyond posi ve z in a normal distribu on diagram. This has been shown as the Area above posi ve z. The third diagram, below the first, highlights the area between 0 and nega ve value of z in a normal distribu on diagram. The fourth diagram, to the right of the third, highlights the area the area beyond nega ve z in a normal distribu on diagram. Back to Figure The X axis denotes the z score and ranges from minus 3 to plus 3, with 0 as the center. The area between minus 1 and plus 1 is under the highest part of the curve and has been termed Close to M. The area between minus 1 and minus 2 and plus 1 and plus 2 has been termed Between. The area between minus 2 and minus 3 and plus 2 and plus 3 has been termed Far. The outer edges beyond minus 3 and plus 3 has been termed Very far. Back to Figure

The X axis denotes the z score and ranges from minus 3 to 1.96, with 0 as the center. The area between plus 1.96 and minus 1.96 has been termed Middle 95 percent. The area beyond minus 1.96 is called the Bo om 2.5 percent and beyond plus 1.96 has been termed Top 2.5 percent. Back to Figure The X axis denotes the z score and ranges from minus 2.576 to plus 2.576, with 0 as the center. The area between plus 2.576 and minus 2.576 has been termed Middle 99 percent. The area beyond minus 2.576 is .005 percent and beyond plus 2.576 is .005 percent. Back to Figure The first image shows a normal curve with the high end of the curve on the le and a long tail on the right. The curve shows posi ve skewness and the long tail is on high end. The SPSS skewness is greater than 0. The second image is of a normal curve that has perfect symmetry. The tails at either end are not uneven and the high part of the curve is over the center. The SPSS skewness is equal to 0. The last image shows a normal curve that has a long tail to the le and the high part of the curve is on the extreme right. The curve shows nega ve skewness and SPSS skewness is less than 0. Back to Figure The X axis denotes the items on a quiz and ranges from 0 to 8. The Y axis denotes the number of correct answers and ranges from 0 to 250. There are eight bars, one for each ques on, drawn as a histogram. The heights of the bars from le to right are; 210, 70, 60, 45, 30, 15, 20, 15, 20. A curve follows the bars; its tail on the right is long and the curve is higher towards the le . Back to Figure The X axis denotes the scores on an exam and ranges from 0 to 90. The Y axis denotes the frequency and ranges from 0 to 30. There are fi een bars visible on the histogram. Most of the bars on the le are close to 0, and the ones closer to the right are higher. A curve follows the bars; its tail on the le is long and the curve is higher towards the right. Back to Figure The X axis indicates the number of servings of fruits and vegetables per day and ranges from 0 to 8. The Y axis denotes the percentage and ranges from 0 to 50 percent. There are 8 bars and their approximate heights are;

 0: 42 percent  1: 15 percent  2: 14 percent  3: 9 percent  4: 5 percent  5: 2 percent  6: 3 percent  7: 2 percent  8: 3 percent

The source of the data has been a ributed to Warner, Frye, Morrell, and Carey. Back to Figure The horizontal axis represents the height in inches and ranges from 58 to 70 inches, in increments of 2. The ver cal axis represents the height in cen meters and ranges from 140 to 180 cen meters, in increments of 10. A best fit line connects the data points that represent how the heights in the different units are related. The best fit line appears to be going upwards from bo om right of the graph to the top le to show a posi ve correla on. A text inside the graph states Female cm equals 2.54 into female inch. Back to Figure The X axis denotes the IQ score and ranges from 55 to 145, with 100 as the center. The standard devia on has been provided as 15 and the mean is 100. The area on either side of the mean, that is, between 100 and 115 on the right as well as between 85 and 100 on the le , is equal to 34.13 percent. The area between 115 and 130 on the right and 70 and 85 on the le is equal to 13.59 percent each. The area between 130 and 145 on the right and 55 and 70 on the le I equal to 2.14 percent each. The area beyond 145 on the right and 55 on the le is 13 percent. At the top of the figure, three lines show the area under the curve. The area under minus 1 to plus 1 is 68 percent. The area under minus 2 to plus 2 is 95 percent. The area under minus 3 and plus 3 is around 99 percent. Back to Figure

The X axis denotes the z score and ranges from minus 3 to plus 3, with 0 as the center. The standard devia on has been provided as 1 and the mean is 0. The area on either side of the mean, that is, between 1 and 0 on the right as well as between minus 1 and 0 on the le , is equal to 34.13 percent. The area between plus 1 and plus 2 on the right and minus 1 and minus 2 on the le is equal to 13.59 percent each. The area between plus 2 and plus 3 on the right and minus 2 and minus 3 on the le is equal to 2.14 percent each. The area beyond plus 3 and minus 3 on either side is 13 percent. At the top of the figure, three lines show the area under the curve. The area under minus 1 to plus 1 is 68 percent. The area under minus 2 to plus 2 is 95 percent. The area under minus 3 and plus 3 is around 99 percent. Back to Figure On the le are variables, namely, sex, hr, temp underscore Fahrenheit, and temp underscore Celsius. The right has check boxes to select cases. There are five choices; all cases, if condi on is sa sfied, random sample of cases, based on me or case range and using filter variable. The second choice, If condi on is sa sfied, has been checked. Below this is the output sec on. There are three choices in the check boxes here; filter out unselected cases, copy selected cases to a new dataset and delete unselected cases. The first op on has been selected. A statement “Current Cases: Do not filter cases” is below this. At the bo om of the dialog box are op ons bu ons for the following; OK, Paste, Reset, Cancel and Help. Back to Figure On the le is a set of variables, namely, sex, hr, temp underscore Fahrenheit, and temp underscore Celsius. Temp underscore Fahrenheit has been selected. A box on the right shows one more variable, temp underscore Fahrenheit less than 100. Below this is a keyboard with numbers and special characters that is used to input the variable specifica ons.

At the extreme right is a box showing Func on groups, of which the following are visible; all, arithme c, CDF and noncentral CDF, conversion, current date or me, date arithme c and date crea on. Back to Figure The image is a screenshot of the select cases dialog box demonstra ng the use of the radio bu on for selec ng all cases without excluding any outliers. On the le are variables, namely, sex, hr, temp underscore Fahrenheit, and temp underscore Celsius. The right has radio bu ons to select cases. There are three choices visible; all cases, if condi on is sa sfied, and random sample of cases. The first choice, All cases, has been selected. Back to Figure NCIFV

 N: valid – 492  N: missing – 0  Mean – 1.86  Median - 1  Mode - 0  Std Devia on – 2.327  Skewness – 1.273  Std Error of Skewness – .110

Back to Figure On the le are the variables sex, hr, and temp underscore Celsius. On the right is a box showing the dependent list, into which the variable temp underscore Fahrenheit has been moved. There are two empty boxes below this. One is the Factor List and the second is the Label cases by box. There are two empty boxes below this. One is the Factor List and the second is the Label cases by box. Below these boxes are radio bu ons to select display op ons; Both, Sta s cs and Plots. On the far right are radio bu ons for Sta s cs, Plots and Op ons. At the bo om of the dialog box are op ons bu ons for the following; OK, Paste, Reset, Cancel and Help. Back to Figure The output shown are from the Kolmogorov-Smirnov and Shapiro-Wilk tests. Details are below:

 Kolmogorov-Smirnov

o X; Sta s c, .236; df, 102; sig., .000;  Shapiro-Wilk

o X; Sta s c, .775; df, 102; sig., .000; The values for Kolmogorov-Smirnov have a note at the end which states: Lilliefors Significance Correc on. Back to Figure The X axis denotes the observed values and ranges from 0 to 25, rising in increments of 5. The Y axis denotes the expected values and ranges from minus 2 to 4. The line drawn through the plo ed points is a straight line that covers most of the points. Back to Figure A note at the top states: The Quincunx is an amazing machine. Pegs and balls and probability! Have a play then read the Quincunx explained. The image consists of a set of controls to manipulate the size of the ball, its speed as well as whether it comes from the le or right. The size has been fixed as 6, the le and right indicator is 50 percent and the speed is 10. There are bu ons on the right to Restart and for Data. A Play and pause bu on is below this. A histogram that shows number of gamblers as the Y axis and the number of heads as the X axis is at the bo om of the image. The height of the bars from le to right is; 2, 12, 52, 71, 45, 14, 1. Three balls are descending from the top; one has been marked as one gambler and another sign states one toss. The source of the data is Math Is Fun. Back to Figure On the le , there are a set of variables which can be chosen for further analysis by moving to the right. The variables that can be seen are beer2, beerozday, beerperday, beerservingounce, BMI4group, BMInormal, c1, c2, and c3. The selected variable is BMI, which has moved to the box on the right. The extreme right has two radio bu ons for Op ons and Style. Below, there is a check box to Save standardized values as variables. This has been selected.

At the bo om of the dialog box are op ons bu ons for the following; OK, Paste, Reset, Cancel and Help. Back to Figure To sort the scores, right-click the column and select Sort Ascending from the pull-down menu. The complete list of available op ons in the menu is; cut, copy, copy with variable names, copy with variable labels, paste, clear, inset variable, sort ascending, sort descending, variable informa on, descrip ve sta s cs and spelling. Back to Figure The image is a table that shows how the sorted z scores for BMI looks like. The data is men oned below;