Data Analysis

jery.b
02CH_Tanner_Statistics.pdf

29

2Illustrating Data

John-Francis Bourke/Corbis

Chapter Learning Objectives After reading this chapter, you should be able to do the following:

1. Organize measures into frequency distributions, ordered arrays, and stem-and-leaf plots.

2. Create pie charts, bar graphs, and frequency polygons using Excel.

3. Describe the components of data normally.

4. Judge data normality by performing manual calculations and by using Excel output.

5. Develop tools to identify outliers.

tan82773_02_ch02_029-060.indd 29 3/3/16 9:58 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.1 From Description to Display

Introduction People who like to organize things will especially like this chapter. What we cover here can be particularly helpful in an age where we are exposed to much more data than we can absorb. When the material is irrelevant, this data overload is not a problem, but when the information is important, we need ways to retain it. This chapter offers some solutions involving visual data displays, which an anecdote will help to illustrate.

During World War II, a British analyst was assigned to recommend to aircraft builders the points on airframes that should be reinforced with armor plating. Too much armor plating and the aircraft would lose maneuverability and range; too little and it would become too vulnerable to enemy fire. The analyst examined aircraft returning from com-

bat, noted which areas showed damage, and drew pictures of the places where they had been hit. He recommended reinforcing the areas where the return- ing planes had not been damaged. How counterintuitive was that? As illogical as his approach seems, he reasoned that if the damage had been fatal to either the pilot or the aircraft’s ability to fly, the airplanes he examined would not have returned. So damage to the other areas was apparently the most serious, and those were the areas that needed the most protection.

This story is a lesson in the value of clari- fying relationships with visual displays. Certainly, mathematical manipulation and statistical procedures are required at

times, but often a necessary first step to understanding a data set is to arrange the data so that they can be visually analyzed. The understanding researchers gain from observation can then guide the mathematical analyses that follow.

Chapter 1 emphasized the descriptors and the statistical shorthand that allow us to classify and describe groups of data. That chapter limited descriptions to the scale of the data and the measures of central tendency and variability that allow data summaries. This chapter uses visual display for some of the same purposes and expands the applications for descriptive statistics.

2.1 From Description to Display The study of statistics has an incremental nature: Each step becomes part of a more involved process later, which makes grasping the early topics important, since they are building blocks for subsequent ones. For now, we will use what we know about data scale and descriptive

Edward Koren/The New Yorker Collection/The Cartoon Bank

tan82773_02_ch02_029-060.indd 30 3/3/16 9:58 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.1 From Description to Display

statistics to arrange measures into the tables and figures that reveal the multiple dimensions of numerical data. Although the stakes for us may be different than they were for the British warplane analyst, the issues are important nevertheless.

Most audiences are more engaged by a visual display than by a text presentation. When a good deal of data must be communicated in a short time, a visual display serves as a good place to begin. The discussions that follow suggest some of the more common procedures for repre- senting different kinds of data, if only to introduce them briefly. For someone interested in a more in- depth discussion, books by authors such as Friendly (2000) and Tufte (2001) will be helpful. Tufte in particular has a reputation for innovative and infor- mative data displays.

Data distributions of one sort or another are ubiquitous. A glance at the latest news reports indi- cates how unemployment numbers have changed during the year. Checking how the stock market has fluctuated over today’s trading session indicates highs, lows, and the volume of trading. The fact that data fluctuate makes them interesting. Data that either all have the same value or that always occur in the same proportions leave little to be analyzed. They interest us much less than data for which pro- portions and frequencies change.

Frequency Distributions Scores on most measures vary, but the variation will generally have some repetition. Whether college admissions test results or the scores on a statistics quiz, all scores are not equally likely; some will occur more frequently than others. Frequency distributions indicate the number of measures in a data set that have the same characteristic. They allow us to display scores in terms of both their variability and their frequency of occurrence.

Suppose a state board administers a licensing test for marriage and family counselors. Rather than report every individual score, the board finds it more economical to report test results in categories:

Meritorious

Exceeds Expectations

Pass

Pass with Exceptions

Fail

Consider the following example: A group of 25 graduates of State U’s marriage and family counseling program takes the test. Table 2.1 shows the group’s results.

John Moore/Getty Images News/Thinkstock

Tracking the highs, lows, and trading volume of stocks on a graph allows us to concisely evaluate what would otherwise be very large quantities of data.

tan82773_02_ch02_029-060.indd 31 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.1 From Description to Display

Table 2.1: A frequency distribution for licensing test results

Licensing test results f

Meritorious 4

Exceeds expectations 6

Pass 8

Pass with exceptions 4

Fail 3

Total 25

Table 2.1 depicts a frequency distribution, with the symbol f indicating the number of scores that occur in a particular category. If each individual score had been entered rather than being grouped into categories, the result would have been a table with 25 discrete entries. Instead, the data in Table 2.1 represent a grouped frequency distribution. Such a table provides a compact presentation when there are many scores.

Ordered and Disordered Arrays Table 2.1 is divided into categories, but if each of the 25 results was listed in ranked order from the four that were meritorious down to the three fails, the display would reflect an ordered array. If instead of listing them from highest to lowest, the board arbitrarily piled all the scores into the table, it would show, not surprisingly, a disordered array. In such a table, for example, although the meritorious scores would still occur as a group, they would be in no particular order. Table 2.1 is a much shorter display than either an ordered or a disordered array.

When sample sizes are comparatively small—15 or 20 scores from a larger popula- tion, for example—the type of presentation is not an issue, but presentation would be a greater issue if the frequency distribution included data for every aspiring mar- riage and family counselor in the state who took the licensing test. Even if hundreds of scores were being reported, a grouped frequency distribution would have the same number of rows as Table 2.1. Frequency distributions, then, can make a presentation compact. Jokela (2012) studied whether associations between individuals’ personality traits and whether they have children are affected by when they were born. Table 2.2 is part of his subjects’ description. It shows the birth cohort, or particular period of birth, and gender for 6,259 subjects (2,971 men and 3,288 women) in a relatively compact display.

Class Intervals The “groups” in grouped frequency distributions—the birth cohorts in Table 2.2—are called class intervals. Although they provide an economical data presentation and make a great deal of data accessible to even a casual observer, some details are inevitably lost. It is not apparent from studying Table 2.1, for example, which numerical test scores belong to a particular class

tan82773_02_ch02_029-060.indd 32 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.1 From Description to Display

interval. We can address that deficiency by incorporat- ing a list of score ranges, which might be the following:

28–34 Meritorious 21–27 Exceeds Expectations 14–20 Pass

7–13 Pass with Exceptions 0–6 Fail

With the ranges, we know how scores were classified, but it still is not apparent exactly how one individual whose score is in the “pass” interval, for example, scored. The person could have scored anywhere from 14 to 20. We know only the category. The same difficulty emerges in Table 2.2. The table shows 347 female subjects in the 1920–1929 birth cohort, but it does not make any distinction within the 1920–1929 group, a range of 9 years.

If we cannot know precisely how a particular individual scored, or the exact year in which a subject was born (Table 2.2), the data can at least be roughly ranked. Clearly, those in Table 2.1 who “exceeded expectations” did better than those in the pass category, although exactly how much better is not indicated.

Estimating the Mean from a Class Interval Indicating the score frequencies in the class intervals reduces the scores to values that can be ranked approximately. Even without the individual scores, we can use the categories to esti- mate the mean of the scores from class intervals. To estimate the mean from class intervals,

1. Determine the midpoint in each class interval. 2. Sum the midpoints of all the class intervals. 3. Divide the sum of the midpoints by the number of class intervals.

Table 2.2: A grouped frequency distribution of subjects’ birth cohort

Birth year Men (2,971) Women (3,288)

1914–1919 0 0

1920–1929 316 347

1930–1939 498 614

1940–1949 732 795

1950–1959 816 802

1960–1969 585 707

1970–1979 24 23

Source: Jokela, M. (2012). Birth-cohort effects in the association between personality and fertility. Psychological Science, 23, 835–841.

Try It!: #1 According to the discussion of the scale of data in Chapter 1, what scale do data cate- gories such as meritorious, exceeds expec- tations, and so on indicate?

tan82773_02_ch02_029-060.indd 33 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.1 From Description to Display

To see how accurate the estimated mean is, using the data in Table 2.1, we will first calculate the actual mean. Perhaps for the licensing test data in the grouped frequency distribution above, the individual scores were the following:

Meritorious: 34, 33, 33, 29 Exceeds Expectations: 26, 26, 24, 23, 23, 22

Pass: 20, 19, 19, 18, 17, 15, 15, 14 Pass with Exceptions: 12, 11, 9, 8

Fail: 6, 3, 1

Using the formula for the mean, M 5 ∑x n

, verify that 460 25 5 18.40.

Now, to estimate the mean based on the class intervals, follow these four steps:

1. Determine the midpoint of each class interval by

a) adding the two possible extreme scores within each interval (not the actual scores) and then

b) dividing by 2.

For

Meritorious: (28 1 34)/2 5 31

Exceeds Expectations: (21 1 27)/2 5 24

Pass: (14 1 20)/2 5 17

Pass with Exceptions: (7 1 13)/2 5 10

Fail: (0 1 6)/2 5 3

2. Multiply the midpoint values from Step 1 by the number of scores in the interval.

31 3 4 5 124

24 3 6 5 144

17 3 8 5 136

10 3 4 5 40

3 3 3 5 9

3. Sum Step 2’s products (the midpoints times the number of values).

124 1 144 1 136 1 40 1 9 5 453

4. Divide the sum of the products from Step 3 by the number of scores.

453/25 5 18.12

The actual mean is 18.40. The estimated mean is 18.12.

tan82773_02_ch02_029-060.indd 34 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.1 From Description to Display

Because this is an estimate, there will generally be a minor discrepancy between the value estimated from the class intervals and the actual value of the mean. In this exam- ple, the difference between the estimated and actual mean is 0.28. As the number of values in the data set increases, the discrepancy will usually diminish. The point is that with only the values that constitute the class intervals and the number of scores in each interval, it is possible to estimate the value of the mean. That can be helpful in a data summary when the original scores are unavailable, as is the case for data in Table 2.2. Whenever the value of M is estimated from the class intervals, any reporting of the value must clearly state that it is an estimate and that it was not calculated directly from the raw data.

The Difference Between Apparent and Actual Limits For the licensing data, the scores are all whole numbers: integers. This makes creating the class intervals easy, but researchers often work with data that include decimal values, and class limits must accommodate any value between the highest and lowest integers. The high- est and lowest integers in the category represent the apparent limits of the class interval. For example, in Table 2.1’s meritorious category, the apparent limits are 28 and 34. If the scores do not involve decimal values, determining class limits does not pose a problem, but sometimes decimals are part of the data being represented. A student’s grade point average, for example, is likely to have a decimal value. Ordinary grading procedures also often include decimals. If the lower limit for A work is 90% and the upper limit for B work is 89%, to which class interval does 89.5% belong?

To accommodate any value, class intervals must have actual limits in addition to apparent limits. In the case of grade averages and a great many other kinds of data, the class interval actually extends from a half point below the lower whole number in the interval to a half point above. That means the lower limit for an A would be 89.5%. For the 21–27 class interval (exceeds expectations), the actual limits are 20.5 to 27.5. If we subtract the lower from the upper actual limit we have the width of the class interval: 27.5 2 20.5 5 7.0.

That difference between the actual limits is the same as the number of whole numbers in the 21–27 apparent limits. In this case, that includes 21, 22, 23, 24, 25, 26, 27 or seven whole numbers.

In our licensing example, the use of actual limits involves a problem apparent limits did not present: The lower actual limit for exceeds expectations is the same as the upper actual limit for pass. Both are 20.5. So when scores happen to include whole numbers and deci- mals, where does a score like 20.5 belong? Sheskin’s (2004) solution is to adopt a rule. Such a rule could dictate, for example, that if the first value for the score in question is an odd number, it falls in one interval (perhaps the upper), and if the value is an even number, it falls in the lower interval. Under that rule, someone scoring 20.5 would receive a pass rating—because the first number, 2, is even. Still, which rule is followed does not matter so long as it is equitable and followed consistently.

Creating Grouped Frequency Distributions Speaking of consistency, grouped frequency distributions are also developed according to a couple of conventions:

tan82773_02_ch02_029-060.indd 35 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.1 From Description to Display

• Each class interval must have the same range. Whether the class limits are apparent or actual, the ranges of the different intervals must be equal. In the licensing scores example, the range of the apparent limits is 6.0 for each interval: 34 2 28 5 6.0 for the meritorious interval, 27 2 21 5 6.0 for the exceeds-expectation interval, and so on.

• A score must fit into just one group. This is simple enough when scores involve only whole numbers, but with decimal values, the difference between actual and appar- ent limits becomes relevant.

No rules dictate how many intervals are too few or too many, and of course that is a nonissue when the data have their own categories, like the licensing results. But without prescribed categories, researchers must decide on the number of categories to use. As a rough rule of thumb, Sheskin (2004) suggests taking the square root of the number of scores to determine the number of class intervals. So if, for example, the data set has 50 scores, the square root of 50 (! 50 5 7.071), or about 7 class intervals, would be a reasonable number. Sheskin’s pro- posal is only a suggestion, however. When the data set is large, the rule may not be very help- ful. In the Jokela (2012) study, shown in Table 2.2, there were 2,971 male subjects. Fifty-five class intervals (! 2,9715 54.507) probably creates a larger table than anyone wants to use in a presentation or research report.

The researcher’s objective is to find a reasonable balance between the efficiency that a few categories provide and the precision that more categories yield. For example, if we were to reduce the five intervals in Table 2.1 to just pass and fail categories, the result would be a very compact table, but a good deal of information about the level at which a particular individual passed would be lost.

Score Frequencies and Score Aggregates Table 2.1 provides a simple summary of frequency, or how the scores on the licensing exam are distributed for 25 test-takers. Other arrangements of these data offer different pictures of how the scores are distributed.

• Frequency indicates how many scores are in each class interval (Table 2.1). • Relative frequency indicates the proportion or percentage of the total that scores in

the class interval represent. Relative frequencies can be reported as common frac- tions, but proportions or percentages of the whole are more common. The propor- tions are calculated by dividing the number of scores in the class interval by the total number of scores.

• Sometimes it is helpful to see a running total of scores as one proceeds from one class interval to the next. A cumulative relative frequency value adds each successive class interval to the proportions of scores that precede it so that the last interval will indicate 1.0, or 100%. The cumulative relative frequency for exceeds expectations will be the relative frequency for that class interval (0.24), plus the relative fre- quency for the preceding class interval (meritorious; 0.16).

0.24 1 0.16 5 0.40

Expanding Table 2.1 by adding columns for relative frequency and for cumulative relative frequency results in Table 2.3.

tan82773_02_ch02_029-060.indd 36 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

(The Stem)

3

2

1

0

(The Leaves)

3 3 4

0 2 3 3 4 6 6 9

1 2 4 5 5 7 8 9 9

1 3 6 8 9

Section 2.1 From Description to Display

Table 2.3: Frequencies, relative frequencies, and cumulative relative frequencies

Licensing test results f Relative f Cumulative relative f

Meritorious 4 0.16 0.16

Exceeds expectations 6 0.24 0.40

Pass 8 0.32 0.72

Pass with exceptions 4 0.16 0.88

Fail 3 0.12 1.00

Total 25

Stem-and-Leaf Displays Sometimes, rather than collapsing or abbreviating the data list, scores need to be orga- nized so that when they are all presented, they are easy to understand. Some data dis- plays accommodate all of the data and still manage to remain fairly compact. One such is the stem-and-leaf display or stem plot. Rather than collapsing the scores into class intervals (and losing some of the information about their original values), the stem-and- leaf approach displays all the original scores. The stem-and-leaf display has its name because each score is reduced to a stem and a leaf. The “stem” in the display is all values in the number preceding the last digit in the score. The “leaf ” is the last value in the score.

Figure 2.1 depicts a stem-and-leaf display of the 25 test scores on which Table 2.1 and 2.3 are based.

At first glance the display appears a little odd, but the beauty of stem-and-leaf displays is that the data list

Try It!: #2 What would the stem be for a score of 1,012?

Figure 2.1: A stem-and-leaf display of test scores

A stem-and-leaf display condenses data to a series of stems (all values in the number except for the last digit) and leaves (the last digit of each value).

tan82773_02_ch02_029-060.indd 37 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.1 From Description to Display

is complete. Single-digit original scores appear on the bottom row, where the stem (the number preceding the final value) is 0. The stem in this particular display is just a series of single values because all of the scores are either single-digit or two-digit numbers. If there were a score of 100, the stem for that score would be a two-digit number, 10.

• The single-digit original scores on the bottom row, then, are 1, 3, 6, 8, and 9. • The second-row test scores are those for which the first digit is a 1 (the stem is 1).

Those scores are 11, 12, 14, 15, 15, 17, 18, 19, and 19. • The third-row test scores are those for which the first number (the stem) is a 2.

These, of course, are the test scores in the 20s. • And the top row, with a stem of 3, contains the three highest scores: 33, 33, and 34.

Once a person is oriented to stems and leaves, the display is not difficult to interpret. A glance makes it clear, for example, that the bulk of these test scores are in the 10s and 20s.

Data Cross-Tabulations Beyond simply listing data, the stem-and-leaf display suggests that the way data are orga- nized can make what are often quite subtle relationships easier to recognize. Other types of displays also do this very well. For the sake of the licensing test example, assume that the 25 people represent all those from a particular city who took the test in a given year. Assume further that they are the products of two different universities in that city. We know from the earlier tables that the test had just three outright failures and an additional four who passed with exceptions, a kind of conditional pass. A researcher might find it important to determine whether students from the two universities performed similarly. Cross-tabulating the data is one way to present them so that such questions are easier to answer.

Tables 2.1 and 2.3 organized test results according to just the categories that constitute the class intervals. If the results add the university the student attended, a data table (Table 2.4) can be developed so that the columns indicate the test results, and the rows indicate the uni- versity attended.

Table 2.4: Cross-tabulating test results with the institution

Institution

Class intervals

Meritorious Exceeds

expectations Pass Pass with

exceptions Fail

University A 0 1 4 4 3

University B 4 5 4 0 0

This cross-tabulation reveals information about the relative success of students from the two universities. If we aggregate the data across institutions, it is not apparent, for example, that

Try It!: #3 How many “stems” would a stem-and-leaf plot have if scores represented every inte- ger from 1 to 99?

tan82773_02_ch02_029-060.indd 38 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.2 Graphs and Other Data Figures

no one from University B failed the test, nor is it clear that all those who scored at the meri- torious level were from University B. Cross-tabulating data allows a second variable to be represented and provides for a more sophisticated level of analysis.

If, for example, we also had access to marital status information, the rows could be divided further to reflect the variable that would allow us to compare results by test result, by uni- versity, and by marital status. If you have read enough to know what information is likely to be important in your report, the published research will guide you to the additional variables that ought to be gathered and presented in a data display.

Cross-tabulation is a visually simple way to represent multiple variables. Linn (2003) wanted to indicate the percentage of students in two different grades who were performing at pro- ficient or above in two different states on two different parts of the National Assessment of Educational Progress (NAEP). If you count them, Linn’s data contains four different variables, displayed in Table 2.5:

Table 2.5: Representing multiple variables in a cross-tabulation percentage of students performing at proficient and beyond on the NEAP

Grade

1998 Reading 1996 Mathematics

Colorado Massachusetts Colorado Massachusetts

4 34 37 22 24

8 30 36 25 28

Source: Linn, R.L. (2003). Accountability: Responsibility and reasonable expectations. Educational Researcher, 32(7), 3–13.

Table 2.5 reveals that the percentage of students scoring at proficient and beyond on both the 1998 Reading test and the 1996 Mathematics test was modestly greater in Massachusetts than in Colorado. The table also reveals that the gap increased slightly from the fourth to the eighth grade. Cross-tabulations readily reveal trends and comparisons such as these.

2.2 Graphs and Other Data Figures Sometimes, rather than visual displays that group or arrange the scores, a more graphic presen- tation is helpful. That was certainly the case for the aircraft analyst discussed in the chapter introduction. Pie charts and bar graphs are both quite common because they require very little explanation. As compact and efficient as the stem-and-leaf display is, the unfamiliar observer must be oriented to it before the data make sense. This is less often the case with pie and bar graphs.

Pie Charts Perhaps better than any other type of graph or figure, the pie chart, or pie graph, clarifies propor- tions. Scholars have been using pie charts to illustrate proportional differences for probably two hundred years. Technically speaking, a pie chart is a circle that is divided into sectors. The size

tan82773_02_ch02_029-060.indd 39 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

African Americans Asian Americans Caucasian Americans Hispanic Americans Native Americans Other

Section 2.2 Graphs and Other Data Figures

of each sector is defined by the percentage of the total area of the circle. For instance, a pie chart is used to illustrate where people in a particular county live, with the following percentages:

• 25% are city dwellers, • 20% live in the suburbs, • 25% live in small towns, and • the remaining 30% live in rural areas.

In the circle used for the pie chart,

• 1/4th of the area will be the city sector, • 1/5th of the area will be for those in the suburbs, • 1/4th of the area will represent the small-town residents, and • the remaining 3/10ths of the area will represent the rural dwellers.

When we are interested in how much of the whole is explained by individual categories, a pie chart is usually more illustrative than a table. This is particularly the case when the data sets are large.

Perhaps a sociologist is interested in the ethnic group makeup of the residents in a particular county. Examining census data might produce the following statistics (depicted as a pie chart in Figure 2.2):

African Americans 23,375

Asian Americans 18,217

Caucasian Americans 32,667

Hispanic Americans 40,886

Native Americans 11,364

Other 5,887

Figure 2.2: A pie chart of the ethnic makeup of the county

This pie chart depicts census data by ethnic group for a single county. Pie charts are useful in showing a percentage of data as proportional to the whole.

African Americans Asian Americans Caucasian Americans Hispanic Americans Native Americans Other

tan82773_02_ch02_029-060.indd 40 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Series 1 African

Americans

Series 1 Asian

Americans

Series 1 Caucasian Americans

Series 1 Hispanic

Americans

Series 1 Native

Americans

Series 1 Other

Americans

23,375

18,217

32,667

40,886

11,364

5,887

Section 2.2 Graphs and Other Data Figures

To make this pie chart in Excel, enter the data in two columns just as they are listed above Figure 2.2. Drag the cursor so as to highlight both columns and then select the Insert tab at the top of the page. Select the Pie option. The default result is a two-dimensional pie chart.

Treated as a list, the data are certainly precise but perhaps are not as communicative as they might be. If the intent is to indicate how different ethnic groups compare as proportions of the entire population of the county, a pie chart is probably more helpful. Figure 2.2 shows the proportions of each ethnic group within the county population.

This particular graph does not indicate the numbers on which the proportions are based, but the exact counts can be listed separately. In any event, the raw numbers may not matter to someone who wants a graphic demonstration of the fact that Hispanic residents constitute the largest single ethnic group in the county, that the second largest group is the Caucasian group, that Native Americans are about half as numerous as African Americans, and so on.

Pie charts illustrate large proportional differences better than small ones. Note that Fig- ure 2.2 makes it difficult to assess how much of the Native American population constitutes the whole, or what proportion is Other. Pie charts generally work better when comparing an individual “slice” to the whole rather than one slice to another.

Bar Graphs Bar graphs or bar charts use a series of bars of different lengths to represent the different quantities of some variable. The bars can be either horizontal or vertical.

Gaps between the bars indicate that the categories in the graph are not continuous; they are discrete or independent categories. For example, such a chart might illustrate the popularity of different academic majors at a university or show the ethnic makeup of the student body, as in Figure 2.3.

Figure 2.3: A bar graph of the ethnic makeup of the county

Unlike a pie chart, a bar graph shows data values for each population group, allowing for a more exact representation of each group’s proportion within the whole population.

Series 1 African

Americans

Series 1 Asian

Americans

Series 1 Caucasian Americans

Series 1 Hispanic

Americans

Series 1 Native

Americans

Series 1 Other

Americans

23,375

18,217

32,667

40,886

11,364

5,887

tan82773_02_ch02_029-060.indd 41 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

South Midwest New England

Northeast Atlantic/ South

Mountain/ West

Pacific/ West

Southwwest

Total Population Under 18

Section 2.2 Graphs and Other Data Figures

One advantage of bar graphs over pie charts is that bar graphs supply the data values along the y (vertical) axis or along the bar, as Figure 2.3 illustrates. The presence of the data values makes it a good deal easier to get a rough idea of the approximate totals for each ethnic group and simplifies comparisons from group to group. In a bar graph with discrete categories, the order of the bars is usually not significant, although there may be an order the researcher wishes to emphasize. In Figure 2.3, the order of ethnic groups happens to be alphabetical.

To create this bar graph in Excel, perform these steps using the same data set used to create the pie chart:

1. Highlight both columns of data. 2. Select the Insert tab at the top of the page and choose Bar. 3. Select All Chart Types at the bottom of the page (the default charts all use horizon-

tal bars). 4. Select the upper-left-column graph. 5. Place your cursor on the series 1 notation at the right. 6. Press the Delete key on your keyboard, and then click OK.

Bar graphs can also depict different variables within a single data set. Using the 2000 U.S. cen- sus results, Lopez (2003) examined the ethnic-group characteristics of school-aged children. She used a bar graph to indicate the percentage of mixed-race children under age 18 for each region in the United States. Figure 2.4 shows the resulting bar graph.

The bar graph makes it clear that mixed-race children are a much greater proportion of the population in the West and Southwest than they are in the South or the Midwest, for example.

Figure 2.4: Percentage of mixed-race children under 18 by region

Bar graphs are capable of depicting multiple variables, as this graph of mixed-race children within each U.S. region illustrates.

Source: Lopez, 2003.

tan82773_02_ch02_029-060.indd 42 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

7

7 8 9

6

6

5

5

4

4

3

3

2

2

1

1 0

Stress Levels (1–9) Among 32 Hospital Personnel

Section 2.2 Graphs and Other Data Figures

Histograms The ethnicity data in the categories in both Figures 2.3 and 2.4 are nominal scale, meaning the categories are not continuous and the order of the categories is unimportant. Sometimes, data categories continue from one to the next, so that each category indicates an incremen- tal increase or decrease in the level of the same characteristic. This kind of bar graph is a histogram.

The subtle visual difference between histograms and other graphs is the absence of a gap between the bars or columns, which serves as a reminder that the data continue without interruption into the next category. Earlier, this chapter discussed actual versus apparent limits in class intervals; here the lack of interruption indicates that limits in a histogram are actual limits.

Researchers commonly use histograms to illustrate test-score data. For example, a stress test is administered to personnel in several hospital emergency rooms. Higher scores indicate greater stress, with scores as follows:

Level Frequency 1 2 2 4 3 3 4 6 5 5 6 4 7 4 8 3 9 1

Figure 2.5 plots these data as a histogram.

Figure 2.5: Histogram showing stress-test score distribution among hospital

personnel

A histogram is a bar graph with continuous (as opposed to discrete) categories. The order of categories in a histogram is not random, but rather dictated by the magnitude of the variable—in this example, the degree of stress among the personnel tested.

7

7 8 9

6

6

5

5

4

4

3

3

2

2

1

1 0

Stress Levels (1–9) Among 32 Hospital Personnel

tan82773_02_ch02_029-060.indd 43 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

3

3

2

2-3

-3

-2

-2

-1

-1

1

1

y

X

Section 2.2 Graphs and Other Data Figures

In a histogram, the order of the intervals is never random. It is dictated by the magnitude of the vari- able, which in this case, is the degree of stress. The histogram indicates that 4 is the most common stress level, but that stress among emergency per- sonnel in the hospitals ranges from the two people who registered 1 (very low stress) to one person who measured 9 (very high stress).

Cartesian Coordinates Cartesian coordinates are values of x and y. The horizontal line is the x-axis, or the abscissa, and the vertical axis, at right angles to the x-axis, is the y-axis, or the ordinate. The abscissa and the ordi- nate intersect at the point of origin, where x 5 0 and y 5 0, as shown in Figure 2.6.

A score with the coordinates x 5 2 and y 5 3 would occur at the point marked by the star.

If the graph included negative values, they would be plotted either to the left of the point of origin (for negative values of x) or below it (for negative values of y). When all values are positive, it is com-

mon to delete the upper left, lower left, and lower right quadrants and present only the upper right quadrant. Most graphs show only this quadrant.

Figure 2.6: A Cartesian coordinate system

A Cartesian coordinate system depicts values along an x-axis and a y-axis.

3

3

2

2-3

-3

-2

-2

-1

-1

1

1

y

X

Giorgios Kollidas/Hemera/Thinkstock

Cartesian coordinates, named after the 17th-century French mathematician René Descartes, are the values x and y that allow one to locate scores on a graph.

tan82773_02_ch02_029-060.indd 44 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.3 The Normal Distribution

2.3 The Normal Distribution Another graphic display called the frequency polygon is quite common in statistics books because it is used to display the normal curve or normal distribution. The frequency poly- gon is formed by using a point for which the height from the x-axis in a set of Cartesian coor- dinates indicates the frequency with which the score occurs. With the lowest-score values to the left, a line joins the frequency of each successive score from lowest to highest. What is often referred to as a bell-shaped curve or just a bell curve is more precisely a frequency polygon. The curve is based on enough individual scores so that the straight lines between consecutive scores are too short to appear straight but rather as part of one long, continu- ous, curved line. A few test scores cannot depict results this way, but scores from all students nationally would show the line less as a series of straight lines joined together and more like a smooth, continuous curve. On the other hand, scores from 50 or 60 students will unlikely reflect a normal distribution.

The normal curve has a number of applications. That bell shape, for which the curve is low on either side and highest in the middle, suggests the way many characteristics tend to be distributed when we measure them for large numbers of people. The shape is a reminder that, particularly when dealing with, for example, mental traits, they are often distributed predictably in populations. Even before collecting data, researchers can sometimes know which scores in a distribution are likely to occur with the greatest frequency and which with the least.

Much of what we know about the normal distribution reflects the work of Karl Gauss (1777–1855), a gifted German mathematician. Indeed, acknowledging his contribution, the normal distribution is sometimes called a “Gaussian distribution.” He was perhaps the most important of those early scholars to recognize the existence of normal distributions and to define their properties.

The Elements of Normality According to Gauss, if we were to measure very large numbers of people on a trait like verbal aptitude, a frequency polygon of the results is likely to have the characteristics Figure 2.7 dis- plays. Specifically, these characteristics include the following:

1. The distribution is divided down the middle: each half is a mirror-image of the other. In other words, the distribution is symmetrical.

2. The distribution has a single mode (i.e., it is unimodal), with only one most fre- quently occurring number (mode). That means in effect that the bell has just one peak, which, because the distribution is symmetrical, will be in the middle of the distribution.

3. A rule sometimes called the “1/6th rule” (although it happens to be characteristic of normal distributions) will apply. This rule states the standard deviation of the scores in the population will be about one-sixth of the range (s 5 16 R). If the range of scores for a certain measure is 36, and the data are normally distributed in the population, s 5 6. The Scholastic Aptitude Test (SAT) has scores from 200 (the minimum) to 800 (the maximum); based on the one-sixth rule, we might conclude the standard devia- tion is 100 points.

tan82773_02_ch02_029-060.indd 45 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 4 3.5

Section 2.3 The Normal Distribution

These statements of normality regarding a data distribution have much more technical descriptions, but these will serve us well. With what we already know about descriptive sta- tistics, these characteristics allow us to make reasonably good judgments about whether a data distribution is normal. This is important because to some degree, the types of analyses that are available depend upon whether data are normal.

Figure 2.7 depicts characteristics beyond the symmetrical, unimodal, and s 5 16 R and notes the percentage of the distribution that occurs in specified areas. These characteristics and the z scores and population’s standard deviation values that are noted below the curve will be important in succeeding chapters. For now, note the three characteristics described above.

Figure 2.7: The normal curve

Figure 2.7 shows a frequency polygon or bell curve, depicting the standard normal distribution for a given set of data.

Source: “Normal Distribution.” (2014). MathIsFun.com. Retrieved from http://www.mathsisfun.com/data/standard-normal -distribution.html

-4 -3.5 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3 4 3.5

Skew An easy way to determine whether data are symmetrical is to compare the measures of cen- tral tendency. If the mean (M), the median (Mdn), and the mode (Mo) all have the same value, the distribution is symmetrical, as well as unimodal.

When the measures of central tendency do not agree, it is because some scores on one side of the distribution are not counterbalanced by scores a similar distance from the mean on the other side of the distribution. This imbalance creates what is called skew (sk). If skew is present in the measures, we indicate it with the following symbols: sk ? 0. When there is no skew, sk 5 0.

Comparing the mean, median, and mode can indicate whether data are skewed, but in order to determine the degree of skew, we must use a calculation. One of the simpler formulas for calculating skew is as follows:

tan82773_02_ch02_029-060.indd 46 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.3 The Normal Distribution

Formula 2.1

sk 5 M 2 Mdn

Mdn

where

sk 5 skew,

M 5 the mean of the values, and

Mdn 5 the median of the values.

Using the following 25 test scores from the licensing-test example:

1, 3, 6, 8, 9, 11, 12, 14, 15, 15, 17, 18, 19, 19, 20, 22, 23, 23, 24, 26, 26, 29, 33, 33, 34

Verify that

M 5 18.40

Mdn 5 19

Using Formula 2.1,

sk 5 M 2 Mdn

Mdn 5

18.40 2 19 19 5 20.0316

The negative value indicates negative skew. A positive value would indicate positive skew. Negative skew means that the slope to the left of the mean and median is more gradual than that to the right. Particularly in small groups, some skew is common. For purposes of analysis, skew values from 61.0 indicate modest skew. Figure 2.8 shows graphs of distributions with negative (A) and positive (B) skew.

We noted that many of the more common statistical procedures require that data be relatively normal. When data are not normal—lack of symmetry is one characteristic of non-normal data—the analytical procedures must accommodate the lack of normality. Chapter 10 deals with such procedures, often used when someone is analyzing salary data or home prices, which are rarely normal. In those instances, a few very high values create negative skew in the data.

As a practical matter, we are interested in the degree of skew. In the calculation above, sk 5 20.0316. The negative value indicates that there is some negative skew, as with the example in Figure 2.8, but it is less extreme. Anytime skew is within 61.0, researchers do not need to make special adjustments for the lack of normality. Traditional analytic procedures that assume normality are appropriate in these cases, as skew does not pose a problem.

Note that the mean is slightly lower than the median in the 25 test scores. The way skew is calculated in Formula 2.1 indicates that any time the mean is lower than the median, the result will be a negative skew value. On the other hand, if M . Mdn the data have positive skew. In effect, the mean is “pulled” in the direction of the skew, so that even before a skew value is calculated, a comparison of the two measures of central tendency usually indicates whether skew is positive or negative.

tan82773_02_ch02_029-060.indd 47 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

A. Negative Skew

B. Positive Skew

Section 2.3 The Normal Distribution

Outliers The mean and median differ when values on one side of the distribution are more extreme than those on the other side. The imbalance creates skew: the more extreme the unbalanced scores, the more extreme the skew. We noted earlier that many mental traits are normally dis- tributed, but that statement refers to their populations. Samples, particularly samples smaller than several hundred, will rarely be normally distributed. Consider these data:

20, 25, 30, 35, 40

For these five values, Mdn 5 30 and M 5 30. There is no skew; the data are symmetrical.

The distribution is certainly not normal with so few data, but it is symmetrical. But when we add values of 5 and 45 to the set, the results change.

5, 20, 25, 30, 35, 40, 45

For these seven values, Mn remains at 30, but M becomes 28.571.

Figure 2.8: Skewed distribution

In a negative skew, the slope to the left of the median and mean is more gradual than that to the right. The opposite is true for a positively skewed distribution.

A. Negative Skew

B. Positive Skew

tan82773_02_ch02_029-060.indd 48 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.3 The Normal Distribution

The additional values created some negative skew. That happens because, of the two new val- ues added to the distribution, 5 is more distant from the mean than 45. The effect of adding 5 is to pull the mean away from the median, creating negative skew.

As the most extreme score in the set, the 5 can be termed an outlier. Outliers are scores that are uncharacteristic of the other scores in the data set. Outliers in just one direc- tion create skew. If instead of 45, the upper value had been 55, the result would be no skew because the 5 and 55 are equidistant from the mean (which in that case, would have remained 30).

A researcher might be tempted to restore symmetry by simply eliminating the offending val- ues. In some research cases, these outliers are removed, but data are seldom eliminated when samples are small. In our case, ignoring the 5 and the 45 would restore symmetry but elimi- nate nearly 29% (2/7) of the data in the process. Besides, as we will see in the next section, normality involves more than skewness.

We have seen that the orientation of the mean to the median indicates whether the data are skewed. We also have noted that normal data are unimodal. Therefore, we need to determine the mode as well—even though mode is a statistic typically associated with nomi- nal scale data and often not very informative in small data sets of any scale. Nevertheless, we need to know whether there is just one mode, as well as how similar the data are.

Kurtosis Symmetry and unimodality offer no assurance of normality. A data distribution can be sym- metrical and unimodal but not normal. The third dimension of normality, kurtosis, deals with how spread-out the data are. Kurtosis, which comes from a Greek word meaning “bulging” or “convex,” is part of three different descriptions of data distribution:

• Data which are too heterogeneous, or too varied to be normal, are platykurtic, or “flat-kurtic.”

• Data which are too homogeneous, or too similar to create a normal distribution, are leptokurtic, or “narrow-kurtic.”

• Normally distributed data are mesokurtic, literally “middle-kurtic.”

Figure 2.9 illustrates mesokurtic, leptokurtic, and platykurtic data distributions.

Although kurtosis can be calculated, the calculations are tedious, and kurtosis is more commonly judged by comparing the standard deviation (s) of the data to their range (R). Recall from Chapter 1 that the standard deviation is a measure of how much the data typically vary from the mean of the distribution. A large standard deviation indicates that, on average, individual measures differ substantially from the mean—they are not homogeneous. In a normal distribution, the standard deviation is about one-sixth the range. So if scores occur from, say, 20 to 55 (making R 5 35), the standard deviation in a normal distribution will be about 6 points, since 1/6th of 35 5 5.833.

Try It!: #4 If the mode is least affected by outliers and the mean is the most influenced, what will be the order of the mean, median, and mode from left to right in a distribution with negative skew?

tan82773_02_ch02_029-060.indd 49 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

A. A Platykurtic Distribution

B. A Leptokurtic Distribution

C. A Mesokurtic Distribution

Section 2.3 The Normal Distribution

• If s , R/6, the distribution is leptokurtic; the data are too similar for normality. • If s . R/6, the distribution is platykurtic; the data are too varied to be normal.

Note that the s 5 16 R rule may not be helpful with small data sets. Small samples tend to be platykur- tic because just one or two extreme scores have a disproportionate effect on the balance of the

sample. As the sample grows, this effect is minimized, but it can be substantial with only a few scores. For the sake of manageability, many of the data sets we deal with in this book are fewer than 15, and in groups that small, ordinar- ily s . R/6. Just remember that the one-sixth rule is much more relevant to populations than to samples.

In data sets where zero (0) indicates a symmetrical distribution when skew is the issue, when examining calculated kurtosis values,

TIP The Excel function KURT will not calculate kurtosis for sets of numbers fewer than four.

Figure 2.9: The three types of kurtosis

Kurtosis describes distribution in which the data set is symmetrical but not necessarily normal.

A. A Platykurtic Distribution

B. A Leptokurtic Distribution

C. A Mesokurtic Distribution

tan82773_02_ch02_029-060.indd 50 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.3 The Normal Distribution

• Zero indicates a mesokurtic distribution—neither leptokurtic nor platykurtic. • Positive kurtosis values are associated with a leptokurtic distribution. • Negative kurtosis values are associated with a platykurtic distribution.

Kurtosis values in the 61.0 range are ideal for statistical analyses that require normal data. Values in the 62.0 range are not considered normal, but they are—if not ideal—at least acceptable for statistical analyses.

As we noted, the formula for kurtosis is quite grueling to calculate by hand and in fact is rarely completed except as a descriptive statistic that Excel or one of the dedicated computer soft- ware packages produces. For nearly everything that the typical analyst or researcher does, the R/6 rule provides enough information to make a judgment about how data are distributed.

Using Excel to Calculate Skew and Kurtosis Excel makes calculating kurtosis painless and provides additional information well, as the following example shows.

If the 25 licensing test scores are entered in one column in Excel, the commands for calculat- ing descriptive statistics, including skew and kurtosis values, are as follows:

1. From the top menu, click the Data tab. Then click Analysis to open the Data Analysis window.

2. In the Data Analysis window, select Descriptive Statistics, and click OK.

3. For Input Range, drag the cursor over the cells in which the 25 scores occur.

4. Click Output range and desig- nate a cell value below which will be sufficient room for the output, which takes up about 15 lines and two columns.

Table 2.6 shows the Excel output for the test scores. (Note that although there are multiple duplicated num- bers, Excel lists only one value for the mode.)

The negative value for kurtosis indi- cates that these 25 values make up a platykurtic distribution, which, as we have emphasized, is typical for rela- tively small groups. The distribution is a little too flat to be normal, although well within the limits for normality that statistical tests require.

Table 2.6: Excel display showing descriptive statistics

Column 1

Mean 18.40

Standard error 1.833939

Median 19

Mode 33

Standard deviation 9.169696

Sample variance 84.08333

Kurtosis 20.63032

Skewness 20.06491

Range 33

Minimum 1

Maximum 34

Sum 460

Count 25

tan82773_02_ch02_029-060.indd 51 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.4 Determining What Is Representative

The skew value, by the way, is slightly different from the value we calculated earlier: sk 5 20.0316. Formula 2.1 is one of the simpler formulas for calculating skew, so the result is a value that is less precise than what Excel produces. Any variation between results calculated using Formula 2.1 and those calculated in Excel will generally be minor.

2.4 Determining What Is Representative Earlier, we noted that extreme scores have the potential to distort the descriptive statistics in a data distribution. Particularly when the group is relatively small, extreme scores can sub- stantially affect both the mean and the standard deviation. This effect prompts a question: Because nearly all distributions have scores that differ from the mean of the group, at what point does a score become an extreme score? That is, what defines an outlier?

Percentile Ranks Several ways are possible to answer such a question about outliers. One approach requires a quick introduction to percentiles. Recall that the median (Mdn) is the point in a distribution below which half of all scores occur. In terms of percentages, then, 50% of the scores fall below the median in a distribution where the scores are arranged from lowest to highest; the median marks the 50th percentile rank. Percentile ranks define the percentage of scores occurring below a point.

If we again divide each half of the distribution into halves, the result is fourths of the distribu- tion, called quartiles. Arranged in order, the licensing test results are as follows:

1, 3, 6, 8, 9, 11, 12, 14, 15, 15, 17, 18, 19, 19, 20, 22, 23, 23, 24, 26, 26, 29, 33, 33, 34.

Because the results comprise 25 scores, the median is the 13th score, the first of the two 19s. To find the middle of the lower half of the distribution we exclude the median value. The middle point of the remaining 12 scores is halfway between the sixth and seventh scores, or between 11 and 12, which would be 11.5. From this data we conclude that:

• 11.5 marks the 25th percentile rank, or quartile 1 (Q1). • Midway between the uppermost 12 (from the second 19 to 34) scores is between 24

and 26; 25 marks the 75th percentile rank, or quartile 3 (Q3).

The Interquartile Range The portion of the distribution from the 25th to the 75th percentile rank constitutes the interquartile range (IQR), which is shaded in Figure 2.10.

Because scores in the middle half of the distribution are more likely to be repeated than those at either end of the distribution, the interquartile range generally contains the most represen- tative scores. Lockhart (1998) used the IQR to identify outliers. Anything outside the IQR—in this case, lower than 11.5 or higher than 25—could be considered a score which distorts the distribution.

tan82773_02_ch02_029-060.indd 52 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

1, 3, 6, 8, 9, 11, 12, 14, 15, 15, 17, 18, 19, 19, 20, 22, 23, 23, 24, 26, 26, 29, 33, 33, 34

11.5 19 25

25th percentile 75th percentileMedian

Section 2.4 Determining What Is Representative

To exclude everything outside the IQR as an outlier, however, excludes half the distribution—a rather extreme move. Lockhart (1998) suggested further that we calculate the IQR and then consider anything more than 1.5 times IQR above Q3 or below Q1 an outlier.

Consider the following illustration of this principle. A psychologist is working with clients in an addiction program and is researching how many drug-free days each client has achieved. A random sample of seven clients yields the following numbers of drug-free days: 1, 7, 13, 17, 25, 27, 63.

Verify that

Mdn 5 17

Q1 5 7

Q3 5 27

IQR 5 20

For lower outliers,

Q1 2 (1.5 3 IQR) 5 7 2 (1.5 3 20) 5 any score below 223 is an outlier.

For upper outliers,

Q3 1 (1.5 3 IQR) 5 27 1 (1.5 3 20) 5 any score above 57 is an outlier.

Among these seven scores, only 63 is an outlier according to Lockhart’s (1998) definition. Although judgment is always involved and therefore some subjectivity, Lockhart’s approach will at least result in consistent decisions about which data to exclude to obtain a more accu- rate picture of which scores may unduly distort a distribution.

If Lockhart’s (1998) solution for answering the question of which scores are least like the other scores seems too complex, alternatives may be used. Recall Figure 2.7 (and refer to Chapter 3, where the normal distribution will be examined more closely). When data are normal, the area included between two standard deviations below the mean to two stan- dard deviations above the mean will always include about 95% of a normal distribution. We could rely on that fact to devise a rule for outliers and exclude scores beyond 6 two standard

Figure 2.10 The interquartile range

Figure 2.10 shows the interquartile range, which for the licensing test scores occurs between the first and third quartiles. The IQR consists of the most representative scores for a data set.

1, 3, 6, 8, 9, 11, 12, 14, 15, 15, 17, 18, 19, 19, 20, 22, 23, 23, 24, 26, 26, 29, 33, 33, 34

11.5 19 25

25th percentile 75th percentileMedian

tan82773_02_ch02_029-060.indd 53 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.4 Determining What Is Representative

deviations (2s). For the group of seven values (drug-free days) above, M 5 21.857 and S 5 20.359. By the 62s rule, anything lower than 218.860 (21.857 2 2 3 20.359) or higher than 62.575 (21.857 1 2 3 20.359) can be excluded as outliers.

Since there is no such thing as negative days clean (our lowest value is 1, and the lowest pos- sible value would be 0), the calculation indicates that by our rule, at least, no data on the low side of this set are outliers. On the high side, 63 is the only outlier, the same outcome reached by Lockhart’s (1998) approach, but that will not always be the case. The point is to develop a reasonable rule and be consistent and transparent in applying it.

The Distorting Effect of Outliers The problem with outliers, of course, is that extreme scores make the mean, which is sup- posed to indicate central tendency, something less than central. Outliers can also dramatically inflate the value of the standard deviation. Remember that the standard deviation is based on the square of the difference between a score and the mean of the group. Squaring the dif- ference between an extreme score and the mean has a much greater impact than squaring the difference between the mean and a score close to the mean. It therefore has a dispropor- tionate effect on the magnitude of the statistic. When that happens, what are supposed to be descriptive statistics may not describe the data very well.

The point of organizing tables and calculating graphs, figures, skew, kurtosis, and all the descriptive statistics is to describe data sets and illustrate relationships. It is easy to get buried in a lot of complex calculations and arrangements as we move into the more advanced analyses, but we cannot lose sight of one fairly constant objective in human subjects research: to clarify and simplify the data. For example, if the point of the anal- ysis is the range of scores included in the meritorious category, it makes little sense to exclude the most extreme scores when developing a feel for the highest scores. That those highest scores may create positive skew in the distribution as a whole in that case is unimportant.

This advice is not meant to diminish the importance of identifying and excluding outliers sometimes. A case study illustrates why. In the Alberta schools the author attended, a rem- nant of British educational practice required students to take rigorous government examina- tions called “departmental” in the ninth and twelfth grades. (In Britain, they are known as O-levels and A-levels.) The results of these exams had much to do with students’ subsequent educational options. In the author’s twelfth grade year in a very small high school, six or seven students took the departmental exam in physics. One of the author’s classmates produced a perfect score, something that had never occurred before in the history of the province. For a researcher who wished to review those scores to develop a feel for the level of physics performance typical of seniors in that high school, it would make sense to exclude that one perfect score before calculating a mean, or perhaps to calculate the median score instead of the mean because medians are less affected than means by outliers. With just a few scores to begin with, a perfect score on a very difficult examination holds too much potential to distort a mean.

tan82773_02_ch02_029-060.indd 54 3/3/16 9:59 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.4 Determining What Is Representative

Apply It! Placement Test Outliers

A psychologist is examining data from 40 service men and women who have recently returned from a United Nations peacekeeping assignment. They are being evaluated for post-traumatic stress and each has completed an instrument designed to measure stress disorders. The psy- chologist records their test scores and then uses Excel to calculate descriptive statistics. The Descriptive Statistics function returns the following values for these 40 test scores.

Mean 54.65 Median 54.22

Standard deviation 10.27 Kurtosis 4.65

Skewness 1.20 Range 62

Minimum score 15 Maximum score 76

The psychologist expects that these scores will roughly follow the characteristics of a normal distribution, with allowances for the fact that this is a comparatively small group. Checking the data reveals the following:

• s 5 10.27. Since R 5 62, and in a normal distribution 1/6th 3 R 5 s, the actual standard deviation is quite close to the expected value: 1/6 3 62 5 10.333.

• M < Mdn (the difference is a modest 0.43). • The skewness value of 1.2 indicates a

slight amount of positive skew, not sur- prising since M , Mdn.

• The kurtosis value is worrisome.

The kurtosis value of 4.65 is far outside the 62.0 range for normal data. With the balance of the descriptive statistics reasonably close to what was expected, the psychologist decides to reexamine the raw data using the interquartile range. Since the 25th percentile (Q1) is a score of 49, and the 75th percentile (Q3) is 60, the inter- quartile range (IQR) 5 11.

For lower outliers,

Q1 2 (1.5 3 IQR) 5 49 2 (1.5 3 11) 5 any score below 32.5 is an outlier.

For upper outliers,

Q3 1 (1.5 3 11) 5 any score above 76.5 is an outlier.

Digital Vision/Photodisc/Thinkstock

When analyzing a small sample of test scores an extreme score has high potential to distort the mean. In this example, excluding the outlier will give a more accurate mean of an average score.

(continued)

tan82773_02_ch02_029-060.indd 55 3/3/16 10:00 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Section 2.4 Determining What Is Representative

Writing Up Statistics Skew and kurtosis values are rarely presented in a research report, unless the study was con- ducted about the normality of the data. Researchers typically determine these values to know how they should treat the data. Values for the mean and standard deviation, on the other hand, are frequently reported. In their study of bipolar behavior among young adults, Malik, Goodwin, Hoppitt, and Holmes (2014) reported results from a Mood Disorder Questionnaire (MDQ). Table 2.7 shows their results in part.

Table 2.7: Demographic characteristics, emotional measures, trauma history, and general imagery measures for high MDQ and low MDQ groups

Characteristic/measure High MDQ (n 5 50) Low MDQ (n 5 50)

M SD M SD

Age 25.48 10.11 28.02 9.66

Emotional measures MDQ Beck depression inv. Eysenck personality

10.62 8.60 5.58

2.03 6.93 2.49

2.45 3.00 2.40

2.05 3.66 5.42

Source: Malik, A., Goodwin, G., Hoppitt, L., & Holmes, E. (2014). Hypomanic experience in young adults confers vulnerability to intrusive imagery after experimental trauma: Relevance for bipolar disorder. Clinical Psychological Science, 2, 675–684.

(continued)

Based on these rules, only the minimum score of 15 is an outlier. The psychologist looks again at the scores and finds a transcription error. The minimum score should have been recorded as 51, not 15. With the score changed from 15 to 51 and descriptive statistics re-calculated, the results are as follows:

Mean 55.55 Median 54.23

Standard Deviation 8.05 Kurtosis 20.16

Skewness 0.38 Range 35

Minimum Score 42 Maximum Score 76

Note that the new value for kurtosis is a much more acceptable 20.16, just slightly negative. That value means that the sample is slightly platykurtic, as relatively small samples often are.

Not everything we measure will be normally distributed, but when we have reason to expect what data will be, and something occurs which calls normality into question, a little investiga- tive work might be in order.

Apply It! boxes written by Shawn Murphy

tan82773_02_ch02_029-060.indd 56 3/3/16 10:00 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Summary and Resources

The table shows a cross-tabulation, which allows the authors to report the two groups’ means and standard deviations (note that Malik et al., 2014, use “SD” rather than the “s” that we use to designate standard deviation) for four different personality characteristics.

Summary and Resources

Chapter Summary Everyone has to analyze data. Whether someone is trying to determine if crossing a street is safe, read a co-worker’s body language, or conduct a complex numerical analysis of many variables and hundreds of data points, the tasks are analytical. When the data are numerous and the decision is important, beginning with organization is usually helpful. Learning to organize and present data is the principal aim of this chapter (Objectives 1 and 2). As the size of the data set increases, so does the need for the information that good organization can provide. Proper organization can make muddled data comprehensible; an ordered array is a good deal more informative than a disordered array. Whether a frequency distribution, pie chart, stem-and-leaf plot, or some other presentation is most appropriate depends upon what information we need to know. Frequency distributions provide a clear indicator of data repetition; pie charts reveal comparative proportions; and a frequency polygon can provide a rough estimate of whether data are normally distributed.

The normal distribution is a central concept in statistics and data analysis. When data are normally distributed, we know from the beginning roughly how data will be distributed relative to the mean. We know what proportions of a data distribution will occur where— a concept that will prove very helpful as we examine z scores in Chapter 3. Sometimes, even when organized into the proper display, a data set does not reflect the characteristics of a normal distribution. Consistent with Objectives 3 and 4, we learned to judge normal- ity based on a few descriptive statistics. If we can compare the mean to the median and the standard deviation to the range, we can make a rough estimate of whether data are normal.

Individual scores can threaten normality when they are uncharacteristic of the population as a whole. This is not to suggest that any normally distributed population does not have some variability; it is what gives the frequency distribution its characteristic curve. But when the extreme scores are predominantly in one direction, or when too many extreme scores are relative to the group as a whole, the resulting data can mislead the interpretation. Outliers create skewed distributions, so having a mechanism for identifying skew can be helpful (Objective 5).

abscissa The horizontal (x) axis in a graph based on Cartesian coordinates.

actual limits In a class interval, includes upper and lower scores with decimal values, which allows consistent classification of any value into one of the class intervals.

apparent limits The highest and lowest integers in a category for a particular class interval.

bar graphs Graphical data presentations that indicate proportions as comparative vertical or horizontal columns. Bar graphs

Key Terms

tan82773_02_ch02_029-060.indd 57 3/3/16 10:00 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Summary and Resources

are often used to indicate the frequency with which nominal data occur.

Cartesian coordinates The values associ- ated with the horizontal and vertical axes, often respectively termed x and y, that indi- cate the location of a score on a graph.

class intervals The groups in a grouped frequency distribution.

disordered arrays A presentation of data without organization into classes or groups or by magnitude.

frequency distribution A graphical data presentation that indicates the number of times each score in a group occurs.

frequency polygon A graphical data distribution in which straight lines connect the midpoints of the bars that make up a histogram. The normal curve is a frequency polygon but with so many successive values that the individual lines are imperceptible.

grouped frequency distribution A graphi- cal data presentation in which the data are grouped according to some characteristic held in common.

histograms A graphical presentation of interval or ratio scale data in which the “bars” in the bar graph-like presentation represent continuous categories reflected in the absence of gaps between the bars.

interquartile range (IQR) The middle half of the distribution stretching from the first to the third quartile, or from the measure representing the 25th percentile to that representing the 75th percentile.

kurtosis A descriptor indicating the level of homogeneity among the measures in a data distribution.

leptokurtic A descriptor for distributions of highly similar data gathered too closely

around the mean for the resulting distribu- tion to be normal.

mesokurtic Literally “middle” kurtic, distri- butions with an intermediate level of homo- geneity among the data that are character- istic of the conventional bell-shaped curve, which represents a normal distribution.

nominal scale Data categories that are not continuous and therefore the order of the categories is unimportant.

normal (or Gaussian) distribution Distri- bution characterized by the following: sym- metry, unimodality, and having data distrib- uted such that the range has about six times the value of the standard deviation. When presented in a frequency distribution, a nor- mal distribution takes on the bell-shape by which it is commonly described.

ordered array A presentation in which data are organized into classes or groups or by magnitude.

ordinate The vertical (y) axis in a graph based on Cartesian coordinates.

outlier A score that is substantially differ- ent from most of the other scores in a data distribution. If they are not balanced by out- liers in the opposite direction, these extreme scores create skew.

percentile rank A point in a data distribu- tion below which a specified percentage of the scores occur.

pie chart A circle that is divided into sectors. The size of each sector is defined by the per- centage of the total area of the circle a whole category represents; also called a pie graph.

platykurtic A descriptor for distributions in which the data are too heterogeneous to constitute a normal distribution. Samples are typically more platykurtic than the popu- lations they represent.

tan82773_02_ch02_029-060.indd 58 3/3/16 10:00 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Summary and Resources

point of origin In a graph based on Carte- sian coordinates, the point where the values plotted on the abscissa and those on the ordinate both equal zero.

quartiles Indicate the fourths of a data distribution. The first quartile (Q1) is from the point representing the 25th percen- tile down. The second quartile is from the median (Mdn) of the distribution to Q1, and so on.

skew A descriptor that indicates that a data distribution lacks symmetry, a characteris- tic which occurs when scores on one side of the mode are more extreme than those on the other. Positive skew indicates that in the frequency polygon plotted to represent

the data, the curve to the right of the mode is more gradual than that to the left. When the skew is negative, scores to the left of the mode are more gradual.

stem-and-leaf plots Also called stem-and- leaf displays. Graphical data presentations which arrange data in a vertical hierarchy from the smallest to the largest. All values except the last in each measure constitute the “stem” of the display. The last value is the “leaf.” For two-digit numbers, the stem is the 10s value, and the leaf is the 1s value. All measures with the same leaf value occur to the right of their common stem.

unimodal A distribution with just one most frequently occurring value, one mode.

Review Questions Answers to the odd-numbered questions are provided in Appendix A.

1. The following data are available from a depression scale:

2, 20, 23, 25, 12, 27, 14, 9, 22, 27, 19, 14, 33, 22, 25, 43

Arrange them in a stem-and-leaf display.

2. What scale of data array do the values in Question 1 represent?

3. What’s the difference between a bar graph with vertical columns and a histogram?

4. A data set has the following characteristics:

a. Mdn 5 45 b. M 5 40 c. R 5 36 d. s 5 6

Describe the distribution in terms of skew and kurtosis.

5. How can a data distribution be symmetrical and mesokurtic but not normal?

6. What percentage of the entire distribution does the interquartile range include?

7. If the data in Question 1 were characteristic of a population, how would you describe the population in terms of normality?

8. Using the data in Question 1, which scores would be excluded as outliers using the 62 standard deviation rule?

tan82773_02_ch02_029-060.indd 59 3/3/16 10:00 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.

Summary and Resources

9. Researchers use the Macro Anger Detector (MAD) to measure aggression level. Scores for 12 clients are as follows:

9, 15, 22, 14, 19, 23, 30, 14, 24, 8, 11, 28

a. Are these data skewed? What is the evidence? b. How would you describe these data regarding kurtosis? Why is this result pre-

dictable? For purposes of a graph, class intervals for age are as follows:

10–19, 20–29, 30–39, 40+

What is the actual limit between the first (10-19) and second (20-29) class intervals?

10. For the data in Question 9,

a. what data score marks the 50th percentile? b. what data scores mark the extremes of the interquartile range?

11. Assume that the data in Question 9 are accurate for the range of some population. What value will the population standard deviation have if MAD data are normally distributed?

12. A grouped frequency distribution for hostility values among incarcerated felons is as follows:

Range Hostility f

12–15 Overtly hostile 3

8–11 Occasional hostility 5

4–7 Passively hostile 7

0–3 No evidence of hostility 4

Calculate an approximation of the mean of the original scores.

Answers to Try It! Questions

1. These classifications allow one to rank scores, the primary characteristic of ordinal data.

2. For a score of 1012, the stem would be 101, all the values preceding the last.

3. The plot would have 10 stems: one for the single-digit values, one for scores in the teens, the 20s, and so on up to the 90s.

4. In a distribution with negative (left) skew, because the mean is most affected by the outliers and the mode the least, the order of the measures of central tendency will usually be mean, median, and mode, from left to right. In small data sets, it is pos- sible for the median and mode to be reversed, but in populations and large samples, the order will be M, Mdn , Mo , left to right.

tan82773_02_ch02_029-060.indd 60 3/3/16 10:00 AM

© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.