Descriptive Statistics Data Analysis

wilito1969

JarmanCHP3.pdf

Home >Applied Sciences homework help >Descriptive Statistics Data Analysis

Dear Mom, I hope you’re doing well. I haven’t heard from you in over two weeks, and I worry you’re still mad at me about the incident. I’d prefer to talk to you in person, but you don’t seem to be getting my phone calls and you didn’t answer the door when I visited. So, I must resort to explain- ing myself in an email.

First, when I dropped by your house with my new book project, I had no idea it would cause so much trouble. It’s only a statistics text, after all. Yes, I probably should’ve called ahead, but in my defense, you asked me to bring a copy over some time. How was I to know you’d be hosting a party for the Shady Oaks Estates Ladies Club when I got there? And how was I to know you’d immediately start showing the book to all your guests before you even looked at it yourself?

Asteroid Belts and Spandex Cars: Using Descriptive Statistics to Answer Your Most Weighty Questions

CHAPTER 3

Jarman, Kristin H.. <i>The Art of Data Analysis : How to Answer Almost Any Question Using Basic Statistics</i>, John Wiley & Sons, Incorporated, 2013. ProQuest Ebook Central, http://ebookcentral.proquest.com/lib/ashford-ebooks/detail.action?docID=1175199. Created from ashford-ebooks on 2019-08-20 18:47:06.

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

24 Chapter 3 asteroid Belts and Spandex Cars

Second, I agree “Your mama’s so heavy . . .” jokes are rude. That’s kind of the point. But please believe me, none of the insults in the book were meant to refer to you or any of your friends, especially not Marta.

Third, there’s no need to wash out my mouth with soap. Yes, some of the jokes had foul language, rude references, and politically incor- rect words. But I’ve removed the worst offenders and I hope you’ll find the revised chapter much less objectionable.

Maybe you’re right and insults like these have no place in a legiti- mate textbook. But you see, descriptive statistics are a lot like adjec- tives. An adjective, as you know, is a word that describes a person, place, or thing. A descriptive statistic is a number that describes a dataset. As you pointed out, “Your mama’s so heavy . . .” jokes are full of colorful adjectives. What better way to illustrate descriptive statistics than to use them to answer the question, “How heavy is she?”

I sincerely apologize for the ruckus this chapter caused at your last party. I never meant to offend anyone. Please convey this apology to all your friends. Also, tell Marta the joke about the lady so heavy she deep fries her toothpaste was not a reference to her. Call me sometime.

Sincerely, Your Loving Daughter

A recent Google search for “your mama” jokes came back with more than two million hits. Two million websites insulting “your mama.” (And when I say “your mama,” I don’t mean your mom, the woman who gave birth to you and raised you to be the fine citizen you are today. I mean that other guy’s mom. You know the one.) Even discounting duplicates, that’s a lot of jokes, a lot of descriptions of this unfortunately-sized woman. Take these three jokes, for example:

Your mama’s so heavy, she shows up on radar. Your mama’s so heavy, she’s been zoned for commercial development. Your mama’s so heavy, her belt size is Equator.1

These three descriptions paint very different pictures, and if you were to use these descriptions to estimate the woman’s size, you’d most likely come up with three very different answers. In other words, you’d have variation in your data. Virtually all real-world datasets have

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

they Mama 25

variation, or differences between obser vations. So how do you describe a dataset in the presence of variation? You use descriptive statistics. Descriptive statistics are numbers that summarize properties of a dataset. For example, suppose I had ten insults describing “your mama,” a woman whom I’d never laid eyes on. I might take each of those insults and use it to calculate the weight and waistline of this woman. These calculated values would make up my dataset, and from them, I might be able to say the following:

“Your mama” is typically estimated to weigh 3,182 lbs. Her belt size falls somewhere between Wide Load and Equator. There’s an 80% chance she drives a spandex car.

The typical weight. The range of her belt size. The likelihood she’s big enough to require a car made of spandex. All of these values are descriptive statistics that tell you something about the woman whose size is in question.

Descriptive statistics are estimates, values calculated from a sample, values that approximate some property of the entire population. The average is a commonly used estimate. It’s calculated from a sample of data and it approximates the typical value of a population. More on this descriptive statistic later. Life in the information age is full of descrip- tive statistics. Whenever a drug commercial warns its product might cause spontaneous bleeding, or a cable newscaster declares the economy is on life support, those statements are based on descriptive statistics. Whenever you search for a website, participate in a Facebook poll, or submit a customer review of your new cell phone, the information you provide gets combined with others’ into a dataset, a dataset that’s even- tually summarized by someone—a search engine company, a friend, a marketing department. In short, wherever you’ve got data, you’ve got descriptive statistics, and they can be used to summarize virtually any- thing. The likelihood a person will search “funky chicken” on the Internet. The average number of whoopee cushions sold in stores last quarter. Or, as you’ll see in this chapter, the typical belt size of a woman who fits the description of “your mama.”

THEY MAMA

With over a half million web pages devoted to “your mama,” gathering jokes is easy. However, it would take years for me to collect every insult

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

26 Chapter 3 asteroid Belts and Spandex Cars

in the world, especially considering the fact that new ones are con- stantly being created. Fortunately, I don’t need every insult to describe this demographic, a group of ladies I’ll call “they mama.” I only need a sample, a subset of insults that represents the whole population of all “your mama” jokes in the world. As mentioned in Chapter 2, the most objective type of sample is a random sample, where every insult has the same likelihood of being included. Random sampling is especially important for a case study such as this, where my own personal joke preferences could cause me to pick certain insults over others, resulting in a skewed picture of “they mama.” To collect my sample, I went to two of the biggest user-contributed joke websites, Aha! Jokes and Yo’ Mama Jokes Galore, and chose two hundred “your mama so heavy” jokes at random.

Like the “your mama” who weighs herself on the Richter scale, the difference between ladies in “they mama” is huge. For example, the woman who needs to grease the door when she enters the house is petite compared to the one who’s been named Miss Arizona . . . Class Battle- ship. Miss Arizona Class Battleship is tiny compared to the woman who influences the tides. With so much variation, I need descriptive statistics to give me an idea of just how big the ladies in this demographic are. But first, I need to turn these colorful yet vague insults into data.

QUALITATIVE VERSUS QUANTITATIVE DATA

All data fall into two categories: qualitative and quantitative. Qualita- tive observations or data are typically categories, groups or character- istics. Hair color and favorite foods are examples of qualitative observations. Quantitative observations or data are numerical values. Weight and belt size are examples of quantitative observations.

These two types of data are generally treated differently. The reason? Plain and simple arithmetic. Qualitative observations cannot be sorted into a numerical order. For example, suppose you’re analyz- ing the hair color of a group of ladies. You might take each lady and categorize her into one of a few groups: blonde, brown, red, black, and gray. The color brown isn’t larger or smaller than red. It’s just different. And without a mathematical relationship between observations, we’re somewhat limited in our ability to mathematically summarize qualita- tive data. Quantitative observations, on the other hand, are meaningful

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

Qualitative versus Quantitative Data 27

numerical values and so they can be sorted. If you’re weighing the ladies in “they mama,” for example, 4,500 lbs is heavier than 4,400 lbs, which is heavier than 4,350 lbs, and so on. This mathematical ordering allows us to use the full arsenal of arithmetic, algebra, and even calcu- lus to summarize quantitative data. I’ll describe “they mama” both ways, starting with the simpler of the two approaches.

Qualitative Analysis

After reading over my list of insults gathered from the Internet, I began to notice a pattern. While the details of the different jokes vary, many of them compare “your mama’s” size to the same small number of objects—cars, whales, buildings, and so on. These comparisons give me a convenient way to categorize the insults, turning a bunch of one liners into qualitative data. After poring over the list a few more times, I settled on seven categories of ladies: (1) Unimpressive, (2) Large Mammals, (3) Planes, Trains, and Automobiles, (4) Buildings, (5) Geo- logical and Geographical Phenomena, (6) Astronomical Objects, and (7) Who Knows? The first category, Unimpressive, includes women whose size, while large, is nothing particularly special. The sizes of the ladies in groups two through six are implied by the category labels and should be obvious. The last category, Who Knows?, is a catch-all group of insults without any obvious, concrete reference. Many of these insults have something to do with the woman’s eating habits or clothing challenges. Examples of jokes falling into each category are listed in Figure 3.1.

You might wonder why I didn’t make planes a separate group from trains and automobiles, or why I didn’t separate the women who cause geological phenomena from the women who take up large geographic areas. The answer is simple. I made a judgment call. I categorized the ladies this way because it makes sense to me. If you prefer a different grouping, I urge you to find your own jokes and repeat the following analysis for yourself.

Once all the ladies in “they mama” have been categorized, I have a set of qualitative observations to work with. Like all qualitative data, these have no clear mathematical ordering, and so they cannot be ana- lyzed by any method that arithmetically compares different observa- tions. So, I’ll do what people usually do with data like these. I’ll start with something called a frequency distribution.

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

28 Chapter 3 asteroid Belts and Spandex Cars

Figure 3.1. “Your mama’s so heavy . . .”

Category Example

Unimpressive …when she steps on the scale, it says, “To be continued…”.

Large Mammals …she got baptized at SeaWorld.

Planes, Trains, and Automobiles …she shows up on radar.

Buildings …she’s been zoned for commercial development.

Geographical and Geological Phenomena

…when she went to the beach, she caused a tsunami.

Astronomical Objects …she wears an asteroid belt.

Who Knows? …she deep fries her toothpaste.

The frequency distribution is a common way to summarize a set of observations. For qualitative data, the frequency distribution is just a list of counts—the number of observations falling into each category. Figure 3.2 lists the frequency distribution for “they mama.” In the figure, the relative frequency is also included. The relative frequency represents the fraction (or alternatively, percentage) of all the observa- tions falling into each category. This fraction is the ratio of the number of counts in each category to the total number of observations. The percentage is just the fraction multiplied by 100.

The frequency distribution in Figure 3.2 shows the relative popular- ity of different types of insults. For example, the smallest women in “they mama” are Unimpressive. Twenty-seven, or 14 percent, of the ladies fall into this group. On the other end of the body mass spectrum sits the Astronomical Objects category. This category may contain the largest women, but with only 4% of the insults, it’s the least popular. The categories Large Mammals to Astronomical Objects include those women who are both impossibly large and compared to something concrete. Adding up the frequencies of these categories tells us these ladies make up 52% of all the jokes.

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

Qualitative versus Quantitative Data 29

With relatively few observations and relatively few categories, tables like Figure 3.2 do a fair job of illustrating the frequency distribu- tion. However, nothing compares to a good graph. Bar charts are par- ticularly useful for frequency distributions. A bar chart displays each category as a box whose height represents the number of observations in that category. Figure 3.3 shows a bar chart of the “your mama” insults, plotted as the relative frequency in terms of percentage.

A quick glance at Figure 3.3 tells us a lot about these data. The mode, the most popular category in “they mama,” is Who Knows? This category contains almost twice as many as any other. Astronomical Objects, with a relative frequency ten times lower than Who Knows?, is the least popular. The categories Unimpressive; Planes, Trains, and Automobiles; Buildings; and Geographical and Geological Phenomena are comparable, each having a relative frequency between 10 and 20%.

Aside from the mode, which tells you what category is most typical in your dataset, relative frequencies are the most common descriptive statistics used in the analysis of qualitative data. For example, the batting average of a baseball player is a relative frequency, the number of base hits divided by the total number of times at bat. The results of

Figure 3.2. Frequency distribution of “they mama.”

Category Counts (Number of Insults) Relative Frequency

Unimpressive

Large Mammals

Planes, Trains, and Automobiles

Buildings

Geographical and Geological Phenomena

Astronomical Objects

Who Knows?

Total

200

0.14

0.04

0.13

0.11

0.21

0.04

0.35

1.0

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

30 Chapter 3 asteroid Belts and Spandex Cars

political polls are typically relative frequencies, expressed as a percent- age of people who approve of a particular candidate. Relative frequen- cies are particularly useful for calculating probabilities. This topic will be discussed more in the following chapter and throughout the rest of the book.

Quantitative Analysis

The frequency distribution in Figure 3.3 provides us with a good general summary of “they mama,” but it doesn’t really answer the question at hand. In order to determine exactly how heavy “your mama” is, I need more precise information, measurements that indicate her belt size or her weight. In other words, I need quantitative data. The kind of numer- ical values that’ll only be possible with a little creative calculation.

For example, imagine a woman who’s so large she drives a spandex car. What kind of car is it? Is it a spandex Corolla or a spandex Sub- urban? The answer makes a difference. According to cars.com, a Toyota Corolla is about 69 inches wide, where a Chevy Suburban measures in at roughly 79 inches wide. That’s a difference of 10 inches in width,

Figure 3.3. “They mama” frequency distribution: how heavy is she?

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

the Descriptive power of Statistics 31

which translates to a 30-inch difference in the belt size of a woman who fills such a car.

Many of my insults are like this one, leaving much room for varia- tion in the measurement. I could go through the jokes one by one, estimating a range of sizes for each and folding them together into one grand dataset. However, this would be a tedious process with many details to explain. So, rather than becoming distracted by creative cal- culations, I’ll pick one insult and conduct a quantitative analysis on that. Here it is: “Your mama’s so heavy she has her own zip code.” Let’s say the U.S. Postal Service assigns zip codes to ladies based on the size of their waistline. In other words, assume “your mama’s” waistline takes up an area roughly the size of a typical zip code region. Exactly how big is this area?

The U.S. Census Bureau spends millions of dollars every year keeping track of the U.S. population—demographics, number of houses, income, number of children, and yes, the land area of every zip code in the country. I’m pretty sure the government never expected the data to be used to measure “your mama’s” waistline, but it makes these data publicly available in any case. I should mention there’s something special about these zip code areas. Where most datasets are merely samples, or subsets of the entire population, these data are complete, including every zip code in the country. In other words, I don’t just have a sample here. I have the entire population.

As of this writing, the 2010 data are not yet available, but the 2000 Census zip code tabulation areas (ZCTAs) can be downloaded from www.census.gov. There are a little over 32,000 values in this dataset, each one representing the land area, in square miles, of a U.S. zip code. Unless you’re squeamish about lots of values, I encourage you to down- load these data and repeat some of the following analysis yourself.

THE DESCRIPTIVE POWER OF STATISTICS

Imagine we’re outside on a sunny day. You’re blindfolded and I’m trying to describe a cloud in the sky. I might tell you where it is, whether high in the sky or near the horizon. I might tell you how big the cloud is, whether it takes up one tiny quadrant or looms over the entire land- scape. I also might describe its texture, where it’s thick and dense and where it’s light and transparent. These three characteristics, location,

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

32 Chapter 3 asteroid Belts and Spandex Cars

size, and texture, would help you form a more detailed picture of the cloud in your mind. These same characteristics are the ones most com- monly used to describe quantitative observations, a data cloud, if you will. In statistics jargon, the location is often called center location, the size is often referred to as variation, and the texture is often represented with the frequency distribution. I’ll start by describing the last of these characteristics first.

Frequency Distributions and Histograms

The frequency distribution measures the relative popularity of every category in a qualitative dataset. For a quantitative dataset, it measures the texture of the data cloud, where the cloud is thick and dense with many observations, and where it is thin with few or none. Constructing the frequency distribution for quantitative data is a lot like the process for qualitative data, where observations in each category are tabulated and then plotted using a bar chart. In fact, for discrete quantitative data—observations that can only take on distinct values like integers from one to ten—the process is exactly the same. You simply count up the number of ones, twos, threes, and so on, and then list or graph the results. The differences arise when you have continuous quantitative data—values that take on all possible numbers in some specified range. Why? With continuous values, you can spend your entire life counting observations at any given value, but there will always be another value to count. For example, if your data can take on any value between 50 and 60 and you count the number of 56s and 57s, you still have 56.1, 56.2, 56.5, and lots of others. If you then count the number of 56.1s, 56.2s, and so on, you still have 56.15, 56.23, 57.28, and lots of others. You can stretch the decimal places all you want and there will still be more numbers to count. Infinitely many of them, in fact. Try plotting a bar graph with infinitely many categories on the x-axis. Go ahead. I’ll wait.

For quantitative data, the frequency distribution is typically calcu- lated by splitting the observations into discrete bins, or ranges of values, and counting the number of observations falling into each bin. For example, if you have observations between 50 and 60, you might con- struct ten bins, one for observations falling between 50 and 51, one for observations falling between 51 and 52, and so on. By binning the data

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

the Descriptive power of Statistics 33

in this way, you can tabulate frequencies and relative frequencies for each bin and then list or graph the results.

I’ve said this before, but it’s worth repeating. The best way look at a large amount of data is with a graph. A graph helps you visualize your observations and it can be incredibly helpful in identifying outliers, extreme or unusual values that can impact your results. You’ve already seen how a bar graph can be used to visualize the frequency distribution for qualitative data. The same type of graph can be used to visualize the frequency distribution for quantitative data. This type of graph is so popular, it has its own name: a histogram.

When plotting a histogram, deciding how to bin the data is a bit of an art, but some guidelines are available. Plotting programs such as Excel divide the range into N equal-size bins (Microsoft Corporation 2012), but the best choice often depends more on what makes the most sense for your specific problem than what any spreadsheet software recommends. In general, bins are calculated from the maximum, the largest data value, and the minimum, the smallest data value, by divid- ing this range into equal length intervals. The number of bins you use can dramatically change the shape of the histogram, especially when your sample is small and you have only a few observations per bin. I recommend experimenting with different bin sizes. It’ll help you decide if the shape you’re seeing reflects the true nature of your data or simply your choice of bin widths.

For the zip code tabulation areas, the minimum area is 0.0019 square miles, about the size of a large building, and the maximum is 5,497 square miles, about the size of Connecticut. With just over 32,000 values, the square root criterion suggests 179 bins for this histogram, each with a width of about 31 square miles. However, this choice of bin number produces a fairly useless histogram, with over 90% of all zip code areas falling into the first two bins. Such a graph, with only two visible bars and no hints as to the shape of the distribution, doesn’t really teach us anything. Instead, I’ll increase the number of bins to 1,500. This breaks up the first few bins, making it easier to see what happens at the low end of the zip code areas. This histogram is plotted in Figure 3.4.

If you’ve been in your statistics class for more than a few weeks, you’ve probably heard about the bell-shaped frequency distribution. Illustrated in Figure 3.5, the bell-shaped distribution is the poster child

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

34 Chapter 3 asteroid Belts and Spandex Cars

Figure 3.4. Frequency distribution of zip code tabulation areas.

Figure 3.5. An illustration of the bell-shaped frequency distribution.

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

the Descriptive power of Statistics 35

of all frequency distributions, with its central peak and gently sloping, symmetric sides. Many datasets have a bell-shaped distribution. The zip code dataset isn’t one of them. There are a large number of very small zip code areas, less than about 16 square miles. Beyond that, the relative frequencies taper off slowly, from 16 all the way out to 5497 square miles (although the histogram has been cut off at 700 square miles so you can see details of the smaller zip code areas). It’s like a cloud that’s thick and dense on one end, and thin at the other. Because the histogram looks like it’s been stretched to the right, this frequency distribution shape is called right-skewed. Right-skewed distributions are fairly common when it comes to measuring counts, distances, and areas. There’s also a left-skewed distribution, whose histogram looks like it’s been stretched to the left, but those are less common.

At this point, you might be asking yourself why you should care about the shape of the histogram. After all, we’re only trying to estimate the typical belt size of a woman who has her own zip code. Does it really matter whether the data are symmetric, left-skewed, or right- skewed? As you’ll see in the following section, it does.

Central Location

The term central location refers to the center of a data cloud, in other words, the spot around which all the data values are clustered. Two descriptive statistics are commonly used to measure central location: the sample mean and the median. Of those two, the sample mean, or the average, is most common.

Sample Mean (Average)

Even if you’ve never had a single teacher utter the word “statistics” in class, you’ve probably run across the average. Calculated by adding all the data values together and dividing by the number you have, the average pinpoints the arithmetic center of a dataset.

For measurement values x1, x2, . . . , xN, the average is

x x x x

N N= + + +1 2 � .

For example, the average of the four numbers 4, 5, 5, and 6 is (4 + 5 + 5 + 6)/4 = 5.

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

36 Chapter 3 asteroid Belts and Spandex Cars

Median

The median is another common way to measure the central location of a data cloud. However, unlike the average, the median isn’t calcu- lated from an arithmetic formula. Rather, it’s the midpoint, or middle measurement value. To calculate the median, sort your list of numbers from smallest to largest. If there’s an odd number of values in the list, pick the one in the middle. If there’s an even number, then average the two middle values. For example, the median of 4, 5, 5, and 6 is the average of the two middle numbers, 5 and 5, which is, of course, 5.

If a frequency distribution is symmetric, meaning the two halves of the histogram are mirror images of one another, the large values will balance the small values both arithmetically and in terms of the middle position. In a case like this, both the average and median will lie in the middle of the histogram, very close to one another. The sample mean and median of the bell-shaped data shown in Figure 3.5, for example, are both ten, the center position on the histogram. On the other hand, if the frequency distribution isn’t symmetric, all bets are off. Extreme values and nonsymmetric bumps in a dataset can cause these two descriptive statistics to be quite different. The sample mean of the zip code areas is 85.8 square miles. The median, 37.7 square miles, is less than half that value.

Why should the sample mean and median be so different? These two statistics both measure the center of a data cloud, but they do it in two very different ways. Think of the numbers 4, 5, and 6, for example. The average of these three values is (4 + 5 + 6)/3 = 5. The median, the middle number, is also five. Now stretch the six to a nine, giving the values 4, 5, and 9. The average of these numbers has increased to (4 + 5 + 9)/3 = 6, but the median is still five. Stretch the 9 to a 12 and the average increases to 7. The median? Still 5. The average takes into account all the values in the dataset, even the very large or small ones. The median chooses the middle value, regardless of what’s happening at the extremes. In other words, the mean is impacted by big or small values, while the median isn’t nearly as much.

According to the sample mean and the median of the zip code tabulation areas, a typical “your mama’s” waistline consumes anywhere from about 38 square miles (the median) to about 86 square miles (the sample mean). In other words, she takes up more space than Newark, New Jersey, and less than Amarillo, Texas. Which one is a

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

the Descriptive power Of Statistics 37

better indicator of the center location of this data cloud? As you’ll see in coming chapters, the sample mean is more common and, in many ways, easier to work with when doing data analysis. However, for purely descriptive purposes, it’s a good idea to calculate both, because comparing the mean and median can tell you a lot about your data. If the two values agree with one another, chances are good you’re looking at a nicely formed, symmetric frequency distribution. When they don’t, you probably want to dive a little deeper into the data. Plot a histogram. Look at the extreme values. You might just have an oddly shaped or skewed distribution, something that’s useful to know before you make any conclusions.

Variation

Variation refers to the size of a data cloud. Understanding variation is one of the most important parts of any statistical analysis. Why? Because people rely on statistics to make weighty decisions, and varia- tion has a big impact on everything from the simplest data summary to the most sophisticated nonlinear analysis. Just think about the sample mean and median of the zip code data. These two descriptive statistics both measure central location, and so it seems they should agree with one another. But they don’t. And this is because variation gives rise to a right-skewed frequency distribution in the data.

Variation pops up any time there’s more than one person, place, thing, or measurement in a group. It doesn’t matter if you’re counting the number of molecules in a test tube or the number of defective parts coming off an assembly line. Variation just happens. Like central loca- tion, there are several ways to measure variation. I’ll introduce three of the most common.

Range

The range is the simplest descriptive statistic for variation. It’s the largest value, the maximum, minus the smallest value, the minimum. In other words, the range is the span of your data cloud. For example, the range of zip code areas is the difference between the largest and smallest values, or 5497.000 − 0.002 = 5496.998 square miles. That’s the total variation of waistline sizes in “they mama.” Simple, right?

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

38 Chapter 3 asteroid Belts and Spandex Cars

Unfortunately, simple is not always better. Because it only takes into account the largest and smallest measurement values, the range is generally easy to calculate and easy to understand. But it ignores the bulk of the observations, the ones in the middle, and so it can also be terribly misleading. For example, suppose you have five observations: 5, 5, 5, 5, and 20. The average of these observations is 8. The range is 15. Without looking at every data value, you might be misled to think the bulk of observations are around 8, with values ranging from about zero to 15. But the actual frequency distribution looks nothing like this. In fact, all of the values are very tightly clustered at 5, with only a single extreme value, an outlier, at 20. Just like it impacts the average, this outlier impacts the range. Without this one value, the average would drop to 5, and the range would fall to zero.

The range is notoriously impacted by skewed distributions and outliers. This can be a good thing when you’re doing an analysis of the extremes, but most descriptive summaries are concerned with the majority, not the unusual. In the analysis of zip code areas, the range may be nearly 5,500 square miles, but it’s clear from Figure 3.4 the

Figure 3.6. Describing the central location, variation, and texture of a typical data cloud.

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

the Descriptive power of Statistics 39

vast majority of values are smaller than 400 square miles. This is where most of the data cloud lies.

Interquartile Range

Percentiles are sometimes used to understand a dataset. A percentile is the point below which some specified percentage of the observations fall. Think of your dataset as a list of values, sorted from smallest to largest. The 10th percentile is the value that’s one-tenth of the way down the list. Ten percent of the values are smaller than the 10th per- centile. Ninety percent of them are larger. The 50th percentile is the median, the value halfway down the list. Quartiles refer to the 25th, 50th, and 75th percentiles, those values 25%, 50%, and 75% of the way down the list.

Consider the list of data values: 2, 3, 4, 4, 5, 6, 6, 7, 7, 8. This list has ten numbers. The tenth percentile is the value one-tenth of the way down the sorted list. That’s the first number on the list, or 2. The 30th percentile is 30% of the way down the list, the third value, which is 4. The 50th percentile, the median, is the fifth number, which is 5. The 75th percentile is 6.5.

The interquartile range (IQR) measures the spread of the middle half of data values. This statistic is calculated as the difference between the 75th and 25th percentile. For example, the list 2, 3, 4, 5, 5, 6, 6, 6, 7, 8, 9, 10 has twelve numbers. The 75th percentile is 7 and the 25th percentile is 4. The IQR is 7 − 4 = 3. This means the middle half of the data values fall within three of one another. Because the IQR uses data values well inside the extremes, this statistic is always less than the range. It also tends to be less sensitive to outliers. For example, the total range of zip code areas is nearly 5,500 square miles. The IQR is 90.4 − 9.0 = 81.4 square miles. This means the middle half of the values fall within 81.4 square miles of one another. The other half of the values are responsible for the rest of the total variation, all 5,419 square miles of it.

Standard Deviation and Variance

The standard deviation and its alternative form, the variance, are the most common measures of variation. Like the average, the standard deviation is an arithmetic value that uses all of the observations in its calculation. Here’s the formula.

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

40 Chapter 3 asteroid Belts and Spandex Cars

For measurement values x1, x2, . . . , xN, with average x , the standard devia tion is

s x x x x x x

N N= − + − + + −

− ( ) ( ) ( )

.1 2

2 2 2

�

The variance is the standard deviation squared, in other words, s2. For the list of values 4, 5, 5, and 6, with an average of 5, the vari-

ance is

s2 2 2 2 24 5 5 5 5 5 6 5

3 0 66= − + − + − + − =( ) ( ) ( ) ( ) .

and the standard deviation is

s = − + − + − + − =( ) ( ) ( ) ( ) . .4 5 5 5 5 5 6 5 3

0 81 2 2 2 2

The standard deviation doesn’t measure the size of a data cloud, at least not directly. Rather, it measures the average deviation of your data around the sample mean. The standard deviation is always smaller than the range. And for bell-shaped data, it’s typically a little smaller than the IQR as well. For the bell-shaped data in Figure 3.5, for example, the total range is 20, the IQR is 5.5, and the standard devia- tion is 3.9. For skewed data, the standard deviation is still smaller than the range, but it can be smaller or larger than the IQR, depending on how stretched the frequency distribution is. For example, the total range of the zip code areas is 5,500 square miles, and the IQR is 81. The standard deviation is 192 square miles. That’s over twice the interquar- tile range, a strong indication of just how skewed these data are. Figure 3.7, while not to scale, illustrates the relationship between these three measures of variation as well as to the other descriptive statistics intro- duced in this chapter.

All of these measures of variation can be useful when analyzing quantitative data; however, the standard deviation plays a particularly important role in statistics. It may not be as straightforward as the range or the IQR, and it may be sensitive to outliers, but it’s firmly rooted in the mathematical foundation of uncertainty, and, frankly, statisticians

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

how heavy is She? 41

love it. As you’ll see in following chapters, the standard deviation leads to many useful statements about the size of a data cloud, helping you to calculate the margin of error in a set of observations, compare two or more datasets together, and determine if a trend you’re seeing is real or coincidental.

HOW HEAVY IS SHE?

Dear Mom, It’s been two months now, and still I haven’t heard from you. I hope it’s because you’ve been busy with your duties as president of the Shady Oaks Ladies Club and not because you’re still mad at me.

Along those lines, I have some good news. I’ve finished my chapter on descriptive statistics and guess what? There’s no reason for your friends to be upset. My qualitative analysis showed that over half of the ladies in the “your mama” demographic are impossibly heavy, and by that I mean large mammal to planetary object heavy. On top of that,

Figure 3.7. Describing “your mama”: how big is she?

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In

co rp

or at

ed . A

ll rig

ht s

re se

rv ed

42 Chapter 3 asteroid Belts and Spandex Cars

my quantitative analysis, though limited, shows that a typical “your mama” is so heavy, she could sell shade to the whole Shady Oaks Estates subdivision, all five phases of it! In other words, there’s no real resemblance between these ladies and your friends. That means none of your friends could possibly be the inspiration for this chapter. None of them have any reason to be offended. Not even Marta.

Sincerely, Your Loving Daughter

BIBLIOGRAPHY

Aha! Jokes. “Yo’ Mama Jokes.” www.ahajokes.com, accessed September 24, 2012. Briney, Amanda. May 29, 2011. “Largest National Parks in the United States.”

http://geography.about.com/od/unitedstatesofamerica/a/largest-national-parks.htm. Cars.com. “Find New and Used Cars.” www.cars.com, accessed January 12, 2012. Microsoft Corporation. “Explore Histograms.” http://office.microsoft.com/en-us/

excel-help/explore-histograms-HA001110948.aspx, accessed September 24, 2012. Payne, Hugh. 2007. Yo’ Mama Is So. . . . New York: Black Dog & Leventhal. United States Census Bureau. Zip Code Tabulation Areas (ZCTAsTM). http://

www.census.gov/geo/ZCTA/zcta.html, accessed October 15, 2011. “Yo Mama Jokes Galore!” http://www.yomamajokesgalore.com/, accessed September

24, 2012.

C op

yr ig

ht ©

2 01

3. J

oh n

W ile

y &

S on

s, In