Business & Data Analytics Discussion

Chapter3BusinessAnalytics_Albright.pdf

Home >Business & Finance homework help >Business & Data Analytics Discussion

CHAPTER 3 Finding Relationships among Variables

DATA ANALYTICS AT NYPD If your impression of the New York Police Department (NYPD) is based on the hit TV show NYPD Blue (1993 to 2005), think again. The NYPD has a history of apply- ing analytics to make its 36,000 uniformed officers more effective and efficient at their jobs. As described in Levine et al. (2017), the NYPD developed CompStat in 1993, a structured process for examining crime statistics. Then on the heels of the terrorism attack on September 11, 2001, they created a Counterterrorism Bureau, the first of its kind in a U.S. municipal police department. Their latest use of analytics, begun in 2008, is the Domain Awareness System (DAS), a network of sensors, databases, devices, software, and infrastructure that delivers up-to-date information and analytics to mobile devices and precinct desktops. This sys-

tem is now deployed across every police precinct in the city, on all officers’ smartphones, and on the tablets in all police vehicles. It provides officers with the timely information they need to respond quickly and appropriately to situations as they arise.

Before DAS, the NYPD had a tremendous amount of information about crimes and offenders, but it didn’t have the means to get this information to the officers who needed it in a timely manner. For example, suppose a 911 call about a domestic disturbance was received. Before DAS, the caller spoke to an operator, whose transcription of the call was passed to a dispatcher. The dispatcher then assigned the job to officers in the vicinity, telling them there was a domestic incident at a stated address. With this limited information, the officers had to manage the situation as well as possible, often having to make life-or-death decisions without any historical context. Imagine how much better the officers could react if they knew ahead of time that there was a warrant for the arrest of the person at this address or that this was the fifth such call from this address in the past few months. With DAS, this type of information is immediately available from one of the officer’s phones. (The officer in the passenger seat looks up the information on the phone; the officer driving is instructed to keep his eyes on the road.) In fact, the officers can now access the caller’s statements as transcribed by the operator, information they never had before DAS.

DAS also implements predictive analytics for more strategic decisions, such as where a precinct commanding officer should place units on patrol. Before DAS, this decision was made by marking locations of crimes in the past month on a map, circling “hot spots” on the map where most crimes seemed to occur, and positioning units at the hot spots. Pre- dictive analytics can improve on this process by predicting where future crimes are likely to occur.

Prior to DAS, there was a weekly CompStat meeting about strategic decisions that revolved around a static piece of paper listing the numbers of crimes of various types for the week, the last 28 days, and year-to-date. There was some summarization on this sheet, such as how this week’s numbers compared to the numbers from a year ago, but not very much. There was almost no way for those at the meeting to use these numbers for strate- gic decisions because they couldn’t see the patterns that led to the numbers. This is now

B um

bl e

D ee

/S hu

tte rs

to ck

.c om

09953_ch03_ptg01_084-131.indd 84 04/03/19 12:10 PM

Copyright 2020 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

3-1 Introduction 8 5

possible with DAS. Each number is now presented in an interactive form on a computer screen. If a user clicks on a number, DAS shows all the records included in that number and marks them on a map. Data visualization software then enables the user to explore trends and patterns with bar charts or time series charts, for example. This information, formerly available only to headquarters staff, is now available to commanding officers and their crime analysts.

The Data system has created a “big data” problem for the NYPD. Audio gunshot detectors, 911 calls, license plate readers (LPRs), closed-circuit television cameras, and environmental sensors all feed real-time data into DAS. As of 2016, it also con- tains two billion readings from license plates (with photos), 54 million 911 calls, 15 million complaints, 12 million detective reports, 11 million arrests, 2 million war- rants, and 30 days of video from 9000 cameras. How can all this data be analyzed? Much of the analysis is based on pattern recognition. For example, pattern recogni- tion is implemented on the LPR data. Predictive analytics inform officers where and when a specific license plate is likely to be scanned in the next few minutes. (This is performed only on vehicles on an NYPD watch list.) Time-and-place patterns occur when a vehicle passes the same LPR multiple times on a specific day of the week and a specific time of day, such as every Tuesday at 2 PM. Routing patterns occur when a vehicle passes a fixed LPR and then later passes a second fixed LPR, such as enter- ing the Queensboro Bridge and then driving south on FDR Drive, and it does this on multiple days. These types of patterns give officers a better chance of finding suspected vehicles by focusing their searches.

The DAS system has been a huge success. NYPD officers tend to be skeptical of new systems involving technology, but DAS was designed for cops by cops. Therefore, they appear to trust it and use it. As a measure of its popularity with NYPD officers, there were 100,000 mobile job clicks per week by the end of 2015.

3-1 Introduction The previous chapter introduced a number of summary measures, graphs, and tables to describe the distribution of a single variable. For a variable such as baseball salary, the entire focus was on how salaries were distributed over some range. This is an important first step in any exploratory data analysis—to look closely at variables one at a time—but it is almost never the last step. The primary interest is usually in relationships between variables. For example, it is natural to ask what drives baseball salaries. Does it depend on qualitative factors, such as the player’s team or position? Does it depend on quantitative factors, such as the number of hits the player gets or the number of strikeouts? To answer these questions, you have to examine relationships between various variables and salary.

This chapter again discusses numerical summary measures, graphs, and tables, but they now involve at least two variables at a time. The most useful numeric summary measure is correlation, a measure that applies primarily to numeric variables. The most useful graph is a scatterplot, which again applies primarily to numeric variables. Other tools are used for relationships involving categorical variables. For example, to break down a numeric variable by a categorical variable, it is often useful to create side-by- side box plots. Finally, we discuss Excel®’s arguably most powerful tool, the pivot table. A pivot table enables you to break down one variable by others so that relationships can be uncovered quickly.

As you read this chapter, remember that the diagram in the file Data Analysis Taxonomy.xlsx is available. This diagram gives you the big picture of which analyses are appropriate for which data types and which tools are best for performing the various analyses.

A key issue in this chapter is that different tools should be used to examine relation- ships, depending on whether the variables involved are numeric or categorical.

09953_ch03_ptg01_084-131.indd 85 04/03/19 12:10 PM

8 6 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

3-2 Relationships among Categorical Variables Consider a data set with at least two categorical variables, Smoking and Drinking. Each person is categorized into one of three smoking categories: nonsmoker (NS), occa- sional smoker (OS), and heavy smoker (HS). Similarly, each person is categorized into one of three drinking categories: nondrinker (ND), occasional drinker (OD), and heavy drinker (HD). Do the data indicate that smoking and drinking habits are related? For exam- ple, do nondrinkers tend to be nonsmokers? Do heavy smokers tend to be heavy drinkers?

As discussed in the previous chapter, the most meaningful way to describe a categori- cal variable is with counts, possibly expressed as percentages of totals, and corresponding charts of the counts. The same is true of examining relationships between two categorical variables. You can find the counts of the categories of either variable separately, and more importantly, you can find counts of the joint categories of the two variables, such as the count of all nondrinkers who are also nonsmokers. Again, corresponding percentages of totals and charts help tell the story.

It is customary to display all such counts in a table called a crosstabs (for cross- tabulations). This is also sometimes called a contingency table. Example 3.1 illustrates these tables.

Use a crosstabs, a table of counts of joint categories, to discover relationships between two categorical variables.

EXAMPLE

3.1 RELATIONSHIP BETWEEN SMOKING AND DRINKING The file Smoking Drinking.xlsx lists the smoking and drinking habits of 8761 adults. (This is fictional data.) The cate- gories have been coded so that “N,” “O,” and “H” stand for “Non,” “Occasional,” and “Heavy,” and “S” and “D” stand for “Smoker” and “Drinker.” Is there any indication that smoking and drinking habits are related? If so, how are they related?

Objective To use a crosstabs to explore the relationship between smoking and drinking.

Solution The first question is the data format. If you are lucky, you will be given a table of counts. However, it is also possible that you will have to create these counts. In the file for this example, the data are in long columns, part of which is shown in Figure 3.1. Presumably, there could be other variables describing these people, but only the Smoking and Drinking variables are relevant here.

1 2 3 4 5 6 7 8 9

10 11

A B C Person Smoking Drinking

1 NS OD 2 NS HD 3 OS HD 4 HS ND 5 NS OD 6 NS ND 7 NS OD 8 NS ND 9 OS HD

10 HS HD

Figure 3.1 Smoking and Drinking Data

09953_ch03_ptg01_084-131.indd 86 04/03/19 12:10 PM

Figure 3.2 Crosstabs of Smoking and Drinking

1 2 3 4 5

E F G H I Crosstabs from COUNTIFS formulas

NS OS HS Total ND 2118 435 163 2716 OD 2061 1067 552 3680

6 7 8 9

10 11 12

HD 733 899 733 2365 Total 4912 2401 1448 8761

NS OS HS Total ND 78.0% 16.0% 6.0% 100.0% OD 56.0% 29.0% 15.0% 100.0%

13 14 15 16 17 18 19

HD 31.0% 38.0% 31.0% 100.0%

NS OS HS ND 43.1% 18.1% 11.3% OD 42.0% 44.4% 38.1% HD 14.9% 37.4% 50.6%

20 Total 100.0% 100.0% 100.0%

Shown as percentages of row totals

Shown as percentages of column totals

To create the crosstabs, start by entering the category headings in row 3 and column E of Figure 3.2. The goal is to fill the table in rows 4–6 with counts of joint categories, along with row and column sums. If you are thinking about using the COUNTIF function to obtain the joint counts, you are close. However, the COUNTIF function lets you specify only a single criterion, and there are now two criteria, one for smoking and one for drinking. Fortunately, Excel has a function designed exactly for this: COUNTIFS. It enables you to specify any number of range-criterion pairs. In fact, you can fill the entire table with a single formula entered in cell F4 and copied to the range F4:H6:

=COUNTIFS($B$2:$B$8762,F$3,$C$2:$C$8762,$E4)

The first two arguments are for the condition on smoking; the last two are for the condition on drinking. You can then sum across rows and down columns to get the totals.

The resulting counts appear in the top table in Figure 3.2. For example, among the 8761 people, 4912 are nonsmokers, 2365 are heavy drinkers, and 733 are nonsmokers and heavy drinkers. Because the totals are far from equal (there are many more nonsmokers than heavy smokers, for example), any relationship between smoking and drinking is difficult to detect in these raw counts. Therefore, it is useful to express the counts as percentages of row totals in the middle table and as percent- ages of column totals in the bottom table.

The latter two tables indicate, in complementary ways, that there is definitely a relationship between smoking and drinking. If there were no relationship, the rows in the middle table would be practically identical, as would the columns in the bottom table. (Make sure you understand why this is true.) But they are far from iden- tical. For example, the middle table indicates that only 6% of the nondrinkers are heavy smokers, whereas 31% of the heavy drinkers are heavy smokers. Similarly, the bottom table indicates that 43.1% of the nonsmokers are nondrinkers, whereas only 11.3% of the heavy smokers are nondrinkers. In short, these tables indicate that smoking and drinking habits tend to go with one another. These tendencies are reinforced by the column charts of the two percentage tables in Figure 3.3.

Relationships between the two variables are usually more evident when the counts are expressed as percentages of row totals or percentages of column totals.

3-2 relationships among Categorical Variables 8 7

09953_ch03_ptg01_084-131.indd 87 04/03/19 12:10 PM

Figure 3.3 Column Charts of Smoking and Drinking Percentages

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

ND OD HD

Percentages of column totals

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

NS OS HS

Percentages of row totals

ND OD HD NS OS HS

Counts Versus Percentages

There is no single correct way to display the data in a crosstabs. Ultimately, the data are always counts, but they can be shown as raw counts, percentages of row totals, percentages of column totals, or even percentages of the overall total. Nev- ertheless, when you are looking for relationships between two categorical vari- ables, showing the counts as percentages of row totals or percentages of column totals usually makes any relationships emerge more clearly. Corresponding charts are also very useful.

Fundamental Insight

Creating Charts from Crosstabs

It takes almost no work to create these charts. To get the one on the left, high- light the range E10:H13 and insert a column chart from the Insert ribbon. Do the same with the range E16:H19 to get the chart on the right, except that it will have smoking on the horizontal axis and drinking in the legend. To reverse their roles, click the Switch Row/Column button on the Chart Tools Design ribbon.

Excel Tip

Although this example illustrates that it doesn’t take too much work to create cross- tabs and corresponding charts, you will see a much quicker and easier way when pivot tables are discussed later in this chapter.

a. For each year separately, recode Nationality so that all nationalities with a count of 1 or 2 are listed as Other.

b. For each year, create a crosstabs of Gender versus the recoded Nationality and an associated column chart. Does there seem to be any relationship between Gender and the recoded Nationality? Is the pattern about the same in the two years?

2. The file P02_03.xlsx contains data from a survey of 399 people regarding a government environmental policy.

Problems Solutions for problems whose numbers appear within a colored box can be found in the Student Solution Files.

Level A 1. The file P02_01.xlsx indicates the gender and national-

ity of the MBA incoming class in two successive years at the Kelley School of Business at Indiana University.

8 8 C h a P t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

09953_ch03_ptg01_084-131.indd 88 04/03/19 8:35 PM

3-3 relationships among Categorical Variables and a Numeric Variable 8 9

Level B 5. Recall from Chapter 2 that HyTex Company is a direct

marketer of technical products and that the file Catalog Marketing.xlsx contains recent data on 1000 HyTex customers. To understand these customers, first recode Salary and Amount Spent as indicated in part a, and then create each of the requested crosstabs and an asso- ciated column chart in parts b to e. Express each count as a percentage, so that for any value of the first variable listed, the percentages add up to 100%. Do any patterns stand out? a. Find the first, second, and third quartiles of Salary,

and then recode Salary as 1 to 4, depending on which quarter of the data each value falls into. For exam- ple, the first salary, $16,400, is recoded as 1 because $16,400 is less than the first quartile, $29,975. Recode Amount Spent similarly, based on its quar- tiles. (Hint: The recoding can be done most easily with lookup tables.)

b. Age versus the recoded Amount Spent c. Own Home versus the recoded Amount Spent d. History versus the recoded Amount Spent e. The recoded Salary versus the recoded Amount Spent

6. The smoking/drinking example in this section used the function COUNTIFS function to find the counts of the joint categories. Without using this function (or pivot tables), devise another way to get the counts. The raw data are in the file Smoking Drinking.xlsx. (Hint: One possibility is to concatenate the values in columns B and C into a new column D. But feel free to find the counts in any way you like.)

a. Create a crosstabs and an associated column chart for Gender versus Opinion. Express the counts as per- centages so that for either gender, the percentages add to 100%. Discuss your findings. Specifically, do the two genders tend to differ in their opinions about the environmental policy?

b. Repeat part a with Age versus Opinion. c. Recode Salary to be categorical with categories “Less

than $40K,” “Between $40K and $70K,” “Between $70K and $100K,” and “Greater than $100K” (where you can treat the breakpoints however you like). Then repeat part a with this new Salary variable versus Opinion.

3. The file P02_02.xlsx contains data on 256 movies that grossed at least $1 million in 2017. a. Recode Distributor so that all distributors with fewer

than 10 movies are listed as Other. Similarly, recode Genre so that all genres with fewer than 10 movies are listed as Other.

b. Create a crosstabs and an associated column chart for these two recoded variables. Express the counts as percentages so that for any distributor, the percentages add to 100%. Discuss your findings.

4. Recall from Chapter 2 that the file Supermarket Trans- actions.xlsx contains over 14,000 transactions made by supermarket customers over a period of approximately two years. To understand which customers purchase which products, create a crosstabs and an associated column chart for each of the following. For each, express the counts as percentages so that for any value of the first variable listed, the percentages add to 100%. Do any patterns stand out? a. Gender versus Product Department b. Marital Status versus Product Department c. Annual Income versus Product Department

3-3 Relationships among Categorical Variables and a Numeric Variable

This section describes a very common situation where the goal is to break down a numeric variable such as salary by a categorical variable such as gender. This is precisely what pivot tables were built for, as you will see later in the chapter. For now, however, Excel’s numeric and graphical tools will be used. This general problem, typically referred to as the comparison problem, is one of the most important problems in data analysis. It occurs whenever you want to compare a numeric variable across two or more subpopulations. Here are some examples:

• The subpopulations are males and females, and the numeric variable is salary. • The subpopulations are different regions of the country, and the numeric variable is the

cost of living. • The subpopulations are different days of the week, and the numeric variable is the

number of customers going to a particular fast-food chain. • The subpopulations are different machines in a manufacturing plant, and the numeric

variable is the number of defective parts produced per day. • The subpopulations are patients who have taken a new drug and those who have taken a

placebo, and the numeric variable is the recovery rate from a particular disease. • The subpopulations are undergraduates with various majors (business, English, history,

and so on), and the numeric variable is the starting salary after graduating.

The comparison problem, where a numeric variable is compared across two or more subpopulations, is one of the most important prob- lems faced by data analysts in all fields of study.

09953_ch03_ptg01_084-131.indd 89 04/03/19 12:10 PM

9 0 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

The list could continue. The discussion of the comparison problem begins in this chapter, where methods are used to investigate whether there appear to be differences across the sub- populations on the numeric variable of interest. In later chapters, inferential methods—con- fidence intervals and hypothesis tests—are used to see whether the differences in samples from the subpopulations can be generalized to the subpopulations as a whole.

Breaking Down by Category

There is arguably no more powerful data analysis technique than breaking down a numeric variable by a categorical variable. The methods in this chapter, espe- cially side-by-side box plots and pivot tables, get you started with this general comparison problem. They allow you to see quickly, with charts and/or numeric summary measures, how two or more categories compare. More advanced tech- niques for comparing across categories are discussed in later chapters.

Fundamental Insight

Stacked and Unstacked Formats There are two possible data formats you might see, stacked and unstacked. Consider salary data on males and females. (There could be other variables in the data set, but they aren’t relevant here.) The data are stacked if there are two “long” variables, Gender and Salary, as indicated in Figure 3.4. The idea is that the male salaries are stacked in with the female salaries. This is the format you will see in the majority of data sets. However, you will occasionally see data in unstacked format, as shown in Figure 3.5. (Note that both

The stacked format is by far the most common. There are one or more long numeric variables and another long variable that specifies which category each observation is in.

Figure 3.4 Stacked Data

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22

A B Gender Salary Male 81600 Female 61600 Female 64300 Female 71900 Male 76300 Female 68200 Male 60900 Female 78600 Female 81700 Male 60200 Female 69200 Male 59000 Male 68600 Male 51900 Female 64100 Male 67600 Female 81100 Female 77000 Female 58800 Female 87800 Male 78900

09953_ch03_ptg01_084-131.indd 90 04/03/19 12:10 PM

3-3 relationships among Categorical Variables and a Numeric Variable 9 1

tables list exactly the same data. See the file Stacked Unstacked Data.xlsx.) Now there are two “short” variables, Female Salary and Male Salary. In addition, it is very possible that the two variables have different lengths. This is the case here because there are more females than males.

Keep this distinction in mind. Some statistical software packages require that the data for a comparison problem be in one of these formats, usually the stacked format. In any case, it is fairly easy to rearrange unstacked data into stacked format or vice versa.

We now return to the baseball data, which is in stacked format, to see which, if any, of the categorical variables makes a difference in player salaries.

Figure 3.5 Unstacked Data

1 2 3 4 5 6 7 8 9

10 11 12 13

A B Female Salary Male Salary

61600 81600 64300 76300 71900 60900 68200 60200 78600 59000 81700 68600 69200 51900 64100 67600 81100 78900 77000 58800 87800

EXAMPLE

3.2 BASEBALL SALARIES The file Baseball Salaries Extra.xlsx contains the same 2018 baseball data examined in the previous chapter. In addition, several extra categorical variables are included:

• Pitcher (Yes for all pitchers, No for the others)

• League (American or National)

• Division (National West, American East, and so on)

• Yankees (Yes if team is New York Yankees, No otherwise)

• Playoff 2017 Team (Yes for the eight teams that made it to the 2017 playoffs, No for the others) • World Series 2017 Team (Yes for Houston Astros and Los Angeles Dodgers, No for others)

Do pitchers (or any other positions) earn more than others? Does one league pay more than the other, or do any divisions pay more than others? How does the notoriously high Yankees payroll compare to the others? Do the successful teams from 2017 tend to have larger 2018 payrolls?

Objective To learn methods for breaking down baseball salaries by various categorical variables.

Solution It is useful to look first at some numeric summary measures for salary. These are the same summary measures from the previous chapter, but now we want to break them down by position or by one of the other categorical variables. You could do this with Excel formulas, but they are not quite as straightforward as before. If you want to create counts or averages,

09953_ch03_ptg01_084-131.indd 91 04/03/19 12:10 PM

you can use Excel’s COUNTIF and AVERAGEIF functions. However, there are no MEDIANIF, MINIF, MAXIF, STDEVIF functions or others you might need. Still, there is a way, using array formulas.

The results appear in Figure 3.6. This table lists selected summary measures for each of the positions in the data set. A typical formula, the standard deviation for catchers, is shown at the bottom. It uses the STDEV.S function but with an IF function inside. This IF function is essentially a filter on catchers, so the function takes the standard deviation of catcher salaries only. However, this is an array formula, so after you type it, you must press Ctrl + Shift + Enter (all three keys at once). This puts curly brackets around the formula, but you do not actually type these curly brackets. If you want to see salaries broken down by team or any other categorical variable, you can easily run this analysis again with a different cate- gorical variable.

Figure 3.6 Summary Measures of Salary by Position

1 2 3 4 5 6 7 8 9

A B C D E F G H I J

10 11 12 13 14 15 16 17 18

Mean Median Min Max Quartile 1 Quartile 3 1st percentile 5th percentile 95th percentile 99th percentile Std Dev Mean Abs Dev

$3,511,360 $1,250,000

$545,000 $22,177,778

$557,471 $3,975,000

$545,000 $546,240

$14,675,000 $22,049,778

$5,009,483 $3,430,724

Position Catcher $12,460,183 $13,262,214

$570,000 $18,666,667 $11,318,607 $17,062,500

$1,091,500 $3,177,500

$18,500,000 $18,633,334

$6,572,010 $4,512,040

Designated Hitter $8,737,151 $5,333,333

$545,000 $30,000,000

$585,000 $14,333,333

$545,336 $546,880

$24,200,000 $28,560,000

$8,977,109 $7,670,583

First Baseman $3,829,218 $1,100,000

$545,000 $34,000,000

$555,000 $5,000,000

$545,000 $545,000

$16,000,000 $27,700,000

$5,650,088 $3,880,380

Pitcher $4,928,349 $2,400,000

$546,900 $24,000,000

$657,000 $8,500,000

$547,008 $547,960

$16,108,674 $20,880,000

$5,489,065 $4,327,677

$3,570,414 $1,025,000

$545,000 $20,000,000

$564,125 $5,112,500

$545,000 $546,305

$15,286,666 $18,120,000

$4,837,949 $3,617,415

$6,029,821 $2,916,666

$545,000 $23,000,000

$565,000 $12,000,000

$546,100 $548,300

$19,319,795 $21,504,000

$6,896,281 $5,900,264

Second Baseman Third BasemanShortstop

Summary stats for Salary broken down by categories of Position

Typical formula Cell C14 =STDEV.S(IF(’Salaries 2018’!$C$2:$C$878=C3,’Salaries 2018’!$J$2:$J$878))

$5,130,183 $1,675,000

$545,000 $34,083,333

$561,500 $6,450,000

$545,000 $546,275

$21,375,000 $27,674,880

$6,978,709 $5,217,029

Outfielder

There are a lot of numbers to digest in Figure 3.6, so it is difficult to get a clear picture of dif- ferences across positions. It is more enlightening to see a graphical summary of this information. There are several types of graphs you can use. Our favorite way, discussed shortly, is to create side-by-side box plots. A second possibility is to use pivot tables and corresponding pivot charts, as discussed later in this chapter.

For now, we illustrate side-by-side box plots. Assuming you have Excel 2016 or later, you can create box plots of Salary by Position. First, select the two columns of data, Position and Salary. Then select a box plot from the Statistical charts list on the Insert ribbon. Excel automatically uses the distinct categories in the Position variable to create a separate box plot for each position. You can interpret each of these box plots as explained in the previous chapter. More importantly, you can compare them.

The results appear in Figure 3.7. Now the differences between positions emerge fairly clearly. A few of the conclusions that can be made follow.

Side-by-side box plots are our favorite graphical way of comparing the distribu- tion of a numeric variable across categories of some categorical variable.

Figure 3.7 Box Plots of Salary by Position

Box Plots of Salary by Position

$35,000,000

$40,000,000

$30,000,000

$25,000,000

$20,000,000

$15,000,000

$10,000,000

$5,000,000

Outfie lder

Pitc her

Fir st

Baseman

Desig nated

Hitt erThird

Baseman Catch

Sh orts

top

Se co

Baseman

9 2 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

09953_ch03_ptg01_084-131.indd 92 04/03/19 12:10 PM

• The salaries for all positions except designated hitter are skewed to the right (mean greater than median, long lines and out- liers to the right).

• As a whole, first basemen tend to be the highest paid players, followed by third basemen. The designated hitters also make a lot, but there are only a few of them in the data set.

• As a whole, pitchers don’t make as much as first basemen and third basemen, but there are a lot of pitchers who are high-earning outliers.

• Except for a few notable exceptions, catchers receive the lowest salaries.

Because these side-by-side box plots are so easy to obtain, you can generate a lot of them to provide insights into the sal- ary data. Several interesting examples appear in Figures 3.8–3.10. These box plots lead to the following conclusions:

• Pitchers make somewhat less than other players, although there are many outliers in each group.

• The Yankees payroll is indeed much larger than the payrolls for the rest of the teams. In fact, it is so large that its stars’ salaries aren’t even considered outliers relative to the rest of the team.

• Aside from the many outliers, the playoff teams from 2017 tend to have slightly larger payrolls than the non-playoff teams. The one question we cannot answer, however, at least not without additional data, is whether these larger payrolls are a cause or an effect of being successful.

You can often create a cate- gorical variable on the fly with an IF formula and then use it for side-by-side box plots. We did this with the Yankees, for example.

Figure 3.8 Box Plots of Salary by Pitcher/Non-Pitcher

Box Plots of Salary by Pitcher/Non-Pitcher

$35,000,000

$40,000,000

$30,000,000

$25,000,000

$20,000,000

$15,000,000

$10,000,000

$5,000,000

$0 No Yes

Figure 3.9 Box Plots of Salary by Yankees/Non-Yankees

Box Plots of Salary by Yankees/Non-Yankees

$35,000,000

$40,000,000

$30,000,000

$25,000,000

$20,000,000

$15,000,000

$10,000,000

$5,000,000

$0 No Yes

3-3 relationships among Categorical Variables and a Numeric Variable 9 3

09953_ch03_ptg01_084-131.indd 93 04/03/19 12:10 PM

Figure 3.10 Box Plots of Salary by Playoff Team/Non- Playoff Team

Box Plots of Salary by Playoff Team/Non-Playoff Team

$35,000,000

$40,000,000

$30,000,000

$25,000,000

$20,000,000

$15,000,000

$10,000,000

$5,000,000

$0 No Yes

Problems Solutions for problems whose numbers appear within a colored box can be found in the Student Solution Files.

Level A 7. Recall that the file Baseball Salaries Extra.xlsx con-

tains data on 877 Major League Baseball (MLB) players during the 2018 season. Find the mean, median, standard deviation, and first and third quartiles of Salary, broken down by each of the following categories. Comment on your findings. a. Team b. Division c. Whether they played for the Yankees d. Whether they were in the playoffs

8. The file P02_07.xlsx includes data on 204 employees at the (fictional) company Beta Technologies. Find the mean, median, and standard deviation of Annual Salary, broken down by each of the following categories. Com- ment on your findings. a. Gender b. A recoded version of Education, with new values 1 for

Education less than 4, 2 for Education equal to 4, and 3 for Education greater than 4

c. A recoded version of Age, with people aged less than 34 listed as Young, those aged at least 34 and less than 50 listed as Middle-aged, and those aged at least 50 listed as Older

9. The file Golf Stats.xlsx contains data on the 200 top golf- ers each year from 2003 to 2017. (This data set is used in an example in the next section.) For the 2017 data, create a recoded Age variable, with values “Twenties,” “Thirties,” and “Forties,” based on their ages in the 2017 sheet. Then calculate the mean, median, and standard

deviation of the following 2017 variables, broken down by the recoded Age. Comment on whether it appears that golfers peak in their thirties. a. Earnings b. Yards/Drive and Driving Accuracy c. Greens in Regulation d. Putting Average (Golfers want this to be small.)

10. (Requires Excel version 2016 or later for box plots) Recall from Chapter 2 that HyTex Company is a direct marketer of technical products and that the file Catalog Marketing.xlsx contains recent data on 1000 HyTex customers. Find the mean, median, and stan- dard deviation of Amount Spent, broken down by the following variables. Then create side-by-side box plots of Amount Spent, broken down by the same variables. Comment on how the box plots complement the sum- mary measures. a. Age b. Gender c. Close d. Region e. Year of first purchase. (Hint: For this one, use Excel’s

YEAR function to create a Year column.) f. The combination of Married and Own Home. (For

this one, create a code variable, with values from 1 to 4, for the four combinations of Married and Own Home. Alternatively, create a text variable with values such as “Not married, Owns home.”)

11. The file P02_35.xlsx contains data from a survey of 500 randomly selected households. a. Create a new column Has Second Income with values

“Yes” and “No” depending on whether the household has a reported second income.

b. Find the mean, median, and standard deviation of First Income, broken down by the variable you created in

9 4 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

09953_ch03_ptg01_084-131.indd 94 04/03/19 12:10 PM

3-3 relationships among Categorical Variables and a Numeric Variable 9 5

part a. Is there any indication that first income tends to be any larger or smaller, or has more or less varia- tion, depending on whether there is a second income?

c. Repeat part b for each of the Monthly Payment and Debt variables.

12. (Requires Excel version 2016 or later for box plots) The file P02_02.xlsx contains data about 256 movies that grossed at least $1 million in 2017. a. Recode Genre so that all genres with fewer than 10

movies are listed as Other. b. Find the mean, median, and standard deviation of

Total US Gross, broken down by the recoded Genre variable. Also, create side-by-side box plots of Total US Gross, again broken down by the recoded Genre variable. Comment on what the results say about the popularity of different genres.

13. (Requires Excel version 2016 or later for box plots) The Wall Street Journal CEO Compensation Study analyzed chief executive officer (CEO) pay from many U.S. com- panies with fiscal year 2008 revenue of at least $5 billion that filed their proxy statements between October 2008 and March 2009. The data are in the file P02_30.xlsx. a. Create a new variable Total 2008, the sum of Salary

2008 and Bonus 2008. (Actually, this is not “total” compensation because it omits the very lucrative com- pensation from stock options.) Also, recode Company Type so that the Technology and Telecommunications are collapsed into a Tech/Telecomm category.

b. Find the mean, median, and standard deviation of Total 2008, broken down by the recoded Company Type. Also, create side-by-side box plots of Total 2008, again broken down by the recoded Company Type. What do the results tell you about differences in level or variability across company types?

14. (Requires Excel version 2016 or later for box plots) The file P02_55.xlsx contains monthly sales (in millions of dollars) of beer, wine, and liquor. The data have not been seasonally adjusted, so there might be seasonal patterns that can be discovered. a. Create a new Month Name variable with values Jan,

Feb, and so on. (Use Excel’s MONTH function and then a lookup table.)

b. Create side-by-side box plots of Total Sales, broken down by Month Name. Is there any evidence of differ- ences across months for either the level of sales or the variability of sales?

15. (Requires Excel version 2016 or later for box plots) The file P03_15.xlsx contains monthly data on the various components of the Consumer Price Index (CPI). The source claims that these data have not been seasonally adjusted. The following parts ask you to check this claim. a. Create a new Month Name variable with values Jan,

Feb, and so on. (Use Excel’s MONTH function and then a lookup table.)

b. Create side-by-side box plots of each component of the CPI (including the All Items variable), broken

down by the Month Name variable from part a. What results would you expect for “not seasonally adjusted” data? Are your results in line with this?

16. (Requires Excel version 2016 or later for box plots) The file P02_11.xlsx contains data on 148 houses that were recently sold in a (fictional) suburban community. The data set includes the selling price of each house, along with its appraised value, square footage, number of bed- rooms, and number of bathrooms. a. Create two new variables, Ratio 1 and Ratio 2. Ratio 1

is the ratio of Appraised Value to Selling Price, and Ratio 2 is the ratio of Selling Price to Square Feet. Identify any obvious outliers in these two Ratio variables.

b. Find the mean, median, and standard deviation of each Ratio variable, broken down by Bedrooms. Also, cre- ate side-by-side box plots of each Ratio variable, again broken down by Bedrooms. Comment on the results.

c. Repeat part b with Bedrooms replaced by Bathrooms. d. If you repeat parts b and c with any obvious outlier(s)

from part a removed, do the conclusions change in any substantial way?

Level B 17. The file P02_32.xlsx contains blood pressures for 1000

people, along with variables that can be related to blood pressure. These other variables have a number of missing values, probably because some people didn’t want to report certain information. For each of the Alcohol, Exercise, and Smoke variables, use StatTools to find the mean, median, and standard deviation of Blood Pressure, broken down by whether the data for that variable are missing. For exam- ple, there should be one set of statistics for people who reported their alcohol consumption and another for those who didn’t report it. Based on your results, does it appear that there is any difference in blood pressure between those who reported and those who didn’t?

18. The file P03_18.xlsx contains the times in the Chicago marathon for the top runners each year (the top 10,000 in 2006 and the top 20,000 in 2007 and 2008). a. Merge the top 1,000 values (that is, the lowest 1,000

times) from each year into a new Combined sheet, and in the new sheet, create a variable Year that lists the year.

b. The Time variable, shown as something like 2:16:12, is really stored as a time, the fraction of day start- ing from midnight. So 2:16:12, for example, which means 2 hours, 16 minutes, and 12 seconds, is stored as 0.0946, meaning that 2:16:12 AM is really 9.46% of the way from midnight to the next midnight. This isn’t very useful. Do whatever it takes to recode the times into a new Minutes variable with two decimals, so that 2:16:12 becomes 136.20 minutes. (Hint: Look up Time functions in Excel’s online help.)

c. Create a new variable Nationality to recode Country as “KEN, ETH,” “USA,” or “Other,” depending on

09953_ch03_ptg01_084-131.indd 95 04/03/19 12:10 PM

9 6 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

whether the runner is from Kenya/Ethiopia (the usual winners), the USA, or some other country.

d. Find the mean, median, standard deviation, and first and third quartiles of Minutes, broken down by Nationality. Also, create side-by-side box plots of Minutes, again broken down by Nationality. Comment on the results.

e. Repeat part d, replacing Nationality by Gender. 19. (Requires Excel version 2016 or later for box plots) The

file P03_19.xlsx contains daily values of the S&P Index from 1970 to mid-2015. It also contains percentage changes in the index from each day to the next. a. Create a new variable President that lists the U.S.

presidents Nixon through Obama on each date. You can look up the presidents and dates online.

b. Find the mean, median, standard deviation, and first and third quartiles of % Change, broken down by President. Also, create side-by-side box plots of % Change, again broken down by President. Comment on the results.

20. The file P02_56.xlsx contains monthly values of indexes that measure the amount of energy necessary to heat or cool buildings due to outside temperatures. (See the explanation in the Source sheet of the file.) These are reported for each

state in the United States and also for several regions, as listed in the Locations sheet, from 1931 to 2000. a. For each of the Heating Degree Days and Cooling

Degree Days sheets, create a new Season variable with values “Winter,” “Spring,” “Summer,” and “Fall.” Winter consists of December, January, and February; Spring consists of March, April, and May; Summer consists of June, July, and August; and Fall consists of September, October, and November.

b. Find the mean, median, and standard deviation of Heating Degree Days (HDD), broken down by Season, for the 48 contiguous states location (code 5999). (Ignore the first and last rows for the given loca- tion, the ones that contain -9999, the code for missing values.) Also, create side-by-side box plots of HDD, broken down by season. Comment on the results. Do they go in the direction you would expect? Do the same for Cooling Degree Days (which has no missing data).

c. Repeat part b for California (code 0499). d. Repeat part b for the New England group of states

(code 5801).

3-4 Relationships Among Numeric Variables This section discusses methods for finding relationships among numeric variables. For example, you might want to examine the relationship between heights and weights of peo- ple, or between salary and years of experience of employees. To study such relationships, we introduce two new summary measures, correlation and covariance, and a type of chart called a scatterplot.

Note that these measures can be applied to any variables that are displayed numeri- cally. However, they are appropriate only for truly numeric variables, not for categorical variables that have been coded numerically. In particular, many people create dummy (0–1) variables for categorical variables such as Gender and then include these dummies in a table of correlations. This is certainly possible, and it is not really “wrong.” However, if you want to investigate relationships involving categorical variables, it is better to employ the tools in the previous two sections.

3-4a Scatterplots We first discuss scatterplots, a graphical method for detecting relationships between two numeric variables.1 Then we will discuss the numeric summary measures, correlation and covariance, in the next subsection. (We do it in this order because correlation and covari- ance make more sense once you understand scatterplots.) A scatterplot is a scatter of points, where each point denotes the values of an observation for two selected variables. The two variables are often labeled generically as X and Y, so a scatterplot is sometimes called an X-Y chart. The whole purpose of a scatterplot is to make a relationship (or the lack of it) apparent. Do the points tend to rise upward from left to right? Do they tend to fall downward from left to right? Does the pattern tend to be linear, nonlinear, or no particular shape? Do any points fall outside the general pattern? The answers to these questions provide information about the possible relationship between the two variables. The process is illustrated in Example 3.3.

In general, don’t use correlations that involve coded categorical variables such as 021 dummies. The methods from the previous two sections are more appropriate.

1 Various terms are used for this type of graph. We use the term scatterplot. You might also see the terms scatter plot (two words), scatter chart, scatter diagram, or scattergram. They all mean exactly the same thing.

09953_ch03_ptg01_084-131.indd 96 04/03/19 12:10 PM

3-4 relationships among Numeric Variables 9 7

EXAMPLE

3.3 GOLF STATS ON THE PGA TOUR For over a decade, the Professional Golf Association (PGA) has kept statistics on all PGA Tour players, and these stats are published on the Web. We imported yearly data from 2003–2017 into the file Golf Stats.xlsx. The file includes an observation for each of the top 200 earners for each year, including age, earnings, events played, rounds played, 36-hole cuts made (only the top scorers on Thursday and Friday get to play on the weekend; the others don’t make the cut), top 10s, and wins. It also includes stats about efficiency in the various parts of the game (driving, putting, greens in regulation, and sand saves), as well as good holes (eagles and birdies) and bad holes (bogies). A sample of the data for 2017 appears in Figure 3.11, with the data sorted in decreasing order of earnings and a few variables not shown. What relationships can be uncovered in these data for any particular year?

Figure 3.11 Golf Stats

4 5 6 7 8 9

Rank Cuts Made Top 10s 12 12

8 7

11 10

7 7 8 7

Greens in Regula on 67.2 70.0 69.5 69.0 68.6 67.0 67.0 63.5 66.4 63.8

Put t ing Average 1.694 1.711 1.755 1.739 1.761 1.721 1.759 1.721 1.785 1.738

Sand Save Pct 48.2 55.8 44.3 50.9 59.1 68.7 53.0 48.2 52.1 58.7

Driving Accuracy 55.1 60.0 57.0 58.6 58.7 63.9 57.9 55.8 67.8 62.8

Yards/Drive 309.7 295.6 315.0 303.3 305.8 300.3 298.6 311.1 289.5 289.9

Earnings $9,921,560 $9,433,033 $8,732,193 $8,380,570 $6,123,248 $6,083,198 $5,866,391 $5,612,397 $4,766,936 $4,396,470

Wins 5 3 4 3 1 1 2 1 1 1

Rounds 79 78 70 76 83 77 90 80 98

100

Events 25 23 20 22 23 21 25 24 28 30

Age 24 24 33 26 23 29 34 27 34 31

Player Just in Thomas Jordan Spieth Dust in Johnson Hideki Matsuyama Jon Rahm Rickie Fowler Marc Leishman Brooks Koepka Kevin Kisner Brian Harman11

10 9 8

1 1 2 3

4 5 6 7

A B C D E F G H I J K L M N

18 19 17 19 21 18 22 19 24 21

Objective To use scatterplots to search for relationships in the golf data.

Solution This example is typical in that there are many numeric variables, and it is up to you to search for possible relationships. A good first step is to ask some interesting questions and then try to answer them with scatterplots. For example, do younger players play more events? Are earnings related to age? Which is related most strongly to earnings: driving, putting, or greens in regulation? Do the answers to these questions remain the same from year to year? This example is all about exploring the data, and we will answer only a few of the questions that could be asked. Fortunately, scatterplots are easy to create, so you can do a lot of exploring very quickly.

To create a scatterplot, you highlight any two variables of interest and select a scatter chart of the top left type from the Insert ribbon. At this point, you will probably want to modify the chart by deleting the legend, inserting some titles, and possibly changing some formatting. Also, you might want to swap the roles of the X and Y variables.

Scatterplots are great for initial exploration of the data. If a scatterplot suggests a relationship between two variables, other methods can then be used to examine this relationship in more depth.

Selecting Multiple Ranges Efficiently

How do you select two long variables such as Age and Earnings? Here are the steps that make it easy. (1) Select the Age label in cell B1. (2) Hold down the Shift and Ctrl keys and press the down arrow key. This selects the Age column. (3) Hold down the Ctrl key and select the Earnings label in cell H1. (4) Hold down the Shift and Ctrl keys and press the down arrow key. Now both columns are selected.

Excel Tip

09953_ch03_ptg01_084-131.indd 97 04/03/19 12:10 PM

Several scatterplots appear in Figures 3.12 through 3.15. The scatterplots in Figure 3.12 indicate the possibly surprising results that age is practically unrelated to the number of events played and earnings. Each scatter is basically a shapeless swarm of points, and a shapeless swarm always indicates “no relationship.” The scatterplots in Figure 3.13 confirm what you would expect. Specifically, players who play in more events tend to earn more, although there are a number of exceptions to this pattern. Also, players who make more 36-hole cuts tend to earn more.

Figure 3.12 Scatterplots of Age Versus Events and Earnings

60.000

50.000

40.000

30.000

20.000

10.000

Age vs Events

0.000 0.000 5.000 10.000 15.000 20.000 25.000 30.000 35.000

60.000

50.000

40.000

30.000

20.000

10.000

Age vs Earnings

0.000 0.000 5,000,000.000 10,000,000.000 15,000,000.000

Figure 3.13 Scatterplots of Earnings Versus Events and Cuts Made

$12,000,000

$10,000,000

$8,000,000

$6,000,000

$4,000,000

$2,000,000

Earnings vs Events

$0 5.000 10.000 15.000 20.000 25.000 30.000 35.000

$12,000,000

$10,000,000

$8,000,000

$6,000,000

$4,000,000

$2,000,000

Earnings vs Cuts Made

$0 0.0000.000 5.000 10.000 15.000 20.000 25.000 30.000

Data Labels in Scatterplots

Unfortunately, there is no automatic way to enter a label such as “Dustin Johnson” next to a point in a scatterplot. If you click twice on a point (don’t double-click, but slowly click twice), you can select this point. Then if you right-click, you have the option of adding a data label. However, this data label is always the value of the Y variable. In this case, it would be Dustin’s earnings, not his name.

Excel Tip

Golfers will be particularly interested in the scatterplots in Figures 3.14 and 3.15. The scatterplots in Figure 3.14 indicate a slight positive relationship between earnings and driving length (yards per drive) but almost no relationship between earn- ings and driving accuracy (percentage of fairways hit). The scatterplots in Figure 3.15 indicate a slight positive relationship between earnings and greens hit in regulation and a negative relationship between earnings and putting average. Does this mean that better putters earn less? Absolutely not! The putting stat is the average number of putts per hole, so a lower value is better. Therefore, the downward relationship indicated in the chart is expected. In fact, the driving and putting scatterplots tend to confirm the old saying in golf: Drive for show, putt for dough.

9 8 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

09953_ch03_ptg01_084-131.indd 98 04/03/19 12:10 PM

You could obviously ask many more questions about the relationships in this golf data set and then attempt to answer them with scatterplots. For example, are the relationships (or lack of them) in the scatterplots consistent through the years? Or would it be better to use Earnings per Round instead of Earnings as the Y variable? Or are there other variables not shown here, such as the percentage of holed putts of less than 10 feet, that are more highly related to Earnings? You now have a powerful tool, scatterplots, for examining relationships, and the tool is easy to implement. We urge you to use it—a lot.

Figure 3.15 Scatterplots of Earnings Versus Greens in Regulation and Putting Average

$12,000,000

$10,000,000

$8,000,000

$6,000,000

$4,000,000

$2,000,000

Earnings vs Greens in Regulation

$0 58.0 60.0 62.0 64.0 66.0 68.0 72.070.0

$12,000,000

$10,000,000

$8,000,000

$6,000,000

$4,000,000

$2,000,000

Earnings vs Putting Average

$0 1.650 1.700 1.750 1.800 1.850 1.900

Figure 3.14 Scatterplots of Earnings Versus Driving Length and Driving Accuracy

$12,000,000

$10,000,000

$8,000,000

$6,000,000

$4,000,000

$2,000,000

Earnings vs Yards/Drive

$0 260.0 270.0 280.0 290.0 300.0 310.0 320.0

$12,000,000

$10,000,000

$8,000,000

$6,000,000

$4,000,000

$2,000,000

Earnings vs Driving Accuracy

$0 0.0 20.0 40.0 60.0 80.0

Trend Lines in Scatterplots Chapters 10 and 11 discuss regression, a method for quantifying relationships between variables. We can provide a brief introduction to regression at this point by discussing the very useful Trendline tool in Excel. Once you have a scatterplot, Excel enables you to superimpose one of several trend lines on the scatterplot. Essentially, a trend line is a line or curve that “fits” the scatter as well as possible. This could indeed be a straight line, or it could be one of several types of curves. (By the way, you can also superimpose a trend line on a time series graph, exactly as described here for scatterplots.)

To illustrate the Trendline option, we created the scatterplot of driving length versus driving accuracy in Figure 3.16. If you are a golfer, you are probably not surprised to see

Excel allows you to superimpose a trend line, linear or curved, on a scatterplot. It is an easy way to quantify the relationship apparent in the scatterplot.

Figure 3.16 Scatterplot with Trendline Superimposed 320.0

310.0

300.0

290.0

280.0

270.0

Yards/Drive vs Driving Accuracy

260.0 0.0 20.0

y = –1.0704x + 358.56

40.0 60.0 80.0

3-4 relationships among Numeric Variables 9 9

09953_ch03_ptg01_084-131.indd 99 04/03/19 12:10 PM

1 0 0 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

that the longest hitters tend to be less accurate. This scatterplot is definitely downward sloping, and it appears to follow a straight line reasonably well.

Therefore, it is reasonable to fit a linear trend line to this scatterplot, as you see in the figure. To do this, right-click any point on the chart, select Add Trendline, and fill out the resulting dialog box as shown in Figure 3.17. Note that we have checked the Display Equation on Chart option. The equation you see is a regression equation. It states that driving length (y) is 358.56 minus 1.0704 times driving accuracy (x). This line is certainly not a perfect fit because there are many points well above the line and others below the line. Still, it quantifies the downward relationship reasonably well.

Figure 3.17 Trendline Options Dialog Box

The tools in this subsection, scatterplots and trend lines superimposed on scatterplots, are among the most valuable tools you will learn in the book. When you are interested in a possible relationship between two numeric variables, these are the tools you should use first.

09953_ch03_ptg01_084-131.indd 100 04/03/19 12:10 PM

3-4 relationships among Numeric Variables 1 0 1

3-4b Correlation and Covariance Many numeric summary measures for a single variable were discussed in Chapter 2. The two measures discussed in this section, correlation and covariance, involve two variables. Specifically, each measures the strength and direction of a linear relationship between two numeric variables. Intuitively, the relationship is “strong” if the points in a scatterplot cluster tightly around some straight line. If this straight line rises from left to right, the relationship is positive and the measures will be positive numbers. If it falls from left to right, the relationship is negative and the measures will be negative numbers.

To measure the covariance or correlation between two numerical variables X and Y— indeed, to form a scatterplot of X versus Y—X and Y must be “paired” variables. That is, they must have the same number of observations, and the X and Y values for any observation should be naturally paired. For example, each observation could be the height and weight for a particular person, the time in a store and the amount purchased for a particular customer, and so on.

With this in mind, let Xi and Yi be the paired values for observation i, and let n be the number of observations. Then the covariance between X and Y , denoted by Covar(X, Y), is given by the following formula.

Formula for Covariance

Covar1X, Y2 5 an

i5 1 1Xi 2 X2 1Yi 2 Y2

n 2 1 (3.1)

Formula for Correlation

Correl1X, Y2 5 Covar1X, Y2

Stdev1X2 3 Stdev1Y2 (3.2)

You will probably never have to use Equation (3.1) directly—Excel has a built-in COVAR function that does it for you—but the formula does indicate what covariance is all about. It is essentially an average of products of deviations from means. If X and Y vary in the same direction, then when X is above its mean, Y will tend to be above its mean, and when X is below its mean, Y will tend to be below its mean. In either case, the product of deviations will be positive—a positive times a positive or a negative times a negative— so the covariance will be positive. The opposite is true when X and Y vary in opposite directions. Then the covariance will be negative.

Covariance has a serious limitation as a descriptive measure because it is very sensitive to the units in which X and Y are measured. For example, the covariance can be inflated by a factor of 1000 simply by measuring X in dollars rather than thousands of dollars. This limits the usefulness of covariance as a descriptive measure, and we will use it very little in the book.2

In contrast, the correlation, denoted by Correl(X, Y), remedies this problem. It is a unitless quantity that is unaffected by measurement scales. For example, the correlation is the same regardless of whether the variables are measured in dollars, thousands of dollars, or millions of dollars. The correlation is defined by Equation (3.2), where Stdev(X) and Stdev(Y) denote the standard deviations of X and Y . Again, you will probably never have to use this formula for calculations—Excel does it for you with the built-in CORREL function—but it does show how correlation and covariance are related to one another.

Covariance is too sensitive to the measurement scales of X and Y to make it interpretable, so it is better to rely on correlation, which is unaffected by measure- ment scales.

2 Don’t write off covariance too quickly, however. If you plan to take a finance course in investments, you will see plenty of covariances.

09953_ch03_ptg01_084-131.indd 101 04/03/19 12:10 PM

1 0 2 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

The correlation is not only unaffected by the units of measurement of the two vari- ables, but it is always between 21 and 11. The closer it is to either of these two extremes, the closer the points in a scatterplot are to a straight line, either in the negative or positive direction. On the other hand, if the correlation is close to 0, the scatterplot is typically a “cloud” of points with no apparent relationship. However, although it is not common, it is also possible that the points are close to a curve and have a correlation close to 0. This is because correlation is relevant only for measuring linear relationships.

When there are several numeric variables in a data set, it is useful to create a table of covariances and/or correlations. Each value in the table then corresponds to a particular pair of variables. However, we first make three important points about the roles of scatter- plots, correlations, and covariances.

• A correlation is a single-number summary of a scatterplot. It never conveys as much information as the full scatterplot; it only summarizes the information in the scatterplot. Still, it is often more convenient to report a table of correlations for many variables than to report an unwieldy number of scatterplots.

• You are usually on the lookout for large correlations, those near 21 or 11. But how large is “large”? There is no generally agreed-upon cutoff, but by looking at a number of scatterplots and their corresponding correlations, you will start to get a sense of what a correlation such as 20.5 or 10.7 really means in terms of the strength of the linear relationship between the variables. (In addition, a concrete meaning will be given to the square of a correlation in Chapters 10 and 11.)

• Do not even try to interpret covariances numerically except possibly to check whether they are positive or negative. For interpretive purposes, concentrate on correlations.

Correlation is useful only for measuring the strength of a linear relationship. Strongly related variables can have correlation close to 0 if the relationship is nonlinear.

Scatterplots Versus Correlations

It is important to remember that a correlation is a single-number measure of the linear relationship between two numeric variables. Although a correlation is a very useful measure, it is hard to imagine exactly what a correlation such as 0.3 or 0.8 actually means. In contrast, a scatterplot of two numeric variables indicates the relationship between the two variables very clearly. In short, a scatterplot conveys much more information than the corresponding correlation.

Fundamental Insight

EXAMPLE

3.3 GOLF STATS (CONTINUED) In the previous subsection, you saw how relationships between several of the golf variables can be detected with scatterplots. What further insights are possible by looking at correlations between these variables?

Objective To use correlations to understand relationships in the golf data.

Solution With the many numeric variables in the golf data set, it is indeed unwieldy to create scatterplots for all pairs of variables (unless you have an add-in like StatTools), but it is fairly easy to create a table of correlations. You could use Excel’s CORREL function to calculation each correlation, but if there are many variables, the following steps allow you to fill a table of correla- tions much more quickly. (Refer to Figure 3.18 as you read this.)

1. Highlight all columns of variables to be correlated, including their variable names at the top, and create range names for them with the Create from Selection button on the Formulas ribbon.

2. Copy the variable names to the top row and left column of your correlation table. 3. Create a copyable formula involving the CORREL and INDIRECT functions to calculate all correlations in the table.

09953_ch03_ptg01_084-131.indd 102 04/03/19 12:10 PM

The resulting table of correlations appears in Figure 3.18. The typical formula shown at the bottom was copied to the entire table. The INDIRECT function refers to the range names created in step 1 to get the ranges for the CORREL function. How- ever, note that range names replace illegal characters like / and spaces with underscores. Therefore, the corresponding changes had to be made to the variable names in the table. Also, there are missing data in some of the variables. When calculating the correlation between any two variables, the CORREL function ignores any rows with missing data for these two variables.

Figure 3.18 Correlations for Golf Data

1 2 3 4 5 6 7 8 9

A B C D E F G H I J K L

10 11 12 13 14 15 16

Age Events Rounds Cuts_Made Earnings Yards_Drive Driving_Accuracy Greens_in_Regulation Putting_Average Sand_Save_Pct

Typical formula Cell C5

1.000 –0.085 –0.098 –0.105 –0.191 –0.381

0.341 0.020 0.226

–0.071

–0.085 1.000 0.943 0.698 0.156

–0.036 –0.002

0.035 0.095

–0.128

–0.098 0.943 1.000 0.888 0.334 0.124 0.045 0.241

–0.045 –0.041

–0.105 0.698 0.888 1.000 0.546 0.272 0.058 0.367

–0.179 0.058

–0.191 0.156 0.334 0.546 1.000 0.387

–0.081 0.308

–0.438 0.130

–0.381 –0.036

0.124 0.272 0.387 1.000

–0.590 0.330

–0.044 –0.089

0.341 –0.002

0.045 0.058

–0.081 –0.590

1.000 0.233 0.167

–0.085

0.020 0.035 0.241 0.367 0.308 0.330 0.233 1.000 0.228

–0.160

0.226 0.095

–0.045 –0.179 –0.438 –0.044

0.167 0.228 1.000

–0.267

–0.071 –0.128 –0.041

0.058 0.130

–0.089 –0.085 –0.160 –0.267

1.000

Age Events Rounds Earnings Yards_Drive Greens_in_Regulation Putting_Average Sand_Save_PctDriving_Accuracy

Table of correlations

Cuts_Made

=CORREL(INDIRECT($B5),INDIRECT(C$3))

You can ignore the 1.000 values along the diagonal because a variable is always perfectly correlated with itself. Besides these, you are typically looking for relatively large values, either positive or negative. When the table is fairly large, conditional formatting is useful. For example, you can format all correlations between 0.6 and 0.999 as red and all correlations between 20.999 and 20.5 as green, as has been done here. There are three large positive values, all above 0.69, involving events, rounds, and cuts made. None of these should come as a surprise. There is only one large negative correlation, the one between driving length and driving accuracy, and you already saw the corresponding scatterplot in Figure 3.16. So if you want to know what a correlation of approximately 20.6 really means, you can look at the scatterplot in this figure. It indicates a definite downward trend, but there is still quite a lot of variability around the best-fitting straight line.

Again, a correlation is only a summary of a scatterplot. Therefore, you can learn more about any interesting-looking correlations by creating the corresponding scatterplot. For example, the scatterplot corresponding to the 0.888 correlation between Cuts Made and Rounds appears in Figure 3.19. (We also superimposed a trend line.) This chart shows the strong lin- ear relationship between cuts made and rounds played, but it also shows that there is still considerable variability around the best-fitting straight line, even with a correlation as large as 0.888.

You typically scan a table of correlations for the large correlations, either positive or negative. Conditional formatting is useful, especially if the table is a large one.

Figure 3.19 Scatterplot of Rounds Versus Cuts Made

0 0 5 10 15 20

y = 3.7577x + 18.11

25 30

100

120 Rounds vs Cuts Made

3-4 relationships among Numeric Variables 1 0 3

09953_ch03_ptg01_084-131.indd 103 04/03/19 12:10 PM

1 0 4 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

b. Create a separate table of correlations for each of the selected years that includes Earnings/Event, Yards/Drive, Driving Accuracy, Greens in Regulation, Putting Average, Sand Save Pct, and Birdies/Round. Explain whether these correlations help answer the questions posed above.

c. There is a saying in golf: “Drive for show, putt for dough.” Create a separate set of scatterplots for each of the selected years of Earnings/Event (Y axis) versus each of Yards/ Drive, Driving Accuracy, and Putting Average. Discuss whether these scatterplots tend to support the saying.

25. The file P03_25.xlsx contains data about 211 movies released in 2006 and 2007. The question to be explored in this problem is whether the total gross for a movie can be predicted from how it does in its first week or two. a. Create a table of correlations between the five variables

7-day Gross, 14-day Gross, Total US Gross, Interna- tional Gross, and US DVD Sales. Does it appear that the last three variables are related to either of the first two?

b. Explore the basic question further by creating a scatter- plot of each of Total US Gross, International Gross, and US DVD Sales (Y axis) versus each of 7-day Gross and 14-day Gross (X axis)—six scatterplots in all. Do these support the claim that you can tell how well a movie will do by seeing how it does in its first week or two?

26. The file P02_39.xlsx contains SAT and ACT test scores by state for high school graduates of the 2016– 2017 school year. These are broken down by reading/ writing and math/science scores. The file also lists the percentages of students taking these two tests by state. Create correlations and scatterplots to explore the following relationships and comment on the results. a. The relationship between the total SAT score and the

percentage taking the SAT exam. b. The relationship between the ERW and Math compo-

nents of the SAT exam. c. The relationship between the composite ACT score

and the percentage taking the ACT exam. d. The relationship between the four components of the

ACT exam. 27. The file P02_16.xlsx contains traffic data from 256

weekdays on four variables. Each variable lists the number of arrivals during a specific 5-minute period of the day. a. What would it mean, in the context of traffic, for the

data in the four columns to be positively correlated? Based on your observations of traffic, would you expect positive correlations?

b. Create a table of correlations and check whether these data behave as you would expect.

28. The file P02_11.xlsx contains data on 148 houses that were recently sold in a (fictional) suburban community. The data set includes the selling price of each house, along with its appraised value, square footage, number of bedrooms, and number of bathrooms. a. Create a table of correlations between all of the

variables. Comment on the magnitudes of the correla- tions. Specifically, which of the last three variables,

Problems Solutions for problems whose numbers appear within a colored box can be found in the Student Solution Files.

Level A 21. The file P02_07.xlsx includes data on 204 employees at

the (fictional) company Beta Technologies. a. Create a table of correlations between the variables

Age, Prior Experience, Beta Experience, Education, and Annual Salary. Which of the first four of these variables is most highly correlated (in a positive direction) with Annual Salary?

b. Create scatterplots of Annual Salary (Y axis) versus each of Age, Prior Experience, Beta Experience, and Education.

c. For the variable from part a most highly correlated with Annual Salary, create a (linear) trend line in its scatterplot with the corresponding equation shown in the chart. What does this equation imply about the relationship between the two variables? Be specific.

22. The file P03_22.xlsx lists financial data on movies released from 1980 to 2011 with budgets of at least $20 million. a. Reduce the size of this data set by deleting all movies

with a budget of more than $100 million. Also, delete all movies where US Gross and/or Worldwide Gross is listed as Unknown.

b. For the remaining movies, create a table of correla- tions between the variables Budget, US Gross, and Worldwide Gross. Comment on the results. Are there any surprises?

c. For the movies remaining after part a, create a scat- terplot of Worldwide Gross (Y axis) versus US Gross and another scatterplot of US Gross (Y axis) versus Budget. Briefly explain any patterns you see in these scatterplots. Do they seem to be consistent with the corresponding correlations?

23. The file P02_10.xlsx contains midterm and final exam scores for 96 students in a corporate finance course. a. Do the students’ scores for the two exams tend to go

together, so that those who do poorly on the midterm tend to do poorly on the final, and those who do well on the midterm tend to do well on the final? Create a scat- terplot, along with a correlation, to answer this question.

b. Superimpose a (linear) trend line on the scatterplot, along with the equation of the line. Based on this equation, what would you expect a student with a 75 on the midterm to score on the final exam?

24. Recall that the file Golf Stats.xlsx contains data on the 200 top golfers each year from 2003 to 2017. The question to be explored in this problem is what drives earnings, and whether this is consistent from year to year. a. For at least three of the years, create two new vari-

ables, Birdies/Round and Earnings/Event. The latter is potentially a better measure of earnings because some players enter more events than others.

09953_ch03_ptg01_084-131.indd 104 04/03/19 12:10 PM

3-4 relationships among Numeric Variables 1 0 5

Square Feet, Bedrooms, and Bathrooms, are highly correlated with Selling Price?

b. Create four scatterplots to show how the other four variables are related to Selling Price. In each, Selling Price should be on the Y axis. Are these in line with the correlations in part a?

c. You might think of the difference, Selling Price minus Appraised Value, as the “error” in the appraised value, in the sense that this difference is how much more or less the house sold for than the appraiser expected. Find the correlation between this difference and Selling Price, and find the correlation between the absolute value of this difference and Selling Price. If either of these cor- relations is reasonably large, what is it telling us?

Level B 29. The file P03_29.xlsx contains daily prices of four pre-

cious metals: gold, silver, platinum, and palladium. The question to be explored here is whether changes in these commodities move together through time. a. Create time series graphs of the four series. Do the

series appear to move together? b. Create four new difference variables, one for each

metal. Each should list this month’s price minus the previous month’s price. Then create time series graphs of the differences.

c. Create a table of correlations between the differences created in part b. Based on this table, comment on whether the changes in the prices of these metals tend to move together over time.

d. For all correlations in part c above 0.6, create the cor- responding scatterplots of the differences (for exam- ple, gold differences versus silver differences). Do these, along with the time series graphs from parts a and b, provide a clearer picture of how these series move together over time?

30. The file P03_30.xlsx contains daily data on exchange rates of various currencies versus the U.S. dollar. It is of interest to financial analysts and economists to see whether exchange rates move together through time. You could find the correlations between the exchange rates themselves, but it is often more useful with time series data to check for correlations between differences from day to day. a. Create a column of differences for each currency. b. Create a table of correlations between all of the origi-

nal variables. Then on the same sheet, create a second table of correlations between the difference variables. On this same sheet, enter two cutoff values, one pos- itive such as 0.6 and one negative such as 20.5, and use conditional formatting to color all correlations (in both tables) above the positive cutoff green and all correlations below the negative cutoff red. Do it so that the 1’s on the diagonal are not colored.

c. Based on the second table and your coloring, can you conclude that these currencies tend to move together in the same direction? If not, what can you conclude?

31. The file P02_35.xlsx contains data from a survey of 500 randomly selected (fictional) households. a. Create a table of correlations between the last five

variables (First Income to Debt). On the sheet with these correlations, enter a “cutoff ” correlation such as 0.5 in a blank cell. Then use conditional format- ting to color green all correlations in the table at least as large as this cutoff, but don’t color the 1’s on the diagonal. The coloring should change auto- matically as you change the cutoff. This is always a good idea for highlighting the “large” correlations in a correlations table.

b. Do some investigation to see how missing values are handled when calculating correlations. There are two basic possibilities (and both of these are options in some software packages). First, it could delete all rows that have missing values for any variables and then calculate all of the correlations based on the remaining data. Second, when it creates the correla- tion for any pair of variables, it could (temporarily) delete only the rows that have missing data for these two variables and then calculate the correlation on what remains for these two variables. Why would you prefer the second option?

32. We have indicated that if you have two categorical vari- ables and you want to check whether they are related, the best method is to create a crosstabs, possibly with the counts expressed as percentages. But suppose both categorical variables have only two categories and these variables are coded as dummy 0–1 variables. Then there is nothing to prevent you from finding the correlation between them with the same Equation (3.2) from this section, that is, with Excel’s CORREL function. How- ever, if we let C1i, j2 be the count of observations where the first variable has value i and the second variable has value j, there are only four joint counts that can have any bearing on the relationship between the two vari- ables: C10,02 , C10,12 , C11,02 , and C11,12 . Let C1112 be the count of 1s for the first variable and let C2112 be the count of 1s for the second variable. Then it is clear that C1112 5 C11,02 1 C11,12 and C2112 5 C10,12 1 C11,12, so C1112 and C2112 are determined by the joint counts. It can be shown algebraically that the correlation between the two 091 variables is

nC11,12 2 C1112C2112 !C1112 1n 2 C1112 2!C2112 1n 2 C2112 2

To illustrate this, the file P03_32.xlsx contains two 091 variables. (The values were generated randomly.) Create a crosstabs to find the required counts, and use the above formula to calculate the correlation. Then use StatTools (or Excel’s CORREL function) to find the correlation in the usual way. Do your two results match? (Again, we do not necessarily recommend finding correlations between 091 variables. A cross- tabs is more meaningful and easier to interpret.)

09953_ch03_ptg01_084-131.indd 105 04/03/19 12:10 PM

1 0 6 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

3-5 Pivot Tables We now discuss one of Excel’s most powerful—and easy-to-use—tools, pivot tables. Pivot tables allow you to break the data down by categories so that you can see, for example, average sales by gender, by region of country, by time of day, or any combination of these. Sometimes pivot tables are used to display counts, such as the number of customers broken down by gender and region of country. These tables of counts, often called crosstabs or contingency tables, have been used by statisticians for years. However, crosstabs typically list only counts, whereas pivot tables can list counts, sums, averages, and other summary measures.

It is easiest to understand pivot tables by means of examples, so we illustrate several possibilities in the following example.

EXAMPLE

3.4 CUSTOMER ORDERS AT ELECMART The file Elecmart Sales.xlsx (see Figure 3.20) contains data on 400 customer orders during a period of several months for the fictional Elecmart company.3 There are several categorical variables and several numeric variables. The categorical variables include the day of week, time of day, region of country, type of credit card used, gender of customer, and buy category of the customer (high, medium, or low) based on previous behavior. The numeric variables include the number of items ordered, the total cost of the order, and the price of the highest-priced item purchased. How can the manager of Elecmart use pivot tables to summarize the data so that she can understand the buying patterns of her customers?

Objective To use pivot tables to break down the customer order data by a number of categorical variables.

Figure 3.20 Elecmart Data

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15

A B C D E F G H I J Date Day Time Region Card Type Gender Buy Category Items Ordered Total Cost High Item

Sun Sun Sun Sun Sun Sun Mon Mon Mon Tue Tue Tue Wed Wed

Morning Morning

Evening Evening Evening Morning Morning

A�ernoon

A�ernoon A�ernoon A�ernoon A�ernoon

Morning

West West West Northeast West Northeast West South West Midwest Northeast South Northeast Northeast

ElecMart Other ElecMart Other ElecMart Other Other Other Other Other ElecMart Other ElecMart Other

Female Female Female Female Male Female Male Male Male Female Female Male Male Male

High Medium Medium Low Medium Medium Low High Low Low Medium Medium High Low

4 1 5 1 4 5 1 4 2 1 2 2 3 1

$136.97 $25.55

$113.95 $6.82

$147.32 $142.15

$18.65 $178.34

$25.83 $18.13 $54.52 $61.93

$147.68 $27.24

$79.97 $25.55 $90.47

$6.82 $83.21 $50.90 $18.65

$161.93 $15.91 $18.13 $54.38 $56.32 $96.64 $27.24

4-Mar 4-Mar 4-Mar 4-Mar 4-Mar 4-Mar 5-Mar 5-Mar 5-Mar 6-Mar 6-Mar 6-Mar 7-Mar 7-Mar

3 Users of previous editions of the book will notice that the dates have been changed. This changes a few values in the reported pivot tables.

09953_ch03_ptg01_084-131.indd 106 04/03/19 12:10 PM

Solution First, we preview the results you can obtain. Pivot tables are useful for breaking down numeric variables by categories, or for counting observations in categories and possibly expressing the counts as percentages. So, for example, you might want to see how the average total cost for females differs from the similar average for males. Or you might simply want to see the per- centage of the 400 sales made by females. Pivot tables allow you to find such averages and percentages easily.

Actually, you could find such averages or percentages without using pivot tables. For example, you could sort on gender and then find the average of the Female rows and the average of the Male rows. However, this takes time, and more complex breakdowns are even more difficult and time-consuming. They are all easy and quick with pivot tables. Besides that, the result- ing tables can be accompanied with corresponding charts that require virtually no extra effort to create. Pivot tables are a man- ager’s dream. Fortunately, Excel makes them a manager’s reality.

We begin by building a pivot table to find the sum of Total Cost broken down by time of day and region of country. Although we show this in a number of screen shots, just to help you get the knack of it, the process takes only a few seconds after you gain some experience with pivot tables.

To start, click the PivotTable button at the far left on the Insert ribbon (see Figure 3.21). This produces the dialog box in Figure 3.22. The top section allows you to specify the table or range that contains the data. (You can also specify an external data source, but we will not cover this option here.) The bot- tom section allows you to select the location where you want the results to be placed. If you start by selecting any cell inside the data set, Excel’s guess for the table or range is usually correct, although you can override it if necessary. Make sure the

Pivot tables are perfect for breaking down data by cate- gories. Many people refer to this as “slicing and dicing” the data.

Beginning in Excel 2013, there is a Recommended Pivot Tables option. It guesses which pivot tables you might want.

Figure 3.21 PivotTable Button on Insert Ribbon

Figure 3.22 Create PivotTable Dialog Box

3-5 pivot tables 1 0 7

09953_ch03_ptg01_084-131.indd 107 04/03/19 12:10 PM

range selected for this example is A1:J401. This selected range should include the variable names at the top of each column. Then click OK. Note that with these settings, the pivot table will be placed in a new worksheet with a generic name such as Sheet1. You will probably want to rename it something like Pivot Table.

This produces a blank pivot table, as shown in Figure 3.23. Also, assuming any cell inside this blank pivot table is selected, the PivotTable Tools “super tab” is visible. This super tab has two ribbons, Analzye and Design. (The Analyze ribbon was named Options in Excel 2007 and 2010.) The Analyze ribbon appears in Figure 3.24, and the Design ribbon appears in Figure 3.25. Each of these has a variety of buttons for manipulating pivot tables, some of which we will explore shortly. Finally, the PivotTable Fields pane in Figure 3.26 is visible. By default, it is docked at the right of the screen, but you can move it if you like.

Figure 3.23 Blank Pivot Table

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21

A B C D

Click in this area to work with the PivotTable report

PivotTable1

Figure 3.24 PivotTable Analyze Ribbon

Figure 3.25 PivotTable Design Ribbon

1 0 8 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

09953_ch03_ptg01_084-131.indd 108 04/03/19 12:10 PM

Note that the two pivot table ribbons and the PivotTable Fields pane are visible only when the active cell is inside a pivot table. If you click outside the pivot table, say, in cell D1, all three of these will disappear. Don’t worry. You can get them back by selecting any cell inside the pivot table.

The PivotTable Fields pane indicates that a pivot table has four areas. These are for Filters, Rows, Columns, and Values. They correspond to the four areas in Figure 3.26 where you can put fields.4

A Rows field has categories that go down the left side of a pivot table, a Columns field has categories that go across the top of a pivot table, a Filters field lets you filter the pivot table by its categories, and a Values field contains the data you want to summarize. Typically (but not always), you will place categorical variables in the Filters, Rows, and/or Columns areas, and you will place numeric variables in the Values area.

In the present example, check the Time, Region, and Total Cost boxes in the upper half of the PivotTable Fields pane. You immediately get the pivot table in Figure 3.27. It shows the sum of Total Cost, broken down by time of day and region of coun- try. For example, the total cost of orders in the morning in the South was $3835.86, and the total cost of orders in the morning (over all regions) was $18,427.31.

Excel applies two rules to variables checked at the top of the PivotTable Fields pane:

• When you check a text variable or a date variable in the field list, it is added to the Rows area. • When you check a numeric variable in the field list, it is added to the Values area and summarized with the Sum function.

This is exactly what happens when you check Time, Region, and Total Cost. However, this is just the beginning. With very little work, you can do a lot more. Some of the many possi- bilities are explained in the remainder of this example.

First, notice that the pivot table in Figure 3.27 has both Rows fields, Time and Region, in column A. This is one of three possible layouts: Compact, Outline, or Tabular. These are available from the Report Layout dropdown list on the Design ribbon. When you create a

Figure 3.26 PivotTable Fields Pane

There are three different layouts for pivot tables, but the differences are relatively minor. Ultimately, it is a matter of taste.

4 In discussing pivot tables, Microsoft uses the term field rather than variable, so we will do so as well.

3-5 pivot tables 1 0 9

09953_ch03_ptg01_084-131.indd 109 04/03/19 12:10 PM

1 1 0 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

pivot table (in an .xlsx file), you get the compact layout by default. If you would rather have the tabular or outline layout, it is easy to switch to them. In particular, the tabular layout, shown in Figure 3.28, is closer to what was used in pre-2007 versions of Excel. (Outline layout, not shown here, is very similar to tabular layout except for the placement of its subtotals.)

A 1 2 3 Row Labels Sum of Total Cost

24265.6A�ernoon

18834.3Evening

18427.31Morning

Midwest 3187.16 Northeast South

8159.78 5729.72

West 7188.94

Midwest 2552.89 Northeast South

5941.49 3864.12

West 6475.8

Midwest 3878.22 Northeast South

5084.57 3835.86

West

4 5 6 7 8 9

10 11 12 13 14 15 16 17

5628.6618 19 61527.21Grand Total

Figure 3.27 Sum of Total Cost by Time and Region (Compact Layout)

A 1 2 3

3187.16A�ernoon

A�ernoon Total

Evening Total

2552.89Evening

18834.3 Morning

Morning Total

Midwest 8159.78Northeast

South 5729.72 7188.94West

Midwest Northeast South West

24265.6

5941.49 3864.12

6475.8

3878.22 5084.57 3835.86

4 5 6 7 8 9

10 11 12 13 14 15 16 17

5628.66 18 19

18427.31 Grand Total

Time Sum of Total CostRegion

61527.21

Figure 3.28 Sum of Total Cost by Time and Region (Tabular Layout)

One significant advantage to using tabular (or outline) layout instead of compact layout is that you can see which fields are in the Rows and Columns areas. Take another look at the pivot table in Figure 3.27. It is fairly obvious that categories such as Afternoon and Morning have to do with time of day and that categories such as Midwest and South have to do with region of country. However, there are no labels that explicitly name the Rows fields. In contrast, the tabular layout in Figure 3.28 names them explicitly, and it makes filtering more transparent. Some critics have been very vocal about their dislike for compact layout due to the lack of meaningful labels, and we tend to agree. Therefore, the remaining screenshots in this chapter show tabular layout.

09953_ch03_ptg01_084-131.indd 110 04/03/19 12:10 PM

1 2 3 4 5 6 7 8

A B C D FE

Grand Total

Morning A�ernoon Evening

Sum of Total Cost Time Midwest

Region Northeast South West Grand Total

3878.22 3187.16 2552.89 9618.27

5084.57 8159.78 5941.49

19185.84

3835.86 5729.72 3864.12 13429.7

5628.66 7188.94

6475.8 19293.4

18427.31 24265.6 18834.3

61527.21

Figure 3.29 Placing Region in the Columns Area

1 2 3 4 5 6 7 8 9 10 11 12 13 14

A B C D FE G

Morning

Grand Total

Morning Total

Female Male

Sum of Total Cost Time Gender Midwest

Region Northeast South West Grand Total

A�ernoon

A�ernoon Total Evening

Evening Total

2880.13 998.09

3878.22 2092.39 1094.77 3187.16 1411.32 1141.57 2552.89 9618.27

2864.79 2219.78 5084.57

6206.8 1952.98 8159.78 1103.44 4838.05 5941.49

19185.84

2774.01 1061.85 3835.86 5073.03

656.69 5729.72 1181.66 2682.46 3864.12 13429.7

4211.19 1417.47 5628.66 5210.59 1978.35 7188.94 1885.91 4589.89

6475.8 19293.4

12730.12 5697.19

18427.31 18582.81

5682.79 24265.6 5582.33

13251.97 18834.3

61527.21

Figure 3.30 Adding a Second Field to the Rows Area

Note that times of day in column A are now in chronological order, not in alphabetical order as they were by default. It is easy to make this change. Select the top of the Morning cell (A7) until the cursor becomes a four-way arrow and drag it above Afternoon.

You can also categorize by a third field such as Gender. As before, if you check Gender in the PivotTable Fields pane, it goes to the Rows area by default, but you could then drag it to another area. The pivot table in Figure 3.30 shows the result of placing Gender in the Rows area. For example, the sum of Total Cost for morning orders by females in the West is 4211.19. You can place as many fields as you like in any area. The only downside is that the pivot table starts to get cluttered. If you want to remove a field from the pivot table, just uncheck its its item in the top part of the PivotTable Fields pane or drag it off from the bottom part of the pane.

If you checked the Gender field to get Figure 3.30, you'll notice that Gender is below Time in the Rows box. This makes Time the "outer" Rows field and Gender the "inner" Rows field in the pivot table. To reverse their roles, drag Gender above Time in the Rows box. This leads to the pivot table in Figure 3.31. The two pivot tables provide slightly different views of the same summarization, and you can choose the one you like best. This ability to categorize by multiple fields and rearrange the fields as you like is a big reason why pivot tables are so powerful and useful—and easy to use.

Changing Field Settings You can change various settings in the Field Settings dialog box. You can get to this dialog box in at least two ways. First, there is a Field Setting button on the Analyze ribbon. Second, you can right-click any of the pivot table cells and select Field Settings. The field settings are particularly useful for fields in the Values area.

For now, right-click any number in the pivot table in Figure 3.31 and select Value Field Settings to obtain the dialog box in Figure 3.32. This allows you to choose how you want to summarize the Total Cost variable—by Sum, Average, Count,

Changing the locations of fields in pivot tables is easy. We favor dragging the fields to the various areas, but you can experiment with the other options.

Changing Locations of Fields Starting with the pivot table in Figure 3.28, you can choose where to place either Time or Region; it does not have to be in the Rows area. To place the Region variable in the Columns area, for example, drag the Region button from the Rows area of the PivotTable Fields pane to the Columns area. The pivot table changes automatically, as shown in Figure 3.29. (This is called pivoting.)

3-5 pivot tables 1 1 1

09953_ch03_ptg01_084-131.indd 111 04/03/19 12:10 PM

or several others. You can also click the Number Format button to choose from the usual number formatting options, and you can click the Show Values As tab to display the data in various ways (more on this later). If you choose Average and format as currency with two decimals, the resulting pivot table appears as in Figure 3.33. Now each number is the aver- age of Total Cost for all orders in its combination of categories. For example, the average of Total Cost for all morning orders by females in the South is $146.00, and the average of all orders by femailes in the South is $143.31.

Figure 3.31 Pivot Table with Rows Fields Reversed 1

2 3 4 5 6 7 8 9 10 11 12 13

A B C D FE G

Female Total

Grand Total

Morning A�ernoon Evening

9618.27 19185.84 13429.7 19293.4 61527.21

Sum of Total Cost Gender Time Midwest

Region Northeast South West Grand Total

Female

Male Total

Male

2880.13 2092.39 1411.32 6383.84

998.09 1094.77 1141.57 3234.43

2864.79 6206.8

1103.44 10175.03

2219.78 1952.98 4838.05 9010.81

2774.01 5073.03 1181.66

9028.7 1061.85

656.69 2682.46

4401

4211.19 5210.59 1885.91

11307.69 1417.47 1978.35 4589.89 7985.71

12730.12 18582.81

5582.33 36895.26

5697.19 5682.79

13251.97 24631.95

The key to summarizing the data the way you want it summarized is the Value Field Settings dialog box. Get used to it because you will use it often.

Figure 3.32 Value Field Settings Dialog Box

Figure 3.33 Pivot Table with Average of Total Cost 1

2 3 4 5 6 7 8 9 10 11 12 13

A B C D FE G

Female

Female Total

Grand Total

Morning A�ernoon Evening

Average of Total C Gender Time Midwest Northeast South West Grand Total

Male

Male Total

$160.01 $123.08 $176.42 $148.46 $124.76 $121.64 $103.78 $115.52 $135.47

$143.24 $172.41 $183.91 $164.11 $170.75 $162.75 $172.79 $170.02 $166.83

$146.00 $153.73 $107.42 $143.31 $132.73 $109.45 $167.65 $146.70 $144.41

$150.40 $192.98 $171.45 $171.33 $141.75 $141.31 $148.06 $145.19 $159.45

$149.77 $164.45 $155.06 $157.67 $146.08 $138.60 $154.09 $148.39 $153.82

Region

1 1 2 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

09953_ch03_ptg01_084-131.indd 112 04/03/19 12:10 PM

Filtering When you place a categorical field in the Rows, Columns, or Filters area, all its categories show by default. But it is often use- ful to filter out, or hide, some of these categories. This lets you focus on the categories of most interest. There are several ways to filter; you can choose the method you prefer.

1. Click the dropdown arrow next to any Rows or Columns field in the pivot table and check the categories you want to show. Alternatively, choose Label Filters from the dropdown list, which provide several useful options.

2. Add a field to the Filters area. Then filter with its dropdown arrow. 3. Add a slicer (or a timeline for date fields) from the Filters group on the Analyze ribbon. These are graphical items for

making filters more transparent. (Slicers were introduced in Excel 2010. Timelines were introduced in Excel 2013.)

Figures 3.34, 3.35, and 3.36 illustrate these three options. Figure 3.34 uses filters on Time and Region to hide the After- noon, South, and West categories. Note that the grand totals are only for the filtered data. For example, $142.91 in cell B7 is the average of Total Cost only for the morning and evening orders in the Midwest. Also note the filter signs on the Time and Region dropdown arrows. You can click either of these to change or remove the filters.

1 2 3 4 5 6 7

A B C D

Grand Total

Morning Evening

Average of Total Cost Region Northeast Grand Total

$149.16 $134.36 $142.91

$154.08 $174.75 $164.57

$151.91 $160.27 $155.87

Time Midwest

Figure 3.34 Filtering on the Fields in the Rows and Columns Areas

Figure 3.35 uses a filter for Day in the Filters area. It indicates that multiple categories were checked in the Day dropdown list. However, without opening this list, you can’t tell which days were selected. (They are Friday and Saturday.) This lack of transparency is a drawback of filters in the Filters area.

Figure 3.35 Using a Day Filter in the Filters Area 1

2 3 4 5 6 7 8

A B C D FE

Grand Total

Morning Afternoon Evening

Day (Multiple Items)

Average of Total Cost Region Northeast South West Grand Total

$175.20 $133.13 $144.21 $149.93

$172.40 $200.93 $161.78 $182.85

$151.51 $167.23 $138.82 $153.05

$162.39 $182.74 $133.56 $159.96

$166.50 $173.95 $144.15 $163.52

Time Midwest

Figure 3.36 illustrates a slicer on Day and a timeline on Date. When you select Insert Slicer from the Analyze ribbon, you can check as many fields as you like. You get a slicer for each field you check. (For a timeline, you are likely to select only a single date variable.) To choose categories from a slicer, select as many as you want in the usual way: hold down the Ctrl key to select multiple items or the Shift key to select contiguous items. It is clear from this slicer and this timeline that only orders on Fridays and Saturdays during March and April are summarized. This is the advantage of slicers and timelines. They make filters transparent. By the way, to create Figure 3.36, starting from Figure 3.35, we unchecked the Day field in the PivotTable Fields pane. However, this did not remove the filter on Day. When we inserted the slicer on Day, Friday and Saturday were already selected. Just keep this in mind. When you delete a filtered field from a pivot table, the filter is still remembered.

Slicers and timelines aren’t really part of a pivot table in the sense that when they are selected, the PivotTable Fields pane isn’t opened. However, they each have their own ribbon. For example, when you select a slicer, a Slicer Tools Options tab appears, with an associated ribbon. This lets you make several cosmetic changes to the slicer.

Sorting You can sort by categories or values in a pivot table. We already mentioned one sort, where Morning was moved above Afternoon by dragging. However, dragging was required only because the “natural” order Morning, Afternoon, Evening is not

3-5 pivot tables 1 1 3

09953_ch03_ptg01_084-131.indd 113 04/03/19 12:10 PM

alphabetical or reverse alphabetical order. More commonly, you will sort by clicking the dropdown arrow next to a Rows or Columns field and using one of the Sort options at the top. These include A-Z, Z-A, and More Sort Options. You can experi- ment with the latter.

Sorting by values— the values in the body of the pivot table—is also possible. For example, if you select cell E5 in Figure 3.36 and click the A-Z button on the Data ribbon, not only will cells E5, E6, and E7 will be sorted in increasing order, but (as you would hope) the entire range A5:F7 will be rearranged as well. This is just one possibility. You can also right-click a value in the pivot table, select Sort and then More Sort Options to get the dialog box in Figure 3.37. This lets you sort in increasing or decreasing order from top to bottom or from left to right. Don’t be afraid to experiment with these options. You can always press Ctrl+z to undo your sort.

IHG

Day

Sun

Mon

Tue

Wed

Thu

Fri

Sat

FEDCBA

Date

Mar - Apr 2018 MONTHS

2018

JAN FEB MAR APR MAY JUN JUL

Morning

A�ernoon

Evening

Grand Total

$167.57

$93.57

$218.29

$131.62

$113.10

$101.05

$94.31

$105.13

$161.47

$143.65

$148.04

$147.72

$119.90

$86.22

$149.87

$134.26

$135.61

$111.34

$139.91

$127.45

Midwest

Region

Northeast South West Grand Total

Average of Total Cost

Time

Figure 3.36 Filtering with a Slicer and a Timeline

Figure 3.37 More Sort Options for Values

Pivot Charts It is easy to accompany pivot tables with pivot charts. These charts are really Excel charts, but they are better because they adapt automatically to the underlying pivot table. If you make a change to the pivot table, such as pivoting the Rows and Columns fields, the pivot chart makes the same change automatically. To create a pivot chart, click anywhere inside the pivot table, select the PivotChart button on the Analyze ribbon, and select a chart type. That’s all there is to it. The resulting pivot

1 1 4 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

09953_ch03_ptg01_084-131.indd 114 04/03/19 12:10 PM

chart (using the default column chart option) for the pivot table in Figure 3.36 appears in Figure 3.38. If you decide to pivot the Rows and Columns fields, the pivot chart changes automatically, as shown in Figure 3.39. Note that the categories on the horizontal axis are always based on the Rows field, and the categories in the legend are always based on the Columns field.

$0.00 Morning A�ernoon Evening

$250.00

$200.00

$150.00

$100.00

$50.00

Time

Region

Midwest

Northeast

South

West

Average of Total CostFigure 3.38 Pivot Chart Based on Pivot Table

$0.00 Midwest Northeast South West

$250.00

$200.00

$150.00

$100.00

$50.00

Region

Time

Morning

A�ernoon

Evening

Average of Total CostFigure 3.39 Pivot Chart after Pivoting Rows and Columns Fields

Note that when you activate a pivot chart, the PivotTable Tools “super tab” changes to PivotChart Tools. This super tab includes three ribbons for manipulating pivot charts: Analyze, Design, and Format. There is not enough space here to discuss the many options on these ribbons, but they are intuitive and easy to use. As usual, don’t be afraid to experiment.

Multiple Variables in the Values Area Multiple variables can be placed in the Values area. In addition, a given variable in the Values area can be summarized by more than one summarizing function. This can create a rather busy pivot table, so we indicate our favorite way of doing it. Starting with the pivot table in Figure 3.40, drag Total Cost in the top of the PivotTable Fields pane (the item that is already checked) to the Values area. The bottom part of the PivotTable Fields pane should now appear as in Figure 3.41, and the pivot table should now appear as in Figure 3.42. Note in particular the Values button in the Rows area of Figure 3.41. This button controls the placement of the data in the pivot table. You have a number of options for this button: (1) leave it where it is, (2) drag it above the Time button, (3) drag it to the Columns area, below the Region button, or (4) drag it to the Columns area, above the Region button. You can experiment with these options, but we tend to prefer option (2), which leads to the pivot table in Figure 3.43.

3-5 pivot tables 1 1 5

09953_ch03_ptg01_084-131.indd 115 04/03/19 12:10 PM

1 1 6 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

In a similar manner, you can experiment with the buttons in the Values area. However, the effect here is less striking. If you drag the Sum of Total Cost button above the Average of Total Cost button in the field list, the effect is simply to switch the ordering of these summaries in the pivot table.

Summarizing by Count The field in the Values area, whatever it is, can be summarized by the Count function. This is useful when you want to know, for example, how many of the orders were placed by females in the South. When summarizing by Count, the key is to under- stand that the field placed in the Values area is irrelevant, so long as you summarize it by the Count function. To illustrate, start

Morning Average of Total Cost $149.16 3878.22

3187.16

2552.89

$122.58

$134.36

$154.08 5084.57

8159.78

5941.49

$170.00

$174.75

$142.07 3835.86

5729.72

3864.12

$146.92

$143.12

$148.12 5628.66

7188.94

6475.8

$175.34

$154.19

$148.61 18427.31

24265.6

18834.3

$157.57

$154.38

$135.47 9618.27

$166.83 $144.41 $159.45 19185.84 13429.7 19293.4 61527.21

$153.82

Average of Total Cost

Sum of Total Cost

Afternoon

Evening

Total Sum of Total Cost Total Average of Total Cost

Time Values Region Midwest Northeast South West Grand Total

1 2 3 4 5 6 7 8 9

10 11 12

A B C ED F G

Figure 3.42 Pivot Table with Two Values Fields

1 2 3 4 5 6 7 8

A B C D FE

Grand Total

Morning A�ernoon Evening

Average of Total Cost Region Northeast South West Grand Total

$149.16 $122.58 $134.36 $135.47

$154.08 $170.00 $174.75 $166.83

$142.07 $146.92 $143.12 $144.41

$148.12 $175.34 $154.19 $159.45

$148.61 $157.57 $154.38 $153.82

Time Midwest

Figure 3.40 Basic Pivot Table

Filters Columns

Region

Values

Rows

Time Average of To...

Sum of Total ...

Drag fields between areas below:Figure 3.41 Choosing Two Values Fields

09953_ch03_ptg01_084-131.indd 116 04/03/19 12:10 PM

3-5 pivot tables 1 1 7

with the pivot table in Figure 3.40, where Total Cost is summarized with the Average function. Next, right-click any number in the pivot table, select Value Field Settings, and select the Count from the Summarize Values By options. The default Custom Name you will see in this dialog box, Count of Total Cost, is misleading, because Total Cost has nothing to do with the counts obtained. Therefore, we like to change this Custom Name label to Count. While in the Value Field Settings dialog box, change the number format to Number with zero decimals. The resulting pivot table appears in Figure 3.44. For example, 27 of the 400 orders were placed in the morning in the South, and 115 of the 400 orders were placed in the Northeast. (Do you now see why the counts have nothing to do with Total Cost?) This type of pivot table, with counts for various categories, is the same as the crosstabs discussed in Section 3-2. However, these counts can be created much more quickly with a pivot table.

When data are summarized by counts, there are a number of ways they can be dis- played. The pivot table in Figure 3.44 shows “raw counts.” Depending on the type of information you want, it might be more useful to display the counts as percentages. Three particular options are typically chosen: as percentages of total, as percentages of row totals, and as percentages of column totals. When shown as percentages of total, the percentages in the table sum to 100%; when shown as percentages of row totals, the percentages in each row sum to 100%; when shown as percentages of column totals, the percentages in each column sum to 100%. Each of these options can be useful, depending on the question you are trying to answer. For example, if you want to know whether the daily pattern of orders varies from region to region, showing the counts as percentages of column totals is useful so that you can compare columns. But if you want to see whether the regional ordering pattern varies by time of day, showing the counts as percentages of row totals is useful so that you can compare rows.

Figure 3.43 Rearranged Pivot Table with Two Values Fields

Morning $149.16

3878.22 3187.16 2552.89

$122.58 $134.36

$154.08

5084.57 8159.78 5941.49

$170.00 $174.75

$142.07

3835.86 5729.72 3864.12

$146.92 $143.12

$148.12

5628.66 7188.94

6475.8

$175.34 $154.19

$148.61

18427.31 24265.6 18834.3

$157.57 $154.38

$135.47 9618.27

$166.83 $144.41 $159.45 19185.84 13429.7 19293.4 61527.21

$153.82

Afternoon Evening Morning Afternoon Evening

Total Sum of Total Cost

Sum of Total Cost

Average of Total Cost

Total Average of Total Cost

TimeValues Region Midwest Northeast South West Grand Total

1 2 3 4 5 6 7 8 9

10 11 12

A B C ED F G

Counts can be displayed in a number of ways. You should choose the way that best answers the question you are asking.

Morning Afternoon Evening Grand Total

Count Time

Region Midwest Northeast South West Grand Total

1 2 3 4 5 6 7 8

26 26 19 71

33 48 34

115

27 39 27 93

38 41 42

121

124 154 122 400

A B C ED FFigure 3.44 Pivot Table with Counts

To display the counts as percentages, right-click any number in the pivot table, select Show Values As, and select the option you want. (You can also get to these options from the Show Values As tab in the Value Field Settings dialog box.) For example, if you choose % of Column Totals, the resulting pivot table and corresponding pivot chart appear in Figure 3.45. As you can see by comparing columns, the pattern of regional orders varies somewhat by time of day.

09953_ch03_ptg01_084-131.indd 117 04/03/19 12:10 PM

1 1 8 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

Sometimes it is useful to see the raw counts and the percentages. This can be done easily by dragging any variable to the Values area, summarizing it by Count, and displaying it as “Normal.” Figure 3.46 shows one possibility, where we have changed the custom names of the two Count variables to make them more meaningful. Alternatively, the counts and percentages could be shown in two separate pivot tables.

Figure 3.45 Pivot Table and Pivot Chart with Counts As Percentages of Column Totals

A B C ED F

Morning Afternoon Evening

50.00%

40.00%

30.00%

20.00%

10.00%

0.00%

Region

Time

Count

Midwest

Northeast

South

West

Morning Afternoon Evening

Count Time

Region Midwest Northeast South West Grand Total

Grand Total

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20

36.62% 36.62% 26.76%

100.00%

29.03% 41.94% 29.03%

100.00%

31.40% 33.88% 34.71%

100.00%

31.00% 38.50% 30.50%

100.00%

28.70% 41.74% 29.57%

100.00%

Right-Clicking to Choose Options

We keep saying to make changes in the Value Field Settings dialog box. However, you can also make changes directly by right-clicking a value in the pivot table. For example, when you right-click a number in the Values area, you see Number Format, Summarize Values By, and Show Values As menu items, among others.

Pivot Table Tip

Grouping Finally, categories in a Rows or Columns variable can be grouped. This is especially useful when a Rows or Columns variable has many distinct values. Because a pivot table creates a row or column for each distinct value, the results can be unwieldy. We present two possibilities. First, suppose you want to summarize Sum of Total Cost by Date. If you have Excel 2016 or later, the dates will automatically be grouped in some way, probably by month or by quarter. (Prior to Excel 2016, they weren’t grouped at all. You saw a row for each separate date.) In any case, you can right-click any date and select Group. Then you can choose the grouping you prefer.

For example, starting with a blank pivot table, we dragged Date to the Rows area and Total Cost to the Values area to get the pivot table in Figure 3.47 (shown in compact layout). By right-clicking any of the month labels and selecting Group, the dialog box in Figure 3.48 opens. The dates are currently grouped by Month and Day, meaning that if you expand any month in column A, you will see a row for each day in that month. If you select Quarter, Month, and Day in Figure 3.48, you will see the pivot table in Figure 3.49 (after collapsing the months).

09953_ch03_ptg01_084-131.indd 118 04/03/19 12:10 PM

3-5 pivot tables 1 1 9

Figure 3.46 Pivot Table with Raw Counts and Percentages of Column Totals

MorningCount

% of column

71 100.00% 100.00% 100.00% 100.00% 100.00%

115 93 121 400

Afternoon Evening Morning Afternoon Evening

Total Count Total % of column

TimeValues Region Midwest Northeast South West Grand Total

1 2 3 4 5 6 7 8 9

10 11 12

26 26 19

40.32% 33.14% 26.54%

33 48 34

26.50% 42.53% 30.97%

27 39 27

28.56% 42.66% 28.77%

38 41 42

29.17% 37.26% 33.56%

124 154 122

29.95% 39.44% 30.61%

A B C ED F G

1 2 3 4 5 6 7 8

A B

Grand Total

Mar Apr May Jun

$102.97 $146.79 $174.89 $191.51 $153.82

Row Labels Average of Total Cost

Figure 3.47 Default Grouping on Date

Figure 3.48 Grouping Dialog Box

09953_ch03_ptg01_084-131.indd 119 04/03/19 12:10 PM

1 2 0 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

As a second possibility for grouping, suppose you want to see how the average of Total Cost varies by the amount of the highest priced item in the order. You can drag Total Cost to the Values area, summarized by Average, and drag High Item to the Rows area. Because High Item has nearly 400 distinct values, the resulting pivot table is virtually worthless. Therefore, group- ing is useful. This time there are no natural groupings as there are for a date variable, so it is up to you to create the groupings. Excel provides the suggested grouping in Figure 3.50, but you can override it. For example, changing the bottom entry to 50 leads to the pivot table in Figure 3.51. Some experimentation is typically required to obtain the grouping that presents the results in the most appropriate way.

Figure 3.49 Another Grouping on Date 1

2 3 4 5 6 7 8 9

10 11 12

A B

Mar

Apr May Jun

Qtr1

Qtr2 Qtr1 Total

Qtr2 Total

Row Labels Average of Total Cost

Grand Total

$102.97 $102.97

$146.79 $174.89 $191.51 $170.54 $153.82

Grouping on Dates

Suppose you have multiple years of data and you would like a monthly grouping such as January 2017 through December 2019. If you simply select Months in the Grouping dialog box, all of the Januaries, for example, will be lumped together. The trick is to select both Months and Years in the dialog box.

Pivot Table Tip

A 1 2 3

$72.78 $139.66 $172.71 $253.55 $324.26 $328.92 $361.53 $415.17

4 6.82-56.82 56.82-106.82 106.82-156.82 156.82-206.82 206.82-256.82 256.82-306.82 306.82-356.82 356.82-406.82

5 6 7 8 9

10 11 12

High Item Average of Total Cost

Grand Total $153.82

Figure 3.51 Pivot Table after Grouping by 50 on High Item

Figure 3.50 Grouping Dialog Box for a Non-Date Variable

09953_ch03_ptg01_084-131.indd 120 04/03/19 12:10 PM

By now, we have illustrated the pivot table features that are most commonly used. Be aware, however, that there are many more features available. These include, but are not limited to, the following:

• Showing/hiding subtotals and grand totals (check the Layout options on the Design ribbon)

• Dealing with blank rows, that is, categories with no data (right-click any number, choose PivotTable Options, and check the options on the Layout & Format tab)

• Displaying the data behind a given number in a pivot table (double-click any number in the Values area to get a new worksheet)

• Formatting a pivot table with various styles (check the style options on the Design ribbon)

• Moving or renaming pivot tables (check the PivotTable and Action groups on the Analyze ribbon)

• Refreshing pivot tables as the underlying data changes (check the Refresh dropdown list on the Analyze ribbon)

• Creating pivot table formulas for calculated fields or calculated items (check the Formulas dropdown list on the Analyze ribbon)

• Basing pivot tables on external databases (see the next chapter)

Not only are these (and other) features available, but Excel usually provides more than one way to implement them. The suggestions given here are just some of the ways they can be implemented. The key to learning pivot table features is to exper- iment. There are entire books written about pivot tables, but we don’t recommend them. You can learn a lot more, and a lot more quickly, by experimenting with data such as the Elecmart data. Don’t be afraid to mess up. Pivot tables are very forgiv- ing, and you can always start over.

How Excel Stores Pivot Table Data

When you create a pivot table, Excel stores a snapshot of your source data in memory in a pivot cache. The amount of memory depends on the size of the data source, but it can be large. Fortunately, if you create another pivot table based on the same data source, Excel is intelligent enough to use the same pivot cache, thus conserving memory. (This sharing behavior began with Excel 2010. There is a way to mimic the pre-2010 behavior, that is, to create a separate pivot cache for each pivot table, but it is not discussed here.)

Excel Tip

We complete this section by providing one last example to illustrate how pivot tables can answer business questions quickly.

EXAMPLE

3.5 FROZEN LASAGNA DINNERS The file Lasagna Triers.xlsx contains data on over 800 potential customers being tracked by a (fictional) company that has been marketing a frozen lasagna dinner. The file contains a number of demographics on these customers, as indicated in Figure 3.52: their age, weight, income, pay type, car value, credit card debt, gender, whether they live alone, dwelling type, monthly number of trips to the mall, and neighborhood. It also indicates whether they have tried the company’s frozen lasagna. The company wants to understand why some potential customers are triers and others are not. Does gender make a difference? Does income make a difference? In general, what distinguishes triers from nontriers? How can the company use pivot tables to explore these questions?

3-5 pivot tables 1 2 1

09953_ch03_ptg01_084-131.indd 121 04/03/19 12:10 PM

1 2 2 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

Objective To use pivot tables to explore which demographic variables help to distinguish lasa- gna triers from nontriers.

Solution The key is to set up a pivot table that shows counts of triers and nontriers for different categories of any of the potential explanatory variables. For example, one such pivot table shows the percentages of triers and nontriers for males and females separately. If the percentages are different for males than for females, the company will know that gender has an effect. On the other hand, if the percentages for males and females are about the same, the company will know that gender does not make much of a difference.

The typical pivot table should be set up as shown in Figure 3.53. The Rows variable is any demographic variable you want to investigate—in this case, Gender. The Columns variable is Have Tried (Yes or No). The Values variable can be any variable, as long as it is expressed as a count. Finally, it is useful to show these counts as percentage of row totals. This way you can easily look down column C to see whether the percentage in one category (Female) who have tried the product is any different from the percentage in another category (Male) who have tried the product. As you can see, males are somewhat more likely to try the product than females: 60.92% versus 54.27%. This is also apparent from the associated pivot chart.

Pivot tables, with counts in the Values area, are a great way to discover which variables have the largest effect on a Yes/No variable.

Figure 3.52 Lasagna Trier Data

1 2 3 4 5 6 7 8 9

10 11

A B C D E F G H I J K L M Person Age Weight Income Pay Type Car Value CC Debt Gender Live Alone Dwell Type Mall Trips Nbhd Have Tried

1 48 175 65500 Hourly 2190 3510 Male No Home 7 East No 2 33 202 29100 Hourly 2110 740 Female No Condo 4 East Yes 3 51 188 32200 Salaried 5140 910 Male No Condo 1 East No 4 56 244 19000 Hourly 700 1620 Female No Home 3 West No 5 28 218 81400 Salaried 26620 600 Male No Apt 3 West Yes 6 51 173 73000 Salaried 24520 950 Female No Condo 2 East No 7 44 182 66400 Salaried 10130 3500 Female Yes Condo 6 West Yes 8 29 189 46200 Salaried 10250 2860 Male No Condo 5 West Yes 9 28 200 61100 Salaried 17210 3180 Male No Condo 10 West Yes

10 29 209 9800 Salaried 2090 1270 Female Yes Apt 7 East Yes

Figure 3.53 Pivot Table and Pivot Chart for Examining the Effect of Gender

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

Yes

Female Male

1 2 3 4 5 6 7 8 9 10 11 12 13

A B C D E F G H I

Female Male Grand Total

45.73%

42.17% 39.08%

54.27%

57.83% 60.92%

100.00%

100.00% 100.00%

Count Gender

Have Tried No Yes Grand Total

Once this generic pivot table and associated pivot chart are set up, you can easily explore other demographic variables by swapping them for Gender. For example, Figure 3.54 indicates that people who live alone are (not surprisingly) much more likely to try this frozen microwave product than people who don’t live alone.

As another example, Figure 3.55 indicates that people with larger incomes are slightly more likely to try the product. There are two things to note about this income pivot table. First, because there are so many individual income values, grouping is useful. You can experiment with the grouping to get the most meaningful results. Second, you should be a bit skeptical about the last group, which has 100% triers. It is possible that there are only one or two people in this group. (It turns out that there are four.) For this reason, it is a good idea to create two pivot tables of the counts, one showing percentage of row totals and one showing the raw counts. This second pivot table is shown at the bottom of Figure 3.55.

09953_ch03_ptg01_084-131.indd 122 04/03/19 12:10 PM

The problem posed in this example is a common one in real business situations. One variable indicates whether people are in one group or another (triers or nontriers), and there are a lot of other variables that could potentially explain why some people are in one group and others are in the other group. There are a number of sophisticated techniques for attacking this classification problem, and some of these are discussed in Chapter 17. However, you can go a long way toward understanding which variables are important by the simple pivot table method illustrated here.

Figure 3.54 Pivot Table and Pivot Chart for Examining the Effect of Live Alone

80.00%

70.00%

60.00%

50.00%

40.00%

30.00%

20.00%

10.00%

0.00% No Yes

Yes

IHGFEDCBA 1 2 3 4 5 6 7 8 9

10 11 12 13

Count Live Alone

Have Tried No Yes Grand Total

100.00% 100.00% 100.00%

45.57% 25.52% 42.17%

No Yes Grand Total

54.43% 74.48% 57.83%

Figure 3.55 Pivot Table and Pivot Chart for Examining the Effect of Income

IHGFEDCBA 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Count Income

Have Tried No Yes Grand Total

0-49999 50000-99999 100000-149999 150000-199999 Grand Total

Count Income

Have Tried No Yes Grand Total

0-49999 50000-99999 100000-149999 150000-199999 Grand Total

307 152

32 4

495

563 243

46 4

856

256 91 14

361

45.47% 37.45% 30.43%

0.00% 42.17%

54.53% 62.55% 69.57%

100.00% 57.83%

Count

Income

Have Tried

Count

Income

Have Tried

120.00% 100.00%

80.00% 60.00% 40.00% 20.00%

0.00%

0-49999

50000-99999

100000-149999

150000-199999

Yes

0-49999

400 300 200 100

50000-99999

100000-...

150000-... No

Yes

100.00% 100.00% 100.00% 100.00% 100.00%

3-5 pivot tables 1 2 3

09953_ch03_ptg01_084-131.indd 123 04/03/19 12:10 PM

1 2 4 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

41. The Wall Street Journal CEO Compensation Study analyzed CEO pay from many U.S. companies with fiscal year 2008 revenue of at least $5 billion that filed their proxy statements between October 2008 and March 2009. The data are in the file P02_30.xlsx. a. Create a pivot table and a corresponding pivot chart

that simultaneously shows average of Salary 2008 and average of Bonus 2008, broken down by Company Type. Comment on any striking results in the chart.

b. In the Data sheet, create a new column, Total 2008, which is the sum of Salary 2008 and Bonus 2008. Then create two pivot tables and corresponding pivot charts on a single sheet. The first should show the counts of CEOs broken down by Company Type, and the second should simultaneously show the average of Total 2008, the minimum of Total 2008, and the maximum of Total 2008, all broken down by Company Type. Comment on any striking results in these charts.

42. One pivot table element we didn’t explain is a calcu- lated item. This is usually a new category for some categorical variable that is created from existing cate- gories. It is easiest to learn from an example. Open the file Elecmart Sales.xlsx from this section, create a pivot table, and put Day in the Rows area. Proceed as follows to create two new categories, Weekday and Weekend. a. Select any day and select Calculated Item from the

Formulas dropdown list on the PivotTable Tools Options ribbon. This will open a dialog box. Enter Weekend in the Name box and enter the formula =Sat 1 Sun in the formula box. (You can dou- ble-click the items in the Items list to help build this formula.) When you click OK, you will see Weekend in the pivot table.

b. Do it yourself. Create another calculated item, Week- day, for Mon through Fri.

c. Filter out all of the individual days from the row area, so that only Weekday and Weekend remain, and then find the sum of Total Cost for these two new cate- gories. How can you check whether these sums are what you think they should be? (Notes about calcu- lated items: First, if you have Weekend, Weekday, and some individual days showing in the Rows area, the sum of Total Cost will double-count these individual days, so be careful about this. Second, be aware that if you create a calculated item from some variable such as Day, you are no longer allowed to drag that vari- able to the Filters area.)

43. Building on the previous problem, another pivot table element we didn’t explain is a calculated field. This is usually a new numerical variable built from numerical variables that can be summarized in the Values area. It acts somewhat like a new column in the spreadsheet data, but there is an important difference. Again, it is easiest to learn from an example. Open the file Elecmart Sales.xlsx and follow the instructions below.

Problems Solutions for problems whose numbers appear within a colored box can be found in the Student Solution Files.

Level A 33. Solve problem 1 with pivot tables and create correspond-

ing pivot charts. Express the counts as percentages of row totals. What do these percentages indicate about this particular data set? Then repeat, expressing the counts as percentages of column totals.

34. Solve problem 2 with pivot tables and create corre- sponding pivot charts. Express the counts as percent- ages of row totals. What do these percentages indicate about this particular data set? Then repeat, expressing the counts as percentages of column totals.

35. Solve problem 3 with pivot tables and create correspond- ing pivot charts. Express the counts as percentages of row totals. What do these percentages indicate about this particular data set? Then repeat, expressing the counts as percentages of columns totals.

36. Solve problem 4 with pivot tables and create corre- sponding pivot charts. Express the counts as percentage of row totals. What do these percentages indicate about this particular data set? Then repeat, expressing the counts as percentages of column totals.

37. Solve problem 7 with pivot tables and create correspond- ing pivot charts. However, find only means and standard deviations, not medians or quartiles. (This is one draw- back of pivot tables. Medians, quartiles, and percentiles are not in the list of summary measures.)

38. Solve problem 8 with pivot tables and create corre- sponding pivot charts. However, find only means and standard deviations, not medians. (This is one drawback of pivot tables. Medians are not among their summary measures.)

39. Solve problem 9 with pivot tables and create corre- sponding pivot charts. However, find only means and standard deviations, not medians. (This is one drawback of pivot tables. Medians are not among their summary measures.)

40. The file P03_40.xlsx contains monthly data on the num- ber of vehicles crossing the border from Mexico into four southwestern states. a. Restructure this data set on a new sheet so that there

are three long columns: Month, State, and Crossings. Essentially, you should stack the original columns B through E on top of one another to get the Crossings column, and you should also indicate which state each row corresponds to in the State column. The Month column should have four replicas of the original Month column.

b. Create a pivot table and corresponding pivot table chart based on the restructured data. It should break down the average of Crossings by Year and State. Comment on any patterns you see in the chart.

09953_ch03_ptg01_084-131.indd 124 04/03/19 12:10 PM

3-5 pivot tables 1 2 5

a. Create a new column in the data, CostPerItem, which is Total Cost divided by Items Ordered. Then create a pivot table and find the average of CostPerItem, bro- ken down by Region. Explain exactly how this value was calculated. Would such an average be of much interest to a manager at Elecmart? Why or why not?

b. Select any average in the pivot table and then select Calculated Field from the Formulas dropdown list on the Analyze ribbon. This will open a dialog box. Enter CF_CostPerItem in the name box (we added CF, for calculated field, because we are not allowed to use the CostPerItem name that already exists), enter the for- mula =TotalCost/ItemsOrdered, and click OK. You should now see a new column in the pivot table, Sum of CF_CostPerItem, with different values than in the Average of CostPerItem column. Do some investiga- tion to understand how each “sum” was calculated. From a manager’s point of view, does it make any sense? (Note on calculated fields: When you summa- rize a calculated field, it doesn’t matter whether you express it as sum, average, max, or any other sum- mary measure. It is calculated in exactly the same way in each case.)

44. The file P02_18.xlsx contains daily values of the S&P Index from 1970 to mid-2015. It also contains percent- age changes in the index from each day to the next. Cre- ate a pivot table with average of % Change in the Values area and Date in the Rows area. You will see every single date, with no real averaging taking place. This problem lets you explore how you can group naturally on a date variable. For each part below, explain the result briefly. (Note that if you are using Excel 2016, grouping by date will be done automatically.) a. Group by Month. b. Group by Year. c. Group by Month and Year (select both in the Group

dialog box). Can you make it show the year averages from part b?

d. Group by Quarter. e. Group by Month and Quarter. Can you make it show

the averages from part c? f. Group by Quarter and Year. g. Group by Month, Quarter, and Year.

45. Using the Elecmart Sales.xlsx file from this section, experiment with slicers as follows. a. Create a pivot table that shows the average of Total

Cost, broken down by Region in the Rows area and Time in the Columns area. Then insert two slicers, one for Region and one for Time. Select the West and Northeast buttons on the Region slicer and the Morning and Afternoon buttons on the Time slicer. Explain what happens in the pivot table.

b. Create a pivot table that shows the average of Total Cost, broken down by Region in the Rows area and

Time in the Columns area. Insert a Day slicer and select the Sat and Sun buttons. Explain what aver- ages are now showing in the pivot table. Verify this by deleting the slicer and placing Days in the Filters area, with Sat and Sun selected.

46. We used the Lasagna Triers.xlsx file in this section to show how pivot tables can help explain which variables are related to the buying behavior of customers. Illus- trate how the same information could be obtained with slicers. Specifically, set up the pivot table as in the exam- ple, but use a slicer instead of a Rows variable. Then set it up exactly as in the example, with a Rows variable, but include a slicer for some other variable. Comment on the type of results you obtain with these two versions. Do slicers appear to provide any advantage in this type of problem?

Level B 47. Solve problem 5 with pivot tables and create corre-

sponding pivot charts. If you first find the quartiles of Salary and Amount Spent (by any method), is it possi- ble to create the desired crosstabs by grouping, without recoding these variables?

48. Solve problem 17 with pivot tables. However, find only means and standard deviations, not medians. This is one drawback of pivot tables. Medians are not among their summary measures. Can you think of a way to calculate medians by category?

49. The file P03_22.xlsx lists financial data on movies released from 1980 to 2011 with budgets of at least $20 million. a. Create three new variables, Ratio1, Ratio2, and

Decade. Ratio1 should be US Gross divided by Budget, Ratio2 should be Worldwide Gross divided by Budget, and Decade should list 1980s, 1990s, or 2000s, depending on the year of the release date. If either US Gross or Worldwide Gross is listed as “Unknown,” the corresponding ratio should be blank. (Hint: For Decade, use the YEAR function to fill in a new Year column. Then use a lookup table to populate the Decade column.)

b. Use a pivot table to find counts of movies by various distributors. Then go back to the data and create one more column, Distributor New, which lists the distrib- utor for distributors with at least 30 movies and lists Other for the rest. (Hint: Use a lookup table to popu- late Distributor New, but also use an IF to fill in Other where the distributor is missing.)

c. Create a pivot table and corresponding pivot chart that shows average and standard deviation of Ratio1, broken down by Distributor New, with Decade in the Filters area. Comment on any striking results.

d. Repeat part c for Ratio2.

09953_ch03_ptg01_084-131.indd 125 04/03/19 12:10 PM

1 2 6 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

50. The file P03_50.xlsx lists NBA salaries for five seasons. (Each NBA season straddles two calendar years.) a. Merge all of the data into a single new sheet called All

Data. In this new sheet, add a new column Season that lists the season, such as 2006–2007.

b. Note that many of the players list a position such as C–F or F–C. Presumably, the first means the player is primarily a center but sometimes plays forward, whereas the second means the opposite. Recode these so that only the primary position remains (C in the first case, F in the second). To complicate matters further, the source lists positions differently in 2007–2008 than in other years. It lists PG and SG (point guard and shooting guard) instead of just G, and it lists SF and PF (small forward and power forward) instead of just F. Recode the positions for this season to be consistent with the other seasons (so that there are only three positions: G, F, and C).

c. Note that many players have (p) or (t) in their Con- tract Thru value. The Source sheet explains this. Create two new columns in the All Data sheet, Years Remaining and Option. The Years Remaining column should list the years remaining in the contract. For

example, if the season is 2004–2005 and the contract is through 2006–2007, years remaining should be 2. The Option column should list Player if there is a (p), Team if there is a (t), and blank if neither.

d. Use a pivot table to find the average Salary by Season. Change it to show average Salary by Team. Change it to show average Salary by Season and Team. Change it to show average Salary by Primary Position. Change it to show average Salary by Team and Primary Position, with filters for Season, Contract Years, Years Remaining, and Option. Comment on any striking findings.

51. The file P02_29.xlsx contain monthly percentages of on-time arrivals at several of the largest U.S. airports. a. Explain why the current format of either data set lim-

its the kind of information you can obtain with a pivot table. For example, does it allow you find the average on-time arrival percentage by year for any selected subset of airports, such as the average for O’Hare, Los Angeles International, and La Guardia?

b. Restructure the data appropriately and then use a pivot table to answer the specific question in part a.

3-6 Conclusion Finding relationships among variables is arguably the most important task in data analysis. This chapter has equipped you with some very powerful tools for detecting relationships. As we have discussed, the tools vary depending on whether the variables are categorical or numerical. (Again, refer to the diagram in the Data Analysis Taxonomy.xlsx file.) Tables and charts of counts are useful for relationships among categorical variables. Summary measures broken down by categories and side-by- side box plots are useful for finding relationships between a categorical and a numerical variable. Scatterplots and correlations are useful for finding relationships among numerical variables. Finally, pivot tables are useful for all types of variables.

Summary of Key Terms TERM EXPLANATION EXCEL PAGES EQUATION Crosstabs (or contingency table)

Table of counts of joint categories of two categorical variables

COUNTIFS function or pivot table

Comparison problem Comparing a numeric variable across two or more subpopulations

Stacked or unstacked data formats

Stacked means long columns, one for categories and another for values; unstacked means a separate values column for each category

Scatterplot (or X-Y chart)

Chart for detecting a relationship between two numeric variables; one point for each observation

Scatter from Insert ribbon

trend line Line or curve fit to scatterplot (or time series graph)

Right-click on chart point, select Add Trendline

09953_ch03_ptg01_084-131.indd 126 04/03/19 12:10 PM

3-6 Conclusion 1 2 7

Problems

Conceptual Questions C.1. When you are trying to discover whether there is a

relationship between two categorical variables, why is it useful to transform the counts in a crosstabs to per- centages of row or column totals? Once you do this, how can you tell if the variables are related?

C.2. Suppose you have a crosstabs of two “Yes/No” cate- gorical variables, with the counts shown as percent- ages of row totals. What will these percentages look like if there is absolutely no relationship between the variables? Besides this case, list all possible types of relationships that could occur. (There aren’t many.)

C.3. If you suspect that a company’s advertising expendi- tures in a given month affect its sales in future months, what correlations would you look at to confirm your suspicions? How would you find them?

C.4. Suppose you have customer data on whether they have bought your product in a given time period, along with various demographics on the custom- ers. Explain how you could use pivot tables to see which demographics are the primary drivers of their “yes/no” buying behavior.

C.5. Suppose you have data on student achievement in high school for each of many school districts. In spreadsheet format, the school district is in column A, and various student achievement measures are in columns B, C, and so on. If you find fairly low correlations (magni- tudes from 0 to 0.4, say) between the variables in these achievement columns, what exactly does this mean?

C.6. In the final round of most professional golf tourna- ments, two players play together in a “twosome.” It often appears that good play or bad play is contagious. That is, if one player in the twosome plays well, the other plays well, and if one player plays badly, the other player plays badly. What data would you collect,

and how would you use the data, to check whether this “contagious” property is true?

C.7. Suppose you have a large data set for some sport. Each row might correspond to a particular team (as in the file P03_57.xlsx on football outcomes, for example) or it might even correspond to a given play. Each row contains one or more measures of success as well as many pieces of data that could be drivers of success. How might you find the most important drivers of success if the success measure is categorical (such as Win or Lose)? How might you find the most important drivers of success if the success measure is numerical and basically continuous (such as Points Scored in basketball)?

C.8. If two variables are highly correlated, does this imply that changes in one cause changes in the other? If not, give at least one example from the real world that illus- trates what else could cause a high correlation.

C.9. Suppose there are two commodities A and B with strongly negatively correlated daily returns, such as a stock and gold. Is it possible to find another commod- ity with daily returns that are strongly negatively cor- related with both A and B?

C.10. In checking whether several times series, such as monthly exchange rates of various currencies, move together, why do most analysts look at correlations between their differences rather than correlations between the original series?

Solutions for problems whose numbers appear within a colored box can be found in the Student Solution Files.

Level A 52. Suppose you would like to create a separate table of cor-

relations for each category of a categorical variable. The only alternative, at least with the software you have, is to sort on the categorical variable, insert some blank rows between values of different categories, copy the headings

TERM EXPLANATION EXCEL PAGES EQUATION

Covariance Measure of linear relationship between two numeric variables, but affected by units of measurement

COVAR function 101 3.1

Correlation Measure of linear relationship between two numeric variables, always from 21 to 11

CORREL function 102 3.2

pivot table Table for breaking down data by category; can show counts, averages, or other summary measures

PivotTable from Insert ribbon

108

pivot chart Chart corresponding to a pivot table PivotTable Tools Analyze ribbon

117

Slicers, timelines Graphical elements for filtering in pivot tables

PivotTable Tools Analyze ribbon

127

Key Terms (continued)

09953_ch03_ptg01_084-131.indd 127 04/03/19 12:10 PM

1 2 8 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

to each section, and then ask for correlations from each. Do this with the movie data in the file P03_25.xlsx. Specifically, separate the data into three data sets based on Genre: one for Comedy, one for Drama, and one for all the rest. For this problem, you can ignore the third group. For each of Comedy and Drama, create a table of correlations between 7-day Gross, 14-day Gross, Total US Gross, International Gross, and US DVD Sales. Comment on whether the correlation structure is much different for these two popular genres.

53. The file P03_53.xlsx lists campaign contributions, by number of contributors and contribution amount, by state (including Washington DC) for the four leading con- tenders in the 2008 presidential race. Create a scatterplot and corresponding correlation between Dollar Amount (Y axis) and Contributors for each of the four contend- ers. For each scatterplot, superimpose a linear trend line and show the corresponding equation. Interpret each equation and compare them across candidates. Finally, identify the state for any points that aren’t on or very close to the corresponding trend line.

54. The file P03_54.xlsx lists data for 593 movies released in 2011. Obviously, some movies are simply more pop- ular than others, but success in 2011, measured by 2011 gross or 2011 tickets sold, could also be influenced by the release date. To check this, create a new variable, Days Out, which is the number of days the movie was out during 2011. For example, a movie released on 12/15 would have Days Out equal to 17 (which includes the release day). Create two scatterplots and correspond- ing correlations, one of 2011 Gross (Y axis) versus Days Out and one of 2011 Tickets Sold (Y axis) ver- sus Days Out. Describe the behavior you see. Do you think a movie’s success can be predicted very well just by knowing how many days it has been out?

55. The file P03_55.xlsx lists the average salary for each MLB team from 2004 to 2011, along with the number of team wins in each of these years. a. Create a table of correlations between the Wins col-

umns. What do these correlations indicate? Are they higher or lower than you expected?

b. Create a table of correlations between the Salary col- umns. What do these correlations indicate? Are they higher or lower than you expected?

c. For each year, create a scatterplot and the associated correlations between Wins for that year (Y axis) and Salary for that year. Does it appear that teams are buy- ing their way to success?

d. The coloring in the Wins columns indicates the play- off teams. Create a new Yes/No column for each year, indicating whether the team made it to the playoffs that year. Then create a pivot table for each year showing average of Salary for that year, broken down by the Yes/ No column for that year. Do these pivot tables indicate that teams are buying their way into the playoffs?

56. The file P03_56.xlsx lists the average salary for each NBA team from the 2004–2005 season to the

2009–2010 season, along with the number of team wins each of these years. Answer the same questions as in the previous problem for this basketball data.

57. The file P03_57.xlsx lists the average salary for each NFL team from 2002 to 2009, along with the number of team wins each of these years. Answer the same ques- tions as in problem 55 for this football data.

58. The file P03_58.xlsx lists salaries of MLB players in the years 2007 to 2009. Each row corresponds to a partic- ular player. As indicated by blank salaries, some play- ers played in one of these years, some played in two of these years, and the rest played in all three years. a. Create a new Yes/No variable, All 3 Years, that indi-

cates which players played all three years. b. Create two pivot tables and corresponding pivot

charts. The first should show the count of players by position who played all three years. The sec- ond should show the average salary each year, by position, for all players who played all three years. (For each of these, put the All 3 Years variable in the Filters area.) Explain briefly what these two pivot tables indicate.

c. Consider the data set consisting of only the players who played all three years. Using this data set, cre- ate a table of correlations of the three salary variables. What do these correlations indicate about player salaries?

59. The file P03_59.xlsx lists the results of about 20,000 runners in the 2008 New York Marathon. a. For all runners who finished in 3.5 hours or less, create

a pivot table and corresponding pivot chart of average of Time by Gender. (To get a fairer com-parison in the chart, change it so that the vertical axis starts at zero.) For the same runners, and on the same sheet, create another pivot table and pivot chart of counts by Gender. Comment on the results.

b. For all runners who finished in 3.5 hours or less, create a pivot table and corresponding pivot chart of average of Time by Age. Group by Age so that the teens are in one category, those in their twenties are in another category, and so on. For the same runners, and on the same sheet, create another pivot table and pivot chart of counts of these age groups. Comment on the results.

c. For all runners who finished in 3.5 hours or less, create a single pivot table of average of Time and of counts, broken down by Country. Then filter so that only the 10 countries with the 10 lowest average times appear. Finally, sort on average times so that the fastest countries rise to the top. Guess who the top two are! (Hint: Try the Value Filters for the Country variable.) Comment on the results.

60. The file P02_12.xlsx includes data on the 50 top grad- uate programs in the United States, according to a U.S. News & World Report survey. a. Create a table of correlations between all of the

numerical variables. Discuss which variables are highly correlated with which others.

09953_ch03_ptg01_084-131.indd 128 04/03/19 12:10 PM

3-6 Conclusion 1 2 9

b. The Overall score is the score schools agonize about. Create a scatterplot and corresponding correlation of each of the other variables versus Overall, with Over- all always on the Y axis. What do you learn from these scatterplots?

61. Recall from an example in the previous chapter that the file Supermarket Transactions.xlsx contains over 14,000 transactions made by supermarket customers over a period of approximately two years. Set up a sin- gle pivot table and corresponding pivot chart, with some instructions to a user (like the supermarket manager) in a text box, on how the user can get answers to any typical question about the data. For example, one possibility (of many) could be total revenue by product department and month, for any combination of gender, marital status, and homeowner. (The point is to get you to explain pivot table basics to a nontechnical user.)

62. The file P03_15.xlsx contains monthly data on the vari- ous components of the Consumer Price Index. a. Use Excel formulas to create differences for each of

the variables. b. Create a times series graph for each CPI component,

including the All Items component. Then create a time series graph for each difference variable. Comment on any patterns or trends you see.

c. Create a table of correlations between the differences. Comment on any large correlations (or the lack of them).

d. Create a scatterplot for each difference variable versus the difference for All Items (Y axis). Comment on any patterns or outliers you see.

Level B 63. The file P03_63.xlsx contains financial data on 85

U.S. companies in the Computer and Electronic Prod- uct Manufacturing sector (NAICS code 334) with 2009 earnings before taxes of at least $10,000. Each of these companies listed R&D (research and development) expenses on its income statement. Create a table of cor- relations between all of the variables and use conditional formatting to color green all correlations involving R&D that are strongly positive or negative. (Use cutoff values of your choice to define “strongly.”) Then create scatterplots of R&D (Y axis) versus each of the other most highly correlated variables. Comment on any pat- terns you see in these scatterplots, including any obvi- ous outliers, and explain why (or if) it makes sense that these variables are highly correlated with R&D. If there are highly correlated variables with R&D, can you tell which way the causality goes?

64. The file P03_64.xlsx lists monthly data since 1950 on the well-known Dow Jones Industrial Average (DJIA), as well as the less well-known Dow Jones Transporta- tion Average (DJTA) and Dow Jones Utilities Average (DJUA). Each of these is an index based on 20 to 30 leading companies (which change over time).

a. Create monthly differences in three new columns. The Jan-50 values will be blank because there are no Dec-49 values. Then, for example, the Feb-50 differ- ence is the Feb-50 value minus the Jan-50 value.

b. Create a table of correlations of the three difference columns. Does it appear that the three Dow indexes tend to move together through time?

c. It is possible (and has been claimed) that one of the indexes is a “leading indicator” of another. For exam- ple, a change in the DJUA in September might predict a similar change in the DJIA in the following Decem- ber. To check for such behavior, create four “lags” of the difference variables. Do this for each of the three difference variables. You should end up with 12 lag variables. Explain in words what these lag variables contain. For example, what is the Dec-50 lag 3 of the DJIA difference?

d. Create a table of correlations of the three differences and the 12 lags. Use conditional formatting to color green all correlations greater than 0.5 (or any other cutoff you choose). Does it appear that any index is indeed a leading indicator of any other? Explain.

65. The file P03_65.xlsx lists a lot of data for each NBA team for the seasons 2004–2005 to 2008–2009. The variables are divided into groups: (1) Overall success, (2) Offensive, and (3) Defensive. The basic question all basketball fans (and coaches) ponder is what causes suc- cess or failure. a. Explore this question by creating a correlation matrix

with the variable Wins (the measure of success) and all of the variables in groups (2) and (3). Based on these correlations, which five variables appear to be the best predictors of success? (Keep in mind that negative correlations can also be important.)

b. Explore this question in a different way, using the Playoff Team column as a measure of success. Here, it makes sense to proceed as in the Lasagna Triers example in Section 3-5, using the variables in groups (2) and (3) as the predictors. However, these predic- tors are all basically continuous, so grouping would be required for all of them in the pivot table, and grouping is always somewhat arbitrary. Instead, create a copy of the Data sheet. Then for each variable in groups (2) to (13), create a formula that returns 1, 2, 3, or 4, depend- ing on which quarter of that variable the value falls in (1 if it is less than or equal to the first quartile, and so on). (This sounds like a lot of work, but a single copy- able formula will work for the entire range.) Now use these discrete variables as predictors and proceed as in the Lasagna Triers example. List the five variables that appear to be the best (or at least good) predictors of making the playoffs.

66. The file P03_66.xlsx lists a lot of data for each NFL team for the years 2004 to 2009. The variables are divided into groups: (1) Overall success, (2) Team Offense, (3) Pass- ing Offense, (4) Rushing Offense, (5) Turnovers Against,

09953_ch03_ptg01_084-131.indd 129 04/03/19 12:10 PM

1 3 0 C h a p t e r 3 F i n d i n g r e l a t i o n s h i p s a m o n g V a r i a b l e s

(6) Punt Returns, (7) Kick Returns, (8) Field Goals, (9) Punts, (10) Team Defense, (11) Passing Defense, (12) Rushing Defense, and (13) Turnovers Caused. The basic question all football fans (and coaches) ponder is what causes success or failure. Answer the same ques- tions as in the previous problem for this football data, but use all of the variables in groups (2) to (13) as possi- ble predictors.

67. The file P02_57.xlsx contains data on mortgage loans in 2008 for each state in the United States. The file is different from others in this chapter in that each state has its own sheet with the same data in the same format. Each state sheet breaks down all mortgage applications by loan purpose, applicant race, loan type, outcome, and denial reason (for those that were denied). The question is how a single data set for all states can be created for analysis. The Typical Data Set sheet indicates a simple way of doing this, using the powerful but little-known

INDIRECT function. This sheet is basically a template for bringing in any pieces of data from the state sheets you would like to examine. a. Do whatever it takes to populate the Typical Data

Set sheet with information in the range B7:D11 and B14:D14 (18 variables in all) of each state sheet. Add appropriate labels in row 3, such as Asian Dollar Amount Applied For.

b. Create a table of correlations between these variables. Color yellow all correlations between a given appli- cant race, such as those between Asian Mortgage Application, Asian Dollar Amount Applied For, and Asian Average Income. Comment on the magnitudes of these. Are there any surprises?

c. Create scatterplots of White Dollar Amount Applied For (X axis) versus the similar variable for each of the other five applicant races. Comment on any patterns in these scatterplots, and identify any obvious outliers.

CASE 3.1 Customer Arrivals at Bank98 Bank98 operates a main location and three branch locations in a medium-size city. All four locations perform similar services, and customers typically do business at the location nearest them. The bank has recently had more congestion— longer waiting lines—than it (or its customers) would like. As part of a study to learn the causes of these long lines and to suggest possible solutions, all locations have kept track of customer arrivals during one-hour intervals for the past 10 weeks. All branches are open Monday through Friday from 9 a.m. until 5 p.m. and on Saturday from 9 a.m. until noon. For each location, the file C03_01.xlsx contains the number of customer arrivals during each hour of a 10-week period.

The manager of Bank98 has hired you to make some sense of these data. Specifically, your task is to present charts and/ or tables that indicate how customer traffic into the bank locations varies by day of week and hour of day. There is also interest in whether any daily or hourly patterns you observe are stable across weeks. Although you don’t have full information about the way the bank currently runs its operations—you know only its customer arrival pattern and the fact that it is currently experiencing long lines—you are encouraged to append any suggestions for improving opera- tions, based on your analysis of the data.

CASE 3.2 Saving, Spending, and Social Climbing The best-selling book The Millionaire Next Door by Thomas J. Stanley and William D. Danko (Longstreet Press, 1996) presents some very interesting data on the characteristics of millionaires. We tend to believe that people with expen- sive houses, expensive cars, expensive clothes, country club memberships, and other outward indications of wealth are the millionaires. The authors define wealth, however, in terms of savings and investments, not consumer items. In this sense, they argue that people with a lot of expensive things and even large incomes often have surprisingly little wealth. These people tend to spend much of what they make on consumer items, often trying to keep up with, or impress, their peers.

In contrast, the real millionaires, in terms of savings and investments, frequently come from “unglamorous”

professions (particularly teaching), own unpretentious homes and cars, dress in inexpensive clothes, and otherwise lead rather ordinary lives.

Consider the (fictional) data in the file C03_02.xlsx. For several hundred couples, it lists their education level, their annual combined salary, the market value of their home and cars, the amount of savings they have accumulated (in sav- ings accounts, stocks, retirement accounts, and so on), and a self-reported “social climber index” on a scale of 1 to 10 (with 1 being very unconcerned about social status and material items and 10 being very concerned about these). Prepare a report based on these data, supported by relevant charts and/or tables, that could be used in a book such as The Millionaire Next Door. Your conclusions can either support or contradict those of Stanley and Danko.

09953_ch03_ptg01_084-131.indd 130 04/03/19 12:10 PM

3-6 Conclusion 1 3 1

CASE 3.3 Churn in the Cellular Phone Market The term churn is very important to managers in the cellu- lar phone business. Churning occurs when a customer stops using one company’s service and switches to another com- pany’s service. Obviously, managers try to keep churning to a minimum, not only by offering the best possible service, but by trying to identify conditions that lead to churning and taking steps to stop churning before it occurs. For exam- ple, if a company learns that customers tend to churn at the end of their two-year contract, they could offer customers an incentive to stay a month or two before the end of their two-year contract. The file C03_03.xlsx contains data on

over 2000 customers of a particular cellular phone company. Each row contains the activity of a particular customer for a given time period, and the last column indicates whether the customer churned during this time period. Use the tools in this chapter (and possibly the previous chapter) to learn (1) how these variables are distributed, (2) how the variables in columns B–R are related to each other, and (3) how the variables in columns B–R are related to the Churn variable in column S. Write a short report of your findings, including any recommendations you would make to the company to reduce churn.

CASE 3.4 Southwest Border Apprehensions and Unemployment

Illegal immigration across the southwest border of the United States has been a hotly debated topic in recent years. Many people argue that illegal immigration is related to the U.S. unemployment rate: when the unemployment rate is up, there are more attempts to cross the border illegally. The file C03_04.xlsx contains data on the number of apprehensions

on the southwest border and the U.S. unemployment rate. There are both yearly data since 1981 and monthly data since 2000. Using these data, make an argument, one way or the other, for whether apprehensions are related to the unem- ployment rate.

APPENDIX Using StatTools to Find Relationships The StatTools add-in from Palisade can be used to speed up several procedures discussed in this chapter:

• It can calculate a table of summary measures of a numeric variable, broken down by a categorical variable. (This is illustrated in the Introduction to StatTools video that accompanies the book.)

• It can create one or more scatterplots from its list of Sum- mary Graphs on the StatTools ribbon. It can even color the

points in a scatterplot based on the category of a categorical variable.

• It can create a table of correlations (or covariances) from the Summary Statistics dropdown list on the StatTools ribbon.

Note, however, that StatTools doesn’t do pivot tables. There is no need because Excel does them very nicely.

09953_ch03_ptg01_084-131.indd 131 04/03/19 12:10 PM