Introducing Data

jroero
2.ExploringandVisualizingData.pdf

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 1/37

2.1 What is data?

Data

Data is information, especially facts or numbers, usually collected or computed for purposes of analysis. Ex: The world population was about million in the year , million in , and billion in . Ex: Analysis of the population data suggests the world's population is growing rapidly, which may in�uence various decisions like use of natural resources.

Datum vs. data

Historically, datum is de�ned as a single item, and data as the plural of datum. Language evolves, however, and the use of the term datum is diminishing. This material follows the increasingly common usage of the term data for both the singular and plural.

The amount of collected data has grown tremendously. A �rst reason is because computers became ubiquitous around the 's and can easily record data. More recently, the world- wide web became ubiquitous in the early 's, transforming how people do business, communicate, and recreate, in ways such that data is easily recorded and analyzed. Smartphones and tablets of the 's provide nearly continuous computer/web access. Plus, numerous items like streets, cars, and buildings have recently been equipped with sensors and cameras and allowing for more data collection. Some estimates are that of all data ever collected was generated in just the past couple years.

The �gure below shows the worldwide data collected per year, in zettabytes. A zettabyte is one sextillion or bytes. The table below lists common sources of data.

Figure 2.1.1: Worldwide data collected per year.

Table 2.1.1: Common sources of data.

Social networks Traditional business

systems Internet of things

Human-generated data

Social Networks: Facebook, Twitter, etc. Blogs and comments Personal documents Pictures: Instagram, Snapchat, etc. Videos: YouTube etc. Internet searches Mobile data: text messages User-generated maps E-mail

Data produced by Public Agencies

Medical records

Data produced by businesses

Commercial transactions Banking/stock records E-commerce Credit cards

Data from sensors

Fixed sensors Home automation Weather/pollution sensors Tra�c sensors/webcam Scienti�c sensors Security/surveillance videos/images

Mobile sensors (tracking) Mobile phone location Cars Satellite images

Data from computer systems

Logs Web logs

PARTICIPATION ACTIVITY 2.1.1: Data.

1) The amount of data collected worldwide in is about ___ zettabytes.

300 1000 500 1500 7 2000

1980 2000

2010 90%

1021

Source: United Nations Statistics Division, 2015 1

2016

1

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 2/37

2) A zettabyte is _____ bytes.

3) Which was a substantial source of computerized data before the year

?

Data analytics

The abundance of collected data provides new opportunities for analysis. Data analytics is the �eld of analyzing data to gain insight, draw conclusions, or make decisions.

With so much data being collected today, one can imagine that data analytics is a growing �eld with increasing job opportunities. Big data is a term commonly used to refer to data analytics on large amounts of data, which is the form in which much data exists today. Big data refers to very large data sets that cannot be processed by traditional methods, and is characterized by high volume, rapid velocity of collection, and variety in type and quality. Articles summarizing jobs in big data are abundant, and summaries and predictions both describe large increases in job opportunities, such as this 2014 Forbes article on big data jobs.

Below are real-world applications of data analysis.

Example 2.1.1: Data analysis catches cheating teachers.

Standardized exams are commonly given to students in public schools. The average scores for a teacher's students are commonly used to evaluate a teacher or a school. A researcher performed data analysis to detect whether some teachers were cheating. For example, if a particular teacher's students answered the last or so questions correctly more frequently than for another teacher, one might assume that the teacher �lled in those last questions (correctly) for students who didn't complete the exams. Or, if a teacher's students did well above average one year, but those same students performed below average the year before and the year after, one might assume that the teacher gave students the answers.

In the book Freakonomics, Steven Levitt described analyses he performed on several years of exam data from Chicago public schools. He found that at least of teachers were cheating. As a result of his data analysis, several teachers were �red, and cheating subsequently decreased.

Levitt's paper describing the analysis Two-minute video of Levitt discussing the analysis

Example 2.1.2: Sports and data analytics.

In the early 2000's, the Oakland Athletics had one of the smallest budgets in professional baseball. The team leaders used data analytics to gain an edge. Traditionally, baseball players were sought based on widely-known factors like a player's batting average (how often the player got a hit), runs batted in (how many runs the player caused by making a hit), and similar numbers. Instead, through data analysis, the team leaders found less-popular factors were more important, like on-base percentage. The team thus hired players strong in those less- advertised factors and paid such players with lower salaries due to not being in high demand. The technique worked, and the Oakland Athletics made the playoffs, both in and , despite having nearly the lowest salaries in the league.

This real story is the basis of the popular movie Moneyball starring Brad Pitt. Many teams in baseball and other sports have since adopted such data analytic techniques With

10

100

one trillion

trillion1, 000

one sextillion

1990

Social network data (like Facebook)

Internet search data

Medical records

None of the above

10

5%

2

2002 2003

Source: Oakland A's stadium (Travis Wise / CC-BY-SA-2.0 via Flickr) 3

2014

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 3/37

teams, in baseball and other sports, have since adopted such data-analytic techniques. With the advent of computers, and thus more recording of data and more ability to analyze such

data, data-analytic techniques are used in more arenas to gain insight and achieve better results. Ex: Online dating sites, stock market investing, language translation, and much more.

playoff teams (in red above): NY Yankees, Anaheim Angels, Oakland Athletics,

Minnesota Twins, Atlanta Braves, San Francisco Giants, Arizona Diamondbacks, St. Louis Cardinals.

Types of data analytics

Three types of data analytics exist:

Descriptive data analytics seeks to describe data, providing insight and knowledge. Ex: Based on collected data, the world population in is about billion. Predictive data analytics seeks to make predictions from data. Ex: Using models based on birth rates, death rates, medical care improvements, and other data, the United Nations predicts the world population will reach billion in . Prescriptive data analytics seeks to make decisions (prescriptions) based on data. Ex: Population predictions for speci�c countries help the United Nations decide where to focus agricultural development efforts.

PARTICIPATION ACTIVITY 2.1.2: Worldwide population.

PARTICIPATION ACTIVITY 2.1.3: Descriptive, predictive, and prescriptive analytics.

Given existing data, strives to summarize the data, perhaps to gain insight.

Given data and a model, strives to determine future values.

Strives to make decisions or recommendations based on data.

Types of data

Variables

Data is typically represented using variables. A variable is an item that can have different ("varying") values. Ex: A person's age is a variable and can have the value , , , or other values. Variables are often considered as being of two possible types:

2002

2015 7

11.2 2100

Animation content:

undefined

Animation captions:

1. Descriptive analytics: Describes the data, perhaps to provide insight. Ex: From census and other data, U.N. estimates 2015 world population as billion.

2. Predictive analytics: Based on models, predict future information. Ex: Based on models, the U.N. predicts world population in 2100 could be as high as billion.

3. Prescriptive: Make decisions based on descriptive/predictive analyses. Ex: Growth is great in India, so U.N. may focus more agricultural efforts there.

7

16

Prescriptive analytics Predictive analytics Descriptive analytics

Reset

10 33 99

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 4/37

A quantitative variable can take on a numeric value (quantitative data) that can be measured and ordered. Ex: A person's age, the outside temperature, and a meal's price are quantitative variables. Example numeric values are an age of or years, a temperature of or degrees, and a price of or dollars. A categorical variable can take on the value (usually a label) of one of several categories. Ex: A person's gender, seasons, and U.S. companies are categorical variables. Gender can be male or female, seasons can be fall, winter, spring, or summer, and U.S. companies can be Wal-Mart, McDonalds, UPS, etc. A categorical variable is often called a qualitative variable (known by qualities, rather than quantities).

Most numbers represent quantitative data, but exceptions exist. Ex: A person's phone number is a number but is not quantitative data; a phone number isn't measured, nor ordered; people don't say: "Joe's phone number is greater than Mary's." In general, if adding the numbers makes sense, the variable is likely quantitative, else categorical. (People may add ages but don't add phone numbers.)

A reason for distinguishing variable types is that each type is handled differently in data analytics. Ex: A categorical variable typically involves counting the instances of each category, often then depicted with a bar chart or pie chart. But a quantitative variable is commonly plotted versus another quantitative variable, often depicted with a scatter plot or line chart. Those chart types are described in other sections.

PARTICIPATION ACTIVITY 2.1.4: Quantitative vs. categorical variables.

1) A car's age.

2) A car's maker.

3) A house's square footage.

4) A house's color.

5) A house's address.

6) "Qualitative variable" is likely another term for which type?

7) "Numerical variable" is likely another term for which type?

Types of categorical variables

Two types of categorical variables are often distinguished:

A nominal variable's categories have no ordering, existing in name only, like apples, oranges, and grapes. ("Nominal" means "in name only"). An ordinal variable's categories have an ordering, like disagree, neutral, and agree.

The difference is sometimes relevant. Ex: On a chart, the ordinal variables would almost always be sorted along the x-axis, listed as "small medium large" rather than arbitrarily as "small large medium."

PARTICIPATION ACTIVITY 2.1.5: Categorical variables: Nominal versus ordinal.

1) A car comes in 5 possible colors: red, grey, brown, black, and white.

2) A movie has possible ratings: G, PG, PG-13, R, and NC-17. (See movie ratings if unfamiliar.)

3) An Amazon product has possible ratings: star, stars, stars, stars, or stars.

33 99 40 45 12 15

Quantitative

Categorical

Quantitative

Categorical

Quantitative

Categorical

Quantitative

Categorical

Quantitative

Categorical

Quantitative

Categorical

Quantitative

Categorical

Nominal

Ordinal

5

Nominal

Ordinal

5 1 2 3 4

5

Nominal

Ordinal

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 5/37

4) A survey asks users to enter a number indicating political a�liation: for Libertarian, for Democratic, for Republican, and for Other.

5) A form asks a person to indicate a country of birth.

Types of quantitative variables

Two types of quantitative variables are often distinguished:

A continuous variable's values are in�nite along a continuum of values within a range, typically real numbers. Continuous variables usually represent measurements, like height ( meters) or temperature ( degrees).

A discrete variable's values are �nite within a range, typically integers. Discrete variables usually represent countable items, like people in a family ( ) or cars in a city ( ). Generally, if "number of" can be added to the beginning, the variable is discrete, like "number of people in a family", but not "number of height". Note: "Discrete" means separate or distinct, not to be confused with "discreet" which means careful or unobtrusive.

PARTICIPATION ACTIVITY 2.1.6: Continuous vs. discrete quantitative variables.

Indicate whether the variable is continuous or discrete.

1) Width of a house.

2) Height of a human.

3) Gallons in a car's gas tank.

4) Fingers on a human's hands.

5) Hairs on a human's head.

6) Air molecules in a house.

References

(*1) United Nations Global Working Group on Big Data for O�cial Statistics Task Team on Cross-Cutting Issues. "Deliverable 2: Revision and Further Development of the Classi�cation of Big Data." United Nations Statistics Division. 12 October 2015, unstats.un.org/unsd/trade/events/2015/abudhabi/gwg/GWG%202015%20-%20item%202%20(iv)%20- %20Big%20Data%20Classi�cation.pdf.

(*2) Jacob, Brian A. and Steven D. Levitt. "ROTTEN APPLES: AN INVESTIGATION OF THE PREVALENCE AND PREDICTORS OF TEACHER CHEATING." Quarterly Journal of Economics. Volume 118, Issue 3, 1 August 2003, Pages 843-877, doi.org/10.1162/00335530360698441.

(*3) Wise, Travis. "Oakland A's." Flickr. 12 May 2007, www.�ickr.com/photos/photographingtravis/16666072878.

1 2 3

4

Nominal

Ordinal

Nominal

Ordinal

0.00104 98.6 5 502, 434

Continuous

Discrete

Continuous

Discrete

Continuous

Discrete

Continuous

Discrete

Continuous

Discrete

Continuous

Discrete

2.2 What is data visualization?

Introduction to data visualization

Data visualization is the display of data in a format, such as a table or chart, that seeks to achieve a goal of conveying particular information to a viewer. Data presented in a text-only format often does not convey information well. Ex: Given this text-only data on 2013 median house prices in southern California counties, �nding the price for a particular county is inconvenient: Los Angeles ; Orange ; Riverside ; San Bernardino ; San Diego ; Ventura .

Instead, displaying the data visually as a table better conveys the information. A table displays data using rows and columns.

Table 2.2.1: Southern California median house prices by county (2013)

$405,000 $661,000 $306,000 $192,000 $473,000 $464,000

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 6/37

prices by county (2013).

County Median house price

Los Angeles

Orange

Riverside

San Bernardino

San Diego

Ventura

As another example, the following data represents California median house prices from 2000-2010: 2000 ; 2001 ; 2002 ; 2003 ; 2004 ; 2005 ; 2006 ; 2007 ; 2008 ; 2009 ; 2010 . A table conveys the information better than text, but if the goal is to

illustrate the housing price "bubble" that grew and then burst in 2008, a chart is even better.

Table 2.2.2: California median house prices, 2000-2010.

Year California median house price

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

Figure 2.2.1: California median house prices, 2000-2010.

PARTICIPATION ACTIVITY 2.2.1: Data visualization.

Refer to the tables and charts above.

1) Refer to the table above showing California house prices by county (2013). A company is considering

$405,000

$661,000

$306,000

$192,000

$473,000

$464,000

$241,000 $262,000 $316,000 $372,000 $451,000 $523,000 $556,000 $560,000 $348,000 $275,000 $305,000

$241,000

$262,000

$316,000

$372,000

$451,000

$523,000

$556,000

$560,000

$348,000

$275,000

$305,000

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 7/37

moving o�ces to San Bernardino county. What is the median house price in that county?

2) Refer to the table above showing California house prices from 2000- 2010. In what year did the price bubble burst? That is, in what year was the price drastically lower than the previous year?

3) Refer to the �gure above showing California house prices from 2000- 2010. In what year was the peak of house prices?

4) Referring to the �gure above showing California house prices from 2000- 2010, what was the relative difference between the highest and lowest California house prices? Answer with: double, triple, or quadruple.

Uses of data visualization

Expressing data as a table or chart allows the viewer to comprehend data more quickly than data presented as a list of numbers. A chart is particularly helpful in analyzing large datasets where a list, or even a table of the data would be incomprehensible. Visual representation is also more intuitively grasped than numbers. The pie chart below shows the per capita availability of milk in the United States in 2013. The viewer is able to quickly grasp that plain 2% milk has the greatest availability, and gains an intuitive sense of how much more 2% milk is available than any other category.

Figure 2.2.2: Charts allow for quick analysis of data.

A chart can help the viewer see trends in the data. The chart below shows the price of gold from 1971 to 2019. The overall trend is upward with the exception of market crashes in 1981 and 2013 and short downward trends in that period. Thus, someone looking to invest in gold might conclude that gold is generally a good long-term investment if timed properly.

Figure 2.2.3: Charts help identify trends.

Check Show answer

Check Show answer

Check Show answer

Check Show answer

Credit: Brilliant.org 1 2

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 8/37

Charts also allow the viewer to identify relationships and patterns in the data. The chart below shows the gold prices, in yellow, and stock prices, in green, from 1971 to 2017. An investor might note that gold and stock prices do not always move together, and that, as of the end of 2011, gold was relatively expensive compared to stock prices. This type of chart might be used to ascertain whether the time is right to buy gold, or if the price of gold is at a peak and likely to come back down.

Figure 2.2.4: Charts help identify relationships between data.

PARTICIPATION ACTIVITY 2.2.2: Using data visualizations.

1) Charts are useful for small datasets, but become too crowded when used with large datasets.

2) Charts can be used to identify relationships between different variables.

Considerations for data visualization

While data visualization is useful, and even necessary, in exploring and understanding large datasets, a variety of tables and charts are available, and a number of factors must be considered when choosing how to present the data.

First, the size and cardinality of the dataset must be considered. Cardinality is the number of unique elements in a dataset. Ex: the set of student IDs of students in a class has high cardinality, since each ID is unique, whereas the set of student ages will have lower cardinality, since many students will have the same ages. Certain chart types, such as pie charts or bar charts, are well-suited to data with low cardinality, but not well-suited for high-cardinality data, as illustrated by the pie charts below.The chart on the left displays high-cardinality data, and is di�cult to read, while the chart on the right displays low-cardinality data.

Figure 2.2.5: Pie charts are better suited for low-cardinality data than high- cardinality data.

Credit: Macrotrends 3

Credit: Sunshine Pro�ts 4

True

False

True

False

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 9/37

Other charts, such as scatter graphs, line charts, and histograms, work very well for high-cardinality data. In the �gure below, the histogram collects the data into eight equal sized bins and shows the distribution of a large number of unique data points. The scatter plot shows the relationship between two variables with high cardinality.

Figure 2.2.6: Histograms and scatter plots are well-suited for high- cardinality data.

The type of chart used also depends on the kind of data being presented, and the information to be conveyed. In the case of a dataset that has only one variable, or where only one variable needs to be presented, can be visualized using a pie chart, histogram, or box plot. A dataset with two or more variables that are related may be best suited to visualization with a type of scatter plot or line chart. A dataset in which one of the variables is categorical works with a bar graph, pie chart, or violin plot. Ex: The bar chart below shows the number of exoplanets discovered using each of ten methods. The method type is a categorical variable.

Figure 2.2.7: Bar charts are appropriate for plotting categorical data.

PARTICIPATION ACTIVITY 2.2.3: Choosing an appropriate chart.

1) Which chart better conveys the most common number of discovered planets in a star system?

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 10/37

2) Which chart better conveys how gas mileage for cars has changed over time?

3) Which chart is more appropriate for showing data from three unrelated categories?

References

(*1) Moore, Karleigh, et al. "Data Presentation - Pie Charts." Brilliant.org. Retrieved 16 July 2018, brilliant.org/wiki/data-presentation-pie-charts/.

(*2) USDA Economic Research Service. "Table. dymfg." United States Department of Agriculture. www.ers.usda.gov/data-products/food-availability-per-capita-data-system/.

(*3) Macrotrends. "Gold Prices - 100 Year Historical Chart". Last accessed: 1 August 2019. https://www.macrotrends.net/1333/historical-gold-prices-100-year-chart.

(*4) Sunshine Pro�ts. "Precious metals investment terms A to Z." Last accessed: 1 August 2019. https://www.sunshinepro�ts.com/gold-silver/dictionary/dow-jones-gold/.

2.3 Data and spreadsheets

Introduction to spreadsheets

A spreadsheet application is a common computer application for organizing data like text or numbers, for using formulas to calculate a mathematical quantity using existing data as inputs, and for creating charts to visualize data.

Widely used spreadsheet applications include Microsoft Excel, Google Sheets, and Apache OpenO�ce. Examples in this material are presented using Microsoft Excel. Where appropriate, the difference in syntax and functions among applications are also discussed.

A spreadsheet contains cells organized into columns labeled A, B, ... and rows labeled 1, 2, ...; a spreadsheet user can type data in each cell.

PARTICIPATION ACTIVITY 2.3.1: Spreadsheets: Entering data.

Below is a basic spreadsheet. Enter data by clicking on a cell, typing data like a number or text, and pressing enter.

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 11/37

PARTICIPATION ACTIVITY 2.3.2: Spreadsheets: Introduction.

PARTICIPATION ACTIVITY 2.3.3: Introduction to spreadsheets.

Refer to the spreadsheet in the animation above.

1) How many cells are shown?

2) How many cells contain data?

3) Cell B1 contains the word "Weight". What does cell A2 contain?

4) Which cell contains the number 12?

Auto�ll

Auto�ll is a useful spreadsheet feature that recognizes a pattern and �lls additional pattern values.

Spreadsheet Practice 2.3.1: Auto�ll.

The following table, captured in Google Sheets, speci�es the start of a pattern.

Auto�ll cells: Enter the �rst few values of a pattern. Highlight the cells. Drag the green box by the �ll handle, denoted by the arrow in the image below, to the desired number of cells.

[ -- ] :

- A B C D E F

1

2

3

4

5

6

7

8

9

10

Animation content:

undefined

Animation captions:

1. A spreadsheet consists of cells organized into columns and rows. The column headings are letters and the row headings are numbers.

2. A user can enter data, like words or numbers, into each cell. The spreadsheet is a convenient way to create a table of data.

Check Show answer

Check Show answer

Check Show answer

Check Show answer

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 12/37

Patterns may include text, numbers, or dates, among others.

If a pattern is not recognized, the auto�ll feature repeats the pattern provided.

PARTICIPATION ACTIVITY 2.3.4: Auto�ll.

1) For the spreadsheet below, selecting cells A1:A3, then dragging the selection down, �lls cells A4:A5 with _____.

2) For the spreadsheet below, selecting cells A1:C1, then dragging the selection right, �lls the cells to the right with _____.

3) Selecting cells A1:A3, then dragging the selection down will �ll subsequent cells with _____.

Spreadsheet formulas

An important spreadsheet feature allows a user to type a formula in a cell to compute that cell's value based on other cells' values. A formula begins with = followed by a math expression using operators like +, -, *, /, and parentheses (). Another cell's value can be included using that cell's row letter and column number. Ex: = 3 * A1.

PARTICIPATION

7, 8

8, 10

Monday, Tuesday

Thursday, Friday

Alfa Romeo, MINI Cooper

Mivalino, Mopetta

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 13/37

PARTICIPATION

ACTIVITY 2.3.5: Spreadsheets: Formulas.

PARTICIPATION ACTIVITY 2.3.6: Spreadsheets: Entering formulas.

Type 10 in A1, 20 in B1, and 30 in C1. In D1, type:   = A1 + B1 + C1. The cell should show 60.

Next, update A1 to be 15; D1 should automatically change to 65.

Finally, click on D1, and modify the formula to use * instead of +.

PARTICIPATION ACTIVITY 2.3.7: Entering formulas.

Refer to the animation above.

1) What value appears if the formula = A2 + B2 is entered in C2?

2) What formula should be used to �nd the 15% tip on a $30 bill?

Spreadsheet functions

A spreadsheet function is a prede�ned formula that supports common tasks such as computing the average, minimum, or maximum of a group of cells. Spreadsheets commonly support a number of functions encompassing engineering, statistics, �nance, and other applications. Users can also de�ne custom functions, which is an advanced topic that is not discussed in this section.

Table 2.3.1: Common functions.

Function name

Description Function syntax

ABS Returns the absolute value of a number. ABS(value)

COUNT Returns a count of the number of numeric values in a dataset.

COUNT(value1, [value2, …])

Animation content:

undefined

Animation captions:

1. To use a formula, a cell is selected and a formula is typed either in the cell or the formula bar. 2. Each formula begins with an equals sign (=). 3. To �nd the15% tip for a $20 restaurant bill, the formula = 0.15*A2 is used. A1 can either be

typed or selected using the cursor. 4. Pressing the return or enter key displays the output in the cell. However, the formula in the

formula bar does not disappear.

[ -- ] :

- A B C D E F

1

2

3

4

5

6

7

8

9

10

Check Show answer

Check Show answer

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 14/37

MAX Returns the maximum value in a numeric dataset. MAX(value1, [value2, …])

MIN Returns the minimum value in a numeric dataset. MIN(value1, [value2, …])

SUM Returns the sum of a series of numbers and/or cells.

SUM(value1, [value2, …])

Above, the function syntax de�nes how the function is used, and speci�es the function's name and accepted arguments. A function's arguments are surrounded by parentheses and specify the data that the function operates on. Arguments may be numbers, cells, a range of cells, or a combination thereof. The [ ] arguments are optional.

To call a function in a spreadsheet, = is followed by the function's name and then arguments separated by commas. Ex:   =SUM(A1, A2, A3) calculates sum of cells A1, A2, and A3. The range operator (:) de�nes a reference to a group of cells. Ex: =SUM(A1:A4, B10) calculates the sum of cells A1, A2, A3, A4, and B10.

PARTICIPATION ACTIVITY 2.3.8: Functions.

Complete each formula.

1) Determine the sum of cells A1 and C3.

= SUM( )

2) Determine the sum of all rows from A5 and A27 using the range operator.

= SUM( )

3) Determine the largest value in cells D2 and D3.

=

4) Determine the number of numeric values from rows F5 to F21, cell G12, and cell H9.

=

Spreadsheet functions

Table 2.3.2: Spreadsheet-functions: Common functions.

Function name

Description Function syntax

ABS Returns the absolute value of a number. ABS(value)

COUNT Returns a count of the number of numeric values in a dataset.

COUNT(value1, [value2, …])

MAX Returns the maximum value in a numeric dataset. MAX(value1, [value2, …])

MIN Returns the minimum value in a numeric dataset. MIN(value1, [value2, …])

SUM Returns the sum of a series of numbers and/or cells.

SUM(value1, [value2, …])

Challenge activities

CHALLENGE ACTIVITY 2.3.1: Spreadsheets: Formulas and functions.

Check Show answer

Check Show answer

Check Show answer

Check Show answer

Start

[ -- ] :

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 15/37

CHALLENGE ACTIVITY 2.3.2: Tables and spreadsheets.

2 3 4 5

- A B C D E F

1

2

3

4

5

6

7

8

9

10

Check Next

Start

2 3 4 5

What value is in cell D5? If no value, type: Empty

A B C D

1 7 41 6

2 1 29 19

3 42 41 36

4 17 12 16

5

Ex: 50

Check Next

2.4 Bar charts

Bar charts

A bar chart depicts data values for a categorical variable, using rectangular bars having lengths proportional to category values. The chart is drawn using two axes: a category axis that displays the category names and a value axis that displays the counts. Ex: The animation below shows the number of employees of each of the 4 largest private employers in the United States in 2017.

PARTICIPATION ACTIVITY 2.4.1: Bar chart.

Categories are commonly ordered along the category axis. In the animation, the nominal variable Company's categories were ordered by each category's data value, highest (Wal-Mart's ) to lowest (Kroger's ). If instead the categories represented years (1970, 1980, etc.) or some other measure of time, such an ordinal variable's categories would be

ordered with time increasing to the right.

1

Animation content:

undefined

Animation captions:

1. A bar chart shows the counts in each category. The categories and the corresponding counts are obtained from a table.

2. The category axis of a bar chart provides a label for each category. 3. The value axis indicates the data values. 4. Each category's value is shown using a bar with appropriate height. 5. All charts should have a title.

2,300,000 449,000

1

1

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 16/37

Each listed category has a category label, such as "Amazon" above. If labels don't �t when written horizontally, the labels can be rotated, such as rotated 30 degrees as above. Rotations of 60 or 90 degrees are also common.

The appropriate increment for the value axis is important for readability. Using small increments clutters the chart with too many values: Above, an increment of would yield values. Too few, like increments of above, can make visually estimating a category's value di�cult. Above, using increments of leads to values that can easily be estimated (Wal-Mart can be seen to have a value of about ) without clutter.

Grid lines help the viewer estimate the value for a category. If easier estimation is desired, additional grid lines can be drawn between number increments (but kept minimal to reduce clutter). If precise values need to be conveyed, data values known as data labels can be shown next to the bars, or even inside the bars, as in the chart further below. However, precise values are not typically needed if the information goal is to show relative differences among categories.

PARTICIPATION ACTIVITY 2.4.2: Bar chart basics.

1) A bar chart excels at showing precise values.

2) A bar chart excels at showing relative values.

3) The more gridlines drawn in a bar chart, the better.

4) A data label is a category name, such as "IBM" in the above example.

5) In the above bar chart on U.S. employers, the category axis is named "Employees".

6) For the above U.S. employers chart, categories were ordered by value. An alphabetical ordering would have been just as good.

Horizontal bar charts

A bar chart can be drawn vertically or horizontally. A horizontal bar chart is useful for long labels, like "Wal-Mart Stores", which need not be written at an angle as was done above. A horizontal bar chart is also useful when numerous categories exist because the categories increase the height rather than width, and due to the nature of paper and computers, width is usually more limited while height is less limited. In contrast, a vertical chart is often preferred due to "height" intuitively representing amount. This preference is especially the case when negative values are shown (which would appear going downwards).

Some authors and tools use the term "bar chart" to refer exclusively to a horizontal bar chart. In that case, a column chart is a term used for a vertical bar chart. However, the term bar chart is widely used for vertical charts by many respected authors and tools. Thus, this material uses the term bar chart for either orientation, adding the word "horizontal" or "vertical" as appropriate.

Figure 2.4.1: A horizontal bar chart with data labels for the largest private U.S. employers in 2016.

100,000 25 1,000,000 500,000

2,300,000

True

False

True

False

True

False

True

False

True

False

True

False

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 17/37

PARTICIPATION ACTIVITY 2.4.3: Horizontal bar charts.

1) A horizontal bar chart may be preferable if the category labels are long.

2) A horizontal bar chart may be preferable if many categories exist.

3) A horizontal bar chart may be preferable to depict different companies' annual pro�ts, some of which are negative.

4) A horizontal bar chart may be preferable for conveying the number of �oors for the world's 5 tallest buildings.

Relative-frequency bar chart

A basic bar chart's value axis provides the raw data value for each category. Instead, a relative-frequency bar chart shows each category's portion of the total data, typically as a percentage. The data total is �rst computed, then the percentage for each category is computed, and �nally those percentages are drawn as a bar chart.

PARTICIPATION ACTIVITY 2.4.4: Relative-frequency bar chart.

Example 2.4.1: Professional European soccer player birth months.

The following relative-frequency bar chart depicts the birth months for professional European soccer players. Most players were born in the �rst few months of the year. A possible explanation is the January 1 cutoff date for youth soccer leagues, meaning kids born in January are the oldest on their teams, while kids born in December are the youngest. Older kids are likely to be better players initially, causing coaches to give them more attention and playing time, and also causing those kids to enjoy playing and thus practicing more.

The above example illustrates the power of data analytics and data visualization. Having understood such data, many parents now choose to postpone their child's entry onto a sports team so that the child is not the youngest on the team. In fact, similar data exists for school kids: Kids born just after cutoff dates tend to be more successful in school (getting more attention from teachers, and causing those kids to feel smarter and enjoy school more), and thus many parents now delay their child's school enrollment by a year.

PARTICIPATION

True

False

True

False

True

False

True

False

Animation content:

undefined

Animation captions:

1. A relative frequency bar chart shows the percent that each category is of the total. First, the total is calculated.

2. Next, the value of each category is divided by the total and multiplied by to �nd the relative frequency as a percentage of the total.

3. The value axis can stop short of 100%, but should usually begin at 0%. 4. Finally, each category's bar is drawn.

100

2

3

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 18/37

C O

ACTIVITY 2.4.5: Relative-frequency bar charts.

Consider a dataset composed of 4 small shirts, 10 medium shirts, and 6 large shirts.

1) If the dataset is represented by a bar chart, what is the height of the bar representing the small shirts?

2) If the dataset is represented by a relative-frequency bar chart, what is the height of the bar representing the small shirts?

%

3) If the dataset is represented by a relative-frequency bar chart, what is the height of the bar representing the medium shirts?

%

Excel-Practice 2.4.1: Bolt production.

A quality assurance manager records the number of defective and non-defective bolts taken randomly from a factory on a given day. To create a chart that displays the number of defective a non-defective bolts, select the cells containing both the title and the data.

To create a vertical bar chart, click the insert tab and select Clustered Column under 2D Column.

To edit the graph, double click a chart element and adjust the parameter values in the dialog box.

Check Show answer

Check Show answer

Check Show answer

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 19/37

To create a horizontal bar chart, select Clustered Bar under 2D Bar instead.

Grouped bar chart

A grouped bar chart depicts two or more groups on a single bar chart, with each group using a different colored (or shaded) bar. A legend indicates what group each color represents in a chart. Ex: The below bar chart shows the number of men vs. women in the U.S. workforce over time. The categories are decades (1970, 1980, ...), the category values are number of people, and the two groups are men and women.

Because the categories represent time (decades), a vertical chart is preferred so that time proceeds to the right.

Figure 2.4.2: Sex of the U.S. workforce.

PARTICIPATION ACTIVITY 2.4.6: Grouped bar chart.

Consider the above grouped bar chart showing men and women in the U.S. workforce.

1) How does the number of women in the workforce compare with the number of men for each decade?

2) How has the difference between men and women changed as time has progressed?

3) How has the total number of workers

Source: U.S. Dept. of Labor 4

Fewer

More

Same

Increased

Decreased

Same

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 20/37

changed as time has progressed?

Example 2.4.2: Case study: CDC data on the utilization of physician assistants and advance practice nurses.

A physician assistant (PA) is a state-licensed health professional who practices medicine under a physician's supervision. An advanced practice nurse (APN) is a registered nurse with advanced training. A study published by the U.S. Centers for Disease Control on physician assistant and advance practice nurse care in hospital outpatient departments �nds that the supply of PAs and APNs are expanding and are playing increasingly diverse roles in the healthcare system.

The chart below appears in a section titled "Does PA/APN utilization differ by hospital location?" The categories being grouped are types of hospital locations: large central metropolitan, large fringe metropolitan, small or medium metropolitan, and nonmetropolitan. The groups are physician and PA/APN.

PARTICIPATION ACTIVITY 2.4.7: Utilization of physician assistants and advance practice nurses.

Consider the bar chart above on the utilization of physician assistants and advanced practice nurses.

1) The bar chart is a horizontal bar chart.

2) The groups are PAs and APNs.

3) The utilization of PAs and APNs decrease as the hospital's location becomes less metropolitan.

4) Each category shown is distinct.

Stacked bar chart

A stacked bar chart is a grouped bar chart where the bars are stacked on each other. A stacked bar chart is useful for showing each category's total, while still showing the breakdown of groups within each category. However, the relative sizes of each group in a category becomes harder to see due to not being side-by-side. The following shows a stacked bar chart and grouped bar chart for the same data.

Example 2.4.3: Case study: CDC data on births in the United States.

Increased

Decreased

Unable to determine

Source: CDC 5

True

False

True

False

True

False

True

False

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 21/37

A study published by the U.S. Centers for Disease Control on the 2016 birth data presents several health indicators such as fertility rates, birth rates, and multiple birth rates.

The chart below appears in a section titled "The preterm birth rate rose for the second straight year in 2016." Preterm birth refers to a pregnancy that is shorter than normal, especially births that occur after no more than 37 weeks of pregnancy. The chart displays the percentage of preterm births for years 2014, 2015, and 2016 categorized by the mother's age group. The age groups are: under 20, 20-29, 30-39, and 40 and over.

PARTICIPATION ACTIVITY 2.4.8: Preterm births in the United States.

Refer to the bar chart above on preterm births in the United States.

1) What is the percentage of preterm births in 2016 for mothers under years old?

2) What is the percent increase in preterm births for all ages between 2015 and 2016?

3) Which age group showed the largest percentage of late preterm births in 2016?

The concept of a relative frequency chart is commonly applied to a stacked bar chart. Such a chart clearly shows how a particular group's proportion of the total changes across categories (such as across years).

Figure 2.4.3: Massachusetts state spending on healthcare versus all other state spending, using a relative frequency stacked bar chart.

Source: CDC 6

20

3.20%

7.20%

10.40%

0.22%

9.63%

9.85%

Under 20

20 − 29

30 − 39

Over 40

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 22/37

PARTICIPATION ACTIVITY 2.4.9: Relative frequency stacked bar chart.

Refer to the above �gure showing Massachusetts state spending.

1) In 2000, about what percent of state spending was on healthcare?

2) Did the relative percentage of healthcare spending to total spending increase or decrease from 2000 to 2013?

3) Based solely on viewing the bar chart, what is the best prediction for relative percentage of healthcare spending to total spending in 2020?

Excel-Practice 2.4.2: Bolt production.

Often, data is presented in groups or categories. For instance, suppose the bolt manufacturer produces three types of bolts: hex bolts, carriage bolts, and shoulder bolts. Then the number of defective versus non-defective bolts can be grouped according to the type of bolt.

To create a grouped vertical bar chart, click the Insert tab and select Clustered Column under 2D Column.

To create a stacked vertical bar chart, click the Insert tab and select Stacked Column under 2D Column.

To create a grouped horizontal bar chart, click the Insert tab and select Clustered Bar under 2D Bar.

Source: Kaiser Family Foundation

25%

40%

75%

Increase

Decrease

Cannot determine

25%

50%

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 23/37

To create a stacked horizontal bar chart, click the Insert tab and select Stacked Bar under 2D Bar.

Challenge activities

CHALLENGE ACTIVITY 2.4.1: Relative-frequency bar charts.

References

(*1) "Fortune 500 Lists". www.fortune.com/fortune500/list. Last accessed 24 Sept 2018.

(*2) "The Disadvantages of Summer Babies ". Freakonomics, 2 Nov. 2011, freakonomics.com/2011/11/02/the-disadvantages-of-summer-babies/.

(*3) Konnikova, Maria. "Youngest Kid, Smartest Kid". The New Yorker, 19 Nov. 2013, www.newyorker.com/tech/elements/youngest-kid-smartest-kid.

(*4) "Civilian Labor Force by Sex". U.S Dept. of Labor, www.dol.gov/wb/stats/NEWSTATS/facts/civilian_lf_sex_2016_txt.htm.

(*5) Esther Hing, M.P.H. and Sayeedha Uddin, M.D., M.P.H. "NCHS Data Brief No. 77: Physician Assistant and Advance Practice Nurse Care in Hospital Outpatient Departments 2008-2009." U.S. Centers for Disease Control.

(*6) Joyce A. Martin, M.P.H, Brady E. Hamilton, Ph.D., and Michelle J.K. Osterman, M.H.S. "NCHS Data Brief No. 287: Births in the United States, 2016". 2011.

Start

Draw the bar chart for: 40% apples, 40% bananas, and 20% carrots.

2 3

Apple

10%

Banana

10%

Carrot

10%

0%

10%

20%

30%

40%

50%

Check Next

2.5 Pie charts

Pie charts

A pie chart shows relative frequency for categories using a circle, with each category shown as a slice of appropriate size. The appearance is one of a sliced pie, leading to the chart's name. Because length differences are interpreted more precisely than size differences, bar charts are often preferred. However, pie charts remain common, perhaps in part because curved shapes are more aesthetically pleasing than rectangular shapes.

Example 2.5.1: Case study: CDC data on health and access to care among employed and unemployed adults.

A study published by the U.S. Centers for Disease Control identi�es the health insurance status and type of insurance among adults in the U.S. between 18-64 years old according to employment in 2009-2010. Key �ndings showed that unemployed individuals are less likely to have private insurance less likely to receive needed prescription medications and less likely to

1

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 24/37

have private insurance, less likely to receive needed prescription medications, and less likely to receive needed medical care than those who are employed. The pie chart below shows the insurance status and type of insurance an individual has according to employment status.

PARTICIPATION ACTIVITY 2.5.1: Health and access to care among employed and unemployed adults.

1) of unemployed individuals had public insurance.

2) The percentage of uninsured among unemployed individuals was nearly three times as high as those who are employed.

3) The percentage of uninsured individuals in the U.S. is .

Exploded pie charts

An exploded pie chart is a pie chart with one or more slices separated from the rest of the pie. By exploding a pie chart, the exploded pieces are emphasized. In practice, the most important category - usually the category with the largest count or percentage - is often exploded. For pie charts with many categories, exploding all slices may enhance readability.

Example 2.5.2: U.S. Bureau of Labor Statistics time use survey.

Each year, the U.S. Bureau of Labor Statistics (BLS) conducts a survey called the American Time Use Survey (ATUS) that measures the amount of time people spend doing various activities. The BLS website provides interactive charts that displays data by group such as sex, employment status, and age.

The exploded pie chart below illustrates the time use average percentages per weekday for full-time college and university students with an emphasis on time spent sleeping.

PARTICIPATION ACTIVITY 2.5.2: Exploded pie charts.

Refer to the exploded pie chart above.

1) College students spend more of the non-sleeping time on _____ than any other activity.

6.6%

True

False

True

False

69.2%

True

False

Credit: BLS.gov 1

work and related activities

educational activities

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 25/37

2) The full pie represents ____ hours.

Misuse of pie charts

While pie charts can be a useful tool, many situations exist in which pie charts can be di�cult to read and interpret. Many optional pie chart effects can also make the pie chart unclear or distort the presented data.

Pie charts with too many slices

A pie chart should not have more than �ve or six slices. With more slices, the relative percentage of each slice becomes di�cult to visually determine. While the percentages can be speci�ed with labels, as in the chart below, too many labels also clutter the chart and make the chart di�cult to interpret.

Example 2.5.3: Too many slices.

The pie chart below shows the twenty largest economies in the world. While interpreting the largest slices is straightforward, the relative sizes of the slices smaller than Germany and the UK are di�cult to compare, and the slices representing Indonesia, Netherlands, Turkey, Switzerland, and Saudi Arabia do not convey much information at all.

Example 2.5.4: Pie charts with a legend.

To avoid the confusion that can occur with many labels, pie charts are sometimes presented with a legend. Below is a pie chart that shows U.S. high-school student time use. The use of a legend to represent the categories interferes with the intuition provided by graphics, instead requiring a reader to look back and forth to determine which slice corresponds to which activity.

Pie charts with poor color choices

Pie charts can also become di�cult to read if the colors of the slices are chosen badly. The Largest Economies pie chart above uses the �ags of individual nations as the colors of the slices. While the intention may have been to convey more clearly which nation is being represented, the effect is to distract the reader's eye from the sizes of the slices. Colors that are too similar are also bad choices, as in the Time Spent pie chart above, where the two variations of blue used for the Sleeping and the Other categories are di�cult to distinguish.

3D pie charts

Even a basic, two-dimensional pie chart can be rendered useless if too many pieces, poor labeling, and confusing color choices are used. 3D effects, exploding charts, and unusual shapes often tend to obscure the data being presented even further.

Python allows for the creation three dimensional pie charts. However, 3D pie charts distort the shape of the charts, making estimating the size of the slices di�cult, at best, and allowing data to be misinterpreted, at worst.

leisure and sports

24

100

Credit: User Wikideas1 2 3

Credit: BLS,gov 4

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 26/37

Example 2.5.5: 3D pie charts.

The chart below from the Internal Revenue Service shows the relative sizes of the major categories of spending for the federal government in 2016. The distortion of the slices from the 3D effect and the orientation of the chart makes the relative sizes of the slices almost impossible to estimate. Ex: "Law enforcement and general government" and "Physical, human, and community development" appear to be nearly the same size, but the former only makes up of the spending, while the latter makes up . The third slice at the back of the graph, "Net interest on the debt", appears larger than either of the other two slices but only represents

of federal spending. Similarly, the "National defense, veterans, and foreign affairs" category at the front of the graph appears considerably larger than the "Social programs" category, even though spending in the former category is less.

Exploded pie charts

Exploded pie charts can also be used badly. While using this effect on one or two slices does emphasize the categories being represented, the process also necessarily changes the relative sizes of the slices, and can make the exploded slices appear larger than appropriate. Using the effect on multiple slices increases the di�culty in visually comparing the relative sizes of the slices, and can make the chart disorienting, as in the pie chart below.

Example 2.5.6: Exploded pie chart.

The exploded pie chart below shows the sources of iTunes revenue. Four of the seven slices have been exploded. Rather than emphasizing those four slices, the unevenness of the white dividing lines and the varying distances that the slices have been moved make interpreting the chart much more di�cult.

Non-circular pie charts

Finally, in an attempt to be creative or interesting, non-circular pie charts are sometimes used. Unfortunately, visually comparing the sizes of slices in a non-circular pie chart is often nearly impossible.

Example 2.5.7

The pie chart of the sources of revenue from the Star Wars movies is shaped like the ship the Millennium Falcon. While the shape is clever, the slices of the "pie" do not have a uniform length, and thus the relative sizes of the slices are even more di�cult to interpret. Ex: the slice representing "rentals" is wider than the slice representing "other", even though the "other" slice represents million more revenue than the Rental slice.

2% 7%

6%

2%

Credit: IRS,gov 5

Credit: Horace Dediu 6

$100

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 27/37

PARTICIPATION ACTIVITY 2.5.3: Identifying misused pie charts.

1) Which of the following pie charts is formatted most appropriately?

2) Which of the following pie charts is formatted most appropriately?

References

(*1) "American Time Use Survey." Bureau of Labor Statistics. 20 December 2016, https://www.bls.gov/tus/charts/students.htm.

Credit: Michael Cerwonka7

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 28/37

(*2) User Wikideas1. "20 Largest Economies Pie Chart." Wikimedia Commons, the free media repository. 25 February 2017, en.wikipedia.org/wiki/File:20_Largest_economies_pie_chart.pdf.

(*3) "GDP Ranking." The World Bank Data Catalog. 29 June 2018, datacatalog.worldbank.org/dataset/gdp-ranking.

(*4) " Time Use Survey activity, How Do You Spend Your Time?." Bureau of Labor Statistics. Retrieved 17 July 2018, www.bls.gov/k12/content/teachers/pdf/atus_activity1_intro.pdf.

(*5) " Major Categories of Federal Income and Outlays for Fiscal Year 2016." Internal Revenue Service 1040A Instructions 2017. p. 89 Retrieved 17 July 2018, www.irs.gov/pub/irs- pdf/i1040a.pdf.

(*6) Dediu, Horace. "Measuring the iBook Market." Asymco. 28 February 2013, www.asymco.com/2013/02/28/measuring-the-ibook-market/.

(*7) Cerwonka, Michael. "Tell Jabba I've Got His Money: Star Wars Revenue Throughout Our Galaxy." Wired. 25 May 2012, www.wired.com/2012/05/tell-jabba-ive-got-his-money-star- wars-revenue-throughout-our-galaxy/.

2.6 Scatter plots

Scatter plots with quantitative variables

A scatter plot depicts the relationship between two variables on a rectangular coordinate system, where each axis corresponds to one variable. Scatter plots are used for both quantitative and categorical data.

In data analytics, scatter plots are especially useful in visualizing the relationship between variables in a multi-dimensional dataset. Ex: A marketing manager for a beverage company may want to check the relationship between average temperature and revenue within a speci�c time period or between revenue and marketing budget.

Example 2.6.1: Number of engineering faculty versus school rank.

To inform a decision of whether to hire new engineering faculty, a dean collected data showing number of faculty versus engineering school rank (using the U.S. News and World Report ranking, lower rank is better), for eight University of California campuses (UC Berkeley, UCLA, etc.). A table of the data (2014) is shown below. For example, the school with engineering faculty is ranked number in the country, while the school with only engineering faculty is ranked number .

Engineering faculty USNWR rank

Below is a scatter plot showing engineering faculty size vs. engineering school rank in 2014 for the eight campuses in the University of California system. Each row in the above table becomes a coordinate, leading to the following scatter plot.

The scatter plot clearly shows the relationship between number of faculty and rank. The data suggests, but does not prove, that increasing the number of engineering faculty may be important to improving rank.

A scatter plot often has numerous data points that are "scattered" about the rectangular coordinate system, leading to the name "scatter plot". Below is a scatter plot showing all college football team rankings (lower is better; number is best) and the total salary for each team's head coach . (Yes, college head coaches often have multi-million-dollar salaries). A viewer quickly sees that more than half of coaches earn over

million, that many earn

247 3 78 81

247 3

194 14

155 16

124 19

198 31

115 38

91 69

78 81

128 1 1 2

(1

)3

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 29/37

or million, and that most poorly-ranked team's coaches earn just a few hundred thousand dollars. The viewer can also see that several teams, in the lower left, seem to be getting a great bargain.

Figure 2.6.1: College football rankings vs. head coach salary.

PARTICIPATION ACTIVITY 2.6.1: Scatter plot.

Consider the scatter plot above showing number of engineering faculty versus rank for eight UC schools.

1) What is the approximate rank for the school having about engineering faculty?

2) If a UC school had about engineering faculty, what rank might that school expect to achieve?

3) Above about on the -axis are two points, one at about and another at about . Which of the following is the best inference?

Independent and dependent variables

Commonly, a scatter plot shows how one variable depends on another. Above, the dean was interested in showing how rank might depend on the number of faculty. To distinguish dependent and independent variables, a question might be: If one variable's value is (or some other value), how is the other variable affected? The variable that is controlled by an observer or is a reason for variation is the independent variable, while the variable that is then determined based on that variable is the dependent variable. The independent variable is usually plotted on the -axis and the dependent variable on the -axis. Typically, a desired independent variable is �rst found along the -axis, and the corresponding dependent variable is found along the -axis.

Below is an example showing how height depends on age for males aged to years. Ex: a question might be "If a male is aged years, what height might he be?". Age is the independent variable and is thus plotted on the -axis.

Example 2.6.2: Height vs. age for males aged 2 to 20.

The following scatter plot shows the median height for U.S. males aged to years.

$4

150

About 5

About 20

About 80

100

80

55

20

200 x 15

30

Two schools having about faculty have different ranks.

200

One school has two different ranks.

The data contains an error.

5

x y x

y

2 20 12 x

2 20

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 30/37

PARTICIPATION ACTIVITY 2.6.2: Independent and dependent variables.

1) A dean wishes to know whether increasing faculty may lead to better rank. The number of faculty is the _____ variable.

2) Parents and doctors wish to compare a child's height to other children's heights of the same age. The height is the _____ variable.

3) A news station records temperature and humidity for various days and then creates a scatter plot to determine if any relationship exists. Is either temperature or humidity an obvious independent variable?

4) For the above scatter plot of male height, at about what age do males stop growing taller?

Regression curve

A regression curve is a curve added to a scatter plot that shows the relationship between two variables. A linear regression curve is also called a regression line or a trend line. The �rst �gure below shows a trend line for the earlier data showing the relationship between coach salary and team rank, and the second �gure shows a trend line for the earlier data showing the relationship between number of engineering faculty and rank.

Figure 2.6.2: Linear trend line added to scatter plot for data showing the relationship between head coach salary and team ranking.

dependent

independent

dependent

independent

Yes

No

16

20

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 31/37

Figure 2.6.3: Exponential regression curve added to scatter plot for data showing the relationship between number of engineering faculty and rank.

PARTICIPATION ACTIVITY 2.6.3: Regression curves.

1) Each point in a scatter plot lies somewhere on the regression curve.

2) Given data where the -value represents the area to be painted in square feet and the -value represents amount of paint in gallons, the regression curve is likely _____.

3) Given data where the -value represents a person's height and the - value represents a person's age for ages 1-50, the regression curve is likely _____.

Scatter plots with categorical variables

Earlier examples showed the use of scatter plots to determine the relationship between two nominal variables with respect to different categories. Ex: The relationship between petal length and petal width of iris is linear. However, when one of the two variables is categorical, other plotting techniques should be used. Three types of plots are commonly used when dealing with categorical variables: (1) strip plots, (2) jittered strip plots, and (3) swarm plots.

Strip plots

A strip plot is a scatter plot where a categorical variable represents an axis and an ordinal variable represents the other. Points are stacked on top of each other and form a single column or strip. A strip plot is useful in summarizing information about the dataset. Ex: The strip plot below shows the relationship between iris species and sepal length. The horizontal axis shows the different species of iris and the vertical axis shows the sepal length in centimeters.

Figure 2.6.4: Strip plot for the sepal lengths of iris species.

True

False

x

y

linear

non-linear

x

y

linear

non-linear

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 32/37

Jittered strip plots

Data points may overlap when the dataset is dense. Data points may share the same value or the marks may be too large compared to the plot's resolution. To get a better sense of the dataset, a data analyst might use scatter plot jittering. Jittering is the addition of random noise to the plot in order to prevent or minimize overlapping data points. Ex: The jittered strip plot below shows 6 samples of Iris virginica had sepal lengths between 7.5 cm and 8 cm, which is not clear from the strip plot above.

Figure 2.6.5: Jittered strip plot for the sepal lengths of iris species.

Swarm plots

Jittered plots do not entirely prevent overlapping. An alternative approach to jittering is to use a swarm plot. A swarm plot uses a random algorithm to set a minimum distance between points.

Figure 2.6.6: Swarm plot for the sepal lengths of iris species.

PARTICIPATION ACTIVITY 2.6.4: Strip plots and swarm plots.

Refer to the strip plots and swarm plots above.

1) The number of samples of each iris species in the dataset can be determined by looking at a strip plot.

2) Jittering gives a more accurate sense of a dataset's features.

References

(*1) "NCAA College Football Predictive Rankings and Ratings." TeamRankings. www.teamrankings.com/college-football/ranking/predictive-by-other.

(*2) "NCAA Salaries." USA Today. sports.usatoday.com/ncaa/salaries/.

True

False

True

False

2.7 Line charts

Introduction to line charts

A line chart (or line graph) depicts data trends by using straight lines to connect successive data points in a scatter plot. The straight lines show the general direction that data changes over time. Because trends involve time, line charts commonly use a time metric for the horizontal axis. Ex: Given the following data on Apple stock prices from March 2015 to March

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 33/37

2016, the following line chart shows how Apple's stock price changes (vertical axis) as each month passes (horizontal axis).

Table 2.7.1: Apple stock prices.

Date Apple stock price (USD)

Mar 2, 2015

Apr 1, 2015

May 1, 2015

Jun 1, 2015

Jul 1, 2015

Aug 3, 2015

Sep 1, 2015

Oct 1, 2015

Nov 2, 2015

Dec 1, 2015

Jan 4, 2016

Feb 1, 2016

Mar 1, 2016

Figure 2.7.1: Apple stock charts (March 2015 - March 2016) with and without lines.

The main bene�t of a line graph is to quickly convey whether values are increasing, decreasing, or remaining constant between data points. Steeper lines indicate more rapid increases or decreases, while �atter lines indicate little change between data points. Ex: The line graph above clearly shows that the steepest increase in the stock value was from October 2015 to November 2015, which may lead investors to research what happened to Apple in October 2015.

Lines also help convey that values exist between data points. Ex: Although the Apple line chart shows two consecutive data points for July 1 and August 3, the stock price took on many values in between those dates. The line connecting July 1 and August 3 does not represent real data, but rather, a basic trend of the data change between data points.

PARTICIPATION ACTIVITY 2.7.1: Interpreting line chart trends.

Consider the following line chart of Alphabet (Google's parent company) stock from March 2015 to March 2016:

129.09

124.25

128.95

130.54

126.60

118.44

107.72

109.58

121.18

117.34

105.35

96.43

100.53

Source: Yahoo! Finance, 2016 1

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 34/37

1) Google's stock price decreased from December 2015 to January 2016.

2) Google's stock price never increases more than two months in a row.

3) Google's stock price had the largest increase from November to December.

4) Google's stock price reached a -week low in June.

5) Relative to the rest of the graph, Google's stock price remains mostly constant from April 2015 to June 2015.

6) is a likely price for Google stock on July 15, 2015.

A linear trend line is a straight line that depicts the general direction data changes from the �rst to last data point, often added to summarize the entire chart. A good linear trend line is typically computed using various techniques such as linear regression (discussed elsewhere), and is not a simple connection of the �rst and last points.

In the Apple stock line chart below, a linear trend line is added in red and starts slightly above the �rst data point. While the stock price had two large increases from March 2015 to March 2016, the linear trend line clearly shows that the stock price tended to decrease during that time.

Figure 2.7.2: Apple stock prices (March 2015 - March 2016) line chart with overall trend line.

PARTICIPATION ACTIVITY 2.7.2: Interpreting linear trend lines on a line chart.

Consider the following line chart of Alphabet (owner of Google) stock prices from March 2015 to March 2016:

True

False

True

False

True

False

52

True

False

True

False

$575

True

False

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 35/37

1) What is Google's stock price in March 2015?

2) What is the difference in Google's stock price between March 2015 and March 2016?

3) Based only on the chart, does Google's stock prices seem more likely to increase or decrease from March 2016 to March 2017 (1 year)?

Multiple datasets are commonly shown in one line chart to highlight differences. Each dataset is distinguished by different color, data point shape, and/or line style, as noted by a legend. Ex: The line chart below shows how temperatures in Los Angeles, California, USA and Durban, South Africa differ per month due to being on opposite sides of the equator. The chart also shows how Los Angeles has more extreme temperature swings than Durban.

Figure 2.7.3: Line chart showing multiple datasets: average high temperatures for Los Angeles, USA, and Durban, South Africa.

PARTICIPATION ACTIVITY 2.7.3: Interpreting multiple datasets on a line chart.

Consider the following line chart of average high temperatures for Pueblo, Colorado, USA, Tianshui, China, and Asunción, Paraguay.

Around $500

Around $575

Around $720

About $250

About $150

Increase

Decrease

Sources: Wikipedia (Los Angeles) , Wikipedia (Durban)2 3

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 36/37

1) What is the average high temperature for Pueblo in March?

2) Which city has the lowest average high temperature at any point during the year?

3) In which month are the average high temperatures for Pueblo, Tianshui, and Asuncion the most similar?

4) Which city has average high temperatures most unlike the other cities?

Misuse of line charts

Using line charts to represent categorical data

A line chart should not be used for nominal categorical data. Lines suggest some relation from one item to the next, but nominal variables have no ordering so can have no such relation. Ex: The plot below on the left inappropriately shows lines, even though no relationship exists between South Dakota and New Hampshire, for example. However, representing the unemployment rate of each state with a point, without connecting the individual points, would be appropriate. Using a bar chart is also common.

Figure 2.7.4: A line chart is not appropriate for categorical data.

PARTICIPATION ACTIVITY 2.7.4: Line charts.

When is a line chart appropriate?

1) The -axis is the year ranging from 2000 to 2015, the -axis is the amount of rainfall.

2) The -axis shows a movie rating: G, PG, PG-13, R, and NC-17. The -axis is the number of movies released in 2015 of a given rating.

3) The -axis is a color: black, blue, green, red, silver, or white. The -axis is the number of cars of a given color.

Omitting labels

Sources: Wikipedia (Pueblo) , Wikipedia (Tianshui) , Wikipedia (Asuncion)4 5 6

About F90∘

About F60∘

About F53∘

Asunción

Tianshui

Pueblo

April

May

September

Asunción

Tianshui

Pueblo

x

y

Appropriate

Not appropriate

x

y

Appropriate

Not appropriate

x

y

Appropriate

Not appropriate

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

1/31/2021 QNT/275T: Statistics for Decision Making home

https://learn.zybooks.com/zybook/QNT_275T_54402574/chapter/2/print 37/37

Common mistakes are to forget to label an axis or to forget to provide units. Without such information, a viewer cannot appropriately interpret the information.

Figure 2.7.5: Average monthly temperatures in Hawaii, with some labels/units missing.

PARTICIPATION ACTIVITY 2.7.5: Missing labels or units.

Refer to the above �gure.

1) The units for Temperature are _____.

2) The label for the -axis is ______.

3) The label for the -axis is _____.

4) If a -axis is labeled or ratio, as in revenue, or ratio of male/female births, additional units are _______.

References

(*1) "Apple Inc." Yahoo! Finance . 2016, �nance.yahoo.com/echarts?s=aapl.

(*2) Wikipedia Contributors. "Los Angeles." Wikipedia, The Free Encyclopedia . Retrieved 17 July 2018, en.wikipedia.org/wiki/Los_Angeles.

(*3) Wikipedia Contributors. "Durban." Wikipedia, The Free Encyclopedia . Retrieved 17 July 2018, en.wikipedia.org/wiki/Durban.

(*4) Wikipedia Contributors. "Pueblo, Colorado." Wikipedia, The Free Encyclopedia . Retrieved 17 July 2018, en.wikipedia.org/wiki/Pueblo,_Colorado.

(*5) Wikipedia Contributors. "Tianshui." Wikipedia, The Free Encyclopedia . Retrieved 17 July 2018, en.wikipedia.org/wiki/Tianshui.

(*6) Wikipedia Contributors. "Asuncion." Wikipedia, The Free Encyclopedia . Retrieved 17 July 2018, en.wikipedia.org/wiki/Asunción.

present

missing

x

present

missing

y

present

missing

y % %

required

not necessary

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574

©zyBooks 01/31/21 11:33 922949 Julio Romero

QNT_275T_54402574