project

1binju2

decisionscience.pdf

Home >Business & Finance homework help >Accounting homework help >project

Business Analytics Third Edition

David R. Anderson University of Cincinnati

Dennis J. Sweeney University of Cincinnati

Thomas A. Williams Rochester Institute

of Technology

Jeffrey D. Camm Wake Forest University

Michael J. Fry University of Cincinnati

James J. Cochran University of Alabama

Jeffrey W. Ohlmann University of Iowa

Australia • Brazil • Mexico • Singapore • United Kingdom • United States

Copyright 2019 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it.

Analytic Solver®

www.solver.com/aspe

Your new textbook, Business Analytics, 3e, uses this software in the eBook

(MindTap Reader). Here’s how to get it for your course.

For Instructors:

Setting Up the Course Code

To set up a course code for your course, please email Frontline Systems at

[email protected], or call 775-831-0300, press 0, and ask for the Academic

Coordinator. Course codes MUST be set up each time a course is taught.

A course code is free, and it can usually be issued within 24 to 48 hours (often

the same day). It will enable your students to download and install Analytic Solver,

use the software free for 2 weeks, and continue to use the software for 20 weeks

(a typical semester) for a small charge, using the course code to obtain a steep

discount. It will enable Frontline Systems to assist students with installation, and

provide technical support to you during the course.

Please give the course code, plus the instructions on the reverse side, to your

students. If you’re evaluating the book for adoption, you can use the course code

yourself to download and install the software as described on the reverse.

Instructions for Students: See reverse. Installing Analytic Solver

For Students:

Installing Analytic Solver®

1) To download and install Analytic Solver (also called Analytic Solver® Basic) from Frontline Systems to work with Microsoft® Excel® for Windows®, please visit:

www.solver.com/student

2) Fill out the registration form on this page, supplying your name, school, email address (key information will be sent to this address), course code (obtain this from your instructor), and textbook code (enter CCFOEBA3).

3) On the download page, click the Download Now button, and save the downloaded file (SolverSetup.exe).

4) Close any Excel® windows you have open. Make sure you’re still connected to the Internet.

5) Run SolverSetup.exe to install the software. Be patient – all necessary files will be down- loaded and installed.

If you have problems downloading or installing, your best options in order are to (i) visit www.solver. com and click the Live Chat link at the top; (ii) email [email protected] and watch for a reply; (iii) call 775-831-0300 and press 4 (tech support). Say that you have Analytic Solver for Education, and have your course code and textbook code available.

If you have problems setting up or solving your model, or interpreting the results, please ask your instructor for assistance. Frontline Systems cannot help you with homework problems.

If you have this textbook but you aren’t enrolled in a course, visit www.solver.com or call 775-831-0300 and press 0 for advice on courses and software licenses.

If you have a Mac, your options are to (i) use a version of Analytic Solver through a web browser at https://analyticsolver.com, or (ii) install “dual-boot” or VM software, Microsoft

Windows®, and Excel® for Windows®. Excel® for Mac will NOT work.

This is an electronic version of the print textbook. Due to electronic rights restrictions, some third party content may be suppressed. Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. The publisher reserves the right to remove content from this title at any time if subsequent rights restrictions require it. For valuable information on pricing, previous editions, changes to current editions, and alternate formats, please visit www.cengage.com/highered to search by ISBN#, author, title, or keyword for materials in your areas of interest.

Important Notice: Media content referenced within the product description or the product text may not be available in the eBook version.

Unless otherwise noted, all content is © Cengage.

ALL RIGHTS RESERVED. No part of this work covered by the copyright herein may be reproduced or distributed in any form or by any means, except as permitted by U.S. copyright law, without the prior written permission of the copyright owner.

For product information and technology assistance, contact us at

Cengage Customer & Sales Support, 1-800-354-9706 or

support.cengage.com.

For permission to use material from this text or product,

submit all requests online at www.cengage.com/permissions.

Screenshots are ©Microsoft Corporation unless otherwise noted.

Library of Congress Control Number: 2017963667

ISBN: 978-1-337-40642-0

Cengage

20 Channel Center Street Boston, MA 02210 USA

Cengage is a leading provider of customized learning solutions with employees residing in nearly 40 different countries and sales in more than 125 countries around the world. Find your local representative at www.cengage.com.

Cengage products are represented in Canada by Nelson Education, Ltd.

To learn more about Cengage platforms and services, visit www.cengage.com.

To register or access your online learning solution or purchase materials for your course, visit www.cengagebrain.com.

Business Analytics, Third Edition

Jeffrey D. Camm, James J. Cochran,

Michael J. Fry, Jeffrey W. Ohlmann,

David R. Anderson, Dennis J. Sweeney,

Thomas A. Williams

Senior Vice President, Higher Ed Product, Content, and Market Development: Erin Joyner

Vice President, B&E, 4LTR, and Support Programs: Mike Schenk

Senior Product Team Manager: Joe Sabatino

Senior Product Manager: Aaron Arnsparger

Senior Digital Content Designer: Brandon Foltz

Content Developer: Anne Merrill

Content Project Manager: D. Jean Buttrom

Product Assistant: Renee Schnee

Senior Marketing Manager: Nathan Anderson

Production Service: SPi Global

Senior Art Director: Michelle Kunkler

Cover and Text Designer: Beckmeyer Design

Cover Image: iStockPhoto.com/tawanlubfah

Intellectual Property Analyst: Reba Frederics

Intellectual Property Project Manager: Nick Barrows

Printed in the United States of America Print Number: 01 Print Year: 2018

Brief Contents ABOUT THE AUTHORS XIX PREFACE XXIII

CHAPTER 1 Introduction 2

CHAPTER 2 Descriptive Statistics 18

CHAPTER 3 Data Visualization 82

CHAPTER 4 Descriptive Data Mining 138

CHAPTER 5 Probability: An Introduction to Modeling Uncertainty 166

CHAPTER 6 Statistical Inference 220

CHAPTER 7 Linear Regression 294

CHAPTER 8 Time Series Analysis and Forecasting 372

CHAPTER 9 Predictive Data Mining 422

CHAPTER 10 Spreadsheet Models 464

CHAPTER 11 Monte Carlo Simulation 500

CHAPTER 12 Linear Optimization Models 556

CHAPTER 13 Integer Linear Optimization Models 606

CHAPTER 14 Nonlinear Optimization Models 646

CHAPTER 15 Decision Analysis 678

APPENDIX A Basics of Excel 724

APPENDIX B Database Basics with Microsoft Access 736

APPENDIX C Solutions to Even-Numbered Questions (MindTap Reader)

REFERENCES 774 INDEX 776

cengage.com/mindtap

Fit your coursework into your hectic life. Make the most of your time by learning your way. Access the resources you need to succeed wherever, whenever.

Study with digital flashcards, listen to audio textbooks, and take quizzes.

Review your current course grade and compare your progress with your peers.

Get the free MindTap Mobile App and learn wherever you are.

Break Limitations. Create your own potential, and be unstoppable with MindTap.

MINDTAP. POWERED BY YOU.

ABOUT THE AUTHORS XIX PREFACE XXIII

CHAPTER 1 Introduction 2 1.1 Decision Making 4 1.2 Business Analytics Defined 5 1.3 A Categorization of Analytical Methods and Models 6

Descriptive Analytics 6 Predictive Analytics 6 Prescriptive Analytics 7

1.4 Big Data 7 Volume 9 Velocity 9 Variety 9 Veracity 9

1.5 Business Analytics in Practice 11 Financial Analytics 11 Human Resource (HR) Analytics 12 Marketing Analytics 12 Health Care Analytics 12 Supply-Chain Analytics 13 Analytics for Government and Nonprofits 13 Sports Analytics 13 Web Analytics 14

Summary 14 Glossary 15

CHAPTER 2 Descriptive Statistics 18 Analytics in Action: U.S. Census Bureau 19 2.1 Overview of Using Data: Definitions and Goals 19 2.2 Types of Data 21

Population and Sample Data 21 Quantitative and Categorical Data 21 Cross-Sectional and Time Series Data 21 Sources of Data 21

2.3 Modifying Data in Excel 24 Sorting and Filtering Data in Excel 24 Conditional Formatting of Data in Excel 27

2.4 Creating Distributions from Data 29 Frequency Distributions for Categorical Data 29 Relative Frequency and Percent Frequency Distributions 30 Frequency Distributions for Quantitative Data 31 Histograms 34 Cumulative Distributions 37

Contents Contents

viii Contents

2.5 Measures of Location 39 Mean (Arithmetic Mean) 39 Median 40 Mode 41 Geometric Mean 41

2.6 Measures of Variability 44 Range 44 Variance 45 Standard Deviation 46 Coefficient of Variation 47

2.7 Analyzing Distributions 47 Percentiles 48 Quartiles 49 z-Scores 49 Empirical Rule 50 Identifying Outliers 52 Box Plots 52

2.8 Measures of Association Between Two Variables 55 Scatter Charts 55 Covariance 57 Correlation Coefficient 60

2.9 Data Cleansing 61 Missing Data 61 Blakely Tires 63 Identification of Erroneous Outliers and Other Erroneous

Values 65 Variable Representation 67

Summary 68 Glossary 69 Problems 71 Case Problem: Heavenly Chocolates Web Site Transactions 79 Appendix 2.1 Creating Box Plots with Analytic Solver (MindTap Reader)

CHAPTER 3 Data Visualization 82 Analytics in Action: Cincinnati Zoo & Botanical Garden 83 3.1 Overview of Data Visualization 85

Effective Design Techniques 85 3.2 Tables 88

Table Design Principles 89 Crosstabulation 90 PivotTables in Excel 93 Recommended PivotTables in Excel 97

3.3 Charts 99 Scatter Charts 99 Recommended Charts in Excel 101

Contents ix

Line Charts 102 Bar Charts and Column Charts 106 A Note on Pie Charts and Three-Dimensional Charts 107 Bubble Charts 109 Heat Maps 110 Additional Charts for Multiple Variables 112 PivotCharts in Excel 115

3.4 Advanced Data Visualization 117 Advanced Charts 117 Geographic Information Systems Charts 120

3.5 Data Dashboards 122 Principles of Effective Data Dashboards 123 Applications of Data Dashboards 123

Summary 125 Glossary 125 Problems 126 Case Problem: All-Time Movie Box-Office Data 136 Appendix 3.1 Creating a Scatter-Chart Matrix and a Parallel-Coordinates

Plot with Analytic Solver (MindTap Reader)

CHAPTER 4 Descriptive Data Mining 138 Analytics in Action: Advice from a Machine 139 4.1 Cluster Analysis 140

Measuring Similarity Between Observations 140 Hierarchical Clustering 143 k-Means Clustering 146 Hierarchical Clustering versus k-Means Clustering 147

4.2 Association Rules 148 Evaluating Association Rules 150

4.3 Text Mining 151 Voice of the Customer at Triad Airline 151 Preprocessing Text Data for Analysis 153 Movie Reviews 154

Summary 155 Glossary 155 Problems 156 Case Problem: Know Thy Customer 164

Available in the MindTap Reader: Appendix 4.1 Hierarchical Clustering with Analytic Solver Appendix 4.2 k-Means Clustering with Analytic Solver Appendix 4.3 Association Rules with Analytic Solver Appendix 4.4 Text Mining with Analytic Solver Appendix 4.5 Opening and Saving Excel files in JMP Pro Appendix 4.6 Hierarchical Clustering with JMP Pro

x Contents

Appendix 4.7 k-Means Clustering with JMP Pro Appendix 4.8 Association Rules with JMP Pro Appendix 4.9 Text Mining with JMP Pro

CHAPTER 5 Probability: An Introduction to Modeling Uncertainty 166

Analytics in Action: National Aeronautics and Space Administration 167 5.1 Events and Probabilities 168 5.2 Some Basic Relationships of Probability 169

Complement of an Event 169 Addition Law 170

5.3 Conditional Probability 172 Independent Events 177 Multiplication Law 177 Bayes’ Theorem 178

5.4 Random Variables 180 Discrete Random Variables 180 Continuous Random Variables 181

5.5 Discrete Probability Distributions 182 Custom Discrete Probability Distribution 182 Expected Value and Variance 184 Discrete Uniform Probability Distribution 187 Binomial Probability Distribution 188 Poisson Probability Distribution 191

5.6 Continuous Probability Distributions 194 Uniform Probability Distribution 194 Triangular Probability Distribution 196 Normal Probability Distribution 198 Exponential Probability Distribution 203

Summary 207 Glossary 207 Problems 209 Case Problem: Hamilton County Judges 218

CHAPTER 6 Statistical Inference 220 Analytics in Action: John Morrell & Company 221 6.1 Selecting a Sample 223

Sampling from a Finite Population 223 Sampling from an Infinite Population 224

6.2 Point Estimation 227 Practical Advice 229

6.3 Sampling Distributions 229 Sampling Distribution of x 232 Sampling Distribution of p 237

Contents xi

6.4 Interval Estimation 240 Interval Estimation of the Population Mean 240 Interval Estimation of the Population Proportion 247

6.5 Hypothesis Tests 250 Developing Null and Alternative Hypotheses 250 Type I and Type II Errors 253 Hypothesis Test of the Population Mean 254 Hypothesis Test of the Population Proportion 265

6.6 Big Data, Statistical Inference, and Practical Significance 268 Sampling Error 268 Nonsampling Error 269 Big Data 270 Understanding What Big Data Is 271 Big Data and Sampling Error 272 Big Data and the Precision of Confidence Intervals 273 Implications of Big Data for Confidence Intervals 274 Big Data, Hypothesis Testing, and p Values 275 Implications of Big Data in Hypothesis Testing 277

Summary 278 Glossary 279 Problems 281 Case Problem 1: Young Professional Magazine 291 Case Problem 2: Quality Associates, Inc 292

CHAPTER 7 Linear Regression 294 Analytics in Action: Alliance Data Systems 295 7.1 Simple Linear Regression Model 296

Regression Model 296 Estimated Regression Equation 296

7.2 Least Squares Method 298 Least Squares Estimates of the Regression Parameters 300 Using Excel’s Chart Tools to Compute the Estimated

Regression Equation 302

7.3 Assessing the Fit of the Simple Linear Regression Model 304 The Sums of Squares 304 The Coefficient of Determination 306 Using Excel’s Chart Tools to Compute the Coefficient

of Determination 307 7.4 The Multiple Regression Model 308

Regression Model 308 Estimated Multiple Regression Equation 308 Least Squares Method and Multiple Regression 309 Butler Trucking Company and Multiple Regression 310 Using Excel’s Regression Tool to Develop the Estimated Multiple

Regression Equation 310

xii Contents

7.5 Inference and Regression 313 Conditions Necessary for Valid Inference in the Least Squares

Regression Model 314 Testing Individual Regression Parameters 318 Addressing Nonsignificant Independent Variables 321 Multicollinearity 322

7.6 Categorical Independent Variables 325 Butler Trucking Company and Rush Hour 325 Interpreting the Parameters 327 More Complex Categorical Variables 328

7.7 Modeling Nonlinear Relationships 330 Quadratic Regression Models 331 Piecewise Linear Regression Models 335 Interaction Between Independent Variables 337

7.8 Model Fitting 342 Variable Selection Procedures 342 Overfitting 343

7.9 Big Data and Regression 344 Inference and Very Large Samples 344 Model Selection 348

7.10 Prediction with Regression 349 Summary 351 Glossary 352 Problems 354 Case Problem: Alumni Giving 369 Appendix 7.1 Regression with Analytic Solver (MindTap Reader)

CHAPTER 8 Time Series Analysis and Forecasting 372 Analytics in Action: ACCO Brands 373 8.1 Time Series Patterns 375

Horizontal Pattern 375 Trend Pattern 377 Seasonal Pattern 378 Trend and Seasonal Pattern 379 Cyclical Pattern 382 Identifying Time Series Patterns 382

8.2 Forecast Accuracy 382 8.3 Moving Averages and Exponential Smoothing 386

Moving Averages 387 Exponential Smoothing 391

8.4 Using Regression Analysis for Forecasting 395 Linear Trend Projection 395 Seasonality Without Trend 397 Seasonality with Trend 398 Using Regression Analysis as a Causal Forecasting Method 401

Contents xiii

Combining Causal Variables with Trend and Seasonality Effects 404 Considerations in Using Regression in Forecasting 405

8.5 Determining the Best Forecasting Model to Use 405 Summary 406 Glossary 406 Problems 407 Case Problem: Forecasting Food and Beverage Sales 415 Appendix 8.1 Using the Excel Forecast Sheet 416 Appendix 8.2 Forecasting with Analytic Solver (MindTap Reader)

CHAPTER 9 Predictive Data Mining 422 Analytics in Action: Orbitz 423 9.1 Data Sampling, Preparation, and Partitioning 424 9.2 Performance Measures 425

Evaluating the Classification of Categorical Outcomes 425 Evaluating the Estimation of Continuous Outcomes 431

9.3 Logistic Regression 432 9.4 k-Nearest Neighbors 436

Classifying Categorical Outcomes with k-Nearest Neighbors 436 Estimating Continuous Outcomes with k-Nearest Neighbors 438

9.5 Classification and Regression Trees 439 Classifying Categorical Outcomes with a Classification Tree 439 Estimating Continuous Outcomes with a Regression Tree 445 Ensemble Methods 446

Summary 449 Glossary 450 Problems 452 Case Problem: Grey Code Corporation 462

Available in the MindTap Reader: Appendix 9.1 Data Partitioning with Analytic Solver Appendix 9.2 Logistic Regression Classification with Analytic Solver Appendix 9.3 k-Nearest Neighbor Classification and Estimation with

Analytic Solver Appendix 9.4 Single Classification and Regression Trees with

Analytic Solver Appendix 9.5 Random Forests of Classification or Regression Trees with

Analytic Solver Appendix 9.6 Data Partitioning with JMP Pro Appendix 9.7 Logistic Regression Classification with JMP Pro Appendix 9.8 k-Nearest Neighbor Classification and Estimation with

JMP Pro Appendix 9.9 Single Classification and Regression Trees with JMP Pro Appendix 9.10 Random Forests of Classification and Regression Trees

with JMP Pro

xiv Contents

CHAPTER 10 Spreadsheet Models 464 Analytics in Action: Procter & Gamble 465 10.1 Building Good Spreadsheet Models 466

Influence Diagrams 466 Building a Mathematical Model 466 Spreadsheet Design and Implementing

the Model in a Spreadsheet 468 10.2 What-If Analysis 471

Data Tables 471 Goal Seek 473 Scenario Manager 475

10.3 Some Useful Excel Functions for Modeling 480 SUM and SUMPRODUCT 481 IF and COUNTIF 483 VLOOKUP 485

10.4 Auditing Spreadsheet Models 487 Trace Precedents and Dependents 487 Show Formulas 487 Evaluate Formulas 489 Error Checking 489 Watch Window 490

10.5 Predictive and Prescriptive Spreadsheet Models 491 Summary 492 Glossary 492 Problems 493 Case Problem: Retirement Plan 499

CHAPTER 11 Monte Carlo Simulation 500 Analytics in Action: Polio Eradication 501 11.1 Risk Analysis for Sanotronics LLC 502

Base-Case Scenario 502 Worst-Case Scenario 503 Best-Case Scenario 503 Sanotronics Spreadsheet Model 503 Use of Probability Distributions to Represent Random

Variables 504 Generating Values for Random Variables with Excel 506 Executing Simulation Trials with Excel 510 Measuring and Analyzing Simulation Output 510

11.2 Simulation Modeling for Land Shark Inc. 514 Spreadsheet Model for Land Shark 515 Generating Values for Land Shark’s Random Variables 517 Executing Simulation Trials and Analyzing Output 519 Generating Bid Amounts with Fitted Distributions 522

11.3 Simulation with Dependent Random Variables 527 Spreadsheet Model for Press Teag Worldwide 527

Contents xv

11.4 Simulation Considerations 532 Verification and Validation 532 Advantages and Disadvantages of Using Simulation 532

Summary 533 Glossary 534 Problems 534 Case Problem: Four Corners 547 Appendix 11.1 Common Probability Distributions for Simulation 549

Available in the MindTap Reader: Appendix 11.2 Land Shark Inc. Simulation with Analytic Solver Appendix 11.3 Distribution Fitting with Analytic Solver Appendix 11.4 Correlating Random Variables with Analytic Solver Appendix 11.5 Simulation Optimization with Analytic Solver

CHAPTER 12 Linear Optimization Models 556 Analytics in Action: General Electric 557 12.1 A Simple Maximization Problem 558

Problem Formulation 559 Mathematical Model for the Par, Inc. Problem 561

12.2 Solving the Par, Inc. Problem 561 The Geometry of the Par, Inc. Problem 562 Solving Linear Programs with Excel Solver 564

12.3 A Simple Minimization Problem 568 Problem Formulation 568 Solution for the M&D Chemicals Problem 568

12.4 Special Cases of Linear Program Outcomes 570 Alternative Optimal Solutions 571 Infeasibility 572 Unbounded 573

12.5 Sensitivity Analysis 575 Interpreting Excel Solver Sensitivity Report 575

12.6 General Linear Programming Notation and More Examples 577

Investment Portfolio Selection 578 Transportation Planning 580 Advertising Campaign Planning 584

12.7 Generating an Alternative Optimal Solution for a Linear Program 589

Summary 591 Glossary 592 Problems 593 Case Problem: Investment Strategy 604 Appendix 12.1 Solving Linear Optimization Models Using

Analytic Solver (MindTap Reader)

xvi Contents

CHAPTER 13 Integer Linear Optimization Models 606 Analytics in Action: Petrobras 607 13.1 Types of Integer Linear Optimization Models 607 13.2 Eastborne Realty, an Example of Integer Optimization 608

The Geometry of Linear All-Integer Optimization 609 13.3 Solving Integer Optimization Problems with Excel Solver 611

A Cautionary Note About Sensitivity Analysis 614 13.4 Applications Involving Binary Variables 616

Capital Budgeting 616 Fixed Cost 618 Bank Location 621 Product Design and Market Share Optimization 623

13.5 Modeling Flexibility Provided by Binary Variables 626 Multiple-Choice and Mutually Exclusive Constraints 626 k Out of n Alternatives Constraint 627 Conditional and Corequisite Constraints 627

13.6 Generating Alternatives in Binary Optimization 628 Summary 630 Glossary 631 Problems 632 Case Problem: Applecore Children’s Clothing 643 Appendix 13.1 Solving Integer Linear Optimization Problems Using

Analytic Solver (MindTap Reader)

CHAPTER 14 Nonlinear Optimization Models 646 Analytics in Action: InterContinental Hotels 647 14.1 A Production Application: Par, Inc. Revisited 647

An Unconstrained Problem 647 A Constrained Problem 648 Solving Nonlinear Optimization Models Using Excel Solver 650 Sensitivity Analysis and Shadow Prices in Nonlinear Models 651

14.2 Local and Global Optima 652 Overcoming Local Optima with Excel Solver 655

14.3 A Location Problem 657 14.4 Markowitz Portfolio Model 658 14.5 Forecasting Adoption of a New Product 663 Summary 666 Glossary 667 Problems 667 Case Problem: Portfolio Optimization with Transaction Costs 675 Appendix 14.1 Solving Nonlinear Optimization Problems with Analytic

Solver (MindTap Reader)

Contents xvii

CHAPTER 15 Decision Analysis 678 Analytics in Action: Phytopharm 679 15.1 Problem Formulation 680

Payoff Tables 681 Decision Trees 681

15.2 Decision Analysis without Probabilities 682 Optimistic Approach 682 Conservative Approach 683 Minimax Regret Approach 683

15.3 Decision Analysis with Probabilities 685 Expected Value Approach 685 Risk Analysis 687 Sensitivity Analysis 688

15.4 Decision Analysis with Sample Information 689 Expected Value of Sample Information 694 Expected Value of Perfect Information 694

15.5 Computing Branch Probabilities with Bayes’ Theorem 695 15.6 Utility Theory 698

Utility and Decision Analysis 699 Utility Functions 703 Exponential Utility Function 706

Summary 708 Glossary 708 Problems 710 Case Problem: Property Purchase Strategy 721 Appendix 15.1 Using Analytic Solver to Create Decision Trees

(MindTap Reader)

APPENDIX A Basics of Excel 724

APPENDIX B Database Basics with Microsoft Access 736

APPENDIX C Solutions to Even-Numbered Questions (MindTap Reader)

REFERENCES 774 INDEX 776

Jeffrey D. Camm. Jeffrey D. Camm is the Inmar Presidential Chair and Associate Dean of Analytics in the School of Business at Wake Forest University. Born in Cincinnati, Ohio, he holds a B.S. from Xavier University (Ohio) and a Ph.D. from Clemson University. Prior to joining the faculty at Wake Forest, he was on the faculty of the University of Cincinnati. He has also been a visiting scholar at Stanford University and a visiting professor of business administration at the Tuck School of Business at Dartmouth College.

Dr. Camm has published over 35 papers in the general area of optimization applied to problems in operations management and marketing. He has published his research in Science, Management Science, Operations Research, Interfaces, and other professional journals. Dr. Camm was named the Dornoff Fellow of Teaching Excellence at the Univer- sity of Cincinnati and he was the 2006 recipient of the INFORMS Prize for the Teaching of Operations Research Practice. A firm believer in practicing what he preaches, he has served as an analytics consultant to numerous companies and government agencies. From 2005 to 2010 he served as editor-in-chief of Interfaces. In 2016, Dr. Camm was awarded the Kimball Medal for service to the operations research profession and in 2017 he was named an INFORMS Fellow.

James J. Cochran. James J. Cochran is Associate Dean for Research, Professor of Applied Statistics, and the Rogers-Spivey Faculty Fellow at the University of Alabama. Born in Day- ton, Ohio, he earned his B.S., M.S., and M.B.A. degrees from Wright State University and a Ph.D. from the University of Cincinnati. He has been at the University of Alabama since 2014 and has been a visiting scholar at Stanford University, Universidad de Talca, the University of South Africa, and Pole Universitaire Leonard de Vinci.

Professor Cochran has published over three dozen papers in the development and application of operations research and statistical methods. He has published his research in Management Science, The American Statistician, Communications in Statistics—Theory and Methods, Annals of Operations Research, European Journal of Operational Research, Journal of Combinatorial Optimization. Interfaces, Statistics and Probability Letters, and other professional journals. He was the 2008 recipient of the INFORMS Prize for the Teaching of Operations Research Practice and the 2010 recipient of the Mu Sigma Rho Statistical Education Award. Professor Cochran was elected to the International Statistics Institute in 2005 and named a Fellow of the American Statistical Association in 2011 and a Fellow of the Institute for Operations Research and the Management Sciences (INFORMS) in 2017. He also received the Founders Award in 2014, the Karl E. Peace Award in 2015, and the Waller Distinguished Teaching Career Award in 2017 from the American Statistical Association. A strong advocate for effective operations research and statistics education as a means of improving the quality of applications to real problems, Professor Cochran has organized and chaired teaching effectiveness workshops in Montevideo, Uruguay; Cape Town, South Africa; Cartagena, Colombia; Jaipur, India; Buenos Aires, Argentina; Nairobi, Kenya; Buea, Cameroon; Suva, Fiji; Kathmandu, Nepal; Osijek, Croatia; Havana, Cuba; Ulaanbaatar, Mongolia; and Chişinău, Moldova. He has served as an operations research consultant to numerous companies and not-for-profit organizations. He served as editor- in-chief of INFORMS Transactions on Education from 2006 to 2012 and is on the editorial board of Interfaces, International Transactions in Operational Research, and Significance.

Michael J. Fry. Michael J. Fry is Professor and Head of the Department of Operations, Business Analytics, and Information Systems in the Carl H. Lindner College of Business at the University of Cincinnati. Born in Killeen, Texas, he earned a B.S. from Texas A&M University, and M.S.E. and Ph.D. degrees from the University of Michigan. He has been

About the Authors

xx About the Authors

at the University of Cincinnati since 2002, where he has been named a Lindner Research Fellow and has served as Assistant Director and Interim Director of the Center for Business Analytics. He has also been a visiting professor at Cornell University and at the University of British Columbia.

Professor Fry has published over 20 research papers in journals such as Operations Research, Manufacturing & Service Operations Management, Transportation Science, Naval Research Logistics, IIE Transactions, and Interfaces. His research interests are in applying quantitative management methods to the areas of supply chain analytics, sports analytics, and public-policy operations. He has worked with many different organizations for his research, including Dell, Inc., Copeland Corporation, Starbucks Coffee Company, Great American Insurance Group, the Cincinnati Fire Department, the State of Ohio Election Commission, the Cincinnati Bengals, and the Cincinnati Zoo. In 2008, he was named a finalist for the Daniel H. Wagner Prize for Excellence in Operations Research Practice, and he has been recognized for both his research and teaching excellence at the University of Cincinnati.

Jeffrey W. Ohlmann. Jeffrey W. Ohlmann is Associate Professor of Management Sciences and Huneke Research Fellow in the Tippie College of Business at the University of Iowa. Born in Valentine, Nebraska, he earned a B.S. from the University of Nebraska, and M.S. and Ph.D. degrees from the University of Michigan. He has been at the University of Iowa since 2003.

Professor Ohlmann’s research on the modeling and solution of decision-making problems has produced over 20 research papers in journals such as Operations Research, Mathematics of Operations Research, INFORMS Journal on Computing, Transportation Science, European Journal of Operational Research, and Interfaces. He has collaborated with companies such as Transfreight, LeanCor, Cargill, the Hamilton County Board of Elections, and three National Football League franchises. Due to the relevance of his work to industry, he was bestowed the George B. Dantzig Dissertation Award and was recognized as a finalist for the Daniel H. Wagner Prize for Excellence in Operations Research Practice.

David R. Anderson. David R. Anderson is Professor Emeritus of Quantitative Analysis in the Carl H. Lindner College of Business at the University of Cincinnati. Born in Grand Forks, North Dakota, he earned his B.S., M.S., and Ph.D. degrees from Purdue University. Professor Anderson has served as Head of the Department of Quantitative Analysis and Operations Management and as Associate Dean of the College of Business Administration. In addition, he was the coordinator of the College’s first Executive Program.

At the University of Cincinnati, Professor Anderson has taught introductory statistics for business students as well as graduate-level courses in regression analysis, multivariate analysis, and management science. He has also taught statistical courses at the Department of Labor in Washington, D.C. He has been honored with nominations and awards for excellence in teaching and excellence in service to student organizations.

Professor Anderson has coauthored 10 textbooks in the areas of statistics, management science, linear programming, and production and operations management. He is an active consultant in the field of sampling and statistical methods.

Dennis J. Sweeney. Dennis J. Sweeney is Professor Emeritus of Quantitative Analysis and Founder of the Center for Productivity Improvement at the University of Cincinnati. Born in Des Moines, Iowa, he earned a B.S.B.A. degree from Drake University and his M.B.A. and D.B.A. degrees from Indiana University, where he was an NDEA Fellow. During 1978–1979, Professor Sweeney worked in the management science group at Procter & Gamble; during 1981–1982, he was a visiting professor at Duke University. Professor Sweeney served as Head of the Department of Quantitative Analysis and as Associate Dean of the College of Business Administration at the University of Cincinnati.

About the Authors xxi

Professor Sweeney has published more than 30 articles and monographs in the areas of management science and statistics. The National Science Foundation, IBM, Procter & Gamble, Federated Department Stores, Kroger, and Cincinnati Gas & Electric have funded his research, which has been published in Management Science, Operations Research, Math- ematical Programming, Decision Sciences, and other journals.

Professor Sweeney has coauthored 10 textbooks in the areas of statistics, management science, linear programming, and production and operations management.

Thomas A. Williams. Thomas A. Williams is Professor Emeritus of Management Science in the College of Business at Rochester Institute of Technology. Born in Elmira, New York, he earned his B.S. degree at Clarkson University. He did his graduate work at Rensselaer Polytechnic Institute, where he received his M.S. and Ph.D. degrees.

Before joining the College of Business at RIT, Professor Williams served for seven years as a faculty member in the College of Business Administration at the University of Cincinnati, where he developed the undergraduate program in Information Systems and then served as its coordinator. At RIT he was the first chairman of the Decision Sciences Department. He teaches courses in management science and statistics, as well as graduate courses in regres- sion and decision analysis.

Professor Williams is the coauthor of 11 textbooks in the areas of management science, statistics, production and operations management, and mathematics. He has been a consultant for numerous Fortune 500 companies and has worked on projects ranging from the use of data analysis to the development of large-scale regression models.

B usiness Analytics 3E is designed to introduce the concept of business analytics to under-graduate and graduate students. This textbook contains one of the first collections of materials that are essential to the growing field of business analytics. In Chapter 1 we present an overview of business analytics and our approach to the material in this textbook. In simple terms, business analytics helps business professionals make better decisions based on data. We discuss models for summarizing, visualizing, and understanding useful information from historical data in Chapters 2 through 6. Chapters 7 through 9 introduce methods for both gain- ing insights from historical data and predicting possible future outcomes. Chapter 10 covers the use of spreadsheets for examining data and building decision models. In Chapter 11, we demonstrate how to explicitly introduce uncertainty into spreadsheet models through the use of Monte Carlo simulation. In Chapters 12 through 14 we discuss optimization models to help decision makers choose the best decision based on the available data. Chapter 15 is an overview of decision analysis approaches for incorporating a decision maker’s views about risk into decision making. In Appendix A we present optional material for students who need to learn the basics of using Microsoft Excel. The use of databases and manipulating data in Microsoft Access is discussed in Appendix B.

This textbook can be used by students who have previously taken a course on basic statisti- cal methods as well as students who have not had a prior course in statistics. Business Analytics 3E is also amenable to a two-course sequence in business statistics and analytics. All statistical concepts contained in this textbook are presented from a business analytics perspective using practical business examples. Chapters 2, 5, 6, and 7 provide an introduction to basic statistical concepts that form the foundation for more advanced analytics methods. Chapters 3, 4, and 9 cover additional topics of data visualization and data mining that are not traditionally part of most introductory business statistics courses, but they are exceedingly important and commonly used in current business environments. Chapter 10 and Appendix A provide the foundational knowledge students need to use Microsoft Excel for analytics applications. Chapters 11 through 15 build upon this spreadsheet knowledge to present additional topics that are used by many organizations that are leaders in the use of prescriptive analytics to improve decision making.

Updates in the Third Edition The third edition of Business Analytics is a major revision. We have heavily modified our data mining chapters to allow instructors to choose their preferred means of teaching this material in terms of software usage. Chapters 4 and 9 now both contain conceptual homework problems that can be solved by students without using any software. Additionally, we now include online appendices on both Analytic Solver and JMP Pro as software for teaching data mining so that instructors can choose their favored way of teaching this material. Chapter 4 also now includes a section on text mining, a fast-growing topic in business analytics. We have moved our chapter on Monte Carlo simulation to Chapter 11, and we have completely rewritten this chapter to greatly expand the material that can be covered using only native Excel. Other changes in this edition include additional content on big-data concepts, data cleansing, new data visualization topics in Excel, and additional homework problems.

●● Software Updates for Data Mining Chapters. Chapters 4 and 9 have received exten- sive updates. The end-of-chapter problems are now written so that they can be solved using any data-mining software. To allow instructors to choose different software for use with these chapters, we have created online appendices for both Analytic Solver and JMP Pro. Analytic Solver has undergone major changes since the previous edition of this textbook. Therefore, we have reworked all examples, problems, and cases using Analytic Solver Basic V2017, the version of this software now available to students. We have created new appendices for Chapters 4 and 9 that introduce the use of JMP Pro 13 for data mining. JMP Pro is a powerful software that is still easy to learn and easy to use. We have also added five homework problems to Chapters 4 and 9 that can

Preface

xxiv Preface

be solved without using any software. This allows instructors to cover the basics of data mining without using any additional software. The online appendices for Chap- ters 4 and 9 also include Analytic Solver and JMP Pro specific instructions for how to solve the end-of-chapter problems using Analytic Solver or JMP Pro. Problem and case solutions using both Analytic Solver and JMP Pro are also available to instructors.

●● New Section on Text Mining. Chapter 4 now includes a section on text mining. With the proliferation of unstructured data, generating insights from text is becoming increas- ingly important. We have added two problems related to text mining to Chapter 4. We also include online appendices for using either Analytic Solver or JMP Pro to per- form basic text mining functions.

●● Revision of Monte Carlo Simulation Chapter. Our chapter on simulation models has been heavily revised. In the body of the chapter, we construct simulation models solely using native Excel functionality. This pedagogical design choice is based on the authors’ own experiences and motivated by the following factors: (1) it primarily avoids software incompatibility issues for students with different operating systems (Apple OS versus Microsoft Windows); and (2) it separates simulation concepts from software steps so that students realize that they do not need a specific software package to utilize simulation in their future careers.

To support our approach, we have added many more topics and examples that can be taught using native Excel functions. Our coverage now guides the instructor and stu- dent through many different types of simulation models and output analysis using only native Excel. However, if instructors wish to utilize specialized Monte Carlo simulation software, the examples and problems in the chapter can all be solved with specialized software. To demonstrate, we include an updated online appendix for using Analytic Solver to create simulation models and perform output analysis.

We have also moved our chapter on simulation models to Chapter 11, prior to the chapters on optimization models. We believe this presents a better ordering of topics as it follows immediately after Chapter 10 that covers good design techniques for Excel spreadsheet models. In particular, we have added a new section in Chapter 10 on using the Scenario Manager tool in Excel that creates a natural bridge to the coverage of simulation models in Chapter 11. The end-of-chapter problems and case in Chapter 11 can all be solved using native Excel. Problem and case solutions for Chapter 11 using both native Excel and Analytic Solver are available to instructors.

●● Additional Material on Big-Data Topics. We have added new sections in Chapters 6 and 7 to enhance our coverage of topics related to big data. In Chapter 6, we introduce the concept of big data, and we discuss some additional challenges and implications of applying statistical inference when you have very large sample sizes. In Chapter 7, we expand on these concepts by discussing the estimation and use of regression models with very large sample sizes.

●● New Data Analysis and Data Visualization Tools in Excel. Excel 2016 introduces several new tools for data analysis and data visualization. Chapter 2 now covers how to create a box plot in native Excel. In Chapter 3 we have added coverage of how to create more advanced data visualization tools in native Excel such as treemaps and geographic information system (GIS) charts.

●● New Section on Data Cleansing. Chapter 2 now includes a section on data cleansing. This section introduces concepts related to missing data, outliers, and variable represen- tation. These are exceptionally important concepts that face all analytics professionals when dealing with real data that can have missing values and errors.

●● Excel Forecast Sheet. As we did in the second edition, Chapter 8 includes an appendix for using the Forecast Sheet tool in Excel 2016. Excel's Forecast Sheet tool implements a time series forecasting model known as the Holt-Winters additive seasonal smoothing model.

●● New End-of-Chapter Problems. The third edition of this textbook includes new prob- lems in Chapters 2, 3, 4, 6, 9, 10, 11, 13, and 14. As we have done in past editions, Excel solution files are available to instructors for problems that require the use of Excel.

Preface xxv

●● Software Appendices Moved Online. Chapter appendices that deal with the use of Analytic Solver or JMP Pro have been moved online as part of the MindTap Reader eBook. This preserves the flow of material in the textbook and allows instructors to eas- ily cover all material using only native Excel if that is preferred. The online appendices offer extensive coverage of Analytic Solver for concepts covered in Chapters 2, 3, 4, 7, 8, 9, 11, 12, 13, 14, and 15. Online appendices for Chapters 4 and 9 cover the usage of JMP Pro for data mining. Contact your Cengage representative for more information on how your students can access the MindTap Reader.

Continued Features and Pedagogy In the third edition of this textbook, we continue to offer all of the features that have been successful in the first two editions. Some of the specific features that we use in this textbook are listed below.

●● Integration of Microsoft Excel: Excel has been thoroughly integrated throughout this textbook. For many methodologies, we provide instructions for how to perform calculations both by hand and with Excel. In other cases where realistic models are practical only with the use of a spreadsheet, we focus on the use of Excel to describe the methods to be used.

●● Notes and Comments: At the end of many sections, we provide Notes and Comments to give the student additional insights about the methods presented in that section. These insights include comments on the limitations of the presented methods, recom- mendations for applications, and other matters. Additionally, margin notes are used throughout the textbook to provide additional insights and tips related to the specific material being discussed.

●● Analytics in Action: Each chapter contains an Analytics in Action article. These articles present interesting examples of the use of business analytics in practice. The examples are drawn from many different organizations in a variety of areas including healthcare, finance, manufacturing, marketing, and others.

●● DATAfiles and MODELfiles: All data sets used as examples and in student exercises are also provided online on the companion site as files available for download by the student. DATAfiles are Excel files that contain data needed for the examples and problems given in the textbook. MODELfiles contain additional modeling features such as extensive use of Excel formulas or the use of Excel Solver, Analytic Solver, or JMP Pro.

●● Problems and Cases: With the exception of Chapter 1, each chapter contains an exten- sive selection of problems to help the student master the material presented in that chapter. The problems vary in difficulty and most relate to specific examples of the use of business analytics in practice. Answers to even-numbered problems are provided in an online supplement for student access. With the exception of Chapter 1, each chap- ter also includes an in-depth case study that connects many of the different methods introduced in the chapter. The case studies are designed to be more open-ended than the chapter problems, but enough detail is provided to give the student some direction in solving the cases.

MindTap MindTap is a customizable digital course solution that includes an interactive eBook, auto- graded exercises from the textbook, algorithmic practice problems with solutions feedback, Exploring Analytics visualizations, Adaptive Test Prep, and more! All of these materials offer students better access to resources to understand the materials within the course. For more information on MindTap, please contact your Cengage representative.

xxvi Preface

For Students Online resources are available to help the student work more efficiently. The resources can be accessed through www.cengagebrain.com.

●● Analytic Solver: Instructions to download an educational version of Frontline Systems’ Analytic Solver Basic V2017 are included with the purchase of this textbook. These instructions can be found within the inside front cover of this text, within MindTap in the ‘Course Materials’ folder, and online on the text companion site at www.cengagebrain.com.

Note that there is now a small charge for one-semester access to Analytic Solver. For more information on pricing and available discounts for student users, please con- tact Frontline Systems directly at 775-831-0300 and press 0 for advice on courses and software licenses.

●● JMP Pro: Most universities have site licenses of SAS Institute’s JMP Pro software on both Mac and Windows. These are typically offered through your university’s software licensing administrator. Faculty may contact the JMP Academic team to find out if their universities have a license or to request a complementary instructor copy at www.jmp. com/contact-academic. For institutions without a site license, students may rent a 6 or 12-month license for JMP at www.onthehub.com/jmp.

For Instructors Instructor resources are available to adopters on the Instructor Companion Site, which can be found and accessed at www.cengage.com, including:

●● Solutions Manual: The Solutions Manual, prepared by the authors, includes solutions for all problems in the text. It is available online as well as print. Excel solution files are available to instructors for those problems that require the use of Excel. Solutions for Chapters 4 and 9 are available using both Analytic Solver and JMP Pro for data mining problems. Solutions for Chapter 11 are available using both native Excel and Analytic Solver for simulation problems.

●● Solutions to Case Problems: These are also prepared by the authors and contain solutions to all case problems presented in the text. Case solutions for Chapters 4 and 9 are provided using both Analytic Solver and JMP Pro. Case solutions for Chapter 11 are available using both native Excel and Analytic Solver.

●● PowerPoint Presentation Slides: The presentation slides contain a teaching outline that incorporates figures to complement instructor lectures.

●● Test Bank: Cengage Learning Testing Powered by Cognero is a flexible, online system that allows you to:

●● author, edit, and manage test bank content from multiple Cengage Learning solutions, ●● create multiple test versions in an instant, and ●● deliver tests from your Learning Management System (LMS), your classroom, or

wherever you want. The Test Bank is also available in Microsoft Word.

Acknowledgments We would like to acknowledge the work of reviewers and users who have provided comments and suggestions for improvement of this text. Thanks to:

Matthew D. Bailey Bucknell University

Phillip Beaver University of Denver

M. Khurrum S. Bhutta Ohio University

Paolo Catasti Virginia Commonwealth University

Preface xxvii

Q B. Chung Villanova University

Elizabeth A. Denny University of Kentucky

Mike Taein Eom University of Portland

Yvette Njan Essounga Fayetteville State University

Lawrence V. Fulton Texas State University

Tom Groleau Carthage College

James F. Hoelscher Lincoln Memorial University

Eric Huggins Fort Lewis College

Faizul Huq Ohio University

Marco Lam York College of Pennsylvania

Thomas Lee University of California, Berkeley

Roger Myerson Northwestern University

Ram Pakath University of Kentucky

Susan Palocsay James Madison University

Andy Shogan University of California, Berkeley

Dothan Truong Embry-Riddle Aeronautical University

Kai Wang Wake Technical Community College

Ed Wasil American University

We are indebted to our Senior Product Team Manager, Joe Sabatino and our Senior Product Manager, Aaron Arnsparger; our Senior Digital Content Designer, Brandon Foltz, our Senior Marketing Manager, Nathan Anderson, and our Content Developer, Anne Merrill; our Content Project Manager, Jean Buttrom; and others at Cengage Learning for their counsel and support during the preparation of this text.

Jeffrey D. Camm James J. Cochran Michael J. Fry Jeffrey W. Ohlmann David R. Anderson Dennis J. Sweeney Thomas A. Williams

Introduction C o n t e n t s

1.1 DECISION MAKING

1.2 BUSINESS ANALYTICS DEFINED

1.3 A CATEGORIZATION OF ANALYTICAL METHODS AND MODELS Descriptive Analytics Predictive Analytics Prescriptive Analytics

1.4 BIG DATA Volume Velocity Variety Veracity

1.5 BUSINESS ANALYTICS IN PRACTICE Financial Analytics Human Resource (HR) Analytics Marketing Analytics Health Care Analytics Supply-Chain Analytics Analytics for Government and Nonprofits Sports Analytics Web Analytics

Chapter 1

Introduction 3

You apply for a loan for the first time. How does the bank assess the riskiness of the loan it might make to you? How does Amazon.com know which books and other products to recommend to you when you log in to their web site? How do airlines determine what price to quote to you when you are shopping for a plane ticket? How can doctors better diagnose and treat you when you are ill or injured?

You may be applying for a loan for the first time, but millions of people around the world have applied for loans before. Many of these loan recipients have paid back their loans in full and on time, but some have not. The bank wants to know whether you are more like those who have paid back their loans or more like those who defaulted. By comparing your credit history, financial situation, and other factors to the vast database of previous loan recipients, the bank can effectively assess how likely you are to default on a loan.

Similarly, Amazon.com has access to data on millions of purchases made by customers on its web site. Amazon.com examines your previous purchases, the products you have viewed, and any product recommendations you have provided. Amazon.com then searches through its huge database for customers who are similar to you in terms of product pur- chases, recommendations, and interests. Once similar customers have been identified, their purchases form the basis of the recommendations given to you.

Prices for airline tickets are frequently updated. The price quoted to you for a flight between New York and San Francisco today could be very different from the price that will be quoted tomorrow. These changes happen because airlines use a pricing strategy known as revenue management. Revenue management works by examining vast amounts of data on past airline customer purchases and using these data to forecast future purchases. These forecasts are then fed into sophisticated optimization algorithms that determine the optimal price to charge for a particular flight and when to change that price. Revenue management has resulted in substantial increases in airline revenues.

Finally, consider the case of being evaluated by a doctor for a potentially serious medical issue. Hundreds of medical papers may describe research studies done on patients facing similar diagnoses, and thousands of data points exist on their outcomes. However, it is extremely unlikely that your doctor has read every one of these research papers or is aware of all previous patient outcomes. Instead of relying only on her medical training and knowledge gained from her limited set of previous patients, wouldn’t it be better for your doctor to have access to the expertise and patient histories of thousands of doctors around the world?

A group of IBM computer scientists initiated a project to develop a new decision tech- nology to help in answering these types of questions. That technology is called Watson, named after the founder of IBM, Thomas J. Watson. The team at IBM focused on one aim: How the vast amounts of data now available on the Internet can be used to make more data- driven, smarter decisions.

Watson became a household name in 2011, when it famously won the television game show, Jeopardy! Since that proof of concept in 2011, IBM has reached agreements with the health insurance provider WellPoint (now part of Anthem), the financial services company Citibank, Memorial Sloan-Kettering Cancer Center, and automobile manufacturer General Motors to apply Watson to the decision problems that they face.

Watson is a system of computing hardware, high-speed data processing, and analytical algorithms that are combined to make data-based recommendations. As more and more data are collected, Watson has the capability to learn over time. In simple terms, accord- ing to IBM, Watson gathers hundreds of thousands of possible solutions from a huge data bank, evaluates them using analytical techniques, and proposes only the best solutions for consideration. Watson provides not just a single solution, but rather a range of good solutions with a confidence level for each.

For example, at a data center in Virginia, to the delight of doctors and patients, Watson is already being used to speed up the approval of medical procedures. Citibank is begin- ning to explore how to use Watson to better serve its customers, and cancer specialists at

4 Chapter 1 Introduction

more than a dozen hospitals in North America are using Watson to assist with the diagnosis and treatment of patients.1

This book is concerned with data-driven decision making and the use of analytical approaches in the decision-making process. Three developments spurred recent explo- sive growth in the use of analytical methods in business applications. First, technologi- cal advances—such as improved point-of-sale scanner technology and the collection of data through e-commerce and social networks, data obtained by sensors on all kinds of mechanical devices such as aircraft engines, automobiles, and farm machinery through the so-called Internet of Things and data generated from personal electronic devices—produce incredible amounts of data for businesses. Naturally, businesses want to use these data to improve the efficiency and profitability of their operations, better understand their custom- ers, price their products more effectively, and gain a competitive advantage. Second, ongo- ing research has resulted in numerous methodological developments, including advances in computational approaches to effectively handle and explore massive amounts of data, faster algorithms for optimization and simulation, and more effective approaches for visu- alizing data. Third, these methodological developments were paired with an explosion in computing power and storage capability. Better computing hardware, parallel computing, and, more recently, cloud computing (the remote use of hardware and software over the Internet) have enabled businesses to solve big problems more quickly and more accurately than ever before.

In summary, the availability of massive amounts of data, improvements in analytic methodologies, and substantial increases in computing power have all come together to result in a dramatic upsurge in the use of analytical methods in business and a reliance on the discipline that is the focus of this text: business analytics. As stated in the Preface, the purpose of this text is to provide students with a sound conceptual understanding of the role that business analytics plays in the decision-making process. To reinforce the applica- tions orientation of the text and to provide a better understanding of the variety of applica- tions in which analytical methods have been used successfully, Analytics in Action articles are presented throughout the book. Each Analytics in Action article summarizes an applica- tion of analytical methods in practice.

1.1 Decision Making It is the responsibility of managers to plan, coordinate, organize, and lead their organiza- tions to better performance. Ultimately, managers’ responsibilities require that they make strategic, tactical, or operational decisions. Strategic decisions involve higher-level issues concerned with the overall direction of the organization; these decisions define the orga- nization’s overall goals and aspirations for the future. Strategic decisions are usually the domain of higher-level executives and have a time horizon of three to five years. Tactical decisions concern how the organization should achieve the goals and objectives set by its strategy, and they are usually the responsibility of midlevel management. Tactical decisions usually span a year and thus are revisited annually or even every six months. Operational decisions affect how the firm is run from day to day; they are the domain of operations managers, who are the closest to the customer.

Consider the case of the Thoroughbred Running Company (TRC). Historically, TRC had been a catalog-based retail seller of running shoes and apparel. TRC sales revenues grew quickly as it changed its emphasis from catalog-based sales to Internet-based sales. Recently, TRC decided that it should also establish retail stores in the malls and downtown areas of major cities. This strategic decision will take the firm in a new direction that it hopes will complement its Internet-based strategy. TRC middle managers will therefore have to make a variety of tactical decisions in support of this strategic decision, including

1“IBM’s Watson Is Learning Its Way to Saving Lives,” Fastcompany web site, December 8, 2012; “IBM’s Watson Targets Cancer and Enlists Prominent Providers in the Fight,” ModernHealthcare web site, May 5, 2015.

1.2 Business Analytics Defined 5

how many new stores to open this year, where to open these new stores, how many distri- bution centers will be needed to support the new stores, and where to locate these distri- bution centers. Operations managers in the stores will need to make day-to-day decisions regarding, for instance, how many pairs of each model and size of shoes to order from the distribution centers and how to schedule their sales personnel’s work time.

Regardless of the level within the firm, decision making can be defined as the following process:

1. Identify and define the problem. 2. Determine the criteria that will be used to evaluate alternative solutions. 3. Determine the set of alternative solutions. 4. Evaluate the alternatives. 5. Choose an alternative.

Step 1 of decision making, identifying and defining the problem, is the most critical. Only if the problem is well-defined, with clear metrics of success or failure (step 2), can a proper approach for solving the problem (steps 3 and 4) be devised. Decision making con- cludes with the choice of one of the alternatives (step 5).

There are a number of approaches to making decisions: tradition (“We’ve always done it this way”), intuition (“gut feeling”), and rules of thumb (“As the restaurant owner, I schedule twice the number of waiters and cooks on holidays”). The power of each of these approaches should not be underestimated. Managerial experience and intuition are valuable inputs to making decisions, but what if relevant data were available to help us make more informed decisions? With the vast amounts of data now generated and stored electronically, it is estimated that the amount of data stored by businesses more than doubles every two years. How can managers convert these data into knowledge that they can use to be more efficient and effective in managing their businesses?

1.2 Business Analytics Defined What makes decision making difficult and challenging? Uncertainty is probably the num- ber one challenge. If we knew how much the demand will be for our product, we could do a much better job of planning and scheduling production. If we knew exactly how long each step in a project will take to be completed, we could better predict the project’s cost and completion date. If we knew how stocks will perform, investing would be a lot easier.

Another factor that makes decision making difficult is that we often face such an enor- mous number of alternatives that we cannot evaluate them all. What is the best combina- tion of stocks to help me meet my financial objectives? What is the best product line for a company that wants to maximize its market share? How should an airline price its tickets so as to maximize revenue?

Business analytics is the scientific process of transforming data into insight for making better decisions.2 Business analytics is used for data-driven or fact-based decision making, which is often seen as more objective than other alternatives for decision making.

As we shall see, the tools of business analytics can aid decision making by creating insights from data, by improving our ability to more accurately forecast for planning, by helping us quantify risk, and by yielding better alternatives through analysis and optimiza- tion. A study based on a large sample of firms that was conducted by researchers at MIT’s Sloan School of Management and the University of Pennsylvania, concluded that firms guided by data-driven decision making have higher productivity and market value and increased output and profitability.3

If I were given one hour to save the planet, I would spend 59 minutes defining the problem and one minute resolving it.

—Albert Einstein

Some firms and industries use the simpler term, analytics. Analytics is often thought of as a broader category than business analytics, encompassing the use of analytical techniques in the sciences and engineering as well. In this text, we use business analytics and analytics synonymously.

2We adopt the definition of analytics developed by the Institute for Operations Research and the Management Sciences (INFORMS). 3E. Brynjolfsson, L. M. Hitt, and H. H. Kim, “Strength in Numbers: How Does Data-Driven Decisionmaking Affect Firm Performance?” (April 18, 2013). Available at SSRN, http://papers.ssrn.com/sol3/papers. cfm?abstract_id=1819486.

6 Chapter 1 Introduction

1.3 A Categorization of Analytical Methods and Models Business analytics can involve anything from simple reports to the most advanced optimi- zation techniques (methods for finding the best course of action). Analytics is generally thought to comprise three broad categories of techniques: descriptive analytics, predictive analytics, and prescriptive analytics.

Descriptive Analytics Descriptive analytics encompasses the set of techniques that describes what has happened in the past. Examples are data queries, reports, descriptive statistics, data visualization including data dashboards, some data-mining techniques, and basic what-if spreadsheet models.

A data query is a request for information with certain characteristics from a database. For example, a query to a manufacturing plant’s database might be for all records of ship- ments to a particular distribution center during the month of March. This query provides descriptive information about these shipments: the number of shipments, how much was included in each shipment, the date each shipment was sent, and so on. A report sum- marizing relevant historical information for management might be conveyed by the use of descriptive statistics (means, measures of variation, etc.) and data-visualization tools (tables, charts, and maps). Simple descriptive statistics and data-visualization techniques can be used to find patterns or relationships in a large database.

Data dashboards are collections of tables, charts, maps, and summary statistics that are updated as new data become available. Dashboards are used to help management monitor specific aspects of the company’s performance related to their decision-making respon- sibilities. For corporate-level managers, daily data dashboards might summarize sales by region, current inventory levels, and other company-wide metrics; front-line managers may view dashboards that contain metrics related to staffing levels, local inventory levels, and short-term sales forecasts.

Data mining is the use of analytical techniques for better understanding patterns and relationships that exist in large data sets. For example, by analyzing text on social network platforms like Twitter, data-mining techniques (including cluster analysis and sentiment analysis) are used by companies to better understand their customers. By categorizing certain words as positive or negative and keeping track of how often those words appear in tweets, a company like Apple can better understand how its customers are feeling about a product like the Apple Watch.

Predictive Analytics Predictive analytics consists of techniques that use models constructed from past data to predict the future or ascertain the impact of one variable on another. For example, past data on product sales may be used to construct a mathematical model to predict future sales. This mode can factor in the product’s growth trajectory and seasonality based on past pat- terns. A packaged-food manufacturer may use point-of-sale scanner data from retail outlets to help in estimating the lift in unit sales due to coupons or sales events. Survey data and past purchase behavior may be used to help predict the market share of a new product. All of these are applications of predictive analytics.

Linear regression, time series analysis, some data-mining techniques, and simulation, often referred to as risk analysis, all fall under the banner of predictive analytics. We dis- cuss all of these techniques in greater detail later in this text.

Data mining, previously discussed as a descriptive analytics tool, is also often used in predictive analytics. For example, a large grocery store chain might be interested in devel- oping a targeted marketing campaign that offers a discount coupon on potato chips. By studying historical point-of-sale data, the store may be able to use data mining to predict which customers are the most likely to respond to an offer on discounted chips by purchas- ing higher-margin items such as beer or soft drinks in addition to the chips, thus increasing the store’s overall revenue.

Appendix B at the end of this book describes how to use Microsoft Access to conduct data queries.

1.4 Big Data 7

Simulation involves the use of probability and statistics to construct a computer model to study the impact of uncertainty on a decision. For example, banks often use simulation to model investment and default risk in order to stress-test financial models. Simulation is also often used in the pharmaceutical industry to assess the risk of introducing a new drug.

Prescriptive Analytics Prescriptive analytics differs from descriptive and predictive analytics in that prescriptive analytics indicates a course of action to take; that is, the output of a prescriptive model is a decision. Predictive models provide a forecast or prediction, but do not provide a deci- sion. However, a forecast or prediction, when combined with a rule, becomes a prescriptive model. For example, we may develop a model to predict the probability that a person will default on a loan. If we create a rule that says if the estimated probability of default is more than 0.6, we should not award a loan, now the predictive model, coupled with the rule is prescriptive analytics. These types of prescriptive models that rely on a rule or set of rules are often referred to as rule-based models.

Other examples of prescriptive analytics are portfolio models in finance, supply network design models in operations, and price-markdown models in retailing. Portfolio models use historical investment return data to determine which mix of investments will yield the high- est expected return while controlling or limiting exposure to risk. Supply-network design models provide plant and distribution center locations that will minimize costs while still meeting customer service requirements. Given historical data, retail price markdown mod- els yield revenue-maximizing discount levels and the timing of discount offers when goods have not sold as planned. All of these models are known as optimization models, that is, models that give the best decision subject to the constraints of the situation.

Another type of modeling in the prescriptive analytics category is simulation optimiza- tion which combines the use of probability and statistics to model uncertainty with optimi- zation techniques to find good decisions in highly complex and highly uncertain settings. Finally, the techniques of decision analysis can be used to develop an optimal strategy when a decision maker is faced with several decision alternatives and an uncertain set of future events. Decision analysis also employs utility theory, which assigns values to out- comes based on the decision maker’s attitude toward risk, loss, and other factors.

In this text we cover all three areas of business analytics: descriptive, predictive, and prescriptive. Table 1.1 shows how the chapters cover the three categories.

1.4 Big Data Walmart handles over 1 million purchase transactions per hour. Facebook processes more than 250 million picture uploads per day. Six billion cell phone owners around the world generate vast amounts of data by calling, texting, tweeting, and browsing the web on a daily basis.4 As Google CEO Eric Schmidt has noted, the amount of data currently created every 48 hours is equivalent to the entire amount of data created from the dawn of civiliza- tion until the year 2003. It is through technology that we have truly been thrust into the data age. Because data can now be collected electronically, the available amounts of it are staggering. The Internet, cell phones, retail checkout scanners, surveillance video, and sen- sors on everything from aircraft to cars to bridges allow us to collect and store vast amounts of data in real time.

In the midst of all of this data collection, the new term big data has been created. There is no universally accepted definition of big data. However, probably the most accepted and most general definition is that big data is any set of data that is too large or too complex to be handled by standard data-processing techniques and typical desktop software. IBM describes the phenomenon of big data through the four Vs: volume, velocity, variety, and veracity, as shown in Figure 1.1.5

4SAS White Paper, “Big Data Meets Big Data Analytics,” SAS Institute, 2012. 5IBM web site: http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg.

8 Chapter 1 Introduction

Chapter Title Descriptive Predictive Prescriptive

1 Introduction ● ● ●

2 Descriptive Statistics ●

3 Data Visualization ●

4 Descriptive Data Mining ●

5 Probability: An Introduction to Modeling Uncertainty

●

6 Statistical Inference ●

7 Linear Regression ●

8 Time Series and Forecasting ●

9 Predictive Data Mining ●

10 Spreadsheet Models ● ● ●

11 Monte Carlo Simulation ● ●

12 Linear Optimization Models ●

13 Integer Linear Optimization Models

●

14 Nonlinear Optimization Models ●

15 Decision Analysis ●

Coverage of Business Analytics Topics in This TexttAble 1.1

Source: IBM.

The Four Vs of Big DataFIGURe 1.1

Volume

Data at Rest

Terabytes to exabytes of existing data to process

Velocity

Data in Motion

Streaming data, milliseconds to seconds to respond

Variety

Data in Many Forms

Structured, unstructured, text, multimedia

Veracity

Data in Doubt

Uncertainly due to data inconsistency & incompleteness, ambiguities, latency, deception, model approximations

1.4 Big Data 9

Volume Because data are collected electronically, we are able to collect more of it. To be useful, these data must be stored, and this storage has led to vast quantities of data. Many compa- nies now store in excess of 100 terabytes of data (a terabyte is 1,024 gigabytes).

Velocity Real-time capture and analysis of data present unique challenges both in how data are stored and the speed with which those data can be analyzed for decision making. For example, the New York Stock Exchange collects 1 terabyte of data in a single trading ses- sion, and having current data and real-time rules for trades and predictive modeling are important for managing stock portfolios.

Variety In addition to the sheer volume and speed with which companies now collect data, more com- plicated types of data are now available and are proving to be of great value to businesses. Text data are collected by monitoring what is being said about a company’s products or ser- vices on social media platforms such as Twitter. Audio data are collected from service calls (on a service call, you will often hear “this call may be monitored for quality control”). Video data collected by in-store video cameras are used to analyze shopping behavior. Analyzing information generated by these nontraditional sources is more complicated in part because of the processing required to transform the data into a numerical form that can be analyzed.

Veracity Veracity has to do with how much uncertainty is in the data. For example, the data could have many missing values, which makes reliable analysis a challenge. Inconsistencies in units of measure and the lack of reliability of responses in terms of bias also increase the complexity of the data.

Businesses have realized that understanding big data can lead to a competitive advan- tage. Although big data represents opportunities, it also presents challenges in terms of data storage and processing, security, and available analytical talent.

The four Vs indicate that big data creates challenges in terms of how these complex data can be captured, stored, and processed; secured; and then analyzed. Traditional databases more or less assume that data fit into nice rows and columns, but that is not always the case with big data. Also, the sheer volume (the first V) often means that it is not possible to store all of the data on a single computer. This has led to new technologies like Hadoop—an open-source programming environment that supports big data processing through distributed storage and distributed processing on clusters of computers. Essentially, Hadoop provides a divide-and- conquer approach to handling massive amounts of data, dividing the storage and processing over multiple computers. MapReduce is a programming model used within Hadoop that performs the two major steps for which it is named: the map step and the reduce step. The map step divides the data into manageable subsets and distributes it to the computers in the cluster (often termed nodes) for storing and processing. The reduce step collects answers from the nodes and combines them into an answer to the original problem. Without technologies like Hadoop and MapReduce, and relatively inexpensive computer power, processing big data would not be cost-effective; in some cases, processing might not even be possible.

While some sources of big data are publicly available (Twitter, weather data, etc.), much of it is private information. Medical records, bank account information, and credit card transactions, for example, are all highly confidential and must be protected from computer hackers. Data security, the protection of stored data from destructive forces or unauthorized users, is of critical importance to companies. For example, credit card transactions are poten- tially very useful for understanding consumer behavior, but compromise of these data could lead to unauthorized use of the credit card or identity theft. A 2016 study of 383 companies

10 Chapter 1 Introduction

in 12 countries conducted by the Ponemon Institute and IBM found that the average cost of a data breach is $4 million.6 Companies such as Target, Anthem, JPMorgan Chase, Yahoo!, and Home Depot have faced major data breaches costing millions of dollars.

The complexities of the 4 Vs have increased the demand for analysts, but a shortage of qualified analysts has made hiring more challenging. More companies are searching for data scientists, who know how to effectively process and analyze massive amounts of data because they are well trained in both computer science and statistics. Next we discuss three examples of how companies are collecting big data for competitive advantage.

Kroger Understands Its Customers7 Kroger is the largest retail grocery chain in the United States. It sends over 11 million pieces of direct mail to its customers each quarter. The quarterly mailers each contain 12 coupons that are tailored to each household based on several years of shopping data obtained through its customer loyalty card program. By collecting and analyzing consumer behavior at the individual household level and better matching its coupon offers to shopper interests, Kroger has been able to realize a far higher redemption rate on its coupons. In the six-week period following distribution of the mail- ers, over 70% of households redeem at least one coupon, leading to an estimated coupon revenue of $10 billion for Kroger.

Magicband at Disney8 The Walt Disney Company has begun offering a wristband to visitors to its Orlando, Florida, Disney World theme park. Known as the MagicBand, the wristband contains technology that can transmit more than 40 feet and can be used to track each visitor’s location in the park in real time. The band can link to information that allows Disney to better serve its visitors. For example, prior to the trip to Disney World, a visitor might be asked to fill out a survey on his or her birth date and favorite rides, characters, and restaurant table type and location. This information, linked to the MagicBand, can allow Disney employees using smart- phones to greet you by name as you arrive, offer you products they know you prefer, wish you a happy birthday, have your favorite characters show up as you wait in line or have lunch at your favorite table. The MagicBand can be linked to your credit card, so there is no need to carry cash or a credit card. And during your visit, your movement throughout the park can be tracked and the data can be analyzed to better serve you during your next visit to the park.

General electric and the Internet of things9 The Internet of Things (IoT) is the tech- nology that allows data, collected from sensors in all types of machines, to be sent over the Internet to repositories where it can be stored and analyzed. This ability to collect data from products has enabled the companies that produce and sell those products to better serve their customers and offer new services based on analytics. For example, each day General Electric (GE) gathers nearly 50 million pieces of data from 10 million sensors on medical equipment and aircraft engines it has sold to customers throughout the world. In the case of aircraft engines, through a service agreement with its customers, GE collects data each time an airplane powered by its engines takes off and lands. By analyzing these data, GE can better predict when maintenance is needed, which helps customers avoid unplanned maintenance and downtime and helps ensure safe operation. GE can also use the data to better control how the plane is flown, leading to a decrease in fuel cost by flying more efficiently. In 2014, GE realized approximately $1.1 billion in revenue from the IoT.

Although big data is clearly one of the drivers for the strong demand for analytics, it is important to understand that in some sense big data issues are a subset of analytics. Many very valuable applications of analytics do not involve big data, but rather traditional data sets that are very manageable by traditional database and analytics software. The key to analytics is that it provides useful insights and better decision making using the data that are available—whether those data are “big” or “small.”

62016 Cost of Data Breach Study: Global Analysis, Ponemon Institute and IBM, June, 2016. 7Based on “Kroger Knows Your Shopping Patterns Better than You Do,” Forbes.com, October 23, 2013. 8Based on “Disney’s $1 Billion Bet on a Magical Wristband,” Wired.com, March 10, 2015. 9Based on “G.E. Opens Its Big Data Platform,” NYTimes.com, October 9, 2014.

1.5 Business Analytics in Practice 11

1.5 Business Analytics in Practice Business analytics involves tools as simple as reports and graphs to those that are as sophisticated as optimization, data mining, and simulation. In practice, companies that apply analytics often follow a trajectory similar to that shown in Figure 1.2. Organizations start with basic analytics in the lower left. As they realize the advantages of these analytic techniques, they often progress to more sophisticated techniques in an effort to reap the derived competitive advantage. Therefore, predictive and prescriptive analytics are some- times referred to as advanced analytics. Not all companies reach that level of usage, but those that embrace analytics as a competitive strategy often do.

Analytics has been applied in virtually all sectors of business and government. Organi- zations such as Procter & Gamble, IBM, UPS, Netflix, Amazon.com, Google, the Internal Revenue Service, and General Electric have embraced analytics to solve important prob- lems or to achieve a competitive advantage. In this section, we briefly discuss some of the types of applications of analytics by application area.

Financial Analytics Applications of analytics in finance are numerous and pervasive. Predictive models are used to forecast financial performance, to assess the risk of investment portfolios and proj- ects, and to construct financial instruments such as derivatives. Prescriptive models are used to construct optimal portfolios of investments, to allocate assets, and to create optimal capital budgeting plans. For example, GE Asset Management uses optimization models to decide how to invest its own cash received from insurance policies and other financial products, as well as the cash of its clients, such as Genworth Financial. The estimated ben- efit from the optimization models was $75 million over a five-year period.10 Simulation is also often used to assess risk in the financial sector; one example is the deployment by Hypo Real Estate International of simulation models to successfully manage commercial real estate risk.11

10L. C. Chalermkraivuth et al., “GE Asset Management, Genworth Financial, and GE Insurance Use a Sequential- Linear Programming Algorithm to Optimize Portfolios,” Interfaces 35, no. 5 (September–October 2005). 11Y. Jafry, C. Marrison, and U. Umkehrer-Neudeck, “Hypo International Strengthens Risk Management with a Large- Scale, Secure Spreadsheet-Management Framework,” Interfaces 38, no. 4 (July–August 2008).

Source: Adapted from SAS.

The Spectrum of Business AnalyticsFIGURe 1.2

Degree of Complexity

Standard Reporting Data Query Data Visualization

Descriptive Statistics

Data Mining Forecasting Predictive Modeling

Simulation

Decision Analysis Rule-Based Models

Optimization

C om

p et

it iv

e A

d va

n ta

ge Prescriptive

Predictive

Descriptive

12 Chapter 1 Introduction

Human Resource (HR) Analytics A relatively new area of application for analytics is the management of an organization’s human resources (HR). The HR function is charged with ensuring that the organization (1) has the mix of skill sets necessary to meet its needs, (2) is hiring the highest-quality tal- ent and providing an environment that retains it, and (3) achieves its organizational diver- sity goals. Google refers to its HR Analytics function as “people analytics.” Google has analyzed substantial data on their own employees to determine the characteristics of great leaders, to assess factors that contribute to productivity, and to evaluate potential new hires. Google also uses predictive analytics to continually update their forecast of future employee turnover and retention.12

Marketing Analytics Marketing is one of the fastest-growing areas for the application of analytics. A better understanding of consumer behavior through the use of scanner data and data generated from social media has led to an increased interest in marketing analytics. As a result, descriptive, predictive, and prescriptive analytics are all heavily used in marketing. A better understanding of consumer behavior through analytics leads to the better use of advertising budgets, more effective pricing strategies, improved forecasting of demand, improved product-line management, and increased customer satisfaction and loyalty. For example, each year, NBCUniversal uses a predictive model to help support its annual upfront mar- ket—a period in late May when each television network sells the majority of its on-air advertising for the upcoming television season. Over 200 NBC sales and finance personnel use the results of the forecasting model to support pricing and sales decisions.13

In another example of high-impact marketing analytics, automobile manufacturer Chrysler teamed with J.D. Power and Associates to develop an innovative set of predictive models to support its pricing decisions for automobiles. These models help Chrysler to bet- ter understand the ramifications of proposed pricing structures (a combination of manufac- turer’s suggested retail price, interest rate offers, and rebates) and, as a result, to improve its pricing decisions. The models have generated an estimated annual savings of $500 million.14

Health Care Analytics The use of analytics in health care is on the increase because of pressure to simultaneously control costs and provide more effective treatment. Descriptive, predictive, and prescriptive analytics are used to improve patient, staff, and facility scheduling; patient flow; purchas- ing; and inventory control. A study by McKinsey Global Institute (MGI) and McKinsey & Company15 estimates that the health care system in the United States could save more than $300 billion per year by better utilizing analytics; these savings are approximately the equivalent of the entire gross domestic product of countries such as Finland, Singapore, and Ireland.

The use of prescriptive analytics for diagnosis and treatment is relatively new, but it may prove to be the most important application of analytics in health care. For example, work- ing with the Georgia Institute of Technology, Memorial Sloan-Kettering Cancer Center developed a real-time prescriptive model to determine the optimal placement of radioactive

12J. Sullivan, “How Google Is Using People Analytics to Completely Reinvent HR,” Talent Management and HR web site, February 26, 2013. 13S. Bollapragada et al., “NBC-Universal Uses a Novel Qualitative Forecasting Technique to Predict Advertising Demand,” Interfaces 38, no. 2 (March–April 2008). 14J. Silva-Risso et al., “Chrysler and J. D. Power: Pioneering Scientific Price Customization in the Automobile Industry,” Interfaces 38, no. 1 (January–February 2008). 15J. Manyika et al., “Big Data: The Next Frontier for Innovation, Competition and Productivity,” McKinsey Global Institute Report, 2011.

1.5 Business Analytics in Practice 13

seeds for the treatment of prostate cancer.16 Using the new model, 20–30% fewer seeds are needed, resulting in a faster and less invasive procedure.

supply-Chain Analytics The core service of companies such as UPS and FedEx is the efficient delivery of goods, and analytics has long been used to achieve efficiency. The optimal sorting of goods, vehicle and staff scheduling, and vehicle routing are all key to profitability for logistics companies such as UPS and FedEx.

Companies can benefit from better inventory and processing control and more efficient supply chains. Analytic tools used in this area span the entire spectrum of analytics. For example, the women’s apparel manufacturer Bernard Claus, Inc. has successfully used descriptive analytics to provide its managers a visual representation of the status of its sup- ply chain.17 ConAgra Foods uses predictive and prescriptive analytics to better plan capac- ity utilization by incorporating the inherent uncertainty in commodities pricing. ConAgra realized a 100% return on its investment in analytics in under three months—an unheard of result for a major technology investment.18

Analytics for Government and nonprofits Government agencies and other nonprofits have used analytics to drive out inefficiencies and increase the effectiveness and accountability of programs. Indeed, much of advanced analytics has its roots in the U.S. and English military dating back to World War II. Today, the use of analytics in government is becoming pervasive in everything from elections to tax collection. For example, the New York State Department of Taxation and Finance has worked with IBM to use prescriptive analytics in the development of a more effective approach to tax collection. The result was an increase in collections from delinquent payers of $83 million over two years.19 The U.S. Internal Revenue Service has used data mining to identify patterns that distinguish questionable annual personal income tax filings. In one application, the IRS combines its data on individual taxpayers with data received from banks, on mortgage payments made by those taxpayers. When taxpayers report a mortgage payment that is unrealistically high relative to their reported taxable income, they are flagged as possible underreporters of taxable income. The filing is then further scrutinized and may trigger an audit.

Likewise, nonprofit agencies have used analytics to ensure their effectiveness and accountability to their donors and clients. Catholic Relief Services (CRS) is the official international humanitarian agency of the U.S. Catholic community. The CRS mission is to provide relief for the victims of both natural and human-made disasters and to help people in need around the world through its health, educational, and agricultural programs. CRS uses an analytical spreadsheet model to assist in the allocation of its annual budget based on the impact that its various relief efforts and programs will have in different countries.20

sports Analytics The use of analytics in sports has gained considerable notoriety since 2003 when renowned author Michael Lewis published Moneyball. Lewis’ book tells the story of how the Oakland Athletics used an analytical approach to player evaluation in order to assemble

16E. Lee and M. Zaider, “Operations Research Advances Cancer Therapeutics,” Interfaces 38, no. 1 (January– February 2008). 17T. H. Davenport, ed., Enterprise Analytics (Upper Saddle River, NJ: Pearson Education Inc., 2013). 18“ConAgra Mills: Up-to-the-Minute Insights Drive Smarter Selling Decisions and Big Improvements in Capacity Utilization,” IBM Smarter Planet Leadership Series. Available at: http://www.ibm.com/smarterplanet/us/en/leader- ship/conagra/, retrieved December 1, 2012. 19G. Miller et al., “Tax Collection Optimization for New York State,” Interfaces 42, no. 1 (January–February 2013). 20I. Gamvros, R. Nidel, and S. Raghavan, “Investment Analysis and Budget Allocation at Catholic Relief Services,” Interfaces 36, no. 5 (September–October 2006).

14 Chapter 1 Introduction

a competitive team with a limited budget. The use of analytics for player evaluation and on-field strategy is now common, especially in professional sports. Professional sports teams use analytics to assess players for the amateur drafts and to decide how much to offer players in contract negotiations;21 professional motorcycle racing teams use sophisti- cated optimization for gearbox design to gain competitive advantage;22 and teams use ana- lytics to assist with on-field decisions such as which pitchers to use in various games of a Major League Baseball playoff series.

The use of analytics for off-the-field business decisions is also increasing rapidly. Ensuring customer satisfaction is important for any company, and fans are the customers of sports teams. The Cleveland Indians professional baseball team used a type of predictive modeling known as conjoint analysis to design its premium seating offerings at Progres- sive Field based on fan survey data. Using prescriptive analytics, franchises across several major sports dynamically adjust ticket prices throughout the season to reflect the relative attractiveness and potential demand for each game.

Web Analytics Web analytics is the analysis of online activity, which includes, but is not limited to, visits to web sites and social media sites such as Facebook and LinkedIn. Web analytics obviously has huge implications for promoting and selling products and services via the Internet. Leading companies apply descriptive and advanced analytics to data collected in online experiments to determine the best way to configure web sites, position ads, and utilize social networks for the promotion of products and services. Online experimenta- tion involves exposing various subgroups to different versions of a web site and tracking the results. Because of the massive pool of Internet users, experiments can be conducted without risking the disruption of the overall business of the company. Such experiments are proving to be invaluable because they enable the company to use trial-and-error in deter- mining statistically what makes a difference in their web site traffic and sales.

S u M M A r y

This introductory chapter began with a discussion of decision making. Decision making can be defined as the following process: (1) identify and define the problem, (2) determine the criteria that will be used to evaluate alternative solutions, (3) determine the set of alter- native solutions, (4) evaluate the alternatives, and (5) choose an alternative. Decisions may be strategic (high level, concerned with the overall direction of the business), tactical (mid- level, concerned with how to achieve the strategic goals of the business), or operational (day-to-day decisions that must be made to run the company).

Uncertainty and an overwhelming number of alternatives are two key factors that make decision making difficult. Business analytics approaches can assist by identifying and miti- gating uncertainty and by prescribing the best course of action from a very large number of alternatives. In short, business analytics can help us make better-informed decisions.

There are three categories of analytics: descriptive, predictive, and prescriptive. Descriptive analytics describes what has happened and includes tools such as reports, data visualization, data dashboards, descriptive statistics, and some data-mining techniques. Predictive analytics consists of techniques that use past data to predict future events or ascertain the impact of one variable on another. These techniques include regression, data mining, forecasting, and simulation. Prescriptive analytics uses data to determine a course of action. This class of analytical techniques includes rule-based models, simulation, deci- sion analysis, and optimization. Descriptive and predictive analytics can help us better

21N. Streib, S. J. Young, and J. Sokol, “A Major League Baseball Team Uses Operations Research to Improve Draft Preparation,” Interfaces 42, no. 2 (March–April 2012). 22J. Amoros, L. F. Escudero, J. F. Monge, J. V. Segura, and O. Reinoso, “TEAM ASPAR Uses Binary Optimization to Obtain Optimal Gearbox Ratios in Motorcycle Racing,” Interfaces 42, no. 2 (March–April 2012).

Glossary 15

understand the uncertainty and risk associated with our decision alternatives. Predictive and prescriptive analytics, also often referred to as advanced analytics, can help us make the best decision when facing a myriad of alternatives.

Big data is a set of data that is too large or too complex to be handled by standard data-processing techniques or typical desktop software. The increasing prevalence of big data is leading to an increase in the use of analytics. The Internet, retail scanners, and cell phones are making huge amounts of data available to companies, and these companies want to better understand these data. Business analytics helps them understand these data and use them to make better decisions.

We concluded this chapter with a discussion of various application areas of analytics. Our discussion focused on financial analytics, human resource analytics, marketing analyt- ics, health care analytics, supply-chain analytics, analytics for government and nonprofit organizations, sports analytics, and web analytics. However, the use of analytics is rapidly spreading to other sectors, industries, and functional areas of organizations. Each remain- ing chapter in this text will provide a real-world vignette in which business analytics is applied to a problem faced by a real organization.

G l o S S A r y

Advanced analytics Predictive and prescriptive analytics. Big data Any set of data that is too large or too complex to be handled by standard data-processing techniques and typical desktop software. Business analytics The scientific process of transforming data into insight for making bet- ter decisions. Data dashboard A collection of tables, charts, and maps to help management monitor selected aspects of the company’s performance. Data mining The use of analytical techniques for better understanding patterns and rela- tionships that exist in large data sets. Data query A request for information with certain characteristics from a database. Data scientists Analysts trained in both computer science and statistics who know how to effectively process and analyze massive amounts of data. Data security Protecting stored data from destructive forces or unauthorized users. Decision analysis A technique used to develop an optimal strategy when a decision maker is faced with several decision alternatives and an uncertain set of future events. Descriptive analytics Analytical tools that describe what has happened. Hadoop An open-source programming environment that supports big data processing through distributed storage and distributed processing on clusters of computers. Internet of Things (IoT) The technology that allows data collected from sensors in all types of machines to be sent over the Internet to repositories where it can be stored and analyzed. MapReduce Programming model used within Hadoop that performs the two major steps for which it is named: the map step and the reduce step. The map step divides the data into manageable subsets and distributes it to the computers in the cluster for storing and processing. The reduce step collects answers from the nodes and combines them into an answer to the original problem. Operational decisions A decision concerned with how the organization is run from day to day. Optimization models A mathematical model that gives the best decision, subject to the situation’s constraints. Predictive analytics Techniques that use models constructed from past data to predict the future or to ascertain the impact of one variable on another. Prescriptive analytics Techniques that analyze input data and yield a best course of action. Rule-based model A prescriptive model that is based on a rule or set of rules.

16 Chapter 1 Introduction

Simulation The use of probability and statistics to construct a computer model to study the impact of uncertainty on the decision at hand. Simulation optimization The use of probability and statistics to model uncertainty, com- bined with optimization techniques, to find good decisions in highly complex and highly uncertain settings. Strategic decision A decision that involves higher-level issues and that is concerned with the overall direction of the organization, defining the overall goals and aspirations for the organization’s future. Tactical decision A decision concerned with how the organization should achieve the goals and objectives set by its strategy. Utility theory The study of the total worth or relative desirability of a particular outcome that reflects the decision maker’s attitude toward a collection of factors such as profit, loss, and risk.

Chapter 2 Descriptive Statistics C O N T E N T S

ANALYTICS IN ACTION: U.S. CENSUS BUREAU

2.1 OVERVIEW OF USING DATA: DEFINITIONS AND GOALS

2.2 TYPES OF DATA Population and Sample Data Quantitative and Categorical Data Cross-Sectional and Time Series Data Sources of Data

2.3 MODIFYING DATA IN EXCEL Sorting and Filtering Data in Excel Conditional Formatting of Data in Excel

2.4 CREATING DISTRIBUTIONS FROM DATA Frequency Distributions for Categorical Data Relative Frequency and Percent Frequency Distributions Frequency Distributions for Quantitative Data Histograms Cumulative Distributions

2.5 MEASURES OF LOCATION Mean (Arithmetic Mean) Median Mode Geometric Mean

2.6 MEASURES OF VARIABILITY Range Variance Standard Deviation Coefficient of Variation

2.7 ANALYZING DISTRIBUTIONS Percentiles Quartiles z-Scores Empirical Rule Identifying Outliers Box Plots

2.8 MEASURES OF ASSOCIATION BETWEEN TWO VARIABLES Scatter Charts Covariance Correlation Coefficient

2.1 Overview of Using Data: Definitions and Goals 19

U.S. Census Bureau

The Bureau of the Census is part of the U.S. Depart- ment of Commerce and is more commonly known as the U.S. Census Bureau. The U.S. Census Bureau collects data related to the population and economy of the United States using a variety of methods and for many purposes. These data are essential to many government and business decisions.

Probably the best-known data collected by the U.S. Census Bureau is the decennial census, which is an effort to count the total U.S. population. Collecting these data is a huge undertaking involving mailings, door-to-door visits, and other methods. The decennial census collects categorical data such as the sex and race of the respondents, as well as quantitative data such as the number of people living in the household. The data collected in the decennial census are used to determine the number of representatives assigned to each state, the number of Electoral College votes apportioned to each state, and how federal govern- ment funding is divided among communities.

The U.S. Census Bureau also administers the Current Population Survey (CPS). The CPS is a cross-sectional monthly survey of a sample of 60,000 households used to estimate employment and unem- ployment rates in different geographic areas. The CPS has been administered since 1940, so an extensive time series of employment and unemployment data

now exists. These data drive government policies such as job assistance programs. The estimated unemploy- ment rates are watched closely as an overall indicator of the health of the U.S. economy.

The data collected by the U.S. Census Bureau are also very useful to businesses. Retailers use data on population changes in different areas to plan new store openings. Mail-order catalog companies use the demographic data when designing targeted market- ing campaigns. In many cases, businesses combine the data collected by the U.S. Census Bureau with their own data on customer behavior to plan strat- egies and to identify potential customers. The U.S. Census Bureau is one of the most important providers of data used in business analytics.

In this chapter, we first explain the need to collect and analyze data and identify some common sources of data. Then we discuss the types of data that you may encounter in practice and present several numer- ical measures for summarizing data. We cover some common ways of manipulating and summarizing data using spreadsheets. We then develop numerical summary measures for data sets consisting of a single variable. When a data set contains more than one vari- able, the same numerical measures can be computed separately for each variable. In the two-variable case, we also develop measures of the relationship between the variables.

A N A L Y T I C S I N A C T I O N

2.9 DATA CLEANSING Missing Data Blakely Tires Identification of Erroneous Outliers and other Erroneous Values

APPENDIX 2.1: CREATING BOX PLOTS WITH ANALYTIC SOLVER (MINDTAP READER)

2.1 Overview of Using Data: Definitions and Goals Data are the facts and figures collected, analyzed, and summarized for presentation and interpretation. Table 2.1 shows a data set containing information for stocks in the Dow Jones Industrial Index (or simply “the Dow”) on October 17, 2017. The Dow is tracked by many financial advisors and investors as an indication of the state of the overall financial markets and the economy in the United States. The share prices for the 30 companies listed in Table 2.1 are the basis for computing the Dow Jones Industrial Average (DJI), which is tracked continuously by virtually every financial publication.

A characteristic or a quantity of interest that can take on different values is known as a variable; for the data in Table 2.1, the variables are Symbol, Industry, Share Price, and Volume. An observation is a set of values corresponding to a set of variables; each row in Table 2.1 corresponds to an observation.

20 Chapter 2 Descriptive Statistics

Practically every problem (and opportunity) that an organization (or individual) faces is concerned with the impact of the possible values of relevant variables on the business out- come. Thus, we are concerned with how the value of a variable can vary; variation is the difference in a variable measured over observations (time, customers, items, etc.).

The role of descriptive analytics is to collect and analyze data to gain a better under- standing of variation and its impact on the business setting. The values of some variables are under direct control of the decision maker (these are often called decision variables). The values of other variables may fluctuate with uncertainty because of factors outside the direct control of the decision maker. In general, a quantity whose values are not known with certainty is called a random variable, or uncertain variable. When we collect data, we are gathering past observed values, or realizations of a variable. By collecting these past realizations of one or more variables, our goal is to learn more about the variation of a par- ticular business situation.

Decision variables used in optimization models are covered in Chapters 12, 13 and 14. Random variables are covered in greater detail in Chapters 5 and 11.

Company Symbol Industry Share Price ($) Volume

Apple AAPL Technology 160.47 18,997,275

American Express AXP Financial 91.69 2,939,556

Boeing BA Manufacturing 258.62 2,515,865

Caterpillar CAT Manufacturing 130.54 2,380,342

Cisco Systems CSCO Technology 33.60 9,303,117

Chevron Corporation CVX Chemical, Oil, and Gas 120.22 4,844,293

DuPont DD Chemical, Oil, and Gas 83.93 34,861,021

Disney DIS Entertainment 98.36 5,942,501

General Electric GE Conglomerate 23.19 58,639,089

Goldman Sachs GS Financial 236.09 7,088,445

The Home Depot HD Retail 163.35 4,189,197

IBM IBM Technology 146.54 6,372,393

Intel INTC Technology 39.79 15,532,818

Johnson & Johnson JNJ Pharmaceuticals 140.79 11,717,348

JPMorgan Chase JPM Banking 97.62 10,335,687

Coca-Cola KO Food and Drink 46.52 7,699,367

McDonald’s MCD Food and Drink 165.40 2,379,725

3M MMM Conglomerate 217.75 2,150,810

Merck MRK Pharmaceuticals 63.22 7,028,492

Microsoft MSFT Technology 77.59 16,823,989

Nike NKE Consumer Goods 52.00 9,492,675

Pfizer PFE Pharmaceuticals 36.20 14,019,661

Procter & Gamble PG Consumer Goods 92.80 5,316,062

Travelers TRV Insurance 128.62 1,808,224

UnitedHealth Group UNH Healthcare 203.89 8,949,715

United Technologies UTX Conglomerate 119.36 2,026,513

Visa V Financial 107.54 5,979,405

Verizon VZ Telecommunications 48.40 14,842,814

Wal-Mart WMT Retail 85.98 5,851,546

ExxonMobil XOM Chemical, Oil, and Gas 82.96 6,444,106

Data for Dow Jones Industrial Index CompaniesTABLE 2.1

2.2 Types of Data 21

2.2 Types of Data Population and Sample Data Data can be categorized in several ways based on how they are collected and the type col- lected. In many cases, it is not feasible to collect data from the population of all elements of interest. In such instances, we collect data from a subset of the population known as a sample. For example, with the thousands of publicly traded companies in the United States, tracking and analyzing all of these stocks every day would be too time consuming and expensive. The Dow represents a sample of 30 stocks of large public companies based in the United States, and it is often interpreted to represent the larger population of all pub- licly traded companies. It is very important to collect sample data that are representative of the population data so that generalizations can be made from them. In most cases (although not true of the Dow), a representative sample can be gathered by random sampling from the population data. Dealing with populations and samples can introduce subtle differences in how we calculate and interpret summary statistics. In almost all practical applications of business analytics, we will be dealing with sample data.

Quantitative and Categorical Data Data are considered quantitative data if numeric and arithmetic operations, such as addi- tion, subtraction, multiplication, and division, can be performed on them. For instance, we can sum the values for Volume in the Dow data in Table 2.1 to calculate a total volume of all shares traded by companies included in the Dow. If arithmetic operations cannot be per- formed on the data, they are considered categorical data. We can summarize categorical data by counting the number of observations or computing the proportions of observations in each category. For instance, the data in the Industry column in Table 2.1 are categorical. We can count the number of companies in the Dow that are in the telecommunications industry. Table 2.1 shows three companies in the financial industry: American Express, Goldman Sachs, and Visa. We cannot perform arithmetic operations on the data in the Industry column.

Cross-Sectional and Time Series Data For statistical analysis, it is important to distinguish between cross-sectional data and time series data. Cross-sectional data are collected from several entities at the same, or approximately the same, point in time. The data in Table 2.1 are cross-sectional because they describe the 30 companies that comprise the Dow at the same point in time (July 2015). Time series data are collected over several time periods. Graphs of time series data are frequently found in business and economic publications. Such graphs help analysts understand what happened in the past, identify trends over time, and project future levels for the time series. For example, the graph of the time series in Figure 2.1 shows the DJI value from January 2006 to March 2017. The figure illustrates that the DJI limbed to above 14,000 in 2007. However, the financial crisis in 2008 led to a significant decline in the DJI to between 6,000 and 7,000 by 2009. Since 2009, the DJI has been generally increasing and topped 21,000 in 2017.

Sources of Data Data necessary to analyze a business problem or opportunity can often be obtained with an appropriate study; such statistical studies can be classified as either experimental or obser- vational. In an experimental study, a variable of interest is first identified. Then one or more other variables are identified and controlled or manipulated to obtain data about how these variables influence the variable of interest. For example, if a pharmaceutical firm conducts an experiment to learn about how a new drug affects blood pressure, then blood pressure is the variable of interest. The dosage level of the new drug is another variable that is hoped to have a causal effect on blood pressure. To obtain data about the effect of

To ensure that the companies in the Dow form a representative sample, companies are periodically added and removed from the Dow. It is possible that the companies in the Dow today have changed from what is shown in Table 2.1.

22 Chapter 2 Descriptive Statistics

the new drug, researchers select a sample of individuals. The dosage level of the new drug is controlled by giving different dosages to the different groups of individuals. Before and after the study, data on blood pressure are collected for each group. Statistical analysis of these experimental data can help determine how the new drug affects blood pressure.

Nonexperimental, or observational, studies make no attempt to control the variables of interest. A survey is perhaps the most common type of observational study. For instance, in a personal interview survey, research questions are first identified. Then a questionnaire is designed and administered to a sample of individuals. Some restaurants use observational studies to obtain data about customer opinions with regard to the quality of food, quality of service, atmosphere, and so on. A customer opinion questionnaire used by Chops City Grill in Naples, Florida, is shown in Figure 2.2. Note that the customers who fill out the questionnaire are asked to provide ratings for 12 variables, including overall experience, the greeting by hostess, the table visit by the manager, overall service, and so on. The response categories of excellent, good, average, fair, and poor provide categorical data that enable Chops City Grill management to maintain high standards for the restaurant’s food and service.

In some cases, the data needed for a particular application exist from an experimental or observational study that has already been conducted. For example, companies maintain a variety of databases about their employees, customers, and business operations. Data on employee salaries, ages, and years of experience can usually be obtained from internal personnel records. Other internal records contain data on sales, advertising expenditures, distribution costs, inventory levels, and production quantities. Most companies also main- tain detailed data about their customers.

Anyone who wants to use data and statistical analysis to aid in decision making must be aware of the time and cost required to obtain the data. The use of existing data sources is desirable when data must be obtained in a relatively short period of time. If important data are not readily available from a reliable existing source, the additional time and cost involved in obtaining the data must be taken into account. In all cases, the decision maker should consider the potential contribution of the statistical analysis to the decision-making process. The cost of data acquisition and the subsequent statistical anal- ysis should not exceed the savings generated by using the information to make a better decision.

In Chapter 15 we discuss methods for determining the value of additional information that can be provided by collecting data.

Dow Jones Index Values Since 2006FIGURE 2.1

5,000 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017

18,000

15,000

19,000

22,000 21,000 20,000

16,000 17,000

12,000 11,000 10,000 9,000 8,000 7,000 6,000

13,000 14,000

D JI

V al

u e

2.2 Types of Data 23

1. Organizations that specialize in collecting and maintaining

data make available substantial amounts of business and

economic data. Companies can access these external data

sources through leasing arrangements or by purchase. Dun

& Bradstreet, Bloomberg, and Dow Jones & Company are

three firms that provide extensive business database ser-

vices to clients. Nielsen and Ipsos are two companies that

have built successful businesses collecting and processing

data that they sell to advertisers and product manufactur-

ers. Data are also available from a variety of industry asso-

ciations and special-interest organizations.

2. Government agencies are another important source of

existing data. For instance, the web site data.gov was

launched by the U.S. government in 2009 to make it easier

for the public to access data collected by the U.S. federal

government. The data.gov web site includes hundreds of

thousands of data sets from a variety of U.S. federal depart-

ments and agencies. In general, the Internet is an important

source of data and statistical information. One can obtain

access to stock quotes, meal prices at restaurants, salary

data, and a wide array of other information simply by per-

forming an Internet search.

N O T E S + C O M M E N T S

Customer Opinion Questionnaire Used by Chops City Grill Restaurant

FIGURE 2.2

Date: ____________ Server Name: ____________

Our customers are our top priority. Please take a moment to �ll out our survey card, so we can better serve your needs. You may return this card to the front desk or return by mail. Thank you!

SERVICE SURVEY Excellent Good Average Fair Poor

Overall Experience ❑ ❑ ❑ ❑ ❑ Greeting by Hostess ❑ ❑ ❑ ❑ ❑ Manager (Table Visit) ❑ ❑ ❑ ❑ ❑

Overall Service ❑ ❑ ❑ ❑ ❑ Professionalism ❑ ❑ ❑ ❑ ❑ Menu Knowledge ❑ ❑ ❑ ❑ ❑ Friendliness ❑ ❑ ❑ ❑ ❑

Wine Selection ❑ ❑ ❑ ❑ ❑ Menu Selection ❑ ❑ ❑ ❑ ❑ Food Quality ❑ ❑ ❑ ❑ ❑ Food Presentation ❑ ❑ ❑ ❑ ❑ Value for $ Spent ❑ ❑ ❑ ❑ ❑ What comments could you give us to improve our restaurant?

Thank you, we appreciate your comments. —The staff of Chops City Grill.

24 Chapter 2 Descriptive Statistics

2.3 Modifying Data in Excel Projects often involve so much data that it is difficult to analyze all of the data at once. In this section, we examine methods for summarizing and manipulating data using Excel to make the data more manageable and to develop insights.

Sorting and Filtering Data in Excel Excel contains many useful features for sorting and filtering data so that one can more eas- ily identify patterns. Table 2.2 contains data on the 20 top-selling automobiles in the United States in March 2011. The table shows the model and manufacturer of each automobile as well as the sales for the model in March 2011 and March 2010.

Figure 2.3 shows the data from Table 2.2 entered into an Excel spreadsheet, and the per- cent change in sales for each model from March 2010 to March 2011 has been calculated. This is done by entering the formula 5(D2-E2)/E2 in cell F2 and then copying the contents of this cell to cells F3 to F20. (We cannot calculate the percent change in sales for the Ford Fiesta because it was not being sold in March 2010.)

Suppose that we want to sort these automobiles by March 2010 sales instead of by March 2011 sales. To do this, we use Excel’s Sort function, as shown in the following steps.

Step 1. Select cells A1:F21 Step 2. Click the Data tab in the Ribbon Step 3. Click Sort in the Sort & Filter group Step 4. Select the check box for My data has headers

Rank (by March

2011 Sales) Manufacturer Model Sales

(March 2011) Sales

(March 2010)

1 Honda Accord 33,616 29,120

2 Nissan Altima 32,289 24,649

3 Toyota Camry 31,464 36,251

4 Honda Civic 31,213 22,463

5 Toyota Corolla/Matrix 30,234 29,623

6 Ford Fusion 27,566 22,773

7 Hyundai Sonata 22,894 18,935

8 Hyundai Elantra 19,255 8,225

9 Toyota Prius 18,605 11,786

10 Chevrolet Cruze/Cobalt 18,101 10,316

11 Chevrolet Impala 18,063 15,594

12 Nissan Sentra 17,851 8,721

13 Ford Focus 17,178 19,500

14 Volkswagen Jetta 16,969 9,196

15 Chevrolet Malibu 15,551 17,750

16 Mazda 3 12,467 11,353

17 Nissan Versa 11,075 13,811

18 Subaru Outback 10,498 7,619

19 Kia Soul 10,028 5,106

20 Ford Fiesta 9,787 0

Source: Manufacturers and Automotive News Data Center

20 Top-Selling Automobiles in United States in March 2011TABLE 2.2

Top20Cars

2.3 Modifying Data in Excel 25

Step 5. In the first Sort by dropdown menu, select Sales (March 2010) Step 6. In the Order dropdown menu, select Largest to Smallest (see Figure 2.4) Step 7. Click OK

The result of using Excel’s Sort function for the March 2010 data is shown in Figure 2.5. Now we can easily see that, although the Honda Accord was the best-selling automobile in March 2011, both the Toyota Camry and the Toyota Corolla/Matrix outsold the Honda Accord in March 2010. Note that while we sorted on Sales (March 2010), which is in column E, the data in all other columns are adjusted accordingly.

Now let’s suppose that we are interested only in seeing the sales of models made by Toyota. We can do this using Excel’s Filter function:

Step 1. Select cells A1:F21 Step 2. Click the Data tab in the Ribbon Step 3. Click Filter in the Sort & Filter group Step 4. Click on the Filter Arrow in column B, next to Manufacturer Step 5. If all choices are checked, you can easily deselect all choices by unchecking

(Select All). Then select only the check box for Toyota. Step 6. Click OK

The result is a display of only the data for models made by Toyota (see Figure 2.6). We now see that of the 20 top-selling models in March 2011, Toyota made three of them. We can further filter the data by choosing the down arrows in the other columns. We can make all data visible again by clicking on the down arrow in column B and checking (Select All) and clicking OK, or by clicking Filter in the Sort & Filter Group again from the Data tab.

Data for 20 Top-Selling Automobiles Entered into Excel with Percent Change in Sales from 2010

FIGURE 2.3

A B C D E F Rank (by March 2011 Sales) Manufacturer Model

Sales (March 2011)

Sales (March 2010)

Percent Change in Sales from 2010

15.4% 31.0%

–13.2% 39.0% 2.1%

21.0% 20.9%

134.1% 57.9% 75.5% 15.8%

104.7% –11.9% 84.5%

–12.4% 9.8%

–19.8% 37.8% 96.4%

-----

29120 24649 36251 22463 29623 22773 18935 8225

11786 10316 15594 8721

19500 9196

17750 11353 13811 7619

0 5106

33616 32289 31464 31213 30234 27566 22894 19255 18605 18101 18063 17851 17178 16969 15551 12467 11075 10498 10028 9787

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Honda Nissan Toyota Honda Toyota Ford Hyundai Hyundai Toyota Chevrolet Chevrolet Nissan Ford Volkswagen Chevrolet Mazda Nissan Subaru Kia Ford

Accord Altima Camry Civic Corolla/Matrix Fusion Sonata Elantra Prius Cruze/Cobalt Impala Sentra Focus Jetta Malibu 3 Versa Outback Soul Fiesta

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Top20CarsPercent

26 Chapter 2 Descriptive Statistics

Using Excel’s Sort Function to Sort the Top-Selling Automobiles DataFIGURE 2.4

A B C D E F G Rank (by March 2011 Sales) Manufacturer Model

Sales (March 2011)

Sales (March 2010)

Percent Change in Sales from 2010

15.4% 31.0%

–13.2% 39.0% 2.1%

21.0% 20.9%

134.1% 57.9% 75.5% 15.8%

104.7% –11.9% 84.5%

–12.4% 9.8%

–19.8% 37.8% 96.4%

-----

29120 24649 36251 22463 29623 22773 18935 8225

11786 10316 15594 8721

19500 9196

17750 11353 13811 7619

0 5106

33616 32289 31464 31213 30234 27566 22894 19255 18605 18101 18063 17851 17178 16969 15551 12467 11075 10498 10028 9787

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Honda Nissan Toyota Honda Toyota Ford Hyundai Hyundai Toyota Chevrolet Chevrolet Nissan Ford Volkswagen Chevrolet Mazda Nissan Subaru Kia Ford

Accord Altima Camry Civic Corolla/Matrix Fusion Sonata Elantra Prius Cruze/Cobalt Impala Sentra Focus Jetta Malibu 3 Versa Outback Soul Fiesta

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Sort

Add Level

Column Sort On Order

OK Cancel

Values Largest to SmallestSort by Sales (March 2010)

Delete Level Copy Level Options... My data has headers

Top-Selling Automobiles Data Sorted by Sales in March 2010 SalesFIGURE 2.5

A B C D E F Rank (by March 2011 Sales) Manufacturer Model

Sales (March 2011)

Sales (March 2010)

Percent Change in Sales from 2010

Camry –13.2% 2.1%

15.4% 31.0% 21.0% 39.0%

–11.9% 20.9%

–12.4% 15.8%

–19.8% 57.9% 9.8%

75.5% 84.5%

104.7% 134.1% 37.8% 96.4%

-----

36251 29623 29120 24649 22773 22463 19500 18935 17750 15594 13811 11786 11353 10316 9196 8721 8225 7619

0 5106

31464 30234 33616 32289 27566 31213 17178 22894 15551 18063 11075 18605 12467 18101 16969 17851 19255 10498 10028 9787

Corolla/Matrix Accord Altima Fusion

3 5 1 2 6 4 Civic 13 Focus 7 Sonata 15 Malibu 11 Impala 17 Versa 9 Prius 16 3 10 Cruze/Cobalt 14 Jetta 12 Sentra 8 Elantra 18 Outback 19 Soul 20

Toyota Toyota Honda Nissan Ford Honda Ford Hyundai Chevrolet Chevrolet Nissan Toyota Mazda Chevrolet Volkswagen Nissan Hyundai Subaru Kia Ford Fiesta

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

2.3 Modifying Data in Excel 27

Top-Selling Automobiles Data Filtered to Show Only Automobiles Manufactured by Toyota

FIGURE 2.6

A B C D E F Rank (by March 2011 Sales) Model

Sales (March 2011)

Sales (March 2010)

Percent Change in Sales from 2010

Toyota 31464 36251 29623

–13.2% 2.1%

57.9%11786 30234 18605

Camry Corolla/Matrix Prius

3 5 Toyota 9 Toyota

1 2 3 13

Manufacturer

Using Conditional Formatting in Excel to Highlight Automobiles with Declining Sales from March 2010

FIGURE 2.7

A B C D E F Rank (by March 2011 Sales) Manufacturer Model

Sales (March 2011)

Sales (March 2010)

Percent Change in Sales from 2010

15.4% 31.0%

–13.2% 39.0%

2.1% 21.0% 20.9%

134.1% 57.9% 75.5% 15.8%

104.7% –11.9% 84.5%

–12.4% 9.8%

–19.8% 37.8% 96.4%

-----

29120 24649 36251 22463 29623 22773 18935 8225

11786 10316 15594 8721

19500 9196

17750 11353 13811 7619

0 5106

33616 32289 31464 31213 30234 27566 22894 19255 18605 18101 18063 17851 17178 16969 15551 12467 11075 10498 10028 9787

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Honda Nissan Toyota Honda Toyota Ford Hyundai Hyundai Toyota Chevrolet Chevrolet Nissan Ford Volkswagen Chevrolet Mazda Nissan Subaru Kia Ford

Accord Altima Camry Civic Corolla/Matrix Fusion Sonata Elantra Prius Cruze/Cobalt Impala Sentra Focus Jetta Malibu 3 Versa Outback Soul Fiesta

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Conditional Formatting of Data in Excel Conditional formatting in Excel can make it easy to identify data that satisfy certain condi- tions in a data set. For instance, suppose that we wanted to quickly identify the automobile models in Table 2.2 for which sales had decreased from March 2010 to March 2011. We can quickly highlight these models:

Step 1. Starting with the original data shown in Figure 2.3, select cells F1:F21 Step 2. Click the Home tab in the Ribbon Step 3. Click Conditional Formatting in the Styles group Step 4. Select Highlight Cells Rules, and click Less Than from the dropdown menu Step 5. Enter 0% in the Format cells that are LESS THAN: box Step 6. Click OK

The results are shown in Figure 2.7. Here we see that the models with decreasing sales (Toyota Camry, Ford Focus, Chevrolet Malibu, and Nissan Versa) are now clearly

28 Chapter 2 Descriptive Statistics

Using Conditional Formatting in Excel to Generate Data Bars for the Top-Selling Automobiles Data

FIGURE 2.8

A B C D E F Rank (by March 2011 Sales) Manufacturer Model

Sales (March 2011)

Sales (March 2010)

Percent Change in Sales from 2010

15.4% 31.0%

–13.2% 39.0% 2.1%

21.0% 20.9%

134.1% 57.9% 75.5% 15.8%

104.7% –11.9% 84.5%

–12.4% 9.8%

–19.8% 37.8% 96.4%

-----

29120 24649 36251 22463 29623 22773 18935 8225

11786 10316 15594 8721

19500 9196

17750 11353 13811 7619

0 5106

33616 32289 31464 31213 30234 27566 22894 19255 18605 18101 18063 17851 17178 16969 15551 12467 11075 10498 10028 9787

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Honda Nissan Toyota Honda Toyota Ford Hyundai Hyundai Toyota Chevrolet Chevrolet Nissan Ford Volkswagen Chevrolet Mazda Nissan Subaru Kia Ford

Accord Altima Camry Civic Corolla/Matrix Fusion Sonata Elantra Prius Cruze/Cobalt Impala Sentra Focus Jetta Malibu 3 Versa Outback Soul Fiesta

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

visible. Note that Excel’s Conditional Formatting function offers tremendous flexibility. Instead of highlighting only models with decreasing sales, we could instead choose Data Bars from the Conditional Formatting dropdown menu in the Styles Group of the Home tab in the Ribbon. The result of using the Blue Data Bar Gradient Fill option is shown in Figure 2.8. Data bars are essentially a bar chart input into the cells that shows the magnitude of the cell values. The widths of the bars in this display are comparable to the values of the variable for which the bars have been drawn; a value of 20 creates a bar twice as wide as that for a value of 10. Negative values are shown to the left side of the axis; positive values are shown to the right. Cells with negative values are shaded in red, and those with positive values are shaded in blue. Again, we can easily see which models had decreasing sales, but Data Bars also provide us with a visual rep- resentation of the magnitude of the change in sales. Many other Conditional Formatting options are available in Excel.

The Quick Analysis button in Excel appears just outside the bottom-right corner of a group of selected cells whenever you select multiple cells. Clicking the Quick Analysis button gives you shortcuts for Conditional Formatting, adding Data Bars, and other operations. Clicking on this button gives you the options shown in Figure 2.9 for Formatting. Note that there are also tabs for Charts, Totals, Tables, and Sparklines.

Bar charts and other graphical presentations will be covered in detail in Chapter 3. We will see other uses for Conditional Formatting in Excel in Chapter 3.

The Quick Analysis button is not available in Excel versions prior to Excel 2013.

2.4 Creating Distributions from Data 29

2.4 Creating Distributions from Data Distributions help summarize many characteristics of a data set by describing how often certain values for a variable appear in that data set. Distributions can be created for both categorical and quantitative data, and they assist the analyst in gauging variation.

Frequency Distributions for Categorical Data It is often useful to create a frequency distribution for a data set. A frequency distribution is a summary of data that shows the number (frequency) of observations in each of several nonoverlapping classes, typically referred to as bins. Consider the data in Table 2.3, taken

Bins for categorical data are also referred to as classes.

Excel Quick Analysis Button Formatting OptionsFIGURE 2.9

Formatting Charts

Data Bars

Conditional Formatting uses rules to highlight interesting data.

Color... Icon Set Greater... Text...

Clear...

Totals Tables Sparklines

Coca-Cola Sprite Pepsi

Diet Coke Coca-Cola Coca-Cola

Pepsi Diet Coke Coca-Cola

Diet Coke Coca-Cola Coca-Cola

Coca-Cola Diet Coke Pepsi

Coca-Cola Coca-Cola Dr. Pepper

Dr. Pepper Sprite Coca-Cola

Diet Coke Pepsi Diet Coke

Pepsi Coca-Cola Pepsi

Coca-Cola Coca-Cola Pepsi

Dr. Pepper Pepsi Pepsi

Sprite Coca-Cola Coca-Cola

Coca-Cola Sprite Dr. Pepper

Diet Coke Dr. Pepper Pepsi

Coca-Cola Pepsi Sprite

Coca-Cola Diet Coke

Data from a Sample of 50 Soft Drink PurchasesTABLE 2.3

SoftDrinks

30 Chapter 2 Descriptive Statistics

from a sample of 50 soft drink purchases. Each purchase is for one of five popular soft drinks, which define the five bins: Coca-Cola, Diet Coke, Dr. Pepper, Pepsi, and Sprite.

To develop a frequency distribution for these data, we count the number of times each soft drink appears in Table 2.3. Coca-Cola appears 19 times, Diet Coke appears 8 times, Dr. Pepper appears 5 times, Pepsi appears 13 times, and Sprite appears 5 times. These counts are summarized in the frequency distribution in Table 2.4. This frequency distribu- tion provides a summary of how the 50 soft drink purchases are distributed across the 5 soft drinks. This summary offers more insight than the original data shown in Table 2.3. The frequency distribution shows that Coca-Cola is the leader, Pepsi is second, Diet Coke is third, and Sprite and Dr. Pepper are tied for fourth. The frequency distribution thus sum- marizes information about the popularity of the five soft drinks.

We can use Excel to calculate the frequency of categorical observations occurring in a data set using the COUNTIF function. Figure 2.10 shows the sample of 50 soft drink purchases in an Excel spreadsheet. Column D contains the five different soft drink categories as the bins. In cell E2, we enter the formula 5COUNTIF($A$2:$B$26, D2), where A2:B26 is the range for the sample data, and D2 is the bin (Coca-Cola) that we are trying to match. The COUNTIF function in Excel counts the number of times a certain value appears in the indicated range. In this case we want to count the number of times Coca-Cola appears in the sample data. The result is a value of 19 in cell E2, indicating that Coca-Cola appears 19 times in the sample data. We can copy the formula from cell E2 to cells E3 to E6 to get frequency counts for Diet Coke, Pepsi, Dr. Pepper, and Sprite. By using the absolute reference $A$2:$B$26 in our formula, Excel always searches the same sample data for the values we want when we copy the formula.

Relative Frequency and Percent Frequency Distributions A frequency distribution shows the number (frequency) of items in each of several non- overlapping bins. However, we are often interested in the proportion, or percentage, of items in each bin. The relative frequency of a bin equals the fraction or proportion of items belonging to a class. For a data set with n observations, the relative frequency of each bin can be determined as follows:

n Relative frequency of a bin

Frequency of the bin 5

A relative frequency distribution is a tabular summary of data showing the relative frequency for each bin. A percent frequency distribution summarizes the percent fre- quency of the data for each bin. Table 2.5 shows a relative frequency distribution and a percent frequency distribution for the soft drink data. Using the data from Table 2.4, we see that the relative frequency for Coca-Cola is 19/50 0.385 , the relative frequency for Diet Coke is 8/50 0.165 , and so on. From the percent frequency distribution, we see that 38% of the purchases were Coca-Cola, 16% were Diet Coke, and so on. We can also note that 38% 26% 16% 80%1 1 5 of the purchases were the top three soft drinks.

A percent frequency distribution can be used to provide estimates of the relative like- lihoods of different values for a random variable. So, by constructing a percent frequency

See Appendix A for more information on absolute versus relative references in Excel.

The percent frequency of a bin is the relative frequency multiplied by 100.

Soft Drink Frequency

Coca-Cola 19

Diet Coke 8

Dr. Pepper 5

Pepsi 13

Sprite 5

Total 50

Frequency Distribution of Soft Drink PurchasesTABLE 2.4

2.4 Creating Distributions from Data 31

Creating a Frequency Distribution for Soft Drinks Data in ExcelFIGURE 2.10

A B C Bins

D E 1 Sample Data 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Coca-Cola Diet Coke

Pepsi Diet Coke Coca-Cola Coca-Cola Dr. Pepper Diet Coke

Pepsi Pepsi

Coca-Cola Dr. Pepper

Sprite Coca-Cola Diet Coke Coca-Cola Coca-Cola Diet Coke Coca-Cola Coca-Cola Coca-Cola

Sprite Coca-Cola Coca-Cola Diet Coke

Coca-Cola Diet Coke

Pepsi Dr. Pepper

Sprite

19 8 5

13 5

Coca-Cola Sprite Pepsi

Coca-Cola Pepsi Sprite

Dr. Pepper Pepsi

Diet Coke Pepsi

Coca-Cola Coca-Cola Diet Coke

Pepsi Pepsi Pepsi

Coca-Cola Dr. Pepper

Sprite Coca-Cola Coca-Cola

Pepsi Dr. Pepper

Pepsi Pepsi

distribution from observations of a random variable, we can estimate the probability dis- tribution that characterizes its variability. For example, the volume of soft drinks sold by a concession stand at an upcoming concert may not be known with certainty. However, if the data used to construct Table 2.5 are representative of the concession stand’s customer population, then the concession stand manager can use this information to determine the appropriate volume of each type of soft drink.

Frequency Distributions for Quantitative Data We can also create frequency distributions for quantitative data, but we must be more careful in defining the nonoverlapping bins to be used in the frequency distribution. For

Soft Drink Relative Frequency Percent Frequency (%)

Coca-Cola 0.38 38

Diet Coke 0.16 16

Dr. Pepper 0.10 10

Pepsi 0.26 26

Sprite 0.10 10

Total 1.00 100

Relative Frequency and Percent Frequency Distributions of Soft Drink Purchases

TABLE 2.5

32 Chapter 2 Descriptive Statistics

example, consider the quantitative data in Table 2.6. These data show the time in days required to complete year-end audits for a sample of 20 clients of Sanderson and Clifford, a small public accounting firm. The three steps necessary to define the classes for a fre- quency distribution with quantitative data are as follows:

1. Determine the number of nonoverlapping bins. 2. Determine the width of each bin. 3. Determine the bin limits.

Let us demonstrate these steps by developing a frequency distribution for the audit time data shown in Table 2.6.

Number of Bins Bins are formed by specifying the ranges used to group the data. As a general guideline, we recommend using from 5 to 20 bins. For a small number of data items, as few as five or six bins may be used to summarize the data. For a larger number of data items, more bins are usually required. The goal is to use enough bins to show the variation in the data, but not so many that some contain only a few data items. Because the number of data items in Table 2.6 is relatively small ( 20)n 5 , we chose to develop a frequency distribution with five bins.

Width of the Bins Second, choose a width for the bins. As a general guideline, we recom- mend that the width be the same for each bin. Thus the choices of the number of bins and the width of bins are not independent decisions. A larger number of bins means a smaller bin width and vice versa. To determine an approximate bin width, we begin by identifying the largest and smallest data values. Then, with the desired number of bins specified, we can use the following expression to determine the approximate bin width.

APPROXIMATE BIN WIDTH

Largest data value smallest data value

Number of bins

2 (2.1)

The approximate bin width given by equation (2.1) can be rounded to a more convenient value based on the preference of the person developing the frequency distribution. For example, an approximate bin width of 9.28 might be rounded to 10 simply because 10 is a more convenient bin width to use in presenting a frequency distribution.

For the data involving the year-end audit times, the largest data value is 33, and the smallest data value is 12. Because we decided to summarize the data with five classes, using equation (2.1) provides an approximate bin width of (33 12)/5 4.22 5 . We therefore decided to round up and use a bin width of five days in the frequency distribution.

In practice, the number of bins and the appropriate class width are determined by trial and error. Once a possible number of bins are chosen, equation (2.1) is used to find the approximate class width. The process can be repeated for a different number of bins.

12 14 19 18

15 15 18 17

20 27 22 23

22 21 33 28

14 18 16 13

Year-End Audit Times (Days)TABLE 2.6

AuditData

2.4 Creating Distributions from Data 33

Ultimately, the analyst judges the combination of the number of bins and bin width that provides the best frequency distribution for summarizing the data.

For the audit time data in Table 2.6, after deciding to use five bins, each with a width of five days, the next task is to specify the bin limits for each of the classes.

Bin Limits Bin limits must be chosen so that each data item belongs to one and only one class. The lower bin limit identifies the smallest possible data value assigned to the bin. The upper bin limit identifies the largest possible data value assigned to the class. In developing frequency distributions for qualitative data, we did not need to specify bin limits because each data item naturally fell into a separate bin. But with quantitative data, such as the audit times in Table 2.6, bin limits are necessary to determine where each data value belongs.

Using the audit time data in Table 2.6, we selected 10 days as the lower bin limit and 14 days as the upper bin limit for the first class. This bin is denoted 10–14 in Table 2.7. The smallest data value, 12, is included in the 10–14 bin. We then selected 15 days as the lower bin limit and 19 days as the upper bin limit of the next class. We continued defining the lower and upper bin limits to obtain a total of five classes: 10–14, 15–19, 20–24, 25–29, and 30–34. The largest data value, 33, is included in the 30–34 bin. The difference between the upper bin limits of adjacent bins is the bin width. Using the first two upper bin limits of 14 and 19, we see that the bin width is 19 14 52 5 .

With the number of bins, bin width, and bin limits determined, a frequency distribution can be obtained by counting the number of data values belonging to each bin. For example, the data in Table 2.6 show that four values—12, 14, 14, and 13—belong to the 10–14 bin. Thus, the frequency for the 10–14 bin is 4. Continuing this counting process for the 15–19, 20–24, 25–29, and 30–34 bins provides the frequency distribution shown in Table 2.7. Using this frequency distribution, we can observe the following:

• The most frequently occurring audit times are in the bin of 15–19 days. Eight of the 20 audit times are in this bin.

• Only one audit required 30 or more days.

Other conclusions are possible, depending on the interests of the person viewing the fre- quency distribution. The value of a frequency distribution is that it provides insights about the data that are not easily obtained by viewing the data in their original unorganized form. Table 2.7 also shows the relative frequency distribution and percent frequency distribution for the audit time data. Note that 0.40 of the audits, or 40%, required from 15 to 19 days. Only 0.05 of the audits, or 5%, required 30 or more days. Again, additional interpretations and insights can be obtained by using Table 2.7.

Frequency distributions for quantitative data can also be created using Excel. Figure 2.11 shows the data from Table 2.6 entered into an Excel Worksheet. The sample of 20 audit times is contained in cells A2:D6. The upper limits of the defined bins are in cells A10:A14.

Although an audit time of 12 days is actually the smallest observation in our data, we have chosen a lower bin limit of 10 simply for convenience. The lowest bin limit should include the smallest observation, and the highest bin limit should include the largest observation.

We define the relative frequency and percent frequency distributions for quantitative data in the same manner as for qualitative data.

Audit Times (days) Frequency Relative Frequency Percent Frequency

10–14 4 0.20 20

15–19 8 0.40 40

20–24 5 0.25 25

25–29 2 0.10 10

30–34 1 0.05 5

Frequency, Relative Frequency, and Percent Frequency Distributions for the Audit Time Data

TABLE 2.7

AuditData

34 Chapter 2 Descriptive Statistics

We can use the FREQUENCY function in Excel to count the number of observations in each bin.

Step 1. Select cells B10:B14 Step 2. Type the formula 5FREQUENCY(A2:D6, A10:A14). The range A2:D6

defines the data set, and the range A10:A14 defines the bins. Step 3. Press CTRL1SHIFT1ENTER after typing the formula in Step 2.

Because these were the cells selected in Step 1 above (see Figure 2.11), Excel will then fill in the values for the number of observations in each bin in cells B10 through B14.

Histograms A common graphical presentation of quantitative data is a histogram. This graphical summary can be prepared for data previously summarized in either a frequency, a relative frequency, or a percent frequency distribution. A histogram is constructed by placing the variable of interest on the horizontal axis and the selected frequency measure (absolute frequency, relative frequency, or percent frequency) on the vertical axis. The frequency measure of each class is shown by drawing a rectangle whose base is the class limits on the horizontal axis and whose height is the corresponding frequency measure.

Figure 2.12 is a histogram for the audit time data. Note that the class with the greatest frequency is shown by the rectangle appearing above the class of 15–19 days. The height of the rectangle shows that the frequency of this class is 8. A histogram for the relative or percent frequency distribution of these data would look the same as the histogram in

Pressing CTRL1SHIFT1ENTER in Excel indicates that the function should return an array of values.

Using Excel to Generate a Frequency Distribution for Audit Times DataFIGURE 2.11

A B C D

Bin Frequency

Year-End Audit Times (in Days) 12

1 2 3 4 5 6

15 20 22 14

7 8 9 10 11 12 13 14

14 19 24 29 34

14 15 27 21 18

19 18 22 33 16

18 17 23 28 13

=FREQUENCY(A2:D6,A10:A14) =FREQUENCY(A2:D6,A10:A14) =FREQUENCY(A2:D6,A10:A14) =FREQUENCY(A2:D6,A10:A14) =FREQUENCY(A2:D6,A10:A14)

A B C D

Bin Frequency

Year-End Audit Times (in Days) 12

1 2 3 4 5 6

15 20 22 14

7 8 9 10 11 12 13 14

14 19 24 29 34

14 15 27 21 18

19 18 22 33 16

18 17 23 28 13

4 8 5 2 1

2.4 Creating Distributions from Data 35

Figure 2.12, with the exception that the vertical axis would be labeled with relative or per- cent frequency values.

Histograms can be created in Excel using the Data Analysis ToolPak. We will use the sample of 20 year-end audit times and the bins defined in Table 2.7 to create a histogram using the Data Analysis ToolPak. As before, we begin with an Excel Worksheet in which the sample of 20 audit times is contained in cells A2:D6, and the upper limits of the bins defined in Table 2.7 are in cells A10:A14 (see Figure 2.11).

Step 1. Click the Data tab in the Ribbon Step 2. Click Data Analysis in the Analyze group Step 3. When the Data Analysis dialog box opens, choose Histogram from the list of

Analysis Tools, and click OK

In the Input Range: box, enter A2:D6 In the Bin Range: box, enter A10:A14 Under Output Options:, select New Worksheet Ply: Select the check box for Chart Output (see Figure 2.13) Click OK

The histogram created by Excel for these data is shown in Figure 2.14. We have modi- fied the bin ranges in column A by typing the values shown in Figure 2.14 into cells A2:A6 so that the chart created by Excel shows both the lower and upper limits for each bin. We have also removed the gaps between the columns in the histogram in Excel to match the traditional format of histograms. To remove the gaps between the columns in the histogram created by Excel, follow these steps:

Step 1. Right-click on one of the columns in the histogram Select Format Data Series…

Step 2. When the Format Data Series pane opens, click the Series Options button,

Set the Gap Width to 0%

One of the most important uses of a histogram is to provide information about the shape, or form, of a distribution. Skewness, or the lack of symmetry, is an important char- acteristic of the shape of a distribution. Figure 2.15 contains four histograms constructed from relative frequency distributions that exhibit different patterns of skewness. Panel A shows the histogram for a set of data moderately skewed to the left. A histogram is said to

The Data Analysis ToolPak can be found in the Analysis group in versions of Excel prior to Excel 2016.

The text “10-14” in cell A2 can be entered in Excel as ‘10-14. The single quote indicates to Excel that this should be treated as text rather than a numerical or date value.

Histogram for the Audit Time DataFIGURE 2.12

F re

q u

en cy

10–14 15–19 20–24 25–29 30–34 Audit Time (days)

36 Chapter 2 Descriptive Statistics

Creating a Histogram for the Audit Time Data Using Data Analysis ToolPak in ExcelFIGURE 2.13

A B C D E F G H I Year-End Audit Times (in Days)

14 19 18 17 23 28 13

18 22 33 16

15 27 21 18

12 15 20 22 14

Bin 14 19 24 29 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Cancel

Help

Histogram

Input

Input Range:

Bin Range:

Labels

Output options

Output Range:

New Worksheet Ply:

New Worksheet

Pareto (sorted histogram) Cumulative Percentage

Chart Output

$A$2:$D$6

$A$10:$A$14

Completed Histogram for the Audit Time Data Using Data Analysis ToolPak in ExcelFIGURE 2.14

A B C D E F G H I J Bin Frequency

410–14 15–19 8 20–24 5 25–29 2 30–34 More 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14

10–14 15–19 20–24

Bin

Histogram

9 8 7 6 5 4 3 2 1 0

F re

q u

en cy

25–29 30–34

Frequency

2.4 Creating Distributions from Data 37

be skewed to the left if its tail extends farther to the left than to the right. This histogram is typical for exam scores, with no scores above 100%, most of the scores above 70%, and only a few really low scores.

Panel B shows the histogram for a set of data moderately skewed to the right. A his- togram is said to be skewed to the right if its tail extends farther to the right than to the left. An example of this type of histogram would be for data such as housing prices; a few expensive houses create the skewness in the right tail.

Panel C shows a symmetric histogram, in which the left tail mirrors the shape of the right tail. Histograms for data found in applications are never perfectly symmetric, but the histogram for many applications may be roughly symmetric. Data for SAT scores, the heights and weights of people, and so on lead to histograms that are roughly symmetric.

Panel D shows a histogram highly skewed to the right. This histogram was constructed from data on the amount of customer purchases in one day at a women’s apparel store. Data from applications in business and economics often lead to histograms that are skewed to the right. For instance, data on housing prices, salaries, purchase amounts, and so on often result in histograms skewed to the right.

Cumulative Distributions A variation of the frequency distribution that provides another tabular summary of quantita- tive data is the cumulative frequency distribution, which uses the number of classes, class

Histograms Showing Distributions with Different Levels of SkewnessFIGURE 2.15

Panel A: Moderately Skewed Left

0.35

0.3

0.25

0.2

0.15

0.1

0.05

Panel C: Symmetric

0.3

0.25

0.2

0.15

0.1

0.05

Panel B: Moderately Skewed Right

0.35

0.3

0.25

0.2

0.15

0.1

0.05

Panel D: Highly Skewed Right

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

38 Chapter 2 Descriptive Statistics

widths, and class limits developed for the frequency distribution. However, rather than show- ing the frequency of each class, the cumulative frequency distribution shows the number of data items with values less than or equal to the upper class limit of each class. The first two columns of Table 2.8 provide the cumulative frequency distribution for the audit time data.

To understand how the cumulative frequencies are determined, consider the class with the description “Less than or equal to 24.” The cumulative frequency for this class is sim- ply the sum of the frequencies for all classes with data values less than or equal to 24. For the frequency distribution in Table 2.7, the sum of the frequencies for classes 10–14, 15–19, and 20–24 indicates that 4 8 5 171 1 5 data values are less than or equal to 24. Hence, the cumulative frequency for this class is 17. In addition, the cumulative frequency distribution in Table 2.8 shows that four audits were completed in 14 days or less and that 19 audits were completed in 29 days or less.

As a final point, a cumulative relative frequency distribution shows the proportion of data items, and a cumulative percent frequency distribution shows the percentage of data items with values less than or equal to the upper limit of each class. The cumulative rela- tive frequency distribution can be computed either by summing the relative frequencies in the relative frequency distribution or by dividing the cumulative frequencies by the total number of items. Using the latter approach, we found the cumulative relative frequencies in column 3 of Table 2.8 by dividing the cumulative frequencies in column 2 by the total number of items ( 20)n 5 . The cumulative percent frequencies were again computed by multiplying the relative frequencies by 100. The cumulative relative and percent frequency distributions show that 0.85 of the audits, or 85%, were completed in 24 days or less, 0.95 of the audits, or 95%, were completed in 29 days or less, and so on.

Audit Time (days) Cumulative Frequency

Cumulative Relative Frequency

Cumulative Percent Frequency

Less than or equal to 14 4 0.20 20

Less than or equal to 19 12 0.60 60

Less than or equal to 24 17 0.85 85

Less than or equal to 29 19 0.95 95

Less than or equal to 34 20 1.00 100

Cumulative Frequency, Cumulative Relative Frequency, and Cumulative Percent Frequency Distributions for the Audit Time Data

TABLE 2.8

1. If Data Analysis does not appear in your Analyze group (or

Analysis group in versions of Excel prior to Excel 2016), then

you will have to include the Data Analysis ToolPak Add-In.

To do so, click on the File tab and choose Options. When

the Excel Options dialog box opens, click Add-Ins. At the

bottom of the Excel Options dialog box, where it says

Manage: Excel Add-ins, click Go.… Select the check box

for Analysis ToolPak, and click OK.

2. Distributions are often used when discussing concepts

related to probability and simulation because they are

used to describe uncertainty. In Chapter 5 we will discuss

probability distributions, and then in Chapter 11 we

will revisit distributions when we introduce simulation

models.

3. In Excel 2016, histograms can also be created using the

new Histogram chart which can be found by clicking on

the Insert tab in the Ribbon, clicking Insert Statistic

Chart in the Charts group and selecting Histogram.

Excel automatically chooses the number of bins and bin

sizes. These values can be changed using Format Axis, but

the functionality is more limited than the steps we provide

in this section to create your own histogram.

N O T E S + C O M M E N T S

2.5 Measures of Location 39

2.5 Measures of Location Mean (Arithmetic Mean) The most commonly used measure of location is the mean (arithmetic mean), or average value, for a variable. The mean provides a measure of central location for the data. If the data are for a sample (typically the case), the mean is denoted by x . The sample mean is a point estimate of the (typically unknown) population mean for the variable of interest. If the data for the entire population are available, the population mean is computed in the same manner, but denoted by the Greek letter m.

In statistical formulas, it is customary to denote the value of variable x for the first observation by 1x , the value of variable x for the second observation by 2x , and so on. In general, the value of variable x for the ith observation is denoted by xi. For a sample with n observations, the formula for the sample mean is as follows.

If the data set is not a sample, but is the entire population with N observations, the population mean is computed

directly by: x

N i

m 5 S

SAMPLE MEAN

1 2

x x

x x x

n i n�

5 S

5 1 1 1

(2.2)

To illustrate the computation of a sample mean, suppose a sample of home sales is taken for a suburb of Cincinnati, Ohio. Table 2.9 shows the collected data. The mean home sell- ing price for the sample of 12 home sales is

12 138, 000 254, 000 456, 250

12 2,639, 250

12 219,937.50

1 2 12 x

x x xi �

�

5 S

5 1 1 1

5 5

The mean can be found in Excel using the AVERAGE function. Figure 2.16 shows the Home Sales data from Table 2.9 in an Excel spreadsheet. The value for the mean in cell E2 is calculated using the formula 5AVERAGE(B2:B13).

Home Sale Selling Price ($)

1 138,000

2 254,000

3 186,000

4 257,500

5 108,000

6 254,000

7 138,000

8 298,000

9 199,500

10 208,000

11 142,000

12 456,250

Data on Home Sales in a Cincinnati, Ohio, SuburbTABLE 2.9

HomeSales

40 Chapter 2 Descriptive Statistics

Median The median, another measure of central location, is the value in the middle when the data are arranged in ascending order (smallest to largest value). With an odd number of observa- tions, the median is the middle value. An even number of observations has no single middle value. In this case, we follow convention and define the median as the average of the values for the middle two observations.

Let us apply this definition to compute the median class size for a sample of five college classes. Arranging the data in ascending order provides the following list:

32 42 46 46 54

Because 5n 5 is odd, the median is the middle value. Thus, the median class size is 46 students. Even though this data set contains two observations with values of 46, each observation is treated separately when we arrange the data in ascending order.

Suppose we also compute the median value for the 12 home sales in Table 2.9. We first arrange the data in ascending order.

108,000 138,000 138,000 142,000 186,000 199,500 208,000 254,000 254,000 257,500 298,000 456,250

Middle Two Values

Calculating the Mean, Median, and Modes for the Home Sales Data Using ExcelFIGURE 2.16

A B C D E

Home Sale

Mean: =AVERAGE(B2:B13) Median: =MEDIAN(B2:B13) Mode 1: =MODE.MULT(B2:B13) Mode 2: =MODE.MULT(B2:B13)

Selling Price ($)

138,000 254,000 186,000 257,500 108,000 254,000 138,000 298,000 199,500 208,000 142,000

1 2 3 4 5 6 7 8 9 10 11 12 456,250

1 2 3 4 5 6 7 8 9 10 11 12 13

A B C D E

Mean: $ 219,937.50 Median: $ 203,750.00 Mode 1: $ 138,000.00 Mode 2: $ 254,000.00

Selling Price ($)

138,000 254,000 186,000 257,500 108,000 254,000 138,000 298,000 199,500 208,000 142,000

Home Sale

1 2 3 4 5 6 7 8 9 10 11 12 456,250

1 2 3 4 5 6 7 8 9 10 11 12 13

2.5 Measures of Location 41

Because 12n 5 is even, the median is the average of the middle two values: 199,500 and 208,000.

Median 199,500 208, 000

2 203, 7505

1 5

The median of a data set can be found in Excel using the function MEDIAN. In Figure 2.16, the value for the median in cell E3 is found using the formula 5MEDIAN(B2:B13).

Although the mean is the more commonly used measure of central location, in some situations the median is preferred. The mean is influenced by extremely small and large data values. Notice that the median is smaller than the mean in Figure 2.16. This is because the one large value of $456,250 in our data set inflates the mean but does not have the same effect on the median. Notice also that the median would remain unchanged if we replaced the $456,250 with a sales price of $1.5 million. In this case, the median selling price would remain $203,750, but the mean would increase to $306,916.67. If you were looking to buy a home in this suburb, the median gives a better indication of the central selling price of the homes there. We can generalize, saying that whenever a data set contains extreme values or is severely skewed, the median is often the preferred measure of central location.

Mode A third measure of location, the mode, is the value that occurs most frequently in a data set. To illustrate the identification of the mode, consider the sample of five class sizes.

32 42 46 46 54

The only value that occurs more than once is 46. Because this value, occurring with a fre- quency of 2, has the greatest frequency, it is the mode. To find the mode for a data set with only one most often occurring value in Excel, we use the MODE.SNGL function.

Occasionally the greatest frequency occurs at two or more different values, in which case more than one mode exists. If data contain at least two modes, we say that they are multimodal. A special case of multimodal data occurs when the data contain exactly two modes; in such cases we say that the data are bimodal. In multimodal cases when there are more than two modes, the mode is almost never reported because listing three or more modes is not particularly helpful in describing a location for the data. Also, if no value in the data occurs more than once, we say the data have no mode.

The Excel MODE.SNGL function will return only a single most-often-occurring value. For multimodal distributions, we must use the MODE.MULT command in Excel to return more than one mode. For example, two selling prices occur twice in Table 2.9: $138,000 and $254,000. Hence, these data are bimodal. To find both of the modes in Excel, we take these steps:

Step 1. Select cells E4 and E5 Step 2. Type the formula 5MODE.MULT(B2:B13) Step 3. Press CTRL1SHIFT1ENTER after typing the formula in Step 2.

Excel enters the values for both modes of this data set in cells E4 and E5: $138,000 and $254,000.

Geometric Mean The geometric mean is a measure of location that is calculated by finding the nth root of the product of n values. The general formula for the sample geometric mean, denoted xg, follows.

We must press CTRL1SHIFT1ENTER because the MODE.MULT function returns an array of values.

The geometric mean for a population is computed similarly but is defined as gµ to denote that it is computed using the entire population.

SAMPLE GEOMETRIC MEAN

( )( ) ( ) [( )( ) ( )]1 2 1 2 1/x x x x x x xg nn n n� �5 5 (2.3)

42 Chapter 2 Descriptive Statistics

The geometric mean is often used in analyzing growth rates in financial data. In these types of situations, the arithmetic mean or average value will provide misleading results.

To illustrate the use of the geometric mean, consider Table 2.10, which shows the per- centage annual returns, or growth rates, for a mutual fund over the past 10 years. Suppose we want to compute how much $100 invested in the fund at the beginning of year 1 would be worth at the end of year 10. We start by computing the balance in the fund at the end of year 1. Because the percentage annual return for year 1 was −22.1%, the balance in the fund at the end of year 1 would be:

$100 0.221($100) $100(1 0.221) $100(0.779) $77.902 5 2 5 5

We refer to 0.779 as the growth factor for year 1 in Table 2.10. We can compute the bal- ance at the end of year 1 by multiplying the value invested in the fund at the beginning of year 1 by the growth factor for year 1: $100(0.779) $77.905 .

The balance in the fund at the end of year 1, $77.90, now becomes the beginning bal- ance in year 2. So, with a percentage annual return for year 2 of 28.7%, the balance at the end of year 2 would be

$77.90 0.287($77.90) $77.90(1 0.287) $77.90(1.287) $100.261 5 1 5 5

Note that 1.287 is the growth factor for year 2. By substituting $100(0.779) for $77.90, we see that the balance in the fund at the end of year 2 is

$100(0.779)(1.287) $100.265

In other words, the balance at the end of year 2 is just the initial investment at the begin- ning of year 1 times the product of the first two growth factors. This result can be gen- eralized to show that the balance at the end of year 10 is the initial investment times the product of all 10 growth factors.

5 5

$100[(0.779)(1.287)(1.109)(1.049)(1.158)(1.055)(0.630)(1.265)(1.151)(1.021)]

$100(1.335) $133.45

So a $100 investment in the fund at the beginning of year 1 would be worth $133.45 at the end of year 10. Note that the product of the 10 growth factors is 1.335. Thus, we can compute the balance at the end of year 10 for any amount of money invested at the begin- ning of year 1 by multiplying the value of the initial investment by 1.335. For instance, an initial investment of $2,500 at the beginning of year 1 would be worth $2,500(1.335), or approximately $3,337.50, at the end of year 10.

The growth factor for each year is 1 plus 0.01 times the percentage return. A growth factor less than 1 indicates negative growth, whereas a growth factor greater than 1 indicates positive growth. The growth factor cannot be less than zero.

Year Return (%) Growth Factor

1 −22.1 0.779

2 28.7 1.287

3 10.9 1.109

4 4.9 1.049

5 15.8 1.158

6 5.5 1.055

7 −37.0 0.630

8 26.5 1.265

9 15.1 1.151

10 2.1 1.021

Percentage Annual Returns and Growth Factors for the Mutual Fund Data

TABLE 2.10

MutualFundsReturns

2.5 Measures of Location 43

What was the mean percentage annual return or mean rate of growth for this invest- ment over the 10-year period? The geometric mean of the 10 growth factors can be used to answer this question. Because the product of the 10 growth factors is 1.335, the geometric mean is the 10th root of 1.335, or

1.335 1.02910xg 5 5

The geometric mean tells us that annual returns grew at an average annual rate of (1.029 − 1) 100, or 2.9%. In other words, with an average annual growth rate of 2.9%, a $100 investment in the fund at the beginning of year 1 would grow to $100(1.029) $133.0910 5 at the end of 10 years. We can use Excel to calculate the geometric mean for the data in Table 2.10 by using the function GEOMEAN. In Figure 2.17, the value for the geometric mean in cell C13 is found using the formula 5GEOMEAN(C2:C11).

It is important to understand that the arithmetic mean of the percentage annual returns does not provide the mean annual growth rate for this investment. The sum of the 10 percentage annual returns in Table 2.10 is 50.4. Thus, the arithmetic mean of the 10 percentage returns is 550.4/10 5.04%. A salesperson might try to convince you to invest in this fund by stating that the mean annual percentage return was 5.04%. Such a statement is not only misleading, it is inaccurate. A mean annual percentage return of 5.04% corresponds to an average growth factor of 1.0504. So, if the average growth factor were really 1.0504, $100 invested in the fund at the beginning of year 1 would have grown to $100(1.0504) $163.5110 5 at the end of 10 years. But, using the 10 annual percentage returns in Table 2.10, we showed that an initial $100 investment is worth $133.09 at the end of 10 years. The salesperson’s claim that the mean annual percentage return is 5.04% grossly overstates the true growth for this mutual fund. The problem is that the arithmetic mean is appropriate only for an additive process. For a multiplicative process, such as appli- cations involving growth rates, the geometric mean is the appropriate measure of location.

While the application of the geometric mean to problems in finance, investments, and banking is particularly common, the geometric mean should be applied any time you want to determine the mean rate of change over several successive periods. Other common applica- tions include changes in the populations of species, crop yields, pollution levels, and birth and death rates. The geometric mean can also be applied to changes that occur over any number

Calculating the Geometric Mean for the Mutual Fund Data Using Excel

FIGURE 2.17

A 1 2 3 4 5 6 7 8 9 10

1 –22.1 0.779 1.287 1.109 1.049 1.158 1.055 0.630 1.265 1.151 1.021

1.029

28.7 10.9 4.9

15.8 5.5

–37.0 26.5 15.1 2.1

2 3 4 5 6 7 8 9

1011 12 13 14

B C D Year Return (%) Growth Factor

Geometric Mean:

44 Chapter 2 Descriptive Statistics

of successive periods of any length. In addition to annual changes, the geometric mean is often applied to find the mean rate of change over quarters, months, weeks, and even days.

2.6 Measures of Variability In addition to measures of location, it is often desirable to consider measures of variability, or dispersion. For example, suppose that you are considering two financial funds. Both funds require a $1,000 annual investment. Table 2.11 shows the annual payouts for Fund A and Fund B for $1,000 investments over the past 20 years. Fund A has paid out exactly $1,100 each year for an initial $1,000 investment. Fund B has had many different payouts, but the mean payout over the previous 20 years is also $1,100. But would you consider the payouts of Fund A and Fund B to be equivalent? Clearly, the answer is no. The difference between the two funds is due to variability.

Figure 2.18 shows a histogram for the payouts received from Funds A and B. Although the mean payout is the same for the two funds, their histograms differ in that the payouts associated with Fund B have greater variability. Sometimes the payouts are considerably larger than the mean, and sometimes they are considerably smaller. In this section, we pres- ent several different ways to measure variability.

Range The simplest measure of variability is the range. The range can be found by subtracting the smallest value from the largest value in a data set. Let us return to the home sales data set to demonstrate the calculation of range. Refer to the data from home sales prices in Table 2.9. The largest home sales price is $456,250, and the smallest is $108,000. The range is $456, 250 $108, 000 $348, 2502 5 .

Year Fund A ($) Fund B ($)

1 1,100 700

2 1,100 2,500

3 1,100 1,200

4 1,100 1,550

5 1,100 1,300

6 1,100 800

7 1,100 300

8 1,100 1,600

9 1,100 1,500

10 1,100 350

11 1,100 460

12 1,100 890

13 1,100 1,050

14 1,100 800

15 1,100 1,150

16 1,100 1,200

17 1,100 1,800

18 1,100 100

19 1,100 1,750

20 1,100 1,000

Mean 1,100 1,100

Annual Payouts for Two Different Investment FundsTABLE 2.11

2.6 Measures of Variability 45

Although the range is the easiest of the measures of variability to compute, it is seldom used as the only measure. The reason is that the range is based on only two of the observations and thus is highly influenced by extreme values. If, for exam- ple, we replace the selling price of $456,250 with $1.5 million, the range would be $1,500, 000 $108, 000 $1,392, 0002 5 . This large value for the range would not be espe- cially descriptive of the variability in the data because 11 of the 12 home selling prices are between $108,000 and $298,000.

The range can be calculated in Excel using the MAX and MIN functions. The range value in cell E7 of Figure 2.19 calculates the range using the formula 5MAX(B2:B13) − MIN(B2:B13). This subtracts the smallest value in the range B2:B13 from the largest value in the range B2:B13.

Variance The variance is a measure of variability that utilizes all the data. The variance is based on the deviation about the mean, which is the difference between the value of each observa- tion ( )xi and the mean. For a sample, a deviation of an observation about the mean is writ- ten ( )x xi 2 . In the computation of the variance, the deviations about the mean are squared.

In most statistical applications, the data being analyzed are for a sample. When we compute a sample variance, we are often interested in using it to estimate the population variance, 2s . Although a detailed explanation is beyond the scope of this text, for a random sample, it can be shown that, if the sum of the squared deviations about the sample mean is divided by n − 1, and not n, the resulting sample variance provides an unbiased estimate of the population variance.1

For this reason, the sample variance, denoted by 2s , is defined as follows:

If the data are for a population, the population variance, 2s , can be computed directly (rather than estimated by the sample variance). For a population of N observations and with m denoting the population mean, population variance is computed by

(x ) N

2 i 2

s 5 S 2 m

Histograms for Payouts of Past 20 Years from Fund A and Fund BFIGURE 2.18

Fund A Payouts ($)

F re

q u

en cy

1,100 0

Fund B Payouts ($)

F re

q u

en cy

0– 50

50 1–

1, 00

1– 1,

50 0

1, 50

1– 2,

00 0

2, 00

1– 2,

50 0

SAMPLE VARIANCE

( )

1 2

s x x

n i

5 S 2

2 (2.4)

1Unbiased means that if we take a large number of independent random samples of the same size from the population and calculate the sample variance for each sample, the average of these sample variances will tend to be equal to the population variance.

46 Chapter 2 Descriptive Statistics

Calculating Variability Measures for the Home Sales Data in ExcelFIGURE 2.19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

1 2 3 4 5 6 7 8 9 10 11 12

138000 Mean: =AVERAGE(B2:B13) =MEDIAN(B2:B13) =MODE.MULT(B2:B13) =MODE.MULT(B2:B13)

=MAX(B2:B13)-MIN(B2:B13) =VAR.S(B2:B13) =STDEV.S(B2:B13)

=E9/E2

=PERCENTILE.EXC(B2:B13,0.85)

Median: Mode 1: Mode 2:

Range: Variance:

Standard Deviation:

Coef�cient of Variation:

85th Percentile:

=QUARTILE. EXC(B2:B13,1)1st Quartile: =QUARTILE. EXC(B2:B13,2)2nd Quartile: =QUARTILE. EXC(B2:B13,3)3rd Quartile:

=E17-E15IQR:

254000 186000 257500 108000 254000 138000 298000 199500 208000 142000 456250

A B

Home Sale Selling Price ($)

C D E

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

1 2 3 4 5 6 7 8 9 10 11 12

138000 Mean: $ 219,937.50 $ 203,750.00 $ 138,000.00 $ 254,000.00

$ 348,250.00 9037501420 $ 95,065.77

43.22%

$ 305,912.50

Median: Mode 1: Mode 2:

Range: Variance:

Standard Deviation:

Coef�cient of Variation:

85th Percentile:

$ 139,000.001st Quartile: $ 203,750.002nd Quartile: $ 256,625.003rd Quartile:

$ 117,625.00IQR:

254000 186000 257500 108000 254000 138000 298000 199500 208000 142000 456250

A B Home Sale Selling Price ($)

C D E

To illustrate the computation of the sample variance, we will use the data on class size from page 40 for the sample of five college classes. A summary of the data, including the computation of the deviations about the mean and the squared deviations about the mean, is shown in Table 2.12. The sum of squared deviations about the mean is ( ) 2562x xiS 2 5 . Hence, with 1 4n 2 5 , the sample variance is

( )

256

4 642

s x x

n i

5 S 2

2 5 5

Note that the units of variance are squared. For instance, the sample variance for our calculation is 64 (students)2 2s 5 . In Excel, you can find the variance for sample data using the VAR.S function. Figure 2.19 shows the data for home sales examined in the previous section. The variance in cell E8 is calculated using the formula 5VAR.S(B2:B13). Excel calculates the variance of the sample of 12 home sales to be 9,037,501,420.

Standard Deviation The standard deviation is defined to be the positive square root of the variance. We use s to denote the sample standard deviation and s to denote the population standard deviation. The sample standard deviation, s, is a point estimate of the population standard deviation, s , and is derived from the sample variance in the following way:

SAMPLE STANDARD DEVIATION

2s s5 (2.5)

2.7 Analyzing Distributions 47

The sample variance for the sample of class sizes in five college classes is 642s 5 . Thus, the sample standard deviation is 64 8s 5 5 .

Recall that the units associated with the variance are squared and that it is difficult to interpret the meaning of squared units. Because the standard deviation is the square root of the variance, the units of the variance, (students)2 in our example, are converted to students in the standard deviation. In other words, the standard deviation is measured in the same units as the original data. For this reason, the standard deviation is more easily compared to the mean and other statistics that are measured in the same units as the original data.

Figure 2.19 shows the Excel calculation for the sample standard deviation of the home sales data, which can be calculated using Excel’s STDEV.S function. The sample standard deviation in cell E9 is calculated using the formula 5STDEV.S(B2:B13). Excel calculates the sample standard deviation for the home sales to be $95,065.77.

Coefficient of Variation In some situations we may be interested in a descriptive statistic that indicates how large the standard deviation is relative to the mean. This measure is called the coefficient of variation and is usually expressed as a percentage.

If the data are for a population, the population standard deviation s is obtained by taking the positive square root of the population variance: 2s 5 s . To calculate the population variance and population standard deviation in Excel, we use the functions 5VAR.P and 5STDEV.P.

Number of Students in Class ( )xi

Mean Class Size ( )x

Deviation About the Mean 22( )x xi

Squared Deviation About the Mean

22( )2x xi 46 44 2 4

54 44 10 100

42 44 22 4

46 44 2 4

32 44 212 144

0 256

x xiS 2( ) x xiS 2( )2

Computation of Deviations and Squared Deviations About the Mean for the Class Size Data

TABLE 2.12

COEFFICIENT OF VARIATION

Standard deviation

Mean 100 %

 

 

3 (2.6)

For the class size data on page 40, we found a sample mean of 44 and a sample standard deviation of 8. The coefficient of variation is 3 5(8/44 100) 18.2%. In words, the coef- ficient of variation tells us that the sample standard deviation is 18.2% of the value of the sample mean. The coefficient of variation for the home sales data is shown in Figure 2.19. It is calculated in cell E11 using the formula 5E9/E2, which divides the standard deviation by the mean. The coefficient of variation for the home sales data is 43.22%. In general, the coefficient of variation is a useful statistic for comparing the relative variability of different variables, each with different standard deviations and different means.

2.7 Analyzing Distributions In Section 2.4 we demonstrated how to create frequency, relative, and cumulative dis- tributions for data sets. Distributions are very useful for interpreting and analyzing data. A distribution describes the overall variability of the observed values of a variable. In this section we introduce additional ways of analyzing distributions.

48 Chapter 2 Descriptive Statistics

Percentiles A percentile is the value of a variable at which a specified (approximate) percentage of observations are below that value. The pth percentile tells us the point in the data where approximately p% of the observations have values less than the pth percentile; hence, approximately (100 − p)% of the observations have values greater than the pth percentile.

Colleges and universities frequently report admission test scores in terms of percentiles. For instance, suppose an applicant obtains a raw score of 54 on the verbal portion of an admission test. How this student performed in relation to other students taking the same test may not be readily apparent. However, if the raw score of 54 corresponds to the 70th percentile, we know that approximately 70% of the students scored lower than this individ- ual, and approximately 30% of the students scored higher.

To calculate the pth percentile for a data set containing n observations we must first arrange the data in ascending order (smallest value to largest value). The smallest value is in position 1, the next smallest value is in position 2, and so on. The location of the pth per- centile, denoted by L p, is computed using the following equation:

Several procedures can be used to compute the location of the pth percentile using sample data. All provide similar values, especially for large data sets. The procedure we show here is the procedure used by Excel’s PERCENTILE. EXC function as well as several other statistical software packages.

Location of the pth Percentile

100

( 1)L p

np 5 1 (2.7)

Once we find the position of the value of the pth percentile, we have the information we need to calculate the pth percentile.

To illustrate the computation of the pth percentile, let us compute the 85th percentile for the home sales data in Table 2.9. We begin by arranging the sample of 12 starting salaries in ascending order.

108,000 138,000 138,000 142,000 186,000 199,500 208,000 254,000 254,000 257,500 298,000 456,250

Position 1 2 3 4 5 6 7 8 9 10 11 12

The position of each observation in the sorted data is shown directly below its value. For instance, the smallest value (108,000) is in position 1, the next smallest value (138,000) is in position 2, and so on. Using equation (2.7) with p 855 and 12n 5 , the location of the 85th percentile is

100 ( 1)

100 (12 1) 11.0585L

p n

 

 

5 1 5 1 5

The interpretation of 11.0585L 5 is that the 85th percentile is 5% of the way between the value in position 11 and the value in position 12. In other words, the 85th percentile is the value in position 11 (298,000) plus 0.05 times the difference between the value in posi- tion 12 (456,250) and the value in position 11 (298,000). Thus, the 85th percentile is

5 1 2 5 1

85th percentile 298, 000 0.05(456, 250 298, 000) 298, 000 0.05(158, 250)

305, 912.50

Therefore, $305,912.50 represents the 85th percentile of the home sales data. The pth percentile can also be calculated in Excel using the function PERCENTILE.EXC.

Figure 2.19 shows the Excel calculation for the 85th percentile of the home sales data. The value in cell E13 is calculated using the formula PERCENTILE.EXC(B2:B13,0.85); B2:B135 defines the data set for which we are calculating a percentile, and 0.85 defines the percentile of interest.

2.7 Analyzing Distributions 49

108,000 138,000 138,000 142,000 186,000 199,500 208,000 254,000 254,000 257,500 298,000 456,250

Position 1 2 3 4 5 6 7 8 9 10 11 12

Quartiles It is often desirable to divide data into four parts, with each part containing approximately one-fourth, or 25 percent, of the observations. These division points are referred to as the quartiles and are defined as follows:

first quartile, or 25th percentile1Q 5 second quartile, or 50th percentile also the median2Q ( )5 third quartile, or 75th percentile3Q 5

To demonstrate quartiles, the home sales data are again arranged in ascending order.

Similar to percentiles, there are multiple methods for computing quartiles that all give similar results. Here we describe a commonly used method that is equivalent to Excel’s QUARTILE.EXC function.

We already identified Q2, the second quartile (median) as 203,750. To find Q1 and Q3 we must find the 25th and 75th percentiles.

For Q1,

100 ( 1)

100 (12 1) 3.2525L

p n

 

 

5 1 5 1 5

25th percentile 138, 000 0.25(142, 000 138, 000) 138, 000 0.25(4,000)

139, 000

5 1 2 5 1

For Q3,

100 ( 1)

100 (12 1) 9.7575L

p n

 

 

5 1 5 1 5

5 1 2 5 1

75th percentile 254, 000 0.75(257, 500 254, 000) 254, 000 0.75(3,500)

256, 625

Therefore, the 25th percentile for the home sales data is $139,000 and the 75th percen- tile is $256,625.

The quartiles divide the home sales data into four parts, with each part containing 25% of the observations.

108,000 138,000 138,000

142,000 186,000 199,500

208,000 254,000 254,000

257,500 298,000 456,250

Q 1 5 139,000 Q

2 5 203,750 Q

3 5 256,625

The difference between the third and first quartiles is often referred to as the interquartile range, or IQR. For the home sales data, Q Q5 2 5 2IQR 256, 6253 1

5139, 000 117, 625. Because it excludes the smallest and largest 25% of values in the data, the IQR is a useful measure of variation for data that have extreme values or are highly skewed.

A quartile can be computed in Excel using the function QUARTILE.EXC. Figure 2.19 shows the calculations for first, second, and third quartiles for the home sales data. The formula used in cell E15 is 5QUARTILE.EXC(B2:B13,1). The range B2:B13 defines the data set, and 1 indicates that we want to compute the first quartile. Cells E16 and E17 use similar formulas to compute the second and third quartiles.

z-Scores A z-score allows us to measure the relative location of a value in the data set. More spe- cifically, a z-score helps us determine how far a particular value is from the mean relative

50 Chapter 2 Descriptive Statistics

to the data set’s standard deviation. Suppose we have a sample of n observations, with the values denoted by , , ,1 2x x xn… . In addition, assume that the sample mean, x , and the sam- ple standard deviation, s, are already computed. Associated with each value, xi, is another value called its z-score. Equation (2.8) shows how the z-score is computed for each xi:

z-SCORE

z x x

s i

i 5

2 (2.8)

where

5 5 5

z z x x s

i ithe -score for the sample mean the sample standard deviation

The z-score is often called the standardized value. The z-score, zi, can be interpreted as the number of standard deviations, xi, is from the mean. For example, 1.21z 5 indicates that 1x is 1.2 standard deviations greater than the sample mean. Similarly, 0.52z −5 indi- cates that 2x is 0.5, or 1/2, standard deviation less than the sample mean. A z-score greater than zero occurs for observations with a value greater than the mean, and a z-score less than zero occurs for observations with a value less than the mean. A z-score of zero indi- cates that the value of the observation is equal to the mean.

The z-scores for the class size data are computed in Table 2.13. Recall the previously computed sample mean, 44x 5 , and sample standard deviation, 8s 5 . The z-score of −1.50 for the fifth observation shows that it is farthest from the mean; it is 1.50 standard deviations below the mean.

The z-score can be calculated in Excel using the function STANDARDIZE. Figure 2.20 demonstrates the use of the STANDARDIZE function to compute z-scores for the home sales data. To calculate the z-scores, we must provide the mean and standard deviation for the data set in the arguments of the STANDARDIZE function. For instance, the z-score in cell C2 is calculated with the formula 5STANDARDIZE(B2, $B$15, $B$16), where cell B15 contains the mean of the home sales data and cell B16 contains the standard deviation of the home sales data. We can then copy and paste this formula into cells C3:C13.

Empirical Rule When the distribution of data exhibits a symmetric bell-shaped distribution, as shown in Figure 2.21, the empirical rule can be used to determine the percentage of data values that are within a specified number of standard deviations of the mean. Many, but not all, distri- butions of data found in practice exhibit a symmetric bell-shaped distribution.

No. of Students in Class xi( )

Deviation About the Mean 22x xi( )

z-Score 22

  

x x s

46 2 2/8 0.255

54 10 510/8 1.25

42 –2 2/8 0.252 5 2

46 2 2/8 0.255

32 –12 2 5 212/8 1.50

z-Scores for the Class Size DataTABLE 2.13

2.7 Analyzing Distributions 51

Calculating z-Scores for the Home Sales Data in ExcelFIGURE 2.20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 9 10 11 12

138000 =STANDARDIZE(B2,$B$15,$B$16) =STANDARDIZE(B3,$B$15,$B$16) =STANDARDIZE(B4,$B$15,$B$16) =STANDARDIZE(B5,$B$15,$B$16) =STANDARDIZE(B6,$B$15,$B$16) =STANDARDIZE(B7,$B$15,$B$16) =STANDARDIZE(B8,$B$15,$B$16) =STANDARDIZE(B9,$B$15,$B$16) =STANDARDIZE(B10,$B$15,$B$16) =STANDARDIZE(B11,$B$15,$B$16) =STANDARDIZE(B12,$B$15,$B$16) =STANDARDIZE(B13,$B$15,$B$16)

Mean: =AVERAGE(B2:B13) =STDEV.S(B2:B13)Standard Deviation:

254000 186000 257500 108000 254000 138000 298000 199500 208000 142000 456250

Selling Price ($) z-ScoreHome Sale A B C

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1 2 3 4 5 6 7 8 9 10 11 12

138,000 –0.862 0.358

–0.357 0.395

–1.177 0.358

–0.862 0.821

–0.215 –0.126 –0.820 2.486

Mean: $ $Standard Deviation:

254,000 186,000 257,500 108,000 254,000 138,000 298,000 199,500 208,000 142,000 456,250

219,937.50 95,065.77

Selling Price ($) z-ScoreHome Sale A B C

A Symmetric Bell-Shaped DistributionFIGURE 2.21

52 Chapter 2 Descriptive Statistics

The height of adult males in the United States has a bell-shaped distribution similar to that shown in Figure 2.21, with a mean of approximately 69.5 inches and standard deviation of approximately 3 inches. Using the empirical rule, we can draw the following conclusions.

• Approximately 68% of adult males in the United States have heights between 69.5 3 66.52 5 and 1 569.5 3 72.5 inches.

• Approximately 95% of adult males in the United States have heights between 63.5 and 75.5 inches.

• Almost all adult males in the United States have heights between 60.5 and 78.5 inches.

Identifying Outliers Sometimes a data set will have one or more observations with unusually large or unusually small values. These extreme values are called outliers. Experienced statisticians take steps to identify outliers and then review each one carefully. An outlier may be a data value that has been incorrectly recorded; if so, it can be corrected before the data are analyzed further. An outlier may also be from an observation that doesn’t belong to the population we are studying and was incorrectly included in the data set; if so, it can be removed. Finally, an outlier may be an unusual data value that has been recorded correctly and is a member of the population we are studying. In such cases, the observation should remain.

Standardized values (z-scores) can be used to identify outliers. Recall that the empirical rule allows us to conclude that for data with a bell-shaped distribution, almost all the data values will be within 3 standard deviations of the mean. Hence, in using z-scores to iden- tify outliers, we recommend treating any data value with a z-score less than −3 or greater than 13 as an outlier. Such data values can then be reviewed to determine their accuracy and whether they belong in the data set.

Box Plots A box plot is a graphical summary of the distribution of data. A box plot is developed from the quartiles for a data set. Figure 2.22 is a box plot for the home sales data. Here are the steps used to construct the box plot:

1. A box is drawn with the ends of the box located at the first and third quartiles. For the home sales data, 1 139, 000Q 5 and 3 256,625Q 5 . This box contains the mid- dle 50% of the data.

2. A vertical line is drawn in the box at the location of the median (203,750 for the home sales data).

3. By using the interquartile range, IQR 3 1Q Q5 2 , limits are located. The limits for the box plot are 1.5(IQR) below Q1 and 1.5(IQR) above Q3. For the home sales data, IQR 3 1 256,625 139, 000 117,625Q Q5 2 5 2 5 . Thus, the limits are 139, 000 1.5(117,625) 37, 437.52 5 2 and 256,625 1.5(117,625) 433, 062.51 5 . Data outside these limits are considered outliers.

4. The dashed lines in Figure 2.22 are called whiskers. The whiskers are drawn from the ends of the box to the smallest and largest values inside the limits computed in Step 3. Thus, the whiskers end at home sales values of 108,000 and 298,000.

Box plots are also known as box-and-whisker plots.

EMPIRICAL RULE

For data having a bell-shaped distribution:

• Approximately 68% of the data values will be within 1 standard deviation of the mean.

• Approximately 95% of the data values will be within 2 standard deviations of the mean.

• Almost all of the data values will be within 3 standard deviations of the mean.

2.7 Analyzing Distributions 53

5. Finally, the location of each outlier is shown with an asterisk (*). In Figure 2.22, we see one outlier, 456,250.

Box plots are also very useful for comparing different data sets. For instance, if we want to compare home sales from several different communities, we could create box plots for recent home sales in each community. An example of such box plots is shown in Figure 2.23.

What can we learn from these box plots? The most expensive houses appear to be in Shadyside and the cheapest houses in Hamilton. The median home sales price in Groton is about the same as the median home sales price in Irving. However, home sales prices in Irving have much greater variability. Homes appear to be selling in Irving for many differ- ent prices, from very low to very high. Home sales prices have the least variation in Groton and Hamilton. The only outlier that appears in these box plots is for home sales in Groton. However, note that most homes sell for very similar prices in Groton, so the selling price does not have to be too far from the median to be considered an outlier.

Clearly, we would not expect a home sales price less than 0, so we could also define the lower limit here to be $0.

Box plots can be drawn horizontally or vertically. Figure 2.22 shows a horizontal box plot, and Figure 2.23 shows vertical box plots.

Box Plot for the Home Sales DataFIGURE 2.22

Q1 Median

Whisker

Outlier

100,000 200,000

Price ($) 300,000

400,000 500,000

IQR

Box Plots Comparing Home Sale Prices in Different Communities

FIGURE 2.23

100,000

Fairview Shadyside Groton Irving Hamilton

200,000

300,000

400,000

500,000

S el

li n

g P

ri ce

( $)

54 Chapter 2 Descriptive Statistics

Note that box plots use a different definition of an outlier than what we described for using z-scores because the distribution of the data in a box plot is not assumed to follow a bell-shaped curve. However, the interpretation is the same. The outliers in a box plot are extreme values that should be investigated to ensure data accuracy.

The step-by-step directions below illustrate how to create box plots in Excel for both a single variable and multiple variables. First we will create a box plot for a single variable using the HomeSales data file.

Step 1. Select cells B1:B13 Step 2. Click the Insert tab on the Ribbon

Click the Insert Statistic Chart button in the Charts group Choose the Box and Whisker chart from the drop-down menu

The resulting box plot created in Excel is shown in Figure 2.24. Comparing this figure to Figure 2.22, we see that all the important elements of a box plot are generated here. Excel orients the box plot vertically, and by default it also includes a marker for the mean.

Next we will use the HomeSalesComparison data file to create box plots in Excel for multiple variables similar to what is shown in Figure 2.26.

Step 1. Select cells B1:F11 Step 2. Click the Insert tab on the Ribbon

Click the Insert Statistic Chart button in the Charts group Choose the Box and Whisker chart from the drop-down menu

The box plot created in Excel is shown in Figure 2.25. Excel again orients the box plot vertically. The different selling locations are shown in the Legend at the top of the figure, and different colors are used for each box plot.

HomeSalesComparison

HomeSales

Box Plot Created in Excel for Home Sales DataFIGURE 2.24

Home Sale A

1 2 3 4 5 6 7 8 9 10 11 12

138000 254000 186000 257500 108000 254000 138000 298000 199500 208000 142000 45625013

14 15 16 17 18 19 20 21 22 23 24 25 26

B C D E F G H I J Selling Price ($)

500000

450000

400000

350000

300000

250000

200000

150000 Q3

Mean

Outlier

Median

Whisker

IQR X

100000

50000

S el

li n

g P

ri ce

( $)

2.8 Measures of Association Between Two Variables 55

2.8 Measures of Association Between Two Variables Thus far, we have examined numerical methods used to summarize the data for one variable at a time. Often a manager or decision maker is interested in the relationship between two variables. In this section, we present covariance and correlation as descrip- tive measures of the relationship between two variables. To illustrate these concepts, we consider the case of the sales manager of Queensland Amusement Park, who is in charge of ordering bottled water to be purchased by park customers. The sales manager believes that daily bottled water sales in the summer are related to the outdoor tempera- ture. Table 2.14 shows data for high temperatures and bottled water sales for 14 sum- mer days. The data have been sorted by high temperature from lowest value to highest value.

Scatter Charts A scatter chart is a useful graph for analyzing the relationship between two variables. Figure 2.26 shows a scatter chart for sales of bottled water versus the high temperature experienced on 14 consecutive days. The scatter chart in the figure suggests that higher

1. Versions of Excel prior to Excel 2010 use the functions

PERCENTILE and QUARTILE to calculate a percentile and

quartile, respectively. However, these Excel functions can

produce odd results for small data sets. Although these

functions are still accepted in later versions of Excel (they

are equivalent to the Excel functions PERCENTILE.INC and

QUARTILE.INC), we do not recommend their use; instead

we suggest using PERCENTILE.EXC and QUARTILE.EXC.

2. The empirical rule applies only to distributions that have an

approximately bell-shaped distribution because it is based

on properties of the Normal probability distribution, which

we will discuss in Chapter 5. For distributions that do not

have a bell-shaped distribution, one can use Chebyshev’s

theorem to make statements about the proportion of data

values that must be within a specified number of standard

deviations of the mean. Chebyshev’s theorem states that at

least z

1 1 2

 

 

2 of the data values must be within z standard

deviations of the mean, where z is any value greater than 1. 3. The ability to create box plots in Excel is a new feature in

Excel 2016. Unfortunately, there is no easy way to generate

box plots in versions of Excel prior to Excel 2016. Box plots

can generally be created in most dedicated statistical soft-

ware packages; we explain how to generate a box plot with

Analytic Solver in the chapter appendix found in MindTap.

4. Note that the box plot in Figure 2.24 has been formatted

using Excel’s Chart Elements button. These options will be discussed in more detail in Chapter 3. We have also

added the text descriptions of the different elements of

the box plot.

N O T E S + C O M M E N T S

Box Plots for Multiple Variables Created in ExcelFIGURE 2.25

Fairview 302,000 265,000 280,000 220,000 149,000 155,000 198,000 187,000 208,000 174,000

336,000 398,000 378,000 298,000 425,000 344,000 302,000 300,000 298,000 342,000

152,000 158,000 154,000 170,000 132,000 164,000 198,000 158,000 149,000 165,000

201,000 365,000 115,000 105,000 225,000 115,000 108,000 218,000 454,000 103,000

102,000 108,000 88,000

111,000 105,000 87,000 95,000

111,000 98,000 78,000

Selling Prices

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

B C D E F G H I J K L M N O P Q Shadyside Groton Irving Hamilton

500,000

Fairview Shadyside Groton Irving Hamilton

450,000

400,000

350,000

300,000

250,000

200,000

150,000

100,000

50,000

S el

li n

g P

ri ce

( $)

56 Chapter 2 Descriptive Statistics

Chart Showing the Positive Linear Relation Between Sales and High Temperatures

FIGURE 2.26

High Temperature (˚F)

76 78 80 82 84 86 88 90 92 94

S al

es (

ca se

daily high temperatures are associated with higher bottled water sales. This is an example of a positive relationship, because when one variable (high temperature) increases, the other variable (sales of bottled water) generally also increases. The scatter chart also sug- gests that a straight line could be used as an approximation for the relationship between high temperature and sales of bottled water.

Scatter charts are covered in Chapter 3.

High Temperature ( )F8 Bottled Water Sales (cases)

78 23

79 22

80 24

80 22

82 24

83 26

85 27

86 25

87 28

87 26

88 29

88 30

90 31

92 31

Data for Bottled Water Sales at Queensland Amusement Park for a Sample of 14 Summer Days

TABLE 2.14

BottledWater

2.8 Measures of Association Between Two Variables 57

Covariance Covariance is a descriptive measure of the linear association between two variables. For a sample of size n with the observations ( , )1 1x y , ( , )2 2x y , and so on, the sample covariance is defined as follows:

Sample Covariance

( )( )

1 s

x x y y

n xy

i i 5

S 2 2

2 (2.9)

This formula pairs each xi with a yi . We then sum the products obtained by multiplying the deviation of each xi from its sample mean ( )x xi 2 by the deviation of the correspond- ing yi from its sample mean ( )y yi 2 ; this sum is then divided by n − 1.

To measure the strength of the linear relationship between the high temperature x and the sales of bottled water y at Queensland, we use equation (2.9) to compute the sample covariance. The calculations in Table 2.15 show the computation ( )( )x x y yi iS 2 2 . Note that for our calculations, 84.6x 5 and 26.3y 5 .

The covariance calculated in Table 2.15 is 12.8sxy 5 . Because the covariance is greater than 0, it indicates a positive relationship between the high temperature and sales of bottled water. This verifies the relationship we saw in the scatter chart in Figure 2.26 that as the high temperature for a day increases, sales of bottled water generally increase.

The sample covariance can also be calculated in Excel using the COVARIANCE.S func- tion. Figure 2.27 shows the data from Table 2.14 entered into an Excel Worksheet. The cova- riance is calculated in cell B17 using the formula 5COVARIANCE.S(A2:A15, B2:B15).

If data consist of a population of N observations, the population covariance

xys is computed by:

(x ) (y ) N

xy i x i y

s 5 S 2m S 2m

Note that this equation is similar to equation (2.8), but uses population parameters instead of sample estimates (and divides by N instead of n − 1 for technical reasons beyond the scope of this book).

xi y i 22x xi 22y yi 22 22x x y yi i( )( ) 78 23 −6.6 −3.3 21.78

79 22 −5.6 −4.3 24.08

80 24 −4.6 −2.3 10.58

80 22 −4.6 −4.3 19.78

82 24 −2.6 −2.3 5.98

83 26 −1.6 −0.3 0.48

85 27 0.4 0.7 0.28

86 25 1.4 −1.3 −1.82

87 28 2.4 1.7 4.08

87 26 2.4 −0.3 −0.72

88 29 3.4 2.7 9.18

88 30 3.4 3.7 12.58

90 31 5.4 4.7 25.38

92 31 7.4 4.7 34.78

Totals 1,185 368 0.6 −0.2 166.42

x y

s x x y y

n xy

i i

84.6 26.3

( )( ) 1

166.42 14 1

12.8

5 S 2 2

2 5

Sample Covariance Calculations for Daily High Temperature and Bottled Water Sales at Queensland Amusement Park

TABLE 2.15

BottledWater

58 Chapter 2 Descriptive Statistics

A2:A15 defines the range for the x variable (high temperature), and B2:B15 defines the range for the y variable (sales of bottled water).

For the bottled water, the covariance is positive, indicating that higher temperatures (x) are associated with higher sales (y). If the covariance is near 0, then the x and y variables are not linearly related. If the covariance is less than 0, then the x and y variables are nega- tively related, which means that as x increases, y generally decreases. Figure 2.27 demon- strates several possible scatter charts and their associated covariance values.

One problem with using covariance is that the magnitude of the covariance value is dif- ficult to interpret. Larger sxy values do not necessarily mean a stronger linear relationship because the units of covariance depend on the units of x and y. For example, suppose we are interested in the relationship between height x and weight y for individuals. Clearly the strength of the relationship should be the same whether we measure height in feet or inches. Measuring the height in inches, however, gives us much larger numerical values for ( )x xi − than when we measure height in feet. Thus, with height measured in inches, we would obtain a larger value for the numerator ( )( )x x y yi iS 2 2 in equation (2.9)—and hence a larger covariance—when in fact the relationship does not change.

Calculating Covariance and Correlation Coefficient for Bottled Water Sales Using Excel

FIGURE 2.27

A B

High Temperature (8F) Bottled Water Sales (cases)

78 79 80 80 82 83 85 86 87

88 88 90 92

23 22 24 22 24 26 27 25 28

29 30 31 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Covariance: Correlation Coefficient:

5COVARIANCE.S(A2:A15,B2:B15) 5CORREL(A2:A15,B2:B15)

17 18

A B

High Temperature (8F)

Bottled Water Sales (cases)

78 79 80 80 82 83 85 86 87

88 88 90 92

23 22 24 22 24 26 27 25 28

29 30 31 31

12.80 0.93

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16

Covariance: Correlation Coefficient:

17 18

2.8 Measures of Association Between Two Variables 59

Scatter Diagrams and Associated Covariance Values for Different Variable Relationships

FIGURE 2.28

sxy Positive: (x and y are positively

linearly related)

sxy Approximately 0: (x and y are not linearly related)

sxy Negative: (x and y are negatively

linearly related)

60 Chapter 2 Descriptive Statistics

Correlation Coefficient The correlation coefficient measures the relationship between two variables, and, unlike covariance, the relationship between two variables is not affected by the units of measure- ment for x and y. For sample data, the correlation coefficient is defined as follows:

If data are a population, the population correlation coefficient is computed by

xy xy

x y

r 5 s

s s . Note that this is

similar to equation (2.10) but uses population parameters instead of sample estimates.

Sample Correlation Coefficient

r s

s s xy

x y

5 (2.10)

where

sample correlation coefficient

sample covariance

sample standard deviation of

s x

s y

The sample correlation coefficient is computed by dividing the sample covariance by the product of the sample standard deviation of x and the sample standard deviation of y. This scales the correlation coefficient so that it will always take values between −1 and 11.

Let us now compute the sample correlation coefficient for bottled water sales at Queensland Amusement Park. Recall that we calculated 12.8sxy 5 using equation (2.9). Using data in Table 2.14, we can compute sample standard deviations for x and y.

( )

1 4.36

( )

1 3.15

s x x

s y y

x i

y i

5 S 2

2 5

5 S 2

2 5

The sample correlation coefficient is computed from equation (2.10) as follows:

12.8

(4.36)(3.15) 0.93r

s s xy

x y

5 5 5

The correlation coefficient can take only values between −1 and 11. Correlation coef- ficient values near 0 indicate no linear relationship between the x and y variables. Correla- tion coefficients greater than 0 indicate a positive linear relationship between the x and y variables. The closer the correlation coefficient is to 11, the closer the x and y values are to forming a straight line that trends upward to the right (positive slope). Correlation coef- ficients less than 0 indicate a negative linear relationship between the x and y variables. The closer the correlation coefficient is to −1, the closer the x and y values are to forming a straight line with negative slope. Because 0.93rxy 5 for the bottled water, we know that there is a very strong linear relationship between these two variables. As we can see in Figure 2.26, one could draw a straight line that would be very close to all of the data points in the scatter chart.

Because the correlation coefficient defined here measures only the strength of the linear relationship between two quantitative variables, it is possible for the correlation coefficient to be near zero, suggesting no linear relationship, when the relationship between the two variables is nonlinear. For example, the scatter diagram in Figure 2.29 shows the relation- ship between the amount spent by a small retail store for environmental control (heating and cooling) and the daily high outside temperature for 100 consecutive days.

The sample correlation coefficient for these data is 0.007rxy 5 2 and indicates that there is no linear relationship between the two variables. However, Figure 2.29 provides strong visual evidence of a nonlinear relationship. That is, we can see that as the daily high

2.9 Data Cleansing 61

1. The correlation coefficient discussed in this chapter was

developed by Karl Pearson and is sometimes referred to

as Pearson product moment correlation coefficient. It is

appropriate for use only with two quantitative variables. A

variety of alternatives, such as the Spearman rank-correla-

tion coefficient, exist to measure the association of cate-

gorical variables. The Spearman rank-correlation coefficient

is discussed in Chapter 11.

2. Correlation measures only the association between two

variables. A large positive or large negative correlation

coefficient does not indicate that a change in the value of

one of the two variables causes a change in the value of the other variable.

N O T E S + C O M M E N T S

Example of Nonlinear Relationship Producing a Correlation Coefficient Near Zero

FIGURE 2.29

Outside Temperature (˚F) 0 20 40 60 80 100

$1,600

$1,400

$1,200

$1,000

$800

$600

$400

$200

D ol

la rs

S p

en t

on E

n vi

ro n

m en

ta l

C on

tr ol

outside temperature increases, the money spent on environmental control first decreases as less heating is required and then increases as greater cooling is required.

We can compute correlation coefficients using the Excel function CORREL. The cor- relation coefficient in Figure 2.27 is computed in cell B18 for the sales of bottled water using the formula 5CORREL(A2:A15, B2:B15), where A2:A15 defines the range for the x variable and B2:B15 defines the range for the y variable.

2.9 Data Cleansing The data in a data set are often said to be “dirty” and “raw” before they have been put into a form that is best suited for investigation, analysis, and modeling. Data preparation makes heavy use of the descriptive statistics and data-visualization methods to gain an understanding of the data. Common tasks in data preparation include treating missing data, identifying erro- neous data and outliers, and defining the appropriate way to represent variables.

Missing Data Data sets commonly include observations with missing values for one or more variables. In some cases missing data naturally occur; these are called legitimately missing data. For example, respondents to a survey may be asked if they belong to a fraternity or a sorority, and

62 Chapter 2 Descriptive Statistics

then in the next question are asked how long they have belonged to a fraternity or a sorority. If a respondent does not belong to a fraternity or a sorority, she or he should skip the ensuing question about how long. Generally no remedial action is taken for legitimately missing data.

In other cases missing data occur for different reasons; these are called illegitimately missing data. These cases can result for a variety of reasons, such as a respondent electing not to answer a question that she or he is expected to answer, a respondent dropping out of a study before its completion, or sensors or other electronic data collection equipment failing during a study. Remedial action is considered for illegitimately missing data. The primary options for addressing such missing data are (1) to discard observations (rows) with any missing values, (2) to discard any variable (column) with missing values, (3) to fill in missing entries with estimated values, or (4) to apply a data-mining algorithm (such as classification and regression trees) that can handle missing values.

Deciding on a strategy for dealing with missing data requires some understanding of why the data are missing and the potential impact these missing values might have on an analysis. If the tendency for an observation to be missing the value for some variable is entirely random, then whether data are missing does not depend on either the value of the missing data or the value of any other variable in the data. In such cases the missing value is called missing completely at random (MCAR). For example, if missing value for a ques- tion on a survey is completely unrelated to the value that is missing and is also completely unrelated to the value of any other question on the survey, the missing value is MCAR.

However, the occurrence of some missing values may not be completely at random. If the tendency for an observation to be missing a value for some variable is related to the value of some other variable(s) in the data, the missing value is called missing at random (MAR). For data that is MAR, the reason for the missing values may determine its importance. For example if the responses to one survey question collected by a specific employee were lost due to a data entry error, then the treatment of the missing data may be less critical. However, in a health care study, suppose observations corresponding to patient visits are missing the results of diag- nostic tests whenever the doctor deems the patient too sick to undergo the procedure. In this case, the absence of a variable measurement actually provides additional information about the patient’s condition, which may be helpful in understanding other relationships in the data.

A third category of missing data is missing not at random (MNAR). Data is MNAR if the tendency for the value of a variable to be missing is related to the value that is missing. For example, survey respondents with high incomes may be less inclined than respondents with lower incomes to respond to the question on annual income, and so these missing data for annual income are MNAR.

Understanding which of these three categories—MCAR, MAR, and MNAR—missing values fall into is critical in determining how to handle missing data. If a variable has obser- vations for which the missing values are MCAR or MAR and only a relatively small number of observations are missing values, the observations that are missing values can be ignored. We will certainly lose information if the observations that are missing values for the variable are ignored, but the results of an analysis of the data will not be biased by the missing values.

If a variable has observations for which the missing values are MNAR, the observation with missing values cannot be ignored because any analysis that includes the variable with MNAR values will be biased. If the variable with MNAR values is thought to be redundant with another variable in the data for which there are few or no missing values, removing the MNAR variable from consideration may be an option. In particular, if the MNAR vari- able is highly correlated with another variable that is known for a majority of observations, the loss of information may be minimal.

Whether the missing values are MCAR, MAR, or MNAR, the first course of action when faced with missing values is to try to determine the actual value that is missing by examining the source of the data or logically determining the likely value that is missing. If the missing values cannot be determined and ignoring missing values or removing a variable with missing values from consideration is not an option, imputation (systematic replacement of missing values with values that seems reasonable) may be useful. Options for replacing the missing entries for a variable include replacing the missing value with

2.9 Data Cleansing 63

the variable’s mode, mean, or median. Imputing values in this manner is truly valid only if variable values are MCAR; otherwise, we may be introducing misleading information into the data. If missing values are particularly troublesome and MAR, it may be possible to build a model to predict a variable with missing values and then to use these predictions in place of the missing entries. How to deal with missing values is fairly subjective, and cau- tion must be used to not induce bias by replacing missing values.

Blakely Tires Blakely Tires is a U.S. producer of automobile tires. In an attempt to learn about the con- ditions of its tires on automobiles in Texas, the company has obtained information for each of the four tires from 116 automobiles with Blakely brand tires that have been col- lected through recent state automobile inspection facilities in Texas. The data obtained by Blakely includes the position of the tire on the automobile (left front, left rear, right front, right rear), age of the tire, mileage on the tire, and depth of the remaining tread on the tire. Before Blakely management attempts to learn more about its tires on automobiles in Texas, it wants to assess the quality of these data.

The tread depth of a tire is a vertical measurement between the top of the tread rubber to the bottom of the tire’s deepest grooves, and is measured in 32nds of an inch in the United States. New Blakely brand tires have a tread depth of 10/32nds of an inch, and a tire’s tread depth is considered insufficient if it is 2/32nds of an inch or less. Shallow tread depth is dangerous as it results in poor traction and so makes steering the automobile more difficult. Blakely’s tires generally last for four to five years or 40,000 to 60,000 miles.

We begin assessing the quality of these data by determining which (if any) observations have missing values for any of the variables in the TreadWear data. We can do so using Excel’s COUNTBLANK function. After opening the file TreadWear

Step 1. Enter the heading # of Missing Values in cell G2 Step 2. Enter the heading Life of Tire (Months) in cell H1 Step 3. Enter COUNTBLANK(C2 : C457)5 in cell H2

The result in cell H2 shows that none of the observations in these data is missing its value for Life of Tire.

By repeating this process for the remaining quantitative variables in the data (Tread Depth and Miles) in columns I and J, we determine that there are no missing values for Tread Depth and one missing value for Miles. The first few rows of the resulting Excel spreadsheet is provided in Figure 2.30.

Next we sort all of Blakely’s data on Miles from smallest to largest value to determine which observation is missing its value of this variable. Excel’s sort procedure will list all observations with missing values for the sort variable, Miles, as the last observations in the sorted data.

Portion of Excel Spreadsheet Showing Number of Missing Values for Variables in TreadWear Data.

FIGURE 2.30

ID Number 13391487

21678308

18414311

19778103

16355454 8952817

6559652

58.4

17.3

16.5

8.2

13.7 52.8

14.7

2.2

8.3

8.6

9.8

8.9 3.0

8.8

2805 0 0 1# of Missing Values

39371

13367

1931

23992 48961

4585

RR LR

Position on Automobile

Life of Tire (Months)

Tread Depth

Life of Tire (Months)

Tread Depth MilesMiles1

A B C D E F G H I J

2 3

4 5

6 7

TreadWear

64 Chapter 2 Descriptive Statistics

We can see in Figure 2.31 that the value of Miles is missing from the left front tire of the automobile with ID Number 3354942. Because only one of the 456 observations is missing its value for Miles, this is likely MCAR and so ignoring the observation would not likely bias any analysis we wish to undertake with these data. However, we may be able to salvage this observation by logically determining a reasonable value to substitute for this missing value. It is sensible to suspect that the value of Miles for the left front tire of the automobile with the ID Number 3354942 would be identical to the value of miles for the other three tires on this automobile, so we sort all the data on ID number and scroll through the data to find the four tires that belong to the automobile with the ID Number 3354942.

Figure 2.32 shows that the value of Miles for the other three tires on the automobile with the ID Number 3354942 is 33,254, so this may be a reasonable value for the Miles of the left front tire of the automobile with the ID Number 3354942. However, before substi- tuting this value for the missing value of the left front tire of the automobile with ID Num- ber 3354942, we should attempt to ascertain (if possible) that this value is valid—there are legitimate reasons why a driver might replace a single tire. In this instance we will assume that the correct value of Miles for the left front tire on the automobile with the ID Number 3354942 is 33,254 and substitute that number in the appropriate cell of the spreadsheet.

Note that we have hidden rows 5 through 454 of the Excel file in Figure 2.31.

Occasionally missing values in a data set are indicated with a unique value, such as 9999999. Be sure to check to see if a unique value is being used to indicate a missing value in the data.

Portion of Excel Spreadsheet Showing TreadWear Data Sorted on Miles from Lowest to Highest Value

FIGURE 2.31

ID Number 15890813

15890813

9306585

9306585 3354942

16.1

45.4

45.4 17.1

8.6

4.1

4.1 8.5

206 0 0 1# of Missing Values

206

107237

LF LF

Position on Automobile

Life of Tire (Months)

Tread Depth

Life of Tire (Months)

Tread Depth MilesMiles1

A B C D E F G H I J

2 3

4 455

456 457

Portion of Excel Spreadsheet Showing TreadWear Data Sorted from Lowest to Highest by ID Number

FIGURE 2.32

3121851 LR 17.1 8.4 21378 21378

21378 21378

33254 33254 33254 57313 57313 57313

57313

8.4

8.4 8.5 8.5 7.7

7.7 0.2 0.2 0.2 0.2

7.8

17.1

17.1 17.1 17.1 21.4 21.4

21.4 73.3 73.3 73.3 73.3

3121851

3121851 3121851 3354942

3354942 3354942 3354942 3374739 3574739 3574739 3574739

54 55 56 57 58 59 60 61 62 63 64 65

2.9 Data Cleansing 65

Identification of Erroneous Outliers and Other Erroneous Values Examining the variables in the data set by use of summary statistics, frequency distribu- tions, bar charts and histograms, z-scores, scatter plots, correlation coefficients, and other tools can uncover data-quality issues and outliers. For example, finding the minimum or maximum value for Tread Depth in the TreadWear data may reveal unrealistic values— perhaps even negative values—for Tread Depth, which would indicate a problem for the value of Tread Depth for any such observation.

It is important to note here that many software, including Excel, JMP Pro Analytic Solver, ignore missing values when calculating various summary statistics such as the mean, standard deviation, minimum, and maximum. However, if missing values in a data set are indicated with a unique value (such as 9999999), these values may be used by soft- ware when calculating various summary statistics such as the mean, standard deviation, minimum, and maximum. Both cases can result in misleading values for summary statistics, which is why many analysts prefer to deal with missing data issues prior to using summary statistics to attempt to identify erroneous outliers and other erroneous values in the data.

We again consider the Blakely tire data. We calculate the mean and standard deviation of each variable (age of the tire, mileage on the tire, and depth of the remaining tread on the tire) to assess whether values of these variable are reasonable in general.

Return to the file TreadWear and complete the following steps:

Step 1. Enter the heading Mean in cell G3 Step 2. Enter the heading Standard Deviation in cell G4 Step 3. Enter 5AVERAGE(C2:C457) in cell H3 Step 4. Enter 5STDEV.S(C2:C457) in cell H4

The results in cells H3 and H4 show that the mean and standard deviation for life of tires are 23.8 months and 31.83 months, respectively. These values appear to be reasonable for the life of tires in months.

By repeating this process for the remaining variables in the data (Tread Depth and Miles) in columns I and J, we determine that the mean and standard deviation for tread depth are 7.62/12ths of an inch and 2.47/12ths of an inch, respectively, and the mean and standard deviation for miles are 25440.22 and 23600.21, respectively. These values appear to be rea- sonable for tread depth and miles. The results of this analysis are provided in Figure 2.33.

Summary statistics only provide an overall perspective on the data. We also need to attempt to determine if there are any erroneous individual values for our three variables. We start by finding the minimum and maximum values for each variable.

Return again to the file TreadWear and complete the following steps:

Step 1. Enter the heading Minimum in cell G5 Step 2. Enter the heading Maximum in cell G6 Step 3. Enter 5MIN(C2:C457) in cell H5 Step 4. Enter 5MAX(C2:C457) in cell H6

The results in cells H5 and H6 show that the minimum and maximum values for Life of Tires (Months) are 1.8 months and 601.0, respectively. The minimum value of life of tires in months appears to be reasonable, but the maximum (which is equal to slightly over 50 years) is not a reasonable value for Life of Tires (Months). In order to identify the automo- bile with this extreme value, we again sort the entire data set on Life of Tire (Months) and scroll to the last few rows of the data.

We see in Figure 2.34 that the observation with Life of Tire (Months) of 601 is the left rear tire from the automobile with ID Number 8696859. Also note that the left rear tire of the automobile with ID Number 2122934 has a suspiciously high value for Life of Tire (Months) of 111. Sorting the data by ID Number and scrolling until we find the four tires from the automobile with ID Number 8696859, we find the value for Life of Tire (Months) for the other three tires from this automobile is 60.1. This suggests that the decimal for Life of Tire (Months) for this automobile’s left rear tire value is in the wrong place. Scrolling

If you do not have good information on what are reasonable values for a variable, you can use z-scores to identify outliers to be investigated.

66 Chapter 2 Descriptive Statistics

to find the four tires from the automobile with ID Number 2122934, we find the value for Life of Tire (Months) for the other three tires from this automobile is 11.1, which suggests that the decimal for Life of Tire (Months) for this automobile’s left rear tire value is also misplaced. Both of these erroneous entries can now be corrected.

By repeating this process for the remaining variables in the data (Tread Depth and Miles) in columns I and J, we determine that the minimum and maximum values for Tread Depth are 0.0/12ths of an inch and 16.7/12ths of an inch, respectively, and the minimum and maximum values for Miles are 206.0 and 107237.0, respectively. Neither the minimum nor the maximum value for Tread Depth is reasonable; a tire with no tread would not be drivable, and the maximum value for tire depth in the data actually exceeds the tread depth on new Blakely brand tires. The minimum value for Miles is reasonable, but the maximum value is not. A similar investigation should be made into these values to determine if they are in error and if so, what might be the correct value.

Not all erroneous values in a data set are extreme; these erroneous values are much more difficult to find. However, if the variable with suspected erroneous values has a rel- atively strong relationship with another variable in the data, we can use this knowledge to look for erroneous values. Here we will consider the variables Tread Depth and Miles; because more miles driven should lead to less tread depth on an automobile tire, we expect

Portion of Excel Spreadsheet Showing the Mean and Standard Deviation for Each Variable in the TreadWear Data

FIGURE 2.33

ID Number 80441

80441

95990 95990

95990

19.0

8.6 8.6

8.6

8.1

8.2

8.1

9.7 9.7

9.7

37419 0 0 1

23.80 7.68 25440.22

31.82 2.62 23600.21

# of Missing Values

Mean

Standard Deviation

37419

5670 5670

5670

RR LR

Position on Automobile

Life of Tire (Months)

Tread Depth

Life of Tire (Months)

Tread Depth MilesMiles1

A B C D E F G H I J

2 3

4 5

6 7

Portion of Excel Spreadsheet Showing the TreadWear Data Sorted on Life of Tires (Months) from Lowest to Highest Value

FIGURE 2.34

ID Number 9091771

9091771

7712178

7712178 3574739

3574739

1.8

2.1

2.1 73.3

73.3

10.8

10.7

10.7 0.2

0.2

2917 0 0 1

23.80 7.68 25440.22

31.82 1.8

601.0

2.62 23600.21

# of Missing Values

Mean

Standard Deviation Minimum

Maximum

2917

2186

2186 57313

57313

RR RR

Position on Automobile

Life of Tire (Months)

Tread Depth

Life of Tire (Months)

Tread Depth MilesMiles1

A B C D E F G H I J

2 3

4 5

6 452

453 3574739 73.3 0.2 57313LR454 3574739 73.3 0.2 57313LR455 2122934 111.0 9.3 21000LR456 8696859 601.0 2.0 26129LR457

2.9 Data Cleansing 67

these two variables to have a negative relationship. A scatter chart will enable us to see whether any of the tires in the data set have values for Tread Depth and Miles that are counter to this expectation.

The red ellipse in Figure 2.35 shows the region in which the points representing Tread Depth and Miles would generally be expected to lie on this scatter plot. The points that lie outside of this ellipse have values for at least one of these variables that is inconsistent with the negative relationship exhibited by the points inside the ellipse. If we position the cursor over one of the points outside the ellipse, Excel will generate a pop-up box that shows that the values of Tread Depth and Miles for this point are 1.0 and 1472.1, respectively. The tire represented by this point has very little tread and has been driven relatively few miles, which suggests that the value of one or both of these two variables for this tire may be inaccurate and should be investigated.

Closer examination of outliers and potential erroneous values may reveal an error or a need for further investigation to determine whether the observation is relevant to the cur- rent analysis. A conservative approach is to create two data sets, one with and one without outliers and potentially erroneous values, and then construct a model on both data sets. If a model’s implications depend on the inclusion or exclusion of outliers and erroneous values, then you should spend additional time to track down the cause of the outliers.

Variable Representation In many data-mining applications, it may be prohibitive to analyze the data because of the number of variables recorded. In such cases, the analyst may have to first identify variables that can be safely omitted from further analysis before proceeding with a data-miningº technique. Dimension reduction is the process of removing variables from the analy- sis without losing crucial information. One simple method for reducing the number of

Scatter Diagram of Tread Depth and Miles for the TreadWear Data

FIGURE 2.35

2.0 4.0 6.0 0 0.0

Series “Miles” Point “1.0” (1.0, 1472.1)

8.0 10.0 12.0 14.0 16.0 18.0

120,000

100,000

80,000

40,000

20,000

60,000

M il

Tread Depth

68 Chapter 2 Descriptive Statistics

variables is to examine pairwise correlations to detect variables or groups of variables that may supply similar information. Such variables can be aggregated or removed to allow more parsimonious model development.

A critical part of data mining is determining how to represent the measurements of the variables and which variables to consider. The treatment of categorical variables is par- ticularly important. Typically, it is best to encode categorical variables with 0–1 dummy variables. Consider a data set that contains the variable Language to track the language preference of callers to a call center. The variable Language with the possible values of English, German, and Spanish would be replaced with three binary variables called English, German, and Spanish. An entry of German would be captured using a 0 for the English dummy variable, a 1 for the German dummy variable, and a 0 for the Spanish dummy variable. Using 0–1 dummy variables to encode categorical variables with many different categories results in a large number of variables. In these cases, the use of Pivot- Tables is helpful in identifying categories that are similar and can possibly be combined to reduce the number of 0–1 dummy variables. For example, some categorical variables (zip code, product model number) may have many possible categories such that, for the pur- pose of model building, there is no substantive difference between multiple categories, and therefore the number of categories may be reduced by combining categories.

Often data sets contain variables that, considered separately, are not particularly insight- ful but that, when appropriately combined, result in a new variable that reveals an import- ant relationship. Financial data supplying information on stock price and company earnings may be as useful as the derived variable representing the price/earnings (PE) ratio. A variable tabulating the dollars spent by a household on groceries may not be interesting because this value may depend on the size of the household. Instead, considering the proportion of total household spending on groceries may be more informative.

1. Many of the data visualization tools described in Chapter 3

can be used to aid in data cleansing.

2. In some cases, it may be desirable to transform a numer-

ical variable into categories. For example, if we wish to

analyze the circumstances in which a numerical outcome

variable exceeds a certain value, it may be helpful to cre-

ate a binary categorical variable that is 1 for observations

with the variable value greater than the threshold and 0

otherwise. In another case, if a variable has a skewed dis-

tribution, it may be helpful to categorize the values into

quantiles.

3. Most dedicated statistical software packages provide

functionality to apply a more sophisticated dimension-re-

duction approach called principal components analysis.

Both JMP Pro and Analytic Solver contain procedures for

implementing principal components analysis. Principal

components analysis creates a collection of metavariables

(components) that are weighted sums of the original vari-

ables These components are not correlated with each other,

and often only a few of them are needed to convey the

same information as the large set of original variables. In

many cases, only one or two components are necessary to

explain the majority of the variance in the original variables.

Then the analyst can continue to build a data-mining model

using just a few of the most explanatory components rather

than the entire set of original variables. Although principal

components analysis can reduce the number of variables

in this manner, it may be harder to explain the results of the

model because the interpretation of a component that is a

linear combination of variables can be unintuitive.

N O T E S + C O M M E N T S

S U M M A R Y

In this chapter we have provided an introduction to descriptive statistics that can be used to summarize data. We began by explaining the need for data collection, defining the types of data one may encounter, and providing a few commonly used sources for finding data. We presented several useful functions for modifying data in Excel, such as sorting and filtering to aid in data analysis.

Glossary 69

We introduced the concept of a distribution and explained how to generate frequency, relative, percent, and cumulative distributions for data. We also demonstrated the use of histograms as a way to visualize the distribution of data. We then introduced measures of location for a distribution of data such as mean, median, mode, and geometric mean, as well as measures of variability such as range, variance, standard deviation, coefficient of variation, and interquartile range. We presented additional measures for analyzing a distri- bution of data including percentiles, quartiles, and z-scores. We showed that box plots are effective for visualizing a distribution.

We discussed measures of association between two variables. Scatter plots allow one to visualize the relationship between variables. Covariance and the correlation coefficient summarize the linear relationship between variables into a single number.

We also introduced methods for data cleansing. Analysts typically spend large amounts of their time trying to understand and cleanse raw data before applying analytics models. We discussed methods for identifying missing data and how to deal with missing data val- ues and outliers.

G L O S S A R Y

Bins The nonoverlapping groupings of data used to create a frequency distribution. Bins for categorical data are also known as classes. Box plot A graphical summary of data based on the quartiles of a distribution. Categorical data Data for which categories of like items are identified by labels or names. Arithmetic operations cannot be performed on categorical data. Coefficient of variation A measure of relative variability computed by dividing the stan- dard deviation by the mean and multiplying by 100. Correlation coefficient A standardized measure of linear association between two vari- ables that takes on values between 21 and 11. Values near 21 indicate a strong negative linear relationship, values near 11 indicate a strong positive linear relationship, and values near zero indicate the lack of a linear relationship. Covariance A measure of linear association between two variables. Positive values indi- cate a positive relationship; negative values indicate a negative relationship. Cross-sectional data Data collected at the same or approximately the same point in time. Cumulative frequency distribution A tabular summary of quantitative data showing the number of data values that are less than or equal to the upper class limit of each bin. Data The facts and figures collected, analyzed, and summarized for presentation and interpretation. Dimension reduction The process of removing variables from the analysis without losing crucial information. Empirical rule A rule that can be used to compute the percentage of data values that must be within 1, 2, or 3 standard deviations of the mean for data that exhibit a bell-shaped distribution. Frequency distribution A tabular summary of data showing the number (frequency) of data values in each of several nonoverlapping bins. Geometric mean A measure of central location that is calculated by finding the nth root of the product of n values. Growth factor The percentage increase of a value over a period of time is calculated using the formula (1 − growth factor). A growth factor less than 1 indicates negative growth, whereas a growth factor greater than 1 indicates positive growth. The growth factor cannot be less than zero. Histogram A graphical presentation of a frequency distribution, relative frequency dis- tribution, or percent frequency distribution of quantitative data constructed by placing the

70 Chapter 2 Descriptive Statistics

bin intervals on the horizontal axis and the frequencies, relative frequencies, or percent frequencies on the vertical axis. Illegitimately missing data Missing data that do not occur naturally. Imputation Systematic replacement of missing values with values that seem reasonable. Interquartile range The difference between the third and first quartiles. Legitimately missing data Missing data that occur naturally. Mean (arithmetic mean) A measure of central location computed by summing the data values and dividing by the number of observations. Median A measure of central location provided by the value in the middle when the data are arranged in ascending order. Missing at random (MAR) The tendency for an observation to be missing a value of some variable is related to the value of some other variable(s) in the data. Missing completely at random (MCAR) The tendency for an observation to be missing a value of some variable is entirely random. Missing not at random (MNAR) The tendency for an observation to be missing a value of some variable is related to the missing value. Mode A measure of central location defined as the value that occurs with greatest frequency. Observation A set of values corresponding to a set of variables. Outliers An unusually large or unusually small data value. Percent frequency distribution A tabular summary of data showing the percentage of data values in each of several nonoverlapping bins. Percentile A value such that approximately p% of the observations have values less than the pth percentile; hence, approximately (100 2 p)% of the observations have values greater than the pth percentile. The 50th percentile is the median. Population The set of all elements of interest in a particular study. Quantitative data Data for which numerical values are used to indicate magnitude, such as how many or how much. Arithmetic operations such as addition, subtraction, and multi- plication can be performed on quantitative data. Quartiles The 25th, 50th, and 75th percentiles, referred to as the first quartile, second quartile (median), and third quartile, respectively. The quartiles can be used to divide a data set into four parts, with each part containing approximately 25% of the data. Random sampling Collecting a sample that ensures that (1) each element selected comes from the same population and (2) each element is selected independently. Random variable, or uncertain variable A quantity whose values are not known with certainty. Range A measure of variability defined to be the largest value minus the smallest value. Relative frequency distribution A tabular summary of data showing the fraction or pro- portion of data values in each of several nonoverlapping bins. Sample A subset of the population. Scatter chart A graphical presentation of the relationship between two quantitative vari- ables. One variable is shown on the horizontal axis and the other on the vertical axis. Skewness A measure of the lack of symmetry in a distribution. Standard deviation A measure of variability computed by taking the positive square root of the variance. Time series data Data that are collected over a period of time (minutes, hours, days, months, years, etc.). Variable A characteristic or quantity of interest that can take on different values. Variance A measure of variability based on the squared deviations of the data values about the mean. Variation Differences in values of a variable over observations. z-score A value computed by dividing the deviation about the mean ( )x xi 2 by the standard deviation s. A z-score is referred to as a standardized value and denotes the number of standard deviations that xi is from the mean.

Problems 71

P R O B L E M S

1. A Wall Street Journal subscriber survey asked 46 questions about subscriber character- istics and interests. State whether each of the following questions provides categorical or quantitative data. a. What is your age? b. Are you male or female? c. When did you first start reading the WSJ? High school, college, early career, midca-

reer, late career, or retirement? d. How long have you been in your present job or position? e. What type of vehicle are you considering for your next purchase? Nine response

categories include sedan, sports car, SUV, minivan, and so on.

2. The following table contains a partial list of countries, the continents on which they are located, and their respective gross domestic products (GDP) in U.S. dollars. A list of 125 countries and their GDPs is contained in the file GDPlist.

Country Continent GDP (millions of US$)

Afghanistan Asia 18,181

Albania Europe 12,847

Algeria Africa 190,709

Angola Africa 100,948

Argentina South America 447,644

Australia Oceania 1,488,221

Austria Europe 419,243

Azerbaijan Europe 62,321

Bahrain Asia 26,108

Bangladesh Asia 113,032

Belarus Europe 55,483

Belgium Europe 513,396

Bolivia South America 24,604

Bosnia and Herzegovina Europe 17,965

Botswana Africa 17,570

GDPlist

a. Sort the countries in GDPlist from largest to smallest GDP. What are the top 10 countries according to GDP?

b. Filter the countries to display only the countries located in Africa. What are the top 5 countries located in Africa according to GDP?

c. What are the top 5 countries by GDP that are located in Europe?

3. Ohio Logistics manages the logistical activities for firms by matching companies that need products shipped with carriers that can provide the best rates and best service for the companies. Ohio Logistics is very concerned that its carriers deliver their cus- tomers’ material on time, so it carefully monitors the percentage of on-time deliveries. The following table contains a list of the carriers used by Ohio Logistics and the corre- sponding on-time percentages for the current and previous years.

Carrier Previous Year On-Time

Deliveries (%) Current Year On-Time

Deliveries (%)

Blue Box Shipping 88.4 94.8

Cheetah LLC 89.3 91.8

Granite State Carriers 81.8 87.6

Honsin Limited 74.2 80.1

Carriers

72 Chapter 2 Descriptive Statistics

a. Sort the carriers in descending order by their current year’s percentage of on-time deliveries. Which carrier is providing the best service in the current year? Which carrier is providing the worst service in the current year?

b. Calculate the change in percentage of on-time deliveries from the previous to the current year for each carrier. Use Excel’s conditional formatting to highlight the car- riers whose on-time percentage decreased from the previous year to the current year.

c. Use Excel’s conditional formatting tool to create data bars for the change in per- centage of on-time deliveries from the previous year to the current year for each carrier calculated in part b.

d. Which carriers should Ohio Logistics try to use in the future? Why?

4. A partial relative frequency distribution is given.

Carrier Previous Year On-Time

Deliveries (%) Current Year On-Time

Deliveries (%)

Jones Brothers 68.9 82.8

Minuteman Company 91.0 84.2

Rapid Response 78.8 70.9

Smith Logistics 84.3 88.7

Super Freight 92.1 86.8

Class Relative Frequency

A 0.22

B 0.18

C 0.40

a. What is the relative frequency of class D? b. The total sample size is 200. What is the frequency of class D? c. Show the frequency distribution. d. Show the percent frequency distribution.

5. In a recent report, the top five most-visited English-language web sites were google. com (GOOG), facebook.com (FB), youtube.com (YT), yahoo.com (YAH), and wiki- pedia.com (WIKI). The most-visited web sites for a sample of 50 Internet users are shown in the following table:

a. Are these data categorical or quantitative? b. Provide frequency and percent frequency distributions. c. On the basis of the sample, which web site is most frequently the most-often-visited

web site for Internet users? Which is second?

YAH WIKI YT WIKI GOOG

YT YAH GOOG GOOG GOOG

WIKI GOOG YAH YAH YAH

YAH YT GOOG YT YAH

GOOG FB FB WIKI GOOG

GOOG GOOG FB FB WIKI

FB YAH YT YAH YAH

YT GOOG YAH FB FB

WIKI GOOG YAH WIKI WIKI

YAH YT GOOG GOOG WIKI

Websites

Problems 73

6. In a study of how chief executive officers (CEOs) spend their days, it was found that CEOs spend an average of about 18 hours per week in meetings, not including confer- ence calls, business meals, and public events. Shown here are the times spent per week in meetings (hours) for a sample of 25 CEOs:

Develop frequency and percent frequency distributions for the data above to answer the following questions. a. Where are most adults living now? b. Where do most adults consider the ideal community to be? c. What changes in living areas would you expect to see if people moved from where

they currently live to their ideal community?

14 15 18 23 15 19 20 13 15 23 23 21 15 20 21 16 15 18 18 19 19 22 23 21 12

CEOtime

a. What is the least amount of time a CEO spent per week in meetings in this sample? The highest?

b. Use a class width of 2 hours to prepare a frequency distribution and a percent fre- quency distribution for the data.

c. Prepare a histogram and comment on the shape of the distribution.

7. Consumer complaints are frequently reported to the Better Business Bureau. Industries with the most complaints to the Better Business Bureau are often banks, cable and sat- ellite television companies, collection agencies, cellular phone providers, and new car dealerships. The results for a sample of 200 complaints are in the file BBB. a. Show the frequency and percent frequency of complaints by industry. b. Which industry had the highest number of complaints? c. Comment on the percentage frequency distribution for complaints.

8. Reports have found that many U.S. adults would rather live in a different type of com- munity than the one in which they are living now. A national survey of 2,260 adults asked: “Where do you live now?” and “What do you consider to be the ideal commu- nity?” Response options were City (C), Suburb (S), Small Town (T), or Rural (R). A representative portion of this survey for a sample of 100 respondents is as follows:

BBB

Where do you live now?

S T R C R R T C S T C S C S T S S C S S T T C C S T C S T C T R S S T C S C T C T C T C R C C R T C S S T S C C C R S C S S C C S C R T T T C R T C R C T R R C T C C R T T R S R T T S S S S S C C R T

What do you consider to be the ideal community? S C R R R S T S S T T S C S T C C R T R S T T S S C C T T S S R C S C C S C R C T S R R R C T S T T T R R S C C R R S S S T C T T C R T T T C T T R R C S R T C T C C T T T R C R T T C S S C S T S S R

Communities

74 Chapter 2 Descriptive Statistics

9. Consider the following data:

a. Develop a frequency distribution using classes of 12–14, 15–17, 18–20, 21–23, and 24–26.

b. Develop a relative frequency distribution and a percent frequency distribution using the classes in part a.

10. Consider the following frequency distribution.

Class Frequency

10–19 10

20–29 14

30–39 17

40–49 7

50–59 2

14 24 18 22 19 18 16 22 24 17 15 16 19 23 24 16 16 26 21 16 20 22 16 12 24 23 19 25 20 25 21 19 21 25 23 24 22 19 20 20

Frequency

Construct a cumulative frequency distribution.

11. The owner of an automobile repair shop studied the waiting times for customers who arrive at the shop for an oil change. The following data with waiting times in minutes were collected over a one-month period.

2 5 10 12 4 4 5 17 11 8 9 8 12 21 6 8 7 13 18 3

Using classes of 0–4, 5–9, and so on, show the following: a. The frequency distribution b. The relative frequency distribution c. The cumulative frequency distribution d. The cumulative relative frequency distribution e. The proportion of customers needing an oil change who wait 9 minutes or less.

12. Approximately 1.65 million high school students take the Scholastic Aptitude Test (SAT) each year, and nearly 80% of the college and universities without open admis- sions policies use SAT scores in making admission decisions. The current version of the SAT includes three parts: reading comprehension, mathematics, and writing. A perfect combined score for all three parts is 2400. A sample of SAT scores for the combined three-part SAT are as follows:

RepairShop

1665 1525 1355 1645 1780 1275 2135 1280 1060 1585 1650 1560 1150 1485 1990 1590 1880 1420 1755 1375 1475 1680 1440 1260 1730 1490 1560 940 1390 1175

SAT

Problems 75

a. Show a frequency distribution and histogram. Begin with the first bin starting at 800, and use a bin width of 200.

b. Comment on the shape of the distribution. c. What other observations can be made about the SAT scores based on the tabular and

graphical summaries?

13. Consider a sample with data values of 10, 20, 12, 17, and 16. a. Compute the mean and median. b. Consider a sample with data values 10, 20, 12, 17, 16, and 12. How would you

expect the mean and median for these sample data to compare to the mean and median for part a (higher, lower, or the same)? Compute the mean and median for the sample data 10, 20, 12, 17, 16, and 12.

14. Consider a sample with data values of 27, 25, 20, 15, 30, 34, 28, and 25. Compute the 20th, 25th, 65th, and 75th percentiles.

15. Consider a sample with data values of 53, 55, 70, 58, 64, 57, 53, 69, 57, 68, and 53. Compute the mean, median, and mode.

16. If an asset declines in value from $5,000 to $3,500 over nine years, what is the mean annual growth rate in the asset’s value over these nine years?

17. Suppose that you initially invested $10,000 in the Stivers mutual fund and $5,000 in the Trippi mutual fund. The value of each investment at the end of each subsequent year is provided in the table:

Year Stivers ($) Trippi ($)

1 11,000 5,600

2 12,000 6,300

3 13,000 6,900

4 14,000 7,600

5 15,000 8,500

6 16,000 9,200

7 17,000 9,900

8 18,000 10,600

Which of the two mutual funds performed better over this time period?

18. The average time that Americans commute to work is 27.7 minutes (Sterling’s Best Places, April 13, 2012). The average commute times in minutes for 48 cities are as follows:

Albuquerque 23.3 Atlanta 28.3 Austin 24.6 Baltimore 32.1 Boston 31.7 Charlotte 25.8 Chicago 38.1 Cincinnati 24.9 Cleveland 26.8 Columbus 23.4 Dallas 28.5 Denver 28.1 Detroit 29.3 El Paso 24.4 Fresno 23.0 Indianapolis 24.8

Jacksonville 26.2 Kansas City 23.4 Las Vegas 28.4 Little Rock 20.1 Los Angeles 32.2 Louisville 21.4 Memphis 23.8 Miami 30.7 Milwaukee 24.8 Minneapolis 23.6 Nashville 25.3 New Orleans 31.7 New York 43.8 Oklahoma City 22.0 Orlando 27.1 Philadelphia 34.2

Phoenix 28.3 Pittsburgh 25.0 Portland 26.4 Providence 23.6 Richmond 23.4 Sacramento 25.8 Salt Lake City 20.2 San Antonio 26.1 San Diego 24.8 San Francisco 32.6 San Jose 28.5 Seattle 27.3 St. Louis 26.8 Tucson 24.0 Tulsa 20.1 Washington, D.C. 32.8

CommuteTimes

76 Chapter 2 Descriptive Statistics

a. What is the mean commute time for these 48 cities? b. What is the median commute time for these 48 cities? c. What is the mode for these 48 cities? d. What is the variance and standard deviation of commute times for these 48 cities? e. What is the third quartile of commute times for these 48 cities?

19. Suppose that the average waiting time for a patient at a physician’s office is just over 29 minutes. To address the issue of long patient wait times, some physicians’ offices are using wait-tracking systems to notify patients of expected wait times. Patients can adjust their arrival times based on this information and spend less time in waiting rooms. The following data show wait times (in minutes) for a sample of patients at offices that do not have a wait-tracking system and wait times for a sample of patients at offices with such systems.

a. What are the mean and median patient wait times for offices with a wait-tracking system? What are the mean and median patient wait times for offices without a wait-tracking system?

b. What are the variance and standard deviation of patient wait times for offices with a wait-tracking system? What are the variance and standard deviation of patient wait times for visits to offices without a wait-tracking system?

c. Create a box plot for patient wait times for offices without a wait-tracking system. d. Create a box plot for patient wait times for offices with a wait-tracking system. e. Do offices with a wait-tracking system have shorter patient wait times than offices

without a wait-tracking system? Explain.

20. According to the National Education Association (NEA), teachers generally spend more than 40 hours each week working on instructional duties. The following data show the number of hours worked per week for a sample of 13 high school science teachers and a sample of 11 high school English teachers.

Without Wait-Tracking System With Wait-Tracking System

24 31

67 11

17 14

20 18

31 12

44 37

12 9

23 13

16 12

37 15

PatientWaits

High school science teachers 53 56 54 54 55 58 49 61 54 54 52 53 54

High school English teachers 52 47 50 46 47 48 49 46 55 44 47

Teachers

a. What is the median number of hours worked per week for the sample of 13 high school science teachers?

b. What is the median number of hours worked per week for the sample of 11 high school English teachers?

c. Create a box plot for the number of hours worked for high school science teachers. d. Create a box plot for the number of hours worked for high school English teachers. e. Comment on the differences between the box plots for science and English teachers.

Problems 77

21. Return to the waiting times given for the physician’s office in Problem 19. a. Considering only offices without a wait-tracking system, what is the z-score for the

10th patient in the sample (wait time 37 minutes)5 ? b. Considering only offices with a wait-tracking system, what is the z-score for the 6th

patient in the sample (wait time 37 minutes)5 ? How does this z-score compare with the z-score you calculated for part a?

c. Based on z-scores, do the data for offices without a wait-tracking system contain any outliers? Based on z-scores, do the data for offices without a wait-tracking sys- tem contain any outliers?

22. The results of a national survey showed that on average, adults sleep 6.9 hours per night. Suppose that the standard deviation is 1.2 hours and that the number of hours of sleep follows a bell-shaped distribution. a. Use the empirical rule to calculate the percentage of individuals who sleep between

4.5 and 9.3 hours per day. b. What is the z-value for an adult who sleeps 8 hours per night? c. What is the z-value for an adult who sleeps 6 hours per night?

23. Suppose that the national average for the math portion of the College Board’s SAT is 515. The College Board periodically rescales the test scores such that the standard deviation is approximately 100. Answer the following questions using a bell-shaped distribution and the empirical rule for the math test scores. a. What percentage of students have an SAT math score greater than 615? b. What percentage of students have an SAT math score greater than 715? c. What percentage of students have an SAT math score between 415 and 515? d. What is the z-score for student with an SAT math score of 620? e. What is the z-score for a student with an SAT math score of 405?

24. Five observations taken for two variables follow.

xi 4 6 11 3 16

yi 50 50 40 60 30

a. Develop a scatter diagram with x on the horizontal axis. b. What does the scatter diagram developed in part a indicate about the relationship

between the two variables? c. Compute and interpret the sample covariance. d. Compute and interpret the sample correlation coefficient.

25. The scatter chart in the following figure was created using sample data for profits and market capitalizations from a sample of firms in the Fortune 500.

PatientWaits

Profits (millions of $)

M ar

k et

C ap

( m

il li

on s

of $

)

0 4,000 8,000 12,000 16,000

200,000

160,000

120,000

80,000

40,000

Fortune500

78 Chapter 2 Descriptive Statistics

a. Discuss what the scatter chart indicates about the relationship between profits and market capitalization?

b. The data used to produce this are contained in the file Fortune500. Calculate the covariance between profits and market capitalization. Discuss what the covariance indicates about the relationship between profits and market capitalization?

c. Calculate the correlation coefficient between profits and market capitalization. What does the correlations coefficient indicate about the relationship between profits and market capitalization?

26. The economic downturn in 2008–2009 resulted in the loss of jobs and an increase in delinquent loans for housing. In projecting where the real estate market was headed in the coming year, economists studied the relationship between the jobless rate and the percentage of delinquent loans. The expectation was that if the jobless rate continued to increase, there would also be an increase in the percentage of delinquent loans. The following data show the jobless rate and the delinquent loan percentage for 27 major real estate markets.

a. Compute the correlation coefficient. Is there a positive correlation between the jobless rate and the percentage of delinquent housing loans? What is your interpretation?

b. Show a scatter diagram of the relationship between the jobless rate and the percent- age of delinquent housing loans.

27. Huron Lakes Candies (HLC) has developed a new candy bar called Java Cup that is a milk chocolate cup with a coffee-cream center. In order to assess the market potential of Java Cup, HLC has developed a taste test and follow-up survey. Respondents were asked to taste Java Cup and then rate Java Cup’s taste, texture, creaminess of filling, sweetness, and depth of the chocolate flavor of the cup on a 100-point scale. The taste test and survey were administered to 217 randomly selected adult consumers. Data collected from each respondent are provided in the file JavaCup. a. Are there any missing values in HLC’s survey data? If so, identify the respon-

dents for which data are missing and which values are missing for each of these respondents.

Metro Area Jobless Rate (%)

Delinquent Loans (%)

Atlanta 7.1 7.02

Boston 5.2 5.31

Charlotte 7.8 5.38

Chicago 7.8 5.40

Dallas 5.8 5.00

Denver 5.8 4.07

Detroit 9.3 6.53

Houston 5.7 5.57

Jacksonville 7.3 6.99

Las Vegas 7.6 11.12

Los Angeles 8.2 7.56

Miami 7.1 12.11

Minneapolis 6.3 4.39

Nashville 6.6 4.78

Metro Area Jobless Rate (%)

Delinquent Loans (%)

New York 6.2 5.78

Orange County 6.3 6.08

Orlando 7.0 10.05

Philadelphia 6.2 4.75

Phoenix 5.5 7.22

Portland 6.5 3.79

Raleigh 6.0 3.62

Sacramento 8.3 9.24

St. Louis 7.5 4.40

San Diego 7.1 6.91

San Francisco 6.8 5.57

Seattle 5.5 3.87

Tampa 7.5 8.42

Source: The Wall Street Journal, January 27, 2009.

JoblessRate

JavaCup

Case Problem: Heavenly Chocolates Web Site Transactions 79

b. Are there any values in HLC’s survey data that appear to be erroneous? If so, iden- tify the respondents for which data appear to be erroneous and which values appear to be erroneous for each of these respondents.

28. Marilyn Marshall, a Professor of sports economics, has obtained a data set of home attendance for each of the 30 major league baseball franchises for each season from 2010 through 2016. Dr. Marshall suspects the data, provided in the file AttendMLB, is in need of a thorough cleansing. You should also find a reliable source of Major League Baseball attendance for each franchise between 2010 and 2016 to use to help you iden- tify appropriate imputation values for data missing in the AttendMLB file. a. Are there any missing values in Dr. Marshall’s data? If so, identify the teams and

seasons for which data are missing and which values are missing for each of these teams and seasons. Use the reliable source of Major League Baseball Attendance for each franchise between 2010 and 2016 you have found to find the correct value in each instance.

b. Are there any values in Dr. Marshall’s data that appear to be erroneous? If so, iden- tify the teams and seasons for which data appear to be erroneous and which values appear to be erroneous for each of these teams and seasons. Use the reliable source of Major League Baseball Attendance for each franchise between 2010 and 2016 you have found to find the correct value in each instance.

C A S E P R O B L E M : H E A V E N L Y C H O C O L A T E S W E B S I T E T R A N S A C T I O N S

Heavenly Chocolates manufactures and sells quality chocolate products at its plant and retail store located in Saratoga Springs, New York. Two years ago, the company developed a web site and began selling its products over the Internet. Web site sales have exceeded the company’s expectations, and management is now considering strategies to increase sales even further. To learn more about the web site customers, a sample of 50 Heavenly Choco- late transactions was selected from the previous month’s sales. Data showing the day of the week each transaction was made, the type of browser the customer used, the time spent on the web site, the number of web pages viewed, and the amount spent by each of the 50 cus- tomers are contained in the file named HeavenlyChocolates. A portion of the data is shown in the table that follows:

Customer Day Browser Time (min) Pages Viewed Amount Spent ($)

1 Mon Chrome 12.0 4 54.52

2 Wed Other 19.5 6 94.90

3 Mon Chrome 8.5 4 26.68

4 Tue Firefox 11.4 2 44.73

5 Wed Chrome 11.3 4 66.27

6 Sat Firefox 10.5 6 67.80

7 Sun Chrome 11.4 2 36.04

. . . . . .

48 Fri Chrome 9.7 5 103.15

49 Mon Other 7.3 6 52.15

50 Fri Chrome 13.4 3 98.75

HeavenlyChocolates

AttendMLB

80 Chapter 2 Descriptive Statistics

Heavenly Chocolates would like to use the sample data to determine whether online shoppers who spend more time and view more pages also spend more money during their visit to the web site. The company would also like to investigate the effect that the day of the week and the type of browser have on sales.

Managerial Report

Use the methods of descriptive statistics to learn about the customers who visit the Heavenly Chocolates web site. Include the following in your report.

1. Graphical and numerical summaries for the length of time the shopper spends on the web site, the number of pages viewed, and the mean amount spent per trans- action. Discuss what you learn about Heavenly Chocolates’ online shoppers from these numerical summaries.

2. Summarize the frequency, the total dollars spent, and the mean amount spent per transaction for each day of week. Discuss the observations you can make about Heavenly Chocolates’ business based on the day of the week?

3. Summarize the frequency, the total dollars spent, and the mean amount spent per transaction for each type of browser. Discuss the observations you can make about Heavenly Chocolates’ business based on the type of browser?

4. Develop a scatter diagram, and compute the sample correlation coefficient to explore the relationship between the time spent on the web site and the dollar amount spent. Use the horizontal axis for the time spent on the web site. Discuss your findings.

5. Develop a scatter diagram, and compute the sample correlation coefficient to explore the relationship between the number of web pages viewed and the amount spent. Use the horizontal axis for the number of web pages viewed. Discuss your findings.

6. Develop a scatter diagram, and compute the sample correlation coefficient to explore the relationship between the time spent on the web site and the number of pages viewed. Use the horizontal axis to represent the number of pages viewed. Discuss your findings.

Chapter 3 Data Visualization C o n t e n t s

AnAlytics in Action: CinCinnati Zoo & BotaniCal Garden

3.1 oVERViEW oF DAtA VisUAliZAtion Effective Design techniques

3.2 tABlEs table Design Principles crosstabulation Pivottables in Excel Recommended Pivottables in Excel

3.3 cHARts scatter charts Recommended charts in Excel line charts Bar charts and column charts A note on Pie charts and three-Dimensional charts Bubble charts Heat Maps Additional charts for Multiple Variables Pivotcharts in Excel

3.4 ADVAncED DAtA VisUAliZAtion Advanced charts Geographic information systems charts

3.5 DAtA DAsHBoARDs Principles of Effective Data Dashboards Applications of Data Dashboards

APPEnDix 3.1 cREAtinG A scAttER-cHARt MAtRix AnD A PARAllEl-cooRDinAtEs Plot WitH AnAlytic solVER (MinDtAP READER)

Analytics in Action 83

Cincinnati Zoo & Botanical Garden1

The Cincinnati Zoo & Botanical Garden, located in Cincinnati, Ohio, is one of the oldest zoos in the United States. To improve decision making by becoming more data-driven, management decided they needed to link the various facets of their busi- ness and provide nontechnical managers and execu- tives with an intuitive way to better understand their data. A complicating factor is that when the zoo is busy, managers are expected to be on the grounds interacting with guests, checking on operations, and dealing with issues as they arise or anticipat- ing them. Therefore, being able to monitor what is happening in real time was a key factor in deciding

what to do. Zoo management concluded that a data-visualization strategy was needed to address the problem.

Because of its ease of use, real-time updating capability, and iPad compatibility, the Cincinnati Zoo decided to implement its data-visualization strategy using IBM’s Cognos advanced data- visualization software. Using this software, the Cincinnati Zoo developed the set of charts shown in Figure 3.1 (known as a data dashboard) to enable management to track the following key measures of performance:

●● Item analysis (sales volumes and sales dollars by location within the zoo)

●● Geoanalytics (using maps and displays of where the day’s visitors are spending their time at the zoo)

●● Customer spending

A n A l y t i C s i n A C t i o n

Data Dashboard for the cincinnati ZooFiGURe 3.1

1the authors are indebted to John lucas of the cincinnati Zoo & Botanical Garden for providing this application.

84 chapter 3 Data Visualization

●● Cashier sales performance ●● Sales and attendance data versus weather

patterns ●● Performance of the zoo’s loyalty rewards

program

An iPad mobile application was also developed to enable the zoo’s managers to be out on the grounds and still see and anticipate occurrences in real time. The Cincinnati Zoo’s iPad application, shown in Figure 3.2, provides managers with access to the fol- lowing information:

●● Real-time attendance data, including what types of guests are coming to the zoo (members, non- members, school groups, and so on)

●● Real-time analysis showing which locations are busiest and which items are selling the fastest inside the zoo

●● Real-time geographical representation of where the zoo’s visitors live

Having access to the data shown in Figures 3.1 and 3.2 allows the zoo managers to make better deci- sions about staffing levels, which items to stock based on weather and other conditions, and how to better target advertising based on geodemographics.

The impact that data visualization has had on the zoo has been substantial. Within the first year of use, the sys- tem was directly responsible for revenue growth of over $500,000, increased visitation to the zoo, enhanced cus- tomer service, and reduced marketing costs.

the cincinnati Zoo iPad Data DashboardFiGURe 3.2

3.1 overview of Data Visualization 85

The first step in trying to interpret data is often to visualize it in some way. Data visual- ization can be as simple as creating a summary table, or it could require generating charts to help interpret, analyze, and learn from the data. Data visualization is very helpful for identifying data errors and for reducing the size of your data set by highlighting important relationships and trends.

Data visualization is also important in conveying your analysis to others. Although business analytics is about making better decisions, in many cases, the ultimate decision maker is not the person who analyzes the data. Therefore, the person analyzing the data has to make the analysis simple for others to understand. Proper data-visualization techniques greatly improve the ability of the decision maker to interpret the analysis easily.

In this chapter we discuss some general concepts related to data visualization to help you analyze data and convey your analysis to others. We cover specifics dealing with how to design tables and charts, as well as the most commonly used charts, and present an over- view of some more advanced charts. We also introduce the concept of data dashboards and geographic information systems (GISs). Our detailed examples use Excel to generate tables and charts, and we discuss several software packages that can be used for advanced data visualization.

3.1 Overview of Data Visualization Decades of research studies in psychology and other fields show that the human mind can process visual images such as charts much faster than it can interpret rows of numbers. However, these same studies also show that the human mind has certain limitations in its ability to interpret visual images and that some images are better at conveying information than others. The goal of this chapter is to introduce some of the most common forms of visualizing data and demonstrate when each form is appropriate.

Microsoft Excel is a ubiquitous tool used in business for basic data visualization. Soft- ware tools such as Excel make it easy for anyone to create many standard examples of data visualization. However, as discussed in this chapter, the default settings for tables and charts created with Excel can be altered to increase clarity. New types of software that are dedicated to data visualization have appeared recently. We focus our techniques on Excel in this chapter, but we also mention some of these more advanced software packages for specific data-visualization uses.

effective Design techniques One of the most helpful ideas for creating effective tables and charts for data visualization is the idea of the data-ink ratio, first described by Edward R. Tufte in 2001 in his book The Visual Display of Quantitative Information. The data-ink ratio measures the proportion of what Tufte terms “data-ink” to the total amount of ink used in a table or chart. Data-ink is the ink used in a table or chart that is necessary to convey the meaning of the data to the audience. Non-data-ink is ink used in a table or chart that serves no useful purpose in con- veying the data to the audience.

Let us consider the case of Gossamer Industries, a firm that produces fine silk clothing products. Gossamer is interested in tracking the sales of one of its most popular items, a particular style of women’s scarf. Table 3.1 and Figure 3.3 provide examples of a table and chart with low data-ink ratios used to display sales of this style of women’s scarf. The data used in this table and figure represent product sales by day. Both of these examples are similar to tables and charts generated with Excel using common default settings. In Table 3.1, most of the grid lines serve no useful purpose. Likewise, in Figure 3.3, the horizontal lines in the chart also add little additional information. In both cases, most of these lines can be deleted without reducing the information conveyed. However, an important piece of information is missing from Figure 3.3: labels for axes. Axes should always be labeled in a chart unless both the meaning and unit of measure are obvious.

The chapter appendix available in the MindTap Reader covers the use of Analytic Solver (and Excel Add-in) for data visualization.

86 chapter 3 Data Visualization

Table 3.2 shows a modified table in which all grid lines have been deleted except for those around the title of the table. Deleting the grid lines in Table 3.1 increases the data-ink ratio because a larger proportion of the ink used in the table is used to convey the informa- tion (the actual numbers). Similarly, deleting the unnecessary horizontal lines in Figure 3.4 increases the data-ink ratio. Note that deleting these horizontal lines and removing (or reducing the size of) the markers at each data point can make it more difficult to determine the exact values plotted in the chart. However, as we discuss later, a simple chart is not the most effective way of presenting data when the audience needs to know exact values; in these cases, it is better to use a table.

In many cases, white space in a table or a chart can improve readability. This principle is similar to the idea of increasing the data-ink ratio. Consider Table 3.2 and Figure 3.4. Removing the unnecessary lines has increased the “white space,” making it easier to read both the table and the chart. The fundamental idea in creating effective tables and charts is to make them as simple as possible in conveying information to the reader.

Scarf Sales by Day

Day Sales Day Sales

1 150 11 170

2 170 12 160

3 140 13 290

4 150 14 200

5 180 15 210

6 180 16 110

7 210 17 90

8 230 18 140

9 140 19 150

10 200 20 230

Example of a low Data-ink Ratio tabletAble 3.1

Example of a low Data-ink Ratio chartFiGURe 3.3

1 0

100

150

200

250

300

350 Scarf Sales by Day

2 3 4 5 6 7 8 9 10 11 12 13 14 15

Sales

16 17 18 19 20

3.1 overview of Data Visualization 87

1. Tables have been used to display data for more than a

thousand years. However, charts are much more recent

inventions. The famous 17th-century French mathemati-

cian, René Descartes, is credited with inventing the now

familiar graph with horizontal and vertical axes. William

Playfair invented bar charts, line charts, and pie charts

in the late 18th century, all of which we will discuss in

this chapter. More recently, individuals such as William

Cleveland, Edward R. Tufte, and Stephen Few have intro-

duced design techniques for both clarity and beauty in

data visualization.

2. Many of the default settings in Excel are not ideal for dis-

playing data using tables and charts that communicate

effectively. Before presenting Excel-generated tables and

charts to others, it is worth the effort to remove unneces-

sary lines and labels.

n o t e s + C o m m e n t s

Scarf Sales by Day

Day Sales Day Sales

1 150 11 170

2 170 12 160

3 140 13 290

4 150 14 200

5 180 15 210

6 180 16 110

7 210 17 90

8 230 18 140

9 140 19 150

10 200 20 230

increasing the Data-ink Ratio by Removing Unnecessary Gridlines

tAble 3.2

increasing the Data-ink Ratio by Adding labels to Axes and Removing Unnecessary lines and labels

FiGURe 3.4

1 0

100

150

200

250

300

350 Scarf Sales by Day

S al

es (

U n

it s)

Day 3 5 7 9 11 13 15 17 19

88 chapter 3 Data Visualization

3.2 Tables The first decision in displaying data is whether a table or a chart will be more effective. In general, charts can often convey information faster and easier to readers, but in some cases a table is more appropriate. Tables should be used when the

1. reader needs to refer to specific numerical values. 2. reader needs to make precise comparisons between different values and not just

relative comparisons. 3. values being displayed have different units or very different magnitudes.

When the accounting department of Gossamer Industries is summarizing the company’s annual data for completion of its federal tax forms, the specific numbers corresponding to revenues and expenses are important and not just the relative values. Therefore, these data should be presented in a table similar to Table 3.3.

Similarly, if it is important to know by exactly how much revenues exceed expenses each month, then this would also be better presented as a table rather than as a line chart as seen in Figure 3.5. Notice that it is very difficult to determine the monthly revenues and costs in Figure 3.5. We could add these values using data labels, but they would clutter the figure. The preferred solution is to combine the chart with the table into a single figure, as in Figure 3.6, to allow the reader to easily see the monthly changes in revenues and costs while also being able to refer to the exact numerical values.

Now suppose that you wish to display data on revenues, costs, and head count for each month. Costs and revenues are measured in dollars, but head count is measured in number of employees. Although all these values can be displayed on a line chart using multiple

Month

1 2 3 4 5 6 Total

Costs ($) 48,123 56,458 64,125 52,158 54,718 50,985 326,567

Revenues ($) 64,124 66,128 67,125 48,178 51,785 55,687 353,027

table showing Exact Values for costs and Revenues by Month for Gossamer industriestAble 3.3

line chart of Monthly costs and Revenues at Gossamer industries

FiGURe 3.5

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

1 2 3 Month

4 5 6

Revenues ($) Costs ($)

3.2 tables 89

vertical axes, this is generally not recommended. Because the values have widely different magnitudes (costs and revenues are in the tens of thousands, whereas head count is approx- imately 10 each month), it would be difficult to interpret changes on a single chart. There- fore, a table similar to Table 3.4 is recommended.

table Design Principles In designing an effective table, keep in mind the data-ink ratio and avoid the use of unneces- sary ink in tables. In general, this means that we should avoid using vertical lines in a table unless they are necessary for clarity. Horizontal lines are generally necessary only for sepa- rating column titles from data values or when indicating that a calculation has taken place. Consider Figure 3.7, which compares several forms of a table displaying Gossamer’s costs and revenue data. Most people find Design D, with the fewest grid lines, easiest to read. In this table, grid lines are used only to separate the column headings from the data and to indicate that a calculation has occurred to generate the Profits row and the Total column.

In large tables, vertical lines or light shading can be useful to help the reader differenti- ate the columns and rows. Table 3.5 breaks out the revenue data by location for nine cities

combined line chart and table for Monthly costs and Revenues at Gossamer industries

FiGURe 3.6

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

1 2 3 4 5 6

Revenues ($) Costs ($)

Month

48,123 56,458 64,125 52,158 54,718 50,985 326,567

64,124 66,128 67,125 48,178 51,785 55,687 353,027

2 3 4 5 6 Total

Month

Revenues ($)

Costs ($)

Month

1 2 3 4 5 6 Total

Head count 8 9 10 9 9 9

Costs ($) 48,123 56,458 64,125 52,158 54,718 50,985 326,567

Revenues ($) 64,124 66,128 67,125 48,178 51,785 55,687 353,027

table Displaying Head count, costs, and Revenues at Gossamer industriestAble 3.4

90 chapter 3 Data Visualization

and shows 12 months of revenue and cost data. In Table 3.5, every other column has been lightly shaded. This helps the reader quickly scan the table to see which values correspond with each month. The horizontal line between the revenue for Academy and the Total row helps the reader differentiate the revenue data for each location and indicates that a calcula- tion has taken place to generate the totals by month. If one wanted to highlight the differ- ences among locations, the shading could be done for every other row instead of every other column.

Notice also the alignment of the text and numbers in Table 3.5. Columns of numerical values in a table should be right-aligned; that is, the final digit of each number should be aligned in the column. This makes it easy to see differences in the magnitude of values. If you are showing digits to the right of the decimal point, all values should include the same number of digits to the right of the decimal. Also, use only the number of digits that are necessary to convey the meaning in comparing the values; there is no need to include additional digits if they are not meaningful for comparisons. In many business applications, we report financial values, in which case we often round to the nearest dollar or include two digits to the right of the decimal if such precision is necessary. Additional digits to the right of the decimal are usually unnecessary. For extremely large numbers, we may prefer to display data rounded to the nearest thousand, ten thousand, or even million. For instance, if we need to include, say, $3,457,982 and $10,124,390 in a table when exact dollar values are not necessary, we could write these as 3,458 and 10,124 and indicate that all values in the table are in units of $1,000.

It is generally best to left-align text values within a column in a table, as in the Reve- nues by Location (the first) column of Table 3.5. In some cases, you may prefer to center text, but you should do this only if the text values are all approximately the same length. Otherwise, aligning the first letter of each data entry promotes readability. Column head- ings should either match the alignment of the data in the columns or be centered over the values, as in Table 3.5.

Crosstabulation A useful type of table for describing data of two variables is a crosstabulation, which provides a tabular summary of data for two variables. To illustrate, consider the following application based on data from Zagat’s Restaurant Review. Data on the quality rating, meal price, and the usual wait time for a table during peak hours were collected for a sample of 300 Los Angeles area restaurants. Table 3.6 shows the data for the first 10 restaurants.

We depart from these guidelines in some figures and tables in this textbook to more closely match Excel’s output.

Types of data such as categorical and quantitative are discussed in Chapter 2.

comparing Different table DesignsFiGURe 3.7

Design A:

6 Total54321

Costs ($) 48,123 56,458 64,125 52,158 54,718 50,985 326,567

Revenues ($) 64,124 66,128 67,125 48,178 51,785 55,687 353,027

Pro�ts ($) 16,001 9,670 3,000 (3,980) (2,933) 4,702 26,460

Month

Design B:

6 Total54321

Costs ($)

Revenues ($)

Pro�ts ($)

48,123

64,124

16,001

56,458

66,128

9,670

64,125

67,125

3,000

52,158

48,178

(3,980)

54,718

51,785

(2,933)

50,985

55,687

4,702

326,567

353,027

26,460

Month

Design C:

6 Total54321

Costs ($)

Revenues ($)

Pro�ts ($)

48,123

64,124

16,001

56,458

66,128

9,670

64,125

67,125

3,000

52,158

48,178

(3,980)

54,718

51,785

(2,933)

50,985

55,687

4,702

326,567

353,027

26,460

Month

Design D:

6 Total54321

Costs ($)

Revenues ($)

Pro�ts ($)

48,123

64,124

16,001

56,458

66,128

9,670

64,125

67,125

3,000

52,158

48,178

(3,980)

54,718

51,785

(2,933)

50,985

55,687

4,702

326,567

353,027

26,460

Month

M o

n th

R e ve

n u e s

b y

Lo ca

ti o

n ( $ )

1 2

3 4

5 6

7 8

9 1

0 1

1 1

2 To

ta l

Te m

p le

8 ,9

8 7

8 ,5

9 5

8 ,9

5 8

6 ,7

1 8

8 ,0

6 6

8 ,5

7 4

8 ,7

0 1

9 ,4

9 0

9 ,6

1 0

9 ,2

6 2

9 ,8

7 5

1 1

,0 5

8 1

0 7

,8 9

K ill

e e n

8 ,2

1 2

9 ,1

4 3

8 ,7

1 4

6 ,8

6 9

8 ,1

5 0

8 ,8

9 1

8 ,7

6 6

9 ,1

9 3

9 ,6

0 3

1 0

,3 7

4 1

0 ,4

5 6

1 0

,9 8

2 1

0 9

,3 5

W a co

1 1 ,6

0 3

1 2 ,0

6 3

1 1

,1 7

3 9

,6 2

2 8

,9 1

2 9

,5 5

3 1

1 ,9

4 3

1 2

,9 4

7 1 2

,9 2

5 1

4 ,0

5 0

1 4

,3 0

0 1

3 ,8

7 7

1 4

2 ,9

6 7

B e lt

o n

7 ,6

7 1

7 ,6

1 7

7 ,8

9 6

6 ,8

9 9

7 ,8

7 7

6 ,6

2 1

7 ,7

6 5

7 ,7

2 0

7 ,8

2 4

7 ,9

3 8

7 ,9

4 3

7 ,0

4 7

9 0

,8 1

G ra

n g

e r

7 ,6

4 2

7 ,7

4 4

7 ,8

3 6

5 ,8

3 3

6 ,0

0 2

6 ,7

2 8

7 ,8

4 8

7 ,7

1 7

7 ,6

4 6

7 ,6

2 0

7 ,7

2 8

8 ,0

1 3

8 8

,3 5

H a rk

e r

H e ig

h ts

5 ,2

5 7

5 ,3

2 6

4 ,9

9 8

4 ,3

0 4

4 ,1

0 6

4 ,9

8 0

5 ,0

8 4

5 ,0

6 1

5 ,1

8 6

5 ,1

7 9

4 ,9

5 5

5 ,3

2 6

5 9

,7 6

G a te

sv ill

e 5 ,3

1 6

5 ,2

4 5

5 ,0

5 6

3 ,3

1 7

3 ,8

5 2

4 ,0

2 6

5 ,1

3 5

5 ,1

3 2

5 ,0

5 2

5 ,2

7 1

5 ,3

0 4

5 ,1

5 4

5 7

,8 5

La m

p a sa

s 5 ,2

6 6

5 ,1

2 9

5 ,0

2 2

3 ,0

2 2

3 ,0

8 8

4 ,2

8 9

5 ,1

1 0

5 ,0

7 3

4 ,9

7 8

5 ,3

4 3

4 ,9

8 4

5 ,3

1 5

5 6

,6 2

A ca

d e m

y 4 ,1

7 0

5 ,2

6 6

7 ,4

7 2

1 ,5

9 4

1 ,7

3 2

2 ,0

2 5

8 ,7

7 2

1 ,9

5 6

3 ,3

0 4

3 ,0

9 0

3 ,5

7 9

2 ,4

8 7

4 5

,4 4

To ta

l 6 4 ,1

2 4

6 6 ,1

2 8

6 7

,1 2

5 4

8 ,1

7 8

5 1

,7 8

5 5

5 ,6

8 7

6 9

,1 2

5 6

4 ,2

8 8

6 6

,1 2

8 6

8 ,1

2 8

6 9

,1 2

5 6

9 ,2

5 8

7 5

9 ,0

7 9

C o

st s

($ )

4 8 ,1

2 3

5 6 ,4

5 8

6 4

,1 2

5 5

2 ,1

5 8

5 4

,7 1

8 5

0 ,9

8 5

5 7

,8 9

8 6

2 ,0

5 0

6 5

,2 1

5 6

1 ,8

1 9

6 7

,8 2

8 6

9 ,5

5 8

7 1

0 ,9

3 5

la rg

e r

ta b

le s

h o

w in

g R

e ve

n u e s

b y

lo ca

ti o

n f

o r

1 2 M

o n th

s o

f D

at a

tA b

le 3

92 chapter 3 Data Visualization

Quality ratings are an example of categorical data, and meal prices are an example of quan- titative data.

For now, we will limit our consideration to the quality-rating and meal-price variables. A crosstabulation of the data for quality rating and meal price is shown in Table 3.7. The left and top margin labels define the classes for the two variables. In the left margin, the row labels (Good, Very Good, and Excellent) correspond to the three classes of the quality-rating variable. In the top margin, the column labels ($10–19, $20–29, $30–39, and $40–49) correspond to the four classes (or bins) of the meal-price variable. Each restau- rant in the sample provides a quality rating and a meal price. Thus, each restaurant in the sample is associated with a cell appearing in one of the rows and one of the columns of the crosstabulation. For example, restaurant 5 is identified as having a very good quality rating and a meal price of $33. This restaurant belongs to the cell in row 2 and column 3. In con- structing a crosstabulation, we simply count the number of restaurants that belong to each of the cells in the crosstabulation.

Table 3.7 shows that the greatest number of restaurants in the sample (64) have a very good rating and a meal price in the $20–29 range. Only two restaurants have an excellent rating and a meal price in the $10–19 range. Similar interpretations of the other frequen- cies can be made. In addition, note that the right and bottom margins of the crosstabulation give the frequencies of quality rating and meal price separately. From the right margin, we see that data on quality ratings show 84 good restaurants, 150 very good restaurants, and 66 excellent restaurants. Similarly, the bottom margin shows the counts for the meal price variable. The value of 300 in the bottom-right corner of the table indicates that 300 restau- rants were included in this data set.

Restaurant Quality Rating Meal Price ($) Wait Time (min)

1 Good 18 5

2 Very Good 22 6

3 Good 28 1

4 Excellent 38 74

5 Very Good 33 6

6 Good 28 5

7 Very Good 19 11

8 Very Good 11 9

9 Very Good 23 13

10 Good 13 1

Quality Rating and Meal Price for 300 los Angeles RestaurantstAble 3.6

Meal Price

Quality Rating $10–19 $20–29 $30–39 $40–49 Total

Good 42 40 2 0 84

Very Good 34 64 46 6 150

Excellent 2 14 28 22 66

Total 78 118 76 28 300

crosstabulation of Quality Rating and Meal Price for 300 los Angeles Restaurants

tAble 3.7

Restaurant

3.2 tables 93

Pivottables in excel A crosstabulation in Microsoft Excel is known as a PivotTable. We will first look at a simple example of how Excel’s PivotTable is used to create a crosstabulation of the Zagat’s restaurant data shown previously. Figure 3.8 illustrates a portion of the data contained in the file Restaurant; the data for the 300 restaurants in the sample have been entered into cells B2:D301.

To create a PivotTable in Excel, we follow these steps:

Step 1. Click the Insert tab on the Ribbon Step 2. Click PivotTable in the Tables group Step 3. When the Create PivotTable dialog box appears:

Choose Select a Table or Range Enter A1:D301 in the Table/Range: box Select New Worksheet as the location for the PivotTable Report Click OK

The resulting initial PivotTable Field List and PivotTable Report are shown in Figure 3.9. Each of the four columns in Figure 3.8 [Restaurant, Quality Rating, Meal Price ($), and

Wait Time (min)] is considered a field by Excel. Fields may be chosen to represent rows, columns, or values in the body of the PivotTable Report. The following steps show how to use Excel’s PivotTable Field List to assign the Quality Rating field to the rows, the Meal Price ($) field to the columns, and the Restaurant field to the body of the PivotTable report.

Excel Worksheet containing Restaurant DataFiGURe 3.8

Wait Time (min)Meal Price ($)Quality Rating 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

5 6 1

74 6 5

11 9

13 1

18 7

18 46 0 3 3

36 7 3

10 14 27 80 9

Good Very Good Good Excellent Very Good Good Very Good Very Good Very Good Good Very Good Very Good Excellent Excellent Good Good Good Excellent Very Good Good Very Good Very Good Excellent Excellent Very Good

18 22 28 38 33 28 19 11 23 13 33 44 42 34 25 22 26 17 30 19 33 22 32 33 34

Restaurant1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

A B C D

Restaurant

94 chapter 3 Data Visualization

Step 4. In the PivotTable Fields task pane, go to Drag fields between areas below: Drag the Quality Rating field to the ROWS area Drag the Meal Price ($) field to the COLUMNS area Drag the Restaurant field to the VALUES area

Step 5. Click on Sum of Restaurant in the VALUES area Step 6. Select Value Field Settings from the list of options Step 7. When the Value Field Settings dialog box appears:

Under Summarize value field by, select Count Click OK

Figure 3.10 shows the completed PivotTable Field List and a portion of the PivotTable worksheet as it now appears.

To complete the PivotTable, we need to group the columns representing meal prices and place the row labels for quality rating in the proper order:

Step 8. Right-click in cell B4 or any cell containing a meal price column label Step 9. Select Group from the list of options Step 10. When the Grouping dialog box appears:

Enter 10 in the Starting at: box Enter 49 in the Ending at: box Enter 10 in the By: box Click OK

Step 11. Right-click on “Excellent” in cell A5 Step 12. Select Move and click Move “Excellent” to End

The final PivotTable, shown in Figure 3.11, provides the same information as the crosstab- ulation in Table 3.7.

The values in Figure 3.11 can be interpreted as the frequencies of the data. For instance, row 8 provides the frequency distribution for the data over the quantitative variable of meal price. Seventy-eight restaurants have meal prices of $10 to $19. Column F provides the frequency distribution for the data over the categorical variable of quality.

initial Pivottable Field list and Pivottable Field Report for the Restaurant DataFiGURe 3.9

A B C D E F G 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

PivotTable1

To build a report, choose �elds from the PivotTable Field List

3.2 tables 95

completed Pivottable Field list and a Portion of the Pivottable Report for the Restaurant Data (columns H:AK Are Hidden)

FiGURe 3.10

A B C D E F G AL AM AN AO AP AQ AR

Count of Restaurant Row Labels 10

1 2 2 66

150

3002 1

11 12 13 14 15 47 48 Grand Total Columns Labels

Very Good

Good

Excellent

Grand Total

Final Pivottable Report for the Restaurant DataFiGURe 3.11

A B C D E F G H I

3 2 1

4 5 Good 6 Very Good 7 Excellent 8 9 10 11 12 13

15 16

17 18 19 20

Count of Restaurant Column Labels Grand TotalRow Labels 10–19 20–29 30–39 40–49

Grand Total

118

150

300

96 chapter 3 Data Visualization

A total of 150 restaurants have a quality rating of Very Good. We can also use a PivotTable to create percent frequency distributions, as shown in the following steps:

Step 1. To invoke the PivotTable Fields task pane, select any cell in the pivot table Step 2. In the PivotTable Fields task pane, click the Count of Restaurant in the

VALUES area Step 3. Select Value Field Settings . . . from the list of options Step 4. When the Value Field Settings dialog box appears, click the tab for Show

Values As Step 5. In the Show values as area, select % of Grand Total from the drop-down menu

Click OK

Figure 3.12 displays the percent frequency distribution for the Restaurant data as a PivotTable. The figure indicates that 50% of the restaurants are in the Very Good quality category and that 26% have meal prices between $10 and $19.

PivotTables in Excel are interactive, and they may be used to display statistics other than a simple count of items. As an illustration, we can easily modify the PivotTable in Figure 3.11 to display summary information on wait times instead of meal prices.

Step 1. To invoke the PivotTable Fields task pane, select any cell in the pivot table Step 2. In the PivotTable Fields task pane, click the Count of Restaurant field in the

VALUES area Select Remove Field

Step 3. Drag the Wait Time (min) to the VALUES area Step 4. Click on Sum of Wait Time (min) in the VALUES area Step 5. Select Value Field Settings… from the list of options Step 6. When the Value Field Settings dialog box appears:

Under Summarize value field by, select Average Click Number Format In the Category: area, select Number Enter 1 for Decimal places: Click OK When the Value Field Settings dialog box reappears, click OK

Percent Frequency Distribution as a Pivottable for the Restaurant DataFiGURe 3.12

A B C D E F G

3 2 1

4 5 Good 6 Very Good 7 Excellent 8 9 10 11 12 13

15 16

17 18

Count of Restaurant Column Grand TotalRow Labels Labels 10–19 20–29 30–39 40–49

Grand Total

14.00%

11.33%

0.67%

26.00%

13.33%

21.33%

4.67%

39.33%

0.67%

15.33%

9.33%

25.33%

0.00%

2.00%

7.33%

9.33%

28.00%

50.00%

22.00%

100.00%

3.2 tables 97

The completed PivotTable appears in Figure 3.13. This PivotTable replaces the counts of restaurants with values for the average wait time for a table at a restaurant for each group- ing of meal prices ($10–19, $20–29, $30–39, and $40–49). For instance, cell B7 indicates that the average wait time for a table at an Excellent restaurant with a meal price of $10–19 is 25.5 minutes. Column F displays the total average wait times for tables in each quality rating category. We see that Excellent restaurants have the longest average waits of 35.2 minutes and that Good restaurants have average wait times of only 2.5 minutes. Finally, cell D7 shows us that the longest wait times can be expected at Excellent restau- rants with meal prices in the $30–39 range (34 minutes).

We can also examine only a portion of the data in a PivotTable using the Filter option in Excel. To Filter data in a PivotTable, click on the Filter Arrow next to Row Labels or Column Labels and then uncheck the values that you want to remove from the PivotTable. For example, we could click on the arrow next to Row Labels and then uncheck the Good value to examine only Very Good and Excellent restaurants.

Recommended Pivottables in excel Excel also has the ability to recommend PivotTables for your data set. To illustrate Rec- ommended PivotTables in Excel, we return to the restaurant data in Figure 3.8. To create a Recommended PivotTable, follow the steps below using the file Restaurant.

Step 1. Select any cell in table of data (for example, cell A1) Step 2. Click the Insert tab on the Ribbon Step 3. Click Recommended PivotTables in the Tables group Step 4. When the Recommended PivotTables dialog box appears:

Select the Count of Restaurant, Sum of Wait Time (min), Sum of Meal Price ($) by Quality Rating option (see Figure 3.14)

Click OK

The steps above will create the PivotTable shown in Figure 3.15 on a new Worksheet. The Recommended PivotTables tool in Excel is useful for quickly creating commonly used PivotTables for a data set, but note that it may not give you the option to create the

You can also filter data in a PivotTable by dragging the field that you want to filter to the FILTERS area in the PivotTable Fields.

Hovering your pointer over the different options will display the full name of each option, as shown in Figure 3.14.

Pivottable Report for the Restaurant Data with Average Wait times AddedFiGURe 3.13

A B C D E F G

3 2 1

4 5 Good 6 Very Good 7 Excellent 8 9 10 11 12 13

15 16

Average of Wait Time (min) Column Grand TotalRow Labels Labels 10–19 20–29 30–39 40–49

Grand Total

2.6

12.6

25.5

7.6

2.5

12.6

29.1

11.1

0.5

12.0

34.0

19.8

10.0

32.3

27.5

2.5

12.3

32.1

13.9

98 chapter 3 Data Visualization

Recommended Pivottables Dialog Box in ExcelFiGURe 3.14

Recommended PivotTables

Sum of Meal Price (S) b...

Sum of Wait Time (min) ...

Row Labels Excellent Good Very Good

2267 1657 3845

Grand Total 7769

Sum of Meal Price ($) Row Labels

Excellent Good Very Good

66 84

150

Grand Total 300

Count of Restaurant Sum of W

Row Labels Excellent Good Very Good

2120 207

1848

Grand Total 4175

Sum of Wait Time (min)

Row Labels

Excellent Good Very Good

66 84

150

Grand Total 300

Count of Restaurant Sum of W

Count of Restaurant, Su...

Sum of Restaurant by Qu...

Row Labels Sum of Restaurant

CancelOK

Count of Restaurant, Sum of Wait Time (min), Sum of Meal Price ($) by Quality Rating

Changes Source Data...Blank PivotTable

Count of Restaurant, Sum of Wait Time (min), a...

Default Pivottable created for Restaurant Data Using Excel’s Recommended Pivottables tool

FiGURe 3.15

PivotTable Fields

Choose �elds to add to report:

..................................................................................................... Drag �eld between areas below:

Restaurant

Quantity Rating

Meal Price ($)

Wait Time (min)

MORE TABLES...

FILTERS COLUMNS

VALUESROWS

Count of RestaurantQuality Rating

Meal Price ($)

Sum of Wait Time (min)

Sum of Meal Price ($)

A B C D E

Good Very Good

Excellent Count of Restaurant Sum of Wait Time (min) Sum of Meal Price ($)Row Labels

Grand Total

300

207

4175

1657

7769

3 2 1

4 5 6 7 8 9 10 11 12 13

21 22 23 24 25 26 27

15 14

16 17 18 19 20

150 1848 3845

66 2120 2267

3.3 charts 99

exact PivotTable that will be of the most use for your data analysis. Displaying the sum of wait times and the sum of meal prices within each quality-rating category, as shown in Figure 3.15, is not particularly useful here; the average wait times and average meal prices within each quality-rating category would be more useful to us. But we can easily modify the PivotTable in Figure 3.14 to show the average values by selecting any cell in the PivotTable to invoke the PivotTable Fields task pane, clicking on Sum of Wait Time (min) and then Sum of Meal Price ($), and using the Value Field Settings… to change the Summarize value field by option to Average. The finished PivotTable is shown in Figure 3.16.

3.3 Charts Charts (or graphs) are visual methods for displaying data. In this section, we introduce some of the most commonly used charts to display and analyze data including scatter charts, line charts, and bar charts. Excel is the most commonly used software package for creating simple charts. We explain how to use Excel to create scatter charts, line charts, sparklines, bar charts, bubble charts, and heat maps.

scatter Charts A scatter chart is a graphical presentation of the relationship between two quantitative variables. As an illustration, consider the advertising/sales relationship for an electronics store in San Francisco. On 10 occasions during the past three months, the store used week- end television commercials to promote sales at its stores. The managers want to investigate whether a relationship exists between the number of commercials shown and sales at the store the following week. Sample data for the 10 weeks, with sales in hundreds of dollars, are shown in Table 3.8.

completed Pivottable for Restaurant Data Using Excel’s Recommended Pivottables tool

FiGURe 3.16

PivotTable Fields

.....................................................................................................

Restaurant

Quality Rating

Meal Price ($)

Wait Time (min)

MORE TABLES...

FILTERS COLUMNS

VALUESROWS

Count of RestaurantQuality Rating

Values

Average of Wait Time (min)

Average of Meal Price ($)

A B C D E

Good Very Good

Excellent Count of Restaurant Average of Wait Time (min) Average of Meal Price ($)Row Labels

Grand Total

300

2.5

13.9

19.73

25.90

3 2 1

4 5 6 7 8 9 10 11 12 13

21 22 23 24 25 26 27

15 14

16 17 18 19 20

150 12.3 25.63

66 32.1 34.35

Serach...

The appendix for this chapter available in the MindTap Reader demonstrates the use of the Excel Add-in Analytic Solver to create a scatter- chart matrix and a parallel- coordinates plot.

100 chapter 3 Data Visualization

We will use the data from Table 3.8 to create a scatter chart using Excel’s chart tools and the data in the file Electronics:

Step 1. Select cells B2:C11 Step 2. Click the Insert tab in the Ribbon Step 3. Click the Insert Scatter (X,Y) or Bubble Chart button in the

Charts group Step 4. When the list of scatter chart subtypes appears, click the Scatter button Step 5. Click the Design tab under the Chart Tools Ribbon Step 6. Click Add Chart Element in the Chart Layouts group

Select Chart Title, and click Above Chart Click on the text box above the chart, and replace the text with Scatter

Chart for the San Francisco Electronics Store Step 7. Click Add Chart Element in the Chart Layouts group

Select Axis Title, and click Primary Vertical Click on the text box under the horizontal axis, and replace “Axis Title”

with Number of Commercials Step 8. Click Add Chart Element in the Chart Layouts group

Select Axis Title, and click Primary Horizontal Click on the text box next to the vertical axis, and replace “Axis Title”

with Sales ($100s) Step 9. Right-click on one of the horizontal grid lines in the body of the chart, and

click Delete Step 10. Right-click on one of the vertical grid lines in the body of the chart, and

click Delete

We can also use Excel to add a trendline to the scatter chart. A trendline is a line that provides an approximation of the relationship between the variables. To add a linear trendline using Excel, we use the following steps:

Step 1. Right-click on one of the data points in the scatter chart, and select Add Trendline…

Step 2. When the Format Trendline task pane appears, select Linear under Trendline Options

Figure 3.17 shows the scatter chart and linear trendline created with Excel for the data in Table 3.8. The number of commercials (x) is shown on the horizontal axis, and sales (y)

Hovering the pointer over the chart type buttons in Excel will display the names of the buttons and short descriptions of the types of chart.

Steps 9 and 10 are optional, but they improve the chart’s readability. We would want to retain the gridlines only if they helped the reader to determine more precisely where data points are located relative to certain values on the horizontal and/or vertical axes.

No. of Commercials Sales ($100s)

Week x y 1 2 50

2 5 57

3 1 41

4 3 54

5 4 54

6 1 38

7 5 63

8 3 48

9 4 59

10 2 46

sample Data for the san Francisco Electronics storetAble 3.8

Electronics

3.3 charts 101

are shown on the vertical axis. For week 1, x 5 2 and y 5 50. A point is plotted on the scatter chart at those coordinates; similar points are plotted for the other nine weeks. Note that during two of the weeks, one commercial was shown, during two of the weeks, two commercials were shown, and so on.

The completed scatter chart in Figure 3.17 indicates a positive linear relationship (or positive correlation) between the number of commercials and sales: Higher sales are asso- ciated with a higher number of commercials. The linear relationship is not perfect because not all of the points are on a straight line. However, the general pattern of the points and the trendline suggest that the overall relationship is positive. This implies that the covariance between sales and commercials is positive and that the correlation coefficient between these two variables is between 0 and 11.

The Chart Buttons in Excel allow users to quickly modify and format charts. Three buttons appear next to a chart whenever you click on a chart to make it active. Clicking

on the Chart Elements button brings up a list of check boxes to quickly add and remove axes, axis titles, chart titles, data labels, trendlines, and more. Clicking on the

Chart Styles button allows the user to quickly choose from many preformatted styles to change the look of the chart. Clicking on the Chart Filter button allows the user to select the data to be included in the chart. The Chart Filter button is very useful

for performing additional data analysis.

Recommended Charts in excel Similar to the ability to recommend PivotTables, Excel has the ability to recommend charts for a given data set. The steps below demonstrate the Recommended Charts tool in Excel for the Electronics data.

Step 1. Select cells B2:C11 Step 2: Click the Insert tab in the Ribbon Step 3: Click the Recommended Charts button in the Charts group Step 4: When the Insert Chart dialog box appears, select the Scatter option

(see Figure 3.18) Click OK

Scatter charts are often referred to as scatter plots or scatter diagrams.

Chapter 2 introduces scatter charts and relates them to the concepts of covariance and correlation.

scatter chart for the san Francisco Electronics storeFiGURe 3.17

A B C D Week

1 1 2 3 4 5 6 7 8 9 10

3 4 5 6 7 8 9 10

No. of Commercials 2 5 1 3 4

5 3 4 2

Sales Volume 50 57 41 54 54 38 63 48 59 4611

12 13 14 15 16 17 18 19

Scatter Chart for the San Francisco Electronics Store

0 0 1 2 3 4 5 6

No. of Commercials

S al

es (

$1 00

E F G H I J K L

Electronics

102 chapter 3 Data Visualization

These steps create the basic scatter chart that can then be formatted (using the Chart Buttons or Chart Tools Ribbon) to create the completed scatter chart shown in Figure 3.17. Note that the Recommended Charts tool gives several possible recommen- dations for the electronics data in Figure 3.18. These recommendations include scatter charts, line charts, and bar charts, which will be covered later in this chapter. Excel’s Recommended Charts tool generally does a good job of interpreting your data and provid- ing recommended charts, but take care to ensure that the selected chart is meaningful and follows good design practice.

line Charts Line charts are similar to scatter charts, but a line connects the points in the chart. Line charts are very useful for time series data collected over a period of time (minutes, hours, days, years, etc.). As an example, Kirkland Industries sells air compressors to manufactur- ing companies. Table 3.9 contains total sales amounts (in $100s) for air compressors during

A line chart for time series data is often called a time series plot.

insert chart Dialog Box from Recommended charts tool in ExcelFiGURe 3.18

Recommended Charts All Charts

Sales Volume

A scatter chart is used to compare at least two sets of values or pairs of data. Use it to show relationships between sets of values.

0 1 2 3 4 5 6

Insert Chart

Scatter

CancelOK

Sales value

Chart Title

3.3 charts 103

each month in the most recent calendar year. Figure 3.19 displays a scatter chart and a line chart created in Excel for these sales data. The line chart connects the points of the scatter chart. The addition of lines between the points suggests continuity, and it is easier for the reader to interpret changes over time.

To create the line chart in Figure 3.19 in Excel, we follow these steps:

Step 1. Select cells A2:B13 Step 2. Click the Insert tab on the Ribbon Step 3. Click the Insert Line Chart button in the Charts group Step 4. When the list of line chart subtypes appears, click the Line with Markers

button under 2-D Line

This creates a line chart for sales with a basic layout and minimum formatting

Step 5. Select the line chart that was just created to reveal the Chart Buttons

Because the gridlines do not add any meaningful information here, we do not select the check box for Gridlines in Chart Elements, as it increases the data-ink ratio.

In the line chart in Figure 3.19, we have kept the markers at each data point. This is a matter of personal taste, but removing the markers tends to suggest that the data are continuous when in fact we have only one data point per month.

Month Sales ($100s)

Jan 150

Feb 145

Mar 185

Apr 195

May 170

Jun 125

Jul 210

Aug 175

Sep 160

Oct 120

Nov 115

Dec 120

Monthly sales Data of Air compressors at Kirkland industriestAble 3.9

scatter chart and line chart for Monthly sales Data at Kirkland industries

FiGURe 3.19

Scatter Chart for Monthly Sales Data

200

150

100

Ja n

M ar

A prFe

b M

ay Ju n

Ju l A

ug Se p

O ct

N ov D

250

S al

es (

$1 00

Line Chart for Monthly Sales Data

200

150

100

250

S al

es (

$1 00

Ja n

M ar

A prFe

b M

ay Ju n

Ju l A

ug Se p

O ct

N ov D

Kirkland

104 chapter 3 Data Visualization

Step 6. Click the Chart Elements button Select the check boxes for Axes, Axis Titles, and Chart Title. Deselect

the check box for Gridlines. Click on the text box next to the vertical axis, and replace “Axis Title”

with Sales ($100s) Click on the text box next to the horizontal axis and replace “Axis Title”

with Month Click on the text box above the chart, and replace “Sales ($100s)” with

Line Chart for Monthly Sales Data

Figure 3.20 shows the line chart created in Excel along with the selected options for the Chart Elements button.

Line charts can also be used to graph multiple lines. Suppose we want to break out Kirkland’s sales data by region (North and South), as shown in Table 3.10. We can cre- ate a line chart in Excel that shows sales in both regions, as in Figure 3.21 by following similar steps but selecting cells A2:C14 in the file KirklandRegional before creating the line chart. Figure 3.21 shows an interesting pattern. Sales in both the North and the South regions seemed to follow the same increasing/decreasing pattern until October. Starting in October, sales in the North continued to decrease while sales in the South increased. We would probably want to investigate any changes that occurred in the North region around October.

A special type of line chart is a sparkline, which is a minimalist type of line chart that can be placed directly into a cell in Excel. Sparklines contain no axes; they display only the line for the data. Sparklines take up very little space, and they can be effec- tively used to provide information on overall trends for time series data. Figure 3.22 illustrates the use of sparklines in Excel for the regional sales data. To create a sparkline in Excel:

Step 1. Click the Insert tab on the Ribbon Step 2. Click Line in the Sparklines group

line chart and Excel’s chart Elements Button options for Monthly sales Data at Kirkland industries

FiGURe 3.20

CHART ELEMENTS

Axes 250

200

150

100

Jan Feb Mar Apr May Jun Jul

Month

Line Chart for Monthly Sales Data

S a le

s ($

1 0 0 s)

Aug Sep Oct Nov Dec

Axis Titles

Chart Title

Data Labels

Data Table

Error Bars

Gridlines

Legend

Trendline

Up/Down Bars

KirklandRegional

3.3 charts 105

Step 3. When the Create Sparklines dialog box opens, Enter B3:B14 in the Data Range: box Enter B15 in the Location Range: box Click OK

Step 4. Copy cell B15 to cell C15

The sparklines in cells B15 and C15 do not indicate the magnitude of sales in the North and the South regions, but they do show the overall trend for these data. Sales in the North appear to be decreasing and sales in the South increasing overall. Because sparklines are input directly into the cell in Excel, we can also type text directly into the same cell that will then be overlaid on the sparkline, or we can add shading to the cell, which will appear as the background. In Figure 3.22, we have shaded cells B15 and C15 to highlight the sparklines. As can be seen, sparklines provide an efficient and simple way to display basic information about a time series.

In the line chart in Figure 3.21, we have replaced Excel’s default legend with text boxes labeling the lines corresponding to sales in the North and the South. This can often make the chart look cleaner and easier to interpret.

line chart of Regional sales Data at Kirkland industriesFiGURe 3.21

100

120

140

Line Chart of Regional Sales Data

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

South

North

160

S al

es (

$1 00

Sales ($100s)

Month North South

Jan 95 40

Feb 100 45

Mar 120 55

Apr 115 65

May 100 60

Jun 85 50

Jul 135 75

Aug 110 65

Sep 100 60

Oct 50 70

Nov 40 75

Dec 40 80

Regional sales Data by Month for Air compressors at Kirkland industries

tAble 3.10

KirklandRegional

106 chapter 3 Data Visualization

bar Charts and Column Charts Bar charts and column charts provide a graphical summary of categorical data. Bar charts use horizontal bars to display the magnitude of the quantitative variable. Column charts use ver- tical bars to display the magnitude of the quantitative variable. Bar and column charts are very helpful in making comparisons between categorical variables. Consider a regional supervisor who wants to examine the number of accounts being handled by each manager. Figure 3.23 shows a bar chart created in Excel displaying these data. To create this bar chart in Excel:

Step 1. Select cells A2:B9 Step 2. Click the Insert tab on the Ribbon Step 3. Click the Insert Column or Bar Chart button in the Charts group Step 4. When the list of bar chart subtypes appears:

Click the Clustered Bar button in the 2-D Bar section

Step 5. Select the bar chart that was just created to reveal the Chart Buttons Step 6. Click the Chart Elements button

Select the check boxes for Axes, Axis Titles, and Chart Title. Deselect the check box for Gridlines.

Click on the text box next to the vertical axis, and replace “Axis Title” with Accounts Managed

Click on the text box next to the vertical axis, and replace “Axis Title” with Manager

Click on the text box above the chart, and replace “Chart Title” with Bar Chart of Accounts Managed

From Figure 3.23 we can see that Gentry manages the greatest number of accounts and Williams the fewest. We can make this bar chart even easier to read by ordering the results by the number of accounts managed. We can do this with the following steps:

Step 1. Select cells A1:B9 Step 2. Right-click any of the cells A1:B9

Select Sort Click Custom Sort

In versions of Excel prior to Excel 2016, Insert Bar Chart and Insert Column Chart each have separate buttons in the Charts group, but these are combined under the Insert Column or Bar Chart button in Excel 2016.

sparklines for the Regional sales Data at Kirkland industriesFiGURe 3.22

A B C D E F G H I

Month 1 2 3 4 5 6 7 8 9 10

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

100 50 40 40

North 95

100 120 115 100 85

135 110

60 70 75 80

South 40 45 55 65 60 50 75 65

Sales ($100s)

11 12 13 14 15

Sales ($100s)

AccountsManaged

3.3 charts 107

Step 3. When the Sort dialog box appears: Make sure that the check box for My data has headers is checked Select Accounts Managed in the Sort by box under Column Select Smallest to Largest under Order Click OK

In the completed bar chart in Excel, shown in Figure 3.24, we can easily compare the relative number of accounts managed for all managers. However, note that it is difficult to interpret from the bar chart exactly how many accounts are assigned to each manager. If this information is necessary, these data are better presented as a table or by adding data labels to the bar chart, as in Figure 3.25, which is created in Excel using the following steps:

Step 1. Select the chart to reveal the Chart Buttons Step 2. Click the Chart Elements button

Select the check box for Data Labels

This adds labels of the number of accounts managed to the end of each bar so that the reader can easily look up exact values displayed in the bar chart.

A note on Pie Charts and three-Dimensional Charts Pie charts are another common form of chart used to compare categorical data. However, many experts argue that pie charts are inferior to bar charts for comparing data. The pie chart in Figure 3.26 displays the data for the number of accounts managed in Figure 3.23. Visually, it is still relatively easy to see that Gentry has the greatest number of accounts and that Williams has the fewest. However, it is difficult to say whether Lopez or Francois has more accounts. Research has shown that people find it very difficult to perceive differences in area. Compare Figure 3.26 to Figure 3.24. Making visual comparisons is much easier in the bar chart than in the pie chart (particularly when using a limited number of colors for differentiation). Therefore, we recommend against using pie charts in most situations and suggest instead using bar charts for comparing categorical data.

Alternatively, you can add Data Labels by right-clicking on a bar in the chart and selecting Add Data Labels.

Bar chart for Accounts Managed DataFiGURe 3.23

A B C D E F G H I J

Manager Davis Edwards Francois Gentry Jones Lopez Smith Williams

Accounts Managed1

2 24 11 28 37 15 29 21 6

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Williams

Smith

Lopez

Jones

Gentry

Francois

Edwards

Davis

Accounts Managed

Bar Chart of Accounts Managed

M an

ag er

20 30 40

108 chapter 3 Data Visualization

Bar chart with Data labels for Accounts Managed DataFiGURe 3.25

Manager Williams Edwards Jones Smith Davis Francois Lopez Gentry

6 11 15 21 24 28 29 37

A B C D E F G H I J

Accounts Managed1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Lopez

Francois

Davis

Smith

Jones

Edwards

Williams

Accounts Managed

Bar Chart of Accounts Managed

M an

ag er

20 30 40

Gentry

sorted Bar chart for Accounts Managed DataFiGURe 3.24

Manager Williams Edwards Jones Smith Davis Francois Lopez Gentry

6 11 15 21 24 28 29 37

A B C D E F G H I J

Accounts Managed1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Lopez

Francois

Davis

Smith

Jones

Edwards

Williams

Accounts Managed

Bar Chart of Accounts Managed

M an

ag er

20 30 40

Gentry

3.3 charts 109

Because of the difficulty in visually comparing area, many experts also recommend against the use of three-dimensional (3-D) charts in most settings. Excel makes it very easy to create 3-D bar, line, pie, and other types of charts. In most cases, however, the 3-D effect simply adds unnecessary detail that does not help explain the data. As an alternative, consider the use of multiple lines on a line chart (instead of adding a z-axis), employing multiple charts, or creating bubble charts in which the size of the bubble can represent the z-axis value. Never use a 3-D chart when a two-dimensional chart will suffice.

bubble Charts A bubble chart is a graphical means of visualizing three variables in a two-dimensional graph and is therefore sometimes a preferred alternative to a 3-D graph. Suppose that we want to compare the number of billionaires in various countries. Table 3.11 provides a sample of six countries, showing, for each country, the number of billionaires per 10 million residents, the per capita income, and the total number of billionaires. We can create a bubble chart using Excel to further examine these data:

Step 1. Select cells B2:D7 Step 2. Click the Insert tab on the Ribbon Step 3. In the Charts group, click Insert Scatter (X,Y) or Bubble Chart

In the Bubble subgroup, click Bubble

Step 4. Select the chart that was just created to reveal the Chart Buttons

Pie chart of Accounts ManagedFiGURe 3.26

Davis

Edwards

Francois

Gentry

Jones

Lopez

Smith

Williams

Country Billionaires per 10M Residents

Per Capita Income

No. of Billionaires

United States 54.7 $54,600 1,764

China 1.5 $12,880 213

Germany 12.5 $45,888 103

India 0.7 $ 5,855 90

Russia 6.2 $24,850 88

Mexico 1.2 $17,881 15

sample Data on Billionaires per countrytAble 3.11

Billionaires

110 chapter 3 Data Visualization

Step 5. Click the Chart Elements button

Select the check boxes for Axes, Axis Titles, Chart Title and Data Labels. Deselect the check box for Gridlines.

Click on the text box under the horizontal axis, and replace “Axis Title” with Billionaires per 10 Million Residents

Click on the text box next to the vertical axis, and replace “Axis Title” with Per Capita Income

Click on the text box above the chart, and replace “Chart Title” with Bil- lionaires by Country

Step 6. Double-click on one of the Data Labels in the chart (e.g., the “$54,600” next to the largest bubble in the chart) to reveal the Format Data Labels task pane

Step 7. In the Format Data Labels task pane, click the Label Options icon and open the Label Options area

Under Label Contains, select Value from Cells and click the Select Range… button

When the Data Label Range dialog box opens, select cells A2:A8 in the Worksheet

Click OK Step 8. In the Format Data Labels task pane, deselect Y Value under Label

Contains, and select Right under Label Position

The completed bubble chart appears in Figure 3.27. This size of each bubble in Figure 3.27 is proportionate to the number of billionaires in that country. The per capita income and billionaires per 10 million residents is displayed on the vertical and horizontal axes. This chart shows us that the United States has the most billionaires and the highest number of billionaires per 10 million residents. We can also see that China has quite a few billionaires but with much lower per capita income and much lower billionaires per 10 million residents (because of China’s much larger population). Germany, Russia, and India all appear to have similar numbers of billionaires, but the per capita income and billionaires per 10 million residents are very different for each country. Bubble charts can be very effective for comparing categorical variables on two different quantitative values.

Heat maps A heat map is a two-dimensional graphical representation of data that uses different shades of color to indicate magnitude. Figure 3.28 shows a heat map indicating the magnitude of changes for a metric called same-store sales, which are commonly used in the retail indus- try to measure trends in sales. The cells shaded red in Figure 3.28 indicate declining same- store sales for the month, and cells shaded blue indicate increasing same-store sales for the month. Column N in Figure 3.28 also contains sparklines for the same-store sales data.

Figure 3.28 can be created in Excel by following these steps:

Step 1. Select cells B2:M17 Step 2. Click the Home tab on the Ribbon Step 3. Click Conditional Formatting in the Styles group

Select Color Scales and click on Blue–White–Red Color Scale

To add the sparklines in column N, we use the following steps:

Step 4. Select cell N2 Step 5. Click the Insert tab on the Ribbon Step 6. Click Line in the Sparklines group Step 7. When the Create Sparklines dialog box opens:

Enter B2:M2 in the Data Range: box Enter N2 in the Location Range: box and click OK

Step 8. Copy cell N2 to N3:N17

SameStoreSales

3.3 charts 111

Bubble chart comparing Billionaires by countryFiGURe 3.27

A B C D E

China Germany

United States

Billionaires per 10M Residents

Per Capita Income No. of BillionairesCountry

1.5 12,880 2133 2 1

4 5 6 7 8 9 10 11 12 13

21 22 23 24 25 26

15 14

16 17 18 19 20

12.5 45,888 103 India 0.7 5,855 90 Russia 6.2 24,850 88 Mexico 1.2 17,881 15

54.7 54,600 $ $ $ $ $

$ 1764

$70,000

$60,000

$50,000

$40,000

$30,000

$20,000

$10,000

$(10,000)

–10 10 20 30 40 50 60 70

Billionaires per 10 Mollion Residents

Billionaires by Country

P er

C ap

it a

In co

m e

China

Germany

United States

India

Russia

Mexico

$– 0

The heat map in Figure 3.28 helps the reader to easily identify trends and patterns. We can see that Austin has had positive increases throughout the year, while Pittsburgh has had consistently negative same-store sales results. Same-store sales at Cincinnati started the year negative but then became increasingly positive after May. In addition, we can differentiate between strong positive increases in Austin and less substantial positive increases in Chicago by means of color shadings. A sales manager could use the heat map in Figure 3.28 to identify stores that may require intervention and stores that may be used as models. Heat maps can be used effectively to convey data over different areas, across time, or both, as seen here.

Because heat maps depend strongly on the use of color to convey information, one must be careful to make sure that the colors can be easily differentiated and that they do not become overwhelming. To avoid problems with interpreting differences in color, we can add sparklines as shown in column N of Figure 3.28. The sparklines clearly show the overall trend (increasing or decreasing) for each location. However, we cannot gauge

Both the heat map and the sparklines described here can also be created using the

Quick Analysis button .

To display this button, select cells B2:M17. The Quick Analysis button will appear at the bottom right of the selected cells. Click the button to display options for heat maps, sparklines, and other data-analysis tools.

112 chapter 3 Data Visualization

differences in the magnitudes of increases and decreases among locations using sparklines. The combination of a heat map and sparklines here is a particularly effective way to show both trend and magnitude.

Additional Charts for multiple Variables Figure 3.29 provides an alternative display for the regional sales data of air compressors for Kirkland Industries. The figure uses a stacked-column chart to display the North and the South regional sales data previously shown in a line chart in Figure 3.21. We could also

Heat Map and sparklines for same-store sales DataFiGURe 3.28

–3% –1% –2% –2% –1% –2% –1%–3%

–4% –3% –1% 1% 2% 3% 5%–5%

–6% –8% –11% –13% –11% –10%

–5% –5% –7% –5% –2% –1% –2%

5%8%12%13%

1% 1%

8% 7% 7% 8% 5% 3%

3% 0% 1% –4%–1%2%

14% 13% 17% 11%12%12%

Chicago 3% 2% 6% 7% 8% 5% 8% 10% 9% 8%8%5%

7% 8% 7%7%

–6% –7% –3%–9% 6% 8% 11% 10% 11% 13% 11%

15% 15% 16% 17% 14% 15% 16% 19% 18% 16%18%16%

–6% –4% –5% –5% –5% –3% –1% –2% –1% –2%–2%–5%

–2% –5% –8% –6% –5% –7% –8%

6% 7% 8% 8%

–6% –8% –5% –6%–6%

7% 7%7%

–5% –3%–5%

–1% –1% 0%–2%

2 3

16 17

A B C D E F G H I J K L M N

St. Louis

Phoenix

Albany

Austin

Cincinnati

San Francisco

Seattle

Atlanta

Miami

Minneapolis

Denver

Salt Lake City

Raleigh

Boston

Pittsburgh

–6%

–4%

–5%

–1%

–2%

10%

–1%

–4%

11%

–6%

–5%

–6%

JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC SPARKLINES

stacked-column chart for Regional sales Data for Kirkland industriesFiGURe 3.29

S al

es (

$1 00

s) 150

100

200

A B I K L MJ

Willia Sales ($100s)

95 100 120 115 100 85

135 110 100 50 40 40

40 45 55 65 60 50 75 65 60 70 75 80

1 2 Month North South

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

South North

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

HGFEDC

KirklandRegional

3.3 charts 113

use a stacked-bar chart to display the same data by using horizontal bars instead of vertical. To create the stacked-column chart shown in Figure 3.29, we use the following steps:

Step 1. Select cells A2:C14 Step 2. Click the Insert tab on the Ribbon Step 3. In the Charts group, click the Insert Column or Bar Chart button

Select Stacked Column under 2-D Column

Stacked-column and stacked-bar charts allow the reader to compare the relative values of quantitative variables for the same category in a bar chart. However, these charts suffer from the same difficulties as pie charts because the human eye has difficulty perceiving small dif- ferences in areas. As a result, experts often recommend against the use of stacked-column and stacked-bar charts for more than a couple of quantitative variables in each category. An alternative chart for these same data is called a clustered-column (or clustered-bar) chart. It is created in Excel following the same steps but selecting Clustered Column under the 2-D Column in Step 3. Clustered-column and clustered-bar charts are often superior to stacked-column and stacked-bar charts for comparing quantitative variables, but they can become cluttered for more than a few quantitative variables per category.

An alternative that is often preferred to both stacked and clustered charts, particularly when many quantitative variables need to be displayed, is to use multiple charts. For the regional sales data, we would include two column charts: one for sales in the North and one for sales in the South. For additional regions, we would simply add additional column charts. To facilitate comparisons between the data displayed in each chart, it is important to maintain consistent axes from one chart to another. The categorical variables should be listed in the same order in each chart, and the axis for the quantitative variable should have the same range. For instance, the vertical axis for both North and South sales starts at 0 and ends at 140. This makes it easy to see that, in most months, the North region has greater sales. Figure 3.30 compares the approaches using stacked-, clustered-, and multiple-bar charts for the regional sales data.

Figure 3.30 shows that the multiple-column charts require considerably more space than the stacked- and clustered-column charts. However, when comparing many quantita- tive variables, using multiple charts can often be superior even if each chart must be made smaller. Stacked-column and stacked-bar charts should be used only when comparing a few quantitative variables and when there are large differences in the relative values of the quantitative variables within the category.

An especially useful chart for displaying multiple variables is the scatter-chart matrix. Table 3.12 contains a partial listing of the data for each of New York City’s 55 subboroughs (a designation of a community within New York City) on monthly median rent, percent- age of college graduates, poverty rate, and mean travel time to work. Suppose we want to examine the relationship between these different categorical variables. Figure 3.31 displays a scatter-chart matrix (scatter-plot matrix) for data related to rentals in New York City.

A scatter-chart matrix allows the reader to easily see the relationships among multiple variables. Each scatter chart in the matrix is created in the same manner as for creating a single scatter chart. Each column and row in the scatter-chart matrix corresponds to one categorical variable. For instance, row 1 and column 1 in Figure 3.31 correspond to the median monthly rent variable. Row 2 and column 2 correspond to the percentage of col- lege graduates variable. Therefore, the scatter chart shown in row 1, column 2 shows the relationship between median monthly rent (on the y-axis) and the percentage of college graduates (on the x-axis) in New York City subboroughs. The scatter chart shown in row 2, column 3 shows the relationship between the percentage of college graduates (on the y-axis) and poverty rate (on the x-axis).

Figure 3.31 allows us to infer several interesting findings. Because the points in the scat- ter chart in row 1, column 2 generally get higher moving from left to right, this tells us that subboroughs with higher percentages of college graduates appear to have higher median monthly rents. The scatter chart in row 1, column 3 indicates that subboroughs with higher

Note that here we have not included the additional steps for formatting the chart in Excel using the Chart Elements button, but the steps are similar to those used to create the previous charts.

Clustered-column (bar) charts are also referred to as side-by-side-column (bar) charts.

114 chapter 3 Data Visualization

comparing stacked-, clustered-, and Multiple-column charts for the Regional sales Data for Kirkland industries

FiGURe 3.30

Stacked-Column Chart:

Multiple-Column Charts:

Clustered-Column Chart:

100

120

140

S al

es (

$1 00

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

South North

100

120

140

S al

es (

$1 00

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

South

100

120

140

S al

es (

$1 00

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

North

S al

es (

$1 00

s) 150

100

200

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

South North

Area

Median Monthly Rent ($)

Percentage Col- lege Graduates

(%) Poverty Rate (%)

Travel Time (min)

Astoria 1,106 36.8 15.9 35.4

Bay Ridge 1,082 34.3 15.6 41.9

Bayside/Little Neck 1,243 41.3 7.6 40.6

Bedford Stuyvesant 822 21.0 34.2 40.5

Bensonhurst 876 17.7 14.4 44.0

Borough Park 980 26.0 27.6 35.3

Brooklyn Heights/ Fort Greene

1,086 55.3 17.4 34.5

Brownsville/ Ocean Hill

714 11.6 36.0 40.3

Bushwick 945 13.3 33.5 35.5

Central Harlem 665 30.6 27.1 25.0

Chelsea/Clinton/ Midtown

1,624 66.1 12.7 43.7

Coney Island 786 27.2 20.0 46.3

… … … … …

Data for new york city subboroughstAble 3.12

NYCityData

3.3 charts 115

poverty rates appear to have lower median monthly rents. The data in row 2, column 3 show that subboroughs with higher poverty rates tend to have lower percentages of college graduates. The scatter charts in column 4 show that the relationships between the mean travel time and the other categorical variables are not as clear as relationships in other columns.

The scatter-chart matrix is very useful in analyzing relationships among variables. Unfortunately, it is not possible to generate a scatter-chart matrix using standard Excel functions.

PivotCharts in excel To summarize and analyze data with both a crosstabulation and charting, Excel pairs PivotCharts with PivotTables. Using the restaurant data introduced in Table 3.7 and Figure 3.7, we can create a PivotChart by taking the following steps:

Step 1. Click the Insert tab on the Ribbon Step 2. In the Charts group, select PivotChart Step 3. When the Create PivotChart dialog box appears:

Choose Select a Table or Range

scatter-chart Matrix for new york city Rent DataFiGURe 3.31

Column 1

Row 1

Row 2

Row 3

Row 4

Column 2 Column 3 Column 4

C ol

le ge

G ra

du at

es M

ed ia

nR en

t P

ov er

ty R

at e

C om

m ut

eT im

MedianRent CollegeGraduates PovertyRate CommuteTime

The scatter charts along the diagonal in a scatter-chart matrix (e.g., in row 1, column 1 and in row 2, column 2) display the relationship between a variable and itself. Therefore, the points in these scatter charts will always fall along a straight line at a 45-degree angle, as shown in Figure 3.31.

Restaurant

In the appendix available within the MindTap Reader, we demonstrate how to create a scatter-chart matrix similar to that shown in Figure 3.31 using the Excel Add-in Analytic Solver. Statistical software packages such as R, NCSS, JMP, and SAS can also be used to create these matrixes.

116 chapter 3 Data Visualization

Enter A1:D301 in the Table/Range: box Select New Worksheet as the location for the PivotTable Report Click OK

Step 4. In the PivotChart Fields area, under Choose fields to add to report: Drag the Quality Rating field to the AXIS (CATEGORIES) area Drag the Meal Price ($) field to the LEGEND (SERIES) area Drag the Wait Time (min) field to the VALUES area

Step 5. Click on Sum of Wait Time (min) in the Values area Step 6. Select Value Field Settings… from the list of options that appear Step 7. When the Value Field Settings dialog box appears:

Under Summarize value field by, select Average Click Number Format In the Category: box, select Number Enter 1 for Decimal places: Click OK When the Value Field Settings dialog box reappears, click OK

Step 8. Right-click in cell B2 or any cell containing a meal price column label Step 9. Select Group from the list of options that appears Step 10. When the Grouping dialog box appears:

Enter 10 in the Starting at: box Enter 49 in the Ending at: box Enter 10 in the By: box Click OK

Step 11. Right-click on “Excellent” in cell A5 Step 12. Select Move and click Move “Excellent” to End

The completed PivotTable and PivotChart appear in Figure 3.32. The PivotChart is a clustered-column chart whose column heights correspond to the average wait times and are clustered into the categorical groupings of Good, Very Good, and Excellent. The columns

Like PivotTables, PivotCharts are interactive. You can use the arrows on the axes and legend labels to change the categorical data being displayed. For example, you can click on the Quality Rating horizontal axis label (see Figure 3.32) and choose to look at only Very Good and Excellent restaurants, or you can click on the Meal Price ($) legend label and choose to view only certain meal price categories.

Pivottable and Pivotchart for the Restaurant DataFiGURe 3.32

A B C D E F Average of Wait Time(min)

Columns Lables

Row Labels 10–19 Good 2.6

12.6

25.5 7.6

2.5

12.6

29.1 11.1

0.5

12.0

34.0 19.8

2.5

12.3

32.1 13.9

10.0

32.3 27.5

Very Good

Excellent

Grand Total

20–29 30–39 40–49 Grand Total

1 2

3 4 5 6 7 8 9 10 11 12 13

14 15 16

17 18

19 20

40.0

35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0

Good Very Good Excellent

Average Wait Time (min)

Quality Rating

Meal Price ($)

10–19 20–29 30–39 40–49

3.4 Advanced Data Visualization 117

are different colors to differentiate the wait times at restaurants in the various meal price ranges. Figure 3.32 shows that Excellent restaurants have longer wait times than Good and Very Good restaurants. We also see that Excellent restaurants in the price range of $30–$39 have the longest wait times. The PivotChart displays the same information as that of the PivotTable in Figure 3.13, but the column chart used here makes it easier to compare the restaurants based on quality rating and meal price.

1. The steps for modifying and formatting charts were

changed in Excel 2013. In versions of Excel prior to 2013,

most chart-formatting options can be found in the Layout

tab in the Chart Tools Ribbon. This is where you will find

options for adding a Chart Title, Axis Titles, Data Labels,

and so on in older versions of Excel.

2. Excel assumes that line charts will be used to graph only

time series data. The Line Chart tool in Excel is the most

intuitive for creating charts that include text entries for the

horizontal axis (e.g., the month labels of Jan, Feb, Mar, etc.

for the monthly sales data in Figure 3.19). When the hori-

zontal axis represents numerical values (1, 2, 3, etc.), then it

is easiest to go to the Charts group under the Insert tab in

the Ribbon, click the Insert Scatter (X,Y) or Bubble Chart

button , and then select the Scatter with Straight

Lines and Markers button .

3. Color is frequently used to differentiate elements in a chart.

However, be wary of the use of color to differentiate for sev-

eral reasons: (1) Many people are color-blind and may not

be able to differentiate colors. (2) Many charts are printed

in black and white as handouts, which reduces or eliminates

the impact of color. (3) The use of too many colors in a chart

can make the chart appear too busy and distract or even

confuse the reader. In many cases, it is preferable to differ-

entiate chart elements with dashed lines, patterns, or labels.

4. Histograms and box plots (discussed in Chapter 2 in rela-

tion to analyzing distributions) are other effective data-

visualization tools for summarizing the distribution of data.

n o t e s + C o m m e n t s

3.4 Advanced Data Visualization In this chapter, we have presented only some of the most basic ideas for using data visualiza- tion effectively both to analyze data and to communicate data analysis to others. The charts discussed so far are those most commonly used and will suffice for most data- visualization needs. However, many additional concepts, charts, and tools can be used to improve your data-visualization techniques. In this section we briefly mention some of them.

Advanced Charts Although line charts, bar charts, scatter charts, and bubble charts suffice for most data- visualization applications, other charts can be very helpful in certain situations. One type of helpful chart for examining data with more than two variables is the parallel- coordinates plot, which includes a different vertical axis for each variable. Each observation in the data set is represented by drawing a line on the parallel-coordinates plot connecting each vertical axis. The height of the line on each vertical axis represents the value taken by that observation for the variable corresponding to the vertical axis. For instance, Figure 3.33 displays a parallel coordinates plot for a sample of Major League Baseball players. The fig- ure contains data for 10 players who play first base (1B) and 10 players who play second base (2B). For each player, the leftmost vertical axis plots his total number of home runs (HR). The center vertical axis plots the player’s total number of stolen bases (SB), and the rightmost vertical axis plots the player’s batting average. Various colors differentiate 1B players from 2B players (1B players are in blue and 2B players are in red).

We can make several interesting statements upon examining Figure 3.33. The sample of 1B players tend to hit lots of HR but have very few SB. Conversely, the sample of 2B play- ers steal more bases but generally have fewer HR, although some 2B players have many HR and many SB. Finally, 1B players tend to have higher batting averages (AVG) than 2B players. We may infer from Figure 3.33 that the traits of 1B players may be different from

118 chapter 3 Data Visualization

those of 2B players. In general, this statement is true. Players at 1B tend to be offensive stars who hit for power and average, whereas players at 2B are often faster and more agile in order to handle the defensive responsibilities of the position (traits that are not common in strong HR hitters). Parallel-coordinates plots, in which you can differentiate categorical variable values using color as in Figure 3.33, can be very helpful in identifying common traits across multiple dimensions.

A treemap is useful for visualizing hierarchical data along multiple dimensions. Smart- Money’s Map of the Market, shown in Figure 3.34, is a treemap for analyzing stock market performance. In the Map of the Market, each rectangle represents a particular company (Apple, Inc. is highlighted in Figure 3.34). The color of the rectangle represents the overall perfor- mance of the company’s stock over the previous 52 weeks. The Map of the Market is also divided into market sectors (Health Care, Financials, Oil & Gas, etc.). The size of each com- pany’s rectangle provides information on the company’s market capitalization size relative to the market sector and the entire market. Figure 3.34 shows that Apple has a very large market capitalization relative to other firms in the Technology sector and that it has performed excep- tionally well over the previous 52 weeks. An investor can use the treemap in Figure 3.34 to quickly get an idea of the performance of individual companies relative to other companies in their market sector as well as the performance of entire market sectors relative to other sectors.

Excel allows the user to create treemap charts. The step-by-step directions below explain how to create a treemap in Excel for the top-100 global companies based on 2014 market value using data in the file Global100 . In this file we provide the continent where the company is headquartered in column A, the country headquarters in column B, the name of the company in column C, and the market value in column D. For the treemap to display properly in Excel, the data should be sorted by column A, “Continent,” which is the highest level of the hierarchy.

Step 1. Select cells A1: D101 Step 2. Click Insert on the Ribbon

Click on the Insert Hierarchy Chart button in the Charts group

Select Treemap from the drop-down menu

Note that the treemap chart is not available in versions of Excel prior to Excel 2016.

The appendix for this chapter available in the MindTap Reader describes how to create a parallel coordinates plot similar to Figure 3.33 using the Analytic Solver Excel Add-in.

Parallel coordinates Plot for Baseball DataFiGURe 3.33

0 HR

1 SB

0.222 AVG

0.338

3039

3.4 Advanced Data Visualization 119

smartMoney’s Map of the Market as an Example of a treemapFiGURe 3.34

Health Care

The Marked DJIA 12369.38-73.11-0.58% Nasdaq 2778.79-34.90-1.24%

Sector view

Color Key (% change)

Show Change since

Highlight Top 5

Find (name or ticker)

Color scheme

red/green

blue/yellow

News

Icons

Gainers Losers

None

52 weeks YTD

26 weeks

None

-55.6% +55.6%0

6:50 pm May 19

Consumer Goods Technology

Consumer Services

Basic Materials

Financials

Oil & Gas

Industrials Apple Inc. +58.21% AAPL last: $530.34 chg: +$195.10 (click for more details)

Utilities

Help

Telecommunications

The Map of the Market is based on work done by Professor Ben Shneiderman and students at the University of Maryland Human– Computer Interaction Lab.

Step 3. When the treemap chart appears, right-click on the treemap portion of the chart Select Format Data Series… in the pop-up menu When the Format Data Series task pane opens, select Banner

Figure 3.35 shows the completed treemap created with Excel. Selecting Banner in Step 3 places the name of each continent as a banner title within the treemap. Each continent is also

treemap created in Excel for top 100 Global companies DataFiGURe 3.35

A B C D E

Asia Continent

3 2 1

4 5 6 7 8 9 10 11 12 13

21 22 23 24 25 26 27

15 14

16 17 18 19 20

F G H I J K L M N O P Q

China Asia China Asia China Asia China Asia China Asia China Asia Asia Asia Asia Asia Asia Asia Asia Australia Australia Australia Australia Europe Europe Europe Europe Europe Europe Europe Europe

28 Europe 29 Europe

Japan Japan Russia Saudi Arabia South Korea Taiwan Australia Australia Australia Australia Belgium Denmark France France France France Germany Germany Germany Germany

China Hong Kong

Bank of China China Construction Bank ICBC PetroChina Sinopec-China Petroleum

Softbank Toyota Motor Gazprom Saudi Basic Industries Samsung Electronics Taiwan Semiconductor ANZ BHP Billiton Commonwealth Bank Westpac Banking Group Anheuser-Busch InBev Novo Nordisk BNP Paribas L'OrÃ©al Group Sano� Total BASF Bayer BMW Group Daimler

Tencent Holdings China Mobile

Country Agricultural Bank of China

124.2 174.4 215.6

202 94.7

91.2 193.5

88.8 94.4

186.5 102

83.9 182.3 114.5

99 171.2 119.6

98.6 98.7

137.1 149.8 102.3 111.3

83.4 102.9

135.4 184.6

141.1 Company

Market Value (Billions US $)

Top 100 Global Companies by Market Value

North America Europe

Asia Australia Europe North America South America

Apple

Berkshire Hathaway

Wal-Mart Stores

JPMorgan Chase

Procter & GambleChevron Pfizer

Exxon Mobil

Google

Verizon Commun...

Coca- Cola

Qualco... Schlumb... PepsiCo

Cisco Systems

BoeingMcD...Gilead Scie...

Philip Morris

Interna...

Comcast

Intel

Amer... Express

Oracle

Bank of America

Merck & Co

Facebo... United Tech...

Home Depot

Amgen

United Parcel...

Cono...

Bristol- Myer... AbbV...

Mast... CVS

Care...

Union Pacific

Microsoft

IBM AT&T

Amazon.... Citigroup Visa Walt

Disney

Johnson & Johnson Wells Fargo

General Electric

HSBC Holdings

Volks... Group

Sieme...

Asia

ICBC

Agricult... Bank of...

Tencent Holdings

Bank of China

China Con... Bank

PetroChina Toyota Motor

Softbank Samsu... Electr...

Gazprom

Saudi Basic Ind...

West... Banki...

Petro... Ecop...

ANZ

BHP Billiton

Co... Bank

Sinop... China...

China Mobile

Tai... Sem...

Australia

BASF SAP Unilever Anheuser-

Busch InBev Eni Sta... B... Gr...

Bayer Daimler Royal Dutch Shell

Ban... San...

Novo Nor...

Alli... Irish Ba...In...

Vodaf... Astra... Novartis

L'Or... Group

BNP Pari...

GlaxoS...

British Ame... Toba...

Roche Holding Nestle

Total Sanofi Rio

Tinto Lloyds Bank...

Global100

120 chapter 3 Data Visualization

assigned a different color within the treemap. From this figure we can see that North Amer- ica has more top-100 companies than any other continent, followed by Europe and then Asia. The size of the rectangles for each company in the treemap represents their relative market value. We can see that Apple, ExxonMobile, Google, and Microsoft have the four highest market values. Australia has only four companies in the top 100 and South America has two. Africa and Antarctica have no companies in the top 100. Hovering your pointer over one of the companies in the treemap will display the market value for that company.

Geographic information systems Charts Consider the case of the Cincinnati Zoo & Botanical Garden, which derives much of its rev- enue from selling annual memberships to customers. The Cincinnati Zoo would like to better understand where its current members are located. Figure 3.36 displays a map of the Cincin- nati, Ohio, metropolitan area showing the relative concentrations of Cincinnati Zoo members. The more darkly shaded areas represent areas with a greater number of members. Figure 3.36 is an example of the output from a geographic information system (GIS), which merges maps and statistics to present data collected over different geographic areas. Displaying geo- graphic data on a map can often help in interpreting data and observing patterns.

The GIS chart in Figure 3.36 combines a heat map and a geographical map to help the reader analyze this data set. From the figure we can see that a high concentration of zoo members in a band to the northeast of the zoo that includes the cities of Mason and

A GIS chart such as that shown in Figure 3.36 is an example of geoanalytics, the use of data by geographical area or some other form of spatial referencing to generate insights.

Gis chart for cincinnati Zoo Member DataFiGURe 3.36

Columbia

Fayette

Highland CenterOldenburg

Delaware

Ripley Milan

Versailles Sparta Greendale

Dearborn

Manchester

Dillsboro

101

47012

47016

47035

45053

45013 Bitler

New Miami

Trenton

Franklin Carlisle

Montgomery Springboro

Greene

Caesar Creek Lake

Waynesville

Monroe

MiddletownMiddletown

MasonMason O H I O

Indian Springs

HamiltonHamilton

45042

45036 45177

45113

4510745152 45040

45069

45241

45044 Clinton

Lebanon

South Lebanon

Warren

Landen

Loveland

Montgomery Woodlawn

Greenhills

Pleasant Run

Fairfield

Groesbeck Northbrook

Harrison

Bright White Oak

Dent Cheviot

Miami Heights

Burlington Crestview

Hills

Boone

Rising Sun Ohio

Pleasant

Je ff

er so

Warsaw Gallatin

Hebron Lawrenceburg

Aurora

DelhiDelhi CincinnatiCincinnati

St. Bernard

Lockland

Madeira Norwood

Covington

FlorenceFlorence

Indian Hill

Goshen Mount Repose

Mulberry

Summerside

Batavia Greenbush

Williamsburg Mount Orab

Brown

Georgetown

Bethel

41006

41001

45255

45244

Forestville

Kenton

Piner

Crittenden

Alexandria Independence

K E N T U C K Y

Cold Spring

Fort Mitchell

Edgewood Amelia

New Richmond

Clermont

Village

Estates

Sherwood

Blanchester

45011

OxfordOxford

Preble

47060

229 47036

47006 I N D I A N A

27 63

127

128

126

956

125

275

76 71

Switzerland

Pendleton

Campbell

Norwood

Batesville

Union

Franklin

Cincinnati Zoo

Villa Hills

Newport

MariemontCincinnati Zoo

3.4 Advanced Data Visualization 121

Hamilton (circled). Also, a high concentration of zoo members lies to the southwest of the zoo around the city of Florence. These observations could prompt the zoo manager to identify the shared characteristics of the populations of Mason, Hamilton, and Florence to learn what is leading them to be zoo members. If these characteristics can be identified, the manager can then try to identify other nearby populations that share these characteristics as potential markets for increasing the number of zoo members.

Excel has a feature called 3D Maps that allows the user to create interactive GIS-type charts. This tool is quite powerful, and the full capabilities are beyond the scope of this text. The step-by-step directions below show an example using data from the World Bank on gross domestic product (GDP) for countries around the world.

Step 1. Select cells A4:C267 Step 2. Click the Insert tab on the Ribbon

Click the 3D Map button 3D Map

in the Tours group

Select Open 3D Maps. This will open a new Excel window that displays a world map (see Figure 3.37)

Step 3. Drag GDP 2014 (Billions US $) from the Field List to the Height box in the Data area of the Layer 1 task pane.

Click the Change the visualization to Region button in the Data area of the Layer 1 task pane.

Step 4. Click Layer Options in the Layer 1 task pane. Change the Color to a dark red color to give the countries more differen-

tiation on the world map.

3D Maps is called Power Map in Excel 2013, but it is not as fully integrated as it is in Excel 2016. This feature is not available in Excel versions prior to Excel 2013.

initial Window opened by clicking on 3D Map Button in Excel for World GDP Data

FiGURe 3.37

WorldGDP

122 chapter 3 Data Visualization

The completed GIS chart is shown in Figure 3.38. You can now click and drag the world map to view different parts of the world. Figure 3.38 shows much of Europe and Asia. The countries with the darker shading have higher GDPs. We can see that China has a very dark shading indicating very high GDP relative to other countries. Russia and Germany have slightly darker shadings than other countries shown indicating higher GDPs that most countries, but lower than China. If you hover over a country, it will display the Country Name and GDP 2014 (Billions US $) in a pop-up window. In Figure 3.38 we have hovered over China to display its GDP.

Spotfire, Tableau, QlikView, SAS Visual Analytics, R, and JMP are examples of software that include advanced data-visualization

capabilities.

n o t e s + C o m m e n t s

completed 3D Map created in Excel for World GDP DataFiGURe 3.38

3.5 Data Dashboards A data dashboard is a data-visualization tool that illustrates multiple metrics and automat- ically updates these metrics as new data become available. It is like an automobile’s dash- board instrumentation that provides information on the vehicle’s current speed, fuel level, and engine temperature so that a driver can assess current operating conditions and take effective action. Similarly, a data dashboard provides the important metrics that managers need to quickly assess the performance of their organization and react accordingly. In this section we provide guidelines for creating effective data dashboards and an example application.

3.5 Data Dashboards 123

Principles of effective Data Dashboards In an automobile dashboard, values such as current speed, fuel level, and oil pressure are displayed to give the driver a quick overview of current operating characteristics. In a busi- ness, the equivalent values are often indicative of the business’s current operating charac- teristics, such as its financial position, the inventory on hand, customer service metrics, and the like. These values are typically known as key performance indicators (KPIs). A data dashboard should provide timely summary information on KPIs that are important to the user, and it should do so in a manner that informs rather than overwhelms its user.

Ideally, a data dashboard should present all KPIs as a single screen that a user can quickly scan to understand the business’s current state of operations. Rather than requiring the user to scroll vertically and horizontally to see the entire dashboard, it is better to create multiple dashboards so that each dashboard can be viewed on a single screen.

The KPIs displayed in the data dashboard should convey meaning to its user and be related to the decisions the user makes. For example, the data dashboard for a marketing manager may have KPIs related to current sales measures and sales by region, while the data dashboard for a Chief Financial Officer should provide information on the current financial standing of the company, including cash on hand, current debt obligations, and so on.

A data dashboard should call attention to unusual measures that may require attention, but not in an overwhelming way. Color should be used to call attention to specific values to differentiate categorical variables, but the use of color should be restrained. Too many dif- ferent or too bright colors make the presentation distracting and difficult to read.

Applications of Data Dashboards To illustrate the use of a data dashboard in decision making, we discuss an application involving the Grogan Oil Company which has offices located in three cities in Texas: Austin (its headquarters), Houston, and Dallas. Grogan’s Information Technology (IT) call cen- ter, located in Austin, handles calls from employees regarding computer-related problems involving software, Internet, and e-mail issues. For example, if a Grogan employee in Dallas has a computer software problem, the employee can call the IT call center for assistance.

The data dashboard shown in Figure 3.39, developed to monitor the performance of the call center, combines several displays to track the call center’s KPIs. The data presented are for the current shift, which started at 8:00 a.m. The line chart in the upper left-hand corner shows the call volume for each type of problem (Software, Internet, or E-mail) over time. This chart shows that call volume is heavier during the first few hours of the shift, that calls concerning e-mail issues appear to decrease over time, and that the volume of calls regard- ing software issues are highest at midmorning. A line chart is effective here because these are time series data and the line chart helps identify trends over time.

The column chart in the upper right-hand corner of the dashboard shows the percentage of time that call center employees spent on each type of problem or were idle (not working on a call). Both the line chart and the column chart are important displays in determining optimal staffing levels. For instance, knowing the call mix and how stressed the system is, as measured by percentage of idle time, can help the IT manager make sure that enough call center employees are available with the right level of expertise.

The clustered-bar chart in the middle right of the dashboard shows the call volume by type of problem for each of Grogan’s offices. This allows the IT manager to quickly iden- tify whether there is a particular type of problem by location. For example, the office in Austin seems to be reporting a relatively high number of issues with e-mail. If the source of the problem can be identified quickly, then the problem might be resolved quickly for many users all at once. Also, note that a relatively high number of software problems are coming from the Dallas office. In this case, the Dallas office is installing new software, resulting in more calls to the IT call center. Having been alerted to this by the Dallas office last week, the IT manager knew that calls coming from the Dallas office would spike, so the manager proactively increased staffing levels to handle the expected increase in calls.

For each unresolved case that was received more than 15 minutes ago, the bar chart shown in the middle left of the data dashboard displays the length of time for which each

Key performance indicators are sometimes referred to as key performance metrics (KPMs).

124 chapter 3 Data Visualization

case has been unresolved. This chart enables Grogan to quickly monitor the key problem cases and decide whether additional resources may be needed to resolve them. The worst case, T57, has been unresolved for over 300 minutes and is actually left over from the pre- vious shift. Finally, the chart in the bottom panel shows the length of time required for resolved cases during the current shift. This chart is an example of a frequency distribution for quantitative data.

Throughout the dashboard, a consistent color coding scheme is used for problem type (E-mail, Software, and Internet). Because the Time to Resolve a Case chart is not broken down by problem type, dark shading is used so as not to confuse these values with a partic- ular problem type. Other dashboard designs are certainly possible, and improvements could certainly be made to the design shown in Figure 3.39. However, what is important is that information is clearly communicated so that managers can improve their decision making.

The Grogan Oil data dashboard presents data at the operational level, is updated in real time, and is used for operational decisions such as staffing levels. Data dashboards may also be used at the tactical and strategic levels of management. For example, a sales manager could monitor sales by salesperson, by region, by product, and by customer. This would alert the sales manager to changes in sales patterns. At the highest level, a more strategic dash- board would allow upper management to quickly assess the financial health of the company by monitoring more aggregate financial, service-level, and capacity- utilization information.

Chapter 2 discusses the construction of frequency distributions for quantitative and categorical data.

Data Dashboard for the Grogan oil information technology call centerFiGURe 3.39

Grogan Oil IT Call Center Shift 1 22–Feb–12 12:44:00 PM

Time to Resolve a Case

Minutes

0 <

1 1–

2 2–

3 3–

4 4–

5 5–

6 6–

7 7–

8 8–

9 9–

10 –1

11 –1

12 –1

13 –1

14 –1

15 –1

16 –1

17 –1

18 –1

19 –2

20 –2

21 –2

22 –2

23 –2

24 –2

25 –2

26 –2

27 –2

28 –2

29 –3

30 –3

31 –3

2 32

F re

qu en

2 4 6 8

10 12

Number of Calls

Call Volume by City

Austin

Dallas

Houston

0 5 10 15 20 25

Unresolved Cases Beyond 15 Minutes

T57

C as

e N

u m

b er

W24

W59

0 100 200

Minutes

300 400

Call Volume

Hour

14 12 10 8 6 4 2

E-mail

Software Internet

E-mail

Software

Internet

E-mail

Software

Internet

8:00 9:00 10:00 11:00 12:00

Time Breakdown This Shift

Hour

10%

20%

30%

40%

50%

E-mail

E-mail 22%

Internet

Internet 18%

Software

Software 46%

Idle

Idle 14%

Glossary 125

The creation of data dashboards in Excel generally requires

the use of macros written using Visual Basic for Applications

(VBA). The use of VBA is beyond the scope of this textbook,

but VBA is a powerful programming tool that can greatly

increase the capabilities of Excel for analytics, including data

visualization.

n o t e s + C o m m e n t s

S u m m A r y

In this chapter we covered techniques and tools related to data visualization. We discussed several important techniques for enhancing visual presentation, such as improving the clarity of tables and charts by removing unnecessary lines and presenting numerical values only to the precision necessary for analysis. We explained that tables can be preferable to charts for data visualization when the user needs to know exact numerical values. We intro- duced crosstabulation as a form of a table for two variables and explained how to use Excel to create a PivotTable.

We presented many charts in detail for data visualization, including scatter charts, line charts, bar and column charts, bubble charts, and heat maps. We explained that pie charts and three-dimensional charts are almost never preferred tools for data visualization and that bar (or column) charts are usually much more effective than pie charts. We also discussed several advanced data-visualization charts, such as parallel-coordinates plots, treemaps, and GIS charts. We introduced data dashboards as a data-visualization tool that provides a summary of a firm’s operations in visual form to allow managers to quickly assess the cur- rent operating conditions and to aid decision making.

Many other types of charts can be used for specific forms of data visualization, but we have covered many of the most-popular and most-useful ones. Data visualization is very important for helping someone analyze data and identify important relations and patterns. The effective design of tables and charts is also necessary to communicate data analysis to others. Tables and charts should be only as complicated as necessary to help the user understand the patterns and relationships in the data.

G l O S S A r y

Bar chart A graphical presentation that uses horizontal bars to display the magnitude of quantitative data. Each bar typically represents a class of a categorical variable. Bubble chart A graphical presentation used to visualize three variables in a two-dimen- sional graph. The two axes represent two variables, and the magnitude of the third variable is given by the size of the bubble. Chart A visual method for displaying data; also called a graph or a figure. Clustered-column (or clustered-bar) chart A special type of column (bar) chart in which multiple bars are clustered in the same class to compare multiple variables; also known as a side-by-side-column (bar) chart. Column chart A graphical presentation that uses vertical bars to display the magnitude of quantitative data. Each bar typically represents a class of a categorical variable. Crosstabulation A tabular summary of data for two variables. The classes of one variable are represented by the rows; the classes for the other variable are represented by the columns. Data dashboard A data-visualization tool that updates in real time and gives multiple outputs. Data-ink ratio The ratio of the amount of ink used in a table or chart that is necessary to convey information to the total amount of ink used in the table and chart. Ink used that is not necessary to convey information reduces the data-ink ratio. Geographic information system (GIS) A system that merges maps and statistics to pres- ent data collected over different geographies.

126 chapter 3 Data Visualization

Heat map A two-dimensional graphical presentation of data in which color shadings indi- cate magnitudes. Key performance indicator (KPI) A metric that is crucial for understanding the current performance of an organization; also known as a key performance metric (KPM). Line chart A graphical presentation of time series data in which the data points are con- nected by a line. Parallel-coordinates plot A graphical presentation used to examine more than two vari- ables in which each variable is represented by a different vertical axis. Each observation in a data set is plotted in a parallel-coordinates plot by drawing a line between the values of each variable for the observation. Pie chart A graphical presentation used to compare categorical data. Because of difficul- ties in comparing relative areas on a pie chart, these charts are not recommended. Bar or column charts are generally superior to pie charts for comparing categorical data. PivotChart A graphical presentation created in Excel that functions similarly to a PivotTable. PivotTable An interactive crosstabulation created in Excel. Scatter chart A graphical presentation of the relationship between two quantitative variables. One variable is shown on the horizontal axis and the other on the vertical axis. Scatter-chart matrix A graphical presentation that uses multiple scatter charts arranged as a matrix to illustrate the relationships among multiple variables. Sparkline A special type of line chart that indicates the trend of data but not magnitude. A sparkline does not include axes or labels. Stacked-column chart A special type of column (bar) chart in which multiple variables appear on the same bar. Treemap A graphical presentation that is useful for visualizing hierarchical data along multiple dimensions. A treemap groups data according to the classes of a categorical vari- able and uses rectangles whose size relates to the magnitude of a quantitative variable. Trendline A line that provides an approximation of the relationship between variables in a chart.

P r O b l e m S

1. A sales manager is trying to determine appropriate sales performance bonuses for her team this year. The following table contains the data relevant to determining the bonuses, but it is not easy to read and interpret. Reformat the table to improve readabil- ity and to help the sales manager make her decisions about bonuses.

Salesperson Total Sales

($)

Average Performance Bonus Previous

Years ($) Customer Accounts

Years with Company

Smith, Michael 325,000.78 12,499.3452 124 14

Yu, Joe 13,678.21 239.9434 9 7

Reeves, Bill 452,359.19 21,987.2462 175 21

Hamilton, Joshua 87,423.91 7,642.9011 28 3

Harper, Derek 87,654.21 1,250.1393 21 4

Quinn, Dorothy 234,091.39 14,567.9833 48 9

Graves, Lorrie 379,401.94 27,981.4432 121 12

Sun, Yi 31,733.59 672.9111 7 1

Thompson, Nicole 127,845.22 13,322.9713 17 3

SalesBonuses

Problems 127

2. The following table shows an example of gross domestic product values for five coun- tries over six years in equivalent U.S. dollars ($).

Gross Domestic Product (in US $)

Country Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

Albania 7,385,937,423 8,105,580,293 9,650,128,750 11,592,303,225 10,781,921,975 10,569,204,154

Argentina 169,725,491,092 198,012,474,920 241,037,555,661 301,259,040,110 285,070,994,754 339,604,450,702

Australia 704,453,444,387 758,320,889,024 916,931,817,944 982,991,358,955 934,168,969,952 1,178,776,680,167

Austria 272,865,358,404 290,682,488,352 336,840,690,493 375,777,347,214 344,514,388,622 341,440,991,770

Belgium 335,571,307,765 355,372,712,266 408,482,592,257 451,663,134,614 421,433,351,959 416,534,140,346

Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Revenue ($) 145,869 123,576 143,298 178,505 186,850 192,850 134,500 145,286 154,285 148,523 139,600 148,235

a. What are the problems with the layout and display of this line chart? b. Create a new line chart for the monthly revenue data at Tedstar, Inc. Format the

chart to make it easy to read and interpret.

4. In the file MajorSalary, data have been collected from 111 College of Business gradu- ates on their monthly starting salaries. The graduates include students majoring in man- agement, finance, accounting, information systems, and marketing. Create a PivotTable in Excel to display the number of graduates in each major and the average monthly starting salary for students in each major. a. Which major has the greatest number of graduates?

a. How could you improve the readability of this table? b. The file GDPyears contains sample data from the United Nations Statistics Division

on 30 countries and their GDP values from Year 1 to Year 6 in US$. Create a table that provides all these data for a user. Format the table to make it as easy to read as possible.

Hint: It is generally not important for the user to know GDP to an exact dollar figure. It is typical to present GDP values in millions or billions of dollars.

3. The following table provides monthly revenue values for Tedstar, Inc., a company that sells valves to large industrial firms. The monthly revenue data have been graphed using a line chart in the following figure.

GDPyears

30000 40000 50000

20000 10000

60000 70000 80000 90000 10000

110000 120000 130000 140000 150000 160000 170000 180000 190000 200000

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

210000

R ev

en u

e ($

)

Months

Tedstar

MajorSalary

128 chapter 3 Data Visualization

b. Which major has the highest average starting monthly salary? c. Use the PivotTable to determine the major of the student with the highest overall

starting monthly salary. What is the major of the student with the lowest overall starting monthly salary?

5. Entrepreneur magazine ranks franchises. Among the factors that the magazine uses in its rankings are growth rate, number of locations, start-up costs, and financial stability. A recent ranking listed the top 20 U.S. franchises and the number of locations as follows:

Franchise Number of

U.S. Locations Franchise Number of

U.S. Locations

Hampton Inns 1,864 Jan-Pro Franchising Intl. Inc. 12,394

ampm 3,183 Hardee’s 1,901

McDonald’s 32,805 Pizza Hut Inc. 13,281

7-Eleven Inc. 37,496 Kumon Math & Reading Centers 25,199

Supercuts 2,130 Dunkin’ Donuts 9,947

Days Inn 1,877 KFC Corp. 16,224

Vanguard Cleaning Systems 2,155 Jazzercise Inc. 7,683

Servpro 1,572 Anytime Fitness 1,618

Subway 34,871 Matco Tools 1,431

Denny’s Inc. 1,668 Stratus Building Solutions 5,018

These data can be found in the file Franchises. Create a PivotTable to summarize these data using classes 0–9,999, 10,000–19,999, 20,000–29,999, and 30,000–39,999 to answer the following questions. (Hint: Use Number of U.S. Locations as the COLUMNS, and use Count of Number of U.S. Locations as the VALUES in the PivotTable.) a. How many franchises have between 0 and 9,999 locations? b. How many franchises have more than 30,000 locations?

6. The file MutualFunds contains a data set with information for 45 mutual funds that are part of the Morningstar Funds 500. The data set includes the following five variables:

Fund Type: The type of fund, labeled DE (Domestic Equity), IE (International Equity), and FI (Fixed Income)

Net Asset Value ($): The closing price per share Five-Year Average Return (%): The average annual return for the fund over

the past five years Expense Ratio (%): The percentage of assets deducted each fiscal year for

fund expenses Morningstar Rank: The risk adjusted star rating for each fund; Morningstar

ranks go from a low of 1 Star to a high of 5 Stars. a. Prepare a PivotTable that gives the frequency count of the data by Fund Type (rows)

and the five-year average annual return (columns). Use classes of 0–9.99, 10–19.99, 20–29.99, 30–39.99, 40–49.99, and 50–59.99 for the Five-Year Average Return (%).

b. What conclusions can you draw about the fund type and the average return over the past five years?

7. The file TaxData contains information from federal tax returns filed in 2007 for all coun- ties in the United States (3,142 counties in total). Create a PivotTable in Excel to answer the questions below. The PivotTable should have State Abbreviation as Row Labels. The Values in the PivotTable should be the sum of adjusted gross income for each state. a. Sort the PivotTable data to display the states with the smallest sum of adjusted gross

income on top and the largest on the bottom. Which state had the smallest sum of adjusted gross income? What is the total adjusted gross income for federal tax

Note that Excel may display the column headings as 0–10, 10–20, 20–30, etc., but they should be interpreted as 0–9.99, 10–19.99, 20–29.99, etc.

Franchies

MutualFunds

TaxData

Problems 129

returns filed in this state with the smallest total adjusted gross income? (Hint: To sort data in a PivotTable in Excel, right-click any cell in the PivotTable that contains the data you want to sort, and select Sort.)

b. Add the County Name to the Row Labels in the PivotTable. Sort the County Names by Sum of Adjusted Gross Income with the lowest values on the top and the highest values on the bottom. Filter the Row Labels so that only the state of Texas is displayed. Which county had the smallest sum of adjusted gross income in the state of Texas? Which county had the largest sum of adjusted gross income in the state of Texas?

c. Click on Sum of Adjusted Gross Income in the Values area of the PivotTable in Excel. Click Value Field Settings…. Click the tab for Show Values As. In the Show values as box, select % of Parent Row Total. Click OK. This displays the adjusted gross income reported by each county as a percentage of the total state adjusted gross income. Which county has the highest percentage adjusted gross income in the state of Texas? What is this percentage?

d. Remove the filter on the Row Labels to display data for all states. What percentage of total adjusted gross income in the United States was provided by the state of New York?

8. The file FDICBankFailures contains data on failures of federally insured banks between 2000 and 2012. Create a PivotTable in Excel to answer the following ques- tions. The PivotTable should group the closing dates of the banks into yearly bins and display the counts of bank closures each year in columns of Excel. Row labels should include the bank locations and allow for grouping the locations into states or viewing by city. You should also sort the PivotTable so that the states with the greatest number of total bank failures between 2000 and 2012 appear at the top of the PivotTable. a. Which state had the greatest number of federally insured bank closings between

2000 and 2012? b. How many bank closings occurred in the state of Nevada (NV) in 2010? In what

cities did these bank closings occur? c. Use the PivotTable’s filter capability to view only bank closings in California (CA),

Florida (FL), Texas (TX), and New York (NY) for the years 2009 through 2012. What is the total number of bank closings in these states between 2009 and 2012?

d. Using the filtered PivotTable from part c, what city in Florida had the greatest num- ber of bank closings between 2009 and 2012? How many bank closings occurred in this city?

e. Create a PivotChart to display a column chart that shows the total number of bank closings in each year from 2000 through 2012 in the state of Florida. Adjust the for- matting of this column chart so that it best conveys the data. What does this column chart suggest about bank closings between 2000 and 2012 in Florida? Discuss.

(Hint: You may have to switch the row and column labels in the PivotChart to get the best presentation for your PivotChart.)

9. The following 20 observations are for two quantitative variables, x and y.

Observation x y Observation x y 1 222 22 11 237 48

2 233 49 12 34 229

3 2 8 13 9 218

4 29 216 14 233 31

5 213 10 15 20 216

6 21 228 16 23 14

7 213 27 17 215 18

8 223 35 18 12 17

9 14 25 19 220 211

10 3 23 20 27 222

FDICBankFailures

Scatter

130 chapter 3 Data Visualization

a. Create a scatter chart for these 20 observations. b. Fit a linear trendline to the 20 observations. What can you say about the relationship

between the two quantitative variables?

10. The file Fortune500 contains data for profits and market capitalizations from a recent sample of firms in the Fortune 500. a. Prepare a scatter diagram to show the relationship between the variables Market

Capitalization and Profit in which Market Capitalization is on the vertical axis and Profit is on the horizontal axis. Comment on any relationship between the variables.

b. Create a trendline for the relationship between Market Capitalization and Profit. What does the trendline indicate about this relationship?

11. The International Organization of Motor Vehicle Manufacturers (officially known as the Organisation Internationale des Constructeurs d’Automobiles, OICA) provides data on worldwide vehicle production by manufacturer. The following table shows vehicle production numbers for four different manufacturers for five recent years. Data are in millions of vehicles.

a. Construct a line chart for the time series data for years 1 through 5 showing the number of vehicles manufactured by each automotive company. Show the time series for all four manufacturers on the same graph.

b. What does the line chart indicate about vehicle production amounts from years 1 through 5? Discuss.

c. Construct a clustered-bar chart showing vehicles produced by automobile manufac- turer using the year 1 through 5 data. Represent the years of production along the horizontal axis, and cluster the production amounts for the four manufacturers in each year. Which company is the leading manufacturer in each year?

12. The following table contains time series data for regular gasoline prices in the United States for 36 consecutive months:

Month Price ($) Month Price ($) Month Price ($)

1 2.27 13 2.84 25 3.91

2 2.63 14 2.73 26 3.68

3 2.53 15 2.73 27 3.65

4 2.62 16 2.73 28 3.64

5 2.55 17 2.71 29 3.61

6 2.55 18 2.80 30 3.45

7 2.65 19 2.86 31 3.38

8 2.61 20 2.99 32 3.27

9 2.72 21 3.10 33 3.38

10 2.64 22 3.21 34 3.58

11 2.77 23 3.56 35 3.85

12 2.85 24 3.80 36 3.90

Fortune500

Production (millions of vehicles)

Manufacturer Year 1 Year 2 Year 3 Year 4 Year 5

Toyota 8.04 8.53 9.24 7.23 8.56

GM 8.97 9.35 8.28 6.46 8.48

Volkswagen 5.68 6.27 6.44 6.07 7.34

Hyundai 2.51 2.62 2.78 4.65 5.76

AutoProduction

GasPrices

Problems 131

a. Create a line chart for these time series data. What interpretations can you make about the average price per gallon of conventional regular gasoline over these 36 months?

b. Fit a linear trendline to the data. What does the trendline indicate about the price of gasoline over these 36 months?

13. The following table contains sales totals for the top six term life insurance salespeople at American Insurance.

Salesperson Contracts Sold

Harish 24

David 41

Kristina 19

Steven 23

Tim 53

Mona 39

a. Create a column chart to display the information in the table above. Format the col- umn chart to best display the data by adding axes labels, a chart title, etc.

b. Sort the values in Excel so that the column chart is ordered from most contracts sold to fewest.

c. Insert data labels to display the number of contracts sold for each salesperson above the columns in the column chart created in part a.

14. The total number of term life insurance contracts sold in Problem 13 is 199. The following pie chart shows the percentages of contracts sold by each salesperson.

19.6% 12.1%

20.6%

9.5%

11.6%

26.6%

Harish

David

Kristina

Steven

Tim

Mona

a. What are the problems with using a pie chart to display these data? b. What type of chart would be preferred for displaying the data in this pie chart? c. Use a different type of chart to display the percentage of contracts sold by each

salesperson that conveys the data better than the pie chart. Format the chart and add data labels to improve the chart’s readability.

15. An automotive company is considering the introduction of a new model of sports car that will be available in four-cylinder and six-cylinder engine types. A sample of cus- tomers who were interested in this new model were asked to indicate their preference for an engine type for the new model of automobile. The customers were also asked to indicate their preference for exterior color from four choices: red, black, green, and white. Consider the following data regarding the customer responses:

Four Cylinders Six Cylinders

Red 143 857

Black 200 800

Green 321 679

White 420 580

NewAuto

132 chapter 3 Data Visualization

a. Construct a stacked-column chart to display the survey data on type of cell-phone ownership. Use Age Category as the variable on the horizontal axis.

b. Construct a clustered column chart to display the survey data. Use Age Category as the variable on the horizontal axis.

c. What can you infer about the relationship between age and smartphone ownership from the column charts in parts a and b? Which column chart (stacked or clustered) is best for interpreting this relationship? Why?

17. The Northwest regional manager of Logan Outdoor Equipment Company has con- ducted a study to determine how her store managers are allocating their time. A study was undertaken over three weeks that collected the following data related to the per- centage of time each store manager spent on the tasks of attending required meetings, preparing business reports, customer interaction, and being idle. The results of the data collection appear in the following table:

a. Create a stacked-bar chart with locations along the vertical axis. Reformat the bar chart to best display these data by adding axis labels, a chart title, and so on.

b. Create a clustered-bar chart with locations along the vertical axis and clusters of tasks. Reformat the bar chart to best display these data by adding axis labels, a chart title, and the like.

c. Create multiple bar charts in which each location becomes a single bar chart show- ing the percentage of time spent on tasks. Reformat the bar charts to best display these data by adding axis labels, a chart title, and so forth.

d. Which form of bar chart (stacked, clustered, or multiple) is preferable for these data? Why?

e. What can we infer about the differences among how store managers are allocating their time at the different locations?

18. The Ajax Company uses a portfolio approach to manage their research and develop- ment (R&D) projects. Ajax wants to keep a mix of projects to balance the expected return and risk profiles of their R&D activities. Consider a situation in which Ajax has six R&D projects as characterized in the table. Each project is given an expected rate of return and a risk assessment, which is a value between 1 and 10, where 1 is the least

Age Category Smartphone (%) Other Cell Phone (%)

No Cell Phone (%)

18–24 49 46 5

25–34 58 35 7

35–44 44 45 11

45–54 28 58 14

55–64 22 59 19

65+ 11 45 44

SmartPhone

a. Construct a clustered-column chart with exterior color as the horizontal variable. b. What can we infer from the clustered-bar chart in part a?

16. Consider the following survey results regarding smartphone ownership by age:

Logan

Attending Required

Meetings (%)

Tasks Prepar- ing Business Reports (%)

Customer Interaction (%) Idle (%)

Locations

Seattle 32 17 37 14

Portland 52 11 24 13

Bend 18 11 52 19

Missoula 21 6 43 30

Boise 12 14 64 10

Olympia 17 12 54 17

Problems 133

risky and 10 is the most risky. Ajax would like to visualize their current R&D projects to keep track of the overall risk and return of their R&D portfolio.

a. Use Excel to create sparklines for sales at each company. b. Which companies have generally decreasing revenues over the six months? Which

company has exhibited the most consistent growth over the six months? Which com- panies have revenues that are both increasing and decreasing over the six months?

c. Use Excel to create a heat map for the revenue of the six companies. Do you find the heat map or the sparklines to be better at communicating the trend of revenues over the six months for each company? Why?

Project Expected Rate of Return (%) Risk Estimate

Capital Invested (Millions $)

1 12.6 6.8 6.4

2 14.8 6.2 45.8

3 9.2 4.2 9.2

4 6.1 6.2 17.2

5 21.4 8.2 34.2

6 7.5 3.2 14.8

Ajax

a. Create a bubble chart in which the expected rate of return is along the horizontal axis, the risk estimate is on the vertical axis, and the size of the bubbles represents the amount of capital invested. Format this chart for best presentation by adding axis labels and labeling each bubble with the project number.

b. The efficient frontier of R&D projects represents the set of projects that have the highest expected rate of return for a given level of risk. In other words, any project that has a smaller expected rate of return for an equivalent, or higher, risk estimate cannot be on the efficient frontier. From the bubble chart in part a, which projects appear to be located on the efficient frontier?

19. Heat maps can be very useful for identifying missing data values in moderate to large data sets. The file SurveyResults contains the responses from a marketing survey: 108 individuals responded to the survey of 10 questions. Respondents provided answers of 1, 2, 3, 4, or 5 to each question, corresponding to the overall satisfaction on 10 different dimensions of quality. However, not all respondents answered every question. a. To find the missing data values, create a heat map in Excel that shades the empty cells

a different color. Use Excel’s Conditional Formatting function to create this heat map. Hint: Click on Conditional Formatting in the Styles group in the Home tab. Select Highlight Cells Rules and click More Rules…. Then enter Blanks in the Format only cells with: box. Select a format for these blank cells that will make them obviously stand out.

b. For each question, which respondents did not provide answers? Which question has the highest nonresponse rate?

20. The following table shows monthly revenue for six different web development companies.

SurveyResults

Revenue ($)

Company Jan Feb Mar Apr May Jun

Blue Sky Media 8,995 9,285 11,555 9,530 11,230 13,600

Innovate Technologies 18,250 16,870 19,580 17,260 18,290 16,250

Timmler Company 8,480 7,650 7,023 6,540 5,700 4,930

Accelerate, Inc. 28,325 27,580 23,450 22,500 20,800 19,800

Allen and Davis, LLC 4,580 6,420 6,780 7,520 8,370 10,100

Smith Ventures 17,500 16,850 20,185 18,950 17,520 18,580

WebDevelop

134 chapter 3 Data Visualization

21. Below is a sample of the data in the file NFLAttendance which contains the 32 teams in the National Football League, their conference affiliation, their division, and their average home attendance.

a. Create a treemap using these data that separates the teams into their conference affiliations (NFC and AFC) and uses size to represent each team’s average home attendance. Note that you will need to sort the data in Excel by Conference to prop- erly create a treemap.

b. Create a sorted bar chart that compares the average home attendance for each team. c. Comment on the advantages and disadvantages of each type of chart for these data.

Which chart best displays these data and why?

22. For this problem we will use the data in the file Global100 that was referenced in Section 3.4 as an example for creating a treemap. Here we will use these data to create a GIS chart. A portion of the data contained in Global100 is shown below.

Use Excel to create a GIS chart that 1) displays the Market Value of companies in different countries as a heat map; 2) allows you to filter the results so that you can choose to add and remove specific continents in your GIS chart; and 3) uses text labels to display which companies are located in each country. To do this you will need to create a 3D Map in Excel. You will then need to click the Change the visualization to Region button, and then add Country to the Location box (and remove Continent from the Location box if it appears there), add Continent to the Filters box and add Market Value (Billions US $) to the Value box. Under Layer Options, you will also need to Customize the Data Card to include Company as a Field for the Custom Tooltip. a. Display the results of the GIS chart for companies in Europe only. Which country in

Europe has the highest total Market Value for Global 100 companies in that coun- try? What is the total market value for Global 100 companies in that country?

b. Add North America in addition to Europe for continents to be displayed. How does the heat map for Europe change? Why does it change in this way?

Conference Division Team Average Home Attendance

AFC West Oakland 54,584

AFC West Los Angeles Chargers 57,024

NFC North Chicago 60,368

AFC North Cincinnati 60,511

NFC South Tampa Bay 60,624

NFC North Detroit 60,792

AFC South Jacksonville 61,915

NFLAttendance

Continent Country Company Market Value (Billions US $)

Asia China Agricultural Bank of China 141.1

Asia China Bank of China 124.2

Asia China China Construction Bank 174.4

Asia China ICBC 215.6

Asia China PetroChina 202

Asia China Sinopec-China Petroleum 94.7

Asia China Tencent Holdings 135.4

Asia Hong Kong China Mobile 184.6

Asia Japan Softbank 91.2

Asia Japan Toyota Motor 193.5

Global100

Problems 135

23. Zeitler’s Department Stores sells its products online and through traditional brick-and- mortar stores. The following parallel-coordinates plot displays data from a sample of 20 customers who purchased clothing from Zeitler’s either online or in-store. The data include variables for the customer’s age, annual income, and the distance from the cus- tomer’s home to the nearest Zeitler’s store. According to the parallel-coordinates plot, how are online customers differentiated from in-store customers?

23 Age

14 Annual Income

($000)

6 Distance from Nearest

Store (miles)

120154

In-store

Online

24. The file ZeitlersElectronics contains data on customers who purchased electronic equipment either online or in-store from Zeitler’s Department Stores. a. Create a parallel-coordinates plot for these data. Include vertical axes for the cus-

tomer’s age, annual income, and distance from nearest store. Color the lines by the type of purchase made by the customer (online or in-store).

b. How does this parallel-coordinates plot compare to the one shown in Problem 23 for clothing purchases? Does the division between online and in-store purchasing habits for customers buying electronics equipment appear to be the same as for customers buying clothing?

c. Parallel-coordinates plots are very useful for interacting with your data to perform analysis. Filter the parallel-coordinates plot so that only customers whose homes are more than 40 miles from the nearest store are displayed. What do you learn from the parallel-coordinates plot about these customers?

25. Aurora Radiological Services is a health care clinic that provides radiological imaging services (such as MRIs, X-rays, and CAT scans) to patients. It is part of Front Range Medical Systems that operates clinics throughout the state of Colorado. a. What type of key performance indicators and other information would be appropri-

ate to display on a data dashboard to assist the Aurora clinic’s manager in making daily staffing decisions for the clinic?

b. What type of key performance indicators and other information would be appro- priate to display on a data dashboard for the CEO of Front Range Medical Systems who oversees the operation of multiple radiological imaging clinics?

26. Bravman Clothing sells high-end clothing products online and through phone orders. Bravman Clothing has taken a sample of 25 customers who placed orders by phone. The file Bravman contains data for each customer purchase, including the wait time the customer experienced when he or she called, the customer’s purchase amount, the cus- tomer’s age, and the customer’s credit score. Bravman Clothing would like to analyze these data to try to learn more about their phone customers. a. Create a scatter-chart matrix for these data. Include the variables wait time, pur-

chase amount, customer age, and credit score. b. What can you infer about the relationships between these variables from the scat-

ter-chart matrix?

ZeitlersElectronics

Bravman

Problem 24 requires the use of software such as Analytic Solver or JMP Pro.

Problem 26 requires the use of software such as Analytic Solver or JMP Pro.

136 chapter 3 Data Visualization

C A S e P r O b l e m : A l l - T i m e m O V i e b O x - O f f i C e D A T A

The motion picture industry is an extremely competitive business. Dozens of movie studios produce hundreds of movies each year, many of which cost hundreds of millions of dollars to produce and distribute. Some of these movies will go on to earn hundreds of millions of dollars in box office revenues, while others will earn much less than their production cost.

Data from 50 of the top box-office-receipt-generating movies are provided in the file Top50Movies. The following table shows the first 10 movies contained in this data set. The categorical variables included in the data set for each movie are the rating and genre. Quantitative variables for the movie’s release year, inflation- and noninflation-adjusted box-office receipts in the United States, budget, and the world box-office receipts are also included.

Title Year

Released

Budget (Inflation Adjusted Millions

World Box

Office Receipts (Inflation Adjusted Millions

U.S. Box Office

Receipts (Inflation Adjusted Millions

$) Rating Genre

Budget (Non-

Inflation Adjusted Millions

World Box Office Receipts

(Non- Inflation Adjusted Millions $)

U.S. Box Office

Receipts (Non-

Inflation Adjusted Millions $)

Gone With the Wind

1939 13 3,242 1,650 G Drama 3 391 199

Star Wars 1977 20 2,468 1,426 PG SciFi/ Fantasy

11 798 461

The Sound of Music

1965 — 1,145 1,145 G Musical — 163 163

E.T. 1982 — 1,970 1,132 PG SciFi/ Fantasy

— 757 435

Titanic 1997 100 3,636 1,096 PG-13 Drama 200 2,185 659

The Ten Com- mandments

1956 184 1,053 1,053 G Drama 14 80 80

Jaws 1975 26 1,865 1,029 PG Action 12 471 260

Doctor Zhivago

1965 96 973 973 PG-13 Drama 11 112 112

The Jungle Book

1967 — 1,263 871 G Animated — 206 142

Snow White and the Seven Dwarfs

1937 5 854 854 G Animated 1 185 185

managerial Report

Use the data-visualization methods presented in this chapter to explore these data and dis- cover relationships between the variables. Include the following in your report:

1. Create a scatter chart to examine the relationship between the year released and the inflation-adjusted U.S. box-office receipts. Include a trendline for this scatter chart. What does the scatter chart indicate about inflation-adjusted U.S. box-office receipts over time for these top 50 movies?

Top50Movies

case Problem: All-time Movie Box-office Data 137

2. Create a scatter chart to examine the relationship between the noninflation- adjusted budget and the noninflation-adjusted world box-office receipts. [Note: You may have to adjust the data in Excel to ignore the missing budget data values to create your scatter chart. You can do this by first sorting the data using Budget (Non- Inflation Adjusted Millions $) and then creating a scatter chart using only the movies that include data for Budget (Non-Inflation Adjusted Millions $).] What does this scatter chart indicate about the relationship between the movie’s budget and the world box-office receipts?

3. Create a scatter chart to examine the relationship between the inflation-adjusted budget and the inflation-adjusted world box-office receipts. What does this scatter chart indicate about the relationship between the movie’s inflation-adjusted budget and the inflation-adjusted world box-office receipts? Is this relationship different than what was shown for the noninflation-adjusted amounts? If so, why?

4. Create a frequency distribution, percent frequency distribution, and histogram for inflation-adjusted U.S. box-office receipts. Use bin sizes of $100 million. Interpret the results. Do any data points appear to be outliers in this distribution?

5. Create a PivotTable for these data. Use the PivotTable to generate a crosstabulation for movie genre and rating. Determine which combinations of genre and rating are most represented in the top 50 movie data. Now filter the data to consider only movies released in 1980 or later. What combinations of genre and rating are most represented for movies after 1980? What does this indicate about how the prefer- ences of moviegoers may have changed over time?

6. Use the PivotTable to display the average inflation-adjusted U.S. box-office receipts for each genre–rating pair for all movies in the data set. Interpret the results.

Chapter 4 Descriptive Data Mining C O N T E N T S

AnAlytics in Action: Advice from A mAchine

4.1 clUstER AnAlysis Measuring similarity Between observations Hierarchical clustering k-Means clustering Hierarchical clustering versus k-Means clustering

4.2 AssociAtion RUlEs Evaluating Association Rules

4.3 tEXt MininG Voice of the customer at triad Airline Preprocessing text Data for Analysis Movie Reviews

AVAilABlE in tHE MinDtAP READER:

APPEnDiX 4.1: HiERARcHicAl clUstERinG witH AnAlytic solVER

APPEnDiX 4.2: K-MEAns clUstERinG witH AnAlytic solVER

APPEnDiX 4.3: AssociAtion RUlEs witH AnAlytic solVER

APPEnDiX 4.4: tEXt MininG witH AnAlytic solVER

APPEnDiX 4.5: oPEninG AnD sAVinG EXcEl filEs in JMP PRo

APPEnDiX 4.6: HiERARcHicAl clUstERinG witH JMP PRo

APPEnDiX 4.7: K-MEAns clUstERinG witH JMP PRo

APPEnDiX 4.8: AssociAtion RUlEs witH JMP PRo

APPEnDiX 4.9: tEXt MininG witH JMP PRo

Over the past few decades, technological advances have led to a dramatic increase in the amount of recorded data. The use of smartphones, radio-frequency identification (RFID) tags, electronic sensors, credit cards, and the Internet has facilitated the collection of data from phone conversations, e-mails, business transactions, product and customer tracking, business transactions, and web browsing. The increase in the use of data-mining techniques in business has been caused largely by three events: the explosion in the amount of data being produced and electronically tracked, the ability to electronically warehouse these data, and the affordability of computer power to analyze the data. In this chapter, we dis- cuss the analysis of large quantities of data in order to gain insight on customers and to uncover patterns to improve business processes.

We define an observation, or record, as the set of recorded values of variables asso- ciated with a single entity. An observation is often displayed as a row of values in a spreadsheet or database in which the columns correspond to the variables. For example, in a university’s database of alumni, an observation may correspond to an alumnus’s age, gender, marital status, employer, position title, as well as size and frequency of donations to the university.

In this chapter, we focus on descriptive data-mining methods, also called unsupervised learning techniques. In an unsupervised learning application, there is no outcome variable to predict; rather, the goal is to use the variable values to identify relationships between observations. Unsupervised learning approaches can be thought of as high-dimensional descriptive analytics because they are designed to describe patterns and relationships in large data sets with many observations of many variables. Without an explicit outcome (or one that is objectively known), there is no definitive measure of accuracy. Instead, qualita- tive assessments, such as how well the results match expert judgment, are used to assess and compare the results from an unsupervised learning method.

Predictive data mining is discussed in Chapter 9.

Advice from a Machine1

The proliferation of data and increase in computing power have sparked the development of automated recommender systems, which provide consumers with suggestions for movies, music, books, clothes, restaurants, dating, and whom to follow on Twitter. The sophisticated, proprietary algorithms guiding rec- ommender systems measure the degree of similarity between users or items to identify recommendations of potential interest to a user.

Netflix, a company that provides media content via DVD-by-mail and Internet streaming, provides its users with recommendations for movies and television shows based on each user’s expressed interests and feedback on previously viewed content. As its busi- ness has shifted from renting DVDs by mail to stream- ing content online, Netflix has been able to track its customers’ viewing behavior more closely. This allows Netflix’s recommendations to account for differences in viewing behavior based on the day of the week,

the time of day, the device used (computer, phone, television), and even the viewing location.

The use of recommender systems is prevalent in e-commerce. Using attributes detailed by the Music Genome Project, Pandora Internet Radio plays songs with properties similar to songs that a user “likes.” In the online dating world, web sites such as eHarmony, Match.com, and OKCupid use different “formulas” to take into account hundreds of different behavioral traits to propose date “matches.” Stitch Fix, a personal shop- ping service for women, combines recommendation algorithms and human input from its fashion experts to match its inventory of fashion items to its clients.

A N A l y T i C S i N A C T i O N

1“The Science Behind the Netflix Algorithms that Decide what You’ll watch Next,” http://www.wired.com/2013/08/qq_netflix-algorithm. Retrieved on August 7, 2013; E. Colson, “Using Human and Machine Processing in Recommendation Systems,” First AAAI Conference on Human Computation and Crowdsourcing (2013); K. Zhao, X. wang, M. Yu, and B. Gao, “User Recommendation in Reciprocal and Bipartite Social Networks—A Case Study of Online Dating,” IEEE Intelligent Systems 29, no. 2 (2014).

Analytics in Action 139

140 Chapter 4 Descriptive Data Mining

4.1 Cluster Analysis The goal of clustering is to segment observations into similar groups based on the observed variables. Clustering can be employed during the data-preparation step to identify variables or observations that can be aggregated or removed from consideration. Cluster analysis is commonly used in marketing to divide consumers into different homogeneous groups, a process known as market segmentation. Identifying different clusters of consumers allows a firm to tailor marketing strategies for each segment. Cluster analysis can also be used to identify outliers, which in a manufacturing setting may represent quality-control problems and in financial transactions may represent fraudulent activity.

In this section, we consider the use of cluster analysis to assist a company called Know Thy Customer (KTC), a financial advising company that provides personalized financial advice to its clients. As a basis for developing this tailored advising, KTC would like to seg- ment its customers into several groups (or clusters) so that the customers within a group are similar with respect to key characteristics and are dissimilar to customers that are not in the group. For each customer, KTC has an observation consisting of the following variables:

Age 5 age of the customer in whole years Female 5 1 if female, 0 if not Income 5 annual income in dollars Married 5 1 if married, 0 if not Children 5 number of children Loan 5 1 if customer has a car loan, 0 if not Mortgage 5 1 if customer has a mortgage, 0 if not

We present two clustering methods using a small sample of data from KTC. We first consider bottom-up hierarchical clustering that starts with each observation belonging to its own cluster and then sequentially merges the most similar clusters to create a series of nested clusters. The second method, k-means clustering, assigns each observation to one of k clusters in a manner such that the observations assigned to the same cluster are as similar as possible. Because both methods depend on how two observations are similar, we first discuss how to measure similarity between observations.

Measuring Similarity Between Observations The goal of cluster analysis is to group observations into clusters such that observations within a cluster are similar and observations in different clusters are dissimilar. Therefore, to formalize this process, we need explicit measurements of similarity or, conversely, dis- similarity. Some metrics track similarity between observations, and a clustering method using such a metric would seek to maximize the similarity between observations. Other metrics measure dissimilarity, or distance, between observations, and a clustering method using one of these metrics would seek to minimize the distance between observations in a cluster.

When observations include numerical variables, Euclidean distance is the most common method to measure dissimilarity between observations. Let observations

5u u u uq( , , , )1 2 … and 5v v v vq( , , , )1 2 … each comprise measurements of q variables. The Euclidean distance between observations u and v is

d u v u v u vuv q q5 2 1 2 1 1 2( ) ( ) ( )1 1 2 2 2 2 2�

Figure 4.1 depicts Euclidean distance for two observations consisting of two variable mea- surements. Euclidean distance becomes smaller as a pair of observations become more similar with respect to their variable values. Euclidean distance is highly influenced by the scale on which variables are measured. For example, consider the task of clustering cus- tomers on the basis of the variables Age and Income. Let observation u (23, $20,375)5 correspond to a 23-year old customer with an annual income of $20,375 and observation

DemoKTC

4.1 Cluster Analysis 141

v (36, $19,475)5 correspond to a 36-year old with an annual income of $19,475. As mea- sured by Euclidean distance, the dissimilarity between these two observations is

duv (23 36) (20, 375 19, 475) 169 811, 441 9012 25 2 1 2 5 1 5

Thus, we see that when using the raw variable values, the amount of dissimilarity between observations is dominated by the Income variable because of the difference in the magnitude of the measurements. Therefore, it is common to standardize the units of each variable j of each observation u. That is, u j, the value of variable j in observation u, is replaced with its z-score z j. For the data in DemoKTC, the standardized (or normalized) values of observa- tions u and v are (−1.76, −0.56) and (−0.76, −0.62), respectively. The dissimilarity between these two observations based on standardized values is

5 2 2 2 1 2 2 2

5 1 5

standardized duv( ) ( 1.76 ( 0.76)) ( 0.56 ( 0.62))

0.994 0.004 0.998

2 2

Based on standardized variable values, we observe that observations u and v are actually much more different in age than in income.

The conversion to z-scores also makes it easier to identify outlier measurements, which can distort the Euclidean distance between observations. After conversion to z-scores, unequal weighting of variables can also be considered by multiplying the variables of each observation by a selected set of weights. For instance, after standardizing the units on customer observations so that income and age are expressed as their respective z-scores (instead of expressed in dollars and years), we can multiply the income z-scores by 2 if we wish to treat income with twice the importance of age. In other words, standardizing removes bias due to the difference in measurement units, and variable weighting allows the analyst to introduce appropriate bias based on the business context.

When clustering observations solely on the basis of categorical variables encoded as 0–1 (or dummy variables), a better measure of similarity between two observations can be achieved by counting the number of variables with matching values. The simplest overlap measure is called the matching coefficient and is computed as follows:

Refer to Chapter 2 for a discussion of z-scores.

Euclidean DistanceFiGURE 4.1

First Variable

S ec

on d

V ar

ia b

v = (v1, v2)

u = (u1, u2)

duv

MATCHiNG COEFFiCiENT

u vnumber of variables with matching value for observations and

total number of variables

142 Chapter 4 Descriptive Data Mining

One weakness of the matching coefficient is that if two observations both have a 0 entry for a categorical variable, this is counted as a sign of similarity between the two observations. However, matching 0 entries do not necessarily imply similarity. For instance, if the categor- ical variable is Own A Minivan, then a 0 entry in two different observations does not mean that these two people own the same type of car; it means only that neither owns a minivan. To avoid misstating similarity due to the absence of a feature, a similarity measure called Jaccard’s coefficient does not count matching zero entries and is computed as follows:

JACCARD’S COEFFiCiENT

u v

number of variables with matching nonzero value for observations and

(total number of variables) (number of variables with matching zero values for observations and )

For five customer observations from the file DemoKTC, Table 4.1 contains observations of the binary variables Female, Married, Loan, and Mortgage and the distance matrixes corresponding to the matching coefficient and Jaccard’s coefficient, respectively. Based on the matching coefficient, Observation 1 and Observation 4 are more similar (0.75) than Observation 2 and Observation 3 (0.5) because 3 out of 4 variable values match between Observation 1 and Observation 4 versus just 2 matching values out of 4 for Observation 2 and Observation 3. However, based on Jaccard’s coefficient, Observation 1 and Observation 4 are equally similar (0.5) as Observation 2 and Observation 3 (0.5) as Jaccard’s coefficient discards the matching zero values for the Loan and Mortgage variables for Observation 1 and Observation 4. In the context of this example, choice of the matching coefficient or Jaccard’s coefficient depends on whether KTC believes that matching 0 entries implies sim- ilarity or not. That is, KTC must gauge whether meaningful similarity is implied if a pair of observations are not female, not married, do not have a car loan, or do not have a mortgage.

Observation Female Married Loan Mortgage

1 1 0 0 0

2 0 1 1 1

3 1 1 1 0

4 1 1 0 0

5 1 1 0 0

Similarity Matrix Based on Matching Coefficient

Observation 1 2 3 4 5

1 1

2 0 1

3 0.5 0.5 1

4 0.75 0.25 0.75 1

5 0.75 0.25 0.75 1 1

Similarity Matrix Based on Jaccard’s Coefficient

Observation 1 2 3 4 5

1 1

2 0 1

3 0.333 0.5 1

4 0.5 0.25 0.667 1

5 0.5 0.25 0.667 1 1

Comparison of Similarity Matrixes for Observations with Binary VariablesTABlE 4.1

4.1 Cluster Analysis 143

Hierarchical Clustering We consider a bottom-up hierarchical clustering approach that starts with each observation in its own cluster and then iteratively combines the two clusters that are the most similar into a single cluster. Each iteration corresponds to an increased level of aggregation by decreasing the number of distinct clusters. Hierarchical clustering determines the similarity of two clusters by considering the similarity between the observations composing either cluster. Given a way to measure similarity between observations (Euclidean distance, matching coefficients, or Jaccard’s coefficients), there are several hierarchical clustering method alternatives for comparing observations in two clusters to obtain a cluster similarity measure. Using Euclidean distance to illustrate, Figure 4.2 provides a two-dimensional depiction of four methods we will discuss.

When using the single linkage clustering method, the similarity between two clusters is defined by the similarity of the pair of observations (one from each cluster) that are the most similar. Thus, single linkage will consider two clusters to be close if an observation in one of the clusters is close to at least one observation in the other cluster. However, a cluster formed by merging two clusters that are close with respect to single linkage may also consist of pairs of observations that are very different. The reason is that there is no consideration of how different an observation may be from other observations in a cluster as long as it is similar to at least one observation in that cluster. Thus, in two dimensions (variables), single linkage clustering can result in long, elongated clusters rather than compact, circular clusters.

Measuring Similarity Between ClustersFiGURE 4.2

Complete Linkage, d1,6

Centroid Linkage, dc1,c2

Group Average Linkage, d1,41d1,51d1,61d2,41d2,51d2,61d3,41d3,51d3,6

Single Linkage, d3,4

5 1

144 Chapter 4 Descriptive Data Mining

The complete linkage clustering method defines the similarity between two clusters as the similarity of the pair of observations (one from each cluster) that are the most different. Thus, complete linkage will consider two clusters to be close if their most-different pair of observations are close. This method produces clusters such that all member observations of a cluster are relatively close to each other. The clusters produced by complete linkage have approximately equal diameters. However, clustering created with complete linkage can be distorted by outlier observations.

The single linkage and complete linkage methods define between-cluster similarity based on the single pair of observations in two different clusters that are most similar or least similar. In contrast, the group average linkage clustering method defines the similar- ity between two clusters to be the average similarity computed over all pairs of observations between the two clusters. If Cluster 1 consists of n1 observations and Cluster 2 consists of n2 observations, the similarity of these clusters would be the average of n n31 2 similarity measures. This method produces clusters that are less dominated by the similarity between single pairs of observations. The median linkage method is analogous to group average linkage except that it uses the median of the similarities computed between all pairs of observations between the two clusters. The use of the median reduces the effect of outliers.

Centroid linkage uses the averaging concept of cluster centroids to define between- cluster similarity. The centroid for cluster k, denoted as ck, is found by calculating the average value for each variable across all observations in a cluster; that is, a centroid is the average observation of a cluster. The similarity between cluster k and cluster j is then defined as the similarity of the centroids c

k and c

j .

Ward’s method merges two clusters such that the dissimilarity of the observations within the resulting single cluster increases as little as possible. It tends to produce clearly defined clusters of similar size. For a pair of clusters under consideration for aggregation, Ward’s method computes the centroid of the resulting merged cluster and then calculates the sum of squared dissimilarity between this centroid and each observation in the union of the two clusters. Representing observations within a cluster with the centroid can be viewed as a loss of information in the sense that the individual differences in these obser- vations will not be captured by the cluster centroid. Hierarchical clustering using Ward’s method results in a sequence of aggregated clusters that minimizes this loss of information between the individual observation level and the cluster centroid level.

When McQuitty’s method considers merging two clusters A and B, the dissimilarity of the resulting cluster AB to any other cluster C is calculated as ((dissimilarity between A and C) 1 (dissimilarity between B and C)) 4 2. At each step, this method then merges the pair of clusters that results in the minimal increase in total dissimilarity between the newly merged cluster and all the other clusters.

Returning to our example, KTC is interested in developing customer segments based on gender, marital status, and whether the customer is repaying a car loan and a mortgage. Using data in the file DemoKTC, we base the clusters on a collection of 0–1 categorical variables (Female, Married, Loan, and Mortgage). We use the matching coefficient to mea- sure similarity between observations and the group average linkage clustering method to measure similarity between clusters. The choice of the matching coefficient (over Jaccard’s coefficent) is reasonable because a pair of customers that both have an entry of zero for any of these four variables implies some degree of similarity. For example, two customers that both have zero entries for Mortgage means that neither has significant debt associated with a mortgage.

Figure 4.3 depicts a dendrogram to visually summarize the output from a hierarchical clustering using the matching coefficient to measure similarity between observations and the group average linkage clustering method to measure similarity between clusters. A den- drogram is a chart that depicts the set of nested clusters resulting at each step of aggrega- tion. The horizontal axis of the dendrogram lists the observation indexes. The vertical axis of the dendrogram represents the dissimilarity (distance) resulting from a merger of two different groups of observations. Each blue horizontal line in the dendrogram represents a merger of two (or more) clusters, where the observations composing the merged clusters are connected to the blue horizontal line with a blue vertical line.

DemoKTC

4.1 Cluster Analysis 145

Cluster 1: {4, 5, 6, 11, 19, 28, 1, 7, 21, 22, 23, 30, 13, 17, 18, 15, 27} 5 mix of males and females, 15 out of 17 married, no car loans, 5 out of 17 with mortgages

Cluster 2: {2, 26, 8, 10, 20, 25} 5 all males with car loans, 5 out of 6 married, 2 out of 6 with mortgages

Cluster 3: {3, 9, 14, 16, 12, 24, 29} 5 all females with car loans, 4 out of 7 married, 5 out of 7 with mortgages

Dendrogram for KTC Using Matching Coefficients and Group Average LinkageFiGURE 4.3

For example, the blue horizontal line connecting observations 4, 5, 6, 11, 19, and 28 conveys that these six observations are grouped together and the resulting cluster has a dissimilarity mea- sure of 0. A dissimilarity of 0 results from this merger because these six observations have iden- tical values for the Female, Married, Loan, and Mortgage variables. In this case, each of these six observations corresponds to a married female with no car loan and no mortgage. Following the blue vertical line up from the cluster of {4, 5, 6, 11, 19, 28}, another blue horizontal line connects this cluster with the cluster consisting solely of Observation 1. Thus, the cluster {4, 5, 6, 11, 19, 28} and cluster {1} are merged resulting in a dissimilarity of 0.25. The dissimilarity of 0.25 results from this merger because Observation 1 differs in one out of the four categorical variable values; Observation 1 is an unmarried female with no car loan and no mortgage.

To interpret a dendrogram at a specific level of aggregation, it is helpful to visualize a horizontal line such as one of the black dashed lines we have drawn across Figure 4.3. The bottom horizontal black dashed line intersects with the vertical branches in the dendrogram three times; each intersection corresponds to a cluster containing the observations connected by the vertical branch that is intersected. The composition of these three clusters is as follows:

These clusters segment KTC’s customers into three groups that could possibly indicate vary- ing levels of responsibility—an important factor to consider when providing financial advice.

146 Chapter 4 Descriptive Data Mining

The nested construction of the hierarchical clusters allows KTC to identify different num- bers of clusters and assess (often qualitatively) the implications. By sliding a horizontal line up or down the vertical axis of a dendrogram and observing the intersection of the horizontal line with the vertical dendrogram branches, an analyst can extract varying numbers of clusters. Note that sliding up to the position of the top horizontal black line in Figure 4.3 results in merging Cluster 2 with Cluster 3 into a single, more dissimilar, cluster. The vertical distance between the points of agglomeration is the “cost” of merging clusters in terms of decreased homogeneity within clusters. Thus, vertically elongated portions of the dendrogram represent mergers of more dissimilar clusters, and vertically compact portions of the dendrogram represent mergers of more similar clusters. A cluster’s durability (or strength) can be measured by the difference between the distance value at which a cluster is originally formed and the distance value at which it is merged with another cluster. Figure 4.3 shows that the cluster consisting of {12, 24, 29} (single females with car loans and mortgages) is a very durable cluster in this example because the verti- cal line for this cluster is very long before it is merged with another cluster.

k-Means Clustering In k-means clustering, the analyst must specify the number of clusters, k. If the number of clusters, k, is not clearly established by the context of the business problem, the k-means clustering algorithm can be repeated for several values of k. Given a value of k, the k-means algorithm randomly assigns each observation to one of the k clusters. After all observations have been assigned to a cluster, the resulting cluster centroids are calculated (these cluster centroids are the “means” of k-means clustering). Using the updated cluster centroids, all observations are reassigned to the cluster with the closest centroid (where Euclidean dis- tance is the standard metric). The algorithm repeats this process (calculate cluster centroid, assign each observation to the cluster with nearest centroid) until there is no change in the clusters or a specified maximum number of iterations is reached.

As an unsupervised learning technique, cluster analysis is not guided by any explicit measure of accuracy, and thus the notion of a “good” clustering is subjective and is depen- dent on what the analyst hopes the cluster analysis will uncover. Regardless, one can mea- sure the strength of a cluster by comparing the average distance in a cluster to the distance between cluster centroids. One rule of thumb is that the ratio of between-cluster distance (as measured by the distance between cluster centroids) to average within-cluster distance should exceed 1.0 for useful clusters.

To illustrate k-means clustering, we consider a 3-means clustering of a small sample of KTC’s customer data in the file DemoKTC. Figure 4.4 shows three clusters based on

A wide disparity in cluster strength across a set of clusters may make it possible to find a better clustering of the data by removing all members of the strong clusters and then continuing the clustering process on the remaining observations.

Clustering Observations by Age and Income Using k-Means Clustering with k 5 3

FiGURE 4.4

$15,000

$5,000

$25,000

$35,000

$45,000

$65,000

$55,000

30 40 Age (years)

I n

co m

50 60 70

Cluster 1 Cluster 3Cluster 2

Cluster centroids are depicted by circles in Figure 4.4.

Although Figure 4.4 is plotted in the original scale of the variables, the clustering was based on the variables after standardizing (normalizing) their values.

DemoKTC

4.1 Cluster Analysis 147

customer income and age. Cluster 1 is characterized by relatively younger, lower-in- come customers (Cluster 1’s centroid is at [33, $20,364]). Cluster 2 is characterized by relatively older, higher-income customers (Cluster 2’s centroid is at [58, $47,729]). Cluster 3 is characterized by relatively older, lower-income customers (Cluster 3’s cen- troid is at [53, $21,416]). As visually corroborated by Figure 4.4, Table 4.2 shows that Cluster 2 is the smallest, but most heterogeneous cluster. We also observe that Cluster 1 is the largest cluster and Cluster 3 is the most homogeneous cluster. Table 4.3 dis- plays the distance between each pair of cluster centroids to demonstrate how distinct the clusters are from each other. Cluster 1 and Cluster 2 are the most distinct from each other. To evaluate the strength of the clusters, we compare the average distance within each cluster (Table 4.2) to the average distances between clusters (Table 4.3). For example, although Cluster 2 is the most heterogeneous, with an average distance between observations of 0.739, comparing this to the distance between the Cluster 2 and Cluster 3 centroids (1.964) reveals that on average an observation in Cluster 2 is approximately 2.66 times closer to the Cluster 2 centroid than to the Cluster 3 centroid. In general, the larger the ratio of the distance between a pair of cluster centroids and the average within-cluster distance, the more distinct the clustering is for the obser- vations in the two clusters in the pair. Although qualitative considerations should take priority in evaluating clusters, using the ratios of between-cluster distance and aver- age within-cluster distance provides some guidance in determining k, the number of clusters.

Hierarchical Clustering versus k-Means Clustering If you have a small data set (e.g., fewer than 500 observations) and want to easily examine solutions with increasing numbers of clusters, you may want to use hierarchical clustering. Hierarchical clusters are also convenient if you want to observe how clusters are nested. However, hierarchical clustering can be very sensitive to outliers, and clusters may change dramatically if observations are eliminated from (or added to) the data set. If you know how many clusters you want and you have a larger data set (e.g., more than 500 observa- tions), you may choose to use k-means clustering. Recall that k-means clustering parti- tions the observations, which is appropriate if you are trying to summarize the data with k “average” observations that describe the data with the minimum amount of error. However, k-means clustering is generally not appropriate for binary or ordinal data, for which an “average” is not meaningful.

No. of Observations Average Distance Between

Observations in Cluster

Cluster 1 12 0.622

Cluster 2 8 0.739

Cluster 3 10 0.520

Average Distances within ClustersTABlE 4.2

Distances Between Cluster CentroidsTABlE 4.3

Cluster 1 Cluster 2 Cluster 3

Cluster 1 0 2.784 1.529

Cluster 2 2.784 0 1.964

Cluster 3 1.529 1.964 0

Tables 4.2 and 4.3 are expressed in terms of standardized coordinates in order to eliminate any distortion resulting from differences in the scale of the input variables.

148 Chapter 4 Descriptive Data Mining

4.2 Association Rules In marketing, analyzing consumer behavior can lead to insights regarding the placement and promotion of products. Specifically, marketers are interested in examining transac- tion data on customer purchases to identify the products commonly purchased together. Bar-code scanners facilitate the collection of retail transaction data, and membership in a customer’s loyalty program can further associate the transaction with a specific customer. In this section, we discuss the development of probabilistic if–then statements, called association rules, which convey the likelihood of certain items being purchased together. Although association rules are an important tool in market basket analysis, they are also applicable to disciplines other than marketing. For example, association rules can assist medical researchers in understanding which treatments have been commonly prescribed to certain patient symptoms (and the resulting effects).

Hy-Vee grocery store would like to gain insight into its customers’ purchase patterns to possibly improve its in-aisle product placement and cross-product promotions. Table 4.4 contains a small sample of data in which each transaction comprises the items purchased by a shopper in a single visit to a Hy-Vee. An example of an association rule from this data would be “if {bread, jelly}, then {peanut butter},” meaning that “if a transaction includes bread and jelly, then it also includes peanut butter.” The collection of items (or item set) corresponding to the if portion of the rule, {bread, jelly}, is called the antecedent. The item set corresponding to the then portion of the rule, {peanut butter}, is called the consequent.

Typically, only association rules for which the consequent consists of a single item are considered because these are more actionable. Although the number of possible association rules can be overwhelming, we typically investigate only association rules that involve antecedent and consequent item sets that occur together frequently. To formalize the notion of “frequent,” we define the support count of an item set as the number of transactions in the data that include that item set. In Table 4.4, the support count of {bread, jelly} is 4. The potential impact of an association rule is often governed by the number of transactions it may affect, which is measured by computing the support count of the item set consisting of the union of its antecedent and consequent. Investigating the rule “if {bread, jelly}, then {peanut butter}” from Table 4.4, we see the support count of {bread, jelly, peanut butter} is 2. By only considering rules involving item sets with a support above a minimum level, inexplicable rules capturing random noise in the data can generally be avoided. A rule of thumb is to consider only association rules with a support count of at least 20% of the total

Support is also sometimes expressed as the percentage of total transactions containing an item set.

Clustering observations based on both numerical and cate-

gorical variables (mixed data) can be challenging. Dissimilarity

between observations with numerical variables is commonly

computed using Euclidean distance. However, Euclidean dis-

tance is not well defined for categorical variables as the magni-

tude of the Euclidean distance measure between two category

values will depend on the numerical encoding of the catego-

ries. There are elaborate methods beyond the scope of this

book to try to address the challenge of clustering mixed data.

Using the methods introduced in this section, there are

two alternative approaches to clustering mixed data. The first

approach is to decompose the clustering into two steps. The

first step applies hierarchical clustering of the observations

only on categorical variables using an appropriate measure

(matching coefficients or Jaccard’s coefficients) to identify a set

of “first-step” clusters. The second step is to apply k-means clustering (or hierarchical clustering again) separately to each

of these “first-step” clusters using only the numerical variables.

This decomposition approach is not fail-safe as it fixes clusters

with respect to one variable type before clustering with respect

to the other variable type, but it does allow the analyst to iden-

tify how the observations are similar or different with respect to

the two variable types.

A second approach to clustering mixed data is to numerically

encode the categorical values (e.g., binary coding, ordinal cod-

ing) and then to standardize both the categorical and numerical

variable values. To reflect relative importance of the variables, the

analyst may experiment with various weightings of the variables

and apply hierarchical or k-means clustering. This approach is

very experimental and the variable weights are subjective.

N O T E S + C O M M E N T S

4.2 Association Rules 149

number of transactions. If an item set is particularly valuable and represents a lucrative opportunity, then the minimum support count used to filter the rules is often lowered.

To help identify reliable association rules, we define the measure of confidence of a rule, which is computed as

The data in Table 4.4 are in item list format; that is, each transaction row corresponds to a list of item names. Alternatively, the data can be represented in binary matrix format, in which each row is a transaction record and the columns correspond to each distinct item. A third approach is to store the data in stacked form in which each row is an ordered pair; the first entry is the transaction number and the second entry is the item.

Conditional probability is discussed in more detail in Chapter 5.

Recall that confidence, the numerator of the lift ratio, can be thought of as the proba- bility of the consequent item set given the antecedent item set occurs. The denominator of the lift ratio is the probability of a randomly selected transaction containing the consequent set. Thus, the lift ratio represents how effective an association rule is at identifying transac- tions in which the consequent item set occurs versus a randomly selected transaction. A lift ratio greater than one suggests that there is some usefulness to the rule and that it is better at identifying cases when the consequent occurs than having no rule at all. In other words, a lift ratio greater than one suggests that the level of association between the antecedent and consequent is higher than would be expected if these item sets were independent.

CONFiDENCE

support of {antecedent and consequent}

support of antecedent

Transaction Shopping Cart

1 bread, peanut butter, milk, fruit, jelly

2 bread, jelly, soda, potato chips, milk, fruit, vegetables, peanut butter

3 whipped cream, fruit, chocolate sauce, beer

4 steak, jelly, soda, potato chips, bread, fruit

5 jelly, soda, peanut butter, milk, fruit

6 jelly, soda, potato chips, milk, bread, fruit

7 fruit, soda, potato chips, milk

8 fruit, soda, peanut butter, milk

9 fruit, cheese, yogurt

10 yogurt, vegetables, beer

Shopping-Cart TransactionsTABlE 4.4

liFT RATiO

confidence

support of consequent/total number of transactions

This measure of confidence can be viewed as the conditional probability of the consequent item set occurring given that the antecedent item set occurs. A high value of confidence suggests a rule in which the consequent is frequently true when the antecedent is true, but a high value of confidence can be misleading. For example, if the support of the consequent is high—that is, the item set corresponding to the then part is very frequent—then the con- fidence of the association rule could be high even if there is little or no association between the items. In Table 4.4, the rule “if {cheese}, then {fruit}” has a confidence of 1.0 (or 100%). This is misleading because {fruit} is a frequent item; the confidence of almost any rule with {fruit} as the consequent will have high confidence. Therefore, to evaluate the efficiency of a rule, we compute the lift ratio of the rule by accounting for the frequency of the consequent:

150 Chapter 4 Descriptive Data Mining

For the data in Table 4.4, the rule “if {bread, jelly}, then {peanut butter}” has 5 5confidence 2/4 0.5 and 5 5lift ratio 0.5/(4/10) 1.25. In other words, identifying a

customer who purchased both bread and jelly as one who also purchased peanut butter is 25% better than just guessing that a random customer purchased peanut butter.

The utility of a rule depends on both its support and its lift ratio. Although a high lift ratio suggests that the rule is very efficient at finding when the consequent occurs, if it has a very low support, the rule may not be as useful as another rule that has a lower lift ratio but affects a large number of transactions (as demonstrated by a high support). However, an association rule with a high lift ratio and low support may still be useful if the consequent represents a very valuable opportunity.

Based on the data in Table 4.4, Table 4.5 shows the list of association rules that achieve a lift ratio of at least 1.39 while satisfying a minimum support of 4 transactions (out of 10) and a minimum confidence of 50%. The top rules in Table 4.5 suggest that bread, fruit, and jelly are commonly associated items. For example, the fourth rule listed in Table 4.5 states, “If Fruit and Jelly are purchased, then Bread is also purchased.” Perhaps Hy-Vee could consider a promotion and/or product placement to leverage this perceived relationship.

Evaluating Association Rules Although explicit measures such as support, confidence, and lift ratio can help filter asso- ciation rules, an association rule is ultimately judged on how actionable it is and how well

Antecedent (A) Consequent (C) Support

for A Support

for C Support for A & C

Confidence (%) Lift Ratio

Bread Fruit, Jelly 4 5 4 100.0 2.00

Bread Jelly 4 5 4 100.0 2.00

Bread, Fruit Jelly 4 5 4 100.0 2.00

Fruit, Jelly Bread 5 4 4 80.0 2.00

Jelly Bread 5 4 4 80.0 2.00

Jelly Bread, Fruit 5 4 4 80.0 2.00

Fruit, Potato Chips Soda 4 6 4 100.0 1.67

Peanut Butter Milk 4 4 6 100.0 1.67

Peanut Butter Milk, Fruit 4 6 4 100.0 1.67

Peanut Butter, Fruit Milk 4 6 4 100.0 1.67

Potato Chips Fruit, Soda 4 6 4 100.0 1.67

Potato Chips Soda 4 6 4 100.0 1.67

Fruit, Soda Potato Chips 6 4 4 66.7 1.67

Milk Peanut Butter 6 4 4 66.7 1.67

Milk Peanut Butter, Fruit 6 4 4 66.7 1.67

Milk, Fruit Peanut Butter 6 4 4 66.7 1.67

Soda Fruit, Potato Chips 6 4 4 66.7 1.67

Soda Potato Chips 6 4 4 66.7 1.67

Fruit, Soda Milk 6 6 5 83.3 1.39

Milk Fruit, Soda 6 6 5 83.3 1.39

Milk Soda 6 6 5 83.3 1.39

Milk, Fruit Soda 6 6 5 83.3 1.39

Soda Milk 6 6 5 83.3 1.39

Soda Milk, Fruit 6 6 5 83.3 1.39

Association Rules for Hy-VeeTABlE 4.5

HyVeeDemoBinary

HyVeeDemoStacked

4.3 Text Mining 151

it explains the relationship between item sets. For example, suppose Walmart mined its transactional data to uncover strong evidence of the association rule, “If a customer pur- chases a Barbie doll, then a customer also purchases a candy bar.” Walmart could leverage this relationship in product placement decisions as well as in advertisements and promo- tions, perhaps by placing a high-margin candy-bar display near the Barbie dolls. However, we must be aware that association rule analysis often results in obvious relationships such as “If a customer purchases hamburger patties, then a customer also purchases hamburger buns,” which may be true but provide no new insight. Association rules with a weak sup- port measure often are inexplicable. For an association rule to be useful, it must be well supported and explain an important previously unknown relationship. The support of an association rule can generally be improved by basing it on less specific antecedent and consequent item sets. Unfortunately, association rules based on less specific item sets tend to yield less insight. Adjusting the data by aggregating items into more general categories (or splitting items into more specific categories) so that items occur in roughly the same number of transactions often yields better association rules.

4.3 Text Mining Every day, nearly 500 million tweets are published on the on-line social network service Twitter. Many of these tweets contain important clues about how Twitter users value a company’s products and services. Some tweets might sing the praises of a product; others might complain about low-quality service. Furthermore, Twitter users vary greatly in the number of followers (some have thousands of followers and others just a few) and there- fore these users have varying degrees of influence. Data-savvy companies can use social media data to improve their products and services. On-line reviews on web sites such as Amazon and Yelp provide data on how customers feel about products and services.

However, the data in these examples are not numerical. The data are text: words, phrases, sentences, and paragraphs. Text, like numerical data, may contain information that can help solve problems and lead to better decisions. Text mining is the process of extracting useful information from text data. In this section, we discuss text mining, how it is different from data mining of numerical data, and how it can be useful for decision making.

Text data is often referred to as unstructured data because in its raw form, it cannot be stored in a traditional structured database (rows and columns). Audio and video data are also examples of unstructured data. Data mining with text data is more challenging than data min- ing with traditional numerical data, because it requires more preprocessing to convert the text to a format amenable for analysis. However, once the text data has been converted to numer- ical data, the analytical methods used for descriptive text mining are the same as those used for numerical data discussed earlier in this chapter. We begin with a small example which illustrates how text data can be converted to numerical data and then analyzed. Then we will provide more in-depth discussion of text-mining concepts and preprocessing procedures.

Voice of the Customer at Triad Airline Triad Airlines is a regional commuter airline. Through its voice of the customer program, Triad solicits feedback from its customers through a follow-up e-mail the day after the cus- tomer has completed a flight. The e-mail survey asks the customer to rate various aspects of the flight and asks the respondent to type comments into a dialog box in the e-mail.

In addition to the quantitative feedback from the ratings, the comments entered by the respondents need to be analyzed so that Triad can better understand its customers’ specific concerns and respond in an appropriate manner. We will use a small training sample of these concerns to illustrate how descriptive text mining can be used in this busi- ness context. In general, a collection of text documents to be analyzed is called a corpus. In the Triad Airline example, our corpus consists of 10 documents, where each document contains concerns made by a customer.

152 Chapter 4 Descriptive Data Mining

Triad’s management would like to categorize these customer concerns into groups whose members share similar characteristics so that a solution team can be assigned to each group of concerns.

To be analyzed, text data needs to be converted to structured data (rows and columns of numerical data) so that the tools of descriptive statistics, data visualization and data mining can be applied. We can think of converting a group of documents into a matrix of rows and columns where the rows correspond to a document and the columns correspond to a particu- lar word. In Triad’s case, a document is a single respondent’s comment. A presence/absence or binary term-document matrix is a matrix with the rows representing documents and the columns representing words, and the entries in the columns indicating either the presence or the absence of a particular word in a particular document (1 5 present and 0 5 not present).

Creating the list of terms to use in the presence/absence matrix can be a complicated matter. Too many terms results in a matrix with many columns, which may be difficult to manage and could yield meaningless results. Too few terms may miss important rela- tionships. Often, term frequency along with the problem context are used as a guide. We discuss this in more detail in the next section. In Triad’s case, management used word fre- quency and the context of having a goal of satisfied customers to come up with the follow- ing list of terms they feel are relevant for categorizing the respondent’s comments: delayed, flight, horrible, recline, rude, seat, and service.

As shown in Table 4.7, these seven terms correspond to the columns of the presence/ absence term-document matrix and the rows correspond to the 10 documents. Each matrix entry indicates whether or not a column’s term appears in the document corresponding to the row. For example, a one entry in the first row and third column means that the term “horrible” appears in Document 1. A zero entry in the third row and fourth column means that the term “recline” does not appear in Document 3.

Having converted the text to numerical data, we can apply clustering. In this case, because we have binary presence-absence data, we apply hierarchical clustering. Observing that the absence of a term in two different documents does not imply similarity between the documents, we select Jaccard’s coefficient as the similarity measure. To measure similarity between clusters, we use complete linkage. At the level of three clusters, hierarchical clus- tering results in the following groups of documents:

Cluster 1: {1, 5, 6} 5 documents discussing service issues

Cluster 2: {2, 4, 8, 10} 5 documents discussing seat issues

Cluster 3: {3, 7, 9} 5 documents discussing schedule issues

With these three clusters defined, management can assign an expert team to each of these clusters to directly address the concerns of its customers.

Ten Respondents’ Concerns for Triad AirlinesTABlE 4.6

Concerns

The wi-fi service was horrible. It was slow and cut off several times.

My seat was uncomfortable.

My flight was delayed 2 hours for no apparent reason.

My seat would not recline.

The man at the ticket counter was rude. Service was horrible.

The flight attendant was rude. Service was bad.

My flight was delayed with no explanation.

My drink spilled when the guy in front of me reclined his seat.

My flight was canceled.

The arm rest of my seat was nasty.

Triad

4.3 Text Mining 153

Term

Document Delayed Flight Horrible Recline Rude Seat Service

1 0 0 1 0 0 0 1

2 0 0 0 0 0 1 0

3 1 1 0 0 0 0 0

4 0 0 0 1 0 1 0

5 0 0 1 0 1 0 1

6 0 1 0 0 1 0 1

7 1 1 0 0 0 0 0

8 0 0 0 1 0 1 0

9 0 1 0 0 0 0 0

10 0 0 0 0 0 1 0

The Presence/Absence Term-Document Matrix for Triad AirlinesTABlE 4.7

Preprocessing Text Data for Analysis In general, the text-mining process converts unstructured text into numerical data and applies quantitative techniques. For the Triad example, we converted the text documents into a term-document matrix and then applied hierarchical clustering to gain insight on the different types of comments (and their frequencies). In this section, we present a more detailed discussion of terminology and methods used in preprocessing text data into numerical data for analysis.

Converting documents to a term-document matrix is not a simple task. Obviously, which terms become the headers of the columns of the term-document matrix can greatly impact the analysis. Tokenization is the process of dividing text into separate terms, referred to as tokens. The process of identifying tokens is not straightforward. First, symbols and punc- tuations must be removed from the document and all letters should be converted to low- ercase. For example, “Awesome!”, “awesome,” and “#Awesome” should all be converted to “awesome.” Likewise, different forms of the same word, such as “stacking”, “stacked,” and “stack” probably should not be considered as distinct terms. Stemming, the process of converting a word to its stem or root word, would drop the “ing” and “ed” and place only “stack” in the list of words to be tracked.

The goal of preprocessing is to generate a list of most-relevant terms that is sufficiently small so as to lend itself to analysis. In addition to stemming, frequency can be used to eliminate words from consideration as tokens. For example, if a term occurs very fre- quently in every document in the corpus, then it probably will not be very useful and can be eliminated from consideration; “the” is an example of frequent, uninformative term. Similarly, low-frequency words probably will not be very useful as tokens. Another tech- nique for reducing the consideration set for tokens is to consolidate a set of words that are synonyms. For example, “courteous,” “cordial,” and “polite” might be best represented as a single token, “polite.”

In addition to automated stemming and text reduction via frequency and synonyms, most text-mining software gives the user the ability to manually specify terms to include or exclude as tokens. Also, the use of slang, humor, and sarcasm can cause interpretation problems and might require more sophisticated data cleansing and subjective intervention on the part of the analyst to avoid misinterpretation.

Data preprocessing parses the original text data down to the set of tokens deemed rele- vant for the topic being studied. Based on these tokens, a presence/absence term-document matrix as in Table 4.7 can be generated.

When the documents in a corpus contain many more words than the brief comments in the Triad Airline example, and when the frequency of word occurrence is important to the context

154 Chapter 4 Descriptive Data Mining

of the business problem, preprocessing can be used to develop a frequency term-document matrix. A frequency term-document matrix is a matrix whose rows represent documents and columns represent tokens, and the entries in the matrix are the frequency of occurrence of each token in each document. We illustrate this in the following example.

Movie Reviews A new action film has been released and we now have a sample of 10 reviews from movie critics. Using preprocessing techniques, including text reduction by synonyms, we have reduced the number of tokens to only two: “great” and “terrible.” Table 4.8 displays the corresponding frequency term-document matrix. As Table 4.8 shows, the token “great” appears four times in Document 7. Reviewing the entire table, we observe that five is the maximum frequency of a token in a document and zero is the minimum frequency.

To demonstrate the analysis of a frequency term-document matrix with descriptive data mining, we apply k-means clustering with k 5 2 to the frequency term-document matrix to obtain the two clusters in Figure 4.5. Cluster 1 contains reviews that tend to be negative and Cluster 2 contains reviews that tend to be positive. We note that the Observation (3, 3) corresponds to the balanced review of Document 4; based on this small corpus, the bal- anced review is more similar to the positive reviews than the negative reviews, suggesting that the negative reviews may tend to be more extreme.

Two Clusters Using k-Means Clustering on Movie ReviewsFiGURE 4.5

0 0

1 2

Cluster 1

Cluster 2

3 4 5 Great

T er

ri b

Term

Document Great Terrible 1 5 0

2 5 1

3 5 1

4 3 3

5 5 1

6 0 5

7 4 1

8 5 3

9 1 3

10 1 2

The frequency Term-Document Matrix for Movie ReviewsTABlE 4.8

Glossary 155

1. The term-document matrix is also sometimes referred to as

a document-term matrix.

2. In addition to the binary term-document matrix and fre-

quency term-document matrix, there are more complex

types of term-document matrices that can be used to

preprocess unstructured text data. These methods utilize

frequency measures other than simple counts, and include

logarithmic-scaled frequency, inverse document frequency,

and term frequency-inverse document frequency (TF-IDF),

which is the term frequency multiplied by the inverse of the

document frequency.

3. The process of converting words to all lowercase is often

referred to as term normalization.

4. The process of clustering/categorizing comments or

reviews as positive, negative, or neutral is known as sen-

timent analysis.

N O T E S + C O M M E N T S

S u M M A R y

We have introduced the descriptive data-mining methods and related concepts. After introducing how to measure the similarity of individual observations, we presented two different methods for grouping observations based on the similarity of their respective variable values: hierarchical clustering and k-means clustering. Hierarchical clustering begins with each observation in its own cluster and iteratively aggregates clusters using a specified linkage method. We described several of these hierarchical clustering methods and discussed their features. In k-means clustering, the analyst specifies k, the number of clusters, and then observations are placed into these clusters in an attempt to minimize the dissimilarity within the clusters. We concluded our discussion of clustering with a compari- son of hierarchical clustering and k-means clustering.

We introduced association rules and explained their use for identifying patterns across transactions, particularly in retail data. We defined the concepts of support count, confi- dence, and lift ratio, and described their utility in gleaning actionable insight from associa- tion rules.

Finally, we discussed the text-mining process. Text is first preprocessed by deriving a smaller set of tokens from the larger set of words contained in a collection of documents. Then the tokenized text data is converted into a presence/absence term- document matrix or a frequency term-document matrix. We then demonstrated the application of hierarchical clustering on a binary term-document matrix and k-means clustering on a frequency term- document matrix to glean insight from the underlying text data.

G l o S S A R y

Antecedent The item set corresponding to the if portion of an if–then association rule. Association rule An if–then statement describing the relationship between item sets. Binary term-document matrix A matrix with the rows representing documents and the columns representing words, and the entries in the columns indicating either the pres- ence or absence of a particular word in a particular document (1 5 present and 0 5 not present). Centroid linkage Method of calculating dissimilarity between clusters by considering the two centroids of the respective clusters. Complete linkage Measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations between the two clusters. Confidence The conditional probability that the consequent of an association rule occurs given the antecedent occurs. Consequent The item set corresponding to the then portion of an if–then association rule. Corpus A collection of documents to be analyzed. Dendrogram A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering.

156 Chapter 4 Descriptive Data Mining

Euclidean distance Geometric measure of dissimilarity between observations based on the Pythagorean theorem. Frequency term-document matrix A matrix whose rows represent documents and col- umns represent tokens (terms), and the entries in the matrix are the frequency of occur- rence of each token (term) in each document. Group average linkage Measure of calculating dissimilarity between clusters by consider- ing the distance between each pair of observations between two clusters. Hierarchical clustering Process of agglomerating observations into a series of nested groups based on a measure of similarity. Jaccard’s coefficient Measure of similarity between observations consisting solely of binary categorical variables that considers only matches of nonzero entries. k-means clustering Process of organizing observations into one of k groups based on a measure of similarity (typically Euclidean distance). Lift ratio The ratio of the performance of a data mining model measured against the per- formance of a random choice. In the context of association rules, the lift ratio is the ratio of the probability of the consequent occurring in a transaction that satisfies the antecedent versus the probability that the consequent occurs in a randomly selected transaction. Market basket analysis Analysis of items frequently co-occurring in transactions (such as purchases). Market segmentation The partitioning of customers into groups that share common char- acteristics so that a business may target customers within a group with a tailored marketing strategy. Matching coefficient Measure of similarity between observations based on the number of matching values of categorical variables. McQuitty’s method Measure that computes the dissimilarity introduced by merging clus- ters A and B by, for each other cluster C, averaging the distance between A and C and the distance between B and C and the summing these average distances. Median linkage Method that computes the similarity between two clusters as the median of the similarities between each pair of observations in the two clusters. Observation (record) A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database. Presence /absence document-term matrix A matrix with the rows representing docu- ments and the columns representing words, and the entries in the columns indicating either the presence or the absence of a particular word in a particular document (1 5 present and 0 5 not present). Single linkage Measure of calculating dissimilarity between clusters by considering only the two most similar observations between the two clusters. Stemming The process of converting a word to its stem or root word. Support count The number of times that a collection of items occurs together in a transac- tion data set. Text mining The process of extracting useful information from text data. Tokenization The process of dividing text into separate terms, referred to as tokens. Unsupervised learning Category of data-mining techniques in which an algorithm explains relationships without an outcome variable to guide the process. Unstructured data Data, such as text, audio, or video, that cannot be stored in a traditional structured database. Ward’s method Procedure that partitions observations in a manner to obtain clusters with the least amount of information loss due to the aggregation.

P R o b l e M S

1. The regulation of electric and gas utilities is an important public policy question affect- ing consumer’s choice and cost of energy provider. To inform deliberation on public policy, data on eight numerical variables have been collected for a group of energy

Problems 157

10 13 4 20 2 21 1 18 14 19 3 9 6 8 16 11 5 7 12 15 17 0

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5

10.0 10.5 11.0

Cluster

D is

ta n

10 13 4 20 2 21 5 1 18 14 19 6 3 9 7 12 15 17 8 16 11 0

0.5

1.0

1.5

2.0

2.5

3.0

4.5

4.0

4.5

5.0

5.5

6.0

6.5

Cluster

D is

ta n

Between–Cluster Distance 2.577

companies. To summarize the data, hierarchical clustering has been executed using Euclidean distance as the similarity measure and Ward’s method as the clustering method. Based on the following dendrogram, what is the most appropriate number of clusters to organize these utility companies?

2. In an effort to inform political leaders and economists discussing the deregulation of electric and gas utilities, data on eight numerical variables from utility companies have been grouped using hierarchical clustering based on Euclidean distance as the similar- ity measure and complete linkage as the clustering method. a. Based on the following dendrogram, what is the most appropriate number of clus-

ters to organize these utility companies?

158 Chapter 4 Descriptive Data Mining

b. Using the following data on the Observations 10, 13, 4, and 20, confirm that the complete linkage distance between the cluster containing {10, 13} and the cluster containing {4, 20} is 2.577 units as displayed in the dendrogram.

Inter-Cluster Distances Cluster 1 Cluster 2 Cluster 3

Cluster 1 0 5.005 3.576

Cluster 2 5.005 0 3.951

Cluster 3 3.576 3.951 0

Within-Cluster Summary Size Average Distance

Cluster 1 63 2.357

Cluster 2 51 2.438

Cluster 3 64 2.765

Total 178 2.527

Observation

10 13 4 20 Income/Debt 0.032 0.195 20.510 0.466 Return 0.741 0.875 0.207 0.474 Cost 0.700 0.748 20.004 20.490 Load 20.892 20.735 20.219 0.655 Peak 20.173 1.013 20.943 0.083 Sales 20.693 20.489 20.702 20.458 PercentNuclear 1.620 2.275 1.328 1.733 TotalFuelCosts 20.863 21.035 20.724 20.721

Inter-Cluster Distances Cluster 1 Cluster 2

Cluster 1 0 3.829 Cluster 2 3.829 0

Within-Cluster Summary Size Average Distance

Cluster 1 94 3.080 Cluster 2 84 2.746 Total 178 2.922

3. Amanda Boleyn, an entrepreneur who recently sold her start-up for a multi-million- dollar sum, is looking for alternate investments for her newfound fortune. She is consid- ering an investment in wine, similar to how some people invest in rare coins and fine art. To educate herself on the properties of fine wine, she has collected data on 13 different characteristics of 178 wines. Amanda has applied k-means clustering to this data for k 5 2, 3, and 4 and provided the summaries for each set of resulting clusters. Which value of k is the most appropriate to categorize these wines? Justify your choice with calculations.

Problems 159

Inter-Cluster Distances Cluster 1 Cluster 2 Cluster 3 Cluster 4

Cluster 1 0 2.991 2.576 4.785 Cluster 2 2.991 0 3.951 5.105 Cluster 3 2.576 3.951 0 3.808 Cluster 4 4.785 5.105 3.808 0

Within-Cluster Summary Size Average Distance

Cluster 1 21 2.738

Cluster 2 55 2.285

Cluster 3 51 2.559

Cluster 4 51 2.438

Total 178 2.461

4. Jay Gatsby categorizes wines into one of three clusters. The centroids of these clusters, describing the average characteristics of a wine in each cluster, are listed in the follow- ing table.

Characteristic Cluster 1 Cluster 2 Cluster 3

Alcohol 0.819 0.164 20.937 MalicAcid 20.329 0.869 20.368 Ash 0.248 0.186 20.393 Alcalinity 20.677 0.523 0.249 Magnesium 0.643 20.075 20.573 Phenols 0.825 0.977 20.034 Flavanoids 0.896 21.212 0.083 Nonflavanoids 20.595 0.724 0.009 Proanthocyanins 0.619 20.778 0.010 ColorIntensity 0.135 0.939 20.881 Hue 0.497 21.162 0.437 Dilution 0.744 21.289 0.295 Proline 1.117 20.406 20.776

Jay has recently discovered a new wine from the Piedmont region of Italy with the following characteristics. In which cluster of wines should he place this new wine? Justify your choice with appropriate calculations.

Characteristic Alcohol 21.023 MalicAcid 20.480 Ash 0.049 Alcalinity 0.600 Magnesium 21.242 Phenols 1.094 Flavanoids 0.001 Nonflavanoids 0.548 Proanthocyanins 20.229 ColorIntensity 20.797 Hue 0.711 Dilution 20.425 Proline 0.010

160 Chapter 4 Descriptive Data Mining

FBS

5. Leggere, an internet book retailer, is interested in better understanding the purchase decisions of its customers. For a set of 2,000 customer transactions, it has catego- rized the individual book purchases comprising those transactions into one or more of the following categories: Novels, Willa Bean series, Cooking Books, Bob Villa Do-It-Yourself, Youth Fantasy, Art Books, Biography, Cooking Books by Mossimo Bottura, Harry Potter series, Florence Art Books, and Titian Art Books. Leggere has conducted association rules analysis on this data set and would like to analyze the out- put. Based on a minimum support of 200 transactions and a minimum confidence of 50%, the table below shows the top 10 rules with respect to lift ratio. a. Explain why the top rule “If customer buys a Bottura cooking book, then they buy a

cooking book,” is not helpful even though it has the largest lift and 100% confidence. b. Explain how the confidence of 52.99% and lift ratio of 2.20 was computed for the

rule “If a customer buys a cooking book and a biography book, then they buy an art book.” Interpret these quantities.

c. Based on these top 10 rules, what general insight can Leggere gain on the purchase habits of these customers?

d. What will be the effect on the rules generated if Leggere decreases the minimum support and reruns the association rules analysis?

e. What will be the effect on the rules generated if Leggere decreases the minimum confidence and reruns the association rules analysis?

Antecedent Consequent Support for A Support

for C Support for

A & C Confidence Ratio BotturaCooking Cooking 227 862 227 100.00 2.32

Cooking, BobVilla Art 379 482 205 54.09 2.24

Cooking, Art Biography 334 554 204 61.08 2.20

Cooking, Biography Art 385 482 204 52.99 2.20

Youth Fantasy Novels, Cooking 446 512 245 54.93 2.15

Cooking, Art BobVilla 334 583 205 61.38 2.11

Cooking, BobVilla Biography 379 554 218 57.52 2.08

Biography Novels, Cooking 554 512 293 52.89 2.07

Novels, Cooking Biography 512 554 293 57.23 2.07

Art Novels, Cooking 482 512 249 51.66 2.02

6. The Football Bowl Subdivision (FBS) level of the National Collegiate Athletic Asso- ciation (NCAA) consists of over 100 schools. Most of these schools belong to one of several conferences, or collections of schools, that compete with each other on a regular basis in collegiate sports. Suppose the NCAA has commissioned a study that will propose the formation of conferences based on the similarities of the constituent schools. The file FBS contains data on schools that belong to the Football Bowl Sub- division. Each row in this file contains information on a school. The variables include football stadium capacity, latitude, longitude, athletic department revenue, endowment, and undergraduate enrollment. a. Apply k-means clustering with k 510 using football stadium capacity, latitude, lon-

gitude, endowment, and enrollment as variables. Normalize the input variables to adjust for the different magnitudes of the variables. Analyze the resultant clusters. What is the smallest cluster? What is the least dense cluster (as measured by the average distance in the cluster)? What makes the least dense cluster so diverse?

b. What problems do you see with the plan for defining the school membership of the 10 conferences directly with the 10 clusters?

c. Repeat part (a), but this time do not normalize the values of the input variables. Analyze the resultant clusters. How and why do they differ from those in part (a)? Identify the dominating factor(s) in the formation of these new clusters.

Problems 161

7. Refer to the clustering problem involving the file FBS described in Problem 6. Apply hierarchical clustering with 10 clusters using football stadium capacity, latitude, longi- tude, endowment, and enrollment as variables. Normalize the values of the input vari- ables to adjust for the different magnitudes of the variables. Use Ward’s method as the clustering method. a. Compute the cluster centers for the clusters created by the hierarchical clustering.

(Hint: This can be done using a PivotTable in Excel to calculate the average for each variable for the schools in a cluster.)

b. Identify the cluster with the largest average football stadium capacity. Using all the variables, how would you characterize this cluster?

c. Examine the smallest cluster. What makes this cluster unique?

8. Refer to the clustering problem involving the file FBS described in Problem 6. Apply hierarchical clustering with 10 clusters using latitude and longitude as variables. Nor- malize the values of the input variables to adjust for the different magnitudes of the variables. Execute the clustering two times—once with single linkage as the clustering method and once with group average linkage as the clustering method. Compute the cluster sizes and the minimum/maximum latitude and longitude for observations in each cluster. (Hint: This can be done using a PivotTable in Excel to display the count of schools in each cluster as well as the minimum and maximum of the latitude and longi- tude within each cluster.) To visualize the clusters, create a scatter plot with longitude as the x-variable and latitude as the y-variable. Compare the results of the two approaches.

9. Refer to the clustering problem involving the file FBS described in Problem 6. Apply hierarchical clustering with 10 clusters using latitude and longitude as variables. Nor- malize the values of the input variables to adjust for the different magnitudes of the variables. Execute the clustering two times—once with Ward’s method as the clustering method and once with group average linkage as the clustering method. Compute the cluster sizes and the minimum/maximum latitude and longitude for observations in each cluster. (Hint: This can be done using a PivotTable in Excel to display the count of schools in each cluster as well as the minimum and maximum of the latitude and longi- tude within each cluster.) To visualize the clusters, create a scatter plot with longitude as the x-variable and latitude as the y-variable. Compare the results of the two approaches.

10. Refer to the clustering problem involving the file FBS described in Problem 6. Apply hierarchical clustering with 10 clusters using latitude and longitude as variables. Nor- malize the values of the input variables to adjust for the different magnitudes of the vari- ables. Execute the clustering two times—once with complete linkage as the clustering method and once with Ward’s method as the clustering method. Compute the cluster sizes and the minimum/maximum latitude and longitude for observations in each clus- ter. (Hint: This can be done using a PivotTable in Excel to display the count of schools in each cluster as well as the minimum and maximum of the latitude and longitude within each cluster.) To visualize the clusters, create a scatter plot with longitude as the x-variable and latitude as the y-variable. Compare the results of the two approaches.

11. Refer to the clustering problem involving the file FBS described in Problem 6. Apply hierarchical clustering with 10 clusters using latitude and longitude as variables. Nor- malize the values of the input variables to adjust for the different magnitudes of the variables. Execute the clustering two times—once with centroid linkage as the cluster- ing method and once with group average linkage as the clustering method. Compute the cluster sizes and the minimum/maximum latitude and longitude for observations in each cluster. (Hint: This can be done using a PivotTable in Excel to display the count of schools in each cluster as well as the minimum and maximum of the latitude and longi- tude within each cluster.) To visualize the clusters, create a scatter plot with longitude as the x-variable and latitude as the y-variable. Compare the results of the two approaches.

12. From 1946 to 1990, the Big Ten Conference consisted of the University of Illinois, Indiana University, University of Iowa, University of Michigan, Michigan State University, University of Minnesota, Northwestern University, Ohio State Univer- sity, Purdue University, and University of Wisconsin. In 1990, the conference added

162 Chapter 4 Descriptive Data Mining

BigBlue

Pennsylvania State University. In 2011, the conference added the University of Nebraska. In 2014, the University of Maryland and Rutgers University were added to the conference with speculation of more schools being added in the future. The file BigTen contains the similar information as the file FBS (see Problem 6 description), except that each variable value for the original 10 schools in the Big Ten conference have been replaced with the respective variable average over these 10 schools.

Apply hierarchical clustering with complete linkage to yield 2 clusters using foot- ball stadium capacity, latitude, longitude, endowment, and enrollment as variables. Normalize the values of the input variables to adjust for the different magnitudes of the variables. Which schools does the clustering suggest would have been the most appro- priate to be the eleventh school in the Big Ten? The twelfth and thirteenth schools? What is the problem with using this method to identify the fourteenth school to add to the Big Ten?

13. In this problem, we refer to the clustering problem described in Problem 6, but now we remove the observation for Hawai’i and only consider schools in the continental United States; this modified data is contained in the file ContinentalFBS. The NCAA has a preference for conferences consisting of similar schools with respect to their endow- ment, enrollment, and football stadium capacity, but these conferences must be in the same geographic region to reduce traveling costs. Follow the following steps to address this desire. Apply k-means clustering using latitude and longitude as variables with k 5 3. Normalize the values of the input variables to adjust for the different magnitudes of the variables. Using the cluster assignments, separate the original data in the Data worksheet into three separate data sets—one data set for each of the three “regional” clusters. a. For Region 1 data set, apply hierarchical clustering with Ward’s method to form

three clusters using football stadium capacity, endowment, and enrollment as vari- ables. Normalize the input variables. Report the characteristics of each cluster using a PivotTable that includes a count of number of schools in each cluster, the average stadium capacity, the average endowment amount, and the average enrollment for schools in each cluster.

b. For the Region 2 data set, apply hierarchical clustering with Ward’s method to form four clusters using football stadium capacity, endowment, and enrollment as vari- ables. Normalize the input variables. Report the characteristics of each cluster using a PivotTable that includes a count of number of schools in each cluster, the average stadium capacity, the average endowment amount, and the average enrollment for schools in each cluster.

c. For the Region 3 data set, apply hierarchical clustering with Ward’s method to form two clusters using football stadium capacity, endowment, and enrollment as vari- ables. Normalize the input variables. Report the characteristics of each cluster using a PivotTable that includes a count of number of schools in each cluster, the average stadium capacity, the average endowment amount, and the average enrollment for schools in each cluster.

d. What problems do you see with the plan with defining the school membership of nine conferences directly with the nine total clusters formed from the regions? How could this approach be tweaked to solve this problem?

14. IBM employs a network of expert analytics consultants for various projects. To help it determine how to distribute its bonuses, IBM wants to form groups of employees with sim- ilar performance according to key performance metrics. Each observation (corresponding to an employee) in the file BigBlue consists of values for: UsageRate which corresponds to the proportion of time that the employee has been actively working on high-priority proj- ects, Recognition which is the number of projects for which the employee was specifically requested, and Leader which is the number of projects on which the employee has served as project leader. Apply k-means clustering with values of 5k 2 to 7. Normalize the values of the input variables to adjust for the different magnitudes of the variables. How many clusters do you recommend to categorize the employees? Why?

ContinentalFBS

BigTen

Problems 163

15. Apply hierarchical clustering to the data in DemoKTC using matching coefficients as the similarity measure and group average linkage as the clustering method to create three clusters based on the Female, Married, Loan, and Mortgage variables. Use a Piv- otTable to count the total number of customers in each cluster as well as the number of customers who are female, the number of customers who are married, the number of customers with a car loan, and the number of customers with a mortgage in each clus- ter. How would you characterize each cluster?

16. Apply k-means clustering with values of 5k 2, 3, 4, and 5 to cluster the data in DemoKTC based on the Age, Income, and Children variables. Normalize the values of the input variables to adjust for the different magnitudes of the variables. How many clusters do you recommend? Why?

17. Attracted by the possible returns from a portfolio of movies, hedge funds have invested in the movie industry by financially backing individual films and/or studios. The hedge fund Star Ventures is currently conducting some research involving movies involving Adam Sandler, an American actor, screenwriter, and film producer. As a first step, Star Ventures would like to cluster Adam Sandler movies based on their gross box office returns and movie critic ratings. Using the data in the file Sandler, apply k-means clustering with 5k 3 to characterize three different types of Adam Sandler movies. Base the clusters on the variables Rating and Box. Rating corresponds to movie rat- ings provided by critics (a higher score represents a movie receiving better reviews). Box represents the gross box office earnings in 2015 dollars. Normalize the values of the input variables to adjust for the different magnitudes of the variables. Report the characteristics of each cluster using a PivotTable that includes a count of movies, the average rating of movies and the average box office earnings of movies in each cluster. How would you characterize the movies in each cluster?

18. Josephine Mater works for the supply-chain analytics division of Trader Joe’s, a national chain of specialty grocery stores. Trader Joe’s is considering a redesign of its supply chain. Josephine knows that Trader Joe’s uses frequent truck shipments from its distribution centers to its retail stores. To keep costs low, retail stores are typically located near a distribution center. The file TraderJoes contains data on the location of Trader Joe’s retail stores. Josephine would like to use k-means clustering with 5k 8 to estimate the preferred locations if Trader Joe’s was to establish eight distribution cen- ters to support its retail stores. Normalize the values of the input variables to adjust for the different magnitudes of the variables. If Trader Joe’s establishes eight distribution centers, how many retail stores are assigned to each distribution center? What are the drawbacks to using this solution approach to assign retail stores to distribution centers?

19. Apple Inc. tracks online transactions at its iStore and is interested in learning about the purchase patterns of its customers in order to provide recommendations as a customer browses its web site. A sample of the “shopping cart” data resides in the files Apple- CartBinary and AppleCartStacked.

Use a minimum support of 10% of the total number of transactions and a minimum confidence of 50% to generate a list of association rules. a. Interpret what the rule with the largest lift ratio is saying about the relationship

between the antecedent item set and consequent item set. b. Interpret the confidence of the rule with the largest lift ratio. c. Interpret the lift ratio of the rule with the largest lift ratio. d. Review the top 15 rules and summarize what the rules suggest.

20. Cookie Monster Inc. is a company that specializes in the development of software that tracks web browsing history of individuals. A sample of browser histories is provided in the files CookieMonsterBinary and CookieMonsterStacked that indicate which web- sites were visited by which customers.

Use a minimum support of 4% of the transactions (800 of the 20,000 total transactions) and a minimum confidence of 50% to generate a list of association rules. Review the top 14 rules. What information does this analysis provide Cookie Monster Inc. regarding the online behavior of individuals?

DemoKTC

Sandler

TraderJoes

AppleCartBinary AppleCartStacked

CookieMonsterBinary CookieMonsterStacked

164 Chapter 4 Descriptive Data Mining

21. A grocery store introducing items from Italy is interested in analyzing buying trends of these new “international” items, namely prosciutto, Peroni, risotto, and gelato. The files GroceryStoreList and GroceryStoreStacked provide data on a collection of trans- actions in item-list format. a. Use a minimum support of 100 transactions (10% of the 1,000 total transactions)

and a minimum confidence of 50% to generate a list of association rules. How many rules satisfy this criterion?

b. Use a minimum support of 250 transactions (25% of the 1,000 total transactions) and a minimum confidence of 50% to generate a list of association rules. How many rules satisfy this criterion? Why may the grocery store want to increase the minimum support required for their analysis? What is the risk of increasing the min- imum support required?

c. Using the list of rules from part (b), consider the rule with the largest lift ratio that also involves an Italian item. Interpret what this rule is saying about the relationship between the antecedent item set and consequent item set.

d. Interpret the confidence of the rule with the largest lift ratio that also involves an Italian item.

e. Interpret the lift ratio of the rule with the largest lift ratio that also involves an Ital- ian item.

f. What insight can the grocery store obtain about its purchasers of the Italian fare?

22. Companies can learn a lot about customer experiences by monitoring the social media web site Twitter. The file AirlineTweets contains a sample of 36 tweets of an airline’s customers. Normalize the terms by using stemming and generate binary term-document matrix. a. What are the five most common terms occurring in these tweets? How often does

each term appear? b. Apply hierarchical clustering using complete linkage to yield three clusters on the

binary term-document matrix using the tokens agent, attend, bag, damag, and rude as variables. How many documents are in each cluster? Give a description of each cluster.

c. How could management use the results obtained in part (b)? Source: Kaggle website

23. The online review service Yelp helps millions of consumers find the goods and ser- vices they seek. To help consumers make more-informed choices, Yelp includes over 120 million reviews. The file YelpItalian contains a sample of 21 reviews for an Italian restaurant. Normalize the terms by using stemming and a generate binary term-docu- ment matrix. a. What are the five most common terms in these reviews? How often does each term

appear? b. Apply hierarchical clustering using complete linkage to yield two clusters from the

presence/absence term- document matrix using all five of the most common terms from the reviews. How many documents are in each cluster? Give a description of each cluster.

C A S e P R o b l e M : K n o w T h y C u S T o M e R

Know Thy Customer (KTC) is a financial consulting company that provides personalized financial advice to its clients. As a basis for developing this tailored advising, KTC would like to segment its customers into several representative groups based on key characteris- tics. Peyton Blake, the director of KTC’s fledging analytics division, plans to establish the set of representative customer profiles based on 600 customer records in the file KnowThy- Customer. Each customer record contains data on age, gender, annual income, marital status, number of children, whether the customer has a car loan, and whether the customer

GroceryStoreList GroceryStoreStacked

AirlineTweets

YelpItalian

KnowThyCustomer

Case Problem: Know Thy Customer 165

has a home mortgage. KTC’s market research staff has determined that these seven charac- teristics should form the basis of the customer clustering.

Peyton has invited a summer intern, Danny Riles, into her office so they can discuss how to proceed. As they review the data on the computer screen, Peyton’s brow furrows as she realizes that this task may not be trivial. The data contains both categorical vari- ables (Female, Married, Car, and Mortgage) and numerical variables (Age, Income, and Children). 1. Using hierarchical clustering on all seven variables, experiment with using complete

linkage and group average linkage as the clustering method. Normalize the values of the input variables. Recommend a set of customer profiles (clusters). Describe these clusters according to their “average” characteristics. Why might hierarchical clustering not be a good method to use for these seven variables?

2. Apply a two-step clustering method: a. Use hierarchical clustering with matching coefficients as the similarity measure and

group average linkage as the clustering method to produce four clusters using the variables Female, Married, Loan, and Mortgage.

b. Based on the clusters from part (a), split the original 600 observations into four sep- arate data sets as suggested by the four clusters from part (a). For each of these four data sets, apply k-means clustering with 5k 2 using Age, Income, and Children as variables. Normalize the values of the input variables. This will generate a total of eight clusters. Describe these eight clusters according to their “average” characteris- tics. What benefit does this two-step clustering approach have over just using hierar- chical clustering on all seven variables as in part (1) or just using k-means clustering on all seven variables? What weakness does it have?

Probability: An Introduction to Modeling Uncertainty C O N T E N T S

AnAlytIcs In ActIon: NatioNal aeroNautics aNd space admiNistratioN

5.1 EVEnts AnD PRoBABIlItIEs

5.2 soME BAsIc RElAtIonsHIPs oF PRoBABIlIty complement of an Event Addition law

5.3 conDItIonAl PRoBABIlIty Independent Events Multiplication law Bayes’ theorem

5.4 RAnDoM VARIABlEs Discrete Random Variables continuous Random Variables

5.5 DIscREtE PRoBABIlIty DIstRIBUtIons custom Discrete Probability Distribution Expected Values and Variance Discrete Uniform Probability Distribution Binomial Probability Distribution Poisson Probability Distribution

5.6 contInUoUs PRoBABIlIty DIstRIBUtIons Uniform Probability Distribution triangular Probability Distribution normal Probability Distribution Exponential Probability Distribution

Chapter 5

Uncertainty is an ever-present fact of life for decision makers, and much time and effort are spent trying to plan for, and respond to, uncertainty. Consider the CEO who has to make decisions about marketing budgets and production amounts using forecasted demands. Or consider the financial analyst who must determine how to build a client’s portfolio of stocks and bonds when the rates of return for these investments are not known with certainty. In many business scenarios, data are available to provide information on possible outcomes for some decisions, but the exact outcome from a given decision is almost never known with certainty because many factors are outside the control of the decision maker (e.g., actions taken by competitors, the weather, etc.).

Probability is the numerical measure of the likelihood that an event will occur.1 Therefore, it can be used as a measure of the uncertainty associated with an event. This measure of uncertainty is often communicated through a probability distribution. Probability distributions are extremely helpful in providing additional information about an

Identifying uncertainty in data was introduced in Chapters 2 and 3 through descriptive statistics and data-visualization techniques, respectively. In this chapter, we expand on our discussion of modeling uncertainty by formalizing the concept of probability and introducing the concept of probability distributions.

National Aeronautics and Space Administration*

WaShiNgTON, D.C.

The National Aeronautics and Space Administration (NASA) is the U.S. government agency that is respon- sible for the U.S. civilian space program and for aero- nautics and aerospace research. NASA is best known for its manned space exploration; its mission state- ment is to “pioneer the future in space exploration, scientific discovery and aeronautics research.” With 18,800 employees, NASA is currently working on the design of a new Space Launch System that will take the astronauts farther into space than ever before and provide the cornerstone for future space exploration.

Although NASA’s primary mission is space explo- ration, its expertise has been called on in assisting countries and organizations throughout the world in nonspace endeavors. In one such situation, the San José copper and gold mine in Copiapó, Chile, caved in, trapping 33 men more than 2,000 feet under- ground. It was important to bring the men safely to the surface as quickly as possible, but it was also imper- ative that the rescue effort be carefully designed and implemented to save as many miners as possible. The Chilean government asked NASA to provide assistance in developing a rescue method. NASA sent a four-per- son team consisting of an engineer with expertise in vehicle design, two physicians, and a psychologist with knowledge about issues of long-term confinement.

The probability of success and the failure of various other rescue methods was prominent in the thoughts of everyone involved. Since no historical data were available to apply to this unique rescue situation, NASA scientists developed subjective probability estimates for the success and failure of various rescue methods based on similar circumstances experienced by astronauts returning from short- and long-term space missions. The probability estimates provided by NASA guided officials in the selection of a rescue method and provided insight as to how the miners would survive the ascent in a rescue cage. The rescue method designed by the Chilean officials in consulta- tion with the NASA team resulted in the construction of 13-foot-long, 924-pound steel rescue capsule that would be used to bring up the miners one at a time. All miners were rescued, with the last emerging 68 days after the cave-in occurred.

In this chapter, you will learn about probability as well as how to compute and interpret probabilities for a variety of situations. The basic relationships of prob- ability, conditional probability, and Bayes’ theorem will be covered. We will also discuss the concepts of ran- dom variables and probability distributions and illus- trate the use of some of the more common discrete and continuous probability distributions.

a N a l y T i C S i N a C T i O N

*the authors are indebted to Dr. Michael Duncan and clinton cragg at nAsA for providing this Analytics in Action.

1note that there are several different possible definitions of probability, depending on the method used to assign probabilities. this includes the classical definition, the relative frequency definition, and the subjective definition of probability. In this text, we most often use the relative frequency definition of probability, which assumes that prob- abilities are based on empirical data. For a more thorough discussion of the different possible definitions of proba- bility see chapter 4 of Anderson, sweeney, Williams, camm, and cochran, an introduction to statistics for Business and economics, 13e Revised (2018).

Analytics in Action 167

168 chapter 5 Probability: An Introduction to Modeling Uncertainty

event, and as we will see in later chapters in this textbook, they can be used to help a deci- sion maker evaluate possible actions and determine the best course of action.

5.1 Events and Probabilities In discussing probabilities, we often start by defining a random experiment as a process that generates well-defined outcomes. Several examples of random experiments and their associated outcomes are shown in Table 5.1.

By specifying all possible outcomes, we identify the sample space for a random experiment. Consider the first random experiment in Table 5.1—a coin toss. The possible outcomes are head and tail. If we let S denote the sample space, we can use the following notation to describe the sample space.

S Head, Tail{ }5 Suppose we consider the second random experiment in Table 5.1—rolling a die. The possi- ble experimental outcomes, defined as the number of dots appearing on the upward face of the die, are the six points in the sample space for this random experiment.

S 1, 2, 3, 4, 5, 6{ }5 Outcomes and events form the foundation of the study of probability. Formally, an

event is defined as a collection of outcomes. For example, consider the case of an expan- sion project being undertaken by California Power & Light Company (CP&L). CP&L is starting a project designed to increase the generating capacity of one of its plants in Southern California. An analysis of similar construction projects indicates that the possible completion times for the project are 8, 9, 10, 11, and 12 months. Each of these possible completion times represents a possible outcome for this project. Table 5.2 shows the num- ber of past construction projects that required 8, 9, 10, 11, and 12 months.

Let us assume that the CP&L project manager is interested in completing the project in 10 months or less. Referring to Table 5.2, we see that three possible outcomes (8 months, 9 months, and 10 months) provide completion times of 10 months or less. Letting C denote the event that the project is completed in 10 months or less, we write:

C 8, 9, 10{ }5 Event C is said to occur if any one of these outcomes occur.

A variety of additional events can be defined for the CP&L project:

L less than

M more than

The event that the project is completed in 10 months 8, 9

The event that the project is completed in 10 months 11, 12

{ } { }

5 5

In each case, the event must be identified as a collection of outcomes for the random experiment.

Random Experiment Experimental Outcomes

Toss a coin Head, tail

Roll a die 1, 2, 3, 4, 5, 6

Conduct a sales call Purchase, no purchase

Hold a particular share of stock for one year

Price of stock goes up, price of stock goes down, no change in stock price

Reduce price of product Demand goes up, demand goes down, no change in demand

Random Experiments and Experimental outcomesTablE 5.1

5.2 some Basic Relationships of Probability 169

The probability of an event is equal to the sum of the probabilities of outcomes for the event. Using this definition and given the probabilities of outcomes shown in Table 5.2, we can now calculate the probability of the event C 8, 9, 10{ }5 . The probability of event C, denoted P(C), is given by

P C P P P( ) (8) (9) (10) 0.15 0.25 0.30 0.705 1 1 5 1 1 5

Similarly, because the event that the project is completed in less than 10 months is given by L 8, 9{ }5 , the probability of this event is given by

P L P P( ) (8) (9) 0.15 0.25 0.405 1 5 1 5

Finally, for the event that the project is completed in more than 10 months, we have M 11, 12{ }5 and thus

P M P P( ) (11) (12) 0.15 0.15 0.305 1 5 1 5

Using these probability results, we can now tell CP&L management that there is a 0.70 probability that the project will be completed in 10 months or less, a 0.40 probability that it will be completed in less than 10 months, and a 0.30 probability that it will be completed in more than 10 months.

5.2 Some Basic Relationships of Probability Complement of an Event Given an event A, the complement of A is defined to be the event consisting of all out- comes that are not in A. The complement of A is denoted by AC. Figure 5.1 shows what is known as a Venn diagram, which illustrates the concept of a complement. The rectangular area represents the sample space for the random experiment and, as such, contains all pos- sible outcomes. The circle represents event A and contains only the outcomes that belong to A. The shaded region of the rectangle contains all outcomes not in event A and is by definition the complement of A.

In any probability application, either event A or its complement AC must occur. Therefore, we have

P A P AC( ) ( ) 11 5

Solving for P(A), we obtain the following result:

The complement of event A is sometimes written as A or A′ in other textbooks.

Completion Time (months)

No. of Past Projects Having This Completion Time

Probability of Outcome

8 6 6/40 0.155

9 10 10/40 0.255

10 12 12/40 0.305

11 6 6/40 0.155

12 6 6/40 0.155 Total 40 1.00

completion times for 40 cP&l ProjectsTablE 5.2

COmpuTiNg prObabiliTy uSiNg ThE COmplEmENT

P A P AC5 2( ) 1 ( ) (5.1)

170 chapter 5 Probability: An Introduction to Modeling Uncertainty

Equation (5.1) shows that the probability of an event A can be computed easily if the prob- ability of its complement, P AC( ), is known.

As an example, consider the case of a sales manager who, after reviewing sales reports, states that 80% of new customer contacts result in no sale. By allowing A to denote the event of a sale and AC to denote the event of no sale, the manager is stating that P AC( ) 0.805 . Using equation (5.1), we see that

P A P AC( ) 1 ( ) 1 0.80 0.205 2 5 2 5

We can conclude that a new customer contact has a 0.20 probability of resulting in a sale.

addition law The addition law is helpful when we are interested in knowing the probability that at least one of two events will occur. That is, with events A and B we are interested in knowing the probability that event A or event B occurs or both events occur.

Before we present the addition law, we need to discuss two concepts related to the com- bination of events: the union of events and the intersection of events. Given two events A and B, the union of A and B is defined as the event containing all outcomes belonging to A or B or both. The union of A and B is denoted by øA B.

The Venn diagram in Figure 5.2 depicts the union of A and B. Note that one circle con- tains all the outcomes in A and the other all the outcomes in B. The fact that the circles overlap indicates that some outcomes are contained in both A and B.

Venn Diagram for Event aFigurE 5.1

ACEvent A

Sample Space S

Complement of Event A

Venn Diagram for the Union of Events a and BFigurE 5.2

Event A Event B

Sample Space S

5.2 some Basic Relationships of Probability 171

The definition of the intersection of A and B is the event containing the outcomes that belong to both A and B. The intersection of A and B is denoted by ùA B. The Venn dia- gram depicting the intersection of A and B is shown in Figure 5.3. The area in which the two circles overlap is the intersection; it contains outcomes that are in both A and B.

The addition law provides a way to compute the probability that event A or event B occurs or both events occur. In other words, the addition law is used to compute the proba- bility of the union of two events. The addition law is written as follows:

aDDiTiON laW

ø ùP A B P A P B P A B5 1 2( ) ( ) ( ) ( ) (5.2)

To understand the addition law intuitively, note that the first two terms in the addition law, P A P B( ) ( )1 , account for all the sample points in øA B. However, because the sample points in the intersection ùA B are in both A and B, when we compute P A P B( ) ( )1 , we are in effect counting each of the sample points in ùA B twice. We correct for this double counting by subtracting ùP A B( ).

As an example of the addition law, consider a study conducted by the human resources manager of a major computer software company. The study showed that 30% of the employees who left the firm within two years did so primarily because they were dissatis- fied with their salary, 20% left because they were dissatisfied with their work assignments, and 12% of the former employees indicated dissatisfaction with both their salary and their work assignments. What is the probability that an employee who leaves within two years does so because of dissatisfaction with salary, dissatisfaction with the work assignment, or both?

Let

S W

5 5

the event that the employee leaves because of salary the event that the employee leaves because of work assignment

From the survey results, we have P S( ) 0.305 , P W( ) 0.205 , and ùP S W 5( ) 0.12. Using the addition law from equation (5.2), we have

ø ùP S W P S P W P S W5 1 2 5 1 2 5( ) ( ) ( ) ( ) 0.30 0.20 0.12 0.38

This calculation tells us that there is a 0.38 probability that an employee will leave for salary or work assignment reasons.

Before we conclude our discussion of the addition law, let us consider a special case that arises for mutually exclusive events. Events A and B are mutually exclusive if the occur- rence of one event precludes the occurrence of the other. Thus, a requirement for A and B

We can also think of this probability in the following manner: What proportion of employees either left because of salary or left because of work assignment?

Venn Diagram for the Intersection of Events a and BFigurE 5.3

Event B

Sample Space S

Event A

172 chapter 5 Probability: An Introduction to Modeling Uncertainty

to be mutually exclusive is that their intersection must contain no sample points. The Venn diagram depicting two mutually exclusive events A and B is shown in Figure 5.4. In this case ùP A B 5( ) 0 and the addition law can be written as follows:

The addition law can be extended beyond two events. For

example, the addition law for three events A, B, and C is ø ø ù ù5 1 1 2 2 2P A B C P A P B P C P A B P A C( ) ( ) ( ) ( ) ( ) ( )

ù ù ù1P B C P A B C( ) ( ). Similar logic can be used to derive the expressions for the addition law for more than three

events.

N O T E S + C O m m E N T S

aDDiTiON laW FOr muTually ExCluSivE EvENTS

øP A B P A P B5 1( ) ( ) ( )

Venn Diagram for Mutually Exclusive EventsFigurE 5.4

Sample Space S

Event BEvent A

5.3 Conditional Probability Often, the probability of one event is dependent on whether some related event has already occurred. Suppose we have an event A with probability P(A). If we learn that a related event, denoted by B, has already occurred, we take advantage of this information by calcu- lating a new probability for event A. This new probability of event A is called a conditional probability and is written P A B( | ). The notation | indicates that we are considering the probability of event A given the condition that event B has occurred. Hence, the notation P A B( | ) reads “the probability of A given B.”

To illustrate the idea of conditional probability, consider a bank that is interested in the mortgage default risk for its home mortgage customers. Table 5.3 shows the first 25 records of the 300 home mortgage customers at Lancaster Savings and Loan, a company that specializes in high-risk subprime lending. Some of these home mortgage customers have defaulted on their mortgages and others have continued to make on-time payments. These data include the age of the customer at the time of mortgage origination, the marital status of the customer (single or married), the annual income of the customer, the mortgage amount, the number of payments made by the customer per year on the mortgage, the total amount paid by the customer over the lifetime of the mortgage, and whether or not the cus- tomer defaulted on her or his mortgage.

More generally, two events are said to be mutually exclusive if the events have no out- comes in common.

5.3 conditional Probability 173

Lancaster Savings and Loan is interested in whether the probability of a customer defaulting on a mortgage differs by marital status. Let

S M D DC

5 5 5

event that a customer is single event that a customer is married event that a customer defaulted on his or her mortgage event that a customer did not default on his or her mortgage

Table 5.4 shows a crosstabulation for two events that can be derived from the Lancaster Savings and Loan mortgage data.

Note that we can easily create Table 5.4 in Excel using a PivotTable by using the following steps:

Step 1. In the Values worksheet of MortgageDefaultData file Click the Insert tab on the Ribbon

Step 2. Click PivotTable in the Tables group Step 3. When the Create PivotTable dialog box appears:

Choose Select a Table or Range Enter A1:H301 in the Table/Range: box

Chapter 3 discusses PivotTables in more detail.

Customer No. Age

Marital Status

Annual Income

Mortgage Amount

Payments per Year

Total Amount Paid

Default on Mortgage?

1 37 Single $ 172,125.70 $ 473,402.96 24 $ 581,885.13 Yes

2 31 Single $ 108,571.04 $ 300,468.60 12 $ 489,320.38 No

3 37 Married $ 124,136.41 $ 330,664.24 24 $ 493,541.93 Yes

4 24 Married $ 79,614.04 $ 230,222.94 24 $ 449,682.09 Yes

5 27 Single $ 68,087.33 $ 282,203.53 12 $ 520,581.82 No

6 30 Married $ 59,959.80 $ 251,242.70 24 $ 356,711.58 Yes

7 41 Single $ 99,394.05 $ 282,737.29 12 $ 524,053.46 No

8 29 Single $ 38,527.35 $ 238,125.19 12 $ 468,595.99 No

9 31 Married $ 112,078.62 $ 297,133.24 24 $ 399,617.40 Yes

10 36 Single $ 224,899.71 $ 622,578.74 12 $1,233,002.14 No

11 31 Married $ 27,945.36 $ 215,440.31 24 $ 285,900.10 Yes

12 40 Single $ 48,929.74 $ 252,885.10 12 $ 336,574.63 No

13 39 Married $ 82,810.92 $ 183,045.16 12 $ 262,537.23 No

14 31 Single $ 68,216.88 $ 165,309.34 12 $ 253,633.17 No

15 40 Single $ 59,141.13 $ 220,176.18 12 $ 424,749.80 No

16 45 Married $ 72,568.89 $ 233,146.91 12 $ 356,363.93 No

17 32 Married $ 101,140.43 $ 245,360.02 24 $ 388,429.41 Yes

18 37 Married $ 124,876.53 $ 320,401.04 4 $ 360,783.45 Yes

19 32 Married $ 133,093.15 $ 494,395.63 12 $ 861,874.67 No

20 32 Single $ 85,268.67 $ 159,010.33 12 $ 308,656.11 No

21 37 Single $ 92,314.96 $ 249,547.14 24 $ 342,339.27 Yes

22 29 Married $ 120,876.13 $ 308,618.37 12 $ 472,668.98 No

23 24 Single $ 86,294.13 $ 258,321.78 24 $ 380,347.56 Yes

24 32 Married $ 216,748.68 $ 634,609.61 24 $ 915,640.13 Yes

25 44 Single $ 46,389.75 $ 194,770.91 12 $ 385,288.86 No

subset of Data from 300 Home Mortgages of customers at lancaster savings and loan

TablE 5.3

174 chapter 5 Probability: An Introduction to Modeling Uncertainty

Select New Worksheet as the location for the PivotTable Report Click OK

Step 4. In the PivotTable Fields area go to Drag fields between areas below: Drag the Marital Status field to the ROWS area Drag the Default on Mortgage? field to the COLUMNS area Drag the Customer Number field to the VALUES area

Step 5. Click on Sum of Customer Number in the VALUES area and select Value Field Settings

Step 6. When the Value Field Settings dialog box appears: Under Summarize value field by, select Count

These steps produce the PivotTable shown in Figure 5.5.

Marital Status No Default Default Total

Married 64 79 143

Single 116 41 157

Total 180 120 300

crosstabulation of Marital status and if customer Defaults on Mortgage

TablE 5.4

MortgageDefaultData

Pivottable for Marital status and Whether customer Defaults on MortgageFigurE 5.5

5.3 conditional Probability 175

From Table 5.4 or Figure 5.5, the probability that a customer defaults on his or her mort- gage is 120/300 0.45 . The probability that a customer does not default on his or her mort- gage is 1 0.4 0.6(or 180/300 0.6)2 5 5 . But is this probability different for married customers as compared with single customers? Conditional probability allows us to answer this question.

But first, let us answer a related question: What is the probability that a randomly selected customer does not default on his or her mortgage and the customer is married? The probabil- ity that a randomly selected customer is married and the customer defaults on his or her mort- gage is written as ùP M D( ). This probability is calculated as ùP M D 5 5( ) 0.263379300 .

Similarly,

ùP M DC 5 5( ) 0.213364300 is the probability that a randomly selected customer is married and that the customer does not default on his or her mortgage.

ùP S D 5 5( ) 0.136741300 is the probability that a randomly selected customer is single and that the customer defaults on his or her mortgage.

ùP S DC 5 5( ) 0.3867116300 is the probability that a randomly selected customer is single and that the customer does not default on his or her mortgage.

Because each of these values gives the probability of the intersection of two events, the probabilities are called joint probabilities. Table 5.5, which provides a summary of the probability information for customer defaults on mortgages, is referred to as a joint proba- bility table.

The values in the Total column and Total row (the margins) of Table 5.5 provide the probabilities of each event separately. That is, P M( ) 0.47665 , P S( ) 0.52345 , P DC( ) 0.60005 , and P D( ) 0.40005 . These probabilities are referred to as marginal probabilities because of their location in the margins of the joint probability table. The marginal probabilities are found by summing the joint probabilities in the corresponding row or column of the joint probability table. From the marginal probabilities, we see that 60% of customers do not default on their mortgage, 40% of customers default on their mortgage, 47.66% of customers are married, and 52.34% of customers are single.

Let us begin the conditional probability analysis by computing the probability that a customer defaults on his or her mortgage given that the customer is married. In conditional probability notation, we are attempting to determine P(D | M), which is read as “the prob- ability that the customer defaults on the mortgage given that the customer is married.” To calculate P(D | M), first we note that we are concerned only with the 143 customers who are married (M). Because 79 of the 143 married customers defaulted on their mortgages, the probability of a customer defaulting given that the customer is married is 79/143 0.55245 . In other words, given that a customer is married, there is a 55.24% chance that he or she will default. Note also that the conditional probability P(D | M) can be computed as the ratio of the joint probability ùP D M( ) to the marginal probability P(M).

ù P D M

P D M

P M 5 5 5( | )

( )

0.2633

0.4766 0.5524

We can also think of this joint probability in the following manner: What proportion of all customers are both married and defaulted on their loans?

We can use the PivotTable from Figure 5.5 to easily create the joint probability table in Excel. To do so, right- click on any of the numerical values in the PivotTable, select Show Values As, and choose % of Grand Total. The resulting values, which are percentages of the total, can then be divided by 100 to create the probabilities in the joint probability table.

Joint Probability table for customer Mortgage PrepaymentsTablE 5.5

Joint Probabilities

No Default (DC) Default (D) Total Married (M) 0.2133 0.2633 0.4766

Single (S) 0.3867 0.1367 0.5234

Total 0.6000 0.4000 1.0000

Marginal Probabilities

176 chapter 5 Probability: An Introduction to Modeling Uncertainty

The fact that conditional probabilities can be computed as the ratio of a joint probability to a marginal probability provides the following general formula for conditional probability calculations for two events A and B.

CONDiTiONal prObabiliTy

P A B P A B

P B 5( | )

( )

( ) (5.3)

P B A P A B

P A 5( | )

( )

( ) (5.4)

We have already determined the probability that a customer who is married will default is 0.5524. How does this compare to a customer who is single? In other words, we want to find P(D | S). From equation (5.3), we can compute P(D | S) as

ù P D S

P D S

P S 5 5 5( | )

( )

0.1367

0.5234 0.2611

In other words, the chance that a customer will default if the customer is single is 26.11%. This is substantially less than the chance of default if the customer is married.

Note that we could also answer this question using the Excel PivotTable in Figure 5.5. We can calculate these conditional probabilities by right-clicking on any numerical value in the body of the PivotTable and then selecting Show Values As and choosing % of Row Total. The modified Excel PivotTable is shown in Figure 5.6.

Using Excel Pivottable to calculate conditional ProbabilitiesFigurE 5.6

5.3 conditional Probability 177

By calculating the % of Row Total, the Excel PivotTable in Figure 5.6 shows that 55.24% of married customers defaulted on mortgages, but only 26.11% of single customers defaulted.

independent Events Note that in our example, P D( ) 0.40005 , P D M( | ) 0.55245 , and P D S( | ) 0.26115 . So the probability that a customer defaults is influenced by whether the customer is married or single. Because P D M P D( | ) ( )± , we say that events D and M are dependent. However, if the probability of event D is not changed by the existence of event M—that is, if P D M P D( | ) ( )5 —then we would say that events D and M are independent events. This is summarized for two events A and B as follows:

Otherwise, the events are dependent.

multiplication law The multiplication law can be used to calculate the probability of the intersection of two events. The multiplication law is based on the definition of conditional probability. Solving equations (5.3) and (5.4) for ùP A B( ), we obtain the multiplication law.

iNDEpENDENT EvENTS

Two events A and B are independent if

( | ) ( )5P A B P A (5.5)

( | ) ( )5P B A P B (5.6)

mulTipliCaTiON laW

ùP A B P B P A B5( ) ( ) ( | ) (5.7)

ùP A B P A P B A5( ) ( ) ( | ) (5.8)

To illustrate the use of the multiplication law, we will calculate the probability that a customer defaults on his or her mortgage and the customer is married, ùP D M( ). From equation (5.7), this is calculated as ùP D M P M P D M5( ) ( ) ( | ).

From Table 5.5 we know that P M( ) 0.47665 , and from our previous calculations we know that the conditional probability P D M( | ) 0.55245 . Therefore,

ùP D M P M P D M5 5 5( ) ( ) ( | ) (0.4766)(0.5524) 0.2633

This value matches the value shown for ùP D M( ) in Table 5.5. The multiplication law is useful when we know conditional probabilities but do not know the joint probabilities.

Consider the special case in which events A and B are independent. From equations (5.5) and (5.6), P A B P A( | ) ( )5 and P B A P B( | ) ( )5 . Using these equations to simplify equations (5.7) and (5.8) for this special case, we obtain the following multiplication law for independent events.

mulTipliCaTiON laW FOr iNDEpENDENT EvENTS

ùP A B P A P B5( ) ( ) ( ) (5.9)

To compute the probability of the intersection of two independent events, we simply multiply the probabilities of each event.

178 chapter 5 Probability: An Introduction to Modeling Uncertainty

bayes’ Theorem Revising probabilities when new information is obtained is an important aspect of proba- bility analysis. Often, we begin the analysis with initial or prior probability estimates for specific events of interest. Then, from sources such as a sample survey or a product test, we obtain additional information about the events. Given this new information, we update the prior probability values by calculating revised probabilities, referred to as posterior probabilities. Bayes’ theorem provides a means for making these probability calculations.

As an application of Bayes’ theorem, consider a manufacturing firm that receives ship- ments of parts from two different suppliers. Let A1 denote the event that the part is from supplier 1 and let A2 denote the event that a part is from supplier 2. Currently, 65% of the parts purchased by the company are from supplier 1 and the remaining 35% are from supplier 2. Hence, if a part is selected at random, we would assign the prior probabilities P A( ) 0.651 5 and P A( ) 0.352 5 .

The quality of the purchased parts varies according to their source. Historical data sug- gest that the quality ratings of the two suppliers are as shown in Table 5.6.

If we let G be the event that a part is good and we let B be the event that a part is bad, the information in Table 5.6 enables us to calculate the following conditional probability values:

P G A P B A P G A P B A

5 5

( | ) 0.98 ( | ) 0.02 ( | ) 0.95 ( | ) 0.05

1 1

2 2

Figure 5.7 shows a diagram that depicts the process of the firm receiving a part from one of the two suppliers and then discovering that the part is good or bad as a two-step ran- dom experiment. We see that four outcomes are possible; two correspond to the part being good and two correspond to the part being bad.

Each of the outcomes is the intersection of two events, so we can use the multiplication rule to compute the probabilities. For instance,

ùP A G P A G P A P G A5 5( , ) ( ) ( ) ( | )1 1 1 1

The process of computing these joint probabilities can be depicted in what is called a probability tree (see Figure 5.8). From left to right through the tree, the probabilities for each branch at step 1 are prior probabilities and the probabilities for each branch at step 2 are conditional probabilities. To find the probability of each experimental out- come, simply multiply the probabilities on the branches leading to the outcome. Each of these joint probabilities is shown in Figure 5.8 along with the known probabilities for each branch.

Now suppose that the parts from the two suppliers are used in the firm’s manufac- turing process and that a machine breaks down while attempting the process using a bad part. Given the information that the part is bad, what is the probability that it came from supplier 1 and what is the probability that it came from supplier 2? With the infor- mation in the probability tree (Figure 5.8), Bayes’ theorem can be used to answer these questions.

For the case in which there are only two events ( 1A and 2A ), Bayes’ theorem can be written as follows:

Bayes’ theorem is also discussed in Chapter 15 in the context of decision analysis.

bayES’ ThEOrEm (TWO-EvENT CaSE)

( | ) ( ) ( | )

( ) ( | ) ( ) ( | ) 1

1 1

1 1 2 2

5 1

P A B P A P B A

P A P B A P A P B A (5.10)

( | ) ( ) ( | )

( ) ( | ) ( ) ( | ) 2

2 2

1 1 2 2

5 1

P A B P A P B A

P A P B A P A P B A (5.11)

5.3 conditional Probability 179

Diagram for two-supplier Example: step 1 shows that the part comes from one of two suppliers and step 2 shows whether the part is good or bad

FigurE 5.7

Outcome Step 2

Condition Step 1

Supplier

Note: Step 1 shows that the part comes from one of two suppliers and Step 2 shows whether the part is good or bad.

(A1, G )

(A1, B)

(A2, G )

(A2, B)

Probability tree for two-supplier ExampleFigurE 5.8

Probability of OutcomeStep 2 Condition

Step 1 Supplier

P(A1)

P(G | A1)

P(G | A2)

P(B | A1)

P(B | A2)

P(A2)

0.65

0.98

0.02

0.05

0.950.35

P(A1 > G ) 5 P(A1)P(G | A1) 5 (0.65)(0.98) 5 0.6370

P(A2 > G) 5 P(A2)P(G | A2) 5 (0.35)(0.95) 5 0.3325

P(A1 > B) 5 P(A1)P( B | A1) 5 (0.65)(0.02) 5 0.0130

P(A2 > B) 5 P(A2)P( B | A2) 5 (0.35)(0.05) 5 0.0175

% Good Parts % Bad Parts

Supplier 1 98 2

Supplier 2 95 5

Historical Quality levels for two suppliersTablE 5.6

180 chapter 5 Probability: An Introduction to Modeling Uncertainty

Using equation (5.10) and the probability values provided in Figure 5.8, we have

P A B P A P B A

P A P B A P A P B A 5

5 1

5 5

( | ) ( ) ( | )

( ) ( | ) ( ) ( | ) (0.65)(0.02)

(0.65)(0.02) (0.35)(0.05)

0.0130

0.0130 0.0175 0.0130

0.0305 0.4262

1 1 1

1 1 2 2

Using equation (5.11), we find ( | )2P A B as

P A B P A P B A

P A P B A P A P B A 5

5 1

5 5

( | ) ( ) ( | )

( ) ( | ) ( ) ( | ) (0.35)(0.05)

(0.65)(0.02) (0.35)(0.05)

0.0175

0.0130 0.0175 0.0175

0.0305 0.5738

2 2 2

2 1 2 2

Note that in this application we started with a probability of 0.65 that a part selected at random was from supplier 1. However, given information that the part is bad, the probabil- ity that the part is from supplier 1 drops to 0.4262. In fact, if the part is bad, the chance is better than 50–50 that it came from supplier 2; that is, P A B( | ) 0.57382 5 .

Bayes’ theorem is applicable when events for which we want to compute posterior prob- abilities are mutually exclusive and their union is the entire sample space. For the case of n mutually exclusive events A A An, , … ,1 2 , whose union is the entire sample space, Bayes’ the- orem can be used to compute any posterior probability P A Bi( | ) as shown in equation 5.12.

If the union of events is the entire sample space, the events are said to be collectively exhaustive.

By applying basic algebra we can derive the multiplication law

from the definition of conditional probability. For two events A

and B, the probability of A given B is ù

5P A B P A B

P B ( | )

( ) ( )

. If we

multiply both sides of this expression by P(B), the P(B) in the numerator and denominator on the right side of the expression

will cancel and we are left with ù5P A B P B P A B( | ) ( ) ( ), which is the multiplication law.

N O T E S + C O m m E N T S

bayES’ ThEOrEm

�

( | ) ( ) ( | )

( ) ( | ) ( ) ( | ) ( ) ( | )1 1 2 2 5

1 1 1 P A B

P A P B A

P A P B A P A P B A P A P B A i

i i

n n

(5.12)

5.4 Random Variables In probability terms, a random variable is a numerical description of the outcome of a random experiment. Because the outcome of a random experiment is not known with cer- tainty, a random variable can be thought of as a quantity whose value is not known with certainty. A random variable can be classified as being either discrete or continuous depending on the numerical values it can assume.

Discrete random variables A random variable that can take on only specified discrete values is referred to as a discrete random variable. Table 5.7 provides examples of discrete random variables.

Returning to our example of Lancaster Savings and Loan, we can define a random vari- able x to indicate whether or not a customer defaults on his or her mortgage. As previously

Chapter 2 introduces the concept of random variables and the use of data to describe them.

5.4 Random Variables 181

stated, the values of a random variable must be numerical, so we can define random vari- able x such that x 15 if the customer defaults on his or her mortgage and x 05 if the customer does not default on his or her mortgage. An additional random variable, y, could indicate whether the customer is married or single. For instance, we can define random variable y such that y 15 if the customer is married and y 05 if the customer is single. Yet another random variable, z, could be defined as the number of mortgage payments per year made by the customer. For instance, a customer who makes monthly payments would make z 125 payments per year, a customer who makes payments quarterly would make z 45 payments per year.

Table 5.8 repeats the joint probability table for the Lancaster Savings and Loan data, but this time with the values labeled as random variables.

Continuous random variables A random variable that may assume any numerical value in an interval or collection of intervals is called a continuous random variable. Technically, relatively few random vari- ables are truly continuous; these include values related to time, weight, distance, and tem- perature. An example of a continuous random variable is x 5 the time between consecutive incoming calls to a call center. This random variable can take on any value x 0. such as x 1.26 minutes5 , x 2.571 minutes5 , x 4.3333 minutes5 , etc. Table 5.9 provides exam- ples of continuous random variables.

As illustrated by the final example in Table 5.9, many discrete random variables have a large number of potential outcomes and so can be effectively modeled as continuous ran- dom variables. Consider our Lancaster Savings and Loan example. We can define a random variable x total amount paid by customer over the lifetime of the mortgage5 . Because we typically measure financial values only to two decimal places, one could consider this a discrete random variable. However, because in any practical interval there are many possi- ble values for this random variable, then it is usually appropriate to model the amount as a continuous random variable.

Random Experiment Random Variable (x) Possible Values for the Random Variable

Flip a coin Face of coin showing 1 if heads; 0 if tails

Roll a die Number of dots showing on top of die 1, 2, 3, 4, 5, 6

Contact five customers Number of customers who place an order

0, 1, 2, 3, 4, 5

Operate a health care clinic for one day Number of patients who arrive 0, 1, 2, 3, …

Offer a customer the choice of two products

Product chosen by customer 0 if none; 1 if choose product A; 2 if choose product B

Examples of Discrete Random VariablesTablE 5.7

No Default ( 0)x 5 Default ( 1)x 5 f(y)

Married 5y( 1) 0.2133 0.2633 0.4766

Single 5y( 0) 0.3867 0.1367 0.5234

f(x) 0.6000 0.4000 1.0000

Joint Probability table for customer Mortgage PrepaymentsTablE 5.8

182 chapter 5 Probability: An Introduction to Modeling Uncertainty

5.5 Discrete Probability Distributions The probability distribution for a random variable describes the range and relative likelihood of possible values for a random variable. For a discrete random variable x, the probability distribution is defined by a probability mass function, denoted by f(x). The probability mass function provides the probability for each value of the random variable.

Returning to our example of mortgage defaults, consider the data shown in Table 5.3 for Lancaster Savings and Loan and the associated joint probability table in Table 5.8. From Table 5.8, we see that f (0) 0.65 and f (1) 0.45 . Note that these values satisfy the required conditions of a discrete probability distribution that (1) f x( ) 0$ and (2) f x( ) 1S 5 .

We can also present probability distributions graphically. In Figure 5.9, the values of the random variable x are shown on the horizontal axis and the probability associated with these values is shown on the vertical axis.

Custom Discrete probability Distribution A probability distribution that is generated from observations such as that shown in Figure 5.9 is called an empirical probability distribution. This particular empirical probability distribution is considered a custom discrete distribution because it is discrete and the possible values of the random variable have different values.

A custom discrete probability distribution is very useful for describing different pos- sible scenarios that have different probabilities of occurring. The probabilities associated with each scenario can be generated using either the subjective method or the relative fre- quency method. Using a subjective method, probabilities are based on experience or intu- ition when little relevant data are available. If sufficient data exist, the relative frequency method can be used to determine probabilities. Consider the random variable describing the number of payments made per year by a randomly chosen customer. Table 5.10 pres- ents a summary of the number of payments made per year by the 300 home mortgage

Random Experiment Random Variable (x) Possible Values for the Random Variable

Customer visits a web page Time customer spends on web page in minutes 0x $

Fill a soft drink can (max capacity 5 12.1 ounces)

Number of ounces 0 12.1x# #

Test a new chemical process Temperature when the desired reaction takes place (min temperature 150 F5 8 ; max temperature 212 F5 8 )

150 212x# #

Invest $10,000 in the stock market Value of investment after one year 0x $

1. In this section we again use the relative frequency

method to assign probabilities for the Lancaster Savings

and Loan example. Technically, the concept of random

variables applies only to populations; probabilities that

are found using sample data are only estimates of the

true probabilities. However, larger samples generate

more reliable estimated probabilities, so if we have a

large enough data set (as we are assuming here for the

Lancaster Savings and Loan data), then we can treat the

data as if they are from a population and the relative fre-

quency method is appropriate to assign probabilities to

the outcomes.

2. Random variables can be used to represent uncertain

future values. Chapter 11 explains how random variables

can be used in simulation models to evaluate business

decisions in the presence of uncertainty.

N O T E S + C O m m E N T S

Examples of continuous Random VariablesTablE 5.9

5.5 Discrete Probability Distributions 183

Graphical Representation of the Probability Distribution for Whether a customer Defaults on a Mortgage

FigurE 5.9

0.6

0.5

0.4

0.3

0.2 P

ro b

ab il

it y

Mortgage Default Random Variable

0.1

0 1

f (x)

Number of Payments Made per Year

4x 5 12x 5 24x 5 Total Number of observations 45 180 75 300

f(x) 0.15 0.60 0.25

summary table of number of Payments Made per yearTablE 5.10

customers. This table shows us that 45 customers made quarterly payments x( 4)5 , 180 customers made monthly payments x( 12)5 , and 75 customers made two payments each month x( 24)5 . We can then calculate f (4) 45/300 0.155 5 , f (12) 180/300 0.605 5 , and f (24) 75/300 0.255 5 . In other words, the probability that a randomly selected customer makes 4 payments per year is 0.15, the probability that a randomly selected cus- tomer makes 12 payments per year is 0.60, and the probability that a randomly selected customer makes 24 payments per year is 0.25.

We can write this probability distribution as a function in the following manner:



 

  

( )

0.15 if 4 0.60 if 12 0.25 if 24 0 otherwise

5 f x

x x x

This probability mass function tells us in a convenient way that f x( ) 0.155 when x 45 (the probability that the random variable x 45 is 0.15); f x( ) 0.605 when x 125 (the probability that the random variable x 125 is 0.60); f x( ) 0.255 when x 245 (the probability that the random variable x 245 is 0.25); and f x( ) 05 when x is any other value (there is zero probability that the random variable x is some value other than 4, 12, or 24).

Note that we can also create Table 5.10 in Excel using a PivotTable as shown in Figure 5.10.

184 chapter 5 Probability: An Introduction to Modeling Uncertainty

Expected value and variance The expected value, or mean, of a random variable is a measure of the central location for the random variable. It is the weighted average of the values of the random variable, where the weights are the probabilities. The formula for the expected value of a discrete random variable x follows:

Chapter 2 discusses the computation of the mean of a random variable based on data.

Excel Pivottable for number of Payments Made per yearFigurE 5.10

ExpECTED valuE OF a DiSCrETE raNDOm variablE

E x xf x( ) ( )m5 5 S (5.13)

Both the notations E(x) and m are used to denote the expected value of a random variable. Equation (5.13) shows that to compute the expected value of a discrete random variable, we must multiply each value of the random variable by the corresponding probability f(x) and then add the resulting products. Table 5.11 calculates the expected value of the number of payments made by a mortgage customer in a year. The sum of the entries in the xf(x) column shows that the expected value is 13.8 payments per year. Therefore, if Lancaster Savings and Loan signs up a new mortgage customer, the expected number of payments per year made by this new customer is 13.8. Obviously, no customer will make exactly 13.8 payments per year, but this value represents our expectation for the number of pay- ments per year made by a new customer absent any other information about the new cus- tomer. Some customers will make fewer payments (4 or 12 per year), some customers will make more payments (24 per year), but 13.8 represents the expected number of payments per year based on the probabilities calculated in Table 5.10.

The SUMPRODUCT function in Excel can easily be used to calculate the expected value for a discrete random variable. This is illustrated in Figure 5.11. We can also

5.5 Discrete Probability Distributions 185

calculate the expected value of the random variable directly from the Lancaster Savings and Loan data using the Excel function AVERAGE, as shown in Figure 5.12. Column F contains the data on the number of payments made per year by each mortgage customer in the data set. Using the Excel formula 5AVERAGE(F2:F301) gives us a value of 13.8 for the expected value, which is the same as the value we calculated in Table 5.11.

Note that we cannot simply use the AVERAGE function on the x values for a cus- tom discrete random variable. If we did, this would give us a calculated value of

1 1 5(4 12 24)/3 13.333, which is not the correct expected value in this scenario. This is because using the AVERAGE function in this way assumes that each value of the random variable x is equally likely. But in this case, we know that 5 12x is much more likely than

5 4x or 5 24x . Therefore, we must use equation (5.13) to calculate the expected value of a custom discrete random variable, or we can use the Excel function AVERAGE on the entire data set, as shown in Figure 5.12.

x f(x) xf(x)

4 0.15 4 0.15 0.65( )( ) 12 0.60 12 0.60 7.25( )( ) 24 0.25 24 0.25 6.05( )( )

13.8 m5 5 ∑) )( (E x xf x

calculation of the Expected Value for number of Payments Made per year by a lancaster savings and loan Mortgage customer

TablE 5.11

Using Excel sUMPRoDUct Function to calculate the Expected Value for number of Payments Made per year by a lancaster savings and loan Mortgage customer

FigurE 5.11

186 chapter 5 Probability: An Introduction to Modeling Uncertainty

Variance is a measure of variability in the values of a random variable. It is a weighted average of the squared deviations of a random variable from its mean where the weights are the probabilities. Below we define the formula for calculating the variance of a discrete random variable.

Chapter 2 discusses the computation of the variance of a random variable based on data.

variaNCE OF a DiSCrETE raNDOm variablE

x x f xVar( ) ( ) ( )2 2s m5 5 S 2 (5.14)

Excel calculation of the Expected Value for number of Payments Made per year by a lancaster savings and loan Mortgage customer

FigurE 5.12

As equation (5.14) shows, an essential part of the variance formula is the deviation, m2x , which measures how far a particular value of the random variable is from the expected value, or mean, m. In computing the variance of a random variable, the deviations are squared and then weighted by the corresponding value of the probability mass function. The sum of these weighted squared deviations for all values of the random variable is referred to as the variance. The notations Var(x) and s 2 are both used to denote the vari- ance of a random variable.

The calculation of the variance of the number of payments made per year by a mortgage customer is summarized in Table 5.12. We see that the variance is 42.360. The standard deviation, s , is defined as the positive square root of the variance. Thus, the standard devia- tion for the number of payments made per year by a mortgage customer is 542.360 6.508.

The Excel function SUMPRODUCT can be used to easily calculate equation (5.14) for a custom discrete random variable. We illustrate the use of the SUMPRODUCT function to calculate variance in Figure 5.13.

We can also use Excel to find the variance directly from the data when the values in the data occur with relative frequencies that correspond to the probability distribution of the random variable. Cell F305 in Figure 5.12 shows that we use the Excel formula 5VAR.P(F2:F301)

Chapter 2 discusses the computation of the standard deviation of a random variable based on data.

5.5 Discrete Probability Distributions 187

x mm2x )(f x mm2 )( )(2x f x 4 4 13.8 9.82 5 2 0.15 9.8 0.15 15.606

2* 2 5( )

12 12 13.8 1.82 5 2 0.60 1.8 0.60 2.904 2*

2 5( ) 21 21 13.8 10.22 5 0.25 10.2 0.25 24.010

2* 5( )

42.360 s m5 2∑ ) )( (2 2x f x

calculation of the Variance for number of Payments Made per year by a lancaster savings and loan Mortgage customer

TablE 5.12

Excel calculation of the Variance for number of Payments Made per year by a lancaster savings and loan Mortgage customer

FigurE 5.13

to calculate the variance from the complete data. This formula gives us a value of 42.360, which is the same as that calculated in Table 5.12 and Figure 5.13. Similarly, we can use the formula 5STDEV.P(F2:F301) to calculate the standard deviation of 6.508.

As with the AVERAGE function and expected value, we cannot use the Excel functions VAR.P and STDEV.P directly on the x values to calculate the variance and standard devi- ation of a custom discrete random variable if the x values are not equally likely to occur. Instead we must either use the formula from equation (5.14) or use the Excel functions on the entire data set as shown in Figure 5.12.

Discrete uniform probability Distribution When the possible values of the probability mass function, f(x), are all equal, then the prob- ability distribution is a discrete uniform probability distribution. For instance, the values that result from rolling a single fair die is an example of a discrete uniform distribution

Note that here we are using the Excel functions VAR.P and STDEV.P rather than VAR.S and STDEV.S. This is because we are assuming that the sample of 300 Lancaster Savings and Loan mortgage customers is a perfect representation of the population.

188 chapter 5 Probability: An Introduction to Modeling Uncertainty

because the possible outcomes 5 1y , 5 2y , 5 3y , 5 4y , 5 5y , and 5 6y all have the same values 5 5 5 5 5 5(1) (2) (3) (4) (5) (6) 1/6f f f f f f . The general form of the proba- bility mass function for a discrete uniform probability distribution is given below as follows:

DiSCrETE uNiFOrm prObabiliTy maSS FuNCTiON

( ) 1/5f x n (5.15)

where 5 the number of unique values that may be assumed by the random variablen .

binomial probability Distribution As an example of the use of the binomial probability distribution, consider an online spe- cialty clothing company called Martin’s. Martin’s commonly sends out targeted e-mails to its best customers notifying them about special discounts that are available only to the recipients of the e-mail. The e-mail contains a link that takes the customer directly to a web page for the discounted item. The exact number of customers who will click on the link is obviously unknown, but from previous data, Martin’s estimates that the probability that a customer clicks on the link in the e-mail is 0.30. Martin’s is interested in knowing more about the probabilities associated with one, two, three, etc. customers clicking on the link in the targeted e-mail.

The probability distribution related to the number of customers who click on the targeted e-mail link can be described using a binomial probability distribution. A binomial prob- ability distribution is a discrete probability distribution that can be used to describe many situations in which a fixed number (n) of repeated identical and independent trials has two, and only two, possible outcomes. In general terms, we refer to these two possible outcomes as either a success or a failure. A success occurs with probability p in each trial and a failure occurs with probability 21 p in each trial. In the Martin’s example, the “trial” refers to a cus- tomer receiving the targeted e-mail. We will define a success as a customer clicking on the e-mail link 5( 0.30)p and a failure as a customer not clicking on the link 2 5(1 0.70)p . The binomial probability distribution can then be used to calculate the probability of a given number of successes (customers who click on the e-mail link) out of a given number of independent trials (number of e-mails sent to customers). Other examples that can often be described by a binomial probability distribution include counting the number of heads result- ing from flipping a coin 20 times, the number of customers who click on a particular adver- tisement link on web site in a day, the number of days on which a particular financial stock increases in value over a month, and the number of nondefective parts produced in a batch.

Equation (5.16) provides the probability mass function for a binomial random variable that calculates the probability of x successes in n independent events.

Whether or not a customer clicks on the link is an example of what is known as a Bernoulli trial—a trial in which: (1) there are two possible outcomes, success or failure, and (2) the probability of success is the same every time the trial is executed. The probability distribution related to the number of successes in a set of n independent Bernoulli trials can be described by a binomial probability distribution.

n! is read as “n factorial,” and 1 25 3 2 3 2 3n! n n n

2 1� 3 3 . For example, 4 ! 4 3 2 1 245 3 3 3 5 . The Excel formula 5 FACT(n) can be used to calculate n factorial.

biNOmial prObabiliTy maSS FuNCTiON

 

 

( ) (1 )( )5 2 2f x n x

p px n x

where

the number of successes the probability of a success on one trial the number of trials

( ) the probability of successes in trials

x p n

f x x n

(5.16)

and

 

 

!( )! 5

n x

x n x

5.5 Discrete Probability Distributions 189

In the Martin’s example, use equation (5.16) to compute the probability that out of three customers who receive the e-mail: (1) no customer clicks on the link; (2) exactly one cus- tomer clicks on the link; (3) exactly two customers click on the link; and (4) all three cus- tomers click on the link. The calculations are summarized in Table 5.13, which gives the probability distribution of the number of customers who click on the targeted e-mail link. Figure 5.14 is a graph of this probability distribution. Table 5.13 and Figure 5.14 show that the highest probability is associated with exactly one customer clicking on the Martin’s tar- geted e-mail link and the lowest probability is associated with all three customers clicking on the link.

Because the outcomes in the Martin’s example are mutually exclusive, we can easily use these results to answer interesting questions about various events. For example, using the information in Table 5.13, the probability that no more than one customer clicks on the link is # 5 5 1 5 5 1 5( 1) ( 0) ( 1) 0.343 0.441 0.784P x P x P x .

x f (x) 0

5 3!

0!3! (0.30) (0.70) 0.3430 3

1 3! 1! 2!

(0.30) (0.70) 0.4411 2 5

2 3! 2!1!

(0.30) (0.70) 0.1892 1 5

3 5

3! 3!0!

(0.30) (0.70) 0.027 1.000

3 0

Probability Distribution for the number of customers Who click on the link in the Martin’s targeted E-Mail

TablE 5.13

Graphical Representation of the Probability Distribution for the number of customers Who click on the link in the Martin’s targeted E-Mail

FigurE 5.14

.40

.30

.20

.10

.00

f (x)

P ro

b ab

il it

Number of Customers Who Click on Link 0 1 2

x 3

.50

190 chapter 5 Probability: An Introduction to Modeling Uncertainty

If we consider a scenario in which 10 customers receive the targeted e-mail, the bino- mial probability mass function given by equation (5.16) is still applicable. If we want to find the probability that exactly 4 of the 10 customers click on the link and 5 0.30p , then we calculate:

(4) 10!

4!6! (0.30) (0.70) 0.20014 65 5f

In Excel we can use the BINOM.DIST function to compute binomial probabilities. Figure 5.15 reproduces the Excel calculations from Table 5.13 for the Martin’s problem with three customers.

The BINOM.DIST function in Excel has four input values: the first is the value of x, the second is the value of n, the third is the value of p, and the fourth is FALSE or TRUE. We choose FALSE for the fourth input if a probability mass function value f(x) is desired, and TRUE if a cumulative probability is desired. The formula 5BINOM.DIST(A5,$D$1:$D$2,FALSE) has been entered into cell B5 to compute the probability of 0 successes in three trials, f(0). Figure 5.15 shows that this value is 0.343, the same as in Table 5.13.

Cells C5:C8 show the cumulative probability distribution values for this example. Note that these values are computed in Excel by entering TRUE as the fourth input in the BINOM.DIST. The cumulative probability for x using a binomial distribution is the prob- ability of x or fewer successes out of n trials. Cell C5 computes the cumulative probability for 5 0x , which is the same as the probability for 5 0x because the probability of 0 suc- cesses is the same as the probability of 0 or fewer successes. Cell C7 computes the cumu- lative probability for 5 2x using the formula 5BINOM.DIST(A7,$D$1,$D$2,TRUE). This value is 0.973, meaning that the probability that two or fewer customers click on the targeted e-mail link is 0.973. Note that the value 0.973 simply corresponds to

1 1 5 1 1 5(0) (1) (2) 0.343 0.441 0.189 0.973f f f because it is the probability of two or fewer customers clicking on the link, which could be zero customers, one customer, or two customers.

Excel Worksheet for computing Binomial Probabilities of the number of customers Who Make a Purchase at Martin’s

FigurE 5.15

5.5 Discrete Probability Distributions 191

poisson probability Distribution In this section, we consider a discrete random variable that is often useful in estimating the number of occurrences of an event over a specified interval of time or space. For example, the random variable of interest might be the number of patients who arrive at a health care clinic in 1 hour, the number of computer-server failures in a month, the number of repairs needed in 10 miles of highway, or the number of leaks in 100 miles of pipeline. If the fol- lowing two properties are satisfied, the number of occurrences is a random variable that is described by the Poisson probability distribution: (1) the probability of an occurrence is the same for any two intervals (of time or space) of equal length; and (2) the occurrence or nonoccurrence in any interval (of time or space) is independent of the occurrence or nonoc- currence in any other interval.

The Poisson probability mass function is defined by equation (5.17).

The number e is a mathematical constant that is the base of the natural logarithm. Although it is an irrational number, 2.71828 is a sufficient approximation for our purposes.

pOiSSON prObabiliTy maSS FuNCTiON

( ) !

m 5

f x e

(5.17)

where

( ) the probability of occurrences in an interval expected value or mean number of occurrences in an interval 2.71828

m 5 5

f x x

For the Poisson probability distribution, x is a discrete random variable that indicates the number of occurrences in the interval. Since there is no stated upper limit for the number of occurrences, the probability mass function f(x) is applicable for values x 0,1, 2,…5 without limit. In practical applications, x will eventually become large enough so that f(x) is approximately zero and the probability of any larger values of x becomes negligible.

Suppose that we are interested in the number of patients who arrive at the emergency room of a large hospital during a 15-minute period on weekday mornings. Obviously, we do not know exactly how many patients will arrive at the emergency room in any defined interval of time, so the value of this variable is uncertain. It is important for administra- tors at the hospital to understand the probabilities associated with the number of arriving patients, as this information will have an impact on staffing decisions such as how many nurses and doctors to hire. It will also provide insight into possible wait times for patients to be seen once they arrive at the emergency room. If we can assume that the probability of a patient arriving is the same for any two periods of equal length during this 15-minute period and that the arrival or nonarrival of a patient in any period is independent of the arrival or nonarrival in any other period during the 15-minute period, the Poisson proba- bility mass function is applicable. Suppose these assumptions are satisfied and an analysis of historical data shows that the average number of patients arriving during a 15-minute period of time is 10; in this case, the following probability mass function applies:

( ) 10

5 2

f x e

The random variable here is number of patients arriving at the emergency room5x during any 15-minute period.

If the hospital’s management team wants to know the probability of exactly five arrivals during 15 minutes, we would set 5 5x and obtain:

Probability of exactly 5arrivals in15 minutes (5) 10

5! 0.0378

5 10

5 5 5 2

f e

In the preceding example, the mean of the Poisson distribution is m 5 10 arrivals per 15-minute period. A property of the Poisson distribution is that the mean of the distribution

192 chapter 5 Probability: An Introduction to Modeling Uncertainty

and the variance of the distribution are always equal. Thus, the variance for the num- ber of arrivals during all 15-minute periods is s 5 102 , and so the standard deviation is s 5 510 3.16. Our illustration involves a 15-minute period, but other amounts of time can be used. Suppose we want to compute the probability of one arrival during a 3-min- ute period. Because 10 is the expected number of arrivals during a 15-minute period, we see that 510/15 2/3 is the expected number of arrivals during a 1-minute period and that

5(2/3)(3 minutes) 2 is the expected number of arrivals during a 3-minute period. Thus, the probability of x arrivals during a 3-minute period with m 5 2 is given by the following Poisson probability mass function:

( ) 2

5 2

f x e

The probability of one arrival during a 3-minute period is calculated as follows:

Probability of exactly1arrival in 3 minutes (1) 2

1! 0.2707

1 2

5 5 5 2

f e

One might expect that because 5(5arrivals)/5 1arrival and 5(15 minutes)/5 3 minutes, we would get the same probability for one arrival during a 3-minute period as we do for five arrivals during a 15-minute period. Earlier we computed the probability of five arriv- als during a 15-minute period as 0.0378. However, note that the probability of one arrival during a 3-minute period is 0.2707, which is not the same. When computing a Poisson probability for a different time interval, we must first convert the mean arrival rate to the period of interest and then compute the probability.

In Excel we can use the POISSON.DIST function to compute Poisson probabilities. Figure 5.16 shows how to calculate the probabilities of patient arrivals at the emergency room if patients arrive at a mean rate of 10 per 15-minute interval.

Excel Worksheet for computing Poisson Probabilities of the number of Patients Arriving at the Emergency Room

FigurE 5.16

Number of Arrivals Probability, f(x) Cumulative Probability

Mean Number of Occurrences: 101 A B C D

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

=POISSON.DIST(A6,$D$1,FALSE) =POISSON.DIST(A7,$D$1,FALSE) =POISSON.DIST(A8,$D$1,FALSE) =POISSON.DIST(A9,$D$1,FALSE) =POISSON.DIST(A10,$D$1,FALSE) =POISSON.DIST(A11,$D$1,FALSE) =POISSON.DIST(A12,$D$1,FALSE) =POISSON.DIST(A13,$D$1,FALSE) =POISSON.DIST(A14,$D$1,FALSE) =POISSON.DIST(A15,$D$1,FALSE) =POISSON.DIST(A16,$D$1,FALSE) =POISSON.DIST(A17,$D$1,FALSE) =POISSON.DIST(A18,$D$1,FALSE) =POISSON.DIST(A19,$D$1,FALSE) =POISSON.DIST(A20,$D$1,FALSE) =POISSON.DIST(A21,$D$1,FALSE) =POISSON.DIST(A22,$D$1,FALSE) =POISSON.DIST(A23,$D$1,FALSE) =POISSON.DIST(A24,$D$1,FALSE)

=POISSON.DIST(A4,$D$1,FALSE) =POISSON.DIST(A4,$D$1,TRUE) =POISSON.DIST(A5,$D$1,FALSE)

=POISSON.DIST(A6,$D$1,TRUE) =POISSON.DIST(A7,$D$1,TRUE) =POISSON.DIST(A8,$D$1,TRUE) =POISSON.DIST(A9,$D$1,TRUE) =POISSON.DIST(A10,$D$1,TRUE) =POISSON.DIST(A11,$D$1,TRUE) =POISSON.DIST(A12,$D$1,TRUE) =POISSON.DIST(A13,$D$1,TRUE) =POISSON.DIST(A14,$D$1,TRUE) =POISSON.DIST(A15,$D$1,TRUE) =POISSON.DIST(A16,$D$1,TRUE) =POISSON.DIST(A17,$D$1,TRUE) =POISSON.DIST(A18,$D$1,TRUE) =POISSON.DIST(A19,$D$1,TRUE) =POISSON.DIST(A20,$D$1,TRUE) =POISSON.DIST(A21,$D$1,TRUE) =POISSON.DIST(A22,$D$1,TRUE) =POISSON.DIST(A23,$D$1,TRUE) =POISSON.DIST(A24,$D$1,TRUE)

=POISSON.DIST(A5,$D$1,TRUE)

Number of Arrivals Probability, f(x) Cumulative Probability

Mean Number of Occurrences: 101 A B C D E F G H I

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20

0.0023 0.0076 0.0189 0.0378 0.0631 0.0901 0.1126 0.1251 0.1251 0.1137 0.0948 0.0729 0.0521 0.0347 0.0217 0.0128 0.0071 0.0037 0.0019

0.0000 0.0005

0.0028 0.0103 0.0293 0.0671 0.1301 0.2202 0.3328 0.4579 0.5830 0.6968 0.7916 0.8645 0.9165 0.9513 0.9730 0.9857 0.9928 0.9965 0.9984

0.0000 0.0005

0.1400 0.1200 0.1000 0.0800 0.0600 0.0400 0.0200 0.0000

0 1 2 3 4 5 6 7 8 Number of Arrivals

Poisson Probabilities

9 10 11 12 1314 1516 1718 19 20

P ro

b ab

ili ty

, f (x

)

5.5 Discrete Probability Distributions 193

The POISSON.DIST function in Excel has three input values: the first is the value of x, the second is the mean of the Poisson distribution, and the third is FALSE or TRUE. We choose FALSE for the third input if a probability mass function value f(x) is desired, and TRUE if a cumulative probability is desired. The formula 5POISSON.DIST(A4,$D$1,FALSE) has been entered into cell B4 to compute the prob- ability of 0 occurrences, f(0). Figure 5.16 shows that this value (to four decimal places) is 0.0000, which means that it is highly unlikely (probability near 0) that we will have 0 patient arrivals during a 15-minute interval. The value in cell B12 shows that the probabil- ity that there will be exactly eight arrivals during a 15-minute interval is 0.1126.

The cumulative probability for x using a Poisson distribution is the probability of x or fewer occurrences during the interval. Cell C4 computes the cumulative probability for

5 0x , which is the same as the probability for 5 0x because the probability of 0 occur- rences is the same as the probability of 0 or fewer occurrences. Cell C12 computes the cumulative probability for 5 8x using the formula 5POISSON.DIST(A12,$D$1,TRUE). This value is 0.3328, meaning that the probability that eight or fewer patients arrive during a 15-minute interval is 0.3328. This value corresponds to

f f f f f(0) (1) (2) (7) (8) 0.0000 0.0005 0.0023

0.0901 0.1126 0.3328

1 1 1 1 1 5 1 1 1

1 1 5

� �

Let us illustrate an application not involving time intervals in which the Poisson dis- tribution is useful. Suppose we want to determine the occurrence of major defects in a highway one month after it has been resurfaced. We assume that the probability of a defect is the same for any two highway intervals of equal length and that the occurrence or nonoc- currence of a defect in any one interval is independent of the occurrence or nonoccurrence of a defect in any other interval. Hence, the Poisson distribution can be applied.

Suppose we learn that major defects one month after resurfacing occur at the average rate of two per mile. Let us find the probability of no major defects in a particular 3-mile section of the highway. Because we are interested in an interval with a length of 3 miles, m 5 5(2 defects/mile)(3 miles) 6 represents the expected number of major defects over the 3-mile section of highway. Using equation (5.17), the probability of no major defects is

5 5 2

(0) 6

0! 0.0025

0 6

f e

. Thus, it is unlikely that no major defects will occur in the 3-mile

section. In fact, this example indicates a 2 51 0.0025 0.9975 probability of at least one major defect in the 3-mile highway section.

1. If sample data are used to estimate the probabilities of a

custom discrete distribution, equation (5.13) yields the sam-

ple mean x rather than the population mean m. However, as the sample size increases, the sample generally becomes

more representative of the population and the sample

mean x converges to the population mean m. In this chap- ter we have assumed that the sample of 300 Lancaster

Savings and Loan mortgage customers is sufficiently large

to be representative of the population of mortgage cus-

tomers at Lancaster Savings and Loan.

2. We can use the Excel function AVERAGE only to compute

the expected value of a custom discrete random variable

when the values in the data occur with relative frequen-

cies that correspond to the probability distribution of the

random variable. If this assumption is not satisfied, then

the estimate of the expected value with the AVERAGE

function will be inaccurate. In practice, this assumption

is satisfied with an increasing degree of accuracy as the

size of the sample is increased. Otherwise, we must use

equation (5.13) to calculate the expected value for a cus-

tom discrete random variable.

3. If sample data are used to estimate the probabilities for

a custom discrete distribution, equation (5.14) yields the

sample variance s 2 rather than the population variance s 2. However, as the sample size increases the sample gen-

erally becomes more representative of the population and

the sample variance s 2 converges to the population vari- ance s 2.

N O T E S + C O m m E N T S

194 chapter 5 Probability: An Introduction to Modeling Uncertainty

5.6 Continuous Probability Distributions In the preceding section we discussed discrete random variables and their probability dis- tributions. In this section we consider continuous random variables. Specifically, we dis- cuss some of the more useful continuous probability distributions for analytics models: the uniform, the triangular, the normal, and the exponential.

A fundamental difference separates discrete and continuous random variables in terms of how probabilities are computed. For a discrete random variable, the probability mass function f(x) provides the probability that the random variable assumes a particular value. With continuous random variables, the counterpart of the probability mass function is the probability density function, also denoted by f(x). The difference is that the probability density function does not directly provide probabilities. However, the area under the graph of f(x) corresponding to a given interval does provide the probability that the continuous random variable x assumes a value in that interval. So when we compute probabilities for continuous random variables, we are computing the probability that the random variable assumes any value in an interval. Because the area under the graph of f(x) at any particular point is zero, one of the implications of the definition of probability for continuous random variables is that the probability of any particular value of the random variable is zero.

uniform probability Distribution Consider the random variable x representing the flight time of an airplane traveling from Chicago to New York. The exact flight time from Chicago to New York is uncer- tain because it can be affected by weather (headwinds or storms), flight traffic patterns, and other factors that cannot be known with certainty. It is important to characterize the uncertainty associated with the flight time because this can have an impact on connecting flights and how we construct our overall flight schedule. Suppose the flight time can be any value in the interval from 120 minutes to 140 minutes. Because the random variable x can assume any value in that interval, x is a continuous rather than a discrete random variable. Let us assume that sufficient actual flight data are available to conclude that the probabil- ity of a flight time within any interval of a given length is the same as the probability of a flight time within any other interval of the same length that is contained in the larger inter- val from 120 to 140 minutes. With every interval of a given length being equally likely, the random variable x is said to have a uniform probability distribution. The probability den- sity function, which defines the uniform distribution for the flight-time random variable, is:

5 # #

 

f x x

( ) 1/20 for 120 140 0 elsewhere

Figure 5.17 shows a graph of this probability density function.

Uniform Probability Distribution for Flight timeFigurE 5.17

Flight Time in Minutes 120 125 130 135 140

f(x)

1 20

5.6 continuous Probability Distributions 195

In general, the uniform probability density function for a random variable x is defined by the following formula:

uNiFOrm prObabiliTy DENSiTy FuNCTiON

f x b a a x b

5 2 # #

( )

1 for

0 elsewhere



 

 

(5.18)

For the flight-time random variable, 5 120a and 5 140b . For a continuous random variable, we consider probability only in terms of the likeli-

hood that a random variable assumes a value within a specified interval. In the flight time example, an acceptable probability question is: What is the probability that the flight time is between 120 and 130 minutes? That is, what is # #(120 130)P x ?

To answer this question, consider the area under the graph of f(x) in the interval from 120 to 130 (see Figure 5.18). The area is rectangular, and the area of a rectangle is simply the width multiplied by the height. With the width of the interval equal to 2 5130 120 10 and the height equal to the value of the probability density function 5( ) 1/20f x , we have

5 3 5 5 5area width height 10(1/20) 10/20 0.50. The area under the graph of f(x) and probability are identical for all continuous random

variables. Once a probability density function f(x) is identified, the probability that x takes a value between some lower value 1x and some higher value 2x can be found by computing the area under the graph of f(x) over the interval from 1x to 2x .

Given the uniform distribution for flight time and using the interpretation of area as probability, we can answer any number of probability questions about flight times. For example:

• What is the probability of a flight time between 128 and 136 minutes? The width of the interval is 2 5136 128 8. With the uniform height of 5( ) 1/20f x , we see that

# # 5 5(128 136) 8(1/20) 0.40P x .

• What is the probability of a flight time between 118 and 123 minutes? The width of the interval is 2 5123 118 5, but the height is 5( ) 0f x for # ,118 120x and 5( ) 1/20f x for # #120 123x , so we have that (118 123)# # 5P x

(118 120) (120 123) 2(0) 3(1/20) 0.15# , 1 # # 5 1 5P x P x .

the Area Under the Graph Provides the Probability of a Flight time Between 120 and 130 Minutes

FigurE 5.18

Flight Time in Minutes 120 125 130 135 140

f(x)

1 20

P(120 # x #130) 5 Area 5 1/20(10) 5 10/20 5 0.50

196 chapter 5 Probability: An Introduction to Modeling Uncertainty

Note that # # 5 5(120 140) 20(1/20) 1P x ; that is, the total area under the graph of f(x) is equal to 1. This property holds for all continuous probability distributions and is the ana- log of the condition that the sum of the probabilities must equal 1 for a discrete probability mass function.

Note also that because we know that the height of the graph of f(x) for a uniform distribu-

tion is 2

b a for # #a x b, then the area under the graph of f(x) for a uniform distribution

evaluated from a to a point 0x when # #0a x b is 3 5 2 3 2width height ( ) ( )0x a b a . This value provides the cumulative probability of obtaining a value for a uniform random variable of less than or equal to some specific value denoted by 0x and the formula is given in equation (5.19).

uNiFOrm DiSTribuTiON: CumulaTivE prObabiliTiES

P x x x a

b a a x b( ) for0

0 0# 5

2 # # (5.19)

The calculation of the expected value and variance for a continuous random variable is analogous to that for a discrete random variable. However, because the computational procedure involves integral calculus, we do not show the formulas here.

For the uniform continuous probability distribution introduced in this section, the formulas for the expected value and variance are as follows:

( ) 2

Var( ) ( )

E x a b

x b a

5 1

5 2

In these formulas, a is the minimum value and b is the maximum value that the random variable may assume.

Applying these formulas to the uniform distribution for flight times from Chicago to New York, we obtain

( ) (120 140)

2 130

Var( ) (140 120)

12 33.33

E x

5 1

5 2

The standard deviation of flight times can be found by taking the square root of the vari- ance. Thus, for flight times from Chicago to New York, s 5 533.33 5.77 minutes.

Triangular probability Distribution The triangular probability distribution is useful when only subjective probability estimates are available. There are many situations for which we do not have sufficient data and only subjective estimates of possible values are available. In the triangular probability dis- tribution, we need only to specify the minimum possible value a, the maximum possible value b, and the most likely value (or mode) of the distribution m. If these values can be knowledgeably estimated for a continuous random variable by a subject-matter expert, then as an approximation of the actual probability density function, we can assume that the tri- angular distribution applies.

Consider a situation in which a project manager is attempting to estimate the time that will be required to complete an initial assessment of the capital project of constructing a new corporate headquarters. The assessment process includes completing environmen- tal-impact studies, procuring the required permits, and lining up all the contractors and

5.6 continuous Probability Distributions 197

subcontractors needed to complete the project. There is considerable uncertainty regarding the duration of these tasks, and generally little or no historical data are available to help estimate the probability distribution for the time required for this assessment process.

Suppose that we are able to discuss this project with several subject-matter experts who have worked on similar projects. From these expert opinions and our own experience, we estimate that the minimum required time for the initial assessment phase is six months and that the worst-case estimate is that this phase could require 24 months if we are delayed in the permit process or if the results from the environmental-impact studies require addi- tional action. While a time of six months represents a best case and 24 months a worst case, the consensus is that the most likely amount of time required for the initial assess- ment phase of the project is 12 months. From these estimates, we can use a triangular dis- tribution as an approximation for the probability density function for the time required for the initial assessment phase of constructing a new corporate headquarters.

Figure 5.19 shows the probability density function for this triangular distribution. Note that the probability density function is a triangular shape.

The general form of the triangular probability density function is as follows:

TriaNgular prObabiliTy DENSiTy FuNCTiON

−



  

  

( )

2( )

( )( ) for

2( )

( )( ) for

2 2 # #

2 2 , #

f x

x a

b a m a a x m

b x

b a b m m x b

(5.20)

where

minimum value maximum value mode

a b m

= = =

In the example of the time required to complete the initial assessment phase of con- structing a new corporate headquarters, the minimum value a is six months, the maximum value b is 24 months, and the mode m is 12 months. As with the explanation given for the uniform distribution above, we can calculate probabilities by using the area under the graph of f(x). We can calculate the probability that the time required is less than 12 months by finding the area under the graph of f(x) from 5 6x to 5 12x as shown in Figure 5.19.

triangular Probability Distribution for time Required for Initial Assessment of corporate Headquarters construction

FigurE 5.19

a 5 6 m 5 12 b 5 24 x

1/9

P(6 # x # 12)

f(x)

198 chapter 5 Probability: An Introduction to Modeling Uncertainty

Equation (5.21) provides the cumulative probability of obtaining a value for a triangular random variable of less than or equal to some specific value denoted by 0x .

To calculate #( 12)P x we use equation (5.20) with 5 6a , 5 24b , 5 12m , and 5 120x .

P x( 12) (12 6)

(24 6)(12 6) 0.3333

# 5 2

2 2 5

Thus, the probability that the assessment phase of the project requires less than 12 months is 0.3333. We can also calculate the probability that the project requires more than 10 months, but less than or equal to 18 months by subtracting #( 10)P x from #( 18)P x . This is shown graphically in Figure 5.20. The calculations are as follows:

  

  

  

  ( 18) ( 10) 1

(24 18)

(24 6)(24 12)

(10 6)

(24 6)(10 6) 0.6111

2 2

# 2 # 5 2 2

2 2 2

2 2 5P x P x

Thus, the probability that the assessment phase of the project requires at least 10 months but less than 18 months is 0.6111.

Normal probability Distribution One of the most useful probability distributions for describing a continuous random vari- able is the normal probability distribution. The normal distribution has been used in a wide variety of practical applications in which the random variables are heights and weights of people, test scores, scientific measurements, amounts of rainfall, and other sim- ilar values. It is also widely used in business applications to describe uncertain quantities such as demand for products, the rate of return for stocks and bonds, and the time it takes to manufacture a part or complete many types of service-oriented activities such as medical surgeries and consulting engagements.

TriaNgular DiSTribuTiON: CumulaTivE prObabiliTiES



  

  

( )

( )( ) for

1 ( )

( )( ) for

0 2

# 5

2 2 # #

2 2

2 2 , #

P x x

x a

b a m a a x m

b x

b a b m m x b

(5.21)

The geometry required to find this area for any given value is slightly more complex than that required to find the area for a uniform distribution, but the resulting formula for a tri- angular distribution is relatively simple:

triangular Distribution to Determine # # 5(10 18)p x # 2 #( 18) ( 10)p x p x

FigurE 5.20

a 5 6 10 18m 5 12 b 5 24 x

1/9

P(10 # x # 18)

f(x)

5.6 continuous Probability Distributions 199

The form, or shape, of the normal distribution is illustrated by the bell-shaped normal curve in Figure 5.21.

The probability density function that defines the bell-shaped curve of the normal distribution follows.

Although p and e are irrational numbers, 3.14159 and 2.71828, respectively, are sufficient approximations for our purposes.

NOrmal prObabiliTy DENSiTy FuNCTiON

5 p

m s2 2( ) 1

2 ( ) / 22 2f x e x (5.22)

where

< <

mean standard deviation 3.14159 2.71828e

m s

5 5

We make several observations about the characteristics of the normal distribution.

1. The entire family of normal distributions is differentiated by two parameters: the mean m and the standard deviation s . The mean and standard deviation are often referred to as the location and shape parameters of the normal distribution, respectively.

2. The highest point on the normal curve is at the mean, which is also the median and mode of the distribution.

3. The mean of the distribution can be any numerical value: negative, zero, or positive. Three normal distributions with the same standard deviation but three different means (210, 0, and 20) are shown in Figure 5.22.

Bell-shaped curve for the normal DistributionFigurE 5.21

Mean

x m

Standard deviation s

three normal Distributions with the same standard Deviation but Different Means (m 5 210, m 5 0, m 5 20)

FigurE 5.22

0–10 20 x

200 chapter 5 Probability: An Introduction to Modeling Uncertainty

4. The normal distribution is symmetric, with the shape of the normal curve to the left of the mean a mirror image of the shape of the normal curve to the right of the mean.

5. The tails of the normal curve extend to infinity in both directions and theoretically never touch the horizontal axis. Because it is symmetric, the normal distribution is not skewed; its skewness measure is zero.

6. The standard deviation determines how flat and wide the normal curve is. Larger val- ues of the standard deviation result in wider, flatter curves, showing more variability in the data. More variability corresponds to greater uncertainty. Two normal distributions with the same mean but with different standard deviations are shown in Figure 5.23.

7. Probabilities for the normal random variable are given by areas under the normal curve. The total area under the curve for the normal distribution is 1. Because the distribution is symmetric, the area under the curve to the left of the mean is 0.50 and the area under the curve to the right of the mean is 0.50.

8. The percentages of values in some commonly used intervals are as follows: a. 68.3% of the values of a normal random variable are within plus or minus one

standard deviation of its mean. b. 95.4% of the values of a normal random variable are within plus or minus two

standard deviations of its mean. c. 99.7% of the values of a normal random variable are within plus or minus three

standard deviations of its mean.

Figure 5.24 shows properties (a), (b), and (c) graphically. We turn now to an application of the normal probability distribution. Suppose Grear

Aircraft Engines sells aircraft engines to commercial airlines. Grear is offering a new performance-based sales contract in which Grear will guarantee that its engines will provide a certain amount of lifetime flight hours subject to the airline purchasing a pre- ventive-maintenance service plan that is also provided by Grear. Grear believes that this performance-based contract will lead to additional sales as well as additional income from providing the associated preventive maintenance and servicing.

From extensive flight testing and computer simulations, Grear’s engineering group has estimated that if their engines receive proper parts replacement and preventive main- tenance, the mean lifetime flight hours achieved is normally distributed with a mean

36, 500 hoursm 5 and standard deviation 5, 000 hourss 5 . Grear would like to know what percentage of its aircraft engines will be expected to last more than 40,000 hours. In other words, what is the probability that the aircraft lifetime flight hours x will exceed 40,000? This question can be answered by finding the area of the darkly shaded region in Figure 5.25.

These percentages are the basis for the empirical rule discussed in Section 2.7.

two normal Distributions with the same Mean but Different standard Deviations (s 5 5, s 5 10)

FigurE 5.23

= 10

x µ

= 5

5.6 continuous Probability Distributions 201

The Excel function NORM.DIST can be used to compute the area under the curve for a normal probability distribution. The NORM.DIST function has four input values. The first is the value of interest corresponding to the probability you want to calculate, the second is the mean of the normal distribution, the third is the standard deviation of the normal distribution, and the fourth is TRUE or FALSE. We enter TRUE for the fourth input if we want the cumulative distribution function and FALSE if we want the probability density function.

Figure 5.26 shows how we can answer the question of interest for Grear using Excel— in cell B5, we use the formula 5NORM.DIST(40,000, $B$1, $B$2, TRUE). Cell B1 contains the mean of the normal distribution and cell B2 contains the standard deviation. Because we want to know the area under the curve, we want the cumulative distribution function, so we use TRUE as the fourth input value in the formula. This formula provides a value of 0.7580 in cell B5. But note that this corresponds to # 5( 40, 000) 0.7580P x . In other words, this gives us the area under the curve to the left of 5 40, 000x in Figure 5.25, and we are interested in the area under the curve to the right of 5 40, 000x . To find this value, we simply use 2 51 0.7580 0.2420 (cell B6). Thus, 0.2420 is the probability that x will exceed 40,000 hours. We can conclude that about 24.2% of aircraft engines will exceed 40,000 lifetime flight hours.

Areas Under the curve for Any normal DistributionFigurE 5.24

x 1 3�

68.3%

95.4%

99.7%

1 1� 2 1�� 1 2��

� � 2 2�

� 2 3�

Grear Aircraft Engines lifetime Flight Hours DistributionFigurE 5.25

40,000

= 5,000

µ = 36,500

P(x < 40,000)

P(x ≥ 40,000) = ?

202 chapter 5 Probability: An Introduction to Modeling Uncertainty

Let us now assume that Grear is considering a guarantee that will provide a discount on a replacement aircraft engine if the original engine does not meet the lifetime-flight-hour guarantee. How many lifetime flight hours should Grear guarantee if Grear wants no more than 10% of aircraft engines to be eligible for the discount guarantee? This question is interpreted graphically in Figure 5.27.

According to Figure 5.27, the area under the curve to the left of the unknown guarantee on lifetime flight hours must be 0.10. To find the appropriate value using Excel, we use the function NORM.INV. The NORM.INV function has three input values. The first is the probability of interest, the second is mean of the normal distribution, and the third is the standard deviation of the normal distribution. Figure 5.26 shows how we can use Excel to answer the Grear’s question about a guarantee on lifetime flight hours. In cell B8 we use

Excel calculations for Grear Aircraft Engines ExampleFigurE 5.26

Grear’s Discount GuaranteeFigurE 5.27

= 5,000

10% of engines eligible for discount guarantee

Guaranteed lifetime �ight hours = ?

µ = 36,500

5.6 continuous Probability Distributions 203

the formula 5NORM.INV(0.10, $B$1, $B$2), where the mean of the normal distribution is contained in cell B1 and the standard deviation in cell B2. This provides a value of 30,092.24. Thus, a guarantee of 30,092 hours will meet the requirement that approximately 10% of the aircraft engines will be eligible for the guarantee. This information could be used by Grear’s analytics team to suggest a lifetime flight hours guarantee of 30,000 hours.

Perhaps Grear is also interested in knowing the probability that an engine will have a lifetime of flight hours greater than 30,000 hours but less than 40,000 hours. How do we calculate this probability? First, we can restate this question as follows. What is (30, 000P # 40, 000)x # ? Figure 5.28 shows the area under the curve needed to answer this question. The area that corresponds to # #(30, 000 40.000)P x can be found by subtracting the area corresponding to #( 30, 000)P x from the area corresponding to #( 40, 000)P x . In other words,

# # 5 # 2 #(30, 000 40, 000) ( 40, 000) ( 30, 000)P x P x P x . Figure 5.29 shows how we can find the value for # #(30, 000 40, 000)P x using Excel. We calculate #( 40, 000)P x in cell B5 and #( 30, 000)P x in cell B6 using the NORM.DIST function. We then calculate

# #(30, 000 40, 000)P x in cell B8 by subtracting the value in cell B6 from the value in cell B5. This tells us that # # 5 2 5(30, 000 40, 000) 0.7580 0.0968 0.6612P x . In other words, the probability that the lifetime flight hours for an aircraft engine will be between 30,000 hours and 40,000 hours is 0.6612.

Exponential probability Distribution The exponential probability distribution may be used for random variables such as the time between patient arrivals at an emergency room, the distance between major defects in a highway, and the time until default in certain credit-risk models. The exponential proba- bility density function is as follows:

With the guarantee set at 30,000 hours, the actual percentage eligible for the guarantee will be

NORM.DIST ( 30000,36500, 5000,TRUE ) 0.0968, or 9.68% 5

Note that we can calculate ( 30,000 40,000 )# #P x in a

single cell using the formula 5NORM.DIST(40000, $B$1, $B$2, TRUE) – NORM. DIST(30000, $B$1, $B$2, TRUE).

Graph showing the Area Under the curve corresponding to # #(30,000 40,000)p x in the Grear Aircraft Engines Example

FigurE 5.28

x 5 30,000 x 5 40,000 m 5 36,500

5 5,000

P(x # 30,000)

P(30,000 # x # 40,000)

P(x # 40,000)

204 chapter 5 Probability: An Introduction to Modeling Uncertainty

As an example, suppose that x represents the time between business loan defaults for a particular lending agency. If the mean, or average, time between loan defaults is 15 months m 5( 15), the appropriate density function for x is

5 2( ) 1

15 /15f x e x

Figure 5.30 is the graph of this probability density function. As with any continuous probability distribution, the area under the curve correspond-

ing to an interval provides the probability that the random variable assumes a value in that interval. In the time between loan defaults example, the probability that the time between defaults is six months or less, #( 6)P x , is defined to be the area under the curve in Figure 5.30 from 5 0x to 5 6x . Similarly, the probability that the time between defaults will be 18 months or less, #( 18)P x , is the area under the curve from 5 0x to

5 18x . Note also that the probability that the time between defaults will be between 6 months and 18 months, # #(6 18)P x , is given by the area under the curve from 5 6x to 5 18x .

ExpONENTial prObabiliTy DENSiTy FuNCTiON

f x e xx( ) 1

for 0/ m

5 $m2 (5.23)

where

expected value or mean 2.71828e

m 5 5

Using Excel to Find # #(30,000 40.000)P x in the Grear Aircraft Engines ExampleFigurE 5.29

1 2 3

5 6 7

Mean: 36500 Standard Deviation: 5000

P (x ≤ 40,000) = =NORM.DIST(40000, $B$1, $B$2,TRUE) P (x ≤ 30,000) =

P (30,000 ≤ x ≤ 40,000) = P (x ≤ 40,000) – P (x ≤ 30,000) =

=NORM.DIST(30000, $B$1, $B$2,TRUE)

=B5-B6

B C

2 3 4 5 6 7

A B C

Mean: 36500 Standard Deviation: 5000

0.7580P (x ≤ 40,000) = 0.0968P (x ≤ 30,000) =

P (30,000 ≤ x ≤ 40,000) = P (x ≤ 40,000) – P (x ≤ 30,000) = 0.6612

5.6 continuous Probability Distributions 205

To compute exponential probabilities such as those just described, we use the following formula, which provides the cumulative probability of obtaining a value for the exponential random variable of less than or equal to some specific value denoted by 0x .

Exponential Distribution for the time Between Business loan Defaults Example

FigurE 5.30

.07

.05

.03

.01

0 6 12 18 24 x

Time Between Defaults

f (x)

P(x # 6)

P(6 # x # 18)

ExpONENTial DiSTribuTiON: CumulaTivE prObabiliTiES

P x x e x( ) 10 /0# 5 2 m2 (5.24)

For the time between defaults example, time between business loan defaultsx 5 in months and 15 monthsm 5 . Using equation (5.24),

# 5 2 2( ) 10 /150P x x e x

Hence, the probability that the time between defaults is six months or less is:

# 5 2 52( 6) 1 0.32976 /15P x e

Using equation (5.24), we calculate the probability that the time between defaults is 18 months or less:

# 5 2 52( 18) 1 0.698818 /15P x e

Thus, the probability that the time between business loan defaults is 6 months and 18 months is equal to 2 50.6988 0.3297 0.3691. Probabilities for any other interval can be computed similarly.

Figure 5.31 shows how we can calculate these values for an exponential distribution in Excel using the function EXPON.DIST. The EXPON.DIST function has three inputs: the first input is x, the second input is 1/m, and the third input is TRUE or FALSE. An input of TRUE for the third input provides the cumulative distribution function value and FALSE provides the probability density function value. Cell B3 calculates #( 18)P x using the formula 5EXPON.DIST(18, 1/$B$1, TRUE), where cell B1 contains the mean of the exponential distribution. Cell B4 calculates the value for #( 6)P x and cell B5 calculates the value for # # 5 # 2 #(6 18) ( 18) ( 6)P x P x P x by subtracting the value in cell B4 from the value in cell B3.

We can calculate (6 18)# #P x in a single cell using the formula 5EXPON.DIST(18, 1/$B$1, TRUE) - EXPON. DIST(6, 1/$B$1, TRUE).

206 chapter 5 Probability: An Introduction to Modeling Uncertainty

1. The way we describe probabilities is different for a discrete

random variable than it is for a continuous random variable.

For discrete random variables, we can talk about the prob-

ability of the random variable assuming a particular value.

For continuous random variables, we can only talk about

the probability of the random variable assuming a value

within a given interval.

2. To see more clearly why the height of a probability density

function is not a probability, think about a random variable

with the following uniform probability distribution:

5 # #

( ) 2 for 0 0.5

0 elsewhere f x

x  



The height of the probability density function, f(x), is 2 for values of x between 0 and 0.5. However, we know that probabilities can never be greater than 1. Thus, we see that

f(x) cannot be interpreted as the probability of x. 3. The standard normal distribution is the special case of

the normal distribution for which the mean is 0 and the

standard deviation is 1. This is useful because probabil-

ities for all normal distributions can be computed using

the standard normal distribution. We can convert any nor-

mal random variable x with mean m and standard devia- tion s to the standard normal random variable z by using

the formula z x m

s 5

2 . We interpret z as the number of

standard deviations that the normal random variable x is from its mean m. Then we can use a table of standard

normal probability distributions to find the area under the

curve using z and the standard normal probability table.

Excel contains special functions for the standard normal

distribution: NORM.S.DIST and NORM.S.INV. The func-

tion NORM.S.DIST is similar to the function NORM.DIST,

but it requires only two input values: the value of inter-

est for calculating the probability and TRUE or FALSE,

depending on whether you are interested in finding the

probability density or the cumulative distribution function.

NORM.S.INV is similar to the NORM.INV function, but it

requires only the single input of the probability of inter-

est. Both NORM.S.DIST and NORM.S.INV do not need

the additional parameters because they assume a mean

of 0 and standard deviation of 1 for the standard normal

distribution.

4. A property of the exponential distribution is that the mean

and the standard deviation are equal to each other.

5. The continuous exponential distribution is related to the

discrete Poisson distribution. If the Poisson distribution

provides an appropriate description of the number of

occurrences per interval, the exponential distribution pro-

vides a description of the length of the interval between

occurrences. This relationship often arises in queueing

applications in which, if arrivals follow a Poisson distribu-

tion, the time between arrivals must follow an exponential

distribution.

6. Chapter 11 explains how values for discrete and continuous

random variables can be generated in Excel for use in sim-

ulation models. It also discusses how to use Analytic Solver

to assess which probability distribution(s) best describe

sample values of a random variable.

N O T E S + C O m m E N T S

Using Excel to calculate # #(6 18)p x for the time Between Business loan Defaults Example

FigurE 5.31

1 Mean, µ = 15

=EXPON.DIST(18,1/$B$1, TRUE) =EXPON.DIST(6,1/$B$1, TRUE) =B3-B4

P (x ≤ 18) = P (x ≤ 6) =

P (6 ≤ x ≤ 18) = P (x ≤ 18) – P (x ≤ 6) =

Mean, µ = 15

0.6988

0.3297 0.3691

P (x ≤ 18) =

P (x ≤ 6) = P (6 ≤ x ≤ 18) = P (x ≤ 18) – P (x ≤ 6) =

2 3 4 5

B C

A 1

2 3 4 5

B C

Glossary 207

S u m m a R y

In this chapter we introduced the concept of probability as a means of understanding and measuring uncertainty. Uncertainty is a factor in virtually all business decisions, thus an understanding of probability is essential to modeling such decisions and improving the decision-making process.

We introduced some basic relationships in probability including the concepts of out- comes, events, and calculations of related probabilities. We introduced the concept of conditional probability and discussed how to calculate posterior probabilities from prior probabilities using Bayes’ theorem. We then discussed both discrete and continuous ran- dom variables as well as some of the more common probability distributions related to these types of random variables. These probability distributions included the custom dis- crete, discrete uniform, binomial, and Poisson probability distributions for discrete random variables, as well as the uniform, triangular, normal, and exponential probability distri- butions for continuous random variables. We also discussed the concepts of the expected value (mean) and variance of a random variable.

Probability is used in many chapters that follow in this textbook. The normal distribu- tion is essential to many of the predictive modeling techniques that we introduce in later chapters. Random variables and probability distributions will be seen again in Chapter 6 when we discuss the use of statistical inference to draw conclusions about a population from sample data, Chapter 7 when we discuss regression analysis as a way of estimating relationships between variables, and Chapter 11 when we discuss simulation as a means of modeling uncertainty. Conditional probability and Bayes’ theorem will be discussed in Chapter 15 in the context of decision analysis. It is very important to have a basic under- standing of probability, such as is provided in this chapter, as you continue to improve your skills in business analytics.

G l o S S a R y

Addition law A probability law used to compute the probability of the union of events. For two events A and B, the addition law is ø ùP A B P A P B P A B5 1 2( ) ( ) ( ) ( ). For two mutually exclusive events, ùP A B 5( ) 0 , so øP A B P A P B5 1( ) ( ) ( ). Bayes’ theorem A method used to compute posterior probabilities. Binomial probability distribution A probability distribution for a discrete random vari- able showing the probability of x successes in n trials. Complement of A The event consisting of all outcomes that are not in A. Conditional probability The probability of an event given that another event has already

occurred. The conditional probability of A given B is ù

P A B P A B

P B 5( | )

( )

( ) .

Continuous random variable A random variable that may assume any numerical value in an interval or collection of intervals. An interval can include negative and positive infinity. Custom discrete probability distribution A probability distribution for a discrete random variable for which each value xi that the random variable assumes is associated with a defined probability f xi( ). Discrete random variable A random variable that can take on only specified discrete values. Discrete uniform probability distribution A probability distribution in which each possible value of the discrete random variable has the same probability. Empirical probability distribution A probability distribution for which the relative frequency method is used to assign probabilities. Event A collection of outcomes. Expected value A measure of the central location, or mean, of a random variable.

208 chapter 5 Probability: An Introduction to Modeling Uncertainty

Exponential probability distribution A continuous probability distribution that is use- ful in computing probabilities for the time it takes to complete a task or the time between arrivals. The mean and standard deviation for an exponential probability distribution are equal to each other. Independent events Two events A and B are independent if ( | ) ( ) orP A B P A5

( | ) ( )P B A P B5 ; the events do not influence each other. Intersection of A and B The event containing the outcomes belonging to both A and B. The intersection of A and B is denoted ùA B. Joint probabilities The probability of two events both occurring; in other words, the prob- ability of the intersection of two events. Marginal probabilities The values in the margins of a joint probability table that provide the probabilities of each event separately. Multiplication law A law used to compute the probability of the intersection of events. For two events A and B, the multiplication law is ùP A B P B P A B5( ) ( ) ( | ) or

ùP A B P A P B A5( ) ( ) ( | ). For two independent events, it reduces to ùP A B P A P B5( ) ( ) ( ). Mutually exclusive events Events that have no outcomes in common; ùA B is empty and

ùP A B 5( ) 0 . Normal probability distribution A continuous probability distribution in which the probability density function is bell shaped and determined by its mean m and standard deviation s . Poisson probability distribution A probability distribution for a discrete random variable showing the probability of x occurrences of an event over a specified interval of time or space. Posterior probabilities Revised probabilities of events based on additional information. Prior probability Initial estimate of the probabilities of events. Probability A numerical measure of the likelihood that an event will occur. Probability density function A function used to compute probabilities for a continuous random variable. The area under the graph of a probability density function over an interval represents probability. Probability distribution A description of how probabilities are distributed over the values of a random variable. Probability mass function A function, denoted by f(x), that provides the probability that x assumes a particular value for a discrete random variable. Probability of an event Equal to the sum of the probabilities of outcomes for the event. Random experiment A process that generates well-defined experimental outcomes. On any single repetition or trial, the outcome that occurs is determined by chance. Random variables A numerical description of the outcome of an experiment. Sample space The set of all outcomes. Standard deviation Positive square root of the variance. Triangular probability distribution A continuous probability distribution in which the probability density function is shaped like a triangle defined by the minimum possible value a, the maximum possible value b, and the most likely value m. A triangular probabil- ity distribution is often used when only subjective estimates are available for the minimum, maximum, and most likely values. Uniform probability distribution A continuous probability distribution for which the probability that the random variable will assume a value in any interval is the same for each interval of equal length. Union of A and B The event containing the outcomes belonging to A or B or both. The union of A and B is denoted by øA B. Variance A measure of the variability, or dispersion, of a random variable. Venn diagram A graphical representation of the sample space and operations involving events, in which the sample space is represented by a rectangle and events are represented as circles within the sample space.

Problems 209

P R o B l E m S

1. On-time arrivals, lost baggage, and customer complaints are three measures that are typically used to measure the quality of service being offered by airlines. Suppose that the following values represent the on-time arrival percentage, amount of lost baggage, and customer complaints for 10 U.S. airlines.

Airline On-Time

Arrivals (%) Mishandled Baggage per 1,000 Passengers

Customer Complaints per

1,000 Passengers

Virgin America 83.5 0.87 1.50

JetBlue 79.1 1.88 0.79

AirTran Airways 87.1 1.58 0.91

Delta Air Lines 86.5 2.10 0.73

Alaska Airlines 87.5 2.93 0.51

Frontier Airlines 77.9 2.22 1.05

Southwest Airlines 83.1 3.08 0.25

US Airways 85.9 2.14 1.74

American Airlines 76.9 2.92 1.80

United Airlines 77.4 3.87 4.24

a. Based on the data above, if you randomly choose a Delta Air Lines flight, what is the probability that this individual flight will have an on-time arrival?

b. If you randomly choose 1 of the 10 airlines for a follow-up study on airline quality ratings, what is the probability that you will choose an airline with less than two mishandled baggage reports per 1,000 passengers?

c. If you randomly choose 1 of the 10 airlines for a follow-up study on airline quality ratings, what is the probability that you will choose an airline with more than one customer complaint per 1,000 passengers?

d. What is the probability that a randomly selected AirTran Airways flight will not arrive on time?

2. Consider the random experiment of rolling a pair of dice. Suppose that we are inter- ested in the sum of the face values showing on the dice. a. How many outcomes are possible? b. List the outcomes. c. What is the probability of obtaining a value of 7? d. What is the probability of obtaining a value of 9 or greater?

3. Suppose that for a recent admissions class, an Ivy League college received 2,851 appli- cations for early admission. Of this group, it admitted 1,033 students early, rejected 854 outright, and deferred 964 to the regular admission pool for further consideration. In the past, this school has admitted 18% of the deferred early admission applicants during the regular admission process. Counting the students admitted early and the students admitted during the regular admission process, the total class size was 2,375. Let E, R, and D represent the events that a student who applies for early admission is admitted early, rejected outright, or deferred to the regular admissions pool. a. Use the data to estimate P(E), P(R), and P(D). b. Are events E and D mutually exclusive? Find ùP E D( ). c. For the 2,375 students who were admitted, what is the probability that a randomly

selected student was accepted during early admission? d. Suppose a student applies for early admission. What is the probability that the stu-

dent will be admitted for early admission or be deferred and later admitted during the regular admission process?

210 chapter 5 Probability: An Introduction to Modeling Uncertainty

4. Suppose that we have two events, A and B, with 5( ) 0.50P A , 5( ) 0.60P B , and ùP A B 5( ) 0.40.

a. Find ( | )P A B . b. Find ( | )P B A . c. Are A and B independent? Why or why not?

5. Students taking the Graduate Management Admissions Test (GMAT) were asked about their undergraduate major and intent to pursue their MBA as a full-time or part-time student. A summary of their responses is as follows:

College Degree

Yes No

loan Status

Satisfactory 0.26 0.24 0.50

Delinquent 0.16 0.34 0.50

0.42 0.58

undergraduate major

Business Engineering Other Totals

intended Enrollment Status

Full-Time 352 197 251 800

Part-Time 150 161 194 505

Totals 502 358 445 1,305

a. Develop a joint probability table for these data. b. Use the marginal probabilities of undergraduate major (business, engineering, or

other) to comment on which undergraduate major produces the most potential MBA students.

c. If a student intends to attend classes full time in pursuit of an MBA degree, what is the probability that the student was an undergraduate engineering major?

d. If a student was an undergraduate business major, what is the probability that the student intends to attend classes full time in pursuit of an MBA degree?

e. Let F denote the event that the student intends to attend classes full time in pursuit of an MBA degree, and let B denote the event that the student was an undergraduate business major. Are events F and B independent? Justify your answer.

6. More than 40 million Americans are estimated to have at least one outstanding student loan to help pay college expenses (“40 Million Americans Now Have Student Loan Debt,” CNNMoney, September 2014). Not all of these graduates pay back their debt in satisfactory fashion. Suppose that the following joint probability table shows the prob- abilities of student loan status and whether or not the student had received a college degree.

a. What is the probability that a student with a student loan had received a college degree?

b. What is the probability that a student with a student loan had not received a college degree?

c. Given that the student has received a college degree, what is the probability that the student has a delinquent loan?

d. Given that the student has not received a college degree, what is the probability that the student has a delinquent loan?

e. What is the impact of dropping out of college without a degree for students who have a student loan?

Problems 211

7. The Human Resources Manager for Optilytics LLC is evaluating applications for the position of Senior Data Scientist. The file OptilyticsLLC presents summary data of the applicants for the position. a. Use a PivotTable in Excel to create a joint probability table showing the proba-

bilities associated with a randomly selected applicant’s sex and highest degree achieved. Use this joint probability table to answer the questions below.

b. What are the marginal probabilities? What do they tell you about the probabilities associated with the sex of applicants and highest degree completed by applicants?

c. If the applicant is female, what is the probability that the highest degree completed by the applicant is a PhD?

d. If the highest degree completed by the applicant is a bachelor’s degree, what is the probability that the applicant is male?

e. What is the probability that a randomly selected applicant will be a male whose highest completed degree is a PhD?

8. The U.S. Census Bureau is a leading source of quantitative data related to the people and economy of the United States. The crosstabulation below represents the number of households (thousands) and the household income by the highest level of education for the head of household (U.S. Census Bureau web site, 2013). Use this crosstabulation to answer the following questions.

household income

highest level of Education

Under $25,000

$25,000 to $49,999

$50,000 to $99,999

$100,000 and Over Total

High school graduate 9,880 9,970 9,441 3,482 32,773

Bachelor’s degree 2,484 4,164 7,666 7,817 22,131

Master’s degree 685 1,205 3,019 4,094 9,003

Doctoral degree 79 160 422 1,076 1,737

Total 13,128 15,499 20,548 16,469 65,644

Days listed until Sold

Under 30 31–90 Over 90 Total

initial asking price Under $150,000 50 40 10 100

$150,000–$199,999 20 150 80 250

$200,000–$250,000 20 280 100 400

Over $250,000 10 30 10 50

Total 100 500 200 800

a. Develop a joint probability table. b. What is the probability the head of one of these households has a master’s degree or

higher education? c. What is the probability a household is headed by someone with a high school

diploma earning $100,000 or more? d. What is the probability one of these households has an income below $25,000? e. What is the probability a household is headed by someone with a bachelor’s degree

earning less than $25,000? f. Are household income and educational level independent?

9. Cooper Realty is a small real estate company located in Albany, New York, that spe- cializes primarily in residential listings. The company recently became interested in determining the likelihood of one of its listings being sold within a certain number of days. An analysis of company sales of 800 homes in previous years produced the fol- lowing data.

OptilyticsLLC

212 chapter 5 Probability: An Introduction to Modeling Uncertainty

a. If A is defined as the event that a home is listed for more than 90 days before being sold, estimate the probability of A.

b. If B is defined as the event that the initial asking price is under $150,000, estimate the probability of B.

c. What is the probability of ùA B? d. Assuming that a contract was just signed to list a home with an initial asking price

of less than $150,000, what is the probability that the home will take Cooper Realty more than 90 days to sell?

e. Are events A and B independent?

10. The prior probabilities for events 1A and 2A are 5( ) 0.401P A and 5( ) 0.602P A . It is also known that ùP A A 5( ) 01 2 . Suppose 5( | ) 0.201P B A and 5( | ) 0.052P B A . a. Are 1A and 2A mutually exclusive? Explain. b. Compute ùP A B( )1 and ùP A B( )2 . c. Compute P(B). d. Apply Bayes’ theorem to compute ( | )1P A B and ( | )2P A B .

11. A local bank reviewed its credit-card policy with the intention of recalling some of its credit cards. In the past, approximately 5% of cardholders defaulted, leaving the bank unable to collect the outstanding balance. Hence, management established a prior prob- ability of 0.05 that any particular cardholder will default. The bank also found that the probability of missing a monthly payment is 0.20 for customers who do not default. Of course, the probability of missing a monthly payment for those who default is 1. a. Given that a customer missed a monthly payment, compute the posterior probability

that the customer will default. b. The bank would like to recall its credit card if the probability that a customer will

default is greater than 0.20. Should the bank recall its credit card if the customer misses a monthly payment? Why or why not?

12. RunningWithTheDevil.com created a web site to market running shoes and other run- ning apparel. Management would like a special pop-up offer to appear for female web- site visitors and a different special pop-up offer to appear for male web site visitors. From a sample of past web site visitors, RunningWithTheDevil’s management learns that 60% of the visitors are male and 40% are female. a. What is the probability that a current visitor to the web site is female? b. Suppose that 30% of RunningWithTheDevil’s female visitors previously visited

LetsRun.com and 10% of male customers previously visited LetsRun.com. If the current visitor to RunningWithTheDevil’s web site previously visited LetsRun.com, what is the revised probability that the current visitor is female? Should the Run- ningWithTheDevil’s web site display the special offer that appeals to female visitors or the special offer that appeals to male visitors?

13. An oil company purchased an option on land in Alaska. Preliminary geologic studies assigned the following prior probabilities.

(high-quality oil) 0.50 (medium-quality oil) 0.20

(no oil) 0.30

P P

5 5 5

a. What is the probability of finding oil? b. After 200 feet of drilling on the first well, a soil test is taken. The probabilities of

finding the particular type of soil identified by the test are as follows.

(soil | high-quality oil) 0.20 (soil | medium-quality oil) 0.80

(soil | no oil) 0.20

P P

5 5 5

c. How should the firm interpret the soil test? What are the revised probabilities, and what is the new probability of finding oil?

Problems 213

14. Suppose the following data represent the number of persons unemployed for a given number of months in Killeen, Texas. The values in the first column show the number of months unemployed and the values in the second column show the corresponding number of unemployed persons.

Months Unemployed Number Unemployed

1 1,029

2 1,686

3 2,269

4 2,675

5 3,487

6 4,652

7 4,145

8 3,587

9 2,325

10 1,120

Let x be a random variable indicating the number of months a randomly selected per- son is unemployed. a. Use the data to develop an empirical discrete probability distribution for x. b. Show that your probability distribution satisfies the conditions for a valid discrete

probability distribution. c. What is the probability that a person is unemployed for two months or less? Unem-

ployed for more than two months? d. What is the probability that a person is unemployed for more than six months?

15. The percent frequency distributions of job satisfaction scores for a sample of informa- tion systems (IS) senior executives and middle managers are as follows. The scores range from a low of 1 (very dissatisfied) to a high of 5 (very satisfied).

Job Satisfaction Score

IS Senior Executives (%)

IS Middle Managers (%)

1 5 4

2 9 10

3 3 12

4 42 46

5 41 28

a. Develop a probability distribution for the job satisfaction score of a randomly selected senior executive.

b. Develop a probability distribution for the job satisfaction score of a randomly selected middle manager.

c. What is the probability that a randomly selected senior executive will report a job satisfaction score of 4 or 5?

d. What is the probability that a randomly selected middle manager is very satisfied?

e. Compare the overall job satisfaction of senior executives and middle managers.

214 chapter 5 Probability: An Introduction to Modeling Uncertainty

16. The following table provides a probability distribution for the random variable y.

y f(y) 2 0.20

4 0.30

7 0.40

8 0.10

Payment ($) Probability

0 0.85

500 0.04

1,000 0.04

3,000 0.03

5,000 0.02

8,000 0.01

10,000 0.01

medium-Scale Expansion profit

large-Scale Expansion profit

x f(x) y f(y)

Demand Low 50 0.20 0 0.20

Medium 150 0.50 100 0.50

High 200 0.30 300 0.30

a. Compute E(y). b. Compute Var(y) and s .

17. The probability distribution for damage claims paid by the Newton Automobile Insurance Company on collision insurance is as follows.

a. Use the expected collision payment to determine the collision insurance premium that would enable the company to break even.

b. The insurance company charges an annual rate of $520 for the collision coverage. What is the expected value of the collision policy for a policyholder? (Hint: It is the expected payments from the company minus the cost of coverage.) Why does the policyholder purchase a collision policy with this expected value?

18. The J.R. Ryland Computer Company is considering a plant expansion to enable the company to begin production of a new computer product. The company’s president must determine whether to make the expansion a medium- or large-scale project. Demand for the new product is uncertain, which for planning purposes may be low demand, medium demand, or high demand. The probability estimates for demand are 0.20, 0.50, and 0.30, respectively. Letting x and y indicate the annual profit in thou- sands of dollars, the firm’s planners developed the following profit forecasts for the medium- and large-scale expansion projects.

a. Compute the expected value for the profit associated with the two expansion alter- natives. Which decision is preferred for the objective of maximizing the expected profit?

b. Compute the variance for the profit associated with the two expansion alternatives. Which decision is preferred for the objective of minimizing the risk or uncertainty?

Problems 215

19. Consider a binomial experiment with 5 10n and 5 0.10p . a. Compute f(0). b. Compute f(2). c. Compute #( 2)P x . d. Compute $( 1)P x . e. Compute E(x). f. Compute Var(x) and s .

20. Many companies use a quality control technique called acceptance sampling to monitor incoming shipments of parts, raw materials, and so on. In the electronics industry, com- ponent parts are commonly shipped from suppliers in large lots. Inspection of a sample of n components can be viewed as the n trials of a binomial experiment. The outcome for each component tested (trial) will be that the component is classified as good or defective. Reynolds Electronics accepts a lot from a particular supplier if the defective components in the lot do not exceed 1%. Suppose a random sample of five items from a recent shipment is tested. a. Assume that 1% of the shipment is defective. Compute the probability that no items

in the sample are defective. b. Assume that 1% of the shipment is defective. Compute the probability that exactly

one item in the sample is defective. c. What is the probability of observing one or more defective items in the sample if

1% of the shipment is defective? d. Would you feel comfortable accepting the shipment if one item was found to be

defective? Why or why not?

21. A university found that 20% of its students withdraw without completing the introduc- tory statistics course. Assume that 20 students registered for the course. a. Compute the probability that two or fewer will withdraw. b. Compute the probability that exactly four will withdraw. c. Compute the probability that more than three will withdraw. d. Compute the expected number of withdrawals.

22. Consider a Poisson distribution with m 5 3. a. Write the appropriate Poisson probability mass function. b. Compute f(2). c. Compute f(1). d. Compute $( 2)P x .

23. Emergency 911 calls to a small municipality in Idaho come in at the rate of one every 2 minutes. Assume that the number of 911 calls is a random variable that can be described by the Poisson distribution. a. What is the expected number of 911 calls in 1 hour? b. What is the probability of three 911 calls in 5 minutes? c. What is the probability of no 911 calls during a 5-minute period?

24. A regional director responsible for business development in the state of Pennsylvania is concerned about the number of small business failures. If the mean number of small business failures per month is 10, what is the probability that exactly 4 small busi- nesses will fail during a given month? Assume that the probability of a failure is the same for any two months and that the occurrence or nonoccurrence of a failure in any month is independent of failures in any other month.

25. The random variable x is known to be uniformly distributed between 10 and 20. a. Show the graph of the probability density function. b. Compute ,( 15)P x . c. Compute # #(12 18)P x . d. Compute E(x). e. Compute Var(x).

216 chapter 5 Probability: An Introduction to Modeling Uncertainty

26. Most computer languages include a function that can be used to generate random num- bers. In Excel, the RAND function can be used to generate random numbers between 0 and 1. If we let x denote a random number generated using RAND, then x is a contin- uous random variable with the following probability density function:

5 # #

f x x

 

 ( )

1 for 0 1

0 elsewhere

a. Graph the probability density function. b. What is the probability of generating a random number between 0.25 and 0.75? c. What is the probability of generating a random number with a value less than or

equal to 0.30? d. What is the probability of generating a random number with a value greater

than 0.60? e. Generate 50 random numbers by entering 5RAND() into 50 cells of an Excel

worksheet. f. Compute the mean and standard deviation for the random numbers in part (e).

27. Suppose we are interested in bidding on a piece of land and we know one other bidder is interested. The seller announced that the highest bid in excess of $10,000 will be accepted. Assume that the competitor’s bid x is a random variable that is uniformly distributed between $10,000 and $15,000. a. Suppose you bid $12,000. What is the probability that your bid will be accepted? b. Suppose you bid $14,000. What is the probability that your bid will be accepted? c. What amount should you bid to maximize the probability that you get the

property? d. Suppose you know someone who is willing to pay you $16,000 for the

property. Would you consider bidding less than the amount in part (c)? Why or why not?

28. A random variable has a triangular probability density function with 5 50a , 5 375b , and 5 250m . a. Sketch the probability distribution function for this random variable. Label the

points 5 50a , 5 375b , and 5 250m on the x-axis. b. What is the probability that the random variable will assume a value between

50 and 250? c. What is the probability that the random variable will assume a value greater

than 300?

29. The Siler Construction Company is about to bid on a new industrial construction proj- ect. To formulate their bid, the company needs to estimate the time required for the project. Based on past experience, management expects that the project will require at least 24 months, and could take as long as 48 months if there are complications. The most likely scenario is that the project will require 30 months. a. Assume that the actual time for the project can be approximated using a triangular

probability distribution. What is the probability that the project will take less than 30 months?

b. What is the probability that the project will take between 28 and 32 months? c. To submit a competitive bid, the company believes that if the project takes more

than 36 months, then the company will lose money on the project. Management does not want to bid on the project if there is greater than a 25% chance that they will lose money on this project. Should the company bid on this project?

30. Suppose that the return for a particular large-cap stock fund is normally distributed with a mean of 14.4% and standard deviation of 4.4%. a. What is the probability that the large-cap stock fund has a return of at least 20%? b. What is the probability that the large-cap stock fund has a return of 10%

or less?

Problems 217

31. A person must score in the upper 2% of the population on an IQ test to qualify for membership in Mensa, the international high IQ society. If IQ scores are normally dis- tributed with a mean of 100 and a standard deviation of 15, what score must a person have to qualify for Mensa?

32. Assume that the traffic to the web site of Smiley’s People, Inc., which sells customized T-shirts, follows a normal distribution, with a mean of 4.5 million visitors per day and a standard deviation of 820,000 visitors per day. a. What is the probability that the web site has fewer than 5 million visitors in a single

day? b. What is the probability that the web site has 3 million or more visitors in a single

day? c. What is the probability that the web site has between 3 million and 4 million visi-

tors in a single day? d. Assume that 85% of the time, the Smiley’s People web servers can handle the

daily web traffic volume without purchasing additional server capacity. What is the amount of web traffic that will require Smiley’s People to purchase additional server capacity?

33. Suppose that Motorola uses the normal distribution to determine the probability of defects and the number of defects in a particular production process. Assume that the production process manufactures items with a mean weight of 10 ounces. Calculate the probability of a defect and the suspected number of defects for a 1,000-unit production run in the following situations. a. The process standard deviation is 0.15, and the process control is set at plus or

minus one standard deviation. Units with weights less than 9.85 or greater than 10.15 ounces will be classified as defects.

b. Through process design improvements, the process standard deviation can be reduced to 0.05. Assume that the process control remains the same, with weights less than 9.85 or greater than 10.15 ounces being classified as defects.

c. What is the advantage of reducing process variation, thereby causing process con- trol limits to be at a greater number of standard deviations from the mean?

34. Consider the following exponential probability density function:

( ) 1

3 for 0/ 3f x e xx5 $2

a. Write the formula for #( )0P x x . b. Find #( 2)P x . c. Find $( 3)P x . d. Find #( 5)P x . e. Find # #(2 5)P x .

35. The time between arrivals of vehicles at a particular intersection follows an exponential probability distribution with a mean of 12 seconds. a. Sketch this exponential probability distribution. b. What is the probability that the arrival time between vehicles is 12 seconds or less? c. What is the probability that the arrival time between vehicles is 6 seconds or less? d. What is the probability of 30 or more seconds between vehicle arrivals?

36. Suppose that the time spent by players in a single session on the World of Warcraft multiplayer online role-playing game follows an exponential distribution with a mean of 38.3 minutes. a. Write the exponential probability distribution function for the time spent by players

on a single session of World of Warcraft. b. What is the probability that a player will spend between 20 and 40 minutes on a sin-

gle session of World of Warcraft? c. What is the probability that a player will spend more than 1 hour on a single session

of World of Warcraft?

218 chapter 5 Probability: An Introduction to Modeling Uncertainty

C a S E P R o B l E m : H a m i l t o n C o u n t y J u D G E S

Hamilton County judges try thousands of cases per year. In an overwhelming majority of the cases disposed, the verdict stands as rendered. However, some cases are appealed, and of those appealed, some of the cases are reversed. Kristen DelGuzzi of the Cincinnati Enquirer newspaper conducted a study of cases handled by Hamilton County judges over a three-year period. Shown in the table below are the results for 182,908 cases handled (dis- posed) by 38 judges in Common Pleas Court, Domestic Relations Court, and Municipal Court. Two of the judges (Dinkelacker and Hogan) did not serve in the same court for the entire three-year period.

The purpose of the newspaper’s study was to evaluate the performance of the judges. Appeals are often the result of mistakes made by judges, and the newspaper wanted to know which judges were doing a good job and which were making too many mistakes. You are called in to assist in the data analysis. Use your knowledge of probability and condi- tional probability to help with the ranking of the judges. You also may be able to analyze the likelihood of appeal and reversal for cases handled by different courts.

total cases Disposed, Appealed, and Reversed in Hamilton county courts

Common pleas Court

Judge Total Cases Disposed Appealed Cases Reversed Cases

Fred Cartolano 3,037 137 12

Thomas Crush 3,372 119 10

Patrick Dinkelacker 1,258 44 8

Timothy Hogan 1,954 60 7

Robert Kraft 3,138 127 7

William Mathews 2,264 91 18

William Morrissey 3,032 121 22

Norbert Nadel 2,959 131 20

Arthur Ney, Jr. 3,219 125 14

Richard Niehaus 3,353 137 16

Thomas Nurre 3,000 121 6

John O’Connor 2,969 129 12

Robert Ruehlman 3,205 145 18

J. Howard Sundermann 955 60 10

Ann Marie Tracey 3,141 127 13

Ralph Winkler 3,089 88 6

Total 43,945 1,762 199

Domestic relations Court

Judge Total Cases Disposed Appealed Cases Reversed Cases

Penelope Cunningham 2,729 7 1

Patrick Dinkelacker 6,001 19 4

Deborah Gaines 8,799 48 9

Ronald Panioto 12,970 32 3

Total 30,499 106 17

case Problem: Hamilton county Judges 219

managerial report

Prepare a report with your rankings of the judges. Also, include an analysis of the like- lihood of appeal and case reversal in the three courts. At a minimum, your report should include the following: 1. The probability of cases being appealed and reversed in the three different courts. 2. The probability of a case being appealed for each judge. 3. The probability of a case being reversed for each judge. 4. The probability of reversal given an appeal for each judge. 5. Rank the judges within each court. State the criteria you used and provide a rationale

for your choice.

municipal Court

Judge Total Cases Disposed Appealed Cases Reversed Cases

Mike Allen 6,149 43 4

Nadine Allen 7,812 34 6

Timothy Black 7,954 41 6

David Davis 7,736 43 5

Leslie Isaiah Gaines 5,282 35 13

Karla Grady 5,253 6 0

Deidra Hair 2,532 5 0

Dennis Helmick 7,900 29 5

Timothy Hogan 2,308 13 2

James Patrick Kenney 2,798 6 1

Joseph Luebbers 4,698 25 8

William Mallory 8,277 38 9

Melba Marsh 8,219 34 7

Beth Mattingly 2,971 13 1

Albert Mestemaker 4,975 28 9

Mark Painter 2,239 7 3

Jack Rosen 7,790 41 13

Mark Schweikert 5,403 33 6

David Stockdale 5,371 22 4

John A. West 2,797 4 2

Total 108,464 500 104

Statistical Inference C O N T E N T S

AnAlytIcS In ActIon: John Morrell & CoMpany

6.1 SElEctInG A SAMPlE Sampling from a Finite Population Sampling from an Infinite Population

6.2 PoInt EStIMAtIon Practical Advice

6.3 SAMPlInG DIStRIBUtIonS Sampling Distribution of x Sampling Distribution of p

6.4 IntERVAl EStIMAtIon Interval Estimation of the Population Mean Interval Estimation of the Population Proportion

6.5 HyPotHESIS tEStS Developing null and Alternative Hypotheses type I and type II Errors Hypothesis test of the Population Mean Hypothesis test of the Population Proportion

6.6 BIG DAtA, StAtIStIcAl InFEREncE, AnD PRActIcAl SIGnIFIcAncE Sampling Error nonsampling Error Big Data Understanding What Big Data Is Big Data and Sampling Error Big Data and the Precision of confidence Intervals Implications of Big Data for confidence Intervals Big Data, Hypothesis testing, and p Values Implications of Big Data in Hypothesis testing

Chapter 6

Analytics in Action 221

When collecting data, we usually want to learn about some characteristic(s) of the popula- tion, the collection of all the elements of interest, from which we are collecting that data. In order to know about some characteristic of a population with certainty, we must collect data from every element in the population of interest; such an effort is referred to as a census. However, there are many potential difficulties associated with taking a census:

• A census may be expensive; if resources are limited, it may not be feasible to take a census.

• A census may be time consuming; if the data need be collected quickly, a census may not be suitable.

• A census may be misleading; if the population is changing quickly, by the time a census is completed the data may be obsolete.

Refer to Chapter 2 for a fundamental overview of data and descriptive statistics.

John Morrell & Company*

CiNCiNNaTi OhiO

John Morrell & Company, which was established in England in 1827, is considered the oldest continuously operating meat manufacturer in the United States. It is a wholly owned and independently managed sub- sidiary of Smithfield Foods, Smithfield, Virginia. John Morrell & Company offers an extensive product line of processed meats and fresh pork to consumers under 13 regional brands, including John Morrell, E-Z-Cut, Tobin’s First Prize, Dinner Bell, Hunter, Kretschmar, Rath, Rodeo, Shenson, Farmers Hickory Brand, Iowa Quality, and Peyton’s. Each regional brand enjoys high brand recognition and loyalty among consumers.

Market research at Morrell provides management with up-to-date information on the company’s various products and how the products compare with com- peting brands of similar products. In order to com- pare a beef pot roast made by Morrell to similar beef products from two major competitors, Morrell asked a random sample of consumers to indicate how the products rated in terms of taste, appearance, aroma, and overall preference.

In Morrell’s independent taste-test study, a sample of 224 consumers in Cincinnati, Milwaukee, and Los Angeles was chosen. Of these 224 consumers, 150 preferred the beef pot roast made by Morrell. Based on these results, Morrell estimates that the popula- tion proportion that prefers Morrell’s beef pot roast is

150/224 0.675 5p . Recognizing that this estimate is subject to sampling error, Morrell calculates the 95% confidence interval for the postulation proportion that prefers Morrell’s beef pot roast to be 0.6080 to 0.7312.

Morrell then turned its attention to whether these sample data support the conclusion that Morrell’s beef pot roast is the preferred choice of more than

50% of the consumer population. Letting p indicate the proportion of the population that prefers Morrell’s product, the hypothesis test for the research question is as follows:

: 0.50

: 0.50 0

H p

The null hypothesis 0H indicates the preference for Morrell’s product is less than or equal to 50%. If the sample data support rejecting 0H in favor of the alter- native hypothesis aH , Morrell will draw the research conclusion that in a three-product comparison, its beef pot roast is preferred by more than 50% of the consumer population. Using statistical hypothesis test- ing procedures, the null hypothesis 0H was rejected. The study provided statistical evidence supporting

aH and the conclusion that the Morrell product is pre- ferred by more than 50% of the consumer population.

In this chapter, you will learn about simple random sampling and the sample selection process. In addi- tion, you will learn how statistics such as the sample mean and sample proportion are used to estimate parameters such as the population mean and popu- lation proportion. The concept of a sampling distri- bution will be introduced and used to compute the margins of error associated with sample estimates. You will then learn how to use this information to con- struct and interpret interval estimates of a population mean and a population proportion. We then discuss how to formulate hypotheses and how to conduct tests such as the one used by Morrell. You will learn how to use sample data to determine whether or not a hypothesis should be rejected.

*the authors are indebted to Marty Butler, Vice President of Marketing, John Morrell, for providing this Analytics in Action.

a N a L Y T i C S i N a C T i O N

222 chapter 6 Statistical Inference

• A census may be unnecessary; if perfect information about the characteristic(s) of the population of interest is not required, a census may be excessive.

• A census may be impractical; if observations are destructive, taking a census would destroy the population of interest.

In order to overcome the potential difficulties associated with taking a census, we may decide to take a sample (a subset of the population) and subsequently use the sample data we collect to make inferences and answer research questions about the population of inter- est. Therefore, the objective of sampling is to gather data from a subset of the population that is as similar as possible to the entire population so that what we learn from the sample data accurately reflects what we want to understand about the entire population. When we use the sample data we have collected to make estimates of or draw conclusions about one or more characteristics of a population (the value of one or more parameters), we are using the process of statistical inference.

Sampling is done in a wide variety of research settings. Let us begin our discussion of statistical inference by citing two examples in which sampling was used to answer a research question about a population.

1. Members of a political party in Texas are considering giving their support to a par- ticular candidate for election to the U.S. Senate, and party leaders want to estimate the proportion of registered voters in the state that favor the candidate. A sample of 400 registered voters in Texas is selected, and 160 of those voters indicate a preference for the candidate. Thus, an estimate of proportion of the population of registered voters who favor the candidate is 5160/400 0.40.

2. A tire manufacturer is considering production of a new tire designed to provide an increase in lifetime mileage over the firm’s current line of tires. To estimate the mean useful life of the new tires, the manufacturer produced a sample of 120 tires for testing. The test results provided a sample mean of 36,500 miles. Hence, an esti- mate of the mean useful life for the population of new tires is 36,500 miles.

It is important to realize that sample results provide only estimates of the values of the cor- responding population characteristics. We do not expect exactly 0.40, or 40%, of the popula- tion of registered voters to favor the candidate, nor do we expect the sample mean of 36,500 miles to exactly equal the mean lifetime mileage for the population of all new tires produced. The reason is simply that the sample contains only a portion of the population and cannot be expected to perfectly replicate the population. Some error, or deviation of the sample from the population, is to be expected. With proper sampling methods, the sample results will pro- vide “good” estimates of the population parameters. But how good can we expect the sample results to be? Fortunately, statistical procedures are available for answering this question.

Let us define some of the terms used in sampling. The sampled population is the pop- ulation from which the sample is drawn, and a frame is a list of the elements from which the sample will be selected. In the first example, the sampled population is all registered voters in Texas, and the frame is a list of all the registered voters. Because the number of registered voters in Texas is a finite number, the first example is an illustration of sampling from a finite population.

The sampled population for the tire mileage example is more difficult to define because the sample of 120 tires was obtained from a production process at a particular point in time. We can think of the sampled population as the conceptual population of all the tires that could have been made by the production process at that particular point in time. In this sense the sampled population is considered infinite, making it impossible to construct a frame from which to draw the sample.

In this chapter, we show how simple random sampling can be used to select a sample from a finite population and we describe how a random sample can be taken from an infinite popu- lation that is generated by an ongoing process. We then discuss how data obtained from a sam- ple can be used to compute estimates of a population mean, a population standard deviation, and a population proportion. In addition, we introduce the important concept of a sampling distribution. As we will show, knowledge of the appropriate sampling distribution enables us

A sample that is similar to the population from which it has been drawn is said to be representative of the population.

A sample mean provides an estimate of a population mean, and a sample proportion provides an estimate of a population proportion. With estimates such as these, some estimation error can be expected. This chapter provides the basis for determining how large that error might be.

6.1 Selecting a Sample 223

to make statements about how close the sample estimates are to the corresponding population parameters, to compute the margins of error associated with these sample estimates, and to construct and interpret interval estimates. We then discuss how to formulate hypotheses and how to use sample data to conduct tests of a population means and a population proportion.

6.1 Selecting a Sample The director of personnel for Electronics Associates, Inc. (EAI) has been assigned the task of developing a profile of the company’s 2,500 employees. The characteristics to be iden- tified include the mean annual salary for the employees and the proportion of employees having completed the company’s management training program.

Using the 2,500 employees as the population for this study, we can find the annual sal- ary and the training program status for each individual by referring to the firm’s personnel records. The data set containing this information for all 2,500 employees in the population is in the file EAI.

A measurable factor that defines a characteristic of a population, process, or system is called a parameter. For EAI, the population mean annual salary m, the population stan- dard deviation of annual salaries s , and the population proportion p of employees who completed the training program are of interest to us. Using the EAI data, we compute the population mean and the population standard deviation for the annual salary data.

Population mean: $51,800

Population standard deviation: $4, 000

The data for the training program status show that 1,500 of the 2,500 employees completed the training program. Letting p denote the proportion of the population that completed the training program, we see that 5 51,500/2,500 0.60p . The population mean annual salary ( $51,800)m 5 , the population standard deviation of annual salary s 5( $4, 000), and the population proportion that completed the training program 5( 0.60)p are parameters of the population of EAI employees.

Now suppose that the necessary information on all the EAI employees was not read- ily available in the company’s database. The question we must consider is how the firm’s director of personnel can obtain estimates of the population parameters by using a sample of employees rather than all 2,500 employees in the population. Suppose that a sample of 30 employees will be used. Clearly, the time and the cost of developing a profile would be substantially less for 30 employees than for the entire population. If the personnel director could be assured that a sample of 30 employees would provide adequate information about the population of 2,500 employees, working with a sample would be preferable to work- ing with the entire population. Let us explore the possibility of using a sample for the EAI study by first considering how we can identify a sample of 30 employees.

Sampling from a Finite Population Statisticians recommend selecting a probability sample when sampling from a finite pop- ulation because a probability sample allows you to make valid statistical inferences about the population. The simplest type of probability sample is one in which each sample of size n has the same probability of being selected. It is called a simple random sample. A simple random sample of size n from a finite population of size N is defined as follows.

Chapter 2 discusses the computation of the mean and standard deviation of a population.

Often the cost of collecting information from a sample is substantially less than the cost of taking a census. Especially when personal interviews must be conducted to collect the information.

SimPLE RaNdOm SamPLE (FiNiTE POPuLaTiON)

A simple random sample of size n from a finite population of size N is a sample selected such that each possible sample of size n has the same probability of being selected.

EAI

Procedures used to select a simple random sample from a finite population are based on the use of random numbers. We can use Excel’s RAND function to generate a random number between 0 and 1 by entering the formula 5RAND() into any cell in a worksheet. The number generated is called a random number because the mathematical procedure used by the RAND

The random numbers generated using Excel’s RAND function follow a uniform probability distribution between 0 and 1.

224 chapter 6 Statistical Inference

function guarantees that every number between 0 and 1 has the same probability of being selected. Let us see how these random numbers can be used to select a simple random sample.

Our procedure for selecting a simple random sample of size n from a population of size N involves two steps.

Step 1. Assign a random number to each element of the population. Step 2. Select the n elements corresponding to the n smallest random numbers.

Because each set of n elements in the population has the same probability of being assigned the n smallest random numbers, each set of n elements has the same probability of being selected for the sample. If we select the sample using this two-step procedure, every sample of size n has the same probability of being selected; thus, the sample selected satis- fies the definition of a simple random sample.

Let us consider the process of selecting a simple random sample of 30 EAI employees from the population of 2,500. We begin by generating 2,500 random numbers, one for each employee in the population. Then we select 30 employees corresponding to the 30 smallest random numbers as our sample. Refer to Figure 6.1 as we describe the steps involved.

Step 1. In cell D1, enter the text Random Numbers Step 2. In cells D2:D2501, enter the formula 5RAND() Step 3. Select the cell range D2:D2501 Step 4. In the Home tab in the Ribbon:

Click Copy in the Clipboard group Click the arrow below Paste in the Clipboard group. When the Paste

window appears, click Values in the Paste Values area Press the Esc key

Step 5. Select cells A1:D2501 Step 6. In the Data tab on the Ribbon, click Sort in the Sort & Filter group Step 7. When the Sort dialog box appears:

Select the check box for My data has headers In the first Sort by dropdown menu, select Random Numbers Click OK

After completing these steps we obtain a worksheet like the one shown on the right in Figure 6.1. The employees listed in rows 2–31 are the ones corresponding to the smallest 30 random numbers that were generated. Hence, this group of 30 employees is a simple ran- dom sample. Note that the random numbers shown on the right in Figure 6.1 are in ascending order, and that the employees are not in their original order. For instance, employee 812 in the population is associated with the smallest random number and is the first element in the sample, and employee 13 in the population (see row 14 of the worksheet on the left) has been included as the 22nd observation in the sample (row 23 of the worksheet on the right).

Sampling from an infinite Population Sometimes we want to select a sample from a population, but the population is infinitely large or the elements of the population are being generated by an ongoing process for which there is no limit on the number of elements that can be generated. Thus, it is not possible to develop a list of all the elements in the population. This is considered the infinite population case. With an infinite population, we cannot select a simple random sample because we cannot construct a frame consisting of all the elements. In the infinite population case, statisticians recommend selecting what is called a random sample.

Excel’s Sort procedure is especially useful for identifying the n elements assigned the n smallest random numbers.

The random numbers generated by executing these steps will vary; therefore, results will not match Figure 6.1.

RaNdOm SamPLE (iNFiNiTE POPuLaTiON)

A random sample of size n from an infinite population is a sample selected such that the following conditions are satisfied.

1. Each element selected comes from the same population. 2. Each element is selected independently.

6.1 Selecting a Sample 225

Care and judgment must be exercised in implementing the selection process for obtaining a random sample from an infinite population. Each case may require a different selection procedure. Let us consider two examples to see what we mean by the condi- tions: (1) Each element selected comes from the same population, and (2) each element is selected independently.

A common quality-control application involves a production process for which there is no limit on the number of elements that can be produced. The conceptual population from which we are sampling is all the elements that could be produced (not just the ones that are produced) by the ongoing production process. Because we cannot develop a list of all the elements that could be produced, the population is considered infinite. To be more specific, let us consider a production line designed to fill boxes with breakfast cereal to a mean weight of 24 ounces per box. Samples of 12 boxes filled by this process are periodically selected by a quality-control inspector to determine if the process is operating properly or whether, perhaps, a machine malfunction has caused the process to begin underfilling or overfilling the boxes.

With a production operation such as this, the biggest concern in selecting a random sample is to make sure that condition 1, the sampled elements are selected from the same population, is satisfied. To ensure that this condition is satisfied, the boxes must be selected at approximately the same point in time. This way the inspector avoids the possibility of selecting some boxes when the process is operating properly and other boxes when the

Using Excel to Select a Simple Random SampleFiGuRE 6.1

Note: Rows 32–2501 are not shown.

Employee Annual Salary

Training Program

Random Numbers

1 55769.50 50823.00 48408.20 49787.50 52801.60 51767.70 58346.60 46670.20 50246.80 51255.00 52546.60 49512.50 51753.00 53547.10 48052.20 44652.50 51764.90 45187.80 49867.50 53706.30 52039.50 52973.60 53372.50 54592.00 55738.10 52975.10 52386.20 51051.60 52095.60 44956.50

No Yes No No Yes No Yes No Yes No No Yes Yes No No Yes Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes Yes No

0.613872 0.473204 0.549011 0.047482 0.531085 0.994296 0.189065 0.020714 0.647318 0.524341 0.764998 0.255244 0.010923 0.238003 0.635675 0.177294 0.415097 0.883440 0.476824 0.101065 0.775323 0.011729 0.762026 0.066344 0.776766 0.828493 0.841532 0.899427 0.486284 0.264628

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

A B C D E F G

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

The formula in cells D2:D2501 is = RAND[].

Employee Annual Salary

Training Program

Random Numbers

812 49094.30 53263.90 49643.50 49894.90 47621.60 55924.00 49092.30 51404.40 50957.70 55109.70 45922.60 57268.40 55688.80 51564.70 56188.20 51766.00 52541.30 44980.00 51932.60 52973.00 45120.90 51753.00 54391.80 50164.20 52973.60 50241.30 52793.90 50979.40 55860.90 57309.10

Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes No Yes No No Yes No Yes Yes Yes Yes Yes No No No No No Yes Yes No

0.000193 0.000484 0.002641 0.002763 0.002940 0.002977 0.003182 0.003448 0.004203 0.005293 0.005709 0.005729 0.005796 0.005966 0.006250 0.006708 0.007767 0.008095 0.009686 0.009711 0.010595 0.010923 0.011364 0.011603 0.011729 0.013570 0.013669 0.014042 0.014532 0.014539

1411 1795 2095 1235 744 470

1606 1744 179

1387 1782 1006 278

1850 844

2028 1654 444 556

2449 13

2187 1633

22 1530 820

1258 2349 1698

A B C D

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

226 chapter 6 Statistical Inference

process is not operating properly and is underfilling or overfilling the boxes. With a pro- duction process such as this, the second condition, each element is selected independently, is satisfied by designing the production process so that each box of cereal is filled inde- pendently. With this assumption, the quality-control inspector need only worry about satis- fying the same population condition.

As another example of selecting a random sample from an infinite population, consider the population of customers arriving at a fast-food restaurant. Suppose an employee is asked to select and interview a sample of customers in order to develop a profile of customers who visit the restaurant. The customer-arrival process is ongoing, and there is no way to obtain a list of all customers in the population. So, for practical purposes, the population for this ongo- ing process is considered infinite. As long as a sampling procedure is designed so that all the elements in the sample are customers of the restaurant and they are selected independently, a random sample will be obtained. In this case, the employee collecting the sample needs to select the sample from people who come into the restaurant and make a purchase to ensure that the same population condition is satisfied. If, for instance, the person selected for the sample is someone who came into the restaurant just to use the restroom, that person would not be a customer and the same population condition would be violated. So, as long as the interviewer selects the sample from people making a purchase at the restaurant, condition 1 is satisfied. Ensuring that the customers are selected independently can be more difficult.

The purpose of the second condition of the random sample selection procedure (each element is selected independently) is to prevent selection bias. In this case, selection bias would occur if the interviewer were free to select customers for the sample arbitrarily. The interviewer might feel more comfortable selecting customers in a particular age group and might avoid customers in other age groups. Selection bias would also occur if the inter- viewer selected a group of five customers who entered the restaurant together and asked all of them to participate in the sample. Such a group of customers would be likely to exhibit similar characteristics, which might provide misleading information about the population of customers. Selection bias such as this can be avoided by ensuring that the selection of a particular customer does not influence the selection of any other customer. In other words, the elements (customers) are selected independently.

McDonald’s, a fast-food restaurant chain, implemented a random sampling procedure for this situation. The sampling procedure was based on the fact that some customers presented discount coupons. Whenever a customer presented a discount coupon, the next customer served was asked to complete a customer profile questionnaire. Because arriving customers presented discount coupons randomly and independently of other customers, this sampling procedure ensured that customers were selected independently. As a result, the sample satisfied the requirements of a random sample from an infinite population.

Situations involving sampling from an infinite population are usually associated with a process that operates over time. Examples include parts being manufactured on a production line, repeated experimental trials in a laboratory, transactions occurring at a bank, telephone calls arriving at a technical support center, and customers entering a retail store. In each case, the situation may be viewed as a process that generates elements from an infinite population. As long as the sampled elements are selected from the same population and are selected independently, the sample is considered a random sample from an infinite population.

1. In this section we have been careful to define two types of

samples: a simple random sample from a finite population

and a random sample from an infinite population. In the

remainder of the text, we will generally refer to both of these

as either a random sample or simply a sample. We will not make a distinction of the sample being a “simple” random

sample unless it is necessary for the exercise or discussion.

2. Statisticians who specialize in sample surveys from finite

populations use sampling methods that provide probability

samples. With a probability sample, each possible sample

has a known probability of selection and a random process

is used to select the elements for the sample. Simple ran-

dom sampling is one of these methods. We use the term

simple in simple random sampling to clarify that this is the

N O T E S + C O m m E N T S

6.2 Point Estimation 227

6.2 Point Estimation Now that we have described how to select a simple random sample, let us return to the EAI problem. A simple random sample of 30 employees and the corresponding data on annual salary and management training program participation are as shown in Table 6.1. The nota- tion 1x , 2x , and so on is used to denote the annual salary of the first employee in the sample, the annual salary of the second employee in the sample, and so on. Participation in the man- agement training program is indicated by Yes in the management training program column.

To estimate the value of a population parameter, we compute a corresponding char- acteristic of the sample, referred to as a sample statistic. For example, to estimate the population mean m and the population standard deviation s for the annual salary of EAI employees, we use the data in Table 6.1 to calculate the corresponding sample statistics: the sample mean and the sample standard deviation s. The sample mean is

5 S

5 5 1,554, 420

30 $51,814x

n i

and the sample standard deviation is

( )

325, 009, 260

29 $3, 384

2∑ 5

2 5 5s

x x

n i

To estimate p, the proportion of employees in the population who completed the manage- ment training program, we use the corresponding sample proportion p. Let x denote the num- ber of employees in the sample who completed the management training program. The data in Table 6.1 show that 5 19x . Thus, with a sample size of 5 30n , the sample proportion is

5 5 5 19

30 0.63p

Chapter 2 discusses the computation of the mean and standard deviation of a sample.

probability sampling method that ensures that each sample

of size n has the same probability of being selected. 3. The number of different simple random samples of size n

that can be selected from a finite population of size N is:

N n N n

! !( )!2

In this formula, N! and n! are the factorial formulas. For the EAI problem with N 2,5005 and n 305 , this expres- sion can be used to show that approximately 2.75 10693

different simple random samples of 30 EAI employees can

be obtained.

4. In addition to simple random sampling, other probability

sampling methods include the following:

• Stratified random sampling—a method in which the

population is first divided into homogeneous sub-

groups or strata and then a simple random sample is

taken from each stratum.

• Cluster sampling—a method in which the population is

first divided into heterogeneous subgroups or clusters

and then simple random samples are taken from some

or all of the clusters.

• Systematic sampling—a method in which we sort

the population based on an important characteristic,

randomly select one of the first k elements of the popu- lation, and then select every kth element from the pop- ulation thereafter.

Calculation of sample statistics such as the sample mean x , the sample standard deviation s, the sample proportion p, and so on differ depending on which method of probabil-

ity sampling is used. See specialized books on sampling

such as Elementary Survey Sampling (2011) by Scheaffer, Mendenhall, and Ott for more information.

5. Nonprobability sampling methods include the following:

• Convenience sampling—a method in which sample ele-

ments are selected on the basis of accessibility.

• Judgment sampling—a method in which sample ele-

ments are selected based on the opinion of the person

doing the study.

Although nonprobability samples have the advantages of

relatively easy sample selection and data collection, no

statistically justified procedure allows a probability analy-

sis or inference about the quality of nonprobability sample

results. Statistical methods designed for probability sam-

ples should not be applied to a nonprobability sample, and

we should be cautious in interpreting the results when a

nonprobability sample is used to make inferences about

a population.

228 chapter 6 Statistical Inference

By making the preceding computations, we perform the statistical procedure called point estimation. We refer to the sample mean x as the point estimator of the population mean m, the sample standard deviation s as the point estimator of the population standard deviation s , and the sample proportion p as the point estimator of the population pro- portion p. The numerical value obtained for x , s, or p is called the point estimate. Thus, for the simple random sample of 30 EAI employees shown in Table 6.1, $51,814 is the point estimate of m, $3,348 is the point estimate of s , and 0.63 is the point estimate of p. Table 6.2 summarizes the sample results and compares the point estimates to the actual values of the population parameters.

As is evident from Table 6.2, the point estimates differ somewhat from the values of corresponding population parameters. This difference is to be expected because a sample,

Annual Salary and training Program Status for a Simple Random Sample of 30 EAI Employees

TabLE 6.1

Annual Salary ($) Management

Training Program

51, 766.0016 5x Yes

52, 541.3017 5x No

44, 980.0018 5x Yes

51, 932.6019 5x Yes

52, 973.0020 5x Yes

45,120.9021 5x Yes

51, 753.0022 5x Yes

54, 391.8023 5x No

50,164.2024 5x No

52, 973.6025 5x No

50, 241.3026 5x No

52, 793.9027 5x No

50, 979.4028 5x Yes

55, 860.9029 5x Yes

57, 309.1030 5x No

Annual Salary ($) Management

Training Program

49, 094.301 5x Yes

53, 263.902 5x Yes

49, 343.503 5x Yes

49, 894.904 5x Yes

47, 621.605 5x No

55, 924.006 5x Yes

49, 092.307 5x Yes

51, 404.408 5x Yes

50, 957.709 5x Yes

55,109, 7010 5x Yes

45, 922.6011 5x Yes

57, 268.4012 5x No

55, 688.8013 5x Yes

51, 564.7014 5x No

56,188.2015 5x No

Population Parameter Parameter

Value Point Estimator Point

Estimate

Populationmean annual salarym 5 $51,800 Sample mean annual salary5x $51,814

s 5 Population standard deviation for annual salary

$4,000 5s Sample standard deviation for annual salary

$3,348

Population proportion completing the management training program

p 5 0.60 5p Sample proportion having completed the management training program

0.63

Summary of Point Estimates obtained from a Simple Random Sample of 30 EAI Employees

TabLE 6.2

6.3 Sampling Distributions 229

and not a census of the entire population, is being used to develop the point estimates. In Chapter 7, we will show how to construct an interval estimate in order to provide information about how close the point estimate is to the population parameter.

Practical advice The subject matter of most of the rest of the book is concerned with statistical inference, of which point estimation is a form. We use a sample statistic to make an inference about a population parameter. When making inferences about a population based on a sample, it is important to have a close correspondence between the sampled population and the target population. The target population is the population about which we want to make infer- ences, while the sampled population is the population from which the sample is actually taken. In this section, we have described the process of drawing a simple random sample from the population of EAI employees and making point estimates of characteristics of that same population. So the sampled population and the target population are identical, which is the desired situation. But in other cases, it is not as easy to obtain a close correspondence between the sampled and target populations.

Consider the case of an amusement park selecting a sample of its customers to learn about characteristics such as age and time spent at the park. Suppose all the sample ele- ments were selected on a day when park attendance was restricted to employees of a large company. Then the sampled population would be composed of employees of that company and members of their families. If the target population we wanted to make inferences about were typical park customers over a typical summer, then we might encounter a significant difference between the sampled population and the target population. In such a case, we would question the validity of the point estimates being made. Park management would be in the best position to know whether a sample taken on a particular day was likely to be representative of the target population.

In summary, whenever a sample is used to make inferences about a population, we should make sure that the study is designed so that the sampled population and the target population are in close agreement. Good judgment is a necessary ingredient of sound statistical practice.

6.3 Sampling Distributions In the preceding section we said that the sample mean x is the point estimator of the population mean m, and the sample proportion p is the point estimator of the population proportion p. For the simple random sample of 30 EAI employees shown in Table 6.1, the point estimate of m is $51,8145x and the point estimate of p is 0.635p . Suppose we select another simple random sample of 30 EAI employees and obtain the following point estimates:

Sample mean: $52, 670

Sample proportion: 0.70

Note that different values of x and p were obtained. Indeed, a second simple random sam- ple of 30 EAI employees cannot be expected to provide the same point estimates as the first sample.

Now, suppose we repeat the process of selecting a simple random sample of 30 EAI employees over and over again, each time computing the values of x and p. Table 6.3 con- tains a portion of the results obtained for 500 simple random samples, and Table 6.4 shows the frequency and relative frequency distributions for the 500 values of x . Figure 6.2 shows the relative frequency histogram for the x values.

A random variable is a quantity whose values are not known with certainty. Because the sample mean x is a quantity whose values are not known with certainty, the sample mean x is a random variable. As a result, just like other random variables, x has a mean or expected value, a standard deviation, and a probability distribution. Because the various

230 chapter 6 Statistical Inference

Sample Number Sample Mean x( ) Sample Proportion p( ) 1 51,814 0.63

2 52,670 0.70

3 51,780 0.67

4 51,588 0.53

· · ·

500 51,752 0.50

Values of x and p from 500 Simple Random Samples of 30 EAI Employees

TabLE 6.3

Mean Annual Salary ($) Frequency Relative Frequency

49,500.00–49,999.99 2 0.004

50,000.00–50,499.99 16 0.032

50,500.00–50,999.99 52 0.104

51,000.00–51,499.99 101 0.202

51,500.00–51,999.99 133 0.266

52,000.00–52,499.99 110 0.220

52,500.00–52,999.99 54 0.108

53,000.00–53,499.99 26 0.052

53,500.00–53,999.99 6 0.012

Totals: 500 1.000

Frequency and Relative Frequency Distributions of x from 500 Simple Random Samples of 30 EAI Employees

TabLE 6.4

possible values of x are the result of different simple random samples, the probability dis- tribution of x is called the sampling distribution of x . Knowledge of this sampling distri- bution and its properties will enable us to make probability statements about how close the sample mean x is to the population mean m.

Let us return to Figure 6.2. We would need to enumerate every possible sample of 30 employees and compute each sample mean to completely determine the sampling dis- tribution of x . However, the histogram of 500 values of x gives an approximation of this sampling distribution. From the approximation we observe the bell-shaped appearance of the distribution. We note that the largest concentration of the x values and the mean of the 500 values of x is near the population mean $51,800m 5 . We will describe the properties of the sampling distribution of x more fully in the next section.

The 500 values of the sample proportion p are summarized by the relative frequency histogram in Figure 6.3. As in the case of x , p is a random variable. If every possible sample of size 30 were selected from the population and if a value of p were computed for each sample, the resulting probability distribution would be the sampling distribution of p. The relative frequency histogram of the 500 sample values in Figure 6.3 provides a general idea of the appearance of the sampling distribution of p.

In practice, we select only one simple random sample from the population. We repeated the sampling process 500 times in this section simply to illustrate that many different

Chapter 2 introduces the concept of a random variable and Chapter 5 discusses properties of random variables and their relationship to probability concepts.

The ability to understand the material in subsequent sections of this chapter depends heavily on the ability to understand and use the sampling distributions presented in this section.

6.3 Sampling Distributions 231

Relative Frequency Histogram of x Values from 500 Simple Random Samples of Size 30 Each

FiGuRE 6.2

.30

.25

.20

.15

.10

.05

R el

at iv

e F

re q

u en

50,000 51,000 52,000 53,000 54,000

Values of x

Relative Frequency Histogram of p Values from 500 Simple Random Samples of Size 30 Each

FiGuRE 6.3

0.30

0.35

0.20

0.15

0.10

0.05

R el

at iv

e F

re q

u en

0.40 0.56 0.72 0.88 Values of p

0.32 0.48 0.64 0.80

0.25

0.40

232 chapter 6 Statistical Inference

samples are possible and that the different samples generate a variety of values for the sample statistics x and p. The probability distribution of any particular sample statistic is called the sampling distribution of the statistic. Next we discuss the characteristics of the sampling distributions of x and p.

Sampling distribution of x In the previous section we said that the sample mean x is a random variable and that its probability distribution is called the sampling distribution of x .

xs 5 the standard deviation of x , or the standard error of the mean s 5 the standard deviation of the population n 5 the sample size N 5 the population size

SamPLiNG diSTRibuTiON OF X

The sampling distribution of x is the probability distribution of all possible values of the sample mean x .

This section describes the properties of the sampling distribution of x . Just as with other probability distributions we studied, the sampling distribution of x has an expected value or mean, a standard deviation, and a characteristic shape or form. Let us begin by consider- ing the mean of all possible x values, which is referred to as the expected value of x .

Expected Value of x In the EAI sampling problem we saw that different simple random samples result in a variety of values for the sample mean x . Because many different values of the random variable x are possible, we are often interested in the mean of all possible values of x that can be generated by the various simple random samples. The mean of the x random variable is the expected value of x . Let ( )E x represent the expected value of x and m represent the mean of the population from which we are selecting a simple random sample. It can be shown that with simple random sampling, ( )E x and m are equal.

The expected value of x equals the mean of the population from which the sample is selected.

ExPECTEd VaLuE OF x

m5( )E x (6.1)

where

5 5

( ) the expected value of the population mean

E x x

This result states that with simple random sampling, the expected value or mean of the sampling distribution of x is equal to the mean of the population. In Section 6.1 we saw that the mean annual salary for the population of EAI employees is $51,800m 5 . Thus, according to equation (6.1), if we considered all possible samples of size n from the pop- ulation of EAI employees, the mean of all the corresponding sample means for the EAI study would be equal to $51,800, the population mean.

When the expected value of a point estimator equals the population parameter, we say the point estimator is unbiased. Thus, equation (6.1) states that x is an unbiased estimator of the population mean m.

Standard deviation of x Let us define the standard deviation of the sampling distribution of x . We will use the following notation:

The term standard error is used in statistical inference to refer to the standard deviation of a point estimator.

6.3 Sampling Distributions 233

It can be shown that the formula for the standard deviation of x depends on whether the population is finite or infinite. The two formulas for the standard deviation of x follow.

STaNdaRd dEViaTiON OF x

Finite Population Infinite Population

N n

N n n x x

  

  

1 s

s s

s 5

2 5

(6.2)

In comparing the two formulas in equation (6.2), we see that the factor ( )/( 1)N n N2 2 is required for the finite population case but not for the infinite population case. This factor is commonly referred to as the finite population correction factor. In many practical sam- pling situations, we find that the population involved, although finite, is large relative to the sample size. In such cases the finite population correction factor ( )/( 1)N n N2 2 is close to 1. As a result, the difference between the values of the standard deviation of x for the finite and infinite populations becomes negligible. Then, / nxs s5 becomes a good approxi- mation to the standard deviation of x even though the population is finite. In cases where

/ 0.05n N . , the finite population version of equation (6.2) should be used in the computa- tion of xs . Unless otherwise noted, throughout the text we will assume that the population size is large relative to the sample size, i.e., / 0.05n N # .

Observe from equation (6.2) that we need to know s , the standard deviation of the population, in order to compute xs . That is, the sample-to-sample variability in the point estimator x , as measured by the standard error xs , depends on the standard deviation of the population from which the sample is drawn. However, when we are sampling to estimate the population mean with x , usually the population standard deviation is also unknown. Therefore, we need to estimate the standard deviation of x with sx using the sample stan- dard deviations as shown in equation (6.3).

ESTimaTEd STaNdaRd dEViaTiON OF x

Finite Population Infinite Population

s N n

n s

n x x

  

  

  

  

1 5

2 5

(6.3)

Let us now return to the EAI example and compute the estimated standard error (standard deviation) of the mean associated with simple random samples of 30 EAI employees. Recall from Table 6.2 that the standard deviation of the sample of 30 EAI employees is 3,348s 5 . In this case, the population is finite ( 2,500)N 5 , but because

/ 30/2,500 0.012 0.05n N 5 5 , , we can ignore the finite population correction factor and compute the estimated standard error as

5 5 5 3,348

30 611.3s

n x

In this case, we happen to know that the standard deviation of the population is actually 4, 000s 5 , so the true standard error is

s s

5 5 5 4, 000

30 730.3

n x

The difference between sx and xs is due to sampling error, or the error that results from observing a sample of 30 rather than the entire population of 2,500.

Form of the Sampling distribution of x The preceding results concerning the expected value and standard deviation for the sampling distribution of x are applicable for any pop- ulation. The final step in identifying the characteristics of the sampling distribution of x is

234 chapter 6 Statistical Inference

to determine the form or shape of the sampling distribution. We will consider two cases: (1) The population has a normal distribution; and (2) the population does not have a normal distribution.

Population has a Normal distribution In many situations it is reasonable to assume that the population from which we are selecting a random sample has a normal, or nearly normal, distribution. When the population has a normal distribution, the sampling distribution of x is normally distributed for any sample size.

Population does Not have a Normal distribution When the population from which we are selecting a random sample does not have a normal distribution, the central limit theorem is helpful in identifying the shape of the sampling distribution of x . A statement of the central limit theorem as it applies to the sampling distribution of x follows.

CENTRaL LimiT ThEOREm

In selecting random samples of size n from a population, the sampling distribution of the sample mean x can be approximated by a normal distribution as the sample size becomes large.

Figure 6.4 shows how the central limit theorem works for three different populations; each column refers to one of the populations. The top panel of the figure shows that none of the populations are normally distributed. Population I follows a uniform distribution. Population II is often called the rabbit-eared distribution. It is symmetric, but the more likely values fall in the tails of the distribution. Population III is shaped like the exponential distribution; it is skewed to the right.

The bottom three panels of Figure 6.4 show the shape of the sampling distribution for samples of size 2n 5 , 5n 5 , and 30n 5 . When the sample size is 2, we see that the shape of each sampling distribution is different from the shape of the corresponding population distribution. For samples of size 5, we see that the shapes of the sampling distributions for populations I and II begin to look similar to the shape of a normal distribution. Even though the shape of the sampling distribution for population III begins to look similar to the shape of a normal distribution, some skewness to the right is still present. Finally, for a sample size of 30, the shapes of each of the three sampling distributions are approximately normal.

From a practitioner’s standpoint, we often want to know how large the sample size needs to be before the central limit theorem applies and we can assume that the shape of the sampling distribution is approximately normal. Statistical researchers have investigated this question by studying the sampling distribution of x for a variety of populations and a variety of sample sizes. General statistical practice is to assume that, for most applications, the sampling distribution of x can be approximated by a normal distribution whenever the sample size is 30 or more. In cases in which the population is highly skewed or outliers are present, sample sizes of 50 may be needed.

Sampling distribution of x for the Eai Problem Let us return to the EAI problem where we previously showed that ( ) $51,800E x 5 and 730.3xs 5 . At this point, we do not have any information about the population distribution; it may or may not be normally distributed. If the population has a normal distribution, the sampling distribution of x is normally distributed. If the population does not have a normal distribution, the simple ran- dom sample of 30 employees and the central limit theorem enable us to conclude that the sampling distribution of x can be approximated by a normal distribution. In either case, we are comfortable proceeding with the conclusion that the sampling distribution of x can be described by the normal distribution shown in Figure 6.5. In other words, Figure 6.5 illus- trates the distribution of the sample means corresponding to all possible sample sizes of 30 for the EAI study.

6.3 Sampling Distributions 235

Relationship between the Sample Size and the Sampling distribution of x Suppose that in the EAI sampling problem we select a simple random sample of 100 EAI employ- ees instead of the 30 originally considered. Intuitively, it would seem that because the larger sample size provides more data, the sample mean based on 5 100n would provide a better estimate of the population mean than the sample mean based on 5 30n . To see how much better, let us consider the relationship between the sample size and the sampling distribution of x .

First, note that m5( )E x regardless of the sample size. Thus, the mean of all possible values of x is equal to the population mean m regardless of the sample size m. However, note that the standard error of the mean, s s5 / nx , is related to the square root of the sample size. Whenever the sample size is increased, the standard error of the mean s x

Illustration of the central limit theorem for three Populations

FiGuRE 6.4

Values of x

Sampling Distribution

of x (n 5 5)

Values of x

Sampling Distribution

of x (n 5 2)

Values of x

Population I

Values of x

Sampling Distribution

of x (n 5 30)

Values of x

Population II

Values of x

Population III

Values of x

Population Distribution

236 chapter 6 Statistical Inference

decreases. With 5 30n , the standard error of the mean for the EAI problem is 730.3. However, with the increase in the sample size to 5 100n , the standard error of the mean is decreased to

4, 000

100 400s

s 5 5 5

n x

The sampling distributions of x with 5 30n and 5 100n are shown in Figure 6.6. Because the sampling distribution with 5 100n has a smaller standard error, the values of x with

5 100n have less variation and tend to be closer to the population mean than the values of x with 5 30n .

The sampling distribution in Figure 6.5 is a theoretical construct, as typically the population mean and the population standard deviation are not known. Instead, we must estimate these parameters with the sample mean and the sample standard deviation, respectively.

Sampling Distribution of x for the Mean Annual Salary of a Simple Random Sample of 30 EAI Employees

FiGuRE 6.5

x 51,800

E( ) 5 �

Sampling distribution of x

x 5 n

5 4,000

30 5 730.3�

�

A comparison of the Sampling Distributions of x for Simple Random Samples of n 5 30 and n 5100 EAI Employees

FiGuRE 6.6

x 51,800

x 5 730.3 With n 5 30,

x 5 400 With n 5 100,

�

6.3 Sampling Distributions 237

The important point in this discussion is that as the sample size increases, the standard error of the mean decreases. As a result, a larger sample size will provide a higher proba- bility that the sample mean falls within a specified distance of the population mean. The practical reason we are interested in the sampling distribution of x is that it can be used to provide information about how close the sample mean is to the population mean. The con- cepts of interval estimation and hypothesis testing discussed in Sections 6.4 and 6.5 rely on the properties of sampling distributions.

Sampling distribution of p The sample proportion p is the point estimator of the population proportion p. The formula for computing the sample proportion is

5p x

where

5 5

the number of elements in the sample that possess the characteristic of interest sample size

x n

As previously noted in this section, the sample proportion p is a random variable and its probability distribution is called the sampling distribution of p.

SamPLiNG diSTRibuTiON OF p The sampling distribution of p is the probability distribution of all possible values of the sample proportion p.

To determine how close the sample proportion p is to the population proportion p, we need to understand the properties of the sampling distribution of p: the expected value of p, the standard deviation of p, and the shape or form of the sampling distribution of p.

Expected Value of p The expected value of p, the mean of all possible values of p, is equal to the population proportion p.

ExPECTEd VaLuE OF p

( ) 5E p p (6.4)

where

5 5

( ) the expected value of the population proportion

E p p p

Because 5( )E p p, p is an unbiased estimator of p. In Section 6.1, we noted that 5 0.60p for the EAI population, where p is the proportion of the population of employees who participated in the company’s management training program. Thus, the expected value of p for the EAI sampling problem is 0.60. That is, if we considered the sample proportions corresponding to all possible samples of size n for the EAI study, the mean of these sample proportions would be 0.6.

Standard deviation of p Just as we found for the standard deviation of x , the standard deviation of p depends on whether the population is finite or infinite. The two formulas for computing the standard deviation of p follow.

STaNdaRd dEViaTiON OF p

(1 ) (1 ) s s5

2 5

Finite Population

N n

p p

Infinite Population

p p

n p p

(6.5)

238 chapter 6 Statistical Inference

Comparing the two formulas in equation (6.5), we see that the only difference is the use of the finite population correction factor 2 2( )/( 1)N n N .

As was the case with the sample mean x , the difference between the expressions for the finite population and the infinite population becomes negligible if the size of the finite pop- ulation is large in comparison to the sample size. We follow the same rule of thumb that we recommended for the sample mean. That is, if the population is finite with #/ 0.05n N , we will use s 5 2(1 )/p p np . However, if the population is finite with ./ 0.05n N , the finite population correction factor should be used. Again, unless specifically noted, throughout the text we will assume that the population size is large in relation to the sample size and thus the finite population correction factor is unnecessary.

Earlier in this section, we used the term standard error of the mean to refer to the stan- dard deviation of x . We stated that in general the term standard error refers to the standard deviation of a point estimator. Thus, for proportions we use standard error of the propor- tion to refer to the standard deviation of p. From equation (6.5), we observe that the sam- ple-to-sample variability in the point estimator p, as measured by the standard error s p, depends on the population proportion p. However, when we are sampling to compute p, typically the population proportion is unknown. Therefore, we need to estimate the stan- dard deviation of p with s p using the sample proportion as shown in equation (6.6).

ESTimaTEd STaNdaRd dEViaTiON OF p

2 5

(1 ) (1 )

Finite Population Infinite Population

s N n

p p

n s

p p

n p p

(6.6)

Let us now return to the EAI example and compute the estimated standard error (standard deviation) of the proportion associated with simple random samples of 30 EAI employees. Recall from Table 6.2 that the sample proportion of EAI employees who completed the management training program is 5 0.63p . Because 5 5 ,/ 30/2,500 0.012 0.05n N , we can ignore the finite population correction factor and compute the estimated standard error as

(1 ) 0.63(1 0.63)

30 0.08815

2 5

2 5s

p p

n p

In the EAI example, we actually know that the population proportion is 5 0.6p , so we know that the true standard error is

(1 ) 0.6(1 0.6)

30 0.0894s 5

2 5

p p

n p

The difference between s p and s p is due to sampling error.

Form of the Sampling distribution of p Now that we know the mean and standard devi- ation of the sampling distribution of p, the final step is to determine the form or shape of the sampling distribution. The sample proportion is 5 /p x n . For a simple random sample from a large population, x is a binomial random variable indicating the number of elements in the sample with the characteristic of interest. Because n is a constant, the probability of x/n is the same as the binomial probability of x, which means that the sampling distribution of p is also a discrete probability distribution and that the probability for each value of x/n the same as the binomial probability of the corresponding value of x.

Statisticians have shown that a binomial distribution can be approximated by a nor- mal distribution whenever the sample size is large enough to satisfy the following two conditions:

$ 2 $5 and (1 ) 5np n p

6.3 Sampling Distributions 239

Assuming that these two conditions are satisfied, the probability distribution of x in the sample proportion, 5 /p x n , can be approximated by a normal distribution. And because n is a constant, the sampling distribution of p can also be approximated by a normal distribu- tion. This approximation is stated as follows:

Because the population proportion p is typically unknown in a study, the test to see whether the sampling distribution of p can be approximated by a normal distribution is often based on the sample proportion, np 5$ and n(1 p) 52 $ .

The sampling distribution of p can be approximated by a normal distribution whenever $ 5np and 2 $(1 ) 5n p .

In practical applications, when an estimate of a population proportion is desired, we find that sample sizes are almost always large enough to permit the use of a normal approxima- tion for the sampling distribution of p.

Recall that for the EAI sampling problem we know that a sample proportion of employ- ees who participated in the training program is p 0.635 . With a simple random sample of size 30, we have np 30(0.63) 18.95 5 and n p(1 ) 30(0.37) 11.12 5 5 . Thus, the sam- pling distribution of p can be approximated by a normal distribution shown in Figure 6.7.

Relationship between Sample Size and the Sampling distribution of p Suppose that in the EAI sampling problem we select a simple random sample of 100 EAI employees instead of the 30 originally considered. Intuitively, it would seem that because the larger sample size provides more data, the sample proportion based on 5 100n would provide a better estimate of the population proportion than the sample proportion based on 5 30n . To see how much better, recall that the standard error of the proportion is 0.0894 when the sample size is 5 30n . If we increase the sample size to 5 100n , the standard error of the proportion becomes

0.60(1 0.60)

100 0.0490s 5

2 5p

As we observed with the standard deviation of the sampling distribution of x , increasing the sample size decreases the sample-to-sample variability of the sample proportion. As a result, a larger sample size will provide a higher probability that the sample proportion falls within a specified distance of the population proportion. The practical reason we are

The sampling distribution in Figure 6.7 is a theoretical construct, as typically the population proportion is not known. Instead, we must estimate it with the sample proportion.

Sampling Distribution of p for the Proportion of EAI Employees Who Participated in the Management training Program

FiGuRE 6.7

p 0.60

Sampling distribution of p

E( p)

5 0.0894p�

240 chapter 6 Statistical Inference

interested in the sampling distribution of p is that it can be used to provide information about how close the sample proportion is to the population proportion. The concepts of interval estimation and hypothesis testing discussed in Sections 6.4 and 6.5 rely on the properties of sampling distributions.

6.4 Interval Estimation In Section 6.2, we stated that a point estimator is a sample statistic used to estimate a pop- ulation parameter. For instance, the sample mean x is a point estimator of the population mean m and the sample proportion p is a point estimator of the population proportion p. Because a point estimator cannot be expected to provide the exact value of the population parameter, interval estimation is frequently used to generate an estimate of the value of a population parameter. An interval estimate is often computed by adding and subtracting a value, called the margin of error, to the point estimate:

Point estimate Margin of error6

The purpose of an interval estimate is to provide information about how close the point estimate, provided by the sample, is to the value of the population parameter. In this sec- tion we show how to compute interval estimates of a population mean m and a population proportion p.

interval Estimation of the Population mean The general form of an interval estimate of a population mean is

Margin of error6x

The sampling distribution of x plays a key role in computing this interval estimate. In Section 6.3 we showed that the sampling distribution of x has a mean equal to the

population mean m5( ( ) )E x and a standard deviation equal to the population standard deviation divided by the square root of the sample size s s5( / )nx . We also showed that for a sufficiently large sample or for a sample taken from a normally distributed popula- tion, the sampling distribution of x follows a normal distribution. These results for samples of 30 EAI employees are illustrated in Figure 6.5. Because the sampling distribution of x shows how values of x are distributed around the population mean m, the sampling distri- bution of x provides information about the possible differences between x and m.

For any normally distributed random variable, 90% of the values lie within 1.645 stan- dard deviations of the mean, 95% of the values lie within 1.960 standard deviations of the mean, and 99% of the values lie within 2.576 standard deviations of the mean. Thus, when the sampling distribution of x is normal, 90% of all values of x must be within 1.645s6 x of the mean m, 95% of all values of x must be within 1.96s6 x of the mean m, and 99% of all values of x must be within 2.576s6 x of the mean m.

Figure 6.8 shows what we would expect for values of sample means for 10 indepen- dent random samples when the sampling distribution of x is normal. Because 90% of all values of x are within 1.645s6 x of the mean m, we expect 9 of the values of x for these 10 samples to be within 1.645s6 x of the mean m. If we repeat this process of collect- ing 10 samples, our results may not include 9 sample means with values that are within

s1.645 x of the mean m, but on average, the values of x will be within 1.645s6 x of the mean m for 9 of every 10 samples.

We now want to use what we know about the sampling distribution of x to develop an interval estimate of the population mean m. However, when developing an interval estimate of a population mean m, we generally do not know the population standard deviation s , and therefore, we do not know the standard error of x , s s5 / nx . In this case, we must use the same sample data to estimate both m and s , so we use 5 /s s nx to estimate the standard error of x . When we estimate s x with sx, we introduce an additional source of

6.4 Interval Estimation 241

uncertainty about the distribution of values of x . If the sampling distribution of x follows a normal distribution, we address this additional source of uncertainty by using a probability distribution known as the t distribution.

The t distribution is a family of similar probability distributions; the shape of each specific t distribution depends on a parameter referred to as the degrees of freedom. The t distribution with 1 degree of freedom is unique, as is the t distribution with 2 degrees of freedom, the t distribution with 3 degrees of freedom, and so on. These t distributions are similar in shape to the standard normal distribution but are wider; this reflects the additional uncertainty that results from using sx to estimate s x . As the degrees of free- dom increase, the difference between sx and s x decreases and the t distribution narrows. Furthermore, because the area under any distribution curve is fixed at 1.0, a narrower t distribution will have a higher peak. Thus, as the degrees of freedom increase, the t dis- tribution narrows, its peak becomes higher, and it becomes more similar to the standard normal distribution. We can see this in Figure 6.9, which shows t distributions with 10 and 20 degrees of freedom as well as the standard normal probability distribution. Note that as with the standard normal distribution, the mean of the t distribution is zero.

To use the t distribution to compute the margin of error for the EAI example, we con- sider the t distribution with 2 5 2 51 30 1 29n degrees of freedom. Figure 6.10 shows that for a t-distributed random variable with 29 degrees of freedom, 90% of the values are

The standard normal distribution is a normal distribution with a mean of zero and a standard deviation of one. Chapter 5 contains a discussion of the normal distribution and its special case of the standard normal distribution.

Sampling Distribution of the Sample MeanFiGuRE 6.8

Sampling distribution of x–

–

x1 –

x2 –

x3 –

x4 –

x5 –

x6 –

x7 –

x8 –

x9 –

x10 –

x = / n–

–μ – 1.645 x –μ + 1.645 x

242 chapter 6 Statistical Inference

within 1.6996 standard deviations of the mean and 10% of the values are more than 1.6996 standard deviations away from the mean. Thus, 5% of the values are more than

1.699 standard deviations below the mean and 5% of the values are more than 1.699 stan- dard deviations above the mean. This leads us to use 0.05t to denote the value of t for which the area in the upper tail of a t distribution is 0.05. For a t distribution with 29 degrees of freedom, 5 1.6990.05t .

We can use Excel’s T.INV.2T function to find the value from a t distribution such that a given percentage of the distribution is included in the interval 6t for any degrees of freedom. For example, suppose again that we want to find the value of t from the t distribution with 29 degrees of freedom such that 90% of the t distribution is included in the interval 2t to 1t. Excel’s T.INV.2T function has two inputs: (1) 1 2 the proportion of the t distribution that will fall between 2t and 1t, and (2) the degrees of freedom (which in this case is equal to the sam- ple size 2 1). For our example, we would enter the formula 5 2 2T.INV.2T(1 0.90, 30 1), which computes the value of 1.699. This confirms the data shown in Figure 6.10; for the t dis- tribution with 29 degrees of freedom, 5 1.6990.05t and 90% of all values for the t distribution with 29 degrees of freedom will lie between 21.699 and 1.699.

Although the mathematical development of the t distribution is based on the assumption that the population from which we are sampling is normally distributed, research shows that the t distribution can be successfully applied in many situations in which the population deviates substantially from a normal distribution.

To see how the difference between the t distribution and the standard normal distribution decreases as the degrees of freedom increase, use Excel’s T.INV.2T function to compute t 0.05 for increasingly larger degrees of freedom (n 2 1) and watch the value of t 0.05 approach 1.645.

comparison of the Standard normal Distribution with t Distributions with 10 and 20 Degrees of Freedom

FiGuRE 6.9

Standard normal distribution

t distribution (20 degrees of freedom)

t distribution (10 degrees of freedom)

z, t

t Distribution with 29 Degrees of FreedomFiGuRE 6.10

90%

5% 5%

t0.05 51.699–1.699 0

6.4 Interval Estimation 243

At the beginning of this section, we stated that the general form of an interval estimate of the population mean m is 6x margin of error. To provide an interpretation for this inter- val estimate, let us consider the values of x that might be obtained if we took 10 indepen- dent simple random samples of 30 EAI employees. The first sample might have the mean

1x and standard deviation 1s . Figure 6.11 shows that the interval formed by subtracting 1.699 / 301s from 1x and adding 1.699 / 301s to 1x includes the population mean m. Now consider what happens if the second sample has the mean 2x and standard deviation

2s . Although this sample mean differs from the first sample mean, we see in Figure 6.11 that the interval formed by subtracting 1.699 / 302s from 2x and adding 1.699 / 302s to

2x also includes the population mean m. However, consider the third sample, which has the mean 3x and standard deviation 3s . As we see in Figure 6.11, the interval formed by subtracting 1.699 / 303s from 3x and adding 1.699 / 303s to 3x does not include the pop- ulation mean m. Because we are using 5 1.6990.05t to form this interval, we expect that

Intervals Formed Around Sample Means from 10 Independent Random SamplesFiGuRE 6.11

x1 –

x2 –

x4 –

x5 –

x6 –

x7 –

x8 –

x9 –

x10 –

x3 –

Sampling distribution of x–

– μ

x1 – 1.699 s1/ 30 – x1 + 1.699 s1/ 30

–

x2 – 1.699 s2/ 30 –

x3 – 1.699 s3/ 30 –

x4 – 1.699 s4/ 30 –

x5 – 1.699 s5/ 30 –

x6 + 1.699 s6 / 30 –x6 – 1.699 s6 / 30

–

x7 – 1.699 s7 / 30 –

x8 – 1.699 s8 / 30 –

x9 – 1.699 s9 / 30 –

x10 – 1.699 s10 / 30 – x10 + 1.699 s10 / 30

–

x9 + 1.699 s9 / 30 –

x8 + 1.699 s8 / 30 –

x7 + 1.699 s7 / 30 –

x5 + 1.699 s5/ 30 –

x4 + 1.699 s4/ 30 –

x3 + 1.699 s3/ 30 –

x2 + 1.699 s2/ 30 –

244 chapter 6 Statistical Inference

90% of the intervals for our samples will include the population mean m, and we see in Figure 6.11 that the results for our 10 samples of 30 EAI employees are what we would expect; the intervals for 9 of the 10 samples of 5 30n observations in this example include the mean m. However, it is important to note that if we repeat this process of collecting 10 samples of 5 30n EAI employees, we may find that fewer than 9 of the resulting intervals

6 1.699x sx include the mean m or all 10 of the resulting intervals 6 1.699x sx include the mean m. However, on average, the resulting intervals 6 1.699x sx for 9 of 10 samples of

5 30n observations will include the mean m. Now recall that the sample of 5 30n EAI employees from Section 6.2 had a sam-

ple mean of salary of 5 $51,814x and sample standard deviation of 5 $3,340s . Using 1.699(3,340/ 30 )6x to construct the interval estimate, we obtain 651,814 1, 036. Thus,

the specific interval estimate of m based on this specific sample is $50,778 to $52,850. Because approximately 90% of all the intervals constructed using 6 1.699( / 30 )x s will contain the population mean, we say that we are approximately 90% confident that the interval $50,778 to $52,850 includes the population mean m. We also say that this interval has been established at the 90% confidence level. The value of 0.90 is referred to as the confidence coefficient, and the interval $50,564 to $53,064 is called the 90% confidence interval.

Another term sometimes associated with an interval estimate is the level of significance. The level of significance associated with an interval estimate is denoted by the Greek letter a. The level of significance and the confidence coefficient are related as follows:

level of significance 1 confidence coefficienta 5 5 2

The level of significance is the probability that the interval estimation procedure will generate an interval that does not contain m (such as the third sample in Figure 6.11). For example, the level of significance corresponding to a 0.90 confidence coefficient is

1 0.90 0.10a 5 2 5 . In general, we use the notation / 2at to represent the value such that there is an area of

/2a in the upper tail of the t distribution (see Figure 6.12). If the sampling distribution of x is normal, the margin of error for an interval estimate of a population mean m is

/ 2 / 25a at s t s

n x

So if the sampling distribution of x is normal, we find the interval estimate of the mean m by subtracting this margin of error from the sample mean x and adding this margin of error to the sample mean x . Using the notation we have developed, equation (6.7) can be used to find the confidence interval or interval estimate of the population mean m.

t Distribution with a /2 Area or Probability in the Upper tailFiGuRE 6.12

0 t /2

6.4 Interval Estimation 245

If we want to find a 95% confidence interval for the mean m in the EAI example, we again recognize that the degrees of freedom are 2 530 1 29 and then use Excel’s T.INV.2T function to find 5 2.0450.025t . We have seen that 5 611.3sx in the EAI example, so the margin of error at the 95% level of confidence is 5 6 52.045(611.3) 1, 2500.025t sx . We also know that 5 51,814x for the EAI example, so the 95% confidence interval is

651,814 1, 250, or $50,564 to $53,064. It is important to note that a 95% confidence interval does not have a 95% probability

of containing the population mean m. Once constructed, a confidence interval will either contain the population parameter (m in this EAI example) or not contain the population parameter. If we take several independent samples of the same size from our population and construct a 95% confidence interval for each of these samples, we would expect 95% of these confidence intervals to contain the mean m. Our 95% confidence interval for the EAI example, $50,564 to $53,064, does indeed contain the population mean $51,800; however, if we took many independent samples of 30 EAI employees and developed a 95% confidence interval for each, we would expect that 5% of these confidence intervals would not include the population mean $51,800.

To further illustrate the interval estimation procedure, we will consider a study designed to estimate the mean credit card debt for the population of U.S. households. A sample of

5 70n households provided the credit card balances shown in Table 6.5. For this situation, no previous estimate of the population standard deviation s is available. Thus, the sample data must be used to estimate both the population mean and the population standard devia- tion. Using the data in Table 6.5, we compute the sample mean 5 $9,312x and the sample standard deviation 5 $4, 007s .

We can use Excel’s T.INV.2T function to compute the value of / 2at to use in finding this confidence interval. With a 95% confidence level and 2 51 69n degrees of freedom, we have that 2 5T.INV.2T(1 0.95,69) 1.995, so 1.995/ 2 (1 0.95) / 2 0.0255 5 5a 2t t t for this confi- dence interval.

We use equation (6.7) to compute an interval estimate of the population mean credit card balance.

9,312 1.995 4, 007

70 9,312 995

The point estimate of the population mean is $9,312, the margin of error is $955, and the 95% confidence interval is 2 59,312 955 $8,357 to 1 59,312 955 $10, 267. Thus, we are 95% confident that the mean credit card balance for the population of all households is between $8,357 and $10,267.

using Excel We will use the credit card balances in Table 6.5 to illustrate how Excel can be used to construct an interval estimate of the population mean. We start by summarizing the data using Excel’s Descriptive Statistics tool. Refer to Figure 6.13 as we describe the tasks involved. The formula worksheet is on the left; the value worksheet is on the right.

Step 1. Click the Data tab on the Ribbon Step 2. In the Analysis group, click Data Analysis

Observe that the margin of error, t (s/ n)/ 2a , varies from sample to sample. This variation occurs because the sample standard deviation s varies depending on the sample selected. A large value for s results in a larger margin of error, while a small value for s results in a smaller margin of error.

iNTERVaL ESTimaTE OF a POPuLaTiON mEaN

6 a ,/ 2x t s

n (6.7)

where s is the sample standard deviation, a is the level of significance, and / 2at is the t value providing an area of /2a in the upper tail of the t distribution with n 2 1 degrees of freedom.

246 chapter 6 Statistical Inference

9,430 14,661 7,159 9,071 9,691 11,032

7,535 12,195 8,137 3,603 11,448 6,525

4,078 10,544 9,467 16,804 8,279 5,239

5,604 13,659 12,595 13,479 5,649 6,195

5,179 7,061 7,917 14,044 11,298 12,584

4,416 6,245 11,346 6,817 4,353 15,415

10,676 13,021 12,806 6,845 3,467 15,917

1,627 9,719 4,972 10,493 6,191 12,591

10,112 2,200 11,356 615 12,851 9,743

6,567 10,746 7,117 13,627 5,337 10,324

13,627 12,744 9,465 12,557 8,372

18,719 5,742 19,263 6,232 7,445

credit card Balances for a Sample of 70 HouseholdsTabLE 6.5

95% confidence Interval for credit card BalancesFiGuRE 6.13

NewBalance

Point Estimate Lower Limit Upper Limit

NewBalance 9430 7535 Mean

Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Con�dence Level(95.0%)

9312 478.9281 9466 13627 4007 16056048 20.2960

5D3 5D182D16 5D31D16

0.1879 18648 615 19263 651840 70 955

4078 5604 5179 4416 10676 1627 10112 6567 13627 18719 14661 12195 10544 13659 7061 6245 13021 9743 10324

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 70 71 72

A B C D

NewBalance

Point Estimate Lower Limit Upper Limit

NewBalance 9430 7535 Mean

Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Con�dence Level(95.0%)

9312 478.9281

9466 13627 4007

16056048 20.2960

9312 8357

10267

0.1879 18648

615 19263

651840 70

955

4078 5604 5179 4416

10676 1627

10112 6567

13627 18719 14661 12195 10544 13659 7061 6245

13021 9743

10324

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 70 71 72

A B C D E F

Point Estimate

Margin of Error

NewBalance

Step 3. When the Data Analysis dialog box appears, choose Descriptive Statistics from the list of Analysis Tools

Step 4. When the Descriptive Statistics dialog box appears: Enter A1:A71 in the Input Range box Select Grouped By Columns Select Labels in First Row Select Output Range: Enter C1 in the Output Range box Select Summary Statistics Select Confidence Level for Mean Enter 95 in the Confidence Level for Mean box Click OK

If you can’t find Data Analysis on the Data tab, you may need to install the Analysis Toolpak add-in (which is included with Excel).

6.4 Interval Estimation 247

As Figure 6.13 illustrates, the sample mean ( )x is in cell D3. The margin of error, labeled “Confidence Level(95%),” appears in cell D16. The value worksheet shows 5 9,312x and a margin of error equal to 955.

Cells D18:D20 provide the point estimate and the lower and upper limits for the con- fidence interval. Because the point estimate is just the sample mean, the formula 5D3 is entered into cell D18. To compute the lower limit of the 95% confidence interval, x 2 (margin of error), we enter the formula 5D18-D16 into cell D19. To compute the upper limit of the 95% confidence interval, x 1 (margin of error), we enter the formula 5 1D18 D16 into cell D20. The value worksheet shows a lower limit of 8,357 and an upper limit of 10,267. In other words, the 95% confidence interval for the population mean is from 8,357 to 10,267.

interval Estimation of the Population Proportion The general form of an interval estimate of a population proportion p is

Margin of error6p

The sampling distribution of p plays a key role in computing the margin of error for this interval estimate.

In Section 6.3 we said that the sampling distribution of p can be approximated by a nor- mal distribution whenever $ 5np and 2 $(1 ) 5n p . Figure 6.14 shows the normal approx- imation of the sampling distribution of p. The mean of the sampling distribution of p is the population proportion p, and the standard error of p is

(1 )

s 5 2p p

n p (6.8)

Because the sampling distribution of p is normally distributed, if we choose / 2saz p as the margin of error in an interval estimate of a population proportion, we know that 100(1 )%a2 of the intervals generated will contain the true population proportion. But s p cannot be used directly in the computation of the margin of error because p will not be known; p is what we are trying to estimate. So we estimate s p with s p and then the margin of error for an interval estimate of a population proportion is given by

Margin of error (1 )

/ 2 / 25 5 2

a az s z p p

n p (6.9)

The margin of error using the t distribution can also be computed with the Excel function CONFIDENCE.T(alpha, s, n), where alpha is the level of significance, s is the sample standard deviation, and n is the sample size.

The notation z /2a represents the value such that there is an area of /2a in the upper tail of the standard normal distribution (a normal distribution with a mean of zero and standard deviation of one).

normal Approximation of the Sampling Distribution of pFiGuRE 6.14

Sampling distribution of p

p p(1 2 p)

z /2

pz /2 p� ��

� 5

248 chapter 6 Statistical Inference

With this margin of error, the general expression for an interval estimate of a population proportion is as follows.

iNTERVaL ESTimaTE OF a POPuLaTiON PROPORTiON

(1 )

,/ 26 2

ap z p p

n (6.10)

where a is the level of significance and / 2az is the z value providing an area of /2a in the upper tail of the standard normal distribution.

The following example illustrates the computation of the margin of error and interval estimate for a population proportion. A national survey of 900 women golfers was con- ducted to learn how women golfers view their treatment at golf courses in the United States. The survey found that 396 of the women golfers were satisfied with the availability of tee times. Thus, the point estimate of the proportion of the population of women golfers who are satisfied with the availability of tee times is 5396/900 0.44. Using equation (6.10) and a 95% confidence level:

p z p p

(1 )

0.44 1.96 0.44(1 0.44)

900 0.44 0.0324

/ 26 2

6 2

Thus, the margin of error is 0.0324 and the 95% confidence interval estimate of the popula- tion proportion is 0.4076 to 0.4724. Using percentages, the survey results enable us to state with 95% confidence that between 40.76% and 47.24% of all women golfers are satisfied with the availability of tee times.

using Excel Excel can be used to construct an interval estimate of the population propor- tion of women golfers who are satisfied with the availability of tee times. The responses in the survey were recorded as a Yes or No in the file named TeeTimes for each woman sur- veyed. Refer to Figure 6.15 as we describe the tasks involved in constructing a 95% confi- dence interval. The formula worksheet is on the left; the value worksheet appears on the right.

The descriptive statistics we need and the response of interest are provided in cells D3:D6. Because Excel’s COUNT function works only with numerical data, we used the COUNTA function in cell D3 to compute the sample size. The response for which we want to develop an interval estimate, Yes or No, is entered into cell D4. Figure 6.15 shows that Yes has been entered into cell D4, indicating that we want to develop an interval esti- mate of the population proportion of women golfers who are satisfied with the availability of tee times. If we had wanted to develop an interval estimate of the population proportion of women golfers who are not satisfied with the availability of tee times, we would have entered No in cell D4. With Yes entered in cell D4, the COUNTIF function in cell D5 counts the number of Yes responses in the sample. The sample proportion is then com- puted in cell D6 by dividing the number of Yes responses in cell D5 by the sample size in cell D3.

Cells D8:D10 are used to compute the appropriate z value. The confidence coefficient (0.95) is entered into cell D8 and the level of significance ( )a is computed in cell D9 by entering the formula 51-D8. The z value corresponding to an upper-tail area of /2a is computed by entering the formula 5NORM.S.INV(1-D9/2) into cell D10. The value work- sheet shows that 5 1.960.025z .

Cells D12:D13 provide the estimate of the standard error and the margin of error. In cell D12, we entered the formula SQRT(D6*(1-D6)/D3)5 to compute the standard error using

The Excel formula NORM.S.INV(1 /2)5 2 a

computes the value of z /2a . For example, for 0.05a 5 , z NORM.S.INV0.025 5 (1 .05/2) 1.962 5 .

The file TeeTimes displayed in Figure 6.15 can be used as a template for developing confidence intervals about a population proportion p by entering new problem data in column A and appropriately adjusting the formulas in column D.

TeeTimes

6.4 Interval Estimation 249

the sample proportion and the sample size as inputs. The formula D10*D125 is entered into cell D13 to compute the margin of error corresponding to equation (6.9).

Cells D15:D17 provide the point estimate and the lower and upper limits for a confi- dence interval. The point estimate in cell D15 is the sample proportion. The lower and upper limits in cells D16 and D17 are obtained by subtracting and adding the margin of error to the point estimate. We note that the 95% confidence interval for the propor- tion of women golfers who are satisfied with the availability of tee times is 0.4076 to 0.4724.

95% confidence Interval for Survey of Women GolfersFiGuRE 6.15

Response

Sample Size Response of Interest Count for Response Sample Proportion

Con�dence Coef�cient Level of Signi�cance (alpha)

Standard Error Margin of Error

Point Estimate Lower Limit Upper Limit

z Value

Interval Estimate of a Population Proportion Yes

Yes Yes

Yes Yes Yes

Yes

Yes Yes No

No No

No No No

No 5COUNTA(A2:A901)

5COUNTIF(A2:A901,D4) Yes

0.95

5D5/D3

5NORM.S.INV(12D9/2)

5SQRT(D6*(12D6)/D3) 5D10*D12

5D6 5D152D13 5D151D13

512D8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 900

902 901

A B C D

Response

Sample Size Response of Interest Count for Response Sample Proportion

Con�dence Coef�cient Level of Signi�cance

Standard Error Margin of Error

Point Estimate Lower Limit Upper Limit

z Value

Interval Estimate of a Population Proportion Yes

Yes Yes

Yes Yes Yes

Yes

Yes Yes No

No No

No No No

No 900

396 Yes

0.95

0.44

1.96

0.0165 0.0324

0.44 0.4076 0.4724

0.05

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 900

902 901

A B C D E F G

Enter Yes as the Response of Interest

1. The reason the number of degrees of freedom associated

with the t value in equation (6.7) is n 2 1 concerns the use of s as an estimate of the population standard deviation s. The expression for the sample standard deviation is

( ) 1

. 2

s x x n

i 5

S 2

Degrees of freedom refer to the number of independent

pieces of information that go into the computation of

x xi( )2S 2 . The n pieces of information involved in comput- ing ( )2S 2x xi are as follows: x x x x x xn, ,…,1 22 2 2 . Note that x xi( ) 0S 2 5 for any data set. Thus, only n 2 1 of the

2x xi values are independent; that is, if we know n 2 1 of the values, the remaining value can be determined exactly

by using the condition that the sum of the 2x xi values must be 0. Thus, n 2 1 is the number of degrees of free- dom associated with ( )2S 2x xi and hence the number of degrees of freedom for the t distribution in equation (6.7).

2. In most applications, a sample size of 30$n is adequate when using equation (6.7) to develop an interval estimate

of a population mean. However, if the population distri-

bution is highly skewed or contains outliers, most statisti-

cians would recommend increasing the sample size to 50

or more. If the population is not normally distributed but

is roughly symmetric, sample sizes as small as 15 can be

expected to provide good approximate confidence inter-

vals. With smaller sample sizes, equation (6.7) should be

used if the analyst believes, or is willing to assume, that the

population distribution is at least approximately normal.

3. What happens to confidence interval estimates of x when the population is skewed? Consider a population that is

skewed to the right, with large data values stretching the

distribution to the right. When such skewness exists, the

sample mean x and the sample standard deviation s are positively correlated. Larger values of s tend to be associ- ated with larger values of x . Thus, when x is larger than the population mean, s tends to be larger than s . This skew- ness causes the margin of error, t s n( / )/ 2a , to be larger than it would be with s known. The confidence interval with the

larger margin of error tends to include the population

mean more often than it would if the true value of s were

used. But when x is smaller than the population mean, the correlation between x and s causes the margin of error to be small. In this case, the confidence interval with the

smaller margin of error tends to miss the population mean

N O T E S + C O m m E N T S

250 chapter 6 Statistical Inference

6.5 Hypothesis Tests Throughout this chapter we have shown how a sample could be used to develop point and interval estimates of population parameters such as the mean m and the proportion p. In this section we continue the discussion of statistical inference by showing how hypothe- sis testing can be used to determine whether a statement about the value of a population parameter should or should not be rejected.

In hypothesis testing we begin by making a tentative conjecture about a population parameter. This tentative conjecture is called the null hypothesis and is denoted by 0H . We then define another hypothesis, called the alternative hypothesis, which is the opposite of what is stated in the null hypothesis. The alternative hypothesis is denoted by aH . The hypothesis testing procedure uses data from a sample to test the validity of the two compet- ing statements about a population that are indicated by 0H and aH .

This section shows how hypothesis tests can be conducted about a population mean and a population proportion. We begin by providing examples that illustrate approaches to developing null and alternative hypotheses.

developing Null and alternative hypotheses It is not always obvious how the null and alternative hypotheses should be formulated. Care must be taken to structure the hypotheses appropriately so that the hypothesis testing conclusion provides the information the researcher or decision maker wants. The context of the situation is very important in determining how the hypotheses should be stated. All hypothesis testing applications involve collecting a random sample and using the sample results to provide evidence for drawing a conclusion. Good questions to consider when for- mulating the null and alternative hypotheses are, What is the purpose of collecting the sam- ple? What conclusions are we hoping to make?

In the introduction to this section, we stated that the null hypothesis 0H is a tentative conjecture about a population parameter such as a population mean or a population pro- portion. The alternative hypothesis aH is a statement that is the opposite of what is stated in the null hypothesis. In some situations it is easier to identify the alternative hypothesis first and then develop the null hypothesis. In other situations it is easier to identify the null hypothesis first and then develop the alternative hypothesis. We will illustrate these situa- tions in the following examples.

The alternative hypothesis as a Research hypothesis Many applications of hypothesis testing involve an attempt to gather evidence in support of a research hypothesis. In these situations, it is often best to begin with the alternative hypothesis and make it the conclu- sion that the researcher hopes to support. Consider a particular automobile that currently

Learning to formulate hypotheses correctly will take some practice. Expect some initial confusion about the proper choice of the null and alternative hypotheses. The examples in this section are intended to provide guidelines.

more than it would if we knew s and used it. For this rea-

son, we recommend using larger sample sizes with highly

skewed population distributions.

4. We can find the sample size necessary to provide the

desired margin of error at the chosen confidence level. Let

E the desired margin of error5 . Then • the sample size for an interval estimate of a population

mean is n z

E ( )/ 2 2 2

s 5

a , where E is the margin of error

that the user is willing to accept, and the value of z / 2a follows directly from the confidence level to be used in

developing the interval estimate.

• the sample size for an interval estimate of a population

proportion is n z p p

E ( ) * (1 *)/ 2 2

2 5

2a , where the plan-

ning value p* can be chosen by use of (i) the sample

proportion from a previous sample of the same or simi-

lar units, (ii) a pilot study to select a preliminary sample,

(iii) judgment or a “best guess” for the value of p*, or (iv) if none of the preceding alternatives apply, use of

the planning value of * 0.505p . 5. The desired margin of error for estimating a popula-

tion proportion is almost always 0.10 or less. In national

public opinion polls conducted by organizations such as

Gallup and Harris, a 0.03 or 0.04 margin of error is com-

mon. With such margins of error, the sample found with

n z p p

E ( ) *(1 *)/ 2 2

2 5

2a will almost always provide a size

that is sufficient to satisfy the requirements of 5$np and (1 ) 52 $n p for using a normal distribution as an approxi-

mation for the sampling distribution of p.

6.5 Hypothesis tests 251

attains a fuel efficiency of 24 miles per gallon for city driving. A product research group has developed a new fuel injection system designed to increase the miles-per-gallon rating. The group will run controlled tests with the new fuel injection system looking for statistical support for the conclusion that the new fuel injection system provides more miles per gal- lon than the current system.

Several new fuel injection units will be manufactured, installed in test automobiles, and subjected to research-controlled driving conditions. The sample mean miles per gallon for these automobiles will be computed and used in a hypothesis test to determine whether it can be concluded that the new system provides more than 24 miles per gallon. In terms of the population mean miles per gallon m, the research hypothesis m . 24 becomes the alternative hypothesis. Since the current system provides an average or mean of 24 miles per gallon, we will make the tentative conjecture that the new system is no better than the current system and choose m # 24 as the null hypothesis. The null and alternative hypotheses are as follows:

: 24

: 24 0

If the sample results lead to the conclusion to reject 0H , the inference can be made that H : 24a m . is true. The researchers have the statistical support to state that the new fuel injection system increases the mean number of miles per gallon. The production of auto- mobiles with the new fuel injection system should be considered. However, if the sample results lead to the conclusion that 0H cannot be rejected, the researchers cannot conclude that the new fuel injection system is better than the current system. Production of automo- biles with the new fuel injection system on the basis of better gas mileage cannot be justi- fied. Perhaps more research and further testing can be conducted.

Successful companies stay competitive by developing new products, new methods, and new services that are better than what is currently available. Before adopting something new, it is desirable to conduct research to determine whether there is statistical support for the conclusion that the new approach is indeed better. In such cases, the research hypothe- sis is stated as the alternative hypothesis. For example, a new teaching method is developed that is believed to be better than the current method. The alternative hypothesis is that the new method is better; the null hypothesis is that the new method is no better than the old method. A new sales force bonus plan is developed in an attempt to increase sales. The alternative hypothesis is that the new bonus plan increases sales; the null hypothesis is that the new bonus plan does not increase sales. A new drug is developed with the goal of lowering blood pressure more than an existing drug. The alternative hypothesis is that the new drug lowers blood pressure more than the existing drug; the null hypothesis is that the new drug does not provide lower blood pressure than the existing drug. In each case, rejection of the null hypothesis 0H provides statistical support for the research hypothesis. We will see many examples of hypothesis tests in research situations such as these through- out this chapter and in the remainder of the text.

The Null hypothesis as a Conjecture to be Challenged Of course, not all hypothesis tests involve research hypotheses. In the following discussion we consider applications of hypothesis testing where we begin with a belief or a conjecture that a statement about the value of a population parameter is true. We will then use a hypothesis test to challenge the conjecture and determine whether there is statistical evidence to conclude that the conjec- ture is incorrect. In these situations, it is helpful to develop the null hypothesis first. The null hypothesis 0H expresses the belief or conjecture about the value of the population parameter. The alternative hypothesis aH is that the belief or conjecture is incorrect.

As an example, consider the situation of a manufacturer of soft drink products. The label on a soft drink bottle states that it contains 67.6 fluid ounces. We consider the label correct provided the population mean filling weight for the bottles is at least 67.6 fluid ounces. With no reason to believe otherwise, we would give the manufacturer the benefit of the doubt and assume that the statement provided on the label is correct. Thus, in a hypoth- esis test about the population mean fluid weight per bottle, we would begin with the con- jecture that the label is correct and state the null hypothesis as m $ 67.6. The challenge to

The conclusion that the research hypothesis is true is made if the sample data provide sufficient evidence to show that the null hypothesis can be rejected.

252 chapter 6 Statistical Inference

this conjecture would imply that the label is incorrect and the bottles are being underfilled. This challenge would be stated as the alternative hypothesis m , 67.6. Thus, the null and alternative hypotheses are as follows:

: 67.6

: 67.6 0

A government agency with the responsibility for validating manufacturing labels could select a sample of soft drink bottles, compute the sample mean filling weight, and use the sample results to test the preceding hypotheses. If the sample results lead to the conclu- sion to reject 0H , the inference that H : 67.6a m , is true can be made. With this statistical support, the agency is justified in concluding that the label is incorrect and that the bottles are being underfilled. Appropriate action to force the manufacturer to comply with label- ing standards would be considered. However, if the sample results indicate 0H cannot be rejected, the conjecture that the manufacturer’s labeling is correct cannot be rejected. With this conclusion, no action would be taken.

Let us now consider a variation of the soft drink bottle-filling example by viewing the same situation from the manufacturer’s point of view. The bottle-filling operation has been designed to fill soft drink bottles with 67.6 fluid ounces as stated on the label. The company does not want to underfill the containers because that could result in complaints from custom- ers or, perhaps, a government agency. However, the company does not want to overfill con- tainers either because putting more soft drink than necessary into the containers would be an unnecessary cost. The company’s goal would be to adjust the bottle-filling operation so that the population mean filling weight per bottle is 67.6 fluid ounces as specified on the label.

Although this is the company’s goal, from time to time any production process can get out of adjustment. If this occurs in our example, underfilling or overfilling of the soft drink bottles will occur. In either case, the company would like to know about it in order to correct the situation by readjusting the bottle-filling operation to result in the desig- nated 67.6 fluid ounces. In this hypothesis testing application, we would begin with the conjecture that the production process is operating correctly and state the null hypothesis as m 5 67.6 fluid ounces. The alternative hypothesis that challenges this conjecture is that

67.6m ± , which indicates that either overfilling or underfilling is occurring. The null and alternative hypotheses for the manufacturer’s hypothesis test are as follows:

: 67.6

: 67.6 0

Suppose that the soft drink manufacturer uses a quality-control procedure to periodi- cally select a sample of bottles from the filling operation and computes the sample mean filling weight per bottle. If the sample results lead to the conclusion to reject 0H , the infer- ence is made that : 67.6a mH ± is true. We conclude that the bottles are not being filled properly and the production process should be adjusted to restore the population mean to 67.6 fluid ounces per bottle. However, if the sample results indicate 0H cannot be rejected, the conjecture that the manufacturer’s bottle-filling operation is functioning properly can- not be rejected. In this case, no further action would be taken and the production operation would continue to run.

The two preceding forms of the soft drink manufacturing hypothesis test show that the null and alternative hypotheses may vary depending on the point of view of the researcher or decision maker. To formulate hypotheses correctly, it is important to understand the context of the situation and to structure the hypotheses to provide the information the researcher or decision maker wants.

Summary of Forms for Null and alternative hypotheses The hypothesis tests in this chapter involve two population parameters: the population mean and the population pro- portion. Depending on the situation, hypothesis tests about a population parameter may take one of three forms: Two use inequalities in the null hypothesis; the third uses an equality in the null hypothesis. For hypothesis tests involving a population mean, we let

A manufacturer’s product information is usually assumed to be true and stated as the null hypothesis. The conclusion that the information is incorrect can be made if the null hypothesis is rejected.

6.5 Hypothesis tests 253

m0 denote the hypothesized value of the population mean and we must choose one of the following three forms for the hypothesis test:

H H H

: : :

: : : 0 0 0 0 0 0

a 0 a 0 a 0

m m m m m m

$ # 5

, .

For reasons that will be clear later, the first two forms are called one-tailed tests. The third form is called a two-tailed test.

In many situations, the choice of 0H and aH is not obvious and judgment is necessary to select the proper form. However, as the preceding forms show, the equality part of the expression (either $, #, or 5) always appears in the null hypothesis. In selecting the proper form of 0H and aH , keep in mind that the alternative hypothesis is often what the test is attempting to establish. Hence, asking whether the user is looking for evidence to support m m, 0, m m. 0, or 0m m± will help determine aH .

Type i and Type ii Errors The null and alternative hypotheses are competing statements about the population. Either the null hypothesis 0H is true or the alternative hypothesis aH is true, but not both. Ideally the hypothesis testing procedure should lead to the acceptance of 0H when 0H is true and the rejection of 0H when aH is true. Unfortunately, the correct conclusions are not always possible. Because hypothesis tests are based on sample information, we must allow for the possibility of errors. Table 6.6 illustrates the two kinds of errors that can be made in hypothesis testing.

The first row of Table 6.6 shows what can happen if the conclusion is to accept 0H . If 0H is true, this conclusion is correct. However, if aH is true, we made a Type II error; that

is, we accepted 0H when it is false. The second row of Table 6.6 shows what can happen if the conclusion is to reject 0H . If 0H is true, we made a Type I error; that is, we rejected 0H when it is true. However, if aH is true, rejecting 0H is correct.

Recall the hypothesis testing illustration in which an automobile product research group developed a new fuel injection system designed to increase the miles-per-gallon rating of a particular automobile. With the current model obtaining an average of 24 miles per gallon, the hypothesis test was formulated as follows:

: 24

: 24 0

The alternative hypothesis, H : 24a m . , indicates that the researchers are looking for sam- ple evidence to support the conclusion that the population mean miles per gallon with the new fuel injection system is greater than 24.

In this application, the Type I error of rejecting 0H when it is true corresponds to the researchers claiming that the new system improves the miles-per-gallon rating m .( 24) when in fact the new system is no better than the current system. In contrast, the Type II error of accepting 0H when it is false corresponds to the researchers concluding that the new system is no better than the current system m #( 24) when in fact the new system improves miles-per-gallon performance.

The three possible forms of hypotheses H0 and Ha are shown here. Note that the equality always appears in the null hypothesis H0 .

Population Condition

0H True Ha True

Conclusion

Do Not Reject 0H Correct conclusion

Type II error

Reject 0H Type I error

Correct conclusion

Errors and correct conclusions in Hypothesis testingTabLE 6.6

254 chapter 6 Statistical Inference

For the miles-per-gallon rating hypothesis test, the null hypothesis is H : 240 m # . Suppose the null hypothesis is true as an equality; that is, m 5 24. The probability of making a Type I error when the null hypothesis is true as an equality is called the level of significance. Thus, for the miles-per-gallon rating hypothesis test, the level of significance is the probability of rejecting H : 240 m # when m 5 24. Because of the importance of this concept, we now restate the definition of level of significance.

The Greek symbol a (alpha) is used to denote the level of significance, and common choices for a are 0.05 and 0.01.

Lower-Tail Test upper-Tail Test

H :0 0m m$ H :0 0m m#

H :a 0m m, H :a 0m m.

LEVEL OF SiGNiFiCaNCE

The level of significance is the probability of making a Type I error when the null hypothesis is true as an equality.

In practice, the person responsible for the hypothesis test specifies the level of signifi- cance. By selecting a, that person is controlling the probability of making a Type I error. If the cost of making a Type I error is high, small values of a are preferred. If the cost of making a Type I error is not too high, larger values of a are typically used. Applications of hypothesis testing that only control the Type I error are called significance tests. Many applications of hypothesis testing are of this type.

Although most applications of hypothesis testing control the probability of making a Type I error, they do not always control the probability of making a Type II error. Hence, if we decide to accept 0H , we cannot determine how confident we can be with that decision. Because of the uncertainty associated with making a Type II error when conduct- ing significance tests, statisticians usually recommend that we use the statement “do not reject 0H ” instead of “accept 0H .” Using the statement “do not reject 0H ” carries the recom- mendation to withhold both judgment and action. In effect, by not directly accepting 0H , the statistician avoids the risk of making a Type II error. Whenever the probability of mak- ing a Type II error has not been determined and controlled, we will not make the statement “accept 0H .” In such cases, only two conclusions are possible: do not reject 0H or reject 0H .

Although controlling for a Type II error in hypothesis testing is not common, it can be done. Specialized texts describe procedures for determining and controlling the probability of making a Type II error.1 If proper controls have been established for this error, action based on the “accept 0H ” conclusion can be appropriate.

hypothesis Test of the Population mean In this section we describe how to conduct hypothesis tests about a population mean for the practical situation in which the sample must be used to develop estimates of both m and s . Thus, to conduct a hypothesis test about a population mean, the sample mean x is used as an estimate of m and the sample standard deviation s is used as an estimate of s .

One-Tailed Test One-tailed tests about a population mean take one of the following two forms:

If the sample data are consistent with the null hypothesis H0 , we will follow the practice of concluding “do not reject H0 .” This conclusion is preferred over “accept H0 ,” because the conclusion to accept H0 puts us at risk of making a Type II error.

1See, for example, D. R. Anderson, D. J. Sweeney, t. A. Williams, J. D. camm, and J. J. cochran, Statistics for Business and Economics, 13th edition (Mason, oH: cengage learning, 2018).

Let us consider an example involving a lower-tail test. The Federal Trade Commission (FTC) periodically conducts statistical studies designed

to test the claims that manufacturers make about their products. For example, the label on a large can of Hilltop Coffee states that the can contains 3 pounds of coffee. The FTC knows

6.5 Hypothesis tests 255

that Hilltop’s production process cannot place exactly 3 pounds of coffee in each can, even if the mean filling weight for the population of all cans filled is 3 pounds per can. However, as long as the population mean filling weight is at least 3 pounds per can, the rights of consum- ers will be protected. Thus, the FTC interprets the label information on a large can of coffee as a claim by Hilltop that the population mean filling weight is at least 3 pounds per can. We will show how the FTC can check Hilltop’s claim by conducting a lower-tail hypothesis test.

The first step is to develop the null and alternative hypotheses for the test. If the population mean filling weight is at least 3 pounds per can, Hilltop’s claim is correct. This establishes the null hypothesis for the test. However, if the population mean weight is less than 3 pounds per can, Hilltop’s claim is incorrect. This establishes the alternative hypothesis. With m denoting the population mean filling weight, the null and alternative hypotheses are as follows:

: 3

: 3 0

Note that the hypothesized value of the population mean is m 5 30 . If the sample data indicate that 0H cannot be rejected, the statistical evidence does not

support the conclusion that a label violation has occurred. Hence, no action should be taken against Hilltop. However, if the sample data indicate that 0H can be rejected, we will con- clude that the alternative hypothesis, H : 3a m , , is true. In this case a conclusion of under- filling and a charge of a label violation against Hilltop would be justified.

Suppose a sample of 36 cans of coffee is selected and the sample mean x is computed as an estimate of the population mean m. If the value of the sample mean x is less than 3 pounds, the sample results will cast doubt on the null hypothesis. What we want to know is how much less than 3 pounds must x be before we would be willing to declare the dif- ference significant and risk making a Type I error by falsely accusing Hilltop of a label violation. A key factor in addressing this issue is the value the decision maker selects for the level of significance.

As noted in the preceding section, the level of significance, denoted by a, is the prob- ability of making a Type I error by rejecting 0H when the null hypothesis is true as an equality. The decision maker must specify the level of significance. If the cost of making a Type I error is high, a small value should be chosen for the level of significance. If the cost is not high, a larger value is more appropriate. In the Hilltop Coffee study, the director of the FTC’s testing program made the following statement: “If the company is meeting its weight specifications at m 5 3, I do not want to take action against them. But I am willing to risk a 1% chance of making such an error.” From the director’s statement, we set the level of significance for the hypothesis test at 0.01a 5 . Thus, we must design the hypothe- sis test so that the probability of making a Type I error when m 5 3 is 0.01.

For the Hilltop Coffee study, by developing the null and alternative hypotheses and specifying the level of significance for the test, we carry out the first two steps required in conducting every hypothesis test. We are now ready to perform the third step of hypothesis testing: collect the sample data and compute the value of what is called a test statistic.

Test Statistic From the study of sampling distributions in Section 6.3 we know that as the sample size increases, the sampling distribution of x will become normally distributed. Figure 6.16 shows the sampling distribution of x when the null hypothesis is true as an equality, that is, when m m5 5 30 .2 Note that s x , the standard error of x , is estimated by

5 5 5/ 0.17 36 0.028s s nx . Recall that in Section 6.4, we showed that an interval esti- mate of a population mean is based on a probability distribution known as the t distribu- tion. The t distribution is similar to the standard normal distribution, but accounts for the additional variability introduced when using a sample to estimate both the population mean and population standard deviation. Hypothesis tests about a population mean are also based on the t distribution. Specifically, if x is normally distributed, the sampling distribution of

m m 5

2 5

0.028 0 0

t x

s n

The standard error of x is the standard deviation of the sampling distribution of x.

Coffee

2In constructing sampling distributions for hypothesis tests, it is assumed that 0H is satisfied as an equality.

256 chapter 6 Statistical Inference

is a t distribution with n 2 1 degrees of freedom. The value of t represents how much the sam- ple mean is above or below the hypothesized value of the population mean as measured in units of the standard error of the sample mean. A value of 5 21t means that the value of x is 1 standard error below the hypothesized value of the mean, a value of 5 22t means that the value of x is 2 standard errors below the hypothesized value of the mean, and so on. For this lower-tail hypothesis test, we can use Excel to find the lower-tail probability corresponding to any t value (as we show later in this section). For example, Figure 6.17 illustrates that the lower tail area at 5 23.00t is 0.0025. Hence, the probability of obtaining a value of t that is three or more standard errors below the mean is 0.0025. As a result, if the null hypothesis is true (i.e., if the population mean is 3), the probability of obtaining a value of x that is 3 or more standard errors below the hypothesized population mean m 5 30 is also 0.0025. Because such a result is unlikely if the null hypothesis is true, this leads us to doubt our null hypothesis.

We use the t-distributed random variable t as a test statistic to determine whether x deviates from the hypothesized value of m enough to justify rejecting the null hypothesis. With 5 /s s nx , the test statistic is as follows:

Although the t distribution is based on an conjecture that the population from which we are sampling is normally distributed, research shows that when the sample size is large enough this conjecture can be relaxed considerably.

Sampling Distribution of x for the Hilltop coffee Study When the null Hypothesis Is true as an Equality m 5( 3)

FiGuRE 6.16

x 5 3

Sampling distribution of x

�

TEST STaTiSTiC FOR hYPOThESiS TESTS abOuT a POPuLaTiON mEaN

5 2

/ 0

t x

s n (6.11)

The key question for a lower-tail test is, How small must the test statistic t be before we choose to reject the null hypothesis? We will draw our conclusion by using the value of the test statistic t to compute a probability called a p value.

A small p value indicates that the value of the test statistic is unusual given the conjecture that H0 is true.

p VaLuE

A p value is the probability, assuming that H 0 is true, of obtaining a random sample of

size n that results in a test statistic at least as extreme as the one observed in the current sample.

The p value measures the strength of the evidence provided by the sample against the null hypothesis. Smaller p values indicate more evidence against H

0 as they suggest that it is

increasingly more unlikely that the sample could occur if the H 0 is true.

Let us see how the p value is computed and used. The value of the test statistic is used to compute the p value. The method used depends on whether the test is a lower-tail, an upper-tail, or a two-tailed test. For a lower-tail test, the p value is the probability of obtain- ing a value for the test statistic as small as or smaller than that provided by the sample. Thus, to compute the p value for the lower-tail test, we must use the t distribution to find

6.5 Hypothesis tests 257

lower-tail Probability for 5 2t 3 from a t Distribution with 35 Degrees of Freedom

FiGuRE 6.17

0t = 23

0.0025

the probability that t is less than or equal to the value of the test statistic. After computing the p value, we must then decide whether it is small enough to reject the null hypothesis; as we will show, this decision involves comparing the p value to the level of significance.

using Excel Excel can be used to conduct one-tailed and two-tailed hypothesis tests about a population mean. The sample data and the test statistic (t) are used to compute three p values: p value (lower tail), p value (upper tail), and p value (two tail). The user can then choose a and draw a conclusion using whichever p value is appropriate for the type of hypothesis test being conducted.

Let’s start by showing how to use Excel’s T.DIST function to compute a lower-tail p value. The T.DIST function has three inputs; its general form is as follows:

T.DIST(test statistic, degrees of freedom, cumulative).

For the first input, we enter the value of the test statistic; for the second input we enter the degrees of freedom for the associated t distribution; for the third input, we enter TRUE to compute the cumulative probability corresponding to a lower-tail p value.

Once the lower-tail p value has been computed, it is easy to compute the upper-tail and the two-tailed p values. The upper-tail p value is 1 minus the lower-tail p value, and the two-tailed p value is two times the smaller of the lower- and upper-tail p values.

Let us now compute the p value for the Hilltop Coffee lower-tail test. Refer to Figure 6.18 as we describe the tasks involved. The formula sheet is in the background and the value worksheet is in the foreground.

The descriptive statistics needed are provided in cells D4:D6. Excel’s COUNT, AVERAGE, and STDEV.S functions compute the sample size, the sample mean, and the sample standard deviation, respectively. The hypothesized value of the population mean (3) is entered into cell D8. Using the sample standard deviation as an estimate of the pop- ulation standard deviation, an estimate of the standard error is obtained in cell D10 by dividing the sample standard deviation in cell D6 by the square root of the sample size in cell D4. The formula 5(D5-D8)/D10 entered into cell D11 computes the value of the test statistic t corresponding to the calculation:

m 5

2 5

2 5 2

2.92 3

0.17/ 36 2.824

0 t

s n

The degrees of freedom are computed in cell D12 as the sample size in cell D4 minus 1. To compute the p value for a lower-tail test, we enter the following formula into

cell D14.

5T.DIST(D11,D12,TRUE)

The p value for an upper-tail test is then computed in cell D15 as 1 minus the p value for the lower-tail test. Finally, the p value for a two-tailed test is computed in cell D16 as two times the minimum of the two one-tailed p values. The value worksheet shows that

CoffeeTest

258 chapter 6 Statistical Inference

the three p values are p value (lower tail) 0.00395 , p value (upper tail) 0.99615 , and p value (two tail) 0.00785 .

The development of the worksheet is now complete. Is 5 2.92x small enough to lead us to reject 0H ? Because this is a lower-tail test, the p value is the area under the t-distribu- tion curve for values of # 22.824t (the value of the test statistic). Figure 6.19 depicts the p value for the Hilltop Coffee lower-tail test. This p value indicates a small probability of obtaining a sample mean of 5 2.92x (and a test statistic of 22.824) or smaller when sam- pling from a population with m 5 3. This p value does not provide much support for the null hypothesis, but is it small enough to cause us to reject 0H ? The answer depends on the level of significance ( )a the decision maker has selected for the test.

Note that the p value can be considered a measure of the strength of the evidence against the null hypothesis that is contained in the sample data. The greater the inconsis- tency between the sample data and the null hypothesis, the smaller the p value will be; thus, a smaller p value indicates that it is less plausible that the sample could have been collected from a population for which the null hypothesis is true. That is, a smaller p value indicates that the sample provides stronger evidence against the null hypothesis.

As noted previously, the director of the FTC’s testing program selected a value of 0.01 for the level of significance. The selection of 0.01a 5 means that the director is willing to tolerate a probability of 0.01 of rejecting the null hypothesis when it is true as an equality m 5( 3)0 . The sample of 36 coffee cans in the Hilltop Coffee study resulted in a p value

of 0.0039, which means that the probability of obtaining a value of 5 2.92x or less when

Hypothesis test about a Population MeanFiGuRE 6.18

Weight Hypothesis Test about a Population Mean 3.15 2.76 3.18 2.77 2.86

Sample Size Sample Mean

Sample Standard Deviation

Hypothesized Value

Standard Error Test Statistic t

Degrees of Freedom

p value (Lower Tail) p value (Upper Tail)

2.66 2.86 2.54 3.02 3.13 2.94 2.74 2.84 2.6 2.94 2.93 3.18 2.95 2.86 2.91 2.96 3.14 2.65 2.77 2.96 3.1 2.82 3.05 2.94 2.82

A B C D

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

3.2132 3.1133 2.934 3.0535 2.9336 2.8937

p value (Two Tail)

=COUNT(A2:A37) =AVERAGE(A2:A37) =STDEV.S(A2:A37)

=D6/SQRT(D4) =(D5-D8)/D10 =D4-1

0.028 –2.824

0.0039 0.9961 0.0078

Weight Hypothesis Test about a Population Mean 3.15 2.76 3.18 2.77 2.86 2.66 2.86 2.54 3.02 3.13 2.94 2.74 2.84 2.60 2.94

1 A B C D

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Sample Size Sample Mean

Sample Standard Deviation

Hypothesized Value

Standard Error Test Statistic t

Degrees of Freedom

p value (Lower Tail) p value (Upper Tail)

p value (Two Tail)

36 2.92

0.170

=T.DIST(D11,D12,TRUE) =1-D14 =2*MIN(D14,D15)

6.5 Hypothesis tests 259

the null hypothesis is true is 0.0039. Because 0.0039 is less than or equal to 0.01a 5 , we reject 0H . Therefore, we find sufficient statistical evidence to reject the null hypothesis at the 0.01 level of significance.

The level of significance a indicates the strength of evidence that is needed in the sam- ple data before we will reject the null hypothesis. If the p value is smaller than the selected level of significance a, the evidence against the null hypothesis that is contained in the sample data is sufficiently strong for us to reject the null hypothesis; that is, we believe that it is implausible that the sample data were collected from a population for which H

0 : m $ 3

is true. Conversely, if the p value is larger than the selected level of significance a, the evidence against the null hypothesis that is contained in the sample data is not sufficiently strong for us to reject the null hypothesis; that is, we believe that it is plausible that the sample data were collected from a population for which the null hypothesis is true.

We can now state the general rule for determining whether the null hypothesis can be rejected when using the p value approach. For a level of significance a, the rejection rule using the p value approach is as follows.

p Value for the Hilltop coffee Study When 5x 2.92 and 5s 0.17

FiGuRE 6.19

t = –

–

x – 3 0.028

x = 2.92

p value = 0.0039

μ0 = 3 x–

z 0

Sampling distribution of x

Sampling distribution

–

Sx = = = 0.028– s n

0.17 36

t = –2.824

REjECTiON RuLE

Reject 0H if value a#p

In the Hilltop Coffee test, the p value of 0.0039 resulted in the rejection of the null hypothesis. Although the basis for making the rejection decision involves a comparison of the p value to the level of significance specified by the FTC director, the observed p value of 0.0039 means that we would reject 0H for any value of 0.0039a $ . For this reason, the p value is also called the observed level of significance.

Different decision makers may express different opinions concerning the cost of making a Type I error and may choose a different level of significance. By providing the p value as part of the hypothesis testing results, another decision maker can compare the reported p value to his or her own level of significance and possibly make a different decision with respect to rejecting 0H .

260 chapter 6 Statistical Inference

We used the Hilltop Coffee study to illustrate how to conduct a lower-tail test. We can use the same general approach to conduct an upper-tail test. The test statistic t is still computed using equation (6.11). But, for an upper-tail test, the p value is the probability of obtaining a value for the test statistic as large as or larger than that provided by the sample. Thus, to compute the p value for the upper-tail test, we must use the t distribution to compute the probability that t is greater than or equal to the value of the test statistic. Then, according to the rejection rule, we will reject the null hypothesis if the p value is less than or equal to the level of significance a.

Let us summarize the steps involved in computing p values for one-tailed hypothesis tests.

COmPuTaTiON OF p VaLuES FOR ONE-TaiLEd TESTS

1. Compute the value of the test statistic using equation (6.11). 2. Lower-tail test: Using the t distribution, compute the probability that t is less

than or equal to the value of the test statistic (area in the lower tail). 3. Upper-tail test: Using the t distribution, compute the probability that t is greater

than or equal to the value of the test statistic (area in the upper tail).

Two-Tailed Test In hypothesis testing, the general form for a two-tailed test about a popu- lation mean is as follows:

: 0 0

a 0

m m

In this subsection we show how to conduct a two-tailed test about a population mean. As an illustration, we consider the hypothesis testing situation facing Holiday Toys.

Holiday Toys manufactures and distributes its products through more than 1,000 retail outlets. In planning production levels for the coming winter season, Holiday must decide how many units of each product to produce before the actual demand at the retail level is known. For this year’s most important new toy, Holiday’s marketing director is expecting demand to average 40 units per retail outlet. Prior to making the final production decision based on this estimate, Holiday decided to survey a sample of 25 retailers to gather more information about demand for the new product. Each retailer was provided with informa- tion about the features of the new toy along with the cost and the suggested selling price. Then each retailer was asked to specify an anticipated order quantity.

With m denoting the population mean order quantity per retail outlet, the sample data will be used to conduct the following two-tailed hypothesis test:

: 40

: 40 0

If 0H cannot be rejected, Holiday will continue its production planning based on the mar- keting director’s estimate that the population mean order quantity per retail outlet will be m 5 40 units. However, if 0H is rejected, Holiday will immediately reevaluate its pro- duction plan for the product. A two-tailed hypothesis test is used because Holiday wants to reevaluate the production plan regardless of whether the population mean quantity per retail outlet is less than anticipated or is greater than anticipated. Because it’s a new

Orders

Lower-Tail Test upper-Tail Test

:0 0m m$H :0 0m m#H

:a 0m m,H :a 0m m.H

At the beginning of this section, we said that one-tailed tests about a population mean take one of the following two forms:

6.5 Hypothesis tests 261

product and therefore, no historical data are available, the population mean m and the population standard deviation must both be estimated using x and s from the sample data.

The sample of 25 retailers provided a mean of 5 37.4x and a standard deviation of 5 11.79 unitss . Before going ahead with the use of the t distribution, the analyst con-

structed a histogram of the sample data in order to check on the form of the population distribution. The histogram of the sample data showed no evidence of skewness or any extreme outliers, so the analyst concluded that the use of the t distribution with 2 51 24n degrees of freedom was appropriate. Using equation (9.2) with 5 37.4x , m 5 400 ,

5 11.79s , and 5 25n , the value of the test statistic is

37.4 40

11.79/ 25 1.100

m 5

2 5

2 5 2t

s n

The sample mean 5 37.4x is less than 40 and so provides some support for the conclusion that the population mean quantity per retail outlet is less than 40 units, but this could possi- bly be due to sampling error. We must address whether the difference between this sample mean and our hypothesized mean is sufficient for us to reject 0H at the 0.05 level of signifi- cance. We will again reach our conclusion by calculating a p value.

Recall that the p value is a probability used to determine whether the null hypothesis should be rejected. For a two-tailed test, values of the test statistic in either tail provide evidence against the null hypothesis. For a two-tailed test the p value is the probability of obtaining a value for the test statistic at least as unlikely as the value of the test statistic cal- culated with the sample given that the null hypothesis is true. Let us see how the p value is computed for the two-tailed Holiday Toys hypothesis test.

To compute the p value for this problem, we must find the probability of obtaining a value for the test statistic at least as unlikely as 5 21.10t if the population mean is actu- ally 40. Clearly, values of # 21.10t are at least as unlikely. But because this is a two- tailed test, all values that are more than 1.10 standard deviations from the hypothesized value m0 in either direction provide evidence against the null hypothesis that is at least as strong as the evidence against the null hypothesis contained in the sample data. As shown in Figure 6.20, the two-tailed p value in this case is given by # 2 1 $( 1.10) ( 1.10)P t P t .

To compute the tail probabilities, we apply the Excel template introduced in the Hilltop Coffee example to the Holiday Toys data. Figure 6.21 displays the formula worksheet in the background and the value worksheet in the foreground.

p Value for the Holiday toys two-tailed Hypothesis testFiGuRE 6.20

p value = 2(0.1406) = 0.2812

P(t $ 1.10) = 0.1406P(t , 21.10) = 0.1406

1.10 z

1.10 0

262 chapter 6 Statistical Inference

To complete the two-tailed Holiday Toys hypothesis test, we compare the two-tailed p value to the level of significance to see whether the null hypothesis should be rejected. With a level of significance of 0.05a 5 , we do not reject 0H because the two-tailed p value

0.2811 0.055 . . This result indicates that Holiday should continue its production plan- ning for the coming season based on the expectation that m 5 40.

The OrdersTest worksheet in Figure 6.21 can be used as a template for any hypothesis tests about a population mean. To facilitate the use of this worksheet, the formulas in cells D4:D6 reference the entire column A as follows:

Cell D4: COUNT(A:A)

Cell D5: AVERAGE(A:A)

Cell D6: STDV(A:A)

With the A:A method of specifying data ranges, Excel’s COUNT function will count the number of numeric values in column A, Excel’s AVERAGE function will compute the average of the numeric values in column A, and Excel’s STDEV function will compute the standard deviation of the numeric values in Column A. Thus, to solve a new problem it is necessary only to enter the new data in column A and enter the hypothesized value of the population mean in cell D8. Then, the standard error, the test statistic, degrees of freedom, and the three p values will be updated by the Excel formulas.

two-tailed Hypothesis test for Holiday toysFiGuRE 6.21

1 Units A B C D

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

26 23 32 47 45 31 47 59 21 52 45 53 34 45 39

Sample Size Sample Mean

Sample Standard Deviation

Hypothesized Value

Standard Error Test Statistic t

Degrees of Freedom

p value (Lower Tail) p value (Upper Tail)

p value (Two Tail)

5COUNT(A:A) 5AVERAGE(A:A) 5STDEV.S(A:A)

5D6/SQRT(D4) 5(D52D8)/D10 5D421

5T.DIST(D11,D12,TRUE) 512D14 52*MIN(D14,D15)

52 52 22 22

21 33 22 21 23 34 24 42 25 30 26 28

Hypothesis Test about a Population Mean

1 A B C D

Units

Sample Size Sample Mean

Sample Standard Deviation

Hypothesized Value

Standard Error Test Statistic t

Degrees of Freedom

p value (Lower Tail) p value (Upper Tail)

p value (Two Tail)

25 37.4

11.79

2.358 21.103

0.1406 0.8594 0.2811

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

26 23 32 47 45 31 47 59 21 52 45 53 34 45 39 52

Hypothesis Test about a Population Mean

Note: Rows 18–24 are hidden.

OrdersTest

6.5 Hypothesis tests 263

COmPuTaTiON OF p VaLuES FOR TwO-TaiLEd TESTS

1. Compute the value of the test statistic using equation (6.11). 2. If the value of the test statistic is in the upper tail, compute the probability that t

is greater than or equal to the value of the test statistic (the upper-tail area). If the value of the test statistic is in the lower tail, compute the probability that t is less than or equal to the value of the test statistic (the lower-tail area).

3. Double the probability (or tail area) from step 2 to obtain the p value.

Let us summarize the steps involved in computing p values for two-tailed hypothesis tests.

Summary and Practical advice We presented examples of a lower-tail test and a two- tailed test about a population mean. Based on these examples, we can now summarize the hypothesis testing procedures about a population mean in Table 6.7. Note that m0 is the hypothesized value of the population mean.

The hypothesis testing steps followed in the two examples presented in this section are common to every hypothesis test.

STEPS OF hYPOThESiS TESTiNG

Step 1. Develop the null and alternative hypotheses. Step 2. Specify the level of significance. Step 3. Collect the sample data and compute the value of the test statistic. Step 4. Use the value of the test statistic to compute the p value. Step 5. Reject 0H if the a#p . Step 6. Interpret the statistical conclusion in the context of the application.

Practical advice about the sample size for hypothesis tests is similar to the advice we provided about the sample size for interval estimation in Section 6.4. In most applications, a sample size of $ 30n is adequate when using the hypothesis testing procedure described in this section. In cases in which the sample size is less than 30, the distribution of the pop- ulation from which we are sampling becomes an important consideration. When the pop- ulation is normally distributed, the hypothesis tests described in this section provide exact results for any sample size. When the population is not normally distributed, these proce- dures provide approximations. Nonetheless, we find that sample sizes of 30 or more will provide good results in most cases. If the population is approximately normal, small sam- ple sizes (e.g., 5 15n ) can provide acceptable results. If the population is highly skewed or contains outliers, sample sizes approaching 50 are recommended.

Lower-Tail Test Upper-Tail Test Two-Tailed Test

Hypotheses : :

0 0

a 0

m m

: 0 0

a 0

m m

: 0 0

a 0

Test Statistic t

x s n

m 5

/ 0

/ 0m

5 2

t x s n /

0m 5

2 t

x s n

p Value 5 2T.DIST(t,n 1, TRUE)

5 2 21 T.DIST(t,n 1, TRUE)

5 2

2 2

2 MIN(T.DIST(t,n 1, TRUE),1 T.DIST(t,n 1, TRUE))

Summary of Hypothesis tests about a Population MeanTabLE 6.7

264 chapter 6 Statistical Inference

Relationship between interval Estimation and hypothesis Testing In Section 6.4 we showed how to develop a confidence interval estimate of a population mean. The (1 )%a2 confidence interval estimate of a population mean is given by

/ 26 ax t s

n In this chapter we showed that a two-tailed hypothesis test about a population mean

takes the following form:

: 0 0

a 0

m m

where m0 is the hypothesized value for the population mean. Suppose that we follow the procedure described in Section 6.4 for constructing a

100(1 )%a2 confidence interval for the population mean. We know that 100(1 )%a2 of the confidence intervals generated will contain the population mean and 100 %a of the confi- dence intervals generated will not contain the population mean. Thus, if we reject 0H when- ever the confidence interval does not contain m0, we will be rejecting the null hypothesis when it is true m m5( )0 with probability a. Recall that the level of significance is the prob- ability of rejecting the null hypothesis when it is true. So constructing a 100(1 )%a2 con- fidence interval and rejecting 0H whenever the interval does not contain m0 is equivalent to conducting a two-tailed hypothesis test with a as the level of significance. The procedure for using a confidence interval to conduct a two-tailed hypothesis test can now be summarized.

3to be consistent with the rule for rejecting 0h when a#p , we would also reject 0h using the confidence interval approach if m0 happens to be equal to one of the endpoints of the a2100(1 )% confidence interval.

a CONFidENCE iNTERVaL aPPROaCh TO TESTiNG a hYPOThESiS OF ThE FORm

: 0 0

a 0

m m

1. Select a simple random sample from the population and use the value of the sample mean x to develop the confidence interval for the population mean m.

6 a / 2x t s

2. If the confidence interval contains the hypothesized value m0 , do not reject 0H . Otherwise, reject3 0H .

For a two-tailed hypothesis test, the null hypothesis can be rejected if the confidence interval does not include m0.

Let us illustrate by conducting the Holiday Toys hypothesis test using the confidence interval approach. The Holiday Toys hypothesis test takes the following form:

: 40

: 40 0

To test these hypotheses with a level of significance of 0.05a 5 , we sampled 25 retailers and found a sample mean of 5 37.4 unitsx and a sample standard deviation of

5 11.79 unitss . Using these results with 5 2 2 5T.INV(1 (.05/2), 25 1) 2.0640.025t , we find that the 95% confidence interval estimate of the population mean is

37.4 2.064 11.79

25 37.4 4.4

0.0256

x t s

33.0 to 41.8.

This finding enables Holiday’s marketing director to conclude with 95% confidence that the mean number of units per retail outlet is between 33.0 and 41.8. Because the

6.5 Hypothesis tests 265

hypothesized value for the population mean, m 5 400 , is in this interval, the hypothesis testing conclusion is that the null hypothesis, H : 400 m 5 , cannot be rejected.

Note that this discussion and example pertain to two-tailed hypothesis tests about a population mean. However, the same confidence interval and two-tailed hypothesis testing relationship exists for other population parameters. The relationship can also be extended to one-tailed tests about population parameters. Doing so, however, requires the develop- ment of one-sided confidence intervals, which are rarely used in practice.

hypothesis Test of the Population Proportion In this section we show how to conduct a hypothesis test about a population proportion p. Using 0p to denote the hypothesized value for the population proportion, the three forms for a hypothesis test about a population proportion are as follows:

: : :

: : : 0 0 0 0 0 0

a 0 a 0 a 0

$ # 5

, .

H p p H p p H p p

H p p H p p H p p±

The first form is called a lower-tail test, the second an upper-tail test, and the third form a two-tailed test.

Hypothesis tests about a population proportion are based on the difference between the sam- ple proportion p and the hypothesized population proportion 0p . The methods used to conduct the hypothesis test are similar to those used for hypothesis tests about a population mean. The only difference is that we use the sample proportion and its standard error to compute the test statistic. The p value is then used to determine whether the null hypothesis should be rejected.

Let us consider an example involving a situation faced by Pine Creek golf course. Over the past year, 20% of the players at Pine Creek were women. In an effort to increase the proportion of women players, Pine Creek implemented a special promotion designed to attract women golfers. One month after the promotion was implemented, the course man- ager requested a statistical study to determine whether the proportion of women players at Pine Creek had increased. Because the objective of the study is to determine whether the proportion of women golfers increased, an upper-tail test with H p: 0.20a . is appropriate. The null and alternative hypotheses for the Pine Creek hypothesis test are as follows:

H p

: 0.20

: 0.20 0

If 0H can be rejected, the test results will give statistical support for the conclusion that the proportion of women golfers increased and the promotion was beneficial. The course manager specified that a level of significance of 0.05a 5 be used in carrying out this hypothesis test.

The next step of the hypothesis testing procedure is to select a sample and compute the value of an appropriate test statistic. To show how this step is done for the Pine Creek upper- tail test, we begin with a general discussion of how to compute the value of the test statistic for any form of a hypothesis test about a population proportion. The sampling distribution of p, the point estimator of the population parameter p, is the basis for developing the test statistic.

When the null hypothesis is true as an equality, the expected value of p equals the hypothesized value 0p ; that is, 5( ) 0E p p . The standard error of p is given by

(1 )0 0 s 5

2p p

n p

In Section 6.3 we said that if $ 5np and 2 $(1 ) 5n p , the sampling distribution of p can be approximated by a normal distribution.4 Under these conditions, which usually apply in practice, the quantity

5 2 0

z p p

(6.12)

4In most applications involving hypothesis tests of a population proportion, sample sizes are large enough to use the normal approximation. the exact sampling distribution of p is discrete, with the probability for each value of p given by the binomial distribution. So hypothesis testing is a bit more complicated for small samples when the normal approximation cannot be used.

266 chapter 6 Statistical Inference

has a standard normal probability distribution. With s 5 2(1 )/0 0p p np , the standard normal random variable z is the test statistic used to conduct hypothesis tests about a popu- lation proportion.

We can now compute the test statistic for the Pine Creek hypothesis test. Suppose a random sample of 400 players was selected, and that 100 of the players were women. The proportion of women golfers in the sample is

5 5 100

400 0.25p

Using equation (6.13), the value of the test statistic is

(1 )

0.25 0.20

0.20(1 0.20) 400

0.05

0.02 2.500

0 0

5 2

2 5

2 5 5z

p p

Because the Pine Creek hypothesis test is an upper-tail test, the p value is the probability of obtaining a value for the test statistic that is greater than or equal to 5 2.50z ; that is, it is the upper-tail area corresponding to $ 2.50z as displayed in Figure 6.22. The Excel 5 21 NORM.S.DIST(2.5, TRUE) computes this upper-tail area of 0.0062.

Recall that the course manager specified a level of significance of 0.05a 5 . A 5 ,value 0.0062 0.05p gives sufficient statistical evidence to reject 0H at the 0.05 level of significance. Thus, the test provides statistical support for the conclusion that the special promotion increased the proportion of women players at the Pine Creek golf course.

using Excel Excel can be used to conduct one-tailed and two-tailed hypothesis tests about a population proportion using the p value approach. The procedure is similar to the approach used with Excel in conducting hypothesis tests about a population mean. The primary difference is that the test statistic is based on the sampling distribution of x for hypothesis tests about a population mean and on the sampling distribution of p for

The Excel formula 5 NORM.S.DIST(z, TRUE) computes the area under the standard normal distribution curve that is less than or equal to the value z.

TEST STaTiSTiC FOR hYPOThESiS TESTS abOuT a POPuLaTiON PROPORTiON

5 2

2(1 ) 0

0 0

z p p

p p

(6.13)

calculation of the p Value for the Pine creek Hypothesis testFiGuRE 6.22

2.5

p value 5 P(z $ 2.50) 5 0.0062

Area 5 0.9938

WomenGolf

6.5 Hypothesis tests 267

WomenGolfTest

hypothesis tests about a population proportion. Thus, although different formulas are used to compute the test statistic and the p value needed to make the hypothesis testing decision, the logical process is identical.

We will illustrate the procedure by showing how Excel can be used to conduct the upper-tail hypothesis test for the Pine Creek golf course study. Refer to Figure 6.23 as we describe the tasks involved. The formula worksheet is on the left; the value worksheet is on the right.

The descriptive statistics needed are provided in cells D3, D5, and D6. Because the data are not numeric, Excel’s COUNTA function, not the COUNT function, is used in cell D3 to determine the sample size. We entered Female in cell D4 to identify the response for which we wish to compute a proportion. The COUNTIF function is then used in cell D5 to determine the number of responses of the type identified in cell D4. The sample proportion is then computed in cell D6 by dividing the response count by the sample size.

The hypothesized value of the population proportion (0.20) is entered into cell D8. The standard error is obtained in cell D10 by entering the formula SQRT(D8*(1-D8)/D3)5 . The formula 5(D6-D8)/D10 entered into cell D11 computes the test statistic z according to equation (6.13). To compute the p value for a lower-tail test, we enter the formula 5NORM.S.DIST(D11, TRUE) into cell D13. The p value for an upper-tail test is then com- puted in cell D14 as 1 minus the p value for the lower-tail test. Finally, the p value for a two-tailed test is computed in cell D15 as two times the minimum of the two one-tailed p values. The value worksheet shows that the three p values are as follows: p value (lower tail) 0.99385 , p value (upper tail) 0.00625 , and p value (two tail) 0.01245 .

The development of the worksheet is now complete. For the Pine Creek upper-tail hypothesis test, we reject the null hypothesis that the population proportion is 0.20 or less because the pupper tail value of 0.0062 is less than 0.05a 5 . Indeed, with this p value we would reject the null hypothesis for any level of significance of 0.0062 or greater.

The procedure used to conduct a hypothesis test about a population proportion is simi- lar to the procedure used to conduct a hypothesis test about a population mean. Although we illustrated how to conduct a hypothesis test about a population proportion only for an upper-tail test, similar procedures can be used for lower-tail and two-tailed tests. Table 6.8 provides a summary of the hypothesis tests about a population proportion for the case that $ 5np and 2 $(1 ) 5n p (and thus the normal probability distribution can be used to approximate the sampling distribution of p).

The worksheet in Figure 6.23 can be used as a template for hypothesis tests about a population proportion whenever np 5$ and n(1 p) 52 $ . Just enter the appropriate data in column A, adjust the ranges for the formulas in cells D3 and D5, enter the appropriate response in cell D4, and enter the hypothesized value in cell D8. The standard error, the test statistic, and the three p values will then appear. Depending on the form of the hypothesis test (lower- tail, upper-tail, or two-tailed), we can then choose the appropriate p value to make the rejection decision.

Hypothesis test for Pine creek Golf courseFiGuRE 6.23

Golfer Hypothesis Test about a Population Proportion Female Male

Female Male Male

Female Male Male Male Male Male

Female Male Male

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 400 401 402

A B C D

Sample Size 5COUNTA(A2:A401)

Count for Response 5COUNTIF(A2:A401,D4) Sample Proportion 5D5/D3

Standard Error 5SQRT(D8*(12D8)/D3)

p value (Lower Tail) 5NORM.S.DIST(D11,TRUE) p value (Upper Tail) 512D13

p value (Two Tail) 52*MIN(D13,D14)

Test Statistic z 5(D62D8)/D10

Hypothesized Value 0.2

Response of Interest Female

Golfer Hypothesis Test about a Population Proportion Female Male

Female Male Male

Female Male Male Male Male Male

Female Male Male

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 400 401 402

A B C D E

Sample Size 400

Count for Response 100 Sample Proportion 0.25

Standard Error 0.02

p value (Lower Tail) 0.9938 p value (Upper Tail) 0.0062

p value (Two Tail) 0.0124

Test Statistic z 2.5000

Hypothesized Value 0.20

Response of Interest Female

268 chapter 6 Statistical Inference

6.6 Big Data, Statistical Inference, and Practical Significance

As stated earlier in this chapter, the purpose of statistical inference is to use sample data to quickly and inexpensively gain insight into some characteristic of a population. Therefore, it is important that we can expect the sample to look like, or be representative of, the popu- lation that is being investigated. In practice, individual samples always, to varying degrees, fail to be perfectly representative of the populations from which they have been taken. There are two general reasons a sample may fail to be representative of the population of interest: sampling error and nonsampling error.

Sampling Error One reason a sample may fail to represent the population from which it has been taken is sampling error, or deviation of the sample from the population that results from random sampling. If repeated independent random samples of the same size are collected from the population of interest using a probability sampling techniques, on average the samples will be representative of the population. This is the justification for collecting sample data randomly. However, the random collection of sample data does not ensure that any single

Lower-Tail Test Upper-Tail Test Two-Tailed Test

Hypotheses : :

0 0

a 0

H p p H p p

: :

0 0

a 0

H p p H p p

± : :

0 0

a 0

Test Statistic (1 )

0 0

5 2

2 z

p p p p

n (1 )

0 0

5 2

2 z

p p p p

n (1 )

0 0

5 2

2 z

p p p p

p Value 5NORM.S.DIST (z, TRUE)

5 21 NORM.S.DIST (z, TRUE)

2*MIN(NORM.S.DIST(z, TRUE), 1 2 NORM.S.DIST(z, TRUE))

Summary of Hypothesis tests about a Population ProportionTabLE 6.8

1. We have shown how to use p values. The smaller the p value, the stronger the evidence in the sample data against

H0 and the stronger the evidence in favor of Ha . Here are guidelines that some statisticians suggest for interpreting

small p values: • Less than 0.01—Overwhelming evidence to conclude

that Ha is true • Between 0.01 and 0.05—Strong evidence to conclude

that Ha is true • Between 0.05 and 0.10—weak evidence to conclude

that Ha is true • Greater than 0.10—Insufficient evidence to conclude

that Ha is true 2. The procedures for testing hypotheses about the mean

that are discussed in this chapter are reliable unless the

sample size is small and the population is highly skewed or

contains outliers. In these cases, a nonparametric approach

such as the sign test can be used. Under these conditions

the results of nonparametric tests are more reliable than

the hypothesis testing procedures discussed in this chap-

ter. However, this increased reliability comes with a cost; if

the sample is large or the population is relatively normally

distributed, a nonparametric approach will also reject false

null hypotheses less frequently.

3. We have discussed only procedures for testing hypotheses

about the mean or proportion of a single population. There

are many statistical procedures for testing hypotheses about

multiple means or proportions. There are also many statistical

procedures for testing hypotheses about parameters other

than the population mean or the population proportion.

N O T E S + C O m m E N T S

6.6 Big Data, Statistical Inference, and Practical Significance 269

sample will be perfectly representative of the population of interest; when collecting a sample randomly, the data in the sample cannot be expected to be perfectly representative of the population from which it has been taken. Sampling error is unavoidable when col- lecting a random sample; this is a risk we must accept when we chose to collect a random sample rather than incur the costs associated with taking a census of the population.

As expressed by equations (6.2) and (6.5), the standard errors of the sampling distribu- tions of the sample mean x and the sample proportion of p reflect the potential for sam- pling error when using sample data to estimate the population mean m and the population proportion p, respectively. As the sample size n increases, the potential impact of extreme values on the statistic decreases, so there is less variation in the potential values of the statistic produced by the sample and the standard errors of these sampling distributions decrease. Because these standard errors reflect the potential for sampling error when using sample data to estimate the population mean m and the population proportion p, we see that for an extremely large sample there may be little potential for sampling error.

Nonsampling Error Although the standard error of a sampling distribution decreases as the sample size n increases, this does not mean that we can conclude that an extremely large sample will always provide reliable information about the population of interest; this is because sam- pling error is not the sole reason a sample may fail to represent the target population. Deviations of the sample from the population that occur for reasons other than random sampling are referred to as nonsampling error. Nonsampling error can occur for a variety of reasons.

Consider the online news service PenningtonDailyTimes.com (PDT). Because PDT’s primary source of revenue is the sale of advertising, the news service is intent on collecting sample data on the behavior of visitors to its web site in order to support its advertising sales. Prospective advertisers are willing to pay a premium to advertise on websites that have long visit times, so PDT’s management is keenly interested in the amount of time customers spend during their visits to PDT’s web site. Advertisers are also concerned with how frequently visitors to a web site click on any of the ads featured on the web site, so PDT is also interested in whether visitors to its web site clicked on any of the ads featured on PenningtonDailyTimes.com.

From whom should PDT collect its data? Should it collect data on current visits to PenningtonDailyTimes.com? Should it attempt to attract new visitors and collect data on these visits? If so, should it measure the time spent at its web site by visitors it has attracted from competitors’ websites or visitors who do not routinely visit online news sites? The answers to these questions depend on PDT’s research objectives. Is the company attempt- ing to evaluate its current market, assess the potential of customers it can attract from com- petitors, or explore the potential of an entirely new market such as individuals who do not routinely obtain their news from online news services? If the research objective and the population from which the sample is to be drawn are not aligned, the data that PDT collects will not help the company accomplish its research objective. This type of error is referred to as a coverage error.

Even when the sample is taken from the appropriate population, nonsampling error can occur when segments of the target population are systematically underrepresented or overrepresented in the sample. This may occur because the study design is flawed or because some segments of the population are either more likely or less likely to respond. Suppose PDT implements a pop-up questionnaire that opens when a visitor leaves PenningtonDailyTimes.com. Visitors to PenningtonDailyTimes.com who have installed pop-up blockers will be likely underrepresented, and visitors to PenningtonDailyTimes. com who have not installed pop-up blockers will likely be overrepresented. If the behavior of PenningtonDailyTimes.com visitors who have installed pop-up blockers differs from the behaviors of PenningtonDailyTimes.com visitors who have not installed pop-up blockers,

Nonsampling error can occur in a sample or a census.

270 chapter 6 Statistical Inference

attempting to draw conclusions from this sample about how all visitors to the PDT web site behave may be misleading. This type of error is referred to as a nonresponse error.

Another potential source of nonsampling error is incorrect measurement of the charac- teristic of interest. If PDT asks questions that are ambiguous or difficult for respondents to understand, the responses may not accurately reflect how the respondents intended to respond. For example, respondents may be unsure how to respond if PDT asks “Are the news stories on PenningtonDailyTimes.com compelling and accurate?”. How should a vis- itor respond if she or he feels the news stories on PenningtonDailyTimes.com are compel- ling but erroneous? What response is appropriate if the respondent feels the news stories on PenningtonDailyTimes.com are accurate but dull? A similar issue can arise if a question is asked in a biased or leading way. If PDT asks “Many readers find the news stories on PenningtonDailyTimes.com to be compelling and accurate. Do you find the news stories on PenningtonDailyTimes.com to be compelling and accurate?”, the qualifying statement PDT makes prior to the actual question will likely result in a bias toward positive responses. Incorrect measurement of the characteristic of interest can also occur when respondents provide incorrect answers; this may be due to a respondent’s poor recall or unwillingness to respond honestly. This type of error is referred to as a measurement error.

Nonsampling error can introduce bias into the estimates produced using the sample, and this bias can mislead decision makers who use the sample data in their decision-making processes. No matter how small or large the sample, we must contend with this limitation of sampling whenever we use sample data to gain insight into a population of interest. Although sampling error decreases as the size of the sample increases, an extremely large sample can still suffer from nonsampling error and fail to be representative of the popula- tion of interest. When sampling, care must be taken to ensure that we minimize the intro- duction of nonsampling error into the data collection process. This can be done by carrying out the following steps:

• Carefully define the target population before collecting sample data, and subse- quently design the data collection procedure so that a probability sample is drawn from this target population.

• Carefully design the data collection process and train the data collectors. • Pretest the data collection procedure to identify and correct for potential sources of nonsampling error prior to final data collection.

• Use stratified random sampling when population-level information about an import- ant qualitative variable is available to ensure that the sample is representative of the population with respect to that qualitative characteristic.

• Use cluster sampling when the population can be divided into heterogeneous subgroups or clusters.

• Use systematic sampling when population-level information about an important quantitative variable is available to ensure that the sample is representative of the population with respect to that quantitative characteristic.

Finally, recognize that every random sample (even an extremely large random sample) will suffer from some degree of sampling error, and eliminating all potential sources of nonsampling error may be impractical. Understanding these limitations of sampling will enable us to be more realistic and pragmatic when interpreting sample data and using sam- ple data to draw conclusions about the target population.

big data Recent estimates state that approximately 2.5 quintillion bytes of data are created world- wide each day. This represents a dramatic increase from the estimated 100 gigabytes (GB) of data generated worldwide per day in 1992, the 100 GB of data generated worldwide per hour in 1997, and the 100 GB of data generated worldwide per second in 2002. Every

Errors that are introduced by interviewers or during the recording and preparation of the data are other types of nonsampling error. These types of error are referred to as interviewer errors and processing errors, respectively.

6.6 Big Data, Statistical Inference, and Practical Significance 271

minute, there is an average of 216,000 Instagram posts, 204,000,000 e-mails sent, 12 hours of footage uploaded to YouTube, and 277,000 tweets posted on Twitter. Without ques- tion, the amount of data that is now generated is overwhelming, and this trend is certainly expected to continue.

In each of these cases the data sets that are generated are so large or complex that cur- rent data processing capacity and/or analytic methods are not adequate for analyzing the data. Thus, each is an example of big data. There are myriad other sources of big data. Sensors and mobile devices transmit enormous amounts of data. Internet activities, digital processes, and social media interactions also produce vast quantities of data.

The amount of data has increased so rapidly that our vocabulary for describing a data set by its size must expand. A few years ago, a petabyte of data seemed almost unimagin- ably large, but we now routinely describe data in terms of yottabytes. Table 6.9 summarizes terminology for describing the size of data sets.

understanding what big data is The processes that generate big data can be described by four attributes or dimensions that are referred to as the four V’s:

• Volume—the amount of data generated • Variety—the diversity in types and structures of data generated • Veracity—the reliability of the data generated • Velocity—the speed at which the data are generated

A high degree of any of these attributes individually is sufficient to generate big data, and when they occur at high levels simultaneously the resulting amount of data can be overwhelmingly large. Technological advances and improvements in electronic (and often automated) data collection make it easy to collect millions, or even billions, of observations in a relatively short time. Businesses are collecting greater volumes of an increasing variety of data at a higher velocity than ever.

To understand the challenges presented by big data, we consider its structural dimen- sions. Big data can be tall data; a data set that has so many observations that traditional statistical inference has little meaning. For example, producers of consumer goods collect information on the sentiment expressed in millions of social media posts each day to bet- ter understand consumer perceptions of their products. Such data consist of the sentiment expressed (the variable) in millions (or over time, even billions) of social media posts (the observations). Big data can also be wide data; a data set that has so many variables that simultaneous consideration of all variables is infeasible. For example, a high-resolution image can comprise millions or billions of pixels. The data used by facial recognition algo- rithms consider each pixel in an image when comparing an image to other images in an attempt to find a match. Thus, these algorithms make use of the characteristics of millions

Number of Bytes Metric Name

10001 kB kilobyte

10002 MB megabyte

10003 GB gigabyte

10004 TB terabyte

10005 PB petabyte

10006 EB exabyte

10007 ZB zettabyte

10008 YB yottabyte

terminology for Describing the Size of Data SetsTabLE 6.9

272 chapter 6 Statistical Inference

or billions of pixels (the variables) for relatively few high-resolution images (the observa- tions). Of course, big data can be both tall and wide, and the resulting data set can again be overwhelmingly large.

Statistics are useful tools for understanding the information embedded in a big data set, but we must be careful when using statistics to analyze big data. It is important that we understand the limitations of statistics when applied to big data and we temper our interpre- tations accordingly. Because tall data are the most common form of big data used in busi- ness, we focus on this structure in the discussions throughout the remainder of this section.

big data and Sampling Error Let’s revisit the data collection problem of online news service PenningtonDailyTimes.com (PDT). Because PDT’s primary source of revenue is the sale of advertising, PDT’s man- agement is interested in the amount of time customers spend during their visits to PDT’s web site. From historical data, PDT has estimated that the standard deviation of the time spent by individual customers when they visit PDT’s web site is 5 20 secondss . Table 6.10 shows how the standard error of the sampling distribution of the sample mean time spent by individual customers when they visit PDT’s web site decreases as the sample size increases.

PDT also wants to collect information from its sample respondents on whether a vis- itor to its web site clicked on any of the ads featured on the web site. From its historical data, PDT knows that 51% of past visitors to its web site clicked on an ad featured on the web site, so it will use this value as p to estimate the standard error. Table 6.11 shows how the standard error of the sampling distribution of the proportion of the sample that clicked on any of the ads featured on PenningtonDailyTimes.com decreases as the sample size increases.

The PDT example illustrates the general relationship between standard errors and the sample size. We see in Table 6.10 that the standard error of the sample mean decreases as the sample size increases. For a sample of 5 10n , the standard error of the sample mean is 6.32456; when we increase the sample size to 5 100, 000n , the standard error of the sam- ple mean decreases to 0.06325; and at a sample size of n 1, 000, 000, 0005 , the standard error of the sample mean decreases to only 0.00063. In Table 6.11 we see that the standard error of the sample proportion also decreases as the sample size increases. For a sample of n 105 , the standard error of the sample proportion is 0.15808; when we increase the sample size to n 100, 0005 , the standard error of the sample proportion decreases to 0.00158; and at a sample size of n 1, 000, 000, 0005 , the standard error of the sample mean decreases to only 0.00002. In both Table 6.10 and Table 6.11, the standard error when n 1, 000, 000, 0005 is one ten-thousandth of the standard error when n 105 .

A sample of one million or more visitors might seem unrealistic, but keep in mind that amazon.com had over 91 million visitors in March of 2016 (quantcast.com, May 13, 2016).

Sample Size n Standard Error 5 /s s nx 10 6.32456

100 2.00000

1,000 0.63246

10,000 0.20000

100,000 0.06325

1,000,000 0.02000

10,000,000 0.00632

100,000,000 0.00200

1,000,000,000 0.00063

Standard Error of the Sample Mean x When 5s 20 at Various Sample Sizes n

TabLE 6.10

6.6 Big Data, Statistical Inference, and Practical Significance 273

big data and the Precision of Confidence intervals We have seen that confidence intervals are powerful tools for making inferences about pop- ulation parameters, but the validity of any interval estimate depends on the quality of the data used to develop the interval estimate. No matter how large the sample is, if the sample is not representative of the population of interest, the confidence interval cannot provide useful information about the population parameter of interest. In these circumstances, statistical inference can be misleading.

A review of equations (6.7) and (6.10) shows that confidence intervals for the popu- lation mean m and population proportion p become more narrow as the size of the sam- ple increases. Therefore, the potential sampling error also decreases as the sample size increases. To illustrate the rate at which interval estimates narrow for a given confidence level, we again consider the PenningtonDailyTimes.com (PDT) example.

Recall that PDT’s primary source of revenue is the sale of advertising, and prospective advertisers are willing to pay a premium to advertise on websites that have long visit times. Suppose PDT’s management wants to develop a 95% confidence interval estimate of the mean amount of time customers spend during their visits to PDT’s web site. Table 6.12 shows how the margin of error at the 95% confidence level decreases as the sample size increases when s 205 .

Suppose that in addition to estimating the population mean amount of time customers spend during their visits to PDT’s web site, PDT would like to develop a 95% confidence interval estimate of the proportion of its web site visitors that click on an ad. Table 6.13 shows how the margin of error for a 95% confidence interval estimate of the popula- tion proportion decreases as the sample size increases when the sample proportion is

0.515p . The PDT example illustrates the relationship between the precision of interval esti-

mates and the sample size. We see in Tables 6.12 and 6.13 that at a given confidence level, the margins of error decrease as the sample sizes increase. As a result, if the sam- ple mean time spent by customers when they visit PDT’s web site is 84.1 seconds, the 95% confidence interval estimate of the population mean time spent by customers when they visit PDT’s web site decreases from (69.79286, 98.40714) for a sample of n 105 to (83.97604, 84.22396) for a sample of n 100, 0005 to (84.09876, 84.10124) for a sample of n 1, 000, 000, 0005 . Similarly, if the sample proportion of its web site visitors who clicked on an ad is 0.51, the 95% confidence interval estimate of the population proportion of its web site visitors who clicked on an ad decreases from (0.20016, 0.81984) for a sample of n 105 to (0.50690, 0.51310) for a sample of n 100, 0005 to (0.50997, 0.51003) for a

Sample Size n Standard Error s 5 2(1 )/p p np 10 0.15808

100 0.04999

1,000 0.01581

10,000 0.00500

100,000 0.00158

1,000,000 0.00050

10,000,000 0.00016

100,000,000 0.00005

1,000,000,000 0.00002

Standard Error of the Sample Proportion p when p 0.515 at Various Sample Sizes n

TabLE 6.11

274 chapter 6 Statistical Inference

sample of n 1, 000, 000, 0005 . In both instances, as the sample size becomes extremely large, the margin of error becomes extremely small and the resulting confidence intervals become extremely narrow.

implications of big data for Confidence intervals Last year the mean time spent by all visitors to PenningtonDailyTimes.com was 84 seconds. Suppose that PDT wants to assess whether the population mean time has changed since last year. PDT now collects a new sample of 1,000,000 visitors to its web site and calculates the sample mean time spent by these visitors to the PDT web site to be x 84.1 seconds5 . The estimated population standard deviation is s 20 seconds5 , so the standard error is s s nx / 0.020005 5 . Furthermore, the sample is sufficiently large to ensure that the sampling distribution of the sample mean will be normally distributed. Thus, the 95% confidence interval estimate of the population mean is

6 5 6 5a 84.1 0.0392 (84.06080, 84.13920)/ 2x t sx

What could PDT conclude from these results? There are three possible reasons that PDT’s sample mean of 84.1 seconds differs from last year’s population mean of 84 sec- onds: (1) sampling error, (2) nonsampling error, or (3) the population mean has changed

Sample Size n Margin of Error t sxaa /2 10 14.30714

100 3.96843

1,000 1.24109

10,000 0.39204

100,000 0.12396

1,000,000 0.03920

10,000,000 0.01240

100,000,000 0.00392

1,000,000,000 0.00124

Margin of Error for Interval Estimates of the Population Mean at the 95% confidence level for Various Sample Sizes n

TabLE 6.12

Sample Size n Margin of Error z psaa /2 10 0.30984

100 0.09798

1,000 0.03098

10,000 0.00980

100,000 0.00310

1,000,000 0.00098

10,000,000 0.00031

100,000,000 0.00010

1,000,000,000 0.00003

Margin of Error for Interval Estimates of the Population Proportion at the 95% confidence level for Various Sample Sizes n

TabLE 6.13

6.6 Big Data, Statistical Inference, and Practical Significance 275

since last year. The 95% confidence interval estimate of the population mean does not include the value for the mean time spent by all visitors to the PDT web site for last year (84 seconds), suggesting that the difference between PDT’s sample mean for the new sam- ple (84.1 seconds) and the mean from last year (84 seconds) is not likely to be exclusively a consequence of sampling error. Nonsampling error is a possible explanation and should be investigated as the results of statistical inference become less reliable as nonsampling error is introduced into the sample data. If PDT determines that it introduced little or no nonsampling error into its sample data, the only remaining plausible explanation for a difference of this magnitude is that the population mean has changed since last year.

If PDT concludes that the sample has provided reliable evidence and the population mean has changed since last year, management must still consider the potential impact of the difference between the sample mean and the mean from last year. If a 0.1 second differ- ence in the time spent by visitors to PenningtonDailyTimes.com has a consequential effect on what PDT can charge for advertising on its site, this result could have practical business implications for PDT. Otherwise, there may be no practical significance of the 0.1 second difference in the time spent by visitors to PenningtonDailyTimes.com.

Confidence intervals are extremely useful, but as with any other statistical tool, they are only effective when properly applied. Because interval estimates become increasingly precise as the sample size increases, extremely large samples will yield extremely precise estimates. However, no interval estimate, no matter how precise, will accurately reflect the parameter being estimated unless the sample is relatively free of nonsampling error. Therefore, when using interval estimation, it is always important to carefully consider whether a random sample of the population of interest has been taken.

big data, hypothesis Testing, and p Values We have seen that interval estimates of the population mean m and the population proportion p narrow as the sample size increases. This occurs because the standard error of the asso- ciated sampling distributions decrease as the sample size increases. Now consider the rela- tionship between interval estimation and hypothesis testing that we discussed earlier in this chapter. If we construct a 100(1 )%a2 interval estimate for the population mean, we reject

:0 0m m5H if the 100(1 )%a2 interval estimate does not contain 0m . Thus, for a given level of confidence, as the sample size increases we will reject :0 0m m5H for increasingly smaller differences between the sample mean x and the hypothesized population mean 0m . We can see that when the sample size n is very large, almost any difference between the sample mean x and the hypothesized population mean 0m results in rejection of the null hypothesis.

In this section, we will elaborate how big data affects hypothesis testing and the mag- nitude of p values. Specifically, we will examine how rapidly the p value associated with a given difference between a point estimate and a hypothesized value of a parameter decreases as the sample size increases.

Let us again consider the online news service PenningtonDailyTimes.com (PDT). Recall that PDT’s primary source of revenue is the sale of advertising, and prospective advertis- ers are willing to pay a premium to advertise on websites that have long visit times. To promote its news service, PDT’s management wants to promise potential advertisers that the mean time spent by customers when they visit PenningtonDailyTimes.com is greater than last year, that is, more than 84 seconds. PDT therefore decides to collect a sample tracking the amount of time spent by individual customers when they visit PDT’s web site in order to test its null hypothesis H : 840 m # .

For a sample mean of 84.1 seconds and a sample standard deviation of s 20 seconds5 , Table 6.14 provides the values of the test statistic t and the p values for the test of the null hypothesis H : 840 m # . The p value for this hypothesis test is essentially 0 for all samples in Table 6.14 with at least n 1, 000, 0005 .

PDT’s management also wants to promise potential advertisers that the proportion of its web site visitors who click on an ad this year exceeds the proportion of its web site visitors who clicked on an ad last year, which was 0.50. PDT collects information from its sample

276 chapter 6 Statistical Inference

on whether the visitor to its web site clicked on any of the ads featured on the web site, and it wants to use these data to test its null hypothesis H p: 0.500 # .

For a sample proportion of 0.51, Table 6.15 provides the values of the test statistic z and the p values for the test of the null hypothesis H p: 0.500 # . The p value for this hypothe- sis test is essentially 0 for all samples in Table 6.14 with at least n 100, 0005 .

We see in Tables 6.14 and 6.15 that the p value associated with a given difference between a point estimate and a hypothesized value of a parameter decreases as the sam- ple size increases. As a result, if the sample mean time spent by customers when they visit PDT’s web site is 84.1 seconds, PDT’s null hypothesis Ho : 84m # is not rejected at 0.01a 5 for samples with n 100, 000# , and is rejected at 0.01a 5 for samples with n 1, 000, 000$ . Similarly, if the sample proportion of visitors to its web site clicked on an ad featured on the web site is 0.51, PDT’s null hypothesis H p: 0.500 # is not rejected at 0.01a 5 for samples with n 10, 000# , and is rejected at 0.01a 5 for samples with n 100, 000$ . In both instances, as the sample size becomes extremely large the p value associated with the given difference between a point estimate and the hypothesized value of the parameter becomes extremely small.

Sample Size n t p Value 10 0.01581 0.49386

100 0.05000 0.48011

1,000 0.15811 0.43720

10,000 0.50000 0.30854

100,000 1.58114 0.05692

1,000,000 5.00000 2.87E-07

10,000,000 15.81139 1.30E-56

100,000,000 50.00000 0.00E100

1,000,000,000 158.11388 0.00E100

Values of the test Statistic t and the p Values for the test of the null Hypothesis : 840h m # and Sample Mean

84.1 Seconds5x for Various Sample Sizes n

TabLE 6.14

Sample Size n z p Value 10 0.06325 0.47479

100 0.20000 0.42074

1,000 0.63246 0.26354

10,000 2.00000 0.02275

100,000 6.32456 1.27E-10

1,000,000 20.00000 0.00E100

10,000,000 63.24555 0.00E100

100,000,000 200.00000 0.00E100

1,000,000,000 632.45553 0.00E100

Values of the test Statistic z and the p Values for the test of the null Hypothesis : .500h p # and Sample Proporton

0.51p 5 for Various Sample Sizes n

TabLE 6.15

6.6 Big Data, Statistical Inference, and Practical Significance 277

implications of big data in hypothesis Testing Suppose PDT collects a sample of 1,000,000 visitors to its web site and uses these data to test its null hypotheses : 840 m #H and : 0.500 #H p at the 0.05 level of signifi- cance. The sample mean is 84.1 and the sample proportion is .51, so the null hypothesis is rejected in both tests as Tables 6.14 and 6.15 show. As a result, PDT can promise potential advertisers that the mean time spent by individual customers who visit PDT’s web site exceeds 84 seconds and the proportion individual visitors to of its web site who click on an ad exceeds 0.50. These results suggest that for each of these hypothesis tests, the difference between the point estimate and the hypothesized value of the parameter being tested is not likely solely a consequence of sampling error. However, the results of any hypothesis test, no matter the sample size, are only reliable if the sample is relatively free of nonsampling error. If nonsampling error is introduced in the data collection pro- cess, the likelihood of making a Type I or Type II error may be higher than if the sample data are free of nonsampling error. Therefore, when testing a hypothesis, it is always important to think carefully about whether a random sample of the population of interest has been taken.

If PDT determines that it has introduced little or no nonsampling error into its sample data, the only remaining plausible explanation for these results is that these null hypotheses are false. At this point, PDT and the companies that advertise on PenningtonDailyTimes.com should also consider whether these statistically significant differences between the point estimates and the hypothesized values of the parameters being tested are of practical significance. Although a 0.1 second increase in the mean time spent by customers when they visit PDT’s web site is statistically significant, it may not be meaningful to companies that might advertise on PenningtonDailyTimes.com. Similarly, although an increase of 0.01 in the proportion of visitors to its web site that click on an ad is statistically significant, it may not be meaningful to companies that might adver- tise on PenningtonDailyTimes.com. Determining whether these statistically significant differences have meaningful implications for ensuing business decisions of PDT and its advertisers.

Ultimately, no business decision should be based solely on statistical inference. Practical significance should always be considered in conjunction with statistical signif- icance. This is particularly important when the hypothesis test is based on an extremely large sample because even an extremely small difference between the point estimate and the hypothesized value of the parameter being tested will be statistically significant. When done properly, statistical inference provides evidence that should be considered in combi- nation with information collected from other sources to make the most informed decision possible.

1. Nonsampling error can occur when either a probability

sampling technique or a nonprobability sampling tech-

nique is used. However, nonprobability sampling tech-

niques such as convenience sampling and judgment

sampling often introduce nonsampling error into sample

data because of the manner in which sample data are col-

lected. Therefore, probability sampling techniques are pre-

ferred over nonprobability sampling techniques.

2. When taking an extremely large sample, it is conceivable

that the sample size is at least 5% of the population size;

that is, / 0.05n N $ . Under these conditions, it is neces- sary to use the finite population correction factor when

calculating the standard error of the sampling distribu-

tion to be used in confidence intervals and hypothesis

testing.

N O T E S + C O m m E N T S

278 chapter 6 Statistical Inference

S u m m a r y

In this chapter we presented the concepts of sampling and sampling distributions. We demonstrated how a simple random sample can be selected from a finite population and how a random sample can be selected from an infinite population. The data collected from such samples can be used to develop point estimates of population parameters. Different samples provide different values for the point estimators; therefore, point estimators such as x and p are random variables. The probability distribution of such a random variable is called a sampling distribution. In particular, we described in detail the sampling distribu- tions of the sample mean x and the sample proportion p. In considering the characteristics of the sampling distributions of x and p, we stated that xE( ) m5 and p pE( ) 5 . After developing the standard deviation or standard error formulas for these estimators, we described the conditions necessary for the sampling distributions of x and p to follow a normal distribution.

In Section 6.4, we presented methods for developing interval estimates of a population mean and a population proportion. A point estimator may or may not provide a good esti- mate of a population parameter. The use of an interval estimate provides a measure of the precision of an estimate. Both the interval estimate of the population mean and the popula- tion proportion take the form: point estimate 6 margin of error.

We presented the interval estimation procedure for a population mean for the practical case in which the population standard deviation is unknown. The interval estimation proce- dure uses the sample standard deviation s and the t distribution. The quality of the interval estimate obtained depends on the distribution of the population and the sample size. In a normally distributed population, the interval estimates will be exact in both cases, even for small sample sizes. If the population is not normally distributed, the interval estimates obtained will be approximate. Larger sample sizes provide better approximations, but the more highly skewed the population is, the larger the sample size needs to be to obtain a good approximation.

The general form of the interval estimate for a population proportion is p ± margin of error. In practice, the sample sizes used for interval estimates of a population proportion are generally large. Thus, the interval estimation procedure for a population proportion is based on the standard normal distribution.

In Section 6.5, we presented methods for hypothesis testing, a statistical procedure that uses sample data to determine whether or not a statement about the value of a population parameter should be rejected. The hypotheses are two competing statements about a pop- ulation parameter. One statement is called the null hypothesis H( )0 , and the other is called the alternative hypothesis H( )a . We provided guidelines for developing hypotheses for situ- ations frequently encountered in practice.

In the hypothesis-testing procedure for the population mean, the sample standard devia- tion s is used to estimate s and the hypothesis test is based on the t distribution. The qual- ity of results depends on both the form of the population distribution and the sample size; if the population is not normally distributed, larger sample sizes are needed. General guide- lines about the sample size were provided in Section 6.5. In the case of hypothesis tests about a population proportion, the hypothesis-testing procedure uses a test statistic based on the standard normal distribution.

We also reviewed how the value of the test statistic can be used to compute a p value—a probability that is used to determine whether the null hypothesis should be rejected. If the p value is less than or equal to the level of significance a, the null hypothesis can be rejected.

In Section 6.6 we discussed the concept of big data and its implications for statistical inference. We considered sampling and nonsampling error; the implications of big data on standard errors, confidence intervals, and hypothesis testing for the mean and the proportion; and the importance of considering both statistical significance and practical significance.

Glossary 279

G l o S S a r y

Alternative hypothesis The hypothesis concluded to be true if the null hypothesis is rejected. Big data Any set of data that is too large or too complex to be handled by standard data processing techniques and typical desktop software. Census Collection of data from every element in the population of interest. Central limit theorem A theorem stating that when enough independent random variables are added, the resulting sum is a normally-distributed random variable. This result allows one to use the normal probability distribution to approximate the sampling distributions of the sample mean and sample proportion for sufficiently large sample sizes. Confidence coefficient The confidence level expressed as a decimal value. For example, 0.95 is the confidence coefficient for a 95% confidence level. Confidence interval Another name for an interval estimate. Confidence level The confidence associated with an interval estimate. For example, if an interval estimation procedure provides intervals such that 95% of the intervals formed using the procedure will include the population parameter, the interval estimate is said to be constructed at the 95% confidence level. Coverage error Nonsampling error that results when the research objective and the popu- lation from which the sample is to be drawn are not aligned. Degrees of freedom A parameter of the t distribution. When the t distribution is used in the computation of an interval estimate of a population mean, the appropriate t distribution has n 2 1 degrees of freedom, where n is the size of the sample. Finite population correction factor The term N n N( )/( 1)2 2 that is used in the for- mulas for computing the (estimated) standard error for the sample mean and sample pro- portion whenever a finite population, rather than an infinite population, is being sampled. The generally accepted rule of thumb is to ignore the finite population correction factor whenever n N/ 0.05# . Frame A listing of the elements from which the sample will be selected. Hypothesis testing The process of making a conjecture about the value of a population parameter, collecting sample data that can be used to assess this conjecture, measuring the strength of the evidence against the conjecture that is provided by the sample, and using these results to draw a conclusion about the conjecture. Interval estimate An estimate of a population parameter that provides an interval believed to contain the value of the parameter. For the interval estimates in this chapter, it has the form: point estimate 6 margin of error. Interval estimation The process of using sample data to calculate a range of values that is believed to include the unknown value of a population parameter. Level of significance The probability that the interval estimation procedure will generate an interval that does not contain the value of parameter being; also the probability of mak- ing a Type I error when the null hypothesis is true as an equality. Margin of error The 6 value added to and subtracted from a point estimate in order to develop an interval estimate of a population parameter. Measurement error Nonsampling error that results from the incorrect measurement of the population characteristic of interest. Nonresponse error Nonsampling error that results when some segments of the population are more likely or less likely to respond to the survey mechanism. Nonsampling error Any difference between the value of a sample statistic (such as the sample mean, sample standard deviation, or sample proportion) and the value of the cor- responding population parameter (population mean, population standard deviation, or population proportion) that are not the result of sampling error. These include but are not limited to coverage error, nonresponse error, measurement error, interviewer error, and processing error.

280 chapter 6 Statistical Inference

Null hypothesis The hypothesis tentatively assumed to be true in the hypothesis testing procedure. One-tailed test A hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution. p value The probability, assuming that 0H is true, of obtaining a random sample of size n that results in a test statistic at least as extreme as the one observed in the current sample. For a lower-tail test, the p value is the probability of obtaining a value for the test statistic as small as or smaller than that provided by the sample. For an upper-tail test, the p value is the probability of obtaining a value for the test statistic as large as or larger than that provided by the sample. For a two-tailed test, the p value is the probability of obtaining a value for the test statistic at least as unlikely as or more unlikely than that provided by the sample. Parameter A measurable factor that defines a characteristic of a population, process, or system, such as a population mean m, a population standard deviation s , a population pro- portion p, and so on. Point estimate The value of a point estimator used in a particular instance as an estimate of a population parameter. Point estimator The sample statistic, such as x , s, or p, that provides the point estimate of the population parameter. Practical significance The real-world impact the result of statistical inference will have on business decisions. Random sample A random sample from an infinite population is a sample selected such that the following conditions are satisfied: (1) Each element selected comes from the same population and (2) each element is selected independently. Random variable A quantity whose values are not known with certainty Sample statistic A characteristic of sample data, such as a sample mean x , a sample stan- dard deviation s, a sample proportion p, and so on. The value of the sample statistic is used to estimate the value of the corresponding population parameter. Sampled population The population from which the sample is drawn. Sampling distribution A probability distribution consisting of all possible values of a sample statistic. Sampling error The difference between the value of a sample statistic (such as the sample mean, sample standard deviation, or sample proportion) and the value of the correspond- ing population parameter (population mean, population standard deviation, or population proportion) that occurs because a random sample is used to estimate the population parameter. Simple random sample A simple random sample of size n from a finite population of size N is a sample selected such that each possible sample of size n has the same probability of being selected. Standard error The standard deviation of a point estimator. Standard normal distribution A normal distribution with a mean of zero and standard deviation of one. Statistical inference The process of making estimates and drawing conclusions about one or more characteristics of a population (the value of one or more parameters) through the analysis of sample data drawn from the population. t distribution A family of probability distributions that can be used to develop an interval estimate of a population mean whenever the population standard deviation s is unknown and is estimated by the sample standard deviation s. Tall data A data set that has so many observations that traditional statistical inference has little meaning. Target population The population for which statistical inferences such as point estimates are made. It is important for the target population to correspond as closely as possible to the sampled population.

Problems 281

Test statistic A statistic whose value helps determine whether a null hypothesis should be rejected. Two-tailed test A hypothesis test in which rejection of the null hypothesis occurs for val- ues of the test statistic in either tail of its sampling distribution. Type I error The error of rejecting 0H when it is true. Type II error The error of accepting 0H when it is false. Unbiased A property of a point estimator that is present when the expected value of the point estimator is equal to the population parameter it estimates. Variety The diversity in types and structures of data generated. Velocity The speed at which the data are generated. Veracity The reliability of the data generated. Volume The amount of data generated. Wide data A data set that has so many variables that simultaneous consideration of all variables is infeasible.

P r o B l E m S

1. The American League consists of 15 baseball teams. Suppose a sample of 5 teams is to be selected to conduct player interviews. The following table lists the 15 teams and the random numbers assigned by Excel’s RAND function. Use these random numbers to select a sample of size 5.

Team Random Number

New York 0.178624

Baltimore 0.578370

Toronto 0.965807

Chicago 0.562178

Detroit 0.253574

Oakland 0.288287

Texas 0.500879

Houston 0.713682

Team Random Number

Boston 0.290197

Tampa Bay 0.867778

Minnesota 0.811810

Cleveland 0.960271

Kansas City 0.326836

Los Angeles 0.895267

Seattle 0.839071

2. The U.S. Golf Association is considering a ban on long and belly putters. This has caused a great deal of controversy among both amateur golfers and members of the Professional Golf Association (PGA). Shown below are the names of the top 10 finish- ers in the recent PGA Tour McGladrey Classic golf tournament.

1. Tommy Gainey 2. David Toms 3. Jim Furyk 4. Brendon de Jonge 5. D. J. Trahan

6. Davis Love III 7. Chad Campbell 8. Greg Owens 9. Charles Howell III

10. Arjun Atwal

Select a simple random sample of 3 of these players to assess their opinions on the use of long and belly putters.

3. A simple random sample of 5 months of sales data provided the following information:

Month: 1 2 3 4 5

Units Sold: 94 100 85 94 92

a. Develop a point estimate of the population mean number of units sold per month. b. Develop a point estimate of the population standard deviation.

AmericanLeague

282 chapter 6 Statistical Inference

4. Morningstar publishes ratings data on 1,208 company stocks. A sample of 40 of these stocks is contained in the file named Morningstar. Use the Morningstar data set to answer the following questions. a. Develop a point estimate of the proportion of the stocks that receive Morningstar’s

highest rating of 5 Stars. b. Develop a point estimate of the proportion of the Morningstar stocks that are rated

Above Average with respect to business risk. c. Develop a point estimate of the proportion of the Morningstar stocks that are rated

2 Stars or less.

5. One of the questions in the Pew Internet & American Life Project asked adults if they used the Internet at least occasionally. The results showed that 454 out of 478 adults aged 18–29 answered Yes; 741 out of 833 adults aged 30–49 answered Yes; and 1,058 out of 1,644 adults aged 50 and over answered Yes. a. Develop a point estimate of the proportion of adults aged 18–29 who use the

Internet. b. Develop a point estimate of the proportion of adults aged 30–49 who use the

Internet. c. Develop a point estimate of the proportion of adults aged 50 and over who use the

Internet. d. Comment on any apparent relationship between age and Internet use. e. Suppose your target population of interest is that of all adults (18 years of age and

over). Develop an estimate of the proportion of that population who use the Internet.

6. In this chapter we showed how a simple random sample of 30 EAI employees can be used to develop point estimates of the population mean annual salary, the population standard deviation for annual salary, and the population proportion having completed the management training program. a. Use Excel to select a simple random sample of 50 EAI employees. b. Develop a point estimate of the mean annual salary. c. Develop a point estimate of the population standard deviation for annual salary. d. Develop a point estimate of the population proportion having completed the man-

agement training program.

7. The College Board reported the following mean scores for the three parts of the SAT: Assume that the population standard deviation on each part of the test is 100s 5 .

Critical Reading 502

Mathematics 515

Writing 494

a. For a random sample of 30 test takers, what is the sampling distribution of x for scores on the Critical Reading part of the test?

b. For a random sample of 60 test takers, what is the sampling distribution of x for scores on the Mathematics part of the test?

c. For a random sample of 90 test takers, what is the sampling distribution of x for scores on the Writing part of the test?

8. For the year 2010, 33% of taxpayers with adjusted gross incomes between $30,000 and $60,000 itemized deductions on their federal income tax return. The mean amount of deductions for this population of taxpayers was $16,642. Assume that the standard deviation is $2,400s 5 . a. What are the sampling distributions of x for itemized deductions for this population

of taxpayers for each of the following sample sizes: 30, 50, 100, and 400? b. What is the advantage of a larger sample size when attempting to estimate the

population mean?

MorningStar

EAI

Problems 283

9. The Economic Policy Institute periodically issues reports on wages of entry-level workers. The institute reported that entry-level wages for male college graduates were $21.68 per hour and for female college graduates were $18.80 per hour in 2011. Assume that the standard deviation for male graduates is $2.30 and for female gradu- ates it is $2.05. a. What is the sampling distribution of x for a random sample of 50 male college

graduates? b. What is the sampling distribution of x for a random sample of 50 female college

graduates? c. In which of the preceding two cases, part (a) or part (b), is the standard error of x

smaller? Why?

10. The state of California has a mean annual rainfall of 22 inches, whereas the state of New York has a mean annual rainfall of 42 inches. Assume that the standard deviation for both states is 4 inches. A sample of 30 years of rainfall for California and a sample of 45 years of rainfall for New York has been taken. a. Show the sampling distribution of the sample mean annual rainfall for California. b. Show the sampling distribution of the sample mean annual rainfall for New York. c. In which of the preceding two cases, part (a) or part (b), is the standard error of x

smaller? Why?

11. The president of Doerman Distributors, Inc. believes that 30% of the firm’s orders come from first-time customers. A random sample of 100 orders will be used to esti- mate the proportion of first-time customers. Assume that the president is correct and p 0.305 . What is the sampling distribution of p for this study?

12. The Wall Street Journal reported that the age at first startup for 55% of entrepreneurs was 29 years of age or less and the age at first startup for 45% of entrepreneurs was 30 years of age or more. a. Suppose a sample of 200 entrepreneurs will be taken to learn about the most import-

ant qualities of entrepreneurs. Show the sampling distribution of p where p is the sample proportion of entrepreneurs whose first startup was at 29 years of age or less.

b. Suppose a sample of 200 entrepreneurs will be taken to learn about the most import- ant qualities of entrepreneurs. Show the sampling distribution of p where p is now the sample proportion of entrepreneurs whose first startup was at 30 years of age or more.

c. Are the standard errors of the sampling distributions different in parts (a) and (b)?

13. People end up tossing 12% of what they buy at the grocery store. Assume this is the true population proportion and that you plan to take a sample survey of 540 grocery shoppers to further investigate their behavior. Show the sampling distribution of p, the proportion of groceries thrown out by your sample respondents.

14. Forty-two percent of primary care doctors think their patients receive unnecessary medical care. a. Suppose a sample of 300 primary care doctors was taken. Show the distribution of

the sample proportion of doctors who think their patients receive unnecessary med- ical care.

b. Suppose a sample of 500 primary care doctors was taken. Show the distribution of the sample proportion of doctors who think their patients receive unnecessary med- ical care.

c. Suppose a sample of 1,000 primary care doctors was taken. Show the distribution of the sample proportion of doctors who think their patients receive unnecessary medical care.

d. In which of the preceding three cases, part (a) or part (b) or part (c), is the standard error of p smallest? Why?

15. The International Air Transport Association surveys business travelers to develop quality ratings for transatlantic gateway airports. The maximum possible rating is 10.

284 chapter 6 Statistical Inference

Suppose a simple random sample of 50 business travelers is selected and each traveler is asked to provide a rating for the Miami International Airport. The ratings obtained from the sample of 50 business travelers follow.

6 4 6 8 7 7 6 3 3 8 10 4 8

7 8 7 5 9 5 8 4 3 8 5 5 4

4 4 8 4 5 6 2 5 9 9 8 4 8

9 9 5 9 7 8 3 10 8 9 6

92 34 40

105 83 55

56 49 40

76 48 96

93 74 73

78 93 100

53 82

1,905 3,112 2,312

2,725 2,545 2,981

2,677 2,525 2,627

2,600 2,370 2,857

2,962 2,545 2,675

2,184 2,529 2,115

2,332 2,442

Assume the population is approximately normal. a. Provide a point estimate of the mean annual automobile insurance premium in

Michigan.

Develop a 95% confidence interval estimate of the population mean rating for Miami.

16. A sample containing years to maturity and yield for 40 corporate bonds is contained in the file named CorporateBonds. a. What is the sample mean years to maturity for corporate bonds and what is the

sample standard deviation? b. Develop a 95% confidence interval for the population mean years to maturity. c. What is the sample mean yield on corporate bonds and what is the sample standard

deviation? d. Develop a 95% confidence interval for the population mean yield on corporate

bonds.

17. Health insurers are beginning to offer telemedicine services online that replace the common office visit. WellPoint provides a video service that allows subscribers to con- nect with a physician online and receive prescribed treatments. Wellpoint claims that users of its LiveHealth Online service saved a significant amount of money on a typical visit. The data shown below ($), for a sample of 20 online doctor visits, are consistent with the savings per visit reported by Wellpoint.

Assuming that the population is roughly symmetric, construct a 95% confidence inter- val for the mean savings for a televisit to the doctor as opposed to an office visit.

18. The average annual premium for automobile insurance in the United States is $1,503. The following annual premiums ($) are representative of the web site’s findings for the state of Michigan.

CorporateBonds

TeleHealth

AutoInsurance

Miami

Problems 285

b. Develop a 95% confidence interval for the mean annual automobile insurance pre- mium in Michigan.

c. Does the 95% confidence interval for the annual automobile insurance premium in Michigan include the national average for the United States? What is your inter- pretation of the relationship between auto insurance premiums in Michigan and the national average?

19. One of the questions on a survey of 1,000 adults asked if today’s children will be better off than their parents. Representative data are shown in the file named ChildOutlook. A response of Yes indicates that the adult surveyed did think today’s children will be better off than their parents. A response of No indicates that the adult surveyed did not think today’s children will be better off than their parents. A response of Not Sure was given by 23% of the adults surveyed. a. What is the point estimate of the proportion of the population of adults who do

think that today’s children will be better off than their parents? b. At 95% confidence, what is the margin of error? c. What is the 95% confidence interval for the proportion of adults who do think that

today’s children will be better off than their parents? d. What is the 95% confidence interval for the proportion of adults who do not think

that today’s children will be better off than their parents? e. Which of the confidence intervals in parts (c) and (d) has the smaller margin of

error? Why?

20. According to Thomson Financial, last year the majority of companies reporting profits had beaten estimates. A sample of 162 companies showed that 104 beat estimates, 29 matched estimates, and 29 fell short. a. What is the point estimate of the proportion that fell short of estimates? b. Determine the margin of error and provide a 95% confidence interval for the pro-

portion that beat estimates. c. How large a sample is needed if the desired margin of error is 0.05?

21. The Pew Research Center Internet Project conducted a survey of 857 Internet users. This survey provided a variety of statistics on them. a. The sample survey showed that 90% of respondents said the Internet has been a

good thing for them personally. Develop a 95% confidence interval for the propor- tion of respondents who say the Internet has been a good thing for them personally.

b. The sample survey showed that 67% of Internet users said the Internet has generally strengthened their relationship with family and friends. Develop a 95% confidence interval for the proportion of respondents who say the Internet has strengthened their relationship with family and friends.

c. Fifty-six percent of Internet users have seen an online group come together to help a person or community solve a problem, whereas only 25% have left an online group because of unpleasant interaction. Develop a 95% confidence interval for the pro- portion of Internet users who say online groups have helped solve a problem.

d. Compare the margin of error for the interval estimates in parts (a), (b), and (c). How is the margin of error related to the sample proportion?

22. For many years businesses have struggled with the rising cost of health care. But recently, the increases have slowed due to less inflation in health care prices and employees paying for a larger portion of health care benefits. A recent Mercer survey showed that 52% of U.S. employers were likely to require higher employee contribu- tions for health care coverage. Suppose the survey was based on a sample of 800 com- panies. Compute the margin of error and a 95% confidence interval for the proportion of companies likely to require higher employee contributions for health care coverage.

23. The manager of the Danvers-Hilton Resort Hotel stated that the mean guest bill for a weekend is $600 or less. A member of the hotel’s accounting staff noticed that the total charges for guest bills have been increasing in recent months. The accountant will use a sample of future weekend guest bills to test the manager’s claim.

ChildOutlook

286 chapter 6 Statistical Inference

a. Which form of the hypotheses should be used to test the manager’s claim? Explain.

: 600 : 600 : 600

: 600 : 600 : 600 0 0 0

a a a

m m m

$ # 5

, .

H H H

H H H ±

b. What conclusion is appropriate when H0 cannot be rejected? c. What conclusion is appropriate when H0 can be rejected?

24. The manager of an automobile dealership is considering a new bonus plan designed to increase sales volume. Currently, the mean sales volume is 14 automobiles per month. The manager wants to conduct a research study to see whether the new bonus plan increases sales volume. To collect data on the plan, a sample of sales personnel will be allowed to sell under the new bonus plan for a one-month period. a. Develop the null and alternative hypotheses most appropriate for this situation. b. Comment on the conclusion when H0 cannot be rejected. c. Comment on the conclusion when H0 can be rejected.

25. A production line operation is designed to fill cartons with laundry detergent to a mean weight of 32 ounces. A sample of cartons is periodically selected and weighed to determine whether underfilling or overfilling is occurring. If the sample data lead to a conclusion of underfilling or overfilling, the production line will be shut down and adjusted to obtain proper filling. a. Formulate the null and alternative hypotheses that will help in deciding whether to

shut down and adjust the production line. b. Comment on the conclusion and the decision when H0 cannot be rejected. c. Comment on the conclusion and the decision when H0 can be rejected.

26. Because of high production-changeover time and costs, a director of manufacturing must convince management that a proposed manufacturing method reduces costs before the new method can be implemented. The current production method operates with a mean cost of $220 per hour. A research study will measure the cost of the new method over a sample production period. a. Develop the null and alternative hypotheses most appropriate for this study. b. Comment on the conclusion when H0 cannot be rejected. c. Comment on the conclusion when H0 can be rejected.

27. Duke Energy reported that the cost of electricity for an efficient home in a particular neighborhood of Cincinnati, Ohio, was $104 per month. A researcher believes that the cost of electricity for a comparable neighborhood in Chicago, Illinois, is higher. A sam- ple of homes in this Chicago neighborhood will be taken and the sample mean monthly cost of electricity will be used to test the following null and alternative hypotheses.

: 104

: 104 0

a. Assume that the sample data lead to rejection of the null hypothesis. What would be your conclusion about the cost of electricity in the Chicago neighborhood?

b. What is the Type I error in this situation? What are the consequences of making this error?

c. What is the Type II error in this situation? What are the consequences of making this error?

28. The label on a 3-quart container of orange juice states that the orange juice contains an average of 1 gram of fat or less. Answer the following questions for a hypothesis test that could be used to test the claim on the label. a. Develop the appropriate null and alternative hypotheses. b. What is the Type I error in this situation? What are the consequences of making this

error? c. What is the Type II error in this situation? What are the consequences of making

this error?

Problems 287

29. Carpetland salespersons average $8,000 per week in sales. Steve Contois, the firm’s vice president, proposes a compensation plan with new selling incentives. Steve hopes that the results of a trial selling period will enable him to conclude that the compensa- tion plan increases the average sales per salesperson. a. Develop the appropriate null and alternative hypotheses. b. What is the Type I error in this situation? What are the consequences of making this

error? c. What is the Type II error in this situation? What are the consequences of making

this error?

30. Suppose a new production method will be implemented if a hypothesis test supports the conclusion that the new method reduces the mean operating cost per hour. a. State the appropriate null and alternative hypotheses if the mean cost for the current

production method is $220 per hour. b. What is the Type I error in this situation? What are the consequences of making this

error? c. What is the Type II error in this situation? What are the consequences of making

this error?

31. Which is cheaper: eating out or dining in? The mean cost of a flank steak, broccoli, and rice bought at the grocery store is $13.04. A sample of 100 neighborhood restaurants showed a mean price of $12.75 and a standard deviation of $2 for a comparable restau- rant meal. a. Develop appropriate hypotheses for a test to determine whether the sample data

support the conclusion that the mean cost of a restaurant meal is less than fixing a comparable meal at home.

b. Using the sample from the 100 restaurants, what is the p value? c. At 0.05a 5 , what is your conclusion?

32. A shareholders’ group, in lodging a protest, claimed that the mean tenure for a chief executive officer (CEO) was at least nine years. A survey of companies reported in the Wall Street Journal found a sample mean tenure of x 7.27 years5 for CEOs with a standard deviation of s 6.38 years5 . a. Formulate hypotheses that can be used to challenge the validity of the claim made

by the shareholders’ group. b. Assume that 85 companies were included in the sample. What is the p value for

your hypothesis test? c. At 0.01a 5 , what is your conclusion?

33. The national mean annual salary for a school administrator is $90,000 a year. A school official took a sample of 25 school administrators in the state of Ohio to learn about salaries in that state to see if they differed from the national average. a. Formulate hypotheses that can be used to determine whether the population mean

annual administrator salary in Ohio differs from the national mean of $90,000. b. The sample data for 25 Ohio administrators is contained in the file named Adminis-

trator. What is the p value for your hypothesis test in part (a)? c. At 0.05a 5 , can your null hypothesis be rejected? What is your conclusion?

34. The time married men with children spend on child care averages 6.4 hours per week. You belong to a professional group on family practices that would like to do its own study to determine if the time married men in your area spend on child care per week differs from the reported mean of 6.4 hours per week. A sample of 40 married couples will be used with the data collected showing the hours per week the husband spends on child care. The sample data are contained in the file named ChildCare. a. What are the hypotheses if your group would like to determine if the population

mean number of hours married men are spending on child care differs from the mean reported by Time in your area?

b. What is the sample mean and the p value? c. Select your own level of significance. What is your conclusion?

Administrator

ChildCare

288 chapter 6 Statistical Inference

35. The Coca-Cola Company reported that the mean per capita annual sales of its bev- erages in the United States was 423 eight-ounce servings. Suppose you are curious whether the consumption of Coca-Cola beverages is higher in Atlanta, Georgia, the location of Coca-Cola’s corporate headquarters. A sample of 36 individuals from the Atlanta area showed a sample mean annual consumption of 460.4 eight-ounce servings with a standard deviation of 101.9 ounces5s . Using 0.05a 5 , do the sample results support the conclusion that mean annual consumption of Coca-Cola beverage products is higher in Atlanta?

36. According to the National Automobile Dealers Association, the mean price for used cars is $10,192. A manager of a Kansas City used car dealership reviewed a sam- ple of 50 recent used car sales at the dealership in an attempt to determine whether the population mean price for used cars at this particular dealership differed from the national mean. The prices for the sample of 50 cars are shown in the file named UsedCars. a. Formulate the hypotheses that can be used to determine whether a difference exists

in the mean price for used cars at the dealership. b. What is the p value? c. At 0.05a 5 , what is your conclusion?

37. What percentage of the population live in their state of birth? According to the U.S. Census Bureau’s American Community Survey, the figure ranges from 25% in Nevada to 78.7% in Louisiana. The average percentage across all states and the District of Columbia is 57.7%. The data in the file Homestate are consistent with the findings in the American Community Survey. The data are for a random sample of 120 Arkansas residents and for a random sample of 180 Virginia residents. a. Formulate hypotheses that can be used to determine whether the percentage of stay-

at-home residents in the two states differs from the overall average of 57.7%. b. Estimate the proportion of stay-at-home residents in Arkansas. Does this proportion

differ significantly from the mean proportion for all states? Use 0.05a 5 . c. Estimate the proportion of stay-at-home residents in Virginia. Does this proportion

differ significantly from the mean proportion for all states? Use 0.05a 5 . d. Would you expect the proportion of stay-at-home residents to be higher in

Virginia than in Arkansas? Support your conclusion with the results obtained in parts (b) and (c).

38. Last year, 46% of business owners gave a holiday gift to their employees. A survey of business owners indicated that 35% plan to provide a holiday gift to their employees. Suppose the survey results are based on a sample of 60 business owners. a. How many business owners in the survey plan to provide a holiday gift to their

employees? b. Suppose the business owners in the sample do as they plan. Compute the p value for

a hypothesis test that can be used to determine if the proportion of business owners providing holiday gifts has decreased from last year.

c. Using a 0.05 level of significance, would you conclude that the proportion of busi- ness owners providing gifts has decreased? What is the smallest level of signifi- cance for which you could draw such a conclusion?

39. Ten years ago 53% of American families owned stocks or stock funds. Sample data collected by the Investment Company Institute indicate that the percentage is now 46%. a. Develop appropriate hypotheses such that rejection of H0 will support the conclu-

sion that a smaller proportion of American families own stocks or stock funds this year than 10 years ago.

b. Assume that the Investment Company Institute sampled 300 American families to estimate that the percent owning stocks or stock funds is 46% this year. What is the p value for your hypothesis test?

c. At 0.01a 5 , what is your conclusion?

UsedCars

HomeState

Problems 289

40. According to the University of Nevada Center for Logistics Management, 6% of all merchandise sold in the United States gets returned. A Houston department store sam- pled 80 items sold in January and found that 12 of the items were returned. a. Construct a point estimate of the proportion of items returned for the population of

sales transactions at the Houston store. b. Construct a 95% confidence interval for the porportion of returns at the Houston

store. c. Is the proportion of returns at the Houston store significantly different from the

returns for the nation as a whole? Provide statistical support for your answer.

41. Eagle Outfitters is a chain of stores specializing in outdoor apparel and camping gear. It is considering a promotion that involves mailing discount coupons to all its credit card customers. This promotion will be considered a success if more than 10% of those receiving the coupons use them. Before going national with the promotion, coupons were sent to a sample of 100 credit card customers. a. Develop hypotheses that can be used to test whether the population proportion of

those who will use the coupons is sufficient to go national. b. The file named Eagle contains the sample data. Develop a point estimate of the

population proportion. c. Use 0.05a 5 to conduct your hypothesis test. Should Eagle go national with the

promotion?

42. One of the reasons health care costs have been rising rapidly in recent years is the increasing cost of malpractice insurance for physicians. Also, fear of being sued causes doctors to run more precautionary tests (possibly unnecessary) just to make sure they are not guilty of missing something. These precautionary tests also add to health care costs. Data in the file named LawSuit are consistent with findings in a Reader’s Digest article and can be used to estimate the proportion of physicians over the age of 55 who have been sued at least once. a. Formulate hypotheses that can be used to see if these data can support a finding that

more than half of physicians over the age of 55 have been sued at least once. b. Use Excel and the file named LawSuit to compute the sample proportion of physi-

cians over the age of 55 who have been sued at least once. What is the p value for your hypothesis test?

c. At 0.01a 5 , what is your conclusion?

43. The Port Authority sells a wide variety of cables and adapters for electronic equipment online. Last year the mean value of orders placed with the Port Authority was $47.28, and management wants to assess whether the mean value of orders placed to date this year is the same as last year. The values of a sample of 49,896 orders placed this year are collected and recorded in the file PortAuthority. a. Formulate hypotheses that can be used to test whether the mean value of orders

placed this year differs from the mean value of orders placed last year. b. Use the data in the file PortAuthority to conduct your hypothesis test. What is the p

value for your hypothesis test? At 0.01a 5 , what is your conclusion?

44. The Port Authority also wants to determine if the gender profile of its customers has changed since last year, when 59.4% of its orders placed were placed by males. The genders for a sample of 49,896 orders placed this year are collected and recorded in the file PortAuthority. a. Formulate hypotheses that can be used to test whether the proportion of orders

placed by male customers this year differs from the proportion of orders placed by male customers placed last year.

b. Use the data in the file PortAuthority to conduct your hypothesis test. What is the p value for your hypothesis test? At 0.05a 5 , what is your conclusion?

45. Suppose a sample of 10,001 erroneous Federal income tax returns from last year has been taken and is provided in the file FedTaxErrors. A positive value indicates the tax- payer underpaid and a negative value indicates that the taxpayer overpaid.

Eagle

LawSuit

PortAuthority

FedTaxErrors

290 chapter 6 Statistical Inference

a. What is the sample mean error made on erroneous Federal income tax returns last year? b. Using 95% confidence, what is the margin of error? c. Using the results from parts (a) and (b), develop the 95% confidence interval esti-

mate of the mean error made on erroneous Federal income tax returns last year.

46. According to the Census Bureau, 2,475,780 people are employed by the federal gov- ernment in the United States. Suppose that a random sample of 3,500 of these federal employees was selected and the number of sick hours each of these employees took last year was collected from an electronic personnel database. The data collected in this survey are provided in the file FedSickHours. a. What is the sample mean number of sick hours taken by federal employees last year? b. Using 99% confidence, what is the margin of error? c. Using the results from parts (a) and (b), develop the 99% confidence interval esti-

mate of the mean number of sick hours taken by federal employees last year. d. If the mean sick hours federal employees took two years ago was 62.2, what would

the confidence interval in part (c) lead you to conclude about last year?

47. Internet users were recently asked online to rate their satisfaction with the web browser they use most frequently. Of 102,519 respondents, 65,120 indicated they were very sat- isfied with the web browser they use most frequently. a. What is the sample proportion of Internet users who are very satisfied with the

web browser they use most frequently? b. Using 95% confidence, what is the margin of error? c. Using the results from parts (a) and (b), develop the 95% confidence interval esti-

mate of the proportion of Internet users who are very satisfied with the web browser they use most frequently.

48. ABC News reports that 58% of U.S. drivers admit to speeding. Suppose that a new satellite technology can instantly measure the speed of any vehicle on a U.S. road and determine whether the vehicle is speeding, and this satellite technology was used to take a sample of 20,000 vehicles at 6:00 p.m. EST on a recent Tuesday afternoon. Of these 20,000 vehicles, 9,252 were speeding. a. What is the sample proportion of vehicles on U.S. roads that speed? b. Using 99% confidence, what is the margin of error? c. Using the results from parts (a) and (b), develop the 99% confidence interval esti-

mate of the proportion of vehicles on U.S. roads that speed. d. What does the confidence interval in part (c) lead you to conclude about the ABC

News report?

49. The Federal Government wants to determine if the mean number of business e-mails sent and received per business day by its employees differs from the mean number of e-mails sent and received per day by corporate employees, which is 101.5. Suppose the department electronically collects information on the number of business e-mails sent and received on a randomly selected business day over the past year from each of 10,163 randomly selected Federal employees. The results are provided in the file FedEmail. Test the Federal Government’s hypothesis at a 5 0.01. Discuss the practical significance of the results.

50. CEOs who belong to a popular business-oriented social networking service have an aver- age of 930 connections. Do other members have fewer connections than CEOs? The num- ber of connections for a random sample of 7,515 members who are not CEOs is provided in the file SocialNetwork. Using this sample, test the hypothesis that other members have fewer connections than CEOs at a 5 0.01. Discuss the practical significance of the results.

51. The American Potato Growers Association (APGA) would like to test the claim that the proportion of fast-food orders this year that include French fries exceeds the pro- portion of fast-food orders that included French fries last year. Suppose that a random sample of 49,581 electronic receipts for fast-food orders placed this year shows that 31,038 included French fries. Assuming that the proportion of fast-food orders that included French fries last year is 0.62, use this information to test APGA’s claim at a 5 0.05. Discuss the practical significance of the results.

FedSickHours

FedEmail

SocialNetwork

case Problem 1: young Professional Magazine 291

52. According to CNN, 55% of all U.S. smartphone users have used their GPS capability to get directions. Suppose a major provider of wireless telephone service in Canada wants to know how GPS usage by its customers compares with U.S. smartphone users. The com- pany collects usage records for this year for a random sample of 547,192 of its Canadian customers and determines that 302,050 of these customers have used their telephone’s GPS capability this year. Use this data to test whether Canadian smartphone users’ GPS usage differs from U.S. smartphone users’ GPS usage at a 5 0.01. Discuss the practical significance of the results.

53. A well-respected polling agency has conducted a poll for an upcoming Presidential election. The polling agency has taken measures so that its random sample consists of 50,000 people and is representative of the voting population. The file Pedro contains survey data for 50,000 respondents in both a pre-election survey and a post-election poll. a. Based on the data in the “Support Pedro in Pre-Election Poll” column, compute the

99% confidence interval on the population proportion of voters who support Pedro Ringer in the upcoming election. If Pedro needs at least 50% of the vote to win in the two-party election, should he be optimistic about winning the election?

b. Now suppose the election occurs and Pedro wins 55% of the vote. Explain how this result could occur given the sample information in part (a).

c. In an attempt to explain the election results (Pedro winning 55% of the vote), the polling agency has followed up with each of the respondents in their pre-election survey. The data in the “Voted for Pedro?” column corresponds to whether or not the respondent actually voted for Pedro in the election. Compute the 99% confi- dence interval on the population proportion of voters who voted for Pedro Ringer. Is this result consistent with the election results?

d. Use a PivotTable to determine the percentage of survey respondents who voted for Pedro that did not admit to supporting him in a pre-election poll. Use this result to explain the discrepancy between the pre-election poll and the actual election results. What type of error is occurring here?

C a S E P r o B l E m 1 : y o u n G P r o f E S S I o n a l m a G a z I n E

Young Professional magazine was developed for a target audience of recent college grad- uates who are in their first 10 years in a business/professional career. In its two years of publication, the magazine has been fairly successful. Now the publisher is interested in expanding the magazine’s advertising base. Potential advertisers continually ask about the demographics and interests of subscribers to Young Professional. To collect this infor- mation, the magazine commissioned a survey to develop a profile of its subscribers. The survey results will be used to help the magazine choose articles of interest and provide advertisers with a profile of subscribers. As a new employee of the magazine, you have been asked to help analyze the survey results, a portion of which are shown in the follow- ing table.

Age Sex Real Estate Purchases

Value of Investments($)

Number of Transactions

Broadband Access

Household Income($) Children

38 Female No 12,200 4 Yes 75,200 Yes

30 Male No 12,400 4 Yes 70,300 Yes

41 Female No 26,800 5 Yes 48,200 No

28 Female Yes 19,600 6 No 95,300 No

31 Female Yes 15,100 5 No 73,300 Yes

· · ·

Pedro

292 chapter 6 Statistical Inference

Some of the survey questions are as follows:

1. What is your age? 2. Are you: Male Female 3. Do you plan to make any real estate purchases in the next two years?

Yes No 4. What is the approximate total value of financial investments, exclusive of your

home, owned by you or members of your household? 5. How many stock/bond/mutual fund transactions have you made in the past year? 6. Do you have broadband access to the Internet at home? Yes No 7. Please indicate your total household income last year. 8. Do you have children? Yes No

The file Professional contains the responses to these questions. The table shows the por- tion of the file pertaining to the first five survey respondents.

managerial Report

Prepare a managerial report summarizing the results of the survey. In addition to statistical summaries, discuss how the magazine might use these results to attract advertisers. You might also comment on how the survey results could be used by the magazine’s editors to identify topics that would be of interest to readers. Your report should address the follow- ing issues, but do not limit your analysis to just these areas.

1. Develop appropriate descriptive statistics to summarize the data. 2. Develop 95% confidence intervals for the mean age and household income of

subscribers. 3. Develop 95% confidence intervals for the proportion of subscribers who have

broadband access at home and the proportion of subscribers who have children. 4. Would Young Professional be a good advertising outlet for online brokers? Justify

your conclusion with statistical data. 5. Would this magazine be a good place to advertise for companies selling educational

software and computer games for young children? 6. Comment on the types of articles you believe would be of interest to readers of

Young Professional.

C a S E P r o B l E m 2 : Q u a l I T y a S S o C I a T E S , I n C

Quality Associates, Inc., a consulting firm, advises its clients about sampling and statisti- cal procedures that can be used to control their manufacturing processes. In one particular application, a client gave Quality Associates a sample of 800 observations taken while that client’s process was operating satisfactorily. The sample standard deviation for these data was 0.21; hence, with so much data, the population standard deviation was assumed to be 0.21. Quality Associates then suggested that random samples of size 30 be taken periodi- cally to monitor the process on an ongoing basis. By analyzing the new samples, the client could quickly learn whether the process was operating satisfactorily. When the process was not operating satisfactorily, corrective action could be taken to eliminate the problem. The design specification indicated that the mean for the process should be 12. The hypothesis test suggested by Quality Associates is as follows:

: 12

: 12 0

Corrective action will be taken any time H0 is rejected. The samples listed in the following table were collected at hourly intervals during the

first day of operation of the new statistical process control procedure. These data are avail- able in the file Quality.

Professional

case Problem 2: Quality Associates, Inc 293

managerial Report

1. Conduct a hypothesis test for each sample at the 0.01 level of significance and determine what action, if any, should be taken. Provide the test statistic and p value for each test.

2. Compute the standard deviation for each of the four samples. Does the conjecture of 0.21 for the population standard deviation appear reasonable?

3. Compute limits for the sample mean x around 12m 5 such that, as long as a new sample mean is within those limits, the process will be considered to be operating satisfactorily. If x exceeds the upper limit or if x is below the lower limit, corrective action will be taken. These limits are referred to as upper and lower control limits for quality-control purposes.

4. Discuss the implications of changing the level of significance to a larger value. What mistake or error could increase if the level of significance is increased?

Sample 1 Sample 2 Sample 3 Sample 4

11.55 11.62 11.91 12.02

11.62 11.69 11.36 12.02

11.52 11.59 11.75 12.05

11.75 11.82 11.95 12.18

11.90 11.97 12.14 12.11

11.64 11.71 11.72 12.07

11.80 11.87 11.61 12.05

12.03 12.10 11.85 11.64

11.94 12.01 12.16 12.39

11.92 11.99 11.91 11.65

12.13 12.20 12.12 12.11

12.09 12.16 11.61 11.90

11.93 12.00 12.21 12.22

12.21 12.28 11.56 11.88

12.32 12.39 11.95 12.03

11.93 12.00 12.01 12.35

11.85 11.92 12.06 12.09

11.76 11.83 11.76 11.77

12.16 12.23 11.82 12.20

11.77 11.84 12.12 11.79

12.00 12.07 11.60 12.30

12.04 12.11 11.95 12.27

11.98 12.05 11.96 12.29

12.30 12.37 12.22 12.47

12.18 12.25 11.75 12.03

11.97 12.04 11.96 12.17

12.17 12.24 11.95 11.94

11.85 11.92 11.89 11.97

12.30 12.37 11.88 12.23

12.15 12.22 11.93 12.25

Quality

Chapter 7 Linear Regression C o n t e n t s

AnALytics in Action: AlliAnce DAtA SyStemS 7.1 tHE siMPLE LinEAR REGREssion MoDEL

Regression Model Estimated Regression Equation

7.2 LEAst sQUAREs MEtHoD Least squares Estimates of the Regression Parameters Using Excel’s chart tools to compute the Estimated Regression Equation

7.3 AssEssinG tHE Fit oF tHE siMPLE LinEAR REGREssion MoDEL the sums of squares the coefficient of Determination Using Excel’s chart tools to compute the coefficient of Determination

7.4 tHE MULtiPLE REGREssion MoDEL Regression Model Estimated Multiple Regression Equation Least squares Method and Multiple Regression Butler trucking company and Multiple Regression Using Excel’s Regression tool to Develop the Estimated Multiple Regression Equation

7.5 inFEREncE AnD REGREssion conditions necessary for Valid inference in the Least squares Regression Model testing individual Regression Parameters Addressing nonsignificant independent Variables Multicollinearity

7.6 cAtEGoRicAL inDEPEnDEnt VARiABLEs Butler trucking company and Rush Hour interpreting the Parameters More complex categorical Variables

7.7 MoDELinG nonLinEAR RELAtionsHiPs Quadratic Regression Models Piecewise Linear Regression Models interaction Between independent Variables

7.8 MoDEL FittinG Variable selection Procedures overfitting

7.9 BiG DAtA AnD REGREssion inference and Very Large samples Model selection

7.10 PREDiction WitH REGREssion APPEnDix 7.1: REGREssion WitH AnALytic soLVER

(MinDtAP READER)

Analytics in Action 295

Managerial decisions are often based on the relationship between two or more variables. For example, after considering the relationship between advertising expenditures and sales, a marketing manager might attempt to predict sales for a given level of advertising expen- ditures. In another case, a public utility might use the relationship between the daily high temperature and the demand for electricity to predict electricity usage on the basis of next month’s anticipated daily high temperatures. Sometimes a manager will rely on intuition to judge how two variables are related. However, if data can be obtained, a statistical pro- cedure called regression analysis can be used to develop an equation showing how the variables are related.

In regression terminology, the variable being predicted is called the dependent variable, or response, and the variables being used to predict the value of the dependent variable are called the independent variables, or predictor variables. For example, in

Alliance Data Systems*

DALLAs, teXAs

Alliance Data Systems (ADS) provides transaction processing, credit services, and marketing services for clients in the rapidly growing customer relationship management (CRM) industry. ADS clients are concen- trated in four industries: retail, petroleum/convenience stores, utilities, and transportation. In 1983, Alliance began offering end-to-end credit- processing services to the retail, petroleum, and casual dining industries; today the company employs more than 6,500 employees who provide services to clients around the world. Operating more than 140,000 point-of-sale terminals in the United States alone, ADS processes in excess of 2.5 billion transactions annually. The company ranks second in the United States in private-label credit services by representing 49 private label programs with nearly 72 million cardholders. In 2001, ADS made an initial public offering and is now listed on the New York Stock Exchange.

As one of its marketing services, ADS designs direct mail campaigns and promotions. With its data- base containing information on the spending habits of more than 100 million consumers, ADS can target con- sumers who are the most likely to benefit from a direct mail promotion. The Analytical Development Group uses regression analysis to build models that measure and predict the responsiveness of consumers to direct market campaigns. Some regression models predict the probability of purchase for individuals receiving a promotion, and others predict the amount spent by consumers who make purchases.

For one campaign, a retail store chain wanted to attract new customers. To predict the effect of the campaign, ADS analysts selected a sample from the consumer database, sent the sampled individuals

*the authors are indebted to Philip clemance, Director of Analytical Development at Alliance Data systems, for providing this Analytics in Action.

promotional materials, and then collected transaction data on the consumers’ responses. Sample data were collected on the amount of purchases made by the consumers responding to the campaign, as well as on a variety of consumer-specific variables thought to be useful in predicting sales. The consumer-specific vari- able that contributed most to predicting the amount purchased was the total amount of credit purchases at related stores over the past 39 months. ADS analysts developed an estimated regression equation relat- ing the amount of purchase to the amount spent at related stores:

5 1ˆ 26.7 0.00205y x

where

ˆ predicted amount of purchase5y

amount spent at related stores5x

Using this equation, we could predict that someone spending $10,000 over the past 39 months at related stores would spend $47.20 when responding to the direct mail promotion. In this chapter, you will learn how to develop this type of estimated regression equation. The final model developed by ADS analysts also included several other variables that increased the predictive power of the preceding equation. Among these variables was the absence or presence of a bank credit card, estimated income, and the aver- age amount spent per trip at a selected store. In this chapter, we will also learn how such additional vari- ables can be incorporated into a multiple regression model.

A n A L y t i C s i n A C t i o n

296 chapter 7 Linear Regression

analyzing the effect of advertising expenditures on sales, a marketing manager’s desire to predict sales would suggest making sales the dependent variable. Advertising expenditure would be the independent variable used to help predict sales.

In this chapter, we begin by considering simple linear regression, in which the relation- ship between one dependent variable (denoted by y) and one independent variable (denoted by x) is approximated by a straight line. We then extend this concept to higher dimensions by introducing multiple linear regression to model the relationship between a dependent variable (y) and two or more independent variables ( , , … , )1 2x x xq .

7.1 Simple Linear Regression Model Butler Trucking Company is an independent trucking company in Southern California. A major portion of Butler’s business involves deliveries throughout its local area. To develop better work schedules, the managers want to estimate the total daily travel times for their drivers. The managers believe that the total daily travel times (denoted by y) are closely related to the number of miles traveled in making the daily deliveries (denoted by x). Using regression analysis, we can develop an equation showing how the dependent variable y is related to the independent variable x.

Regression Model In the Butler Trucking Company example, a simple linear regression model hypothesizes that the travel time of a driving assignment (y) is linearly related to the number of miles traveled (x) as follows:

The statistical methods used in studying the relationship between two variables were first employed by Sir Francis Galton (1822–1911). Galton found that the heights of the sons of unusually tall or unusually short fathers tend to move, or “regress,” toward the average height of the male population. Karl Pearson (1857–1936), a disciple of Galton, later confirmed this finding in a sample of 1,078 pairs of fathers and sons.

siMPLe LineAR ReGRession MoDeL

b b5 1 1 «0 1y x (7.1)

estiMAteD siMPLe LineAR ReGRession eQUAtion

5 1ˆ 0 1y b b x (7.2)

In equation (7.1), b0 and b1 are population parameters that describe the y-intercept and slope of the line relating y and x. The error term « (Greek letter epsilon) accounts for the variability in y that cannot be explained by the linear relationship between x and y. The simple linear regression model assumes that the error term is a normally distributed ran- dom variable with a mean of zero and constant variance for all observations.

estimated Regression equation In practice, the values of the population parameters b0 and b1 are not known and must be estimated using sample data. Sample statistics (denoted 0b and 1b ) are computed as estimates of the population parameters b0 and b1. Substituting the values of the sample statistics 0b and 1b for b0 and b1 in equation (7.1) and dropping the error term (because its expected value is zero), we obtain the estimated regression for simple linear regression:

Figure 7.1 provides a summary of the estimation process for simple linear regression. Using equation (7.2), ŷ provides an estimate for the mean value of y corresponding to a given value of x.

The graph of the estimated simple linear regression equation is called the estimated regression line; 0b is the estimated y-intercept, and 1b is the estimated slope. In the next sec- tion, we show how the least squares method can be used to compute the values of 0b and 1b in the estimated regression equation.

Examples of possible regression lines are shown in Figure 7.2. The regression line in Panel A shows that the estimated mean value of y is related positively to x, with larger values

7.1 simple Linear Regression Model 297

the Estimation Process in simple Linear RegressionFiGURe 7.1

y 5 �0 1 �1x 1 «

�0, �1

�0 and �1

Possible Regression Lines in simple Linear RegressionFiGURe 7.2

Intercept b0

Intercept

Slope b1 is positive.

Slope b1 is negative.

Slope b1 is 0. b0

Intercept b0

x x x

Regression line

Panel A: Positive Linear Relationship

Panel B: Negative Linear Relationship

Panel C: No Relationship

Regression line

ŷ ŷ ŷ

of ŷ associated with larger values of x. In Panel B, the estimated mean value of y is related negatively to x, with smaller values of ŷ associated with larger values of x. In Panel C, the estimated mean value of y is not related to x; that is, ŷ is the same for every value of x.

In general, ŷ is the point estimator of E(y|x), the mean value of y for a given value of x. Thus, to estimate the mean or expected value of travel time for a driving assignment of 75 miles, Butler Trucking would substitute the value of 75 for x in equation (7.2). In some cases, however, Butler Trucking may be more interested in predicting travel time for an upcoming driving assignment of a particular length. For example, suppose Butler Trucking would like to predict travel time for a new 75-mile driving assignment the company is

The estimation of 0b and 1b is a statistical process much like the estimation of the population mean, m, discussed in Chapter 6. 0b and 1b are the unknown parameters of interest, and b0 and 1b are the sample statistics used to estimate the parameters.

A point estimator is a single value used as an estimate of the corresponding population parameter.

298 chapter 7 Linear Regression

considering. It turns out that to predict travel time for a new 75-mile driving assignment, Butler Trucking would also substitute the value of 75 for x in equation (7.2). The value of ŷ provides both a point estimate of E(y|x) for a given value of x and a prediction of an indi- vidual value of y for a given value of x. In most cases, we will refer to ŷ simply as the pre- dicted value of y.

7.2 Least Squares Method The least squares method is a procedure for using sample data to find the estimated regres- sion equation. To illustrate the least squares method, suppose data were collected from a sample of 10 Butler Trucking Company driving assignments. For the thi observation or driv- ing assignment in the sample, xi is the miles traveled and yi is the travel time (in hours). The values of xi and yi for the 10 driving assignments in the sample are summarized in Table 7.1. We see that driving assignment 1, with 5 1001x and 5 9.31y , is a driving assignment of 100 miles and a travel time of 9.3 hours. Driving assignment 2, with 5 502x and 5 4.82y , is a driving assignment of 50 miles and a travel time of 4.8 hours. The shortest travel time is for driving assignment 5, which requires 50 miles with a travel time of 4.2 hours.

Figure 7.3 is a scatter chart of the data in Table 7.1. Miles traveled is shown on the horizontal axis, and travel time (in hours) is shown on the vertical axis. Scatter charts for regression analysis are constructed with the independent variable x on the horizontal axis and the dependent variable y on the vertical axis. The scatter chart enables us to observe the data graphically and to draw preliminary conclusions about the possible relationship between the variables.

What preliminary conclusions can be drawn from Figure 7.3? Longer travel times appear to coincide with more miles traveled. In addition, for these data, the relationship between the travel time and miles traveled appears to be approximated by a straight line; indeed, a positive linear relationship is indicated between x and y. We therefore choose the simple linear regression model to represent this relationship. Given that choice, our next task is to use the sample data in Table 7.1 to determine the values of 0b and 1b in the estimated simple linear regression equation. For the thi driving assignment, the estimated regression equation provides:

y b b xi i5 1ˆ 0 1 (7.3)

where

5ˆ predicted travel time (in hours) for the driving assignmentthy ii b y5 the -intercept of the estimated regression line0

Driving Assignment i x Miles Traveled5 y Travel Time (hours)5 1 100 9.3

2 50 4.8

3 50 8.9

4 100 6.5

5 50 4.2

6 80 6.2

7 75 7.4

8 65 6.0

9 90 7.6

10 90 6.1

Miles traveled and travel time for 10 Butler trucking company Driving Assignments

tABLe 7.1

Butler

7.2 Least squares Method 299

b 5 the slope of the estimated regression line1 x ii 5 miles traveled for the driving assignmentth

With yi denoting the observed (actual) travel time for driving assignment i and ŷi in equa- tion (7.3) representing the predicted travel time for driving assignment i, every driving assignment in the sample will have an observed travel time yi and a predicted travel time ŷi . For the estimated regression line to provide a good fit to the data, the differences between the observed travel times yi and the predicted travel times ŷi should be small.

The least squares method uses the sample data to provide the values of 0b and 1b that minimize the sum of the squares of the deviations between the observed values of the dependent variable yi and the predicted values of the dependent variable ŷi . The criterion for the least squares method is given by equation (7.4).

scatter chart of Miles traveled and travel time for sample of 10 Butler trucking company Driving Assignments

FiGURe 7.3

0 40 50 60 70 80 90 100

T ra

ve l

T im

e (h

ou rs

) -

Miles Traveled - x

LeAst sQUARes eQUAtion

5 5 5

min ( ˆ ) min ( )2 1

1 0 1 1 2

y y y b b xi i i

∑ ∑− − − (7.4) where

y i y i n

5 5

observed value of the dependent variable for the observation ˆ predicted value of the dependent variable for the observation

total number of observations

The error we make using the regression model to estimate the mean value of the depen- dent variable for the thi observation is often written as 5 2 ˆe y yi i i and is referred to as the

thi residual. Using this notation, equation (7.4) can be rewritten as

min 1

2e i

i∑ and we say that we are estimating the regression equation that minimizes the sum of squared errors.

300 chapter 7 Linear Regression

Least squares estimates of the Regression Parameters Although the values of 0b and 1b that minimize equation (7.4) can be calculated manually with equations (see note at end of this section), computer software such as Excel is gener- ally used to calculate 1b and 0b . For the Butler Trucking Company data in Table 7.1, an esti- mated slope of 5 0.06781b and a y-intercept of 5 1.27390b minimize the sum of squared errors (in the next section we show how to use Excel to obtain these values). Thus, our estimated simple linear regression equation is 5 1ˆ 1.2739 0.0678 1y x .

We interpret 1b and 0b as we would the slope and y-intercept of any straight line. The slope 1b is the estimated change in the mean of the dependent variable y that is asso- ciated with a one-unit increase in the independent variable x. For the Butler Trucking Company model, we therefore estimate that, if the length of a driving assignment were 1 mile longer, the mean travel time for that driving assignment would be 0.0678 hour (or approximately 4 minutes) longer. The y-intercept 0b is the estimated value of the depen- dent variable y when the independent variable x is equal to 0. For the Butler Trucking Company model, we estimate that if the driving distance for a driving assignment was 0 units (0 miles), the mean travel time would be 1.2739 units (1.2739 hours, or approxi- mately 76 minutes). Can we find a plausible explanation for this? Perhaps the 76 minutes represent the time needed to prepare, load, and unload the vehicle, which is required for all trips regardless of distance and which therefore does not depend on the distance trav- eled. However, we cautiously note that to estimate the travel time for a driving distance of 0 miles, we have to extend the relationship we have found with simple linear regression well beyond the range of values for driving distance in our sample. Those sample values range from 50 to 100 miles, and this range represents the only values of driving distance for which we have empirical evidence of the relationship between driving distance and our estimated travel time.

It is important to note that the regression model is valid only over the experimental region, which is the range of values of the independent variables in the data used to esti- mate the model. Prediction of the value of the dependent variable outside the experimental region is called extrapolation and is risky. Because we have no empirical evidence that the relationship between y and x holds true for x values outside the range of x values in the data used to estimate the relationship, extrapolation is risky and should be avoided if pos- sible. For Butler Trucking, this means that any prediction of the travel time for a driving distance less than 50 miles or greater than 100 miles is not a reliable estimate. Thus, any interpretation of b 0 based on the Butler Trucking Company data is unreliable and likely meaningless. However, if the experimental region for a regression analysis includes zero, the y-intercept will have a meaningful interpretation.

We can use the estimated regression equation and our known values for miles traveled for a driving assignment (x) to estimate mean travel time in hours. For example, the first driving assignment in Table 7.1 has a value for miles traveled of 5 100x . We estimate the mean travel time in hours for this driving assignment to be

yi 5 1 5ˆ 1.2739 0.0678(100) 8.0539

Since the travel time for this driving assignment was 9.3 hours, this regression estimate would have resulted in a residual of

−e y yi5 5 2 5ˆ 9.3 8.0539 1.24611 1

The simple linear regression model underestimated travel time for this driving assignment by 1.2461 hours (approximately 74 minutes). Table 7.2 shows the predicted mean travel times, the residuals, and the squared residuals for all 10 driving assignments in the sample data. Note the following in Table 7.2:

• The sum of predicted values ŷi is equal to the sum of the values of the dependent variable y.

The estimated value of the y-intercept often results from extrapolation.

The point estimate ŷ provided by the regression equation does not give us any information about the precision associated with the prediction. For that we must develop an interval estimate around the point estimate. In the last section of this chapter, we discuss the contruction of interval estimates around the point predictions provided by a regression equation.

7.2 Least squares Method 301

• The sum of the residuals ei is 0. • The sum of the squared residuals 2ei has been minimized.

These three points will always be true for a simple linear regression that is determined by equation (7.5). Figure 7.4 shows the simple linear regression line 5 1ˆ 1.2739 0.0678y xi i superimposed on the scatter chart for the Butler Trucking Company data in Table 7.1. Figure 7.4 highlights the residuals for driving assignment 3 and driving assignment 5.

Driving Assignment i

x Miles Traveled

5 y Travel Time (hours)

5 y b b xi iˆ 0 15 1 e y yi i iˆ5 2 2ei

1 100 9.3 8.0565 1.2435 1.5463

2 50 4.8 4.6652 0.1348 0.0182

3 100 8.9 8.0565 0.8435 0.7115

4 100 6.5 8.0565 21.5565 2.4227

5 50 4.2 4.6652 20.4652 0.2164

6 80 6.2 6.7000 20.5000 0.2500

7 75 7.4 6.3609 1.0391 1.0797

8 65 6.0 5.6826 0.3174 0.1007

9 90 7.6 7.3783 0.2217 0.0492

10 90 6.1 7.3783 21.2783 1.6341

Totals 67.0 67.0000 0.0000 8.0288

Predicted travel time and Residuals for 10 Butler trucking company Driving Assignments

tABLe 7.2

scatter chart of Miles traveled and travel time for Butler trucking company Driving Assignments with Regression Line superimposed

FiGURe 7.4

0 40 50 60 70 80 90 100

T ra

ve l

T im

e (h

ou rs

) -

Miles Traveled - x

e3 y3 y3 ^

5 1.2739 1 0.0678xiyi ^

y5 y5 ^

302 chapter 7 Linear Regression

A Geometric interpretation of the Least squares MethodFiGURe 7.5

0 40 50 60 70 80 90 100

T ra

ve l

T im

e (h

ou rs

) -

Miles Traveled - x

5 1.2739 1 0.0678xiyi ^

The regression model underpredicts travel time for some driving assignments (e 3 . 0)

and overpredicts travel time for others (e 5 , 0), but in general appears to fit the data

relatively well. In Figure 7.5, a vertical line is drawn from each point in the scatter chart to the lin-

ear regression line. Each of these vertical lines represents the difference between the actual driving time and the driving time we predict using linear regression for one of the assignments in our data. The length of each vertical line is equal to the absolute value of the residual for one of the driving assignments. When we square a residual, the resulting value is equal to the area of the square with the length of each side equal to the absolute value of the residual. In other words, the square of the residual for driving assignment 5 2 54, ( ( 1.5565) 2.4227)4 2e , is the area of a square for which the length of each side is 1.5565. Thus, when we find the linear regression model that minimizes the sum of squared errors for the Butler Trucking example, we are positioning the regression line in the manner that minimizes the sum of the areas of the 10 squares in Figure 7.5.

Using excel’s Chart tools to Compute the estimated Regression equation We can use Excel’s chart tools to compute the estimated regression equation on a scatter chart of the Butler Trucking Company data in Table 7.1. After constructing a scatter chart (as shown in Figure 7.3) with Excel’s chart tools, the following steps describe how to compute the estimated regression equation using the data in the worksheet:

Step 1. Right-click on any data point in the scatter chart and select Add Trendline Step 2. When the Format Trendline task pane appears: Select Linear in the Trendline Options area Select Display Equation on chart in the Trendline Options area

The worksheet displayed in Figure 7.6 shows the original data, scatter chart, estimated regression line, and estimated regression equation.

Note that Excel uses y instead of ŷ to denote the predicted value of the dependent variable and puts the regression equation into slope-intercept form, whereas we use the intercept-slope form that is standard in statistics.

7.2 Least squares Method 303

scatter chart and Estimated Regression Line for Butler trucking companyFiGURe 7.6

A Assignment Miles Time

B C D E F G H I J K L

1 1 2 3 4 5 6 7 8 9 10 11

2 3 4 5 6 7 8 9

100 50

100 100 50 80 75 65 90 90

9.3 4.8 8.9 6.5 4.2 6.2 7.4 6.0 7.6 6.1

12 13 14 15 16 17 18 19 20 21 22

0 40 50 60 70 80 90 100

T ra

ve l

T im

e (h

ou rs

) -

Miles Traveled - x

y 5 0.0678x 1 1.2739

1. Differential calculus can be used to show that the values of

b0 and b1 that minimize expression (7.5) are given by:

sLoPe eQUAtion

b x x y y

x x

i i

1 1

2 5

2 2

∑

( )( )

( )

y-inteRCePt eQUAtion

b y b x0 15 2 where

value of the independent variable for the observation

value of the dependent variable for the observation

mean value for the independent variable

mean value for the dependent variable

totalnumber of observations

x i

y i

2. Equation 7.4 minimizes the sum of the squared deviations

between the observed values of the dependent variable yi and the predicted values of the dependent variable ŷ1. One alternative is to simply minimize the sum of the deviations

between the observed values of the dependent variable

yi and the predicted values of the dependent variable yiˆ . This is not a viable option because then negative deviations

(observations for which the regression forecast exceeds

the actual value) and positive deviations (observations for

which the regression forecast is less than the actual value)

offset each other. Another alternative is to minimize the

sum of the absolute value of the deviations between the

observed values of the dependent variable yi and the pre- dicted values of the dependent variable yiˆ . It is possible to compute estimated regression parameters that minimize

this sum of the absolute value of the deviations, but this

approach is more difficult than the least squares approach.

n o t e s + C o M M e n t s

304 chapter 7 Linear Regression

7.3 Assessing the Fit of the Simple Linear Regression Model

For the Butler Trucking Company example, we developed the estimated regression equa- tion 5 1ˆ 1.2739 0.0678y xi i to approximate the linear relationship between the miles traveled (x) and travel time in hours (y). We now wish to assess how well the estimated regression equation fits the sample data. We begin by developing the intermediate calcula- tions, referred to as the sums of squares.

the sums of squares Recall that we found our estimated regression equation for the Butler Trucking Company example by minimizing the sum of squares of the residuals. This quantity, also known as the sum of squares due to error, is denoted by SSE.

sUM oF sQUARes DUe to eRRoR

5 2 5

SSE ˆ 1

2 y y

i i∑( ) (7.5)

The value of SSE is a measure of the error (in squared units of the dependent variable) that results from using the estimated regression equation to predict the values of the dependent variable in the sample.

We have already shown the calculations required to compute the sum of squares due to error for the Butler Trucking Company example in Table 7.2. The squared residual or error for each observation in the data is shown in the last column of that table. After computing and squaring the residuals for each driving assignment in the sample, we sum them to obtain 5SSE 8.0288 hours2. Thus, 5SSE 8.0288 measures the error in using the estimated regression equation 5 1ˆ 1.2739 0.0678y xi i to predict travel time for the driving assign- ments in the sample.

Now suppose we are asked to predict travel time in hours without knowing the miles traveled for a driving assignment. Without knowledge of any related variables, we would use the sample mean y as a predictor of travel time for any given driving assignment. To find y , we divide the sum of the actual driving times yi from Table 7.2 (67) by the number of observations n in the data (10); this yields 5 6.7y .

Figure 7.7 provides insight on how well we would predict the values of yi in the Butler Trucking company example using 5 6.7y . From this figure, which again highlights the residuals for driving assignments 3 and 5, we can see that y tends to overpredict travel times for driving assignments that have relatively small values for miles traveled (such as driving assignment 5) and tends to underpredict travel times for driving assignments that have relatively large values for miles traveled (such as driving assignment 3).

In Table 7.3 we show the sum of squared deviations obtained by using the sample mean 5 6.7y to predict the value of travel time in hours for each driving assignment in the sam-

ple. For the thi driving assignment in the sample, the difference 2y yi provides a measure of the error involved in using y to predict travel time for the thi driving assignment. The corresponding sum of squares, called the total sum of squares, is denoted by SST.

totAL sUM oF sQUARes, sst

5 2 5

SST 1

2 y y

i∑( ) (7.6)

The sum at the bottom of the last column in Table 7.3 is the total sum of squares for Butler Trucking Company: SST 23.9 hours25 .

7.3 Assessing the Fit of the simple Linear Regression Model 305

Now we put it all together. In Figure 7.8 we show the estimated regression line 5 1ˆ 1.2739 0.0678y xi i and the line corresponding to 5 6.7y . Note that the points clus-

ter more closely around the estimated regression line 5 1ˆ 1.2739 0.0678y xi i than they do about the horizontal line 5 6.7y . For example, for the third driving assignment in the sample, we see that the error is much larger when 5 6.7y is used to predict 3y than when

5 1 5ˆ 1.2739 0.0678 (100) 8.05393y is used. We can think of SST as a measure of how well the observations cluster about the y line and SSE as a measure of how well the obser- vations cluster about the ŷ line.

To measure how much the ŷ values on the estimated regression line deviate from y, another sum of squares is computed. This sum of squares, called the sum of squares due to regression, is denoted by SSR.

the sample Mean y as a Predictor of travel time for Butler trucking company

FiGURe 7.7

0 40 50 60 70 80 90 100

T ra

ve l

T im

e (h

ou rs

) -

Miles Traveled - x

y5 2 y

y3 2 y

Driving Assignment i

x Miles Traveled

5 y Travel Time (hours)

5 y yi 2 y yi( )22

1 100 9.3 2.6 6.76

2 50 4.8 21.9 3.61

3 100 8.9 2.2 4.84

4 100 6.5 20.2 0.04

5 50 4.2 22.5 6.25

6 80 6.2 20.5 0.25

7 75 7.4 0.7 0.49

8 65 6.0 20.7 0.49

9 90 7.6 0.9 0.81

10 90 6.1 20.6 0.36

Totals 67.0 0 23.9

calculations for the sum of squares total for the Butler trucking simple Linear Regression

tABLe 7.3

306 chapter 7 Linear Regression

sUM oF sQUARes DUe to ReGRession, ssR

5 2 5

SSR ˆ 1

2 y y

i∑( ) (7.7)

From the preceding discussion, we should expect that SST, SSR, and SSE are related. Indeed, the relationship among these three sums of squares is

5 1SST SSR SSE (7.8)

where

5SST total sum of squares 5SSR sum of squares due to regression 5SSE sum of squares due to error

the Coefficient of Determination Now let us see how the three sums of squares, SST, SSR, and SSE, can be used to provide a measure of the goodness of fit for the estimated regression equation. The estimated regression equation would provide a perfect fit if every value of the dependent variable yi happened to lie on the estimated regression line. In this case, 2 ˆy yi would be zero for each observation, resulting in 5SSE 0. Because 5 1SST SSR SSE, we see that for a perfect fit SSR must equal SST, and the ratio (SSR/SST) must equal one. Poorer fits will result in larger values for SSE. Solving for SSE in equation (7.11), we see that 5 2SSE SST SSR. Hence, the largest value for SSE (and hence the poorest fit) occurs when 5SSR 0 and

5SSE SST. The ratio SSR/SST, which will take values between zero and one, is used to evaluate the goodness of fit for the estimated regression equation. This ratio is called the coefficient of determination and is denoted by 2r .

In simple regression, r2 is often referred to as the simple coefficient of determination.

CoeFFiCient oF DeteRMinAtion

5 SSR

SST 2r (7.9)

Deviations About the Estimated Regression Line and the Line y y5 for the third Butler trucking company Driving Assignment

FiGURe 7.8

0 40 50 60 70 80 90 100

T ra

ve l

T im

e (h

ou rs

) -

Miles Traveled - x

y 5 0.0678x 1 1.2739

y3 2 y

y3 y3 ^

y3 2 y3 ^

y3 2 y ^

7.3 Assessing the Fit of the simple Linear Regression Model 307

For the Butler Trucking Company example, the value of the coefficient of determination is

r 5 5 5 SSR

SST

15.8712

23.9 0.66412

When we express the coefficient of determination as a percentage, 2r can be interpreted as the percentage of the total sum of squares that can be explained by using the esti- mated regression equation. For Butler Trucking Company, we can conclude that 66.41% of the total sum of squares can be explained by using the estimated regression equation

5 1ˆ 1.2739 0.0678y xi i to predict quarterly sales. In other words, 66.41% of the variabil- ity in the values of travel time in our sample can be explained by the linear relationship between the miles traveled and travel time.

Using excel’s Chart tools to Compute the Coefficient of Determination In Section 7.1 we used Excel’s chart tools to construct a scatter chart and compute the esti- mated regression equation for the Butler Trucking Company data. We will now describe how to compute the coefficient of determination using the scatter chart in Figure 7.3.

Step 1. Right-click on any data point in the scatter chart and select Add Trendline... Step 2. When the Format Trendline task pane appears: Select Display R-squared

value on chart in the Trendline Options area

Figure 7.9 displays the scatter chart, the estimated regression equation, the graph of the estimated regression equation, and the coefficient of determination for the Butler Trucking Company data. We see that 5 0.66412r .

The coefficient of determination r2 is the square of the correlation between yi and yiˆ , and # #r0 12

Note that Excel notates the coefficient of determination as R2

scatter chart and Estimated Regression Line with coefficient of Determination r 2 for Butler trucking company

FiGURe 7.9

A Assignment Miles Time

B C D E F G H I J K L

1 2 3 4 5 6 7 8 9

100 50

100 100 50 80 75 65 90 90

9.3 4.8 8.9 6.5 4.2 6.2 7.4 6.0 7.6 6.1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

0 40 50 60 70 80 90 100

T ra

ve l

T im

e (h

ou rs

) -

Miles Traveled - x

y 5 0.0678x 1 1.2739 R2 5 0.6641

308 chapter 7 Linear Regression

In the multiple regression model, b0, b1, b2, . . . , bq are the parameters and the error term e is a normally distributed random variable with a mean of zero and a constant vari- ance across all observations. A close examination of this model reveals that y is a linear function of 1x , 2x , . . . , xq plus the error term e. As in simple regression, the error term accounts for the variability in y that cannot be explained by the linear effect of the q inde- pendent variables. The interpretation of the y-intercept b0 in multiple regression is similar to the interpretation in simple regression; in a multiple regression model, b0 is the mean of the dependent variable y when all of the independent variables 1x , 2x , . . . , xq are equal to zero. On the other hand, the interpretation of the slope coefficients b1, b2, . . . , bq in a mul- tiple regression model differ in a subtle but important way from the interpretation of the slope b1 in a simple regression model. In a multiple regression model the slope coefficient b j represents the change in the mean value of the dependent variable y that corresponds to a one-unit increase in the independent variable x j , holding the values of all other inde- pendent variables in the model constant. Thus, in a multiple regression model, the slope coefficient b1 represents the change in the mean value of the dependent variable y that cor- responds to a one-unit increase in the independent variable 1x , holding the values of 2x , 3x , . . . , xq constant. Similarly, the slope coefficient b2 represents the change in the mean value of the dependent variable y that corresponds to a one-unit increase in the independent vari- able 2x , holding the values of 1x , 3x , . . . , xq constant.

estimated Multiple Regression equation In practice, the values of the population parameters b0, b1, b2, . . . , bq are not known and so must be estimated from sample data. A simple random sample is used to compute sample statistics 0b , 1b , 2b , . . . , bq that are then used as the point estimators of the parameters b0, b1, b2, . . . , bq. These sample statistics provide the following estimated multiple regression equation.

As a practical matter, for typical data in the social and behavioral

sciences, values of r 2 as low as 0.25 are often considered useful. For data in the physical and life sciences, r 2 values of 0.60 or greater

are often found; in fact, in some cases, r 2 values greater than 0.90 can be found. In business applications, r 2 values vary greatly, depending on the unique characteristics of each application.

n o t e s + C o M M e n t s

7.4 The Multiple Regression Model We now extend our discussion to the study of how a dependent variable y is related to two or more independent variables.

Regression Model The concepts of a regression model and a regression equation introduced in the preceding sections are applicable in the multiple regression case. We will use q to denote the num- ber of independent variables in the regression model. The equation that describes how the dependent variable y is related to the independent variables 1x , 2x , . . . , xq and an error term is called the multiple regression model. We begin with the assumption that the multiple regression model takes the following form:

MULtiPLe ReGRession MoDeL

b b b b5 1 1 1 1 1 «0 1 1 2 2y x x xq q� (7.10)

7.4 the Multiple Regression Model 309

estiMAteD MULtiPLe ReGRession eQUAtion

�y b b x b x b xq q5 1 1 1 1ˆ 0 1 1 2 2 (7.11)

where b b b b5, , , … , the point estimates of , , , … ,0 1 2 0 1 2b b b bq q

5ˆ estimated mean value of given values for , … ,1y y x xq

Least squares Method and Multiple Regression As with simple linear regression, in multiple regression we wish to find a model that results in small errors over the sample data. We continue to use the least squares method to develop the estimated multiple regression equation; that is, we find 0b , 1b , 2b , . . . , bq that minimize the sum of squared residuals (the squared deviations between the observed values of the dependent variable yi and the estimated values of the dependent variable ŷ):

min ˆ min min 1

1 0 1 1

2∑ ∑ ∑)()( ⋅⋅⋅ =

2 5 2 2 2 2 5 5 5

y y y b b x b x e i

i i

i q q

i (7.12)

The estimation process for multiple regression is shown in Figure 7.10. The estimated values of the dependent variable y are computed by substituting values of

the independent variables 1x , 1x , . . . , 1x into the estimated multiple regression equation (7.11). As in simple regression, it is possible to derive formulas that determine the values of the

regression coefficients that minimize equation (7.12). However, these formulas involve the use of matrix algebra and are outside the scope of this text. Therefore, in presenting multi- ple regression, we focus on how computer software packages can be used to obtain the esti- mated regression equation and other information. The emphasis will be on how to construct and interpret a regression model.

the Estimation Process for Multiple RegressionFiGURe 7.10

Sample Data x1 x2 xq y…

provide the estimates of b0, b1, b2, . . . ,bq

b0, b1, b2, . . . ,bq

Multiple Regression Model

y 5 b0 1 b1x1 1 b2x2 1…1 bqxq 1

unknown parameters. b0, b1, b2, . . . ,bq are

Compute the Estimated Multiple Regression

Equation y 5 b0 1 b1x1 1 b2x2 1…1bqxq

b0, b1, b2, . . . ,bq are sample statistics.

310 chapter 7 Linear Regression

Butler trucking Company and Multiple Regression As an illustration of multiple regression analysis, recall that a major portion of Butler Trucking Company’s business involves deliveries throughout its local area and that the managers want to estimate the total daily travel time for their drivers in order to develop better work schedules for the company’s drivers.

Initially, the managers believed that the total daily travel time would be closely related to the number of miles traveled in making the daily deliveries. Based on a simple random sample of 10 driving assignments, we explored the simple linear regression model b b5 1 1 «0 1y x to describe the relationship between travel time (y) and number of miles (x). As Figure 7.9 shows, we found that the estimated simple linear regression equation for our sample data is 5 1ˆ 1.2739 0.0678y xi i. With a coefficient of determina- tion 5 0.66412r , the linear effect of the number of miles traveled explains 66.41% of the variability in travel time in the sample data, and so 33.59% of the variability in sample travel times remains unexplained. This result suggests to Butler’s managers that other factors may contribute to the travel times for driving assignments. The managers might want to consider adding one or more independent variables to the model to explain some of the remaining variability in the dependent variable.

In considering other independent variables for their model, the managers felt that the number of deliveries made on a driving assignment also contributes to the total travel time. To support the development of a multiple regression model that includes both the number of miles traveled and the number of deliveries, they augment their original data with information on the number of deliveries for the 10 driving assignments in the orig- inal data and they collect new observations over several ensuing weeks. The new data, which consist of 300 observations, are provided in the file ButlerWithDeliveries. Note that we now refer to the independent variables miles traveled as 1x and the number of deliveries as 2x .

Our multiple linear regression with two independent variables will take the form 5 1 1ˆ 0 1 1 2 2y b b x b x . The SSE, SST, and SSR are again calculated using equations (7.5),

(7.6), and (7.7), respectively. Thus, the coefficient of determination, which in multiple regression is denoted by 2R , is again calculated using equation (7.9). We will now use Excel’s Regression tool to calculate the values of the estimates 0b , 1b , 2b , and 2R .

Using excel’s Regression tool to Develop the estimated Multiple Regression equation The following steps describe how to use Excel’s Regression tool to compute the estimated regression equation using the data in the worksheet.

Step 1. Click the Data tab in the Ribbon Step 2. Click Data Analysis in the Analysis group Step 3. Select Regression from the list of Analysis Tools in the Data Analysis tools

box (shown in Figure 7.11) and click OK Step 4. When the Regression dialog box appears (as shown in Figure 7.12):

Enter D1:D301 in the Input Y Range: box Enter B1:C301 in the Input X Range: box Select Labels Select Confidence Level: Enter 99 in the Confidence Level: box Select New Worksheet Ply: Click OK

In the Excel output shown in Figure 7.13, the label for the independent variable 1x is Miles (see cell A18), and the label for the independent variable 2x is Deliveries (see cell A19). The estimated regression equation is

5 5 1ˆ 0.1273 0.0672 0.69001 2y x x (7.13)

In multiple regression, R2 is often referred to as the multiple coefficient of determination.

When using Excel’s Regression tool, the data for the independent variables must be in adjacent columns or rows. Thus, you may have to rearrange the data in order to use Excel to run a particular multiple regression.

Selecting New Worksheet Ply: tells Excel to place the output of the regression analysis in a new worksheet. In the adjacent box, you can specify the name of the worksheet where the output is to be placed, or you can leave this blank and allow Excel to create a new worksheet to use as the destination for the results of this regression analysis (as we are doing here).

If Data Analysis does not appear in the Analysis group in the Data tab, you will have to load the Analysis ToolPak add-in into Excel. To do so, click the File tab in the Ribbon, and click Options. When the Excel Options dialog box appears, click Add-Ins from the menu. Next to Manage:, select Excel Add-ins, and click Go. . . at the bottom of the dialog box. When the Add-Ins dialog box appears, select Analysis ToolPak and click Go. When the Add-Ins dialog box appears, check the box next to Analysis Toolpak and click OK.

ButlerWithDeliveries

7.4 the Multiple Regression Model 311

Our multiple linear regression with two independent variables will take the form 5 1 1ˆ 0 1 1 2 2y b b x b x . The SSE, SST, and SSR are again calculated using equations (7.5),

Step 1. Click the Data tab in the Ribbon Step 2. Click Data Analysis in the Analysis group Step 3. Select Regression from the list of Analysis Tools in the Data Analysis tools

box (shown in Figure 7.11) and click OK Step 4. When the Regression dialog box appears (as shown in Figure 7.12):

Enter D1:D301 in the Input Y Range: box Enter B1:C301 in the Input X Range: box Select Labels Select Confidence Level: Enter 99 in the Confidence Level: box Select New Worksheet Ply: Click OK

5 5 1ˆ 0.1273 0.0672 0.69001 2y x x (7.13)

In multiple regression, R2 is often referred to as the multiple coefficient of determination.

Data Analysis tools BoxFiGURe 7.11

Regression Dialog BoxFiGURe 7.12

We interpret this model in the following manner:

• For a fixed number of deliveries, we estimate that the mean travel time will increase by 0.0672 hour when the distance traveled increases by 1 mile.

• For a fixed distance traveled, we estimate that the mean travel time will increase by 0.69 hour when the number of deliveries increases by 1 delivery.

The interpretation of the estimated y-intercept for this model (the expected mean travel time for a driving assignment with a distance traveled of 0 miles and no deliveries) is not meaningful because it is the result of extrapolation.

This model has a multiple coefficient of determination of 5 0.81732R . By adding the number of deliveries as an independent variable to our original simple linear regression, we now explain 81.73% of the variability in our sample values of the dependent variable, travel time. Because the simple linear regression with miles traveled as the sole inde- pendent variable explained 66.41% of the variability in our sample values of travel time, we can see that adding number of deliveries as an independent variable to our regression

The sum of squares due to error, SSE, cannot become larger (and generally will become smaller) when independent variables are added to a regression model. Therefore, because

5 2SSR SST SSE, the SSR cannot become smaller (and generally becomes larger) when an independent variable is added to a regression model. Thus, 5R SSR/SST2 can never decrease as independent variables are added to the regression model.

312 chapter 7 Linear Regression

model resulted in explaining an additional 15.32% of the variability in our sample values of travel time. The addition of the number of deliveries to the model appears to have been worthwhile.

Using this multiple regression model, we now generate an estimated mean value of y for every combination of values of 1x and 2x . Thus, instead of a regression line, we now have created a regression plane in three-dimensional space. Figure 7.14 provides the graph of

Excel Regression output for the Butler trucking company with Miles and Deliveries as independent Variables

FiGURe 7.13

Regression Statistics

Multiple R

R Square

Adjusted R Square

Standard Error

Observations

ANOVA

Coef�cients Standard Error t Stat P-value Lower 95% Upper 95% Lower 99.0% Upper 99.0%

2 915.5160626 457.7580313

0.68884558

664.5292419 2.2419E-110

204.5871374

1120.1032

0.90407397

0.817349743

0.816119775

0.829967216

300

297

299

0.127337137 0.20520348 0.620540826 0.53537766 –0.276499931 0.531174204 –0.404649592

0.06081725

0.613465414

0.659323866

0.073546235

0.766531147

0.072013099

0.748095234

0.062350385

0.631901326

3.5398E-83

2.84826E-69

27.36551071

23.37308852

0.002454979

0.029521057

0.067181742

0.68999828

SS MS F Signi�cance F

Regression

Residual

Total

Intercept

Miles

Deliveries

A B C D E F G H I SUMMARY OUTPUT1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Graph of the Regression Equation for Multiple Regression Analysis with two independent Variables

FiGURe 7.14

y7 y7

2 3

4 50

Miles

Number of Deliveries

Delivery Time

100

e7 ^

7.5 inference and Regression 313

the estimated regression plane for the Butler Trucking Company example and shows the seventh driving assignment in the data. Observe that as the plane slopes upward to larger values of estimated mean travel time (ŷ) as either the number of miles traveled ( 1x ) or the number of deliveries ( 2x ) increases. Further, observe that the residual for a driving assign- ment when 5 751x and 5 32x is the difference between the observed y value and the esti- mated mean value of y given 5 751x and 5 32x . Note that in Figure 7.14, the observed value lies above the regression plane, indicating that the regression model underestimates the expected driving time for the seventh driving assignment.

Although we use regression analysis to estimate relationships

between independent variables and the dependent variable,

it does not provide information on whether these are cause-

and-effect relationships. The analyst can conclude that a

cause-and-effect relationship exists between an independent

variable and a dependent variable only if there is a theoretical

justification that the relationship is in fact causal. In the But-

ler Trucking Company multiple regression, through regression

analysis we have found evidence of a relationship between dis-

tance traveled and travel time and evidence of a relationship

between number of deliveries and travel time. Nonetheless,

we cannot conclude from the regression model that changes

in distance traveled x1 cause changes in travel time y, and we cannot conclude that changes in number of deliveries x2 cause changes in travel time y. The appropriateness of such cause-

and-effect conclusions are left to supporting practical justifica-

tion and to good judgment on the part of the analyst. Based on

their practical experience, Butler Trucking’s managers felt that

increases in distance traveled and number of deliveries were

likely causes of increased travel time. However, it is important to

realize that the regression model itself provides no information

about cause-and-effect relationships.

n o t e s + C o M M e n t s

7.5 Inference and Regression The statistics 0b , 1b , 2b , . . . , bq are point estimators of the population parameters b0, b1, b2, . . . , bq; that is, each of these q 1 1 estimates is a single value used as an estimate of the corresponding population parameter. Similarly, we use ŷ as a point estimator of

( )E y x x xq| , , . . . ,1 2 , the conditional mean of y given values of 1x , 2x , . . . , xq. However, we must recognize that samples do not replicate the population exactly.

Different samples taken from the same population will result in different values of the point estimators 0b , 1b , 2b , . . . , bq; that is, the point estimators are random variables. If the values of a point estimator such as 0b , 1b , 2b , . . . , bq change relatively little from sample to sample, the point estimator has low variability, and so the value of the point estimator that we calculate based on a random sample will likely be a reliable estimate of the popu- lation parameter. On the other hand, if the values of a point estimator change dramatically from sample to sample, the point estimator has high variability, and so the value of the point estimator that we calculate based on a random sample will likely be a less reliable estimate. How confident can we be in the estimates 0b , 1b , and 2b that we developed for the Butler Trucking multiple regression model? Do these estimates have little variation and so are relatively reliable, or do they have so much variation that they have little meaning? We address the variability in potential values of the estimators through use of statistical inference.

Statistical inference is the process of making estimates and drawing conclusions about one or more characteristics of a population (the value of one or more parameters) through the analysis of sample data drawn from the population. In regression, we commonly use inference to estimate and draw conclusions about the following:

• The regression parameters b0, b1, b2, . . . , bq. • The mean value and/or the predicted value of the dependent variable y for specific values of the independent variables 1x , 2x , . . . , xq.

In our discussion of inference and regression, we will consider both hypothesis testing and interval estimation.

See Chapter 6 for a more thorough treatment of hypothesis testing and confidence intervals.

314 chapter 7 Linear Regression

Conditions necessary for Valid inference in the Least squares Regression Model In conducting a regression analysis, we begin by making an assumption about the appropri- ate model for the relationship between the dependent and independent variable(s). For the case of linear regression, the assumed multiple regression model is

⋅⋅⋅ +y x x xq qb b b b5 1 1 1 1 «0 1 1 2 2

The least squares method is used to develop values for ,0 1b b , 2b , . . . , bq, the estimates of the model parameters b0, b1, b2, . . . , b p, respectively. The resulting estimated multiple regression equation is

⋅⋅⋅y b b x b x b xq q5 1 1 1 1ˆ 0 1 1 2 2

Although inference can provide greater understanding of the nature of relationships esti- mated through regression analysis, our inferences are valid only if the error term « behaves in a certain way. Specifically, the validity of inferences in regression analysis depends on how well the following two conditions about the error term « are met:

1. For any given combination of values of the independent variables 1x , 2x , . . . , xq, the population of potential error terms e is normally distributed with a mean of 0 and a constant variance.

2. The values of e are statistically independent.

The practical implication of normally distributed errors with a mean of zero and a con- stant variation for any given combination of values of 1x , 2x , . . . , xq is that the regression estimates are unbiased (i.e., they do not tend to over- or underpredict), possess consistent accuracy, and tend to err in small amounts rather than in large amounts. This first condi- tion must be met for statistical inference in regression to be valid. The second condition is generally a concern when we collect data from a single entity over several periods of time and must also be met for statistical inference in regression to be valid in these instances. However, inferences in regression are generally reliable unless there are marked violations of these conditions.

Figure 7.15 illustrates these model conditions and their implications for a simple linear regression; note that in this graphical interpretation, the value of E(y/x) changes linearly according to the specific value of x considered, and so the mean error is zero at each value of x. However, regardless of the x value, the error term e and hence the dependent vari- able y are normally distributed, each with the same variance.

To evaluate whether the error of an estimated regression equation reasonably meets the two conditions, the sample residuals ( ˆ5 2e y yi i i for observations 5 1i , . . . , n) need to be analyzed. There are many sophisticated diagnostic procedures for detecting whether the sample errors violate these conditions, but simple scatter charts of the residuals versus the predicted values of the dependent variable and the residuals versus the independent variables are an extremely effective method for assessing whether these conditions are vio- lated. We should review the scatter chart for patterns in the residuals indicating that one or more of the conditions have been violated. As Figure 7.16 illustrates, at any given value of the horizontal-axis variable in these residual scatter plots, the center of the residuals should be approximately zero, the spread in the errors should be similar to the spread in error for other values of the horizontal-axis variable, and the errors should be symmetrically distrib- uted with values near zero occurring more frequently than values that differ greatly from zero. A pattern in the residuals such as this gives us little reason to doubt the validity of inferences made on the regression that generated the residuals.

While the residuals in Figure 7.16 show no discernible pattern, the residuals in the four panels of Figure 7.17 show examples of distinct patterns, each of which suggests a viola- tion of at least one of the regression model conditions. Figure 7.17 shows plots of residuals from four different regressions, each showing a different pattern. In panel (a), the variation in the residuals (e) increases as the value of the independent variable increases, suggesting that the residuals do not have a constant variance. In panel (b), the residuals are positive

7.5 inference and Regression 315

illustration of the conditions for Valid inference in Regression

FiGURe 7.15

ŷ when x = 30

x = 30

x = 20

x = 10

x = 0

Distribution of y at x = 30Distribution of

y at x = 20 Distribution of

y at x = 0 Distribution of

y at x = 10

ŷ = b0 + b1x

Note: The distribution of y has the same shape at each x value.

ŷ when x = 20

ŷ when x = 10

ŷ when x = 0

Example of a Random Error Pattern in a scatter chart of Residuals and Predicted Values of the Dependent Variable

FiGURe 7.16

ŷ

for small and large values of the independent variable but are negative for moderate values of the independent variable. This pattern suggests that the linear regression model under- predicts the value of the dependent variable for small and large values of the independent variable and overpredicts the value of the dependent variable for intermediate values of the independent variable. In this case, the regression model does not adequately capture the relationship between the independent variable x and the dependent variable y. The residuals

316 chapter 7 Linear Regression

in panel (c) are not symmetrically distributed around 0; many of the negative residuals are relatively close to zero, while the relatively few positive residuals tend to be far from zero. This skewness suggests that the residuals are not normally distributed. Finally, the residuals in panel (d) are plotted over time t, which generally serves as an independent variable; that is, an observation is made at each of several (usually equally spaced) points in time. In this case, connected consecutive residuals allow us to see a distinct pattern across every set of four residuals; the second residual is consistently larger than the first and smaller than the third, whereas the fourth residual is consistently the smallest. This pattern, which occurs consistently over each set of four consecutive residuals in the chart in panel (d), suggests that the residuals generated by this model are not independent. A residual pattern such as this generally occurs when we have collected quarterly data and have not captured seasonal effects in the model. In each of these four instances, any inferences based on our regression will likely not be reliable.

Frequently, the residuals do not meet these conditions either because an important inde- pendent variable has been omitted from the model or because the functional form of the model is inadequate to explain the relationships between the independent variables and the dependent variable. It is important to note that calculating the values of the estimates 0b ,

1b , 2b , . . . , bq does not require the errors to satisfy these conditions. However, the errors must satisfy these conditions in order for inferences (interval estimates for predicted values

Examples of Diagnostic scatter charts of Residuals from Four RegressionsFiGURe 7.17

(a)

(b) x

e e

7.5 inference and Regression 317

of the dependent variable and confidence intervals and hypothesis tests of the regression parameters b0, b1, b2, . . . , bq) to be reliable.

You can generate scatter charts of the residuals against each independent variable in the model when using Excel’s Regression tool; to do so, select the Residual Plots option in the Residuals area of the Regression dialog box. Figure 7.18 shows residual plots produced by Excel for the Butler Trucking Company example for which the independent variables are miles ( )1x and deliveries ( )2x .

The residuals at each value of miles appear to have a mean of zero, to have similar vari- ances, and to be concentrated around zero. The residuals at each value of deliveries also appear to have a mean of zero, to have similar variances, and to be concentrated around zero. Although there appears to be a slight pattern in the residuals across values of deliver- ies, it is negligible and could conceivably be the result of random variation. Thus, this evi- dence provides little reason for concern over the validity of inferences about the regression model that we may perform.

A scatter chart of the residuals e against the predicted values of the dependent vari- ables is also commonly used to assess whether the residuals of the regression model sat- isfy the conditions necessary for valid inference. To obtain the data to construct a scatter

Recall that in the Excel output shown in Figure 7.13, the label for the independent variable x1 is Miles and the label for the independent variable x 2 is Deliveries.

Excel Residual Plots for the Butler trucking company Multiple Regression

FiGURe 7.18

2.5

2.0

1.5

1.0

0.5

R es

id u

al s

1 2 3 4 5 6 7

Miles

Deliveries

Deliveries Residual Plot

Miles Residual Plot

40 60 80 100 120

–0.5

–1.0

–1.5

–2.0

2.5

2.0

1.5

1.0

0.5

R es

id u

al s

–0.5

–1.0

–1.5

–2.0

318 chapter 7 Linear Regression

chart of the residuals against the predicted values of the dependent variable using Excel’s Regression tool, select the Residuals option in the Residuals area of the Regression dia- log box (shown in Figure 7.12). This generates a table of predicted values of the dependent variable and residuals for the observations in the data; a partial list for the Butler Trucking multiple regression example is shown in Figure 7.19.

We can then use the Excel chart tool to create a scatter chart of these predicted values and residuals similar to the chart in Figure 7.20. The figure shows that the residuals at each predicted value of the dependent variable appear to have a mean of zero, to have similar variances, and to be concentrated around zero. Thus, the residuals provide little evidence that our regression model violates the conditions necessary for reliable inference. We can trust the inferences that we may wish to perform on our regression model.

testing individual Regression Parameters Once we ascertain that our regression model satisfies the conditions necessary for reliable inference reasonably well, we can begin testing hypotheses and building confidence inter- vals. Specifically, we may then wish to determine whether statistically significant relation- ships exist between the dependent variable y and each of the independent variables 1x , 2x , . . . , xq individually. Note that if a b j is zero, then the dependent variable y does not change when the independent variable x j changes, and there is no linear relationship between y and x j . Alternatively, if a b j is not zero, there is a linear relationship between the dependent variable y and the independent variable x j .

We use a t test to test the hypothesis that a regression parameter b j is zero. The corre- sponding null and alternative hypotheses are as follows:

: 0

: 0 0

a ±

H j

See Chapter 6 for a more in-depth discussion of hypothesis testing.

table of the First several Predicted Values ŷ and Residuals e Generated by the Excel Regression tool

FiGURe 7.19

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

RESIDUAL OUTPUT

9.605504464

5.556419081

9.605504464

8.225507903

4.8664208

6.881873062

7.235932632

7.254143492

8.243688763

7.553690482

6.936415641

7.290505212

9.287776613

5.874146931

6.954596501

5.556419081

–0.305504464

–0.756419081

–0.705504464

–1.725507903

–0.6664208

–0.681873062

0.164037368

–1.254143492

–0.643688763

–1.453690482

0.063584359

–0.290505212

0.312223387

0.625853069

0.245403499

0.443580919

Observation Predicted Time Residuals

7.5 inference and Regression 319

scatter chart of Predicted Values ŷ and Residuals eFiGURe 7.20

Predicted Values of the Dependent Variable

–2.0

–1.5

–1.0

–0.5

2 4 6 8 10 120

0.5

1.0

1.5

2.0

2.5

R es

id u

al s

The test statistic for this t test is

t b

s j

b j

5 (7.14)

where bj is the point estimate of the regression parameter b j and sb j is the estimated stan- dard deviation of bj.

As the value of bj, the point estimate of b j , deviates from zero in either direction, the evidence from our sample that the corresponding regression parameter b j is not zero increases. Thus, as the magnitude of t increases (as t deviates from zero in either direction), we are more likely to reject the hypothesis that the regression parameter b j is zero and so conclude that a relationship exists between the dependent variable y and the independent variable x j .

Statistical software will generally report a p value for this test statistic; for a given value of t, this p value represents the probability of collecting a sample of the same size from the same population that yields a larger t statistic given that the value of jb is actually zero. Thus, smaller p values indicate stronger evidence against the hypothesis that the value of

jb is zero (i.e., stronger evidence of a relationship between x j and y). The hypothesis is rejected when the corresponding p value is smaller than some predetermined level of sig- nificance (usually 0.05 or 0.01).

The output of Excel’s Regression tool provides the results of the t tests for each regres- sion parameter. Refer again to Figure 7.13, which shows the multiple linear regression results for Butler Trucking with independent variables 1x (labeled Miles) and 2x (labeled Deliveries). The values of the parameter estimates 0b , 1b , and 2b are located in cells B17, B18, and B19, respectively; the standard deviations

0 sb , 1sb , and 2sb are contained in cells

C17, C18, and C19, respectively; the values of the t statistics for the hypothesis tests are in cells D17, D18, and D19, respectively; and the corresponding p values are in cells E17, E18, and E19, respectively.

The standard deviation of b j is often referred to as the standard error of b j. Thus, sbj provides an estimate of the standard error of b j.

320 chapter 7 Linear Regression

Let’s use these results to test the hypothesis that 1b is zero. If we do not reject this hypothesis, we conclude that the mean value of y does not change when the value of 1x changes, and so there is no relationship between driving time and miles traveled. We see in the Excel output in Figure 7.13 that the statistic for this test is 27.3655 and that the associ- ated p value is 3.5398E-83. This p value tells us that if the value of 1b is actually zero, the probability we could collect a random sample of 300 observations from the population of Butler Trucking driving assignments that yields a t statistic with an absolute value greater than 27.3655 is practically zero. Such a small probability represents a highly unlikely sce- nario; thus, the small p value allows us to reject the hypothesis that 01b 5 for the Butler Trucking multiple regression example at a 0.01 level of significance or even at a far smaller level of significance. Thus, this data suggests that a relationship may exist between driving time and miles traveled.

Similarly, we can test the hypothesis that 2b is zero. If we do not reject this hypothesis, we conclude that the mean value of y does not change when the value of 2x changes, and so there is no relationship between driving time and number of deliveries. We see in the Excel output in Figure 7.13 that the t statistic for this test is 23.3731 and that the associated p value is 2.84826E-69. This p value tells us that if the value of 2b is actually zero, the probability we could collect a random sample of 300 observations from the population of Butler Trucking driving assignments that yields a t statistic with an absolute value greater than 23.3731 is practically zero. This is highly unlikely, and so the p value is sufficiently small to reject the hypothesis that 02b 5 for the Butler Trucking multiple regression example at a 0.01 level of significance or even at a far smaller level of significance. Thus, this data suggests that a rela- tionship may exist between driving time and number of deliveries.

Finally, we can test the hypothesis that 0b is zero in a similar fashion. If we do not reject this hypothesis, we conclude that the mean value of y is zero when the values of 1x and 2x are both zero, and so there is no driving time when a driving assignment is 0 miles and has 0 deliveries. We see in the Excel output that the t statistic for this test is 0.6205 and the associated p value is 0.5354. This p value tells us that if the value of 0b is actually zero, the probability we could collect a random sample of 300 observations from the population of Butler Trucking driving assignments that yields a t statistic with an absolute value greater than 0.6205 is 0.5354. Thus, we do not reject the hypothesis that mean driving time is zero when a driving assignment is 0 miles and has 0 deliveries.

We can also execute each of these hypothesis tests through confidence intervals. A confidence interval for a regression parameter ib is an estimated interval believed to contain the true value of ib at some level of confidence. The level of confidence, or confidence level, indicates how frequently interval estimates based on similar-sized sam- ples from the same population using identical sampling techniques will contain the true value of ib . Thus, when building a 95% confidence interval, we can expect that if we took similar-sized samples from the same population using identical sampling techniques, the corresponding interval estimates would contain the true value of ib for 95% of the samples.

Although the confidence intervals for 0b , 1b , 2b , . . . , qb convey information about the variation in the estimates 1b , 2b , . . . , bq that can be expected across repeated samples, they can also be used to test whether each of the regression parameters 0b , 1b , 2b , . . . , qb is equal to zero in the following manner. To test that jb is zero (i.e., there is no linear relation- ship between x j and y) at some predetermined level of significance (say 0.05), first build a confidence interval at the (1 2 0.05)100% confidence level. If the resulting confidence interval does not contain zero, we conclude that jb differs from zero at the predetermined level of significance.

The form of a confidence interval for jb is as follows:

b t sj a b j6 / 2

where bj is the point estimate of the regression parameter jb , sb j is the estimated stan- dard deviation of bj, and / 2ta is a multiplier term based on the sample size and specified

See Chapter 6 for a more in-depth discussion of confidence intervals.

7.5 inference and Regression 321

100(1 2 a)% confidence level of the interval. More specifically, / 2ta is the t value that provides an area of a/2 in the upper tail of a t distribution with n 2 q 2 1 degrees of freedom.

Most software that is capable of regression analysis can also produce these confidence intervals. For example, the output of Excel’s Regression tool for Butler Trucking, given in Figure 7.13, provides confidence intervals for 1b (the slope coefficient associated with the independent variable 1x , labeled Miles) and 2b (the slope coefficient associated with the independent variable 2x , labeled Deliveries), as well as the y-intercept 0b . The 95% confidence intervals for 0b , 1b , and 2b are shown in cells F17:G17, F18:G18, and F19:G19, respectively. Neither of the 95% confidence intervals for 1b and 2b includes zero, so we can conclude that 1b and 2b each differ from zero at the 0.05 level of significance. On the other hand, the 95% confidence interval for 0b does include zero, so we conclude that 0b does not differ from zero at the 0.05 level of significance.

The Regression tool dialog box offers the user the opportunity to generate confidence intervals for 0b , 1b , and 2b at a confidence level other than 95%. In this example, we chose to create 99% confidence intervals for 0b , 1b , and 2b , which in Figure 7.13 are given in cells H17:I17, H18:I18, and H19:I19, respectively. Neither of the 99% confidence intervals for

1b and 2b includes zero, so we can conclude that 1b and 2b each differs from zero at the 0.01 level of significance. On the other hand, the 99% confidence interval for 0b does include zero, so we conclude that 0b does not differ from zero at the 0.01 level of significance.

Addressing nonsignificant independent Variables If we do not reject the hypothesis that jb is zero, we conclude that there is no linear rela- tionship between y and x j . This leads to the question of how to handle the corresponding independent variable. Do we use the model as originally formulated with the nonsignificant independent variable, or do we rerun the regression without the nonsignificant independent variable and use the new result? The approach to be taken depends on a number of factors, but ultimately whatever model we use should have a theoretical basis. If practical experi- ence dictates that the nonsignificant independent variable has a relationship with the depen- dent variable, the independent variable should be left in the model. On the other hand, if the model sufficiently explains the dependent variable without the nonsignificant indepen- dent variable, then we should consider rerunning the regression without the nonsignificant independent variable. Note that it is possible that the estimates of the other regression coef- ficients and their p values may change considerably when we remove the nonsignificant independent variable from the model.

The appropriate treatment of the inclusion or exclusion of the y-intercept when 0b is not statistically significant may require special consideration. For example, in the Butler Trucking multiple regression model, recall that the p value for 0b is 0.5354, suggesting that this estimate of 0b is not statistically significant. Should we remove the y-intercept from this model because it is not statistically significant? Excel provides functionality to remove the y- intercept from the model by selecting Constant is zero in Excel’s Regression tool. This will force the y-intercept to go through the origin (when the independent variables

, ,...,1 2x x xq all equal zero, the estimated value of the dependent variable will be zero). However, doing this can substantially alter the estimated slopes in the regression model and result in a less effective regression that yields less accurate predicted values of the depen- dent variable. The primary purpose of the regression model is to explain or predict values of the dependent variable corresponding to values of the independent variables within the experimental region. Therefore, it is generally advised that regression through the origin should not be forced. In a situation for which there are strong a priori reasons for believing that the dependent variable is equal to zero when the values of all independent variables in the model are equal to zero, it is better to collect data for which the values of the indepen- dent variables are at or near zero in order to allow the regression to empirically validate this belief and avoid extrapolation. If data for which the values of the independent vari- ables are at or near zero is not obtainable, and the regression model is intended to be used

322 chapter 7 Linear Regression

to explain or predict values of the dependent variable at or near y-intercept, then forcing the y-intercept to be zero may be a necessary action, although it results in extrapolation. A common business example of regression through the origin is a model for which output in a labor-intensive production process is the dependent variable and hours of labor is the independent variable; because the production process is labor intense, we would expect no output when the value of labor hours is zero.

Multicollinearity We use the term independent variable in regression analysis to refer to any variable used to predict or explain the value of the dependent variable. The term does not mean, however, that the independent variables themselves are independent from each other in any statistical sense. On the contrary, most independent variables in a multiple regression problem are correlated with one another to some degree. For example, in the Butler Trucking example involving the two independent variables 1x (miles traveled) and 2x (number of deliveries), we could compute the sample correlation coefficient ,1 2rx x to determine the extent to which these two variables are related. Doing so yields ,1 2rx x 5 0.16. Thus, we find some degree of linear association between the two independent variables. In multiple regression analysis, multicollinearity refers to the correlation among the independent variables.

To gain a better perspective of the potential problems of multicollinearity, let us con- sider a modification of the Butler Trucking example. Instead of 2x being the number of deliveries, let 2x denote the number of gallons of gasoline consumed. Clearly, 1x (the miles traveled) and 2x are now related; that is, we know that the number of gallons of gasoline used depends to a large extent on the number of miles traveled. Hence, we would con- clude logically that 1x and 2x are highly correlated independent variables and that mul- ticollinearity is present in the model. The data for this example are provided in the file ButlerWithGasConsumption.

Using Excel’s Regression tool, we obtain the results shown in Figure 7.21 for our mul- tiple regression. When we conduct a t test to determine whether 1b is equal to zero, we find a p value of 3.1544E-07, and so we reject this hypothesis and conclude that travel time is related to miles traveled. On the other hand, when we conduct a t test to determine whether

2b is equal to zero, we find a p value of 0.6588, and so we do not reject this hypothesis. Does this mean that travel time is not related to gasoline consumption? Not necessarily.

What it probably means in this instance is that, with 1x already in the model, 2x does not make a significant marginal contribution to predicting the value of y. This interpretation makes sense within the context of the Butler Trucking example; if we know the miles trav- eled, we do not gain much new information that would be useful in predicting driving time by also knowing the amount of gasoline consumed. We can see this in the scatter chart in Figure 7.22; miles traveled and gasoline consumed are strongly related.

Even though we rejected the hypothesis that 1b is equal to zero in the model correspond- ing to Figure 7.21, a comparison to Figure 7.13 shows the value of the t statistic is much smaller and the p value substantially larger than in the multiple regression model that includes miles driven and number of deliveries as the independent variables. The evidence against the hypothesis that 1b is equal to zero is weaker in the multiple regression that includes miles driven and gasoline consumed as the independent variables because of the high correlation between these two independent variables.

To summarize, in t tests for the significance of individual parameters, the difficulty caused by multicollinearity is that it is possible to conclude that a parameter associated with one of the multicollinear independent variables is not significantly different from zero when the independent variable actually has a strong relationship with the dependent variable. This problem is avoided when there is little correlation among the independent variables.

Statisticians have developed several tests for determining whether multicollinearity is strong enough to cause problems. In addition to the initial understanding of the nature of

ButlerWithGasConsumption

If any estimated regression parameters b1, b2 , . . . , bq or associated p values change dramatically when a new independent variable is added to the model (or an existing independent variable is removed from the model), multicollinearity is likely present. Looking for changes such as these is sometimes used as a way to detect multicollinearity.

7.5 inference and Regression 323

Excel Regression output for the Butler trucking company with Miles and Gasoline consumption as independent Variables

FiGURe 7.21

A SUMMARY OUTPUT

Multiple R 0.69406354

0.481724198

0.478234125

1.398077545

300

2.493095385

0.074701825

–0.067506102

0.33669895

0.014274552

0.152707928

7.404523781

5.233216928

–0.442060235

1.36703E-12

3.15444E-07

0.658767336

1.830477398

269.7904079

1.954620822

138.0269794 4.09542E-43

0.046609743

–0.368032789

3.155713373

0.102793908

0.233020584

1.620208758

0.037695279

–0.463398955

3.365982013

0.111708371

0.328386751

539.5808158

580.5223842

1120.1032

297

299

R Square

Adjusted R Square

Standard Error

Observations

ANOVA

Regression

Residual

Total

Intercept

Miles

Gasoline Consumption

Regression Statistics

Coef�cients Standard Error t Stat P-value Lower 95% Upper 95% Lower 99.0% Upper 99.0%

SS MS F Signi�cance F

B C D E F G H I

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

scatter chart of Miles and Gasoline consumed for Butler trucking company

FiGURe 7.22

Miles 0 20 40 60 80 100 120

G as

ol in

e C

on su

m p

ti on

( ga

324 chapter 7 Linear Regression

the relationships between the various pairs of variables that we can gain through scatter charts such as the chart shown in Figure 7.22, correlations between pairs of independent variables can be used to identify potential problems. According to a common rule-of-thumb test, multicollinearity is a potential problem if the absolute value of the sample correlation coefficient exceeds 0.7 for any two of the independent variables. We can use the Excel function

5 CORREL(B2:B301, C2:C301)

to find that the correlation between Miles (in column B) and Gasoline Consumed (in col- umn C) in the file ButlerWithGasConsumption is 0.9572Miles, Gasoline Consumedr 5 , which sup- ports the conclusion that Miles and Gasoline Consumed are multicollinear. Similarly, we can use the Excel function

5 CORREL(B2:B301, D2:D301)

to show that the correlation between Miles (in column B) and Deliveries (in column D) for the sample data is 0.0258Miles, Deliveriesr 5 . This supports the conclusion that Miles and Deliv- eries are not multicollinear. Other tests for multicollinearity are more advanced and beyond the scope of this text.

The primary consequence of multicollinearity is that it increases the standard deviation of 0b , 1b , . . . , bq and ŷ, and so inference based on these estimates is less precise than it should be. This means that confidence intervals for 0b , 1b , 2b , . . . , qb and predicted values of the dependent variable are wider than they should be. Thus, we are less likely to reject the hypothesis that an individual parameter bj is equal to zero than we otherwise would be, and multicollinearity leads us to conclude that an independent variable x j is not related to the dependent variable y when they in fact are related. In addition, multicollinearity can result in confusing or misleading regression parameters 1b , 2b , . . . , bq. Therefore, if a primary objective of the regression analysis is inference, to explain the relationship between a dependent variable y and a set of independent variables 1x , . . . , xq, you should, if possible, avoid including independent variables that are highly correlated in the regres- sion model. For example, when a pair of independent variables is highly correlated it is common to simply include only one of these independent variables in the regression model. When decision makers have reason to believe that substantial multicollinearity is present and they choose to retain the highly correlated independent variables in the model, they must realize that separating the relationships between each of the individual independent variables and the dependent variable is difficult (and maybe impossible). On the other hand, multicollinearity does not affect the predictive capability of a regression model, so if the primary objective is prediction or forecasting, then multicollinearity is not a concern.

See Chapter 2 for a more in-depth discussion of correlation and how to compute it with Excel.

1. In multiple regression we can test the null hypothesis that

the regression parameters b1, b2, . . . , bq are all equal to zero ( �H Hq a: 0, :0 1 2b b b5 5 5 5 at least one ±bj 0 for

5j q1, … , ) with an F test based on the F probability dis- tribution. The test statistic generated by the sample data

for this test is

F SSR q

SSE n q 5

2 2

/ /( 1)

where SSR and SSE are as defined by equations (7.5) and

(7.7), q is the number of independent variables in the

regression model, and n is the number of observations in the sample. If the p value corresponding to the F statistic is smaller than some predetermined level of significance

(usually 0.05 or 0.01), this leads us to reject the hypothesis

that the values of b1, b2, . . . , bq are all zero, and we would conclude that there is an overall regression relationship;

otherwise, we conclude that there is no overall regression

relationship.

The output of Excel’s Regression tool provides the

results of the F test; in Figure 7.13, which shows the mul- tiple linear regression results for Butler Trucking with

n o t e s + C o M M e n t s

7.6 categorical independent Variables 325

7.6 Categorical Independent Variables Thus far, the examples we have considered have involved quantitative independent vari- ables such as the miles traveled and the number of deliveries. In many situations, however, we must work with categorical independent variables such as marital status (married, single), method of payment (cash, credit card, check), and so on. The purpose of this sec- tion is to show how categorical variables are handled in regression analysis. To illustrate the use and interpretation of a categorical independent variable, we will again consider the Butler Trucking Company example.

Butler trucking Company and Rush Hour Several of Butler Trucking’s driving assignments require the driver to travel on a congested segment of a highway during the afternoon rush hour. Management believes that this factor may also contribute substantially to variability in the travel times across driving assign- ments. How do we incorporate information on which driving assignments include travel on a congested segment of a highway during the afternoon rush hour into a regression model?

The previous independent variables we have considered (such as the miles traveled and the number of deliveries) have been quantitative, but this new variable is categor- ical and will require us to define a new type of variable called a dummy variable. To

Dummy variables are sometimes referred to as indicator variables.

independent variables 1x (labeled Miles) and 2x (labeled Deliveries), the value of the F statistic and the correspond- ing p value are in cells E24 and F24, respectively. From the Excel output in Figure 7.13 we see that the p value for the F test is essentially 0. Thus, the p value is sufficiently small to allow us to reject the hypothesis that no overall regression

relationship exists at the 0.01 level of significance.

2. Finding a significant relationship between an independent

variable x j and a dependent variable y in a linear regres- sion does not enable us to conclude that the relationship

is linear. We can state only that x j and y are related and that a linear relationship explains a statistically significant

portion of the variability in y over the range of values for x j observed in the sample.

3. Note that a review of the correlations of pairs of indepen-

dent variables is not always sufficient to entirely uncover

multicollinearity. The problem is that sometimes one inde-

pendent variable is highly correlated with some combina-

tion of several other independent variables. If you suspect

that one independent variable is highly correlated with a

combination of several other independent variables, you

can use multiple regression to assess whether the sample

data support your suspicion. Suppose that your original

regression model includes the independent variables x1, x2, . . . , xq and that you suspect that x1 is highly correlated with a subset of the other independent variables x2, . . . , xq. Then construct the multiple linear regression for which x1 is the dependent variable to be explained by the subset

of the independent variables x2, . . . , xq that you suspect

are highly correlated with x1. The coefficient of determi- nation R 2 for this regression provides an estimate of the strength of the relationship between x1 and the subset of the other independent variables x2, . . . , xq that you suspect are highly correlated with x1. As a rule of thumb, if the coef- ficient of determination R 2 for this regression exceeds 0.50, multicollinearity between x1 and the subset of the other independent variables x2, . . . , xq is a concern.

4. When working with a small number of observations, assess-

ing the conditions necessary for inference to be valid in

regression can be extremely difficult. Similarly, when work-

ing with a small number of observations, assessing multi-

collinearity can also be difficult.

5. In some instances, the values of the independent variables to

be used to estimate the value of dependent variable are not

known. For example, a company may include its competitor’s

price as an independent variable in a regression model to

be used to estimate demand for one of its products in some

future period. It is unlikely that the competitor’s price in some

future period will be known by this company, and so the com-

pany may estimate what the competitor’s price will be and

substitute this estimated value into the regression equation.

In such instances, estimated values of the independent

variables are sometimes substituted into the regression

equation to produce an estimated value of the dependent

variable. The result can be useful, but one must proceed

with caution as an inaccurate estimate of the value of any

independent variable can create an inaccurate estimate of

the dependent variable.

326 chapter 7 Linear Regression

incorporate a variable that indicates whether a driving assignment included travel on this congested segment of a highway during the afternoon rush hour into a model that currently includes the miles traveled ( )1x and the number of deliveries ( )2x , we define the following variable:

0 if an assignment did not include travel on the congested segment of highway

during afternoon rush hour

1 if an assignment included travel on the congested segment of highway

during afternoon rush hour



  

  

Will this dummy variable add valuable information to the current Butler Trucking regression model? A review of the residuals produced by the current model may help us make an initial assessment. Using Excel chart tools, we can create a frequency distribution and a histogram of the residuals for driving assignments that included travel on a congested segment of a highway during the afternoon rush hour period. We then create a frequency distribution and a histogram of the residuals for driving assignments that did not include travel on a congested segment of a highway during the afternoon rush hour period. The two histograms are shown in Figure 7.23.

Recall that the residual for the thi observation is ˆe y yi i i5 2 , which is the difference between the observed and the predicted values of the dependent variable. The histograms in Figure 7.23 show that driving assignments that included travel on a congested segment of a highway during the afternoon rush hour period tend to have positive residuals, which means we are generally underpredicting the travel times for those driving assignments. Conversely, driving assignments that did not include travel on a congested segment of a highway during the afternoon rush hour period tend to have negative residuals, which means we are generally overpredicting the travel times for those driving assignments. These results suggest that the dummy variable could potentially explain a substantial pro- portion of the variance in travel time that is unexplained by the current model, and so we proceed by adding the dummy variable 3x to the current Butler Trucking multiple regres- sion model. Using Excel’s Regression tool to develop the estimated regression equation on the data in the file ButlerHighway, we obtain the Excel output in Figure 7.24. The esti- mated regression equation is

y x x x5 2 1 1 1ˆ 0.3302 0.0672 0.6735 0.99801 2 3 (7.15)

See Chapter 2 for step-by- step descriptions of how to construct charts in Excel.

ButlerHighway

Histograms of the Residuals for Driving Assignments that included travel on a congested segment of a Highway During the Afternoon Rush Hour and Residuals for Driving Assignments that Did not

FiGURe 7.23

0 –1.0 – –1.5

–0.5 – –1.0

0.0 – –0.5

0.0 – 0.5

0.5 – 1.0

1.0 – 1.5

> 1.5

F re

q u

en cy

Residuals

Included Highway—Rush Hour Driving

< –1.5

0 < –1.5 –1.0 –

–1.5 –0.5 – –1.0

0.0 – –0.5

0.0 – 0.5

0.5 – 1.0

1.0 – 1.5

> 1.5

F re

q u

en cy

Residuals

Did Not Include Highway—Rush Hour Driving

7.6 categorical independent Variables 327

interpreting the Parameters After checking to make sure this regression satisfies the conditions for inference and the model does not suffer from serious multicollinearity, we can consider inference on our results. The p values for the t tests of miles traveled 5value 4.7852E-105p )( , num- ber of deliveries 5value 6.7480E-87p )( , and the rush hour driving dummy variable

5value 6.4982E-31p )( are all extremely small, indicating that each of these independent variables has a statistical relationship with travel time. The model estimates that the mean travel time of a driving assignment increases by:

• 0.0672 hour for every increase of 1 mile traveled, holding constant the number of deliveries and whether the driving assignment route requires the driver to travel on the congested segment of a highway during the afternoon rush hour.

• 0.6735 hour for every delivery, holding constant the number of miles traveled and whether the driving assignment route requires the driver to travel on the congested segment of a highway during the afternoon rush hour.

• 0.9980 hour if the driving assignment route requires the driver to travel on the con- gested segment of a highway during the afternoon rush hour, holding constant the number of miles traveled and the number of deliveries.

In addition, 0.88382R 5 indicates that the regression model explains approximately 88.4% of the variability in travel time for the driving assignments in the sample. Thus, equation (7.15) should prove helpful in estimating the travel time necessary for the various driving assignments.

To understand how to interpret the regression when a categorical variable is present, let’s compare the regression model for the case when 03x 5 (the driving assignment does

Excel Data and output for Butler trucking with Miles traveled x( )1 , number of Deliveries x( )2 , and the Highway Rush Hour Dummy Variable x( )3 as the independent Variables

FiGURe 7.24

A SUMMARY OUTPUT

Multiple R 0.940107228

0.8838016

0.882623914

0.663106426

300

–0.330229304

0.067220302

0.67351584

0.167677925

0.00196142

0.023619993

–1.969426232

34.27125147

28.51465081

0.04983651

4.7852E-105

6.74797E-87

–0.66022126

329.9830003

0.439710132

750.455757 5.7766E–138

0.063360208

0.627031441

–0.000237349

0.071080397

0.720000239

–0.764941128

0.062135243

0.612280051

0.104482519

0.072305362

0.734751629

0.9980033 0.076706582 13.0106605 6.49817E-31 0.847043924 1.148962677 0.799138374 1.196868226

989.9490008

130.1541992

1120.1032

296

299

R Square

Adjusted R Square

Standard Error

Observations

ANOVA

Regression

Residual

Total

Intercept

Miles

Deliveries

Highway

Regression Statistics

Coef�cients Standard Error t Stat P-value Lower 95% Upper 95% Lower 99.0% Upper 99.0%

SS MS F Signi�cance F

B C D E F G H I

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

328 chapter 7 Linear Regression

not include travel on congested highways) and when 13x 5 (the driving assignment does include travel on congested highways). In the case that 03x 5 , we have

y x x

x x 5 2 1 1 1 5 2 1 1

ˆ 0.3302 0.0672 0.6735 0.9980(0) 0.3302 0.0672 0.6735

1 2

1 2 (7.16)

In the case that when 13x 5 , we have

y x x x x

5 2 1 1 1 1 1

ˆ 0.3302 0.0672 0.6735 0.9980(1) 0.6678 0.0672 0.6735

1 2

1 2 (7.17)

Comparing equations (7.16) and (7.17), we see that the mean travel time has the same linear relationship with 1x and 2x for both driving assignments that include travel on the congested segment of highway during the afternoon rush hour period and driving assign- ments that do not. However, the y-intercept is 20.3302 in equation (7.16) and 0.6678 in equation (7.17). That is, 0.9980 is the difference between the mean travel time for driving assignments that include travel on the congested segment of highway during the afternoon rush hour and the mean travel time for driving assignments that do not.

In effect, the use of a dummy variable provides two estimated regression equations that can be used to predict the travel time: One that corresponds to driving assignments that include travel on the congested segment of highway during the afternoon rush hour period, and one that corresponds to driving assignments that do not include such travel.

More Complex Categorical Variables The categorical variable for the Butler Trucking Company example had two levels: (1) driving assignments that include travel on the congested segment of highway during the afternoon rush hour and (2) driving assignments that do not. As a result, defining a dummy variable with a value of zero indicating a driving assignment that does not include travel on the congested segment of highway during the afternoon rush hour and a value of one indicating a driving assignment that includes such travel was sufficient. However, when a categorical variable has more than two levels, care must be taken in both defining and interpreting the dummy variables. As we will show, if a categorical variable has k levels, k – 1 dummy variables are required, with each dummy variable corresponding to one of the levels of the categorical variable and coded as 0 or 1.

For example, suppose a manufacturer of vending machines organized the sales territo- ries for a particular state into three regions: A, B, and C. The managers want to use regres- sion analysis to help predict the number of vending machines sold per week. With the number of units sold as the dependent variable, they are considering several independent variables (the number of sales personnel, advertising expenditures, etc.). Suppose the man- agers believe that sales region is also an important factor in predicting the number of units sold. Because sales region is a categorical variable with three levels (A, B, and C), we will need 3 1 22 5 dummy variables to represent the sales region. Selecting Region A to be the “reference” region, each dummy variable can be coded 0 or 1 as follows:

5 5 1 if sales Region B

0 otherwise

1 if sales Region C

0 otherwise 1 2x x

  



  



With this definition, we have the following values of 1x and 2x :

Region 1x 2x A 0 0

B 1 0

C 0 1

7.6 categorical independent Variables 329

The regression equation relating the estimated mean number of units sold to the dummy variables is written as

y b b x b x5 1 1ˆ 0 1 1 2 2

Observations corresponding to Region A correspond to 01x 5 , 02x 5 , so the estimated mean number of units sold in Region A is

y b b b b5 1 1 5ˆ (0) (0)0 1 2 0

Observations corresponding to Region B are coded 11x 5 , 02x 5 , so the estimated mean number of units sold in Region B is

y b b b b b5 1 1 5 1ˆ (1) (0)0 1 2 0 1

Observations corresponding to Region C are coded 01x 5 , 12x 5 , so the estimated mean number of units sold in Region C is

y b b b b b5 1 1 5 1ˆ (0) (1)0 1 2 0 2

Thus, 0b is the estimated mean sales for Region A, 1b is the estimated difference between the mean number of units sold in Region B and the mean number of units sold in Region A, and 2b is the estimated difference between the mean number of units sold in Region C and the mean number of units sold in Region A.

Two dummy variables were required because sales region is a categorical variable with three levels. But the assignment of 01x 5 and 02x 5 to indicate Region A, 11x 5 and

02x 5 to indicate Region B, and 01x 5 and 12x 5 to indicate Region C was arbitrary. For example, we could have chosen to let 11x 5 and 02x 5 indicate Region A, 01x 5 and

02x 5 indicate Region B, and 01x 5 and 12x 5 indicate Region C. In this case, 0b is the mean or expected value of sales for Region B, 1b is the difference between the mean num- ber of units sold in Region A and the mean number of units sold in Region B, and 2b is the difference between the mean number of units sold in Region C and the mean number of units sold in Region B.

The important point to remember is that when a categorical variable has k levels, k – 1 dummy variables are required in the multiple regression analysis. Thus, if the sales region example had a fourth region, labeled D, three dummy variables would be necessary. For example, these three dummy variables could then be coded as follows:

   1 if sales Region B 0 otherwise1

5x    1 if sales Region C 0 otherwise2

5x    1 if sales Region D 0 otherwise3

Dummy variables are often used to model seasonal effects in sales data. If the data are collected quarterly and we use winter as the reference season, we may use three dummy variables defined in the following manner:

5x   

1 if spring;

0 otherwise 1

5x   

1 if summer ; 0 otherwise

5x   

1 if fall 0 otherwise

Detecting multicollinearity when a categorical variable is

involved is difficult. The correlation coefficient that we used in

Section 7.5 is appropriate only when assessing the relationship

between two quantitative variables. However, recall that if any

estimated regression parameters b1, b2, . . . , bq or associated p values change dramatically when a new independent variable

is added to the model (or an existing independent variable

is removed from the model), multicollinearity is likely pres-

ent. We can use our understanding of these ramifications of

multicollinearity to assess whether there is multicollinearity that

involves a dummy variable. We estimate the regression model

twice; once with the dummy variable included as an indepen-

dent variable and once with the dummy variable omitted from

the regression model. If we see relatively little change in the

estimated regression parameters b1, b2, . . . , bq or associated p values for the independent variables that have been included in

both regression models, we can be confident that there is not

strong multicollinearity involving the dummy variable.

n o t e s + C o M M e n t s

330 chapter 7 Linear Regression

7.7 Modeling Nonlinear Relationships Regression may be used to model more complex types of relationships. To illustrate, let us consider the problem facing Reynolds, Inc., a manufacturer of industrial scales and laboratory equipment. Managers at Reynolds want to investigate the relationship between length of employment of their salespeople and the number of electronic lab- oratory scales sold. The file Reynolds gives the number of scales sold by 15 randomly selected salespeople for the most recent sales period and the number of months each salesperson has been employed by the firm. Figure 7.25, the scatter chart for these data, indicates a possible curvilinear relationship between the length of time employed and the number of units sold.

Before considering how to develop a curvilinear relationship for Reynolds, let us con- sider the Excel output in Figure 7.26 for a simple linear regression; the estimated regres- sion is

5 1Sales 113.7453 2.3675 Months Employed

The computer output shows that the relationship is significant 5value 9.3954E-06p( in cell E18 of Figure 7.26 for the t test that )b 5 01 and that a linear relationship explains a high percentage of the variability in sales 5 0.7901 in cell B 52r )( . However, Figure 7.27 reveals a pattern in the scatter chart of residuals against the predicted values of the depen- dent variable that suggests that a curvilinear relationship may provide a better fit to the data.

If we have a practical reason to suspect a curvilinear relationship between number of electronic laboratory scales sold by a salesperson and the number of months the salesper- son has been employed, we may wish to consider an alternative to simple linear regression. For example, we may believe that a recently hired salesperson faces a learning curve but becomes increasingly more effective over time and that a salesperson who has been in a sales position with Reynolds for a long time eventually becomes burned out and becomes increasingly less effective. If our regression model supports this theory, Reynolds manage- ment can use the model to identify the approximate point in employment when its sales- people begin to lose their effectiveness, and management can plan strategies to counteract salesperson burnout.

The scatter chart of residuals against the independent variable Months Employed would also suggest that a curvilinear relationship may provide a better fit to the data.

Reynolds

scatter chart for the Reynolds ExampleFiGURe 7.25

400

350

300

250

200

150

100

0 0 20 40 60 80 100

S ca

le s

S ol

Months Employed

7.7 Modeling nonlinear Relationships 331

Quadratic Regression Models To account for the curvilinear relationship between months employed and scales sold that is suggested by the scatter chart of residuals against the predicted values of the dependent variable, we could include the square of the number of months the salesperson has been employed as a second independent variable in the estimated regression equation:

5 1 1ˆ 0 1 1 2 1 2y b b x b x (7.18)

Equation (7.18) corresponds to a quadratic regression model. As Figure 7.28 illustrates, quadratic regression models are flexible and are capable of representing a wide variety of nonlinear relationships between an independent variable and the dependent variable.

To estimate the values of b 0 , b

1 , and b

2 in equation (7.18) with Excel, we need to add to

the original data the square of the number of months the salesperson has been employed with the firm. Figure 7.29 shows the Excel spreadsheet that includes the square of the number of months the employee has been with the firm. To create the variable, which we will call MonthsSq, we create a new column and set each cell in that column equal to the square of the associated value of the variable Months. These values are shown in Column B of Figure 7.29.

The regression output for equation (7.18) is shown in Figure 7.30. The estimated regression equation is

5 1 2Sales 61.4299 5.8198 Months Employed 0.0310 MonthsSq

where MonthsSq is the square of the number of months the salesperson has been employed. Because the value of 1b (5.8198) is positive, and the value of 2b (20.0310) is negative, ŷ will initially increase as the number of months the salesperson has been employed increases. As the value of the independent variable Months Employed increases, its squared value increases more rapidly, and eventually ŷ will decrease as the number of months the salesperson has been employed increases.

Excel Regression output for the Reynolds ExampleFiGURe 7.26

A SUMMARY OUTPUT

Multiple R 0.888897515

0.790138792

0.773995622

48.49087146

113.7452874

2.367463621

20.81345608

0.338396631

5.464987985

6.996120545

0.000108415

9.39543E-06

68.78054927

115089.1933

2351.364615

48.94570268 9.39543E–06

1.636402146

158.7100256

3.098525095

68.78054927

1.636402146

158.7100256

3.098525095

115089.1933

30567.74

145656.9333

R Square

Adjusted R Square

Standard Error

Observations

ANOVA

Regression

Residual

Total

Intercept

Months Employed

Regression Statistics

Coef�cients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

SS MS F Signi�cance F

B C D E F G H I

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

332 chapter 7 Linear Regression

scatter chart of the Residuals and Predicted Values of the Dependent Variable for the Reynolds simple Linear Regression

FiGURe 7.27

–80

–60

–40

–20

50 100 150 200 250 300 350 400

R es

id u

al s

Predicted Values

100

Relationships that can Be Fit with a Quadratic Regression Model

FiGURe 7.28

x (b) �1 , 0, �2 . 0(a) �1 . 0, �2 . 0

If b 2 . 0, the function is convex (bowl-shaped relative to the x-axis); if b 2 , 0, the function is concave (mound- shaped relative to the x-axis).

7.7 Modeling nonlinear Relationships 333

The 2R of 0.9013 indicates that this regression model explains approximately 90.1% of the variation in Scales Sold for our sample data. The lack of a distinct pattern in the scatter chart of residuals against the predicted values of the dependent variable (Figure 7.31) suggests that the quadratic model fits the data better than the simple linear regression in the Reynolds example. While not shown here, the scatter chart of residuals against the independent variable Months Employed also lack any distinct pattern.

Although it is difficult to assess from a sample as small as this whether the regression model satisfies the conditions necessary for reliable inference, we see no marked violations of these conditions, so we will proceed with hypothesis tests of the regression parameters

0b , 1b , and 2b for our quadratic regression model. From the Excel output provided in Figure 7.30, we see that the p values corresponding to the

t statistics for Months Employed (6.2050E-05) and MonthsSq (0.0032) are both substantially less than 0.05, and hence we can conclude that the variables Months Employed and MonthsSq are significant. There is a nonlinear relationship between months employed and sales.

Note that if the estimated regression parameters 1b and 2b corresponding to the linear term x and the squared term 2x are of the same sign, the estimated value of the dependent variable is either increasing over the experimental range of x (when 1b . 0 and 2b . 0) or decreas- ing over the experimental range of x (when 1b , 0 and 2b , 0). If the estimated regression parameters 1b and 2b corresponding to the linear term x and the squared term 2x have dif- ferent signs, the estimated value of the dependent variable has a maximum over the exper- imental range of x (when 1b . 0 and 2b , 0) or a minimum over the experimental range of x (when 1b , 0 and 2b . 0). In these instances, we can find the estimated maximum or minimum over the experimental range of x by finding the value of x at which the estimated value of the dependent variable stops increasing and begins decreasing (when a maximum exists) or stops decreasing and begins increasing (when a minimum exists). For example, we estimate that when months employed increases by 1 from some value x (x 1 1), sales changes by

x x x x x x x x x

x x

1 2 2 1 2

5 2 1 2 1 1 2

5 2 1

5 2

5.8198 [( 1) ] 0.0310 [( 1) ] 5.8198( 1) 0.0310( 2 1 ) 5.8198 0.0310 (2 1) 5.7888 0.0620

2 2

Excel Data for the Reynolds Quadratic Regression ModelFiGURe 7.29

A B C Months Employed MonthsSq Scales Sold1

2 3 4 5 6 7

41 106 76

100 22 12 85

111 40 51 0

12 6

56 19

1,681 11,236 5,776

10,000 484 144

7,225 12,321 1,600 2,601

0 144 36

3,136 361

275 296 317 376 162 150 367 308 189 235 83

112 67

325 189

8 9 10 11 12 13 14 15 16

334 chapter 7 Linear Regression

Excel output for the Reynolds Quadratic Regression ModelFiGURe 7.30

A B C D E F G H I SUMMARY OUTPUT

Multiple R

R Square

Adjusted R Square

Standard Error

Observations

Regression

Residual

Total

ANOVA

0.949361402

0.901287072

0.884834917

34.61481184

61.42993467

5.819796648

20.031009589

Intercept

Months Employed

MonthsSq

20.57433536

0.969766536

0.008436087

0.011363561

6.20497E-05

0.003172962

16.60230882

3.706856877

20.049390243

106.2575605

7.93273642

20.012628935

21.415187222

2.857606371

20.05677795

124.2750566

8.781986926

20.005241228

2.985755485

6.001234761

23.675826286

131278.711

14378.22238

145656.9333

65639.35548 54.78231208 9.25218E-07

1198.185199

1 2 3 4 5 6

7 8 9 10 11 12 13 14 15 16 17 18

Regression Statistics

df SS MS F Signi�cance F

Coef�cients Standard Error t Stat P-value Lower 95% Upper 95% Lower 99.0% Upper 99.0%

scatter chart of the Residuals and Predicted Values of the Dependent Variable for the Reynolds Quadratic Regression Model

FiGURe 7.31

–20

–40

–60 0 50 100 150 200 250 300 350

R es

id u

al s

Predicted Values

7.7 Modeling nonlinear Relationships 335

That is, estimated Sales initially increases as Months Employed increases and then eventu- ally decreases as Months Employed increases. Solving this result for x:

x x

2 5 2 5 2

5 2

2 5

5.7888 0.0620 0 0.0620 5.7888

5.7888

0.0620 93.3387

tells us that estimated maximum sales occurs at approximately 93 months (in about seven years and nine months). We can then find the estimated maximum value of the dependent variable Sales by substituting this value of x into the estimated regression equation:

5 1 2 5Sales 61.58198 5.8198 (93.3387) 0.0310 (93.3387 ) 334.49092

At approximately 93 months, the maximum estimated sales of approximately 334 scales occurs.

Piecewise Linear Regression Models As an alternative to a quadratic regression model, we can recognize that below some value of Months Employed, the relationship between Months Employed and Sales appears to be positive and linear, whereas the relationship between Months Employed and Sales appears to be negative and linear for the remaining observations. A piecewise linear regression model will allow us to fit these relationships as two linear regressions that are joined at the value of Months at which the relationship between Months Employed and Sales changes.

Our first step in fitting a piecewise linear regression model is to identify the value of the independent variable Months Employed at which the relationship between Months Employed and Sales changes; this point is called the knot, or breakpoint. Although theory should determine this value, analysts often use the sample data to aid in the identification of this point. Figure 7.32 provides the scatter chart for the Reynolds data with an indication

In business analytics applications, polynomial regression models of higher than second or third order are rarely used.

A piecewise linear regression model is sometimes referred to as a segment regression or a spline model.

Possible Position of Knot x k( )FiGURe 7.32

100

200

300

400

20 40 60 80 x(k) 100 120 Months Employed

S ca

le s

S ol

336 chapter 7 Linear Regression

of the possible location of the knot, which we have denoted x k( ). From this scatter chart, it appears that the knot is at approximately 90 months.

Once we have decided on the location of the knot, we define a dummy variable that is equal to zero for any observation for which the value of Months Employed is less than or equal to the value of the knot, and equal to one for any observation for which the value of Months Employed is greater than the value of the knot:

  

0 if 1 if

1 ( )

1 ( )5

. x

x x x xk

k (7.19)

where

x 5 Months1 x k 5 the value of the knot (90 months for the Reynolds example)( )

xk 5 the knot dummy variable

We then fit the following estimated regression equation:

5 1 1 2ˆ ( )0 1 1 2 1 ( )y b b x b x x xk k (7.20)

The data and Excel output for the Reynolds piecewise linear regression model are pro- vided in Figure 7.33. Because we placed the knot at 90( )x k 5 , the estimated regression equation is

5 1 2 2ˆ 87.2172 3.4094 7.8726( 90)1 1y x x xk

The output shows that the p value corresponding to the t statistic for knot term ( 0.0014)p 5 is less than 0.05, and hence we can conclude that adding the knot to the model with Months Employed as the independent variable is significant.

But what does this model mean? For any value of Months less than or equal to 90, the knot term 7.8726( 90)1x xk2 is zero because the knot dummy variable 0xk 5 , so the regression equation is

y x5 1ˆ 87.2172 3.4094 1

For any value of Months Employed greater than 90, the knot term is 7.87( 90)1x2 2 because the knot dummy variable 1xk 5 , so the regression equation is

= =

y x x x x

1 2 2 2 2 1 2 5 2

ˆ 87.2172 3.4094 7.8726( 90) 87.2172 7.8726( 90) (3.4094 7.8726) 795.7512 4.4632

1 1

Note that if Months Employed is equal to 90, both regressions yield the same value of ŷ:

y 5 1 5 2 5ˆ 87.2172 3.4094(90) 795.7512 4.4632(90) 394.06

So the two regression segments are joined at the knot. The interpretation of this model is similar to the interpretation of the quadratic regres-

sion model. A salesperson’s sales are expected to increase by 3.4094 electronic laboratory scales for each month of employment until the salesperson has been employed for 90 months. At that point the salesperson’s sales are expected to decrease by 4.4632 elec- tronic laboratory scales for each additional month of employment.

Should we use the quadratic regression model or the piecewise linear regression model? These models fit the data equally well, and both have reasonable interpretations, so we cannot differentiate between the models on either of these criteria. Thus, we must consider whether the abrupt change in the relationship between Sales and Months Employed that is suggested by the piecewise linear regression model captures the real relationship between Sales and Months Employed better than the smooth change in the relationship between Sales and Months Employed suggested by the quadratic model.

Multiple knots can be used to fit complex piecewise linear regressions.

7.7 Modeling nonlinear Relationships 337

Data and Excel output for the Reynolds Piecewise Linear Regression ModelFiGURe 7.33

A B C D E F G H I

Knot Dummy Months Employed

Knot Dummy* Months

Scales Sold

Multiple R

Regression

Residual

Total

Intercept

Months Employed

Knot Dummy* Months

SUMMARY OUTPUT

ANOVA

Regression Statistics

0.955796127

R Square 0.913546237

Adjusted R Square 0.899137276

32.3941739

Standard Error

Observations

Coef�cients Standard Error t Stat P-value Lower 95% Lower 99.0% Upper 99.0%Upper 95%

SS MS F Signi�cance F

0 0 275

296

317

376

162

150

367

308

189

235

112

325

189

106

100

111

1 2

3 4 5 6 7 8 9 10 11 12 13

14 15 16

17 18

19 20 21

22 23 24 25 26 27

28 29 30 31

32 33

34 35

36 37

133064.3433

87.21724231 133.984153153.85825572

2.67220742

–12.01699634

120.5762289

4.146656538

–3.728110179

40.45033153

2.375895931

–13.68276572

4.442968028

–2.062340794

5.696517369

10.07632484

–4.138751508

15.31062519

0.338360666

1.902156543

9.9677E-05

3.2987E-07

0.00137388

3.409431979

–7.872553259

66532.17165 63.4012588 4.17545E-07

1049.38250212592.59003

145656.9333

interaction Between independent Variables Often the relationship between the dependent variable and one independent variable is dif- ferent at various values of a second independent variable. When this occurs, it is called an interaction. If the original data set consists of observations for y and two independent vari- ables 1x and 2x , we can incorporate an 1x 2x interaction into the estimated multiple linear regression equation in the following manner:

5 1 1 1ˆ 0 1 1 2 2 3 1 2y b b x b x b x x (7.21)

The variable Knot Dummy*Months is the product of the corresponding values of Knot Dummy and the difference between Months Employed and the knot value, that is, 5 2C2 A2 (B2 90)* in this Excel spreadsheet.

338 chapter 7 Linear Regression

To provide an illustration of an interaction and what it means, let us consider the regres- sion study conducted by Tyler Personal Care for one of its new shampoo products. The two factors believed to have the most influence on sales are unit selling price and advertising expenditure. To investigate the effects of these two variables on sales, prices of $2.00, $2.50, and $3.00 were paired with advertising expenditures of $50,000 and $100,000 in 24 test markets.

The data collected by Tyler are provided in the file Tyler. Figure 7.34 shows the sample mean sales for the six price and advertising expenditure combinations. Note that the sample mean sales corresponding to a price of $2.00 and an advertising expenditure of $50,000 is 461,000 units and that the sample mean sales corresponding to a price of $2.00 and an adver- tising expenditure of $100,000 is 808,000 units. Hence, with price held constant at $2.00, the difference in mean sales between advertising expenditures of $50,000 and $100,000 is 808, 000 461, 000 347, 0002 5 units. When the price of the product is $2.50, the difference in mean sales between advertising expenditures of $50,000 and $100,000 is 646,000 2

5364, 000 282, 000 units. Finally, when the price is $3.00, the difference in mean sales

Tyler

Mean Unit sales (1,000s) as a Function of selling Price and Advertising Expenditures

FiGURe 7.34

400

500

700

Selling Price ($)

M ea

n U

n it

S al

es (

1, 00

0s )

300

2.00 2.50 3.00

600

800

900

$50,000

$100,000

Difference of

808 2 461 = 347

Difference of

646 2 364 = 282

Difference of

375 2 332 = 43

7.7 Modeling nonlinear Relationships 339

between advertising expenditures of $50,000 and $100,000 is 375, 000 332, 000 43, 0002 5 units. Clearly, the difference between mean sales for advertising expenditures of $50,000 and mean sales for advertising expenditures of $100,000 depends on the price of the product. In other words, at higher selling prices, the effect of increased advertising expenditure dimin- ishes. These observations provide evidence of interaction between the price and advertising expenditure variables.

When interaction between two variables is present, we cannot study the relationship between one independent variable and the dependent variable y independently of the other variable. In other words, meaningful conclusions can be developed only if we consider the joint relationship that both independent variables have with the dependent variable. To account for the interaction, we use the regression equation in equation (7.21), where

5 Unit Sales (1000s)y 5 Price ($)1x 5 Adverstising Expenditure ($1000s)2x

Note that the regression equation in equation (7.21) reflects Tyler’s belief that the number of units sold is related to selling price and advertising expenditure (accounted for by the 1 1xb and

2 2xb terms) and an interaction between the two variables (accounted for by the 3 1 2x xb term). The Excel output corresponding to the interaction model for the Tyler Personal Care

example is provided in Figure 7.35. The resulting estimated regression equation is

5 2 1 1 2Sales 275.8333 175 Price 19.68 Advertising 6.08 Price* Advertising

Because the p value corresponding to the t test for Price*Advertising is 8.6772E-10, we conclude that interaction is significant. Thus, the regression results show that the relation- ship between advertising expenditure and sales depends on the price (and the relationship between price and sales depends on advertising expenditure).

In the file Tyler, the data for the independent variable Price is in column A, the independent variable Advertising Expenditures is in column B, and the dependent variable Sales is in column D. We created the interaction variable Price*Advertising in column C by entering the function A2*B2 in cell C2, and then copying cell C2 into cells C3 through C25.

Excel output for the tyler Personal care Linear Regression Model with interactionFiGURe 7.35

A SUMMARY OUTPUT

Multiple R 0.988993815

0.978108766

0.974825081

28.17386496

–275.8333333

175

112.8421033

44.54679188

–2.444418575

3.928453489

0.023898351

0.0008316

–511.2178361

236438.6667

793.7666667

297.8692 9.25881E-17

82.07702045

–40.44883053

267.9229796

–596.9074508

48.24924412

45.24078413

301.7507559

709316

15875

5191.3333

R Square

Adjusted R Square

Standard Error

Observations

ANOVA

Regression

Residual

Total

Intercept

Price

Regression Statistics

Coef�cients Standard Error t Stat P-value Lower 95% Upper 95% Lower 99.0% Upper 99.0%

SS MS F Signi�cance F

B C D E F G H I

19.68

–6.08

1.42735225

0.563477299

13.78776683

–10.79014187

1.1263E-11

8.67721E-10

16.70259538

–7.255393049

22.65740462

–4.904606951

15.61869796

–7.683284335

23.74130204

–4.476715665

Advertising Expenditure ($1,000s)

Price*Advertising

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

340 chapter 7 Linear Regression

Our initial review of these results may alarm us: How can price have a positive esti- mated regression coefficient? With the exception of luxury goods, we expect sales to decrease as price increases. Although this result appears counterintuitive, we can make sense of this model if we work through the interpretation of the interaction. In other words, the relationship between the independent variable Price and the dependent variable Sales is different at various values of Advertising (and the relationship between the independent variable Advertising and the dependent variable Sales is different at various values of Price).

It becomes easier to see how the predicted value of Sales depends on Price by using the estimated regression equation to consider the effect when Price increases by $1:

5 2 1 1

1 2 1

Sales After $1 Price Increase 275.8333 175 (Price 1) 19.68 Advertising 6.08 (Price 1) * Advertising

Thus,

2 5 2Sales After $1 Price Increase Sales Before $1 Price Increase 175 6.08 * Advertising Expenditure

So the change in the predicted value of sales when the independent variable Price increases by $1 depends on how much was spent on advertising.

Consider a concrete example. If Advertising Expenditures is $50,000 when price is $2.00, we estimate sales to be

5 2 1 1 2 5Sales 275.8333 175 2 19.68 50 6.08 2 50 450.1667, or 450,167 units) ) ) )( ( ( ( At the same level of Advertising Expenditures ($50,000) when price is $3.00, we estimate sales to be

( ) ( ) ( )( )5 2 1 1 2 5Sales 275.8333 175 3 19.68 50 6.08 3 50 321.1667, or 321,167 units So when Advertising Expenditures is $50,000, a change in price from $2.00 to $3.00 results in a 450,167 321,167 129, 0002 5 unit decrease in estimated sales. However, if Advertising Expenditures is $100,000 when price is $2.00, we estimate sales to be

( ) ( ) ( )( )5 2 1 1 2 5Sales 275.8333 175 2 19.68 100 6.08 2 100 826.1667, or 826,167 units At the same level of Advertising Expenditures ($100,000) when price is $3.00, we estimate sales to be

( ) ( ) ( )( )5 2 1 1 2 5Sales 275.8333 175 3 19.68 100 6.08 3 100 393.1667, or 393,167 units So when Advertising Expenditures is $100,000, a change in price from $2.00 to $3.00 results in a 826,167 393,167 433, 0002 5 unit decrease in estimated sales. When Tyler spends more on advertising, its sales are more sensitive to changes in price. Perhaps at larger Advertising Expenditures, Tyler attracts new customers who have been buying the product from another company and so are more aware of the prices charged for the product by Tyler’s competitors.

There is a second and equally valid interpretation of the interaction; it tells us that the relationship between the independent variable Advertising Expenditure and the depen- dent variable Sales is different at various values of Price. Using the estimated regression equation to consider the effect when Advertising Expenditure increases by $1,000:

5 2 1 1 1

2 1

Sales After $1K Advertising Increase 275.8333 175 Price 19.68 (Advertising 1) 6.08 Price * (Advertising 1)

Thus,

2 5

Sales After $1K Advertising Increase Sales Before $1K Advertising Increase 19.68 6.08 Price

7.7 Modeling nonlinear Relationships 341

So the change in the predicted value of the dependent variable that occurs when the inde- pendent variable Advertising Expenditure increases by $1,000 depends on the price.

Thus, if Price is $2.00 when Advertising Expenditures is $50,000, we estimate sales to be

5 2 1 1 2 5Sales 275.8333 175(2) 19.68(50) 6.08(2)(50) 450.1667, or 450,167 units

At the same level of Price ($2.00) when Advertising Expenditures is $100,000, we estimate sales to be

5 2 1 1 2 5Sales 275.8333 175(2) 19.68(100) 6.08(2)(100) 826.1667, or 826,167 units

So when Price is $2.00, a change in Advertising Expenditures from $50,000 to $100,000 results in a 826,167 450,167 376, 0002 5 unit increase in estimated sales. However, if Price is $3.00 when Advertising Expenditures is 50,000, we estimate sales to be

5 2 1 1 2 5Sales 275.8333 175(3) 19.68(50) 6.08(3)(50) 321.1667, or 321,167 units

At the same level of Price ($3.00) when Advertising Expenditures is $100,000, we estimate sales to be

5 2 1 1 2 5Sales 275.8333 175(3) 19.68(100) 6.08(3)(100) 393.1667, or 393,167 units

So when Price is $3.00, a change in Advertising Expenditure from $50,000 to $100,000 results in a 393.167 321,167 72, 0002 5 unit increase in estimated sales. When the price of Tyler’s product is high, its sales are less sensitive to changes in advertising expenditure. Perhaps as Tyler increases its price, it must advertise more to convince potential customers that its product is a good value.

1. Just as a dummy variable can be used to allow for different

y-intercepts for the two groups represented by the dummy, we can use an interaction between a dummy variable and a quan-

titative independent variable to allow for different relationships

between independent and dependent variables for the two

groups represented by the dummy. Consider the Butler Truck-

ing example: Travel time is the dependent variable y, miles traveled and number of deliveries are the quantitative inde-

pendent variables x1 and x2, and the dummy variable x3 differ- entiates between driving assignments that included travel on

a congested segment of a highway and driving assignments

that did not. If we believe that the relationship between miles

traveled and travel time differs for driving assignments that

included travel on a congested segment of a highway and

those that did not, we could create a new variable that is the

interaction between miles traveled and the dummy variable

( )∗x x x4 1 35 and estimate the following model:

5 1 1 1ˆ 0 1 1 2 2 3 4y b b x b x b x

If a driving assignment does not include travel on a con-

gested segment of a highway, 5 5 5x x x x* * 0 04 1 3 1 and the regression model is

5 1 1ˆ 0 1 1 2 2y b b x b x

If a driving assignment does include travel on a congested

segment of a highway, 5 5 5x x x x x* * 14 1 3 1 1 and the regression model is

+ + + + + +

ˆ (1)

( ) 0 1 1 2 2 3 1

0 1 3 1 2 2

y b b x b x b x

b b b x b x

So in this regression model b1 is the estimate of the rela- tionship between miles traveled and travel time for driving

assignments that do not include travel on a congested

segment of a highway, and b b1 31 is the estimate of the relationship between miles traveled and travel time for

driving assignments that do include travel on a congested

segment of a highway.

2. Multicollinearity can be divided into two types. Data-based multicollinearity occurs when separate independent vari- ables that are related are included in the model, whereas

structural multicollinearity occurs when a new independent variable is created by taking a function of one or more

existing independent variables. If we use ratings that con-

sumers give on bread’s aroma and taste as independent

variables in a model for which the dependent variable is

the overall rating of the bread, the multicollinearity that

would exist between the aroma and taste ratings is an

n o t e s + C o M M e n t s

342 chapter 7 Linear Regression

7.8 Model Fitting Finding an effective regression model can be challenging. Although we rely on theory to guide us, often we are faced with a large number of potential independent variables from which to choose. In this section, we discuss common methods for building a regression model and the potential hazards of these approaches.

Variable selection Procedures When there are many independent variables to consider, special procedures are some- times employed to select the independent variables to include in the regression model. These variable selection procedures include backward elimination, forward selection, stepwise selection, and the best subsets procedure. Given a data set with several pos- sible independent variables, we can use these procedures to identify which independent variables provide a model that best satisfies some criterion. The first three procedures are iterative; at each step of the procedure a single independent variable is added or removed and the new model is evaluated. The process continues until a stopping criterion indicates that the procedure cannot find a superior model. The best subsets procedure is not a one- variable-at-a-time procedure; it evaluates regression models involving different subsets of the independent variables.

The backward elimination procedure begins with the regression model that includes all of the independent variables under consideration. At each step of the procedure, backward elimination considers the removal of an independent variable according to some criterion. One such criterion is to check if any independent variables currently in the model are not significant at a specified level of significance, and if so, then remove the least significant of these independent variables from the model. The regression model is then refit with the remaining independent variables and statistical significance is reexamined. The backward elimination procedure stops when all independent variables in the model are significant at a specified level of significance.

The forward selection procedure begins with none of the independent variables under consideration included in the regression model. At each step of the procedure, forward selection considers the addition of an independent variable according to some criterion. One such criterion is to check if any independent variables currently not in the model would be significant at a specified level of significance if included, and if so, then add the most significant of these independent variables to the model. The regression model is then refit with the additional independent variable and statistical significance is reexamined. The forward selection procedure stops when all of the independent variables not in the model would not be significant at a specified level of significance if included in the model.

example of data-based multicollinearity. If we build a qua-

dratic model for which the independent variables are rat-

ings that consumers give on bread’s aroma and the square

of the ratings that consumers give on bread’s aroma, the

multicollinearity that would exist is an example of structural

multicollinearity.

3. Structural multicollinearity occurs naturally in polynomial

regression models and regression models with interactions.

You can greatly reduce the structural multicollinearity in a

polynomial regression by centering the independent vari-

able x (using x 2 –x in place of x). In a regression model with interaction, you can greatly reduce the structural mul-

ticollinearity by centering both independent variables that

interact. However, quadratic regression models and regres-

sion models with interactions are frequently used only for

prediction; in these instances centering independent vari-

ables is not necessary because we are not concerned with

inference.

4. Note that we can combine a quadratic effect with inter-

action to produce a second-order polynomial model with

interaction between the two independent variables. The

resulting estimated regression equation is

0 1 1 2 2 3 1 2

4 2 2

5 1 2y b b x b x b x b x b x x5 1 1 1 1 1

This model provides a great deal of flexibility in captur-

ing nonlinear effects.

7.8 Model Fitting 343

Similar to the forward selection procedure, the stepwise procedure begins with none of the independent variables under consideration included in the regression model. The analyst establishes both a criterion for allowing independent variables to enter the model and a crite- rion for allowing independent variables to remain in the model. One such criterion adds the most significant variable and removes the least significant variable at each iteration. To initiate the procedure, the most significant independent variable is added to the empty model if its level of significance satisfies the entering threshold. Each subsequent step involves two inter- mediate steps. First, the remaining independent variables not in the current model are evalu- ated, and the most significant one is added to the model if its significance satisfies the threshold to remain in the model. Then the independent variables in the resulting model are evaluated, and the least significant variable is removed if its level of significance fails to satisfy the threshold to remain in the model. The procedure stops when no independent variable not currently in the model has a level of significance that satisfies the entering threshold, and no independent variable currently in the model has a level of significance that fails to satisfy the threshold to remain in the model.

In the best subsets procedure, simple linear regressions for each of the independent vari- ables under consideration are generated, and then the multiple regressions with all combi- nations of two independent variables under consideration are generated, and so on. Once a regression model has been generated for every possible subset of the independent variables under consideration, the entire collection of regression models can be compared and evalu- ated by the analyst.

Although these algorithms are potentially useful when dealing with a large number of potential independent variables, they do not necessarily provide useful models. Once the procedure terminates, you should deliberate whether the combination of independent variables included in the final regression model makes sense from a practical standpoint and consider whether you can create a more useful regression model with more meaning- ful interpretation through the addition or removal of independent variables. Use your own judgment and intuition about your data to refine the results of these algorithms.

overfitting The objective in building a regression model (or any other type of mathematical model) is to provide the simplest accurate representation of the population. A model that is relatively simple will be easy to understand, interpret, and use, and a model that accurately represents the population will yield meaningful results.

When we base a model on sample data, we must be wary. Sample data generally do not perfectly represent the population from which they are drawn; if we attempt to fit a model too closely to the sample data, we risk capturing behavior that is idiosyncratic to the sam- ple data rather than representative of the population. When the model is too closely fit to sample data and as a result does not accurately reflect the population, the model is said to have been overfit.

Overfitting generally results from creating an overly complex model to explain idiosyn- crasies in the sample data. In regression analysis, this often results from the use of complex functional forms or independent variables that do not have meaningful relationships with the dependent variable. If a model is overfit to the sample data, it will perform better on the sample data used to fit the model than it will on other data from the population. Thus, an overfit model can be misleading with regard to its predictive capability and its interpretation.

Overfitting is a difficult problem to detect and avoid, but there are strategies that can help mitigate this problem. Use only independent variables that you expect to have real and meaningful relationships with the dependent variable. Use complex models, such as quadratic models and piecewise linear regression models, only when you have a reasonable expectation that such complexity provides a more accurate depiction of what you are mod- eling. Do not let software dictate your model. Use iterative modeling procedures, such as the stepwise and best-subsets procedures, only for guidance and not to generate your final

The stepwise procedure requires that the criterion for an independent variable to enter the regression model is more difficult to satisfy than the criterion for an independent variable to be removed from the regression model. This requirement prevents the same independent variable from exiting and then reentering the regression model in the same step.

The principle of using the simplest meaningful model possible without sacrificing accuracy is referred to as Ockham’s razor, the law of parsimony, or the law of economy.

344 chapter 7 Linear Regression

model. Use your own judgment and intuition about your data and what you are modeling to refine your model. If you have access to a sufficient quantity of data, assess your model on data other than the sample data that were used to generate the model (this is referred to as cross-validation. The following list contains three possible ways to execute cross-validation.

Holdout method: The sample data are randomly divided into mutually exclusive and collectively exhaustive training and validation sets. The training set is the data set used to build the candidate models that appear to make practical sense. The val- idation set is the set of data used to compare model performances and ultimately select a model for predicting values of the dependent variable. For example, we might randomly select half of the data for use in developing regression models. We could use these data as our training set to estimate a model or a collection of models that appear to perform well. Then we use the remaining half of the data as a validation set to assess and compare the models’ performances and ultimately select the model that minimizes some measure of overall error when applied to the validation set. The advantages of the holdout method are that it is simple and quick. However, results of a holdout sample can vary greatly depending on which observations are randomly selected for the training set, the number of observations in the sample, and the number of observations that are randomly selected for the training and validation sets.

k-fold cross-validation: The sample data set are randomly divided into k equal-sized, mutually exclusive, and collectively exhaustive subsets called folds, and k iterations are executed. For each iteration, a different subset is designated as the validation set and the remaining k – 1 subsets are combined and designated as the training set. The model is estimated using the respective training set data and evaluated using the respective validation set. The results of the k iterations are then combined and evaluated. A common choice for the number of folds is k 5 10. The k-fold cross-val- idation method is more complex and time consuming than the holdout method, but the results of the k-fold cross-validation method are less sensitive to how the observa- tions are randomly assigned to the training validation sets.

Leave-one-out cross-validation: For a sample of n observations, an iteration con- sists of estimating the model on n – 1 observations and evaluating the model on the single observation that was omitted from the training data. This procedure is repeated for n total iterations so that the model is trained on each possible combination of n 2 1 observations and evaluated on the single remaining observation in each case.

7.9 Big Data and Regression inference and Very Large samples Consider the example of a credit card company that has a very large database of informa- tion provided by its customers when they apply for credit cards. These customer records include information on the customer’s annual household income, number of years of post-high school education, and number of members of the customer’s household. In a second database, the company has records of the credit card charges accrued by each cus- tomer over the past year. Because the company is interested in using annual household income, the number of years of post-high school education, and the number of members of the household reported by new applicants to predict the credit card charges that will be accrued by these applicants, a data analyst links these two databases to create one data set containing all relevant information for a sample of 5,000 customers. The file LargeCredit contains these data, split into a training set of 3,000 observations and a validation set of 2,000 observations.

The company has decided to apply multiple regression to these data to develop a model for predicting annual credit card charges for its new applicants. The dependent variable in

LargeCredit

7.9 Big Data and Regression 345

the model is credit card charges accrued by a customer in the data set over the past year (y); the independent variables are the customer’s annual household income ( )1x , number of members of the household ( )2x , and number of years of post-high school education ( )3x . Figure 7.36 provides Excel output for the multiple regression model based on the 3,000 observations in the training set.

The model has a coefficient of determination of 0.3632 (see cell B5 in Figure 7.36), indicating that this model explains approximately 36% of the variation in credit card charges accrued by the customers in the sample over the past year. The p value for each test of the individual regression parameters is also very small (see cells E18 through E20), indi- cating that for each independent variable we can reject the hypothesis of no relationship with the dependent variable. The estimated slopes associated with the dependent variables are all highly significant. The model estimates the following:

• For a fixed number of household members and number of years of post-high school education, accrued credit card charges increase by $121.34 when a customer’s annual household income increases by $1,000. This is shown in cell B18 of Figure 7.36.

• For a fixed annual household income and number of years of post-high school educa- tion, accrued credit card charges increase by $528.10 when a customer’s household increases by one member. This is shown in cell B19 of Figure 7.36.

• For a fixed annual household income and number of household members, accrued credit card charges decrease by $535.36 when a customer’s number of years of post-high school education increases by one year. This is shown in cell B20 of Figure 7.36.

Because the y-intercept is an obvious result of extrapolation (no customer in the data has values of zero for annual household income, number of household members, and number of years of post-high school education), the estimated regression parameter 0b is meaningless.

The small p values associated with a model that is fit on an extremely large sample do not imply that an extremely large sample solves all problems. Virtually all relationships

Excel Regression output for credit card company ExampleFiGURe 7.36

346 chapter 7 Linear Regression

between independent variables and the dependent variable will be statistically significant if the sample size is sufficiently large. That is, if the sample size is very large, there will be little difference in the bj values generated by different random samples. Because we address the variability in potential values of our estimators through the use of statistical inference, and variability of our estimates bj essentially disappears as the sample size grows very large, inference is of little use for estimates generated from very large samples. Thus, we generally are not concerned with the conditions a regression model must satisfy in order for inference to be reliable when we use a very large sample. Multicollinearity, on the other hand, can result in confusing or misleading regression parameters b b bq, , . . . ,1 2 and so is still a concern when we use a large data set to estimate a regression model that is to be used for explanatory purposes.

How much does sample size matter? Table 7.4 provides the regression parameter esti- mates and the corresponding p values for multiple regression models estimated on the first 50 observations, the second 50 observations, and so on for the LargeCredit data. Note that, even though the means of the parameter estimates for the regressions based on 50 observa- tions are similar to the parameter estimates based on the full sample of 5,000 observations, the individual values of the estimated regression parameters in the regressions based on 50 observations show a great deal of variation. In these 10 regressions, the estimated values of 0b range from 22,191.590 to 8,994.040, the estimated values of 1b range from 73.207 to 155.187, the estimated values of 2b range from 2489.932 to 1,267.041, and the esti- mated values of 3b range from 2974.791 to 207.828. This is reflected in the p values cor- responding to the parameter estimates in the regressions based on 50 observations, which are substantially larger than the corresponding p values in the regression based on 3,000 observations. These results underscore the impact that a very large sample size can have on inference.

For another example, suppose the credit card company also has a separate database of information on shopping and lifestyle characteristics that it has collected from its custom- ers during a recent Internet survey. The data analyst notes in the results in Figure 7.36 that the original regression model fails to explain almost 65% of the variation in credit card charges accrued by the customers in the data set. In an attempt to increase the variation in the dependent variable explained by the model, the data analyst decides to augment the original regression with a new independent variable, number of hours per week spent

The phenomenon by which the value of an estimate generally becomes closer to the value of parameter being estimated as the sample size grows is called the Law of Large Numbers.

Observations 0b p value 1b p value 2b p value 3b p value 1–50 2805.152 0.7814 154.488 1.45E-06 234.664 0.5489 207.828 0.6721

5–100 894.407 0.6796 125.343 2.23E-07 822.675 0.0070 2355.585 0.3553

101–150 22,191.590 0.4869 155.187 3.56E-07 674.961 0.0501 225.309 0.9560

151–200 2,294.023 0.3445 114.734 1.26E-04 297.011 0.3700 2537.063 0.2205

201–250 8,994.040 0.0289 103.378 6.89E-04 2489.932 0.2270 2375.601 0.5261

251–300 7,265.471 0.0234 73.207 1.02E-02 277.874 0.8409 2405.195 0.4060

301–350 2,147.906 0.5236 117.500 1.88E-04 390.447 0.3053 2374.799 0.4696

351–400 2504.532 0.8380 118.926 8.54E-07 798.499 0.0112 45.259 0.9209

401–450 1,587.067 0.5123 81.532 5.06E-04 1,267.041 0.0004 2891.118 0.0359

451–500 2315.945 0.9048 148.860 1.07E-05 1,000.243 0.0053 2974.791 0.0420

Mean 1,936.567 119.316 491.773 2368.637

Regression Parameter Estimates and the corresponding p values for 10 Multiple Regression Models, Each Estimated on 50 observations from the largecredit Data

tABLe 7.4

7.9 Big Data and Regression 347

watching television (which we will designate as 4x ). The analyst runs the new multiple regression and achieves the results shown in Figure 7.37.

The new model has a coefficient of determination of 0.3645 (see cell B5 in Figure 7.37), indicating the addition of number of hours per week spent watching television increased the explained variation in sample values of accrued credit card charges by less than 1%. The estimated regression parameters and associated p values for annual household income, number of household members, and number of years of post-high school education changed little after introducing into the model the number of hours per week spent watch- ing television.

The estimated regression parameter for number of hours per week spent watching tele- vision is 12.55 (see cell B21 in Figure 7.37), suggesting that a 1-hour increase coincides with an increase of $12.55 in credit card charges accrued by each customer over the past year. The p value associated with this estimate is 0.014 (see cell E21 in Figure 7.37), so, at a 5% level of significance, we can reject the hypothesis that there is no relationship between the number of hours per week spent watching television and credit card charges accrued. However, when the model is based on a very large sample, almost all relationships will be significant whether they are real or not, and statistical significance does not necessarily imply that a relationship is meaningful or useful.

Is it reasonable to expect that the credit card charges accrued by a customer are related to the number of hours per week the consumer watches television? If not, the model that includes number of hours per week the consumer watches television as an independent variable may provide inaccurate or unreliable predictions of the credit card charges that will be accrued by new customers, even though we have found a significant relationship between these two variables. If the model is to be used to predict future amounts of credit charges, then the usefulness of including the number of hours per week the consumer watches television is best evaluated by measuring the accuracy of predictions for obser- vations not included in the sample data used to construct the model. We demonstrate this procedure in the next subsection.

The use of out-of-sample data is common in data mining applications and is covered in detail in Chapter 9.

Excel Regression output for credit card company Example after Adding number of Hours per Week spent Watching television

FiGURe 7.37

348 chapter 7 Linear Regression

Model selection As we discussed in Section 7.8, various methods for identifying which independent vari- ables to include in a regression model consider the p values of these variables in iterative procedures that sequentially add and/or remove variables. However, when dealing with a sufficiently large sample, the p value of virtually every independent variable will be small, and so variable selection procedures may suggest models with most or all the variables. Therefore, when dealing with large samples, it is often more difficult to discern the most appropriate model.

If developing a regression model for explanatory purposes, the practical significance of the estimated regression coefficients should be considered when interpreting the model and considering which variables to keep in the model. If developing a regression model to make future predictions, the selection of the independent variables to include in the regres- sion model should be based on the predictive accuracy on observations that have not been used to train the model.

Let us revisit the example of a credit card company with a data set of customer records containing information on the customer’s annual household income, number of years of post-high school education, number of members of the customer’s household, and the credit card charges accrued. The file LargeCredit contains these data, split into a training set of 3,000 observations and a validation set of 2,000 observations.

To predict annual credit card charges for its new applicants, the company is considering two models:

• Model A—The dependent variable is credit card charges accrued by a customer in the data set over the past year (y), and the independent variables are the customer’s annual household income ( 1x ), number of household members ( 2x ), and number of years of post-high school education ( 3x ). Figure 7.36 summarizes Model A estimated using the 3,000 observations of the training set.

• Model B—The dependent variable is credit card charges accrued by a customer in the data set over the past year (y), and the independent variables are the customer’s annual household income ( 1x ), number of household members ( 2x ), number of years of post-high school education ( 3x ), and number of hours per week spent watching television ( 4x ). Figure 7.37 summarizes Model B estimated using the 3,000 observa- tions of the training set.

Now, we would like to compare these models based on their predictive accuracy on the 2,000 observations in the validation set. For the first observation in the validation set (account number 18572870), Model A predicts annual charges of

5 1 1 2 5ˆ 2119.60 121.33(50.2) 528.10(5) 525.36(1) $10,315.931y A

Alternatively, Model B predicts annual charges of

5 1 1 2 1 5ˆ 1712.55 121.61(50.2) 531.21(5) 539.89(1) 12.55(4) $9,983.921y B

Account number 18572870 has actual annual charges of $5,472.51, so Model A’s predic- tion of the first observation has a squared error of (5, 472.51 10,315.93) 23, 458, 72122 5 and Model B’s prediction of the first observation has a squared error of (5,472.51 2

59, 983.92) 20, 352, 7972 . Repeating these predictions and error calculations for each of the 2,000 observations in the validation set, Figure 7.38 shows that the sum of squared errors for Model A is 47,392,009,111 and the sum of squared errors for Model B is 47,409,404,281. Therefore, Model A’s predictions are slightly more accurate than Model B’s predictions on the validation set, as measured by squared error. Although the p value of Hours Per Week Watching Television in Model B is relatively small, these results suggest that it does not improve the accuracy of predictions.LargeCreditValidation

7.10 Prediction with Regression 349

7.10 Prediction with Regression To illustrate how a regression model can be used to make predictions about new observa- tions and support decision making, let us again consider the Butler Trucking Company. Recall from Section 7.4 that the multiple regression equation based on the 300 past routes using Miles ( )1x and Deliveries ( )2x as the independent variables to estimate travel time (y) for a driving assignment is

5 1 1ˆ 0.1273 0.0672 0.69001 2y x x (7.22)

As described by the first three columns of Table 7.5, Butler has 10 new observations corresponding to upcoming routes for which they have estimated the miles to be driven and number of deliveries. The point estimates for the travel time for each of these 10 upcom- ing routes can be obtained by substituting the miles driven and number of deliveries into equation (7.22). For example, the predicted travel time for Assignment 301 is

y 5 1 1 5ˆ 0.1273 0.0672(105) 0.6900(3) 9.25301

In addition to the point estimate, there are two types of interval estimates associated with the regression equation. A confidence interval is an interval estimate of the mean y value given values of the independent variables. A prediction interval is an interval esti- mate of an individual y value given values of the independent variables.

The general form for the confidence interval on the mean y value given values of , , . . . ,1 2 2x x x is

6 aˆ / 2 ˆy t sy (7.23)

Predictive Accuracy on largecredit Validation setFiGURe 7.38

350 chapter 7 Linear Regression

where ŷ is the point estimate of the mean y value provided by the regression equation, sŷ is the estimated standard deviation of ŷ, and / 2ta is a multiplier term based on the sample size and specified 100(12a)% confidence level of the interval. More specifically, / 2ta is the t value that provides an area of a/2 in the upper tail of a t distribution with n 2 q 2 1 degrees of freedom. In general, the calculation of the confidence interval in equation (7.23) uses matrix algebra and requires the use of specialized statistical software.

The prediction interval on the individual y value given values of , , . . . ,1 2x x xq is

6 1 2 2

aˆ 1

/ 2 ˆ 2y t s

SSE

n q y (7.24)

where ŷ is the point estimate of the individual y value provided by the regression equation, 2sy is the estimated variance of ŷ, and / 2ta is the t value that provides an area of a/2 in the

upper tail of a t distribution with n 2 q 2 1 degrees of freedom. In the term SSE/(n 2 q 2 1), n is the number of observations in the sample, q is the number of independent variables in the regression model, and SSE is the sum of squares due to error as defined by equation (7.5). In general, the calculation of the prediction interval in equation (7.24) uses matrix algebra and requires the use of specialized statistical software.

In the Butler Trucking problem, the 95% confidence interval is an interval estimate of the mean travel time for a route assignment with the given values of Miles and Deliveries. This is the appropriate interval estimate if we are interested in estimating the mean travel time for all route assignments with specified mileage and number of deliveries. This confi- dence interval estimates the variability in the mean travel time.

In the Butler Trucking problem, the 95% prediction interval is an interval estimate on the prediction of travel time for an individual route assignment with the given values of Miles and Deliveries. This is the appropriate interval estimate if we are interested in pre- dicting the travel time for an individual route assignment with the specified mileage and number of deliveries. This prediction interval estimates the variability inherent in a single route's travel time.

To illustrate, consider the first observation (Assignment 301) in Table 7.5 with 105 miles and 3 deliveries. For all 105-mile routes with 3 deliveries, a 95% confidence interval on the mean travel time is 9.25 6 0.193. That is, we are 95% confident that the true population mean travel time for 105-mile routes with 3 deliveries is between 9.06 and 9.44 hours.

Now suppose Butler Trucking is interested in predicting the travel time for a specific upcoming route assignment covering 105 miles and 3 deliveries. The best prediction for this route's travel time is still 9.25 hours, as provided by the regression equation. However, a 95% prediction interval for this travel time prediction is 9.25 6 1.645. That is, we are 95% confident that the travel time for a single 105-mile route with 3 deliveries will be between 7.61 and 10.90 hours.

Note that the 95% prediction interval for the travel time of a single route assignment with 105 miles and 3 deliveries is wider than the 95% confidence interval for the mean travel time of all route assignments with 105 miles and 3 deliveries. The difference reflects the fact that we are able to estimate the mean y value of a group of observations with the same specified values of the independent variables more precisely than we can predict an individual y value of a single observation with specified values of the independent variables. Comparing equation (7.23) to equation (7.24), we observe the reason for the difference in the width of the confidence interval and the prediction interval. Just as the confidence interval, the prediction interval calculation includes a ˆsy term to account for the variability in estimating the mean value of y, but it also includes an additional term SSE/(n 2 q 2 1) which accounts for the variability in individual values of y about its mean value.

Finally, we point out that the width of the prediction (and confidence) intervals for the regression point estimate are not the same for each observation. Instead, the width of the interval depends on the corresponding values of the independent variables. Confidence intervals and prediction intervals are the narrowest when the values of the independent variables, 1x , 2x , . . . , xq, are closest to their respective means, 1x , 2x , . . . , xq. For the 300 observations on which the regression equation model is based, the mean miles driven for

Software such as Analytic Solver, JMP, and R can be used to compute the confidence interval and prediction intervals on regression output.

summary 351

a route assignment is 70.7 miles and the mean number of deliveries for a route assignment is 3.5. Assignment 307 has the mileage (65) and number of deliveries (4) that are closest to these means and correspondingly has the narrowest confidence and prediction interval. Conversely, Assignment 304 has the widest confidence and prediction intervals because it has the mileage (100) and number of deliveries (1) that are the farthest from the mean mileage and mean number of deliveries in the data.

S u M M A R y

In this chapter we showed how regression analysis can be used to determine how a depen- dent variable y is related to an independent variable x. In simple linear regression, the regression model is 0 1 1y xb b5 1 1 «. We use sample data and the least squares method to develop the estimated regression equation ˆ 0 1 1y b b x5 1 . In effect, 0b and 1b are the sam- ple statistics used to estimate the unknown model parameters.

The coefficient of determination 2r was presented as a measure of the goodness of fit for the estimated regression equation; it can be interpreted as the proportion of the variation in the sample values of the dependent variable y that can be explained by the estimated regression equation. We then extended our discussion to include multiple independent variables and reviewed how to use Excel to find the estimated multiple regression equation ˆ 0 1 1 2 2y b b x b x b xq q�5 1 1 1 1 , and we considered the interpretations of the parameter estimates in multiple regression and the ramifications of multicollinearity.

The assumptions related to the regression model and its associated error term e were discussed. We reviewed the t test for determining whether there is a statistically significant relationship between the dependent variable and an individual independent variable given the other independent variables in the regression model. We also showed how to use Excel to develop confidence interval estimates of the regression parameters 0b , 1b , . . . , qb .

We showed how to incorporate categorical independent variables into a regression model through the use of dummy variables, and we discussed a variety of ways to use multiple regression to fit nonlinear relationships between independent variables and the dependent variable. We discussed various automated procedures for selecting independent variables to include in a regression model and the problem of overfitting a regression model.

We discussed the implications of big data on regression analysis. Specifically, we con- sidered the impact of very large samples on regression inference and demonstrated the use of holdout data to evaluate candidate regression models. We concluded by presenting the concepts of confidence intervals and prediction intervals related to point estimates pro- duced by the regression model.

Assignment Miles Deliveries Predicted

Value 95% Cl

Half-Width(1/2) 95% Pl

Half-Width(1/2)

301 105 3 9.25 0.193 1.645

302 60 4 6.92 0.112 1.637

303 95 5 9.96 0.173 1.642

304 100 1 7.54 0.225 1.649

305 40 3 4.88 0.177 1.643

306 80 3 7.57 0.108 1.637

307 65 4 7.25 0.103 1.637

308 55 3 5.89 0.124 1.638

309 95 2 7.89 0.175 1.643

310 95 3 8.58 0.154 1.641

Predicted Values and 95% confidence intervals and Prediction intervals for 10 new Butler trucking Routes

tABLe 7.5

352 chapter 7 Linear Regression

G L o S S A R y

Backward elimination An iterative variable selection procedure that starts with a model with all independent variables and considers removing an independent variable at each step. Best subsets A variable selection procedure that constructs and compares all possible mod- els with up to a specified number of independent variables. Coefficient of determination A measure of the goodness of fit of the estimated regression equation. It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation. Confidence interval An estimate of a population parameter that provides an interval believed to contain the value of the parameter at some level of confidence. Confidence level An indication of how frequently interval estimates based on samples of the same size taken from the same population using identical sampling techniques will contain the true value of the parameter we are estimating. Cross-validation Assessment of the performance of a model on data other than the data that were used to generate the model. Dependent variable The variable that is being predicted or explained. It is denoted by y and is often referred to as the response. Dummy variable A variable used to model the effect of categorical independent variables in a regression model; generally takes only the value zero or one. Estimated regression The estimate of the regression equation developed from sample data by using the least squares method. The estimated multiple linear regression equation is

5 1 1 1 1ˆ · · ·0 1 1 2 2y b b x b x b xq q. Experimental region The range of values for the independent variables , , . . . ,1 2x x xq for the data that are used to estimate the regression model. Extrapolation Prediction of the mean value of the dependent variable y for values of the independent variables , , . . . ,1 2x x xq that are outside the experimental range. Forward selection An iterative variable selection procedure that starts with a model with no variables and considers adding an independent variable at each step. Holdout method Method of cross-validation in which sample data are randomly divided into mutually exclusive and collectively exhaustive sets, then one set is used to build the candidate models and the other set is used to compare model performances and ultimately select a model. Hypothesis testing The process of making a conjecture about the value of a population parameter, collecting sample data that can be used to assess this conjecture, measuring the strength of the evidence against the conjecture that is provided by the sample, and using these results to draw a conclusion about the conjecture. Independent variable(s) The variable(s) used for predicting or explaining values of the dependent variable. It is denoted by x and is often referred to as the predictor variable. Interaction Regression modeling technique used when the relationship between the depen- dent variable and one independent variable is different at different values of a second inde- pendent variable. Interval estimation The use of sample data to calculate a range of values that is believed to include the unknown value of a population parameter. k-fold cross-validation Method of cross-validation in which sample data set are randomly divided into k equal-sized, mutually exclusive and collectively exhaustive subsets. In each of k iterations, one of the k subsets is used to evaluate a candidate model that was con- structed on the data from the other k – 1 subsets. Knot The prespecified value of the independent variable at which its relationship with the dependent variable changes in a piecewise linear regression model; also called the break- point or the joint. Least squares method A procedure for using sample data to find the estimated regression equation.

Glossary 353

Leave-one-out cross-validation Method of cross-validation in which candidate models are repeatedly fit using n – 1 observations and evaluated with the remaining observation. Linear regression Regression analysis in which relationships between the independent variables and the dependent variable are approximated by a straight line. Multicollinearity The degree of correlation among independent variables in a regression model. Multiple linear regression Regression analysis involving one dependent variable and more than one independent variable. Overfitting Fitting a model too closely to sample data, resulting in a model that does not accurately reflect the population. p value The probability that a random sample of the same size collected from the same population using the same procedure will yield stronger evidence against a hypothesis than the evidence in the sample data given that the hypothesis is actually true. Parameter A measurable factor that defines a characteristic of a population, process, or system. Piecewise linear regression model Regression model in which one linear relationship between the independent and dependent variables is fit for values of the independent vari- able below a prespecified value of the independent variable, a different linear relationship between the independent and dependent variables is fit for values of the independent vari- able above the prespecified value of the independent variable, and the two regressions have the same estimated value of the dependent variable (i.e., are joined) at the prespecified value of the independent variable. Prediction interval An interval estimate of the prediction of an individual y value given values of the independent variables. Point estimator A single value used as an estimate of the corresponding population parameter. Quadratic regression model Regression model in which a nonlinear relationship between the independent and dependent variables is fit by including the independent variable and the square of the independent variable in the model: ˆ 1 1 2 12y b b x b xo5 1 1 ; also referred to as a second-order polynomial model. Random variable A quantity whose values are not known with certainty. Regression analysis A statistical procedure used to develop an equation showing how the variables are related. Regression model The equation that describes how the dependent variable y is related to an independent variable x and an error term; the multiple linear regression model is y xb b5 1 10 1 1 · · ·2 2b b1 1 1 «x xq q . Residual The difference between the observed value of the dependent variable and the value predicted using the estimated regression equation; for the thi observation, the thi resid- ual is ˆy yi i2 . Simple linear regression Regression analysis involving one dependent variable and one independent variable. Statistical inference The process of making estimates and drawing conclusions about one or more characteristics of a population (the value of one or more parameters) through anal- ysis of sample data drawn from the population. Stepwise selection An iterative variable selection procedure that considers adding an independent variable and removing an independent variable at each step. t test Statistical test based on the Student’s t probability distribution that can be used to test the hypothesis that a regression parameter jb is zero; if this hypothesis is rejected, we conclude that there is a regression relationship between the jth independent variable and the dependent variable. Training set The data set used to build the candidate models. Validation set The data set used to compare model forecasts and ultimately pick a model for predicting values of the dependent variable.

354 chapter 7 Linear Regression

P R o B L e M S

1. Bicycling World, a magazine devoted to cycling, reviews hundreds of bicycles through- out the year. Its Road-Race category contains reviews of bicycles used by riders pri- marily interested in racing. One of the most important factors in selecting a bicycle for racing is its weight. The following data show the weight (pounds) and price ($) for 10 racing bicycles reviewed by the magazine:

a. Develop a scatter chart with weight as the independent variable. What does the scatter chart indicate about the relationship between the weight and price of these bicycles?

b. Use the data to develop an estimated regression equation that could be used to estimate the price for a bicycle, given its weight. What is the estimated regression model?

c. Test whether each of the regression parameters 0b and 1b is equal to zero at a 0.05 level of significance. What are the correct interpretations of the estimated regression parameters? Are these interpretations reasonable?

d. How much of the variation in the prices of the bicycles in the sample does the regression model you estimated in part (b) explain?

e. The manufacturers of the D’Onofrio Pro plan to introduce the 15-lb D’Onofrio Elite bicycle later this year. Use the regression model you estimated in part (a) to predict the price of the D’Ononfrio Elite.

f. The owner of Michele's Bikes of Nesika Beach, Oregon is trying to decide in advance whether to make room for the D’Onofrio Elite bicycle in its inventory. She is convinced that she will not be able to sell the D’Onofrio Elite for more than $7,000, and so she will not make room in her inventory for the bicycle unless its esti- mated price is less than $7,000. Under this condition and using the regression model you estimated in part (a), what decision should the owner of Michele's Bikes make?

2. In a manufacturing process the assembly line speed (feet per minute) was thought to affect the number of defective parts found during the inspection process. To test this theory, managers devised a situation in which the same batch of parts was inspected visually at a variety of line speeds. They collected the following data:

Model Weight (lb) Price ($)

Fierro 7B 17.9 2,200

HX 5000 16.2 6,350

Durbin Ultralight 15.0 8,470

Schmidt 16.0 6,300

WSilton Advanced 17.3 4,100

bicyclette vélo 13.2 8,700

Supremo Team 16.3 6,100

XTC Racer 17.2 2,680

D’Onofrio Pro 17.7 3,500

Americana #6 14.2 8,100

BicyclingWorld

Line Speed (ft/min) No. of Defective Parts Found

20 21

20 19

40 15

30 16

60 14

40 17

LineSpeed

Problems 355

a. Develop a scatter chart with line speed as the independent variable. What does the scatter chart indicate about the relationship between line speed and the number of defective parts found?

b. Use the data to develop an estimated regression equation that could be used to pre- dict the number of defective parts found, given the line speed. What is the estimated regression model?

c. Test whether each of the regression parameters 0b and 1b is equal to zero at a 0.01 level of significance. What are the correct interpretations of the estimated regression parameters? Are these interpretations reasonable?

d. How much of the variation in the number of defective parts found for the sample data does the model you estimated in part (b) explain?

3. Jensen Tire & Auto is deciding whether to purchase a maintenance contract for its new computer wheel alignment and balancing machine. Managers feel that maintenance expense should be related to usage, and they collected the following information on weekly usage (hours) and annual maintenance expense (in hundreds of dollars).

Weekly Usage (hours)

Annual Maintenance Expense ($100s)

13 17.0

10 22.0

20 30.0

28 37.0

32 47.0

17 30.5

24 32.5

31 39.0

40 51.5

38 40.0

Jensen

a. Develop a scatter chart with weekly usage hours as the independent variable. What does the scatter chart indicate about the relationship between weekly usage and annual maintenance expense?

b. Use the data to develop an estimated regression equation that could be used to pre- dict the annual maintenance expense for a given number of hours of weekly usage. What is the estimated regression model?

d. How much of the variation in the sample values of annual maintenance expense does the model you estimated in part (b) explain?

e. If the maintenance contract costs $3,000 per year, would you recommend purchas- ing it? Why or why not?

4. A sociologist was hired by a large city hospital to investigate the relationship between the number of unauthorized days that employees are absent per year and the distance (miles) between home and work for the employees. A sample of 10 employees was chosen, and the following data were collected.

Distance to Work (miles) No. of Days Absent

1 8

3 5

4 8 Absent

356 chapter 7 Linear Regression

a. Develop a scatter chart for these data. Does a linear relationship appear reasonable? Explain.

b. Use the data to develop an estimated regression equation that could be used to pre- dict the number of days absent given the distance to work. What is the estimated regression model?

c. What is the 99% confidence interval for the regression parameter 1b ? Based on this interval, what conclusion can you make about the hypotheses that the regression parameter 1b is equal to zero?

d. What is the 99% confidence interval for the regression parameter 0b ? Based on this interval, what conclusion can you make about the hypotheses that the regression parameter 0b is equal to zero?

e. How much of the variation in the sample values of number of days absent does the model you estimated in part (b) explain?

5. The regional transit authority for a major metropolitan area wants to determine whether there is a relationship between the age of a bus and the annual maintenance cost. A sample of 10 buses resulted in the following data:

Distance to Work (miles) No. of Days Absent

6 7

8 6

10 3

12 5

14 2

14 4

18 2

a. Develop a scatter chart for these data. What does the scatter chart indicate about the relationship between age of a bus and the annual maintenance cost?

b. Use the data to develop an estimated regression equation that could be used to pre- dict the annual maintenance cost given the age of the bus. What is the estimated regression model?

d. How much of the variation in the sample values of annual maintenance cost does the model you estimated in part (b) explain?

e. What do you predict the annual maintenance cost to be for a 3.5-year-old bus?

Age of Bus (years) Annual Maintenance Cost ($)

1 350

2 370

2 480

2 520

2 590

3 550

4 750

4 800

5 790

5 950

AgeCost

Problems 357

6. A marketing professor at Givens College is interested in the relationship between hours spent studying and total points earned in a course. Data collected on 156 students who took the course last semester are provided in the file MktHrsPts. a. Develop a scatter chart for these data. What does the scatter chart indicate about the

relationship between total points earned and hours spent studying? b. Develop an estimated regression equation showing how total points earned is related

to hours spent studying. What is the estimated regression model? c. Test whether each of the regression parameters 0b and 1b is equal to zero at a 0.01

level of significance. What are the correct interpretations of the estimated regression parameters? Are these interpretations reasonable?

d. How much of the variation in the sample values of total point earned does the model you estimated in part (b) explain?

e. Mark Sweeney spent 95 hours studying. Use the regression model you estimated in part (b) to predict the total points Mark earned.

f. Mark Sweeney wants to receive a letter grade of A for this course, and he needs to earn at least 90 points to do so. Based on the regression equation developed in part (b), how many estimated hours should Mark study to receive a letter grade of A for this course?

7. The Dow Jones Industrial Average (DJIA) and the Standard & Poor’s 500 (S&P 500) indexes are used as measures of overall movement in the stock market. The DJIA is based on the price movements of 30 large companies; the S&P 500 is an index com- posed of 500 stocks. Some say the S&P 500 is a better measure of stock market perfor- mance because it is broader based. The closing price for the DJIA and the S&P 500 for 15 weeks, beginning with January 6, 2012, follow (Barron’s web site, April 17, 2012).

MktHrsPts

Date DJIA S&P

January 6 12,360 1,278

January 13 12,422 1,289

January 20 12,720 1,315

January 27 12,660 1,316

February 3 12,862 1,345

February 10 12,801 1,343

February 17 12,950 1,362

February 24 12,983 1,366

March 2 12,978 1,370

March 9 12,922 1,371

March 16 13,233 1,404

March 23 13,081 1,397

March 30 13,212 1,408

April 5 13,060 1,398

April 13 12,850 1,370

DJIAS&P500

a. Develop a scatter chart for these data with DJIA as the independent variable. What does the scatter chart indicate about the relationship between DJIA and S&P 500?

b. Develop an estimated regression equation showing how S&P 500 is related to DJIA. What is the estimated regression model?

c. What is the 95% confidence interval for the regression parameter 1b ? Based on this interval, what conclusion can you make about the hypotheses that the regression parameter 1b is equal to zero?

d. What is the 95% confidence interval for the regression parameter 0b ? Based on this interval, what conclusion can you make about the hypotheses that the regression parameter 0b is equal to zero?

358 chapter 7 Linear Regression

e. How much of the variation in the sample values of S&P 500 does the model esti- mated in part (b) explain?

f. Suppose that the closing price for the DJIA is 13,500. Estimate the closing price for the S&P 500.

g. Should we be concerned that the DJIA value of 13,500 used to predict the S&P 500 value in part (f) is beyond the range of the DJIA used to develop the estimated regression equation?

8. The Toyota Camry is one of the best-selling cars in North America. The cost of a previ- ously owned Camry depends on many factors, including the model year, mileage, and condition. To investigate the relationship between the car’s mileage and the sales price for Camrys, the following data show the mileage and sale price for 19 sales (PriceHub web site, February 24, 2012).

Miles (1,000s) Price ($1,000s)

22 16.2

29 16.0

36 13.8

47 11.5

63 12.5

77 12.9

73 11.2

87 13.0

92 11.8

101 10.8

110 8.3

28 12.5

59 11.1

68 15.0

68 12.2

91 13.0

42 15.6

65 12.7

110 8.3

Camry

a. Develop a scatter chart for these data with miles as the independent variable. What does the scatter chart indicate about the relationship between price and miles?

b. Develop an estimated regression equation showing how price is related to miles. What is the estimated regression model?

d. How much of the variation in the sample values of price does the model estimated in part (b) explain?

e. For the model estimated in part (b), calculate the predicted price and residual for each automobile in the data. Identify the two automobiles that were the biggest bargains.

f. Suppose that you are considering purchasing a previously owned Camry that has been driven 60,000 miles. Use the estimated regression equation developed in part (b) to predict the price for this car. Is this the price you would offer the seller?

Problems 359

9. Dixie Showtime Movie Theaters, Inc. owns and operates a chain of cinemas in several markets in the southern United States. The owners would like to estimate weekly gross revenue as a function of advertising expenditures. Data for a sample of eight markets for a recent week follow:

Market Weekly Gross

Revenue ($100s) Television

Advertising ($100s) Newspaper

Advertising ($100s)

Mobile 101.3 5.0 1.5

Shreveport 51.9 3.0 3.0

Jackson 74.8 4.0 1.5

Birmingham 126.2 4.3 4.3

Little Rock 137.8 3.6 4.0

Biloxi 101.4 3.5 2.3

New Orleans 237.8 5.0 8.4

Baton Rouge 219.6 6.9 5.8

DixieShowtime

a. Develop an estimated regression equation with the amount of television advertising as the independent variable. Test for a significant relationship between television advertising and weekly gross revenue at the 0.05 level of significance. What is the interpretation of this relationship?

b. How much of the variation in the sample values of weekly gross revenue does the model in part (a) explain?

c. Develop an estimated regression equation with both television advertising and newspaper advertising as the independent variables. Test whether each of the regres- sion parameters 0b , 1b , and 2b is equal to zero at a 0.05 level of significance. What are the correct interpretations of the estimated regression parameters? Are these interpretations reasonable?

d. How much of the variation in the sample values of weekly gross revenue does the model in part (c) explain?

e. Given the results in parts (a) and (c), what should your next step be? Explain. f. What are the managerial implications of these results?

10. Resorts & Spas, a magazine devoted to upscale vacations and accommodations, pub- lished its Reader’s Choice List of the top 20 independent beachfront boutique hotels in the world. The data shown are the scores received by these hotels based on the results from Resorts & Spas’ annual Readers’ Choice Survey. Each score represents the per- centage of respondents who rated a hotel as excellent or very good on one of three criteria (comfort, amenities, and in-house dining). An overall score was also reported and used to rank the hotels. The highest ranked hotel, the Muri Beach Odyssey, has an overall score of 94.3, the highest component of which is 97.7 for in-house dining.

Hotel Overall Comfort Amenities In-House Dining

Muri Beach Odyssey 94.3 94.5 90.8 97.7

Pattaya Resort 92.9 96.6 84.1 96.6

Sojourner’s Respite 92.8 99.9 100.0 88.4

Spa Carribe 91.2 88.5 94.7 97.0

Penang Resort and Spa 90.4 95.0 87.8 91.1

Mokihana Ho-kele 90.2 92.4 82.0 98.7

Theo’s of Cape Town 90.1 95.9 86.2 91.9

Cap d’Agde Resort 89.8 92.5 92.5 88.8

Spirit of Mykonos 89.3 94.6 85.8 90.7

Turismo del Mar 89.1 90.5 83.2 90.4

Hotel Iguana 89.1 90.8 81.9 88.5

BeachFrontHotels

360 chapter 7 Linear Regression

a. Determine the estimated multiple linear regression equation that can be used to pre- dict the overall score given the scores for comfort, amenities, and in-house dining.

b. Use the t test to determine the significance of each independent variable. What is the conclusion for each test at the 0.01 level of significance?

c. Remove all independent variables that are not significant at the 0.01 level of signifi- cance from the estimated regression equation. What is your recommended estimated regression equation?

d. Suppose Resorts & Spas has decided to recommend each of the independent beach- front boutiques in its data that achieves an estimated overall score over 90. Use the regression equation developed in part (c) to determine which of the independent beachfront boutiques will receive a recommendation from Resorts & Spas.

11. The American Association of Individual Investors (AAII) On-Line Discount Broker Survey polls members on their experiences with electronic trades handled by discount brokers. As part of the survey, members were asked to rate their satisfaction with the trade price and the speed of execution, as well as provide an overall satisfaction rating. Possible responses (scores) were no opinion (0), unsatisfied (1), somewhat satisfied (2), satisfied (3), and very satisfied (4). For each broker, summary scores were com- puted by computing a weighted average of the scores provided by each respondent. A portion of the survey results follow (AAII web site, February 7, 2012).

Hotel Overall Comfort Amenities In-House Dining

Sidi Abdel Rahman Palace 89.0 93.0 93.0 89.6

Sainte-Maxime Quarters 88.6 92.5 78.2 91.2

Rotorua Inn 87.1 93.0 91.6 73.5

Club Lapu-Lapu 87.1 90.9 74.9 89.6

Terracina Retreat 86.5 94.3 78.0 91.5

Hacienda Punta Barco 86.1 95.4 77.3 90.8

Rendezvous Kolocep 86.0 94.8 76.4 91.4

Cabo de Gata Vista 86.0 92.0 72.2 89.2

Sanya Deluxe 85.1 93.4 77.3 91.8

Brokerage Satisfaction

with Trade Price Satisfaction with

Speed of Execution Overall Satisfaction

with Electronic Trades

Scottrade, Inc. 3.4 3.4 3.5

Charles Schwab 3.2 3.3 3.4

Fidelity Brokerage Services

3.1 3.4 3.9

TD Ameritrade 2.9 3.6 3.7

E*Trade Financial 2.9 3.2 2.9

(Not listed) 2.5 3.2 2.7

Vanguard Brokerage Services

2.6 3.8 2.8

USAA Brokerage Services

2.4 3.8 3.6

Thinkorswim 2.6 2.6 2.6

Wells Fargo Investments

2.3 2.7 2.3

Interactive Brokers 3.7 4.0 4.0

Zecco.com 2.5 2.5 2.5

Firstrade Securities 3.0 3.0 4.0

Banc of America Investment Services

4.0 1.0 2.0

Broker

Problems 361

a. Develop an estimated regression equation using trade price and speed of exe- cution to predict overall satisfaction with the broker. Interpret the coefficient of determination.

b. Use the t test to determine the significance of each independent variable. What are your conclusions at the 0.05 level of significance?

c. Interpret the estimated regression parameters. Are the relationships indicated by these estimates what you would expect?

d. Finger Lakes Investments has developed a new electronic trading system and would like to predict overall customer satisfaction assuming they can provide satisfactory service levels (3) for both trade price and speed of execution. Use the estimated regression equation developed in part (a) to predict overall satisfaction level for Finger Lakes Investments if they can achieve these performance levels.

e. What concerns (if any) do you have with regard to the possible responses the respondents could select on the survey.

12. The National Football League (NFL) records a variety of performance data for individ- uals and teams. To investigate the importance of passing on the percentage of games won by a team, the following data show the conference (Conf), average number of passing yards per attempt (Yds/Att), the number of interceptions thrown per attempt (Int/Att), and the percentage of games won (Win%) for a random sample of 16 NFL teams for the 2011 season (NFL web site, February 12, 2012).

Team Conf Yds/Att Int/Att Win%

Arizona Cardinals NFC 6.5 0.042 50.0

Atlanta Falcons NFC 7.1 0.022 62.5

Carolina Panthers NFC 74.1 0.033 37.5

Cincinnati Bengals AFC 6.2 0.026 56.3

Detroit Lions NFC 7.2 0.024 62.5

Green Bay Packers NFC 8.9 0.014 93.8

Houston Texans AFC 7.5 0.019 62.5

Indianapolis Colts AFC 5.6 0.026 12.5

Jacksonville Jaguars AFC 4.6 0.032 31.3

Minnesota Vikings NFC 5.8 0.033 18.

New England Patriots AFC 8.3 0.020 81.3

New Orleans Saints NFC 8.1 0.021 81.3

Oakland Raiders AFC 7.6 0.044 50.0

San Francisco 49ers NFC 6.5 0.011 81.3

Tennessee Titans AFC 6.7 0.024 56.3

Washington Redskins NFC 6.4 0.041 31.3

NFLPassing

a. Develop the estimated regression equation that could be used to predict the percent- age of games won, given the average number of passing yards per attempt. What proportion of variation in the sample values of proportion of games won does this model explain?

b. Develop the estimated regression equation that could be used to predict the per- centage of games won, given the number of interceptions thrown per attempt. What proportion of variation in the sample values of proportion of games won does this model explain?

c. Develop the estimated regression equation that could be used to predict the per- centage of games won, given the average number of passing yards per attempt and the number of interceptions thrown per attempt. What proportion of variation in the sample values of proportion of games won does this model explain?

d. The average number of passing yards per attempt for the Kansas City Chiefs during the 2011 season was 6.2, and the team’s number of interceptions thrown per attempt

362 chapter 7 Linear Regression

was 0.036. Use the estimated regression equation developed in part (c) to predict the percentage of games won by the Kansas City Chiefs during the 2011 season. Compare your prediction to the actual percentage of games won by the Kansas City Chiefs. (Note: For the 2011 season, the Kansas City Chiefs’ record was 7 wins and 9 losses.)

e. Did the estimated regression equation that uses only the average number of passing yards per attempt as the independent variable to predict the percentage of games won provide a good fit?

13. Johnson Filtration, Inc. provides maintenance service for water filtration systems throughout Southern Florida. Customers contact Johnson with requests for mainte- nance service on their water filtration systems. To estimate the service time and the service cost, Johnson’s managers want to predict the repair time necessary for each maintenance request. Hence, repair time in hours is the dependent variable. Repair time is believed to be related to three factors: the number of months since the last mainte- nance service, the type of repair problem (mechanical or electrical), and the repairper- son who performs the repair (Donna Newton or Bob Jones). Data for a sample of 10 service calls are reported in the following table:

Repair Time in Hours

Months Since Last Service Type of Repair Repairperson

2.9 2 Electrical Donna Newton

3.0 6 Mechanical Donna Newton

4.8 8 Electrical Bob Jones

1.8 3 Mechanical Donna Newton

2.9 2 Electrical Donna Newton

4.9 7 Electrical Bob Jones

4.2 9 Mechanical Bob Jones

4.8 8 Mechanical Bob Jones

4.4 4 Electrical Bob Jones

4.5 6 Electrical Donna Newton

a. Develop the simple linear regression equation to predict repair time given the number of months since the last maintenance service, and use the results to test the hypothesis that no relationship exists between repair time and the number of months since the last maintenance service at the 0.05 level of significance. What is the inter- pretation of this relationship? What does the coefficient of determination tell you about this model?

b. Using the simple linear regression model developed in part (a), calculate the pre- dicted repair time and residual for each of the 10 repairs in the data. Sort the data in ascending order by value of the residual. Do you see any pattern in the residuals for the two types of repair? Do you see any pattern in the residuals for the two repair- persons? Do these results suggest any potential modifications to your simple linear regression model? Now create a scatter chart with months since last service on the x-axis and repair time in hours on the y-axis for which the points representing elec- trical and mechanical repairs are shown in different shapes and/or colors. Create a similar scatter chart of months since last service and repair time in hours for which the points representing repairs by Bob Jones and Donna Newton are shown in dif- ferent shapes and/or colors. Do these charts and the results of your residual analysis suggest the same potential modifications to your simple linear regression model?

c. Create a new dummy variable that is equal to zero if the type of repair is mechanical and one if the type of repair is electrical. Develop the multiple regression equation to predict repair time, given the number of months since the last maintenance ser- vice and the type of repair. What are the interpretations of the estimated regression parameters? What does the coefficient of determination tell you about this model?

Repair

Problems 363

d. Create a new dummy variable that is equal to zero if the repairperson is Bob Jones and one if the repairperson is Donna Newton. Develop the multiple regression equa- tion to predict repair time, given the number of months since the last maintenance service and the repairperson. What are the interpretations of the estimated regression parameters? What does the coefficient of determination tell you about this model?

e. Develop the multiple regression equation to predict repair time, given the number of months since the last maintenance service, the type of repair, and the repairperson. What are the interpretations of the estimated regression parameters? What does the coefficient of determination tell you about this model?

f. Which of these models would you use? Why?

14. A study investigated the relationship between audit delay (the length of time from a company’s fiscal year-end to the date of the auditor’s report) and variables that describe the client and the auditor. Some of the independent variables that were included in this study follow:

Industry A dummy variable coded 1 if the firm was an industrial company or 0 if the firm was a bank, savings and loan, or insurance company.

Public A dummy variable coded 1 if the company was traded on an organized exchange or over the counter; otherwise coded 0.

Quality A measure of overall quality of internal controls, as judged by the auditor, on a 5-point scale ranging from “virtually none” (1) to “excellent” (5).

Finished A measure ranging from 1 to 4, as judged by the auditor, where 1 indi- cates “all work performed subsequent to year-end” and 4 indicates “most work performed prior to year-end.”

A sample of 40 companies provided the following data:

Delay (Days) Industry Public Quality Finished 62 0 0 3 1

45 0 1 3 3

54 0 0 2 2

71 0 1 1 2

91 0 0 1 1

62 0 0 4 4

61 0 0 3 2

69 0 1 5 2

80 0 0 1 1

52 0 0 5 3

47 0 0 3 2

65 0 1 2 3

60 0 0 1 3

81 1 0 1 2

73 1 0 2 2

89 1 0 2 1

71 1 0 5 4

76 1 0 2 2

68 1 0 1 2

68 1 0 5 2

86 1 0 2 2

76 1 1 3 1

67 1 0 2 3

57 1 0 4 2

55 1 1 3 2

Audit

364 chapter 7 Linear Regression

a. Develop the estimated regression equation using all of the independent variables included in the data.

b. How much of the variation in the sample values of delay does this estimated regres- sion equation explain? What other independent variables could you include in this regression model to improve the fit?

c. Test the relationship between each independent variable and the dependent variable at the 0.05 level of significance, and interpret the relationship between each of the independent variables and the dependent variable.

d. On the basis of your observations about the relationships between the dependent variable Delay and the independent variables Quality and Finished, suggest an alter- native model for the regression equation developed in part (a) to explain as much of the variability in Delay as possible.

15. The U.S. Department of Energy’s Fuel Economy Guide provides fuel efficiency data for cars and trucks. A portion of the data for 311 compact, midsized, and large cars fol- lows. The Class column identifies the size of the car: Compact, Midsize, or Large. The Displacement column shows the engine’s displacement in liters. The FuelType column shows whether the car uses premium (P) or regular (R) fuel, and the HwyMPG column shows the fuel efficiency rating for highway driving in terms of miles per gallon. The complete data set is contained in the file FuelData:

Delay (Days) Industry Public Quality Finished 54 1 0 5 2

69 1 0 3 3

82 1 0 5 1

94 1 0 1 1

74 1 1 5 2

75 1 1 4 3

69 1 0 2 2

71 1 0 4 4

79 1 0 5 2

80 1 0 1 4

91 1 0 4 1

92 1 0 1 4

46 1 1 4 3

72 1 0 5 2

85 1 0 5 1

a. Develop an estimated regression equation that can be used to predict the fuel effi- ciency for highway driving given the engine’s displacement. Test for significance using the 0.05 level of significance. How much of the variation in the sample values of HwyMPG does this estimated regression equation explain?

Car Class Displacement FuelType HwyMPG

1 Compact 3.1 P 25

2 Compact 3.1 P 25

3 Compact 3.0 P 25

. .

161 Midsize 2.4 R 30

162 Midsize 2.0 P 29

. .

. . .

310 Large 3.0 R 25

FuelData

Problems 365

b. Create a scatter chart with HwyMPG on the y-axis and displacement on the x-axis for which the points representing compact, midsize, and large automobiles are shown in different shapes and/or colors. What does this chart suggest about the relationship between the class of automobile (compact, midsize, and large) and HwyMPG?

c. Now consider the addition of the dummy variables ClassMidsize and ClassLarge to the simple linear regression model in part (a). The value of ClassMidsize is 1 if the car is a midsize car and 0 otherwise; the value of ClassLarge is 1 if the car is a large car and 0 otherwise. Thus, for a compact car, the value of ClassMidsize and the value of ClassLarge are both 0. Develop the estimated regression equation that can be used to predict the fuel efficiency for highway driving, given the engine’s displacement and the dummy variables ClassMidsize and ClassLarge. How much of the variation in the sample values of HwyMPG is explained by this estimated regression equation?

d. Use significance level of 0.05 to determine whether the dummy variables added to the model in part (c) are significant.

e. Consider the addition of the dummy variable FuelPremium, where the value of FuelPremium is 1 if the car uses premium fuel and 0 if the car uses regular fuel. Develop the estimated regression equation that can be used to predict the fuel effi- ciency for highway driving given the engine’s displacement, the dummy variables ClassMidsize and ClassLarge, and the dummy variable FuelPremium. How much of the variation in the sample values of HwyMPG does this estimated regression equa- tion explain?

f. For the estimated regression equation developed in part (e), test for the significance of the relationship between each of the independent variables and the dependent variable using the 0.05 level of significance for each test.

g. An automobile manufacturer is designing a new compact model with a displace- ment of 2.9 liters with the objective of creating a model that will achieve at least 25 estimated highway MPG. The manufacturer must now decide if the car can be designed to use premium fuel and still achieve the objective of 25 MPG on the highway. Use the model developed in part (c) to recommend a decision to this manufacturer.

16. A highway department is studying the relationship between traffic flow and speed during rush hour on Highway 193. The data in the file TrafficFlow were collected on Highway 193 during 100 recent rush hours. a. Develop a scatter chart for these data. What does the scatter chart indicate about the

relationship between vehicle speed and traffic flow? b. Develop an estimated simple linear regression equation for the data. How much

variation in the sample values of traffic flow is explained by this regression model? Use a 0.05 level of significance to test the relationship between vehicle speed and traffic flow. What is the interpretation of this relationship?

c. Develop an estimated quadratic regression equation for the data. How much vari- ation in the sample values of traffic flow is explained by this regression model? Test the relationship between each of the independent variables and the dependent variable at a 0.05 level of significance. How would you interpret this model? Is this model superior to the model you developed in part (b)?

d. As an alternative to fitting a second-order model, fit a model using a piecewise lin- ear regression with a single knot. What value of vehicle speed appears to be a good point for the placement of the knot? Does the estimated piecewise linear regression provide a better fit than the estimated quadratic regression developed in part (c)? Explain.

e. Separate the data into two sets such that one data set contains the observations of vehicle speed less than the value of the knot from part (d) and the other data set con- tains the observations of vehicle speed greater than or equal to the value of the knot from part (d). Then fit a simple linear regression equation to each data set. How

TrafficFlow

366 chapter 7 Linear Regression

does this pair of regression equations compare to the single piecewise linear regres- sion with the single knot from part (d)? In particular, compare predicted values of traffic flow for values of the speed slightly above and slightly below the knot value from part (d).

f. What other independent variables could you include in your regression model to explain more variation in traffic flow?

17. A sample containing years to maturity and (percent) yield for 40 corporate bonds is contained in the file named CorporateBonds (Barron’s, April 2, 2012). a. Develop a scatter chart of the data using years to maturity as the independent vari-

able. Does a simple linear regression model appear to be appropriate? b. Develop an estimated quadratic regression equation with years to maturity and

squared values of years to maturity as the independent variables. How much vari- ation in the sample values of yield is explained by this regression model? Test the relationship between each of the independent variables and the dependent variable at a 0.05 level of significance. How would you interpret this model?

c. Create a plot of the linear and quadratic regression lines overlaid on the scatter chart of years to maturity and yield. Does this help you better understand the difference in how the quadratic regression model and a simple linear regression model fit the sample data? Which model does this chart suggest provides a superior fit to the sample data?

d. What other independent variables could you include in your regression model to explain more variation in yield?

18. In 2011, home prices and mortgage rates fell so far that in a number of cities the monthly cost of owning a home was less expensive than renting. The following data show the average asking rent for 10 markets and the monthly mortgage on the median priced home (including taxes and insurance) for 10 cities where the average monthly

CorporateBonds

RentMortgage

City Rent ($) Mortgage ($)

Atlanta 840 539

Chicago 1,062 1,002

Detroit 823 626

Jacksonville 779 711

Las Vegas 796 655

Miami 1,071 977

Minneapolis 953 776

Orlando 851 695

Phoenix 762 651

St. Louis 723 654

mortgage payment was less than the average asking rent (The Wall Street Journal, November 26–27, 2011). a. Develop a scatter chart for these data, treating the average asking rent as the inde-

pendent variable. Does a simple linear regression model appear to be appropriate? b. Use a simple linear regression model to develop an estimated regression equation to

predict the monthly mortgage on the median-priced home given the average asking rent. Construct a plot of the residuals against the independent variable rent. Based on this residual plot, does a simple linear regression model appear to be appropriate?

c. Using a quadratic regression model, develop an estimated regression equation to pre- dict the monthly mortgage on the median-priced home, given the average asking rent.

d. Do you prefer the estimated regression equation developed in part (a) or part (c)? Create a plot of the linear and quadratic regression lines overlaid on the scatter chart of the monthly mortgage on the median-priced home and the average asking rent to help you assess the two regression equations. Explain your conclusions.

Problems 367

19. A recent 10-year study conducted by a research team at the Great Falls Medical School was conducted to assess how age, systolic blood pressure, and smoking relate to the risk of strokes. Assume that the following data are from a portion of this study. Risk is interpreted as the probability (times 100) that the patient will have a stroke over the next 10-year period. For the smoking variable, define a dummy variable with 1 indicat- ing a smoker and 0 indicating a nonsmoker.

Risk Age Systolic Blood Pressure Smoker

12 57 152 No

24 67 163 No

13 58 155 NO

56 86 177 YES

28 59 196 NO

51 76 189 YES

18 56 155 YES

31 78 120 NO

37 80 135 YES

15 78 98 NO

22 71 152 NO

36 70 173 YES

15 67 135 YES

48 77 209 YES

15 60 199 NO

36 82 119 YES

8 66 166 NO

34 80 125 YES

3 62 117 NO

37 59 207 YES

Stroke

RugglesCollege

a. Develop an estimated multiple regression equation that relates risk of a stroke to the person’s age, systolic blood pressure, and whether the person is a smoker.

b. Is smoking a significant factor in the risk of a stroke? Explain. Use a 0.05 level of significance.

c. What is the probability of a stroke over the next 10 years for Art Speen, a 68-year- old smoker who has a systolic blood pressure of 175? What action might the physi- cian recommend for this patient?

d. An insurance company will only sell its Select policy to people for whom the prob- ability of a stroke in the next 10 years is less than 0.01. If a smoker with a systolic blood pressure of 230 applies for a Select policy, under what condition will the company sell him the policy if it adheres to this standard?

e. What other factors could be included in the model as independent variables?

20. The Scholastic Aptitude Test (or SAT) is a standardized college entrance test that is used by colleges and universities as a means for making admission decisions. The crit- ical reading and mathematics components of the SAT are reported on a scale from 200 to 800. Several universities believe these scores are strong predictors of an incoming student’s potential success, and they use these scores as important inputs when making admission decisions on potential freshman. The file RugglesCollege contains freshman year GPA and the citical reading and mathematics SAT scores for a random sample of 200 students who recently completed their freshman year at Ruggles College. a. Develop an estimated multiple regression equation that includes critical reading and

mathematics SAT scores as independent variables. How much variation in freshman

368 chapter 7 Linear Regression

GPA is explained by this model? Test whether each of the regression parameters 0b , 1b , and 2b is equal to zero at a 0.05 level of significance. What are the correct

interpretations of the estimated regression parameters? Are these interpretations reasonable?

b. Using the multiple linear regression model you developed in part (a), what is the predicted freshman GPA of Bobby Engle, a student who has been admitted to Ruggles College with a 660 SAT score on critical reading and at a 630 SAT score on mathematics?

c. The Ruggles College Director of Admissions believes that the relationship between a student’s scores on the critical reading component of the SAT and the student’s freshman GPA varies with the student’s score on the mathematics component of the SAT. Develop an estimated multiple regression equation that includes critical reading and mathematics SAT scores and their interaction as independent variables. How much variation in freshman GPA is explained by this model? Test whether each of the regression parameters 0b , 1b , 2b , and 3b is equal to zero at a 0.05 level of significance. What are the correct interpretations of the estimated regression param- eters? Do these results support the conjecture made by the Ruggles College Director of Admissions?

d. Do you prefer the estimated regression model developed in part (a) or part (c)? Explain.

e. What other factors could be included in the model as independent variables?

21. Consider again the example introduced in Section 7.5 of a credit card company that has a database of information provided by its customers when they apply for credit cards. An analyst has created a multiple regression model for which the dependent variable in the model is credit card charges accrued by a customer in the data set over the past year (y), and the independent variables are the customer’s annual household income ( )1x , number of members of the household ( )2x , and number of years of post-high school education ( )3x . Figure 7.23 provides Excel output for a multiple regression model esti- mated using a data set the company created. a. Estimate the corresponding simple linear regression with the customer’s annual

household income as the independent variable and credit card charges accrued by a customer over the past year as the dependent variable. Interpret the estimated rela- tionship between the customer’s annual household income and credit card charges accrued over the past year. How much variation in credit card charges accrued by a customer over the past year is explained by this simple linear regression model?

b. Estimate the corresponding simple linear regression with the number of members in the customer’s household as the independent variable and credit card charges accrued by a customer over the past year as the dependent variable. Interpret the estimated relationship between the number of members in the customer’s household and credit card charges accrued over the past year. How much variation in credit card charges accrued by a customer over the past year is explained by this simple linear regression model?

c. Estimate the corresponding simple linear regression with the customer’s num- ber of years of post-high school education as the independent variable and credit card charges accrued by a customer over the past year as the dependent variable. Interpret the estimated relationship between the customer’s number of years of post-high school education and credit card charges accrued over the past year. How much variation in credit card charges accrued by a customer over the past year is explained by this simple linear regression model?

d. Recall the multiple regression in Figure 7.23 with credit card charges accrued by a customer over the past year as the dependent variable and customer’s annual house- hold income ( )1x , number of members of the household ( )2x , and number of years of post-high school education ( )3x as the independent variables. Do the estimated slopes differ substantially from the corresponding slopes that were estimated using

ExtendedLargeCredit

case Problem: Alumni Giving 369

simple linear regression in parts (a), (b), and (c)? What does this tell you about mul- ticollinearity in the multiple regression model in Figure 7.23?

e. Add the coefficients of determination for the simple linear regression in parts (a), (b), and (c), and compare the result to the coefficient of determination for the multi- ple regression model in Figure 7.23. What does this tell you about multicollinearity in the multiple regression model in Figure 7.23?

f. Add age, a dummy variable for sex, and a dummy variable for whether a customer has exceeded his or her credit limit in the past 12 months as independent variables to the multiple regression model in Figure 7.23. Code the dummy variable for sex as 1 if the customer is female and 0 if male, and code the dummy variable for whether a customer has exceeded his or her credit limit in the past 12 months as 1 if the cus- tomer has exceeded his or her credit limit in the past 12 months and 0 otherwise. Do these variables substantially improve the fit of your model?

C A S e P R o B L e M : A L u M N I G I V I N G

Alumni donations are an important source of revenue for colleges and universities. If administrators could determine the factors that could lead to increases in the percentage of alumni who make a donation, they might be able to implement policies that could lead to increased revenues. Research shows that students who are more satisfied with their con- tact with teachers are more likely to graduate. As a result, one might suspect that smaller class sizes and lower student/faculty ratios might lead to a higher percentage of satisfied graduates, which in turn might lead to increases in the percentage of alumni who make a donation. The following table shows data for 48 national universities. The Graduation Rate column is the percentage of students who initially enrolled at the university and graduated. The % of Classes Under 20 column shows the percentages of classes with fewer than 20 students that are offered. The Student/Faculty Ratio column is the number of students enrolled divided by the total number of faculty. Finally, the Alumni Giving Rate column is the percentage of alumni who made a donation to the university.

State Graduation

Rate

% of Classes

Under 20

Student/ Faculty Ratio

Alumni Giving Rate

Boston College MA 85 39 13 25

Brandeis University MA 79 68 8 33

Brown University RI 93 60 8 40

California Institute of Technology

CA 85 65 3 46

Carnegie Mellon University PA 75 67 10 28

Case Western Reserve Univ. OH 72 52 8 31

College of William and Mary VA 89 45 12 27

Columbia University NY 90 69 7 31

Cornell University NY 91 72 13 35

Dartmouth College NH 94 61 10 53

Duke University NC 92 68 8 45

Emory University GA 84 65 7 37

Georgetown University DC 91 54 10 29

Harvard University MA 97 73 8 46

Johns Hopkins University MD 89 64 9 27

Lehigh University PA 81 55 11 40

AlumniGiving

370 chapter 7 Linear Regression

State Graduation

Rate

% of Classes

Under 20

Student/ Faculty Ratio

Alumni Giving Rate

Massachusetts Institute of Technology

MA 92 65 6 44

New York University NY 72 63 13 13

Northwestern University IL 90 66 8 30

Pennsylvania State Univ. PA 80 32 19 21

Princeton University NJ 95 68 5 67

Rice University TX 92 62 8 40

Stanford University CA 92 69 7 34

Tufts University MA 87 67 9 29

Tulane University LA 72 56 12 17

University of California–Berkeley

CA 83 58 17 18

University of California–Davis

CA 74 32 19 7

University of California–Irvine

CA 74 42 20 9

University of California– Los Angeles

CA 78 41 18 13

University of California–San Diego

CA 80 48 19 8

University of California– Santa Barbara

CA 70 45 20 12

University of Chicago IL 84 65 4 36

University of Florida FL 67 31 23 19

University of Illinois–Urbana Champaign

IL 77 29 15 23

University of Michigan–Ann Arbor

MI 83 51 15 13

University of North Caro- lina–Chapel Hill

NC 82 40 16 26

University of Notre Dame IN 94 53 13 49

University of Pennsylvania PA 90 65 7 41

University of Rochester NY 76 63 10 23

University of Southern California

CA 70 53 13 22

University of Texas–Austin TX 66 39 21 13

University of Virginia VA 92 44 13 28

University of Washington WA 70 37 12 12

University of Wisconsin–Madison

WI 73 37 13 13

Vanderbilt University TN 82 68 9 31

Wake Forest University NC 82 59 11 38

Washington University–St. Louis

MO 86 73 7 33

Yale University CT 94 77 7 50

case Problem: Alumni Giving 371

Managerial Report

1. Use methods of descriptive statistics to summarize the data. 2. Develop an estimated simple linear regression model that can be used to predict the

alumni giving rate, given the graduation rate. Discuss your findings. 3. Develop an estimated multiple linear regression model that could be used to predict the

alumni giving rate using Graduation Rate, % of Classes Under 20, and Student/ Faculty Ratio as independent variables. Discuss your findings.

4. Based on the results in parts (2) and (3), do you believe another regression model may be more appropriate? Estimate this model, and discuss your results.

5. What conclusions and recommendations can you derive from your analysis? What universities are achieving a substantially higher alumni giving rate than would be expected, given their Graduation Rate, % of Classes Under 20, and Student/Faculty Ratio? What universities are achieving a substantially lower alumni giving rate than would be expected, given their Graduation Rate, % of Classes Under 20, and Student/ Faculty Ratio? What other independent variables could be included in the model?

Time Series Analysis and Forecasting C O N T E N T S

AnAlyTicS in AcTion: ACCO BrAnds

8.1 TiME SERiES PATTERnS Horizontal Pattern Trend Pattern Seasonal Pattern Trend and Seasonal Pattern cyclical Pattern identifying Time Series Patterns

8.2 FoREcAST AccURAcy

8.3 MoVinG AVERAGES AnD EXPonEnTiAl SMooTHinG Moving Averages Exponential Smoothing

8.4 USinG REGRESSion AnAlySiS FoR FoREcASTinG linear Trend Projection Seasonality Without Trend Seasonality with Trend Using Regression Analysis as a causal Forecasting Method combining causal Variables with Trend and

Seasonality Effects considerations in Using Regression in Forecasting

8.5 DETERMininG THE BEST FoREcASTinG MoDEl To USE

APPEnDiX 8.1: USinG THE EXcEl FoREcAST SHEET

APPEnDiX 8.2: FoREcASTinG WiTH AnAlyTic SolVER (MinDTAP READER)

Chapter 8

Analytics in Action 373

The purpose of this chapter is to provide an introduction to time series analysis and forecasting. Suppose we are asked to provide quarterly forecasts of sales for one of our company’s products over the upcoming one-year period. Production schedules, raw mate- rials purchasing, inventory policies, marketing plans, and cash flows will all be affected by the quarterly forecasts we provide. Consequently, poor forecasts may result in poor planning and increased costs for the company. How should we go about providing the quarterly sales forecasts? Good judgment, intuition, and an awareness of the state of the economy may give us a rough idea, or feeling, of what is likely to happen in the future, but converting that feeling into a number that can be used as next year’s sales forecast is challenging.

ACCO Brands*

ACCO Brands Corporation is one of the world’s larg- est suppliers of branded office and consumer products and print finishing solutions. The company’s brands include AT-A-GLANCE®, Day-Timer®, Five Star®, GBC®, Hilroy®, Kensington®, Marbig®, Mead®, NOBO, Quartet®, Rexel, Swingline®, Tilibra®, Wilson Jones®, and many others.

Because it produces and markets a wide array of products with myriad demand characteristics, ACCO Brands relies heavily on sales forecasts in planning its manufacturing, distribution, and marketing activities. By viewing its relationship in terms of a supply chain, ACCO Brands and its customers (which are generally retail chains) establish close collaborative relation- ships and consider each other to be valued partners. As a result, ACCO Brands’ customers share valuable information and data that serve as inputs into ACCO Brands’ forecasting process.

In her role as a forecasting manager for ACCO Brands, Vanessa Baker appreciates the importance of this additional information. “We do separate forecasts of demand for each major customer,” said Baker, “and we generally use twenty-four to thirty-six months of history to generate monthly forecasts twelve to eigh- teen months into the future. While trends are import- ant, several of our major product lines, including school, planning and organizing, and decorative cal- endars, are heavily seasonal, and seasonal sales make up the bulk of our annual volume.”

Daniel Marks, one of several account-level strategic forecast managers for ACCO Brands, adds,

The supply chain process includes the total lead time from identifying opportunities to making or procuring the product to getting the product on the shelves to align with the forecasted demand; this can potentially take several months, so the accuracy

*The authors are indebted to Vanessa Baker and Daniel Marks of Acco Brands for providing input for this Analytics in Action.

of our forecasts is critical throughout each step of the supply chain. Adding to this challenge is the risk of obsolescence. We sell many dated items, such as planners and calendars, which have a natu- ral, built-in obsolescence. In addition, many of our products feature designs that are fashion-conscious or contain pop culture images, and these products can also become obsolete very quickly as tastes and popularity change. An overly optimistic fore- cast for these products can be very costly, but an overly pessimistic forecast can result in lost sales potential and give our competitors an opportunity to take market share from us.

In addition to trends, seasonal components, and cyclical patterns, there are several other factors that Baker and Marks must consider. Baker notes, “We have to adjust our forecasts for upcoming promotions by our customers.” Marks agrees and adds:

We also have to go beyond just forecasting con- sumer demand; we must consider the retailer’s spe- cific needs in our order forecasts, such as what type of display will be used and how many units of a prod- uct must be on display to satisfy their presentation requirements. Current inventory is another factor—if a customer is carrying either too much or too little inventory, that will affect their future orders, and we need to reflect that in our forecasts. Will the product have a short life because it is tied to a cultural fad? What are the retailer’s marketing and markdown strategies? Our knowledge of the environments in which our supply chain partners are competing helps us to forecast demand more accurately, and that reduces waste and makes our customers, as well as ACCO Brands, far more profitable.

A N A l y T i C S i N A C T i O N

374 chapter 8 Time Series Analysis and Forecasting

Forecasting methods can be classified as qualitative or quantitative. Qualitative methods generally involve the use of expert judgment to develop forecasts. Such methods are appro- priate when historical data on the variable being forecast are either unavailable or not appli- cable. Quantitative forecasting methods can be used when (1) past information about the variable being forecast is available, (2) the information can be quantified, and (3) it is rea- sonable to assume that past is prologue (i.e., that the pattern of the past will continue into the future). We will focus exclusively on quantitative forecasting methods in this chapter.

If the historical data are restricted to past values of the variable to be forecast, the fore- casting procedure is called a time series method and the historical data are referred to as time series. The objective of time series analysis is to uncover a pattern in the time series and then extrapolate the pattern to forecast the future; the forecast is based solely on past values of the variable and/or on past forecast errors.

Causal or exploratory forecasting methods are based on the assumption that the variable we are forecasting has a cause-and-effect relationship with one or more other variables. These methods help explain how the value of one variable impacts the value of another. For instance, the sales volume for many products is influenced by advertising expenditures, so regression analysis may be used to develop an equation showing how these two variables are related. Then, once the advertising budget is set for the next period, we could substi- tute this value into the equation to develop a prediction or forecast of the sales volume for that period. Note that if a time series method was used to develop the forecast, advertising expenditures would not be considered; that is, a time series method would base the forecast solely on past sales.

Modern data-collection technologies have enabled individuals, businesses, and govern- ment agencies to collect vast amounts of data that may be used for causal forecasting. For example, supermarket scanners allow retailers to collect point-of-sale data that can then be used to help aid in planning sales, coupon targeting, and other marketing and planning efforts. These data can help answer important questions like, “Which products tend to be purchased together?” One of the techniques used to answer questions using such data is regression analysis. In this chapter we discuss the use of regression analysis as a causal fore- casting method.

In Section 8.1 we discuss the various kinds of time series that a forecaster might be faced with in practice. These include a constant or horizontal pattern, a trend, a seasonal pattern, both a trend and a seasonal pattern, and a cyclical pattern. To build a quantita- tive forecasting model it is also necessary to have a measurement of forecast accuracy. Different measurements of forecast accuracy, as well as their respective advantages and disadvantages, are discussed in Section 8.2. In Section 8.3 we consider the simplest case, which is a horizontal or constant pattern. For this pattern, we develop the classical moving average, weighted moving average, and exponential smoothing models. Many time series have a trend, and taking this trend into account is important; in Section 8.4 we provide regression models for finding the best model parameters when a linear trend is present, when the data show a seasonal pattern, or when the variable to be predicted has a causal relationship with other variables. Finally, in Section 8.5 we discuss considerations to be made when determining the best forecasting model to use.

A forecast is simply a prediction of what will happen in the future. Managers must accept that regardless of the technique used, they will not be able to develop perfect forecasts.

Virtually all large companies today rely on enterprise resource

planning (ERP) software to aid in their planning and opera-

tions. These software systems help the business run smoothly

by collecting and efficiently storing company data, enabling it

to be shared company-wide for planning at all levels: strate-

gically, tactically, and operationally. Most ERP systems include

a forecasting module to help plan for the future. SAP, one

of the most widely used ERP systems, includes a forecast-

ing component. This module allows the user to select from

a number of forecasting techniques and/or have the system

find a “best” model. The various forecasting methods and

ways to measure the quality of a forecasting model discussed

in this chapter are routinely available in software that sup-

ports forecasting.

N O T E S + C O m m E N T S

8.1 Time Series Patterns 375

Gasoline

8.1 Time Series Patterns A time series is a sequence of observations on a variable measured at successive points in time or over successive periods of time. The measurements may be taken every hour, day, week, month, year, or at any other regular interval. The pattern of the data is an important factor in understanding how the time series has behaved in the past. If such behavior can be expected to continue in the future, we can use it to guide us in selecting an appropriate forecasting method.

To identify the underlying pattern in the data, a useful first step is to construct a time series plot, which is a graphical presentation of the relationship between time and the time series variable; time is represented on the horizontal axis and values of the time series vari- able are shown on the vertical axis. Let us first review some of the common types of data patterns that can be identified in a time series plot.

Horizontal Pattern A horizontal pattern exists when the data fluctuate randomly around a constant mean over time. To illustrate a time series with a horizontal pattern, consider the 12 weeks of data in Table 8.1. These data show the number of gallons of gasoline (in 1,000s) sold by a gasoline distributor in Bennington, Vermont, over the past 12 weeks. The average value, or mean, for this time series is 19.25 or 19,250 gallons per week. Figure 8.1 shows a time series plot for these data. Note how the data fluctuate around the sample mean of 19,250 gallons. Although random variability is present, we would say that these data follow a horizontal pattern.

The term stationary time series is used to denote a time series whose statistical proper- ties are independent of time. In particular this means that:

1. The process generating the data has a constant mean. 2. The variability of the time series is constant over time.

A time series plot for a stationary time series will always exhibit a horizontal pattern with random fluctuations. However, simply observing a horizontal pattern is not sufficient evidence to conclude that the time series is stationary. More advanced texts on forecasting discuss procedures for determining whether a time series is stationary and provide methods for transforming a nonstationary time series into a stationary series.

We limit our discussion to time series for which the values of the series are recorded at equal intervals. Cases in which the observations are made at unequal intervals are beyond the scope of this text.

In Chapter 2 we discussed line charts, which are often used to graph time series.

For a formal definition of stationarity, see K. Ord and R. Fildes, Principles of Business Forecasting (Mason, OH: Cengage Learning, 2012), p. 155.

Week Sales

(1,000s of gallons)

1 17

2 21

3 19

4 23

5 18

6 16

7 20

8 18

9 22

10 20

11 15

12 22

Gasoline Sales Time SeriesTABlE 8.1

376 chapter 8 Time Series Analysis and Forecasting

Changes in business conditions often result in a time series with a horizontal pattern that shifts to a new level at some point in time. For instance, suppose the gasoline distribu- tor signs a contract with the Vermont State Police to provide gasoline for state police cars located in southern Vermont beginning in week 13. With this new contract, the distributor naturally expects to see a substantial increase in weekly sales starting in week 13. Table 8.2 shows the number of gallons of gasoline sold for the original time series and for the 10 weeks after signing the new contract. Figure 8.2 shows the corresponding time series plot. Note the increased level of the time series beginning in week 13. This change in the

Week Sales

(1,000s of gallons) Week Sales

(1,000s of gallons)

1 17 12 22

2 21 13 31

3 19 14 34

4 23 15 31

5 18 16 33

6 16 17 28

7 20 18 32

8 18 19 30

9 22 20 29

10 20 21 34

11 15 22 33

Gasoline Sales Time Series after obtaining the contract with the Vermont State Police

TABlE 8.2

S al

es (

1, 00

0s o

f ga

ll on

Week 0 1 2 3 4 5 6 7 8 9 10 11 12

Gasoline Sales Time Series PlotFiGURE 8.1

GasolineRevised

8.1 Time Series Patterns 377

level of the time series makes it more difficult to choose an appropriate forecasting method. Selecting a forecasting method that adapts well to changes in the level of a time series is an important consideration in many practical applications.

Trend Pattern Although time series data generally exhibit random fluctuations, a time series may also show gradual shifts or movements to relatively higher or lower values over a longer period of time. If a time series plot exhibits this type of behavior, we say that a trend pattern exists. A trend is usually the result of long-term factors such as population increases or decreases, shifting demographic characteristics of the population, improving technology, changes in the competitive landscape, and/or changes in consumer preferences.

To illustrate a time series with a linear trend pattern, consider the time series of bicy- cle sales for a particular manufacturer over the past 10 years, as shown in Table 8.3 and Figure 8.3. Note that a total of 21,600 bicycles were sold in year 1, a total of 22,900 in year 2, and so on. In year 10, the most recent year, 31,400 bicycles were sold. Visual inspection of the time series plot shows some up-and-down movement over the past 10 years, but the time series seems also to have a systematically increasing, or upward, trend.

The trend for the bicycle sales time series appears to be linear and increasing over time, but sometimes a trend can be described better by other types of patterns. For instance, the data in Table 8.4 and the corresponding time series plot in Figure 8.4 show the sales revenue for a cholesterol drug since the company won FDA approval for the drug 10 years ago. The time series increases in a nonlinear fashion; that is, the rate of change of revenue does not increase by a constant amount from one year to the next. In fact, the revenue appears to be growing in an exponential fashion. Exponential relation- ships such as this are appropriate when the percentage change from one period to the next is relatively constant.

Gasoline Sales Time Series Plot after obtaining the contract with the Vermont State Police

FiGURE 8.2

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

S al

es (

1, 00

0s o

f ga

ll on

Week

378 chapter 8 Time Series Analysis and Forecasting

Seasonal Pattern The trend of a time series can be identified by analyzing movements in historical data over multiple time periods. Seasonal patterns are recognized by observing recurring patterns over successive periods of time. For example, a retailer who sells bathing suits expects low sales activity in the fall and winter months, with peak sales in the spring and summer months to occur every year. Retailers who sell snow removal equipment and heavy cloth- ing, however, expect the opposite yearly pattern. Not surprisingly, the pattern for a time series plot that exhibits a recurring pattern over a one-year period due to seasonal influ- ences is called a seasonal pattern. Although we generally think of seasonal movement in a time series as occurring over one year, time series data can also exhibit seasonal patterns of less than one year in duration. For example, daily traffic volume shows within-the-day

Year Sales (1,000s)

1 21.6

2 22.9

3 25.5

4 21.9

5 23.9

6 27.5

7 31.5

8 29.7

9 28.6

10 31.4

Bicycle Sales Time SeriesTABlE 8.3

Bicycle Sales Time Series PlotFiGURE 8.3

S al

es (

1, 00

0s )

Year 0 1 2 3 4 5 6 7 8 9

10 20

Bicycle

8.1 Time Series Patterns 379

“seasonal” behavior, with peak levels occurring during rush hours, moderate flow during the rest of the day and early evening, and light flow from midnight to early morning. Another example of an industry with sales that exhibit easily discernible seasonal patterns within a day is the restaurant industry.

As an example of a seasonal pattern, consider the number of umbrellas sold at a cloth- ing store over the past five years. Table 8.5 shows the time series and Figure 8.5 shows the corresponding time series plot. The time series plot does not indicate a long-term trend in sales. In fact, unless you look carefully at the data, you might conclude that the data follow a horizontal pattern with random fluctuation. However, closer inspection of the fluctuations in the time series plot reveals a systematic pattern in the data that occurs within each year. Specifically, the first and third quarters have moderate sales, the second quarter has the highest sales, and the fourth quarter has the lowest sales volume. Thus, we would conclude that a quarterly seasonal pattern is present.

Trend and Seasonal Pattern Some time series include both a trend and a seasonal pattern. For instance, the data in Table 8.6 and the corresponding time series plot in Figure 8.6 show quarterly smartphone sales for a particular manufacturer over the past four years. Clearly an increasing trend is

Year Revenue ($ millions)

1 23.1

2 21.3

3 27.4

4 34.6

5 33.8

6 43.2

7 59.5

8 64.4

9 74.2

10 99.3

cholesterol Drug Revenue Time SeriesTABlE 8.4

cholesterol Drug Revenue Times Series Plot ($ Millions)FiGURE 8.4

R ev

en u

Year 0 1 2 3 4 5 6 7 8 9

120

100

Cholesterol

380 chapter 8 Time Series Analysis and Forecasting

Year Quarter Sales

1 1 125

2 153

3 106

4 88

2 1 118

2 161

3 133

4 102

3 1 138

2 144

3 113

4 80

4 1 109

2 137

3 125

4 109

5 1 130

2 165

3 128

4 96

Umbrella Sales Time SeriesTABlE 8.5

Umbrella Sales Time Series PlotFiGURE 8.5

0 2 4 6 81 3 5 7 9 10 11 12 13 14 15 16 17 18 19 20

Time Period

180

100

120

140

160

S al

Umbrella

8.1 Time Series Patterns 381

present. However, Figure 8.6 also indicates that sales are lowest in the second quarter of each year and highest in quarters 3 and 4. Thus, we conclude that a seasonal pattern also exists for smartphone sales. In such cases we need to use a forecasting method that is capa- ble of dealing with both trend and seasonality.

Year Quarter Sales ($1,000s)

1 1 4.8

2 4.1

3 6.0

4 6.5

2 1 5.8

2 5.2

3 6.8

4 7.4

3 1 6.0

2 5.6

3 7.5

4 7.8

4 1 6.3

2 5.9

3 8.0

4 8.4

Quarterly Smartphone Sales Time SeriesTABlE 8.6

Quarterly Smartphone Sales Time Series PlotFiGURE 8.6

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

0 2 4 6 8 10 12 14 16 18 Period

Q u

ar te

rl y

S m

ar tp

h on

e S

al es

( $1

,0 00

s) SmartPhoneSales

382 chapter 8 Time Series Analysis and Forecasting

Cyclical Pattern A cyclical pattern exists if the time series plot shows an alternating sequence of points below and above the trendline that lasts for more than one year. Many economic time series exhibit cyclical behavior with regular runs of observations below and above the trendline. Often the cyclical component of a time series is due to multiyear business cycles. For exam- ple, periods of moderate inflation followed by periods of rapid inflation can lead to a time series that alternates below and above a generally increasing trendline (e.g., a time series of housing costs). Business cycles are extremely difficult, if not impossible, to forecast. As a result, cyclical effects are often combined with long-term trend effects and referred to as trend-cycle effects. In this chapter we do not deal with cyclical effects that may be present in the time series.

identifying Time Series Patterns The underlying pattern in the time series is an important factor in selecting a forecasting method. Thus, a time series plot should be one of the first analytic tools employed when trying to determine which forecasting method to use. If we see a horizontal pattern, then we need to select a method appropriate for this type of pattern. Similarly, if we observe a trend in the data, then we need to use a forecasting method that is capable of handling the conjectured type of trend effectively. In the next section we discuss methods for assessing forecast accuracy.

8.2 Forecast Accuracy In this section we begin by developing forecasts for the gasoline time series shown in Table 8.1 using the simplest of all the forecasting methods. We use the most recent week’s sales volume as the forecast for the next week. For instance, the distributor sold 17 thousand gallons of gasoline in week 1; this value is used as the forecast for week 2. Next, we use 21, the actual value of sales in week 2, as the forecast for week 3, and so on. The forecasts obtained for the historical data using this method are shown in Table 8.7 in the Forecast column. Because of its simplicity, this method is often referred to as a naïve forecasting method.

Week Time Series

Value Forecast Forecast

Error

Absolute Value of Forecast

Error

Squared Forecast

Error Percentage

Error

Absolute Value of Percentage

Error

1 17

2 21 17 4 4 16 19.05 19.05

3 19 21 −2 2 4 −10.53 10.53

4 23 19 4 4 16 17.39 17.39

5 18 23 −5 5 25 −27.78 27.78

6 16 18 −2 2 4 −12.50 12.50

7 20 16 4 4 16 20.00 20.00

8 18 20 −2 2 4 −11.11 11.11

9 22 18 4 4 16 18.18 18.18

10 20 22 −2 2 4 −10.00 10.00

11 15 20 −5 5 25 −33.33 33.33

12 22 15 7 7 49 31.82 31.82

Totals 5 41 179 1.19 211.69

computing Forecasts and Measures of Forecast Accuracy Using the Most Recent Value as the Forecast for the next Period

TABlE 8.7

8.2 Forecast Accuracy 383

How accurate are the forecasts obtained using this naïve forecasting method? To answer this question, we will introduce several measures of forecast accuracy. These measures are used to determine how well a particular forecasting method is able to reproduce the time series data that are already available. By selecting the method that is most accurate for the data already observed, we hope to increase the likelihood that we will obtain more accurate forecasts for future time periods. The key concept associated with measuring forecast accu- racy is forecast error. If we denote yt and ytˆ as the actual and forecasted values of the time series for period t, respectively, the forecasting error for period t is as follows:

FORECAST ERROR

ˆ5 2e y yt t t (8.1)

That is, the forecast error for time period t is the difference between the actual and the forecasted values for period t.

For instance, because the distributor actually sold 21 thousand gallons of gasoline in week 2, and the forecast, using the sales volume in week 1, was 17 thousand gallons, the forecast error in week 2 is

e y y5 2 5 2 5ˆ 21 17 42 2 2

A positive error such as this indicates that the forecasting method underestimated the actual value of sales for the associated period. Next we use 21, the actual value of sales in week 2, as the forecast for week 3. Since the actual value of sales in week 3 is 19, the forecast error for week 3 is 19 21 23e 5 2 5 2 . In this case, the negative forecast error indicates that the forecast overestimated the actual value for week 3. Thus, the forecast error may be positive or negative, depending on whether the forecast is too low or too high. A complete summary of the forecast errors for this naïve forecasting method is shown in Table 8.7 in the Forecast Error column. It is important to note that because we are using a past value of the time series to produce a forecast for period t, we do not have sufficient data to produce a naïve forecast for the first week of this time series.

A simple measure of forecast accuracy is the mean or average of the forecast errors. If we have n periods in our time series and k is the number of periods at the beginning of the time series for which we cannot produce a naïve forecast, the mean forecast error (MFE) is as follows:

mEAN FORECAST ERROR (mFE)

MFE 15

2 5 1

∑ e n k t k

(8.2)

Table 8.7 shows that the sum of the forecast errors for the gasoline sales time series is 5; thus, the mean, or average, error is 5 / 11 0.455 . Because we do not have sufficient data to produce a naïve forecast for the first week of this time series, we must adjust our calcu- lations in both the numerator and denominator accordingly. This is common in forecasting; we often use k past periods from the time series to produce forecasts, and so we frequently cannot produce forecasts for the first k periods. In those instances the summation in the numerator starts at the first value of t for which we have produced a forecast (so we begin the summation at 1t k5 1 ), and the denominator (which is the number of periods in our time series for which we are able to produce a forecast) will also reflect these circum- stances. In the gasoline example, although the time series consists of 12n 5 values, to compute the mean error we divided the sum of the forecast errors by 11 because there are only 11 forecast errors (we cannot generate forecast sales for the first week using this naïve forecasting method).

384 chapter 8 Time Series Analysis and Forecasting

Also note that in the gasoline time series, the mean forecast error is positive, implying that the method is generally under-forecasting; in other words, the observed values tend to be greater than the forecasted values. Because positive and negative forecast errors tend to offset each other, the mean error is likely to be small; thus, the mean error is not a very use- ful measure of forecast accuracy.

The mean absolute error (MAE) is a measure of forecast accuracy that avoids the problem of positive and negative forecast errors offsetting each other. As you might expect given its name, MAE is the average of the absolute values of the forecast errors:

mEAN ABSOlUTE ERROR (mAE)

MAE 15

2 5 1

∑ e n k

t k

(8.3)

The MAE is also referred to as the mean absolute deviation (MAD).

Table 8.7 shows that the sum of the absolute values of the forecast errors is 41; thus:

MAE average of the absolute value of the forecast errors 41

11 3.73.5 5 5

Another measure that avoids the problem of positive and negative errors offsetting each other is obtained by computing the average of the squared forecast errors. This measure of forecast accuracy is referred to as the mean squared error (MSE):

mEAN SQUARED ERROR (mSE)

MSE 1

5 2

5 1

∑ e n k t k

(8.4)

mEAN ABSOlUTE PERCENTAGE ERROR (mAPE)

MAPE

100 1

5 2

5 1

∑   

n k t k

n t

t (8.5)

From Table 8.7, the sum of the squared errors is 179; hence

MSE average of the square of the forecast errors 179

11 16.27.5 5 5

The size of the MAE or MSE depends on the scale of the data. As a result, it is difficult to make comparisons for different time intervals (such as comparing a method of forecast- ing monthly gasoline sales to a method of forecasting weekly sales) or to make compari- sons across different time series (such as monthly sales of gasoline and monthly sales of oil filters). To make comparisons such as these we need to work with relative or percentage error measures. The mean absolute percentage error (MAPE) is such a measure. To cal- culate MAPE we use the following formula:

Table 8.7 shows that the sum of the absolute values of the percentage errors is

100 211.69 1

∑   

yt k

5 5 1

8.2 Forecast Accuracy 385

Thus, the MAPE, which is the average of the absolute value of percentage forecast errors, is

5 211.69

11 19.24%

These measures of forecast accuracy simply measure how well the forecasting method is able to forecast historical values of the time series. Now, suppose we want to forecast sales for a future time period, such as week 13. The forecast for week 13 is 22, the actual value of the time series in week 12. Is this an accurate estimate of sales for week 13? Unfortunately, there is no way to address the issue of accuracy associated with forecasts for future time periods. However, if we select a forecasting method that works well for the historical data, and we have reason to believe the historical pattern will continue into the future, we should obtain forecasts that will ultimately be shown to be accurate.

Before concluding this section, let us consider another method for forecasting the gas- oline sales time series in Table 8.1. Suppose we use the average of all the historical data available as the forecast for the next period. We begin by developing a forecast for week 2. Because there is only one historical value available prior to week 2, the forecast for week 2 is just the time series value in week 1; thus, the forecast for week 2 is 17 thousand gallons of gasoline. To compute the forecast for week 3, we take the average of the sales values in weeks 1 and 2. Thus

y 5 1

5ˆ 17 21

2 193

Similarly, the forecast for week 4 is

y 5 1 1

5ˆ 17 21 19

3 194

computing Forecasts and Measures of Forecast Accuracy Using the Average of All the Historical Data as the Forecast for the next Period

TABlE 8.8

Week Time Series

Value Forecast Forecast

Error

Absolute Value of Forecast

Error

Squared Forecast

Error Percentage

Error

Absolute Value of Percentage

Error

1 17

2 21 17.00 4.00 4.00 16.00 19.05 19.05

3 19 19.00 0.00 0.00 0.00 0.00 0.00

4 23 19.00 4.00 4.00 16.00 17.39 17.39

5 18 20.00 −2.00 2.00 4.00 −11.11 11.11

6 16 19.60 −3.60 3.60 12.96 −22.50 22.50

7 20 19.00 1.00 1.00 1.00 5.00 5.00

8 18 19.14 −1.14 1.14 1.31 −6.35 6.35

9 22 19.00 3.00 3.00 9.00 13.64 13.64

10 20 19.33 0.67 0.67 0.44 3.33 3.33

11 15 19.40 −4.40 4.40 19.36 −29.33 29.33

12 22 19.00 3.00 3.00 9.00 13.64 13.64

Totals 4.52 26.81 89.07 2.75 141.34

386 chapter 8 Time Series Analysis and Forecasting

The forecasts obtained using this method for the gasoline time series are shown in Table 8.8 in the Forecast column. Using the results shown in Table 8.8, we obtain the fol- lowing values of MAE, MSE, and MAPE:

MAE 26.81

11 2.44

MSE 89.07

11 8.10

MAPE 141.34

11 12.85%

5 5

We can now compare the accuracy of the two forecasting methods we have considered in this section by comparing the values of MAE, MSE, and MAPE for each method.

Naïve Method Average of All Past Values

MAE 3.73 2.44

MSE 16.27 8.10

MAPE 19.24% 12.85%

As measured by MAE, MSE, and MAPE, the average of all past weekly gasoline sales provides more accurate forecasts for the next week than using the most recent week’s gasoline sales.

Evaluating different forecasts based on historical accuracy is helpful only if historical patterns continue into the future. As we noted in Section 8.1, the 12 observations of Table 8.1 comprise a stationary time series. In Section 8.1, we also mentioned that changes in business conditions often result in a time series that is not stationary. We discussed a situation in which the gasoline distributor signed a contract with the Vermont State Police to provide gasoline for state police cars located in southern Vermont. Table 8.2 shows the number of gallons of gasoline sold for the original time series and for the 10 weeks after signing the new contract, and Figure 8.2 shows the corresponding time series plot. Note the change in level in week 13 for the resulting time series. When a shift to a new level such as this occurs, it takes several periods for the forecasting method that uses the average of all the historical data to adjust to the new level of the time series. However, in this case the simple naïve method adjusts very rapidly to the change in level because it uses only the most recent observation as the forecast.

Measures of forecast accuracy are important factors in comparing different forecast- ing methods, but we have to be careful not to rely too heavily on them. Good judgment and knowledge about business conditions that might affect the value of the variable to be forecast also have to be considered carefully when selecting a method. Historical forecast accuracy is not the sole consideration, especially if the pattern exhibited by the time series is likely to change in the future.

In the next section, we will introduce more sophisticated methods for developing forecasts for a time series that exhibits a horizontal pattern. Using the measures of forecast accuracy developed here, we will be able to assess whether such methods provide more accurate fore- casts than we obtained using the simple approaches illustrated in this section. The methods that we will introduce also have the advantage that they adapt well to situations in which the time series changes to a new level. The ability of a forecasting method to adapt quickly to changes in level is an important consideration, especially in short-term forecasting situations.

8.3 Moving Averages and Exponential Smoothing In this section we discuss two forecasting methods that are appropriate for a time series with a horizontal pattern: moving averages and exponential smoothing. These methods are capable of adapting well to changes in the level of a horizontal pattern such as the one we saw with the extended gasoline sales time series (Table 8.2 and Figure 8.2). However, with- out modification they are not appropriate when considerable trend, cyclical, or seasonal effects are present. Because the objective of each of these methods is to smooth out random

8.3 Moving Averages and Exponential Smoothing 387

fluctuations in the time series, they are referred to as smoothing methods. These methods are easy to use and generally provide a high level of accuracy for short-range forecasts, such as a forecast for the next time period.

moving Averages The moving average method uses the average of the most recent k data values in the time series as the forecast for the next period. Mathematically, a moving average forecast of order k is:

mOViNG AVERAGE FORECAST

most recent data values 1

1 1�

∑)(Σ

+ + +

− + −

5 5

5 2 1y k

k y y y

t i t k

t k t t

(8.6)

where

ˆ forecast of the time series for period 1

actual value of the time series in period

number of periods of time series data used to generate the forecast

1 5 1

1y t

y t

The term moving is used because every time a new observation becomes available for the time series, it replaces the oldest observation in the equation and a new average is com- puted. Thus, the periods over which the average is calculated change, or move, with each ensuing period.

To illustrate the moving averages method, let us return to the original 12 weeks of gas- oline sales data in Table 8.1. The time series plot in Figure 8.1 indicates that the gasoline sales time series has a horizontal pattern. Thus, the smoothing methods of this section are applicable.

To use moving averages to forecast a time series, we must first select the order k, or the number of time series values to be included in the moving average. If only the most recent values of the time series are considered relevant, a small value of k is preferred. If a greater number of past values are considered relevant, then we generally opt for a larger value of k. As previously mentioned, a time series with a horizontal pattern can shift to a new level over time. A moving average will adapt to the new level of the series and continue to pro- vide good forecasts in k periods. Thus a smaller value of k will track shifts in a time series more quickly (the naïve approach discussed earlier is actually a moving average for 1k 5 ). On the other hand, larger values of k will be more effective in smoothing out random fluc- tuations. Thus, managerial judgment based on an understanding of the behavior of a time series is helpful in choosing an appropriate value of k.

To illustrate how moving averages can be used to forecast gasoline sales, we will use a three-week moving average ( 3)k 5 . We begin by computing the forecast of sales in week 4 using the average of the time series values in weeks 1 to 3:

y 5 5 1 1

5ˆ average for weeks 1 to 3 17 21 19

3 194

Thus, the moving average forecast of sales in week 4 is 19, or 19,000 gallons of gasoline. Because the actual value observed in week 4 is 23, the forecast error in week 4 is

23 19 44e 5 2 5 . We next compute the forecast of sales in week 5 by averaging the time series values in

weeks 2 to 4:

y 5 5 1 1

5ˆ average for weeks 2 to 4 21 19 23

3 215

388 chapter 8 Time Series Analysis and Forecasting

Hence, the forecast of sales in week 5 is 21 and the error associated with this forecast is 18 21 35e 5 2 5 2 . A complete summary of the three-week moving average forecasts for

the gasoline sales time series is provided in Table 8.9. Figure 8.7 shows the original time series plot and the three-week moving average forecasts. Note how the graph of the moving average forecasts has tended to smooth out the random fluctuations in the time series.

Week Time Series

Value Forecast Forecast

Error

Absolute Value of Forecast

Error

Squared Forecast

Error Percentage

Error

Absolute Value of Percentage

Error

1 17

2 21

3 19

4 23 19 4 4 16 17.39 17.39

5 18 21 −3 3 9 −16.67 16.67

6 16 20 −4 4 16 −25.00 25.00

7 20 19 1 1 1 5.00 5.00

8 18 18 0 0 0 0.00 0.00

9 22 18 4 4 16 18.18 18.18

10 20 20 0 0 0 0.00 0.00

11 15 20 −5 5 25 −33.33 33.33

12 22 19 3 3 9 13.64 13.64

Totals 0 24 92 −20.79 129.21

Summary of Three-Week Moving Average calculationsTABlE 8.9

Gasoline Sales Time Series Plot and Three-Week Moving Average Forecasts

FiGURE 8.7

S al

es (

1, 00

0s o

f ga

ll on

Week 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Three-week moving average forecasts

8.3 Moving Averages and Exponential Smoothing 389

To forecast sales in week 13, the next time period in the future, we simply compute the average of the time series values in weeks 10, 11, and 12:

y 5 5 1 1

5ˆ average for weeks10 to12 20 15 22

3 1913

Thus, the forecast for week 13 is 19, or 19,000 gallons of gasoline. To show how Excel can be used to develop forecasts using the moving averages method,

we develop a forecast for the gasoline sales time series in Table 8.1 and in the file Gasoline as displayed in Columns A and B of Figure 8.10.

The following steps can be used to produce a three-week moving average:

Step 1. Click the Data tab in the Ribbon Step 2. Click Data Analysis in the Analyze group Step 3. When the Data Analysis dialog box appears (Figure 8.8), select Moving

Average and click OK Step 4. When the Moving Average dialog box appears (Figure 8.9):

Enter B2:B13 in the Input Range: box Enter 3 in the Interval: box Enter C3 in the Output Range: box Click OK

If Data Analysis does not appear in your Analyze group in the Data tab, you will have to load the Analysis Toolpak add-in into Excel. To do so, click the File tab in the Ribbon and click Options. When the Excel Options dialog box appears, click Add-Ins from the menu. Next to Manage, select Excel Add-ins and click Go... at the bottom of the dialog box. When the Add-Ins dialog box appears, select Analysis Toolpak and click OK.

Data Analysis Dialog BoxFiGURE 8.8

Moving Average Dialog BoxFiGURE 8.9

Gasoline

390 chapter 8 Time Series Analysis and Forecasting

Once you have completed this step, the three-week moving average forecasts will appear in column C of the worksheet as shown in Figure 8.10. Note that forecasts for periods of other lengths can be computed easily by entering a different value in the Interval: box.

In Section 8.2 we discussed three measures of forecast accuracy: mean absolute error (MAE), mean squared error (MSE), and mean absolute percentage error (MAPE). Using the three-week moving average calculations in Table 8.9, the values for these three mea- sures of forecast accuracy are as follows:

∑

 

 

MAE 3

9 2.67

MSE 3

9 10.22

MAPE

100

129.21

9 14.36%

12 2

2 5 5

5 2

5 5

5 2

5 5

In Section 8.2 we showed that using the most recent observation as the forecast for the next week (a moving average of order 1k 5 ) resulted in values of MAE 3.735 , MSE 16.275 , and MAPE 19.24%5 . Thus, according to each of these three measures, the three-week moving average approach has provided more accurate forecasts than sim- ply using the most recent observation as the forecast. Also note how we have revised the formulas for the MAE, MSE, and MAPE to reflect that our use of a three-week moving average leaves us with insufficient data to generate forecasts for the first three weeks of our time series.

To determine whether a moving average with a different order k can provide more accu- rate forecasts, we recommend using trial and error to determine the value of k that mini- mizes the MSE. For the gasoline sales time series, it can be shown that the minimum value

Excel output for Moving Average Forecast for Gasoline DataFiGURE 8.10

1 17 #N/A #N/A

19 21 20 19 18 18 20 20 19 19

21 19 23 18 16 20 18 22

15 22

2 3 4 5 6 7 8 9 10 11 12 13

1 2 3 4 5 6 7 8 9 10 11 12 13 14

B C Week Sales (1,000s of gallons)

8.3 Moving Averages and Exponential Smoothing 391

of MSE corresponds to a moving average of order 6k 5 with MSE 6.795 . If we are willing to assume that the order of the moving average that is best for the historical data will also be best for future values of the time series, the most accurate moving average forecasts of gasoline sales can be obtained using a moving average of order 6k 5 .

Exponential Smoothing Exponential smoothing uses a weighted average of past time series values as a forecast. The exponential smoothing model is as follows:

If a large amount of data are available to build the forecast models, we suggest dividing the data into training and validation sets, and then determining the best value of k as the value that minimizes the MSE for the validation set. We discuss the use of training and validation sets in more detail in Section 8.5.

EXPONENTiAl SmOOTHiNG FORECAST

ˆ (1 ) ˆ1y y yt t ta a5 1 21 (8.7)

where

+ˆ forecast of the time series for period 1

actual value of the time series in period

ˆ forecast of the time series for period t

smoothing constant (0 1)

a a

5 1

5 # #

y t

Equation (8.7) shows that the forecast for period t + 1 is a weighted average of the actual value in period t and the forecast for period t. The weight given to the actual value in period t is the smoothing constant a, and the weight given to the forecast in period t is a21 . It turns out that the exponential smoothing forecast for any period is actually a weighted average of all the previous actual values of the time series. Let us illustrate by working with a time series involving only three periods of data: 1y , 2y , and 3y .

To initiate the calculations, we let ŷ1 equal the actual value of the time series in period 1; that is, y y5^ 1 1. Hence, the forecast for period 2 is

( ) ( )

+ − + −

ˆ 1 ˆ

2 1 1

1 1

a a

y y y

y y

We see that the exponential smoothing forecast for period 2 is equal to the actual value of the time series in period 1.

The forecast for period 3 is

( ) ( )y y y y ya a a a5 1 2 5 1 2ˆ 1 ˆ 13 2 2 2 1 Finally, substituting this expression for ˆ3y into the expression for ˆ4y , we obtain

( ) ( ) ( ) ( )

( ) ( )

ˆ 1 ˆ

1 1

4 3 3

3 2 1

3 2 2

a a

a a a a

5 1 2

5 1 2 1 2

y y y

We now see that ˆ4y is a weighted average of the first three time series values. The sum of the coefficients, or weights, for 1y , 2y , and 3y equals 1. A similar argument can be made to show that, in general, any forecast ˆ 1yt1 is a weighted average of all the t previous time series values.

Despite the fact that exponential smoothing provides a forecast that is a weighted aver- age of all past observations, all past data do not need to be retained to compute the forecast for the next period. In fact, equation (8.7) shows that once the value for the smoothing constant a is selected, only two pieces of information are needed to compute the forecast

392 chapter 8 Time Series Analysis and Forecasting

for period 1 :t yt1 , the actual value of the time series in period t; and ŷt , the forecast for period t.

To illustrate the exponential smoothing approach to forecasting, let us again consider the gasoline sales time series in Table 8.1. As indicated previously, to initialize the calcu- lations we set the exponential smoothing forecast for period 2 equal to the actual value of the time series in period 1. Thus, with 171y 5 , we set ˆ 172 5y to initiate the computations. Referring to the time series data in Table 8.1, we find an actual time series value in period 2 of 212y 5 . Thus, in period 2 we have a forecast error of 21 17 42e 5 2 5 .

Continuing with the exponential smoothing computations using a smoothing constant of a 5 0.2, we obtain the following forecast for period 3:

( ) ( )y y y5 1 5 1 5ˆ 0.2 0.8 ˆ 0.2 21 0.8 17 17.83 2 2 Once the actual time series value in period 3, 193y 5 , is known, we can generate a forecast for period 4 as follows:

( ) ( )y y y5 1 5 1 5ˆ 0.2 0.8 ˆ 0.2 19 0.8 17.8 18.044 3 3 Continuing the exponential smoothing calculations, we obtain the weekly forecast val-

ues shown in Table 8.10. Note that we have not shown an exponential smoothing forecast or a forecast error for week 1 because no forecast was made (we used actual sales for week 1 as the forecasted sales for week 2 to initialize the exponential smoothing process). For week 12, we have 2212y 5 and ˆ 18.4812 5y . We can we use this information to generate a forecast for week 13:

( ) ( )y y y5 1 5 1 5ˆ 0.2 0.8 ˆ 0.2 22 0.8 18.48 19.1813 12 12 Thus, the exponential smoothing forecast of the amount sold in week 13 is 19.18, or 19,180 gallons of gasoline. With this forecast, the firm can make plans and decisions accordingly.

Figure 8.11 shows the time series plot of the actual and forecasted time series values. Note in particular how the forecasts smooth out the irregular or random fluctuations in the time series.

Week Time Series

Value Forecast Forecast Error Squared Forecast

Error

1 17

2 21 17.00 4.00 16.00

3 19 17.80 1.20 1.44

4 23 18.04 4.96 24.60

5 18 19.03 −1.03 1.06

6 16 18.83 −2.83 8.01

7 20 18.26 1.74 3.03

8 18 18.61 −0.61 0.37

9 22 18.49 3.51 12.32

10 20 19.19 0.81 0.66

11 15 19.35 −4.35 18.92

12 22 18.48 3.52 12.39

Totals 10.92 98.80

Summary of the Exponential Smoothing Forecasts and Forecast Errors for the Gasoline Sales Time Series with Smoothing constant 0.2a 5

TABlE 8.10

8.3 Moving Averages and Exponential Smoothing 393

To show how Excel can be used for exponential smoothing, we again develop a fore- cast for the gasoline sales time series in Table 8.1. We use the file Gasoline, which has the week in rows 2 through 13 of column A and the sales data for the 12 weeks in rows 2 through 13 of column B. We use 0.2a 5 . The following steps can be used to produce a forecast.

Step 1. Click the Data tab in the Ribbon Step 2. Click Data Analysis in the Analyze group Step 3. When the Data Analysis dialog box appears (Figure 8.12), select Exponential

Smoothing and click OK Step 4. When the Exponential Smoothing dialog box appears (Figure 8.13):

Enter B2:B13 in the Input Range: box Enter 0.8 in the Damping factor: box Enter C2 in the Output Range: box Click OK

Once you have completed this step, the exponential smoothing forecasts will appear in column C of the worksheet as shown in Figure 8.14. Note that the value we entered in the Damping factor: box is 1 a2 ; forecasts for other smoothing constants can be computed easily by entering a different value for 1 a2 in the Damping factor: box.

In the preceding exponential smoothing calculations, we used a smoothing constant of 0.2a 5 . Although any value of a between 0 and 1 is acceptable, some values will yield more accurate forecasts than others. Insight into choosing a good value for a can be obtained by rewriting the basic exponential smoothing model as follows:

( )

( )+y y y y y y

y y y y e

t t t

t t t t t

a a

5 1 2

5 1 2 5 1

ˆ 1 ˆ ˆ ˆ

ˆ ˆ ˆ

Actual and Forecast Gasoline Time Series with Smoothing constant 0.2a 5

FiGURE 8.11

S al

es (

1, 00

0s o

f ga

ll on

Week 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Forecast time series with a 5 0.2

Actual time series

Gasoline

394 chapter 8 Time Series Analysis and Forecasting

Data Analysis Dialog BoxFiGURE 8.12

Exponential Smoothing Dialog BoxFiGURE 8.13

Excel output for Exponential Smoothing Forecast for Gasoline Data

FiGURE 8.14

A 1 2 3 4 5 6 7 8 9 10

1 17 #N/A 17

17.8 18.04

19.032 18.8256 18.2605 18.6084 18.4867 19.1894 19.3515 18.4812

21 19 23 18 16 20 18 22

15 22

2 3 4 5 6 7 8 9 10 11 12

11 12 13

B C Week Sales (1,000s of gallons)

8.4 Using Regression Analysis for Forecasting 395

Thus, the new forecast ˆ 11yt is equal to the previous forecast ŷt plus an adjustment, which is the smoothing constant a times the most recent forecast error, ˆ5 2e y yt t t . In other words, the forecast in period t + 1 is obtained by adjusting the forecast in period t by a fraction of the forecast error from period t. If the time series contains substantial random variability, a small value of the smoothing constant is preferred. The reason for this choice is that if much of the forecast error is due to random variability, we do not want to overreact and adjust the forecasts too quickly. For a time series with relatively little random variability, a forecast error is more likely to represent a real change in the level of the series. Thus, larger values of the smoothing constant provide the advantage of quickly adjusting the forecasts to changes in the time series, thereby allowing the forecasts to react more quickly to changing conditions.

The criterion we will use to determine a desirable value for the smoothing constant a is the same as that proposed for determining the order or number of periods of data to include in the moving averages calculation; that is, we choose the value of a that minimizes the MSE. A summary of the MSE calculations for the exponential smoothing forecast of gas- oline sales with 0.2a 5 is shown in Table 8.10. Note that there is one less squared error term than the number of time periods; this is because we had no past values with which to make a forecast for period 1. The value of the sum of squared forecast errors is 98.80; hence, MSE 98.80 / 11 8.985 5 . Would a different value of a provide better results in terms of a lower MSE value? Trial and error is often used to determine whether a different smoothing constant a can provide more accurate forecasts.

Similar to our note related to moving averages, if enough data are available, then a should be chosen to minimize the MSE of the validation set.

Nonlinear optimization can be used to identify the value of a that minimizes the MSE.

1. Spreadsheet packages are effective tools for implement-

ing exponential smoothing. With the time series data and

the forecasting formulas in a spreadsheet such as the one

shown in Table 8.10, you can use the MAE, MSE, and MAPE

to evaluate different values of the smoothing constant a . 2. Moving averages and exponential smoothing provide the

foundation for much of time series analysis, and many

more sophisticated refinements of these methods have

been developed. These include but are not limited to

weighted moving averages, double moving averages,

Brown’s method for double exponential smoothing,

and Holt- Winters exponential smoothing. Appendix 8.1

explains how to implement the Holt-Winters method using

Excel Forecast Sheet.

N O T E S + C O m m E N T S

8.4 Using Regression Analysis for Forecasting Regression analysis is a statistical technique that can be used to develop a mathematical equation showing how variables are related. In regression terminology, the variable that is being predicted is called the dependent (or response) variable, and the variable or vari- ables being used to predict the value of the dependent variable are called the independent (or predictor) variables. In this section we will show how to use regression analysis to develop forecasts for a time series that has a trend, a seasonal pattern, and both a trend and a seasonal pattern. We will also show how to use regression analysis to develop forecast models that include causal variables.

linear Trend Projection We now consider forecasting methods that are appropriate for time series that exhibit trend patterns and show how regression analysis can be used to forecast a time series with a linear trend. In Section 8.1 we used the bicycle sales time series in Table 8.3 to illustrate a time series with a trend pattern. Let us now use this time series to illustrate how regression analysis can be used to forecast a time series with a linear trend. Although the time series plot in Figure 8.3 shows some up-and-down movement over the past 10 years, we might

In Chapter 7, we discuss linear regression models in more detail.

396 chapter 8 Time Series Analysis and Forecasting

agree that a linear trendline provides a reasonable approximation of the long-run movement in the series. We can use regression analysis to develop such a linear trendline for the bicy- cle sales time series.

Because simple linear regression analysis yields the linear relationship between the independent variable and the dependent variable that minimizes the MSE, we can use this approach to find a best-fitting line to a set of data that exhibits a linear trend. In finding a linear trend, the variable to be forecasted (yt , the actual value of the time series in period t) is the dependent variable and the trend variable (time period t) is the independent variable. We will use the following notation for our linear trendline:

y b b tt 5 1ˆ 0 1 (8.8)

where

ˆ forecast of sales in period time period the -intercept of the linear trendline the slope of the linear trendline

5 5 5 5

y t t

b y b

In equation (8.8), the time variable begins at 1t 5 , corresponding to the first time series observation (year 1 for the bicycle sales time series). The time variable then continues until t n5 , corresponding to the most recent time series observation (year 10 for the bicycle sales time series). Thus, the bicycle sales time series 1t 5 corresponds to the oldest time series value, and 10t 5 corresponds to the most recent year.

Excel can be used to compute the estimated intercept 0b and slope 1b . The Excel output for a regression analysis of the bicycle sales data is provided in Figure 8.15.

We see in this output that the estimated intercept 0b is 20.4 (shown in cell B17) and the estimated slope 1b is 1.1 (shown in cell B18). Thus,

ˆ 20.4 1.15 1y tt (8.9)

is the regression equation for the linear trend component for the bicycle sales time series. The slope of 1.1 in this trend equation indicates that over the past 10 years the firm has experienced an average growth in sales of about 1,100 units per year. If we assume that the past 10-year trend in sales is a good indicator for the future, we can use equation (8.9) to project the trend component of the time series. For example, substituting 11t 5 into equation (8.9) yields next year’s trend projection, ˆ11y :

( )y 5 1 5ˆ 20.4 1.1 11 32.511 Thus, the linear trend model yields a sales forecast of 32,500 bicycles for the next year.

We can also use the trendline to forecast sales farther into the future. Using equation (8.9), we develop annual forecasts of bicycle sales for two and three years into the future as follows:

ˆ 20.4 1.1 12 33.6

ˆ 20.4 1.1 13 34.7

) )

( (

5 1 5

The forecasted value increases by 1,100 bicycles in each year. Note that in this example we are not using past values of the time series to produce

forecasts, so we can produce a forecast for each period of the time series; that is, 0k 5 in equations (8.3)–(8.5) to calculate the MAE, MSE, and MAPE.

We can also use more complex regression models to fit nonlinear trends. For example, to generate a forecast of a time series with a curvilinear trend, we could include 2t and

3t as independent variables in our model, and the estimated regression equation would become

y b b t b t b tt 5 1 1 1ˆ 0 1 2 2 3 3

8.4 Using Regression Analysis for Forecasting 397

Another type of regression-based forecasting model occurs whenever all the indepen- dent variables are previous values of the same time series. For example, if the time series values are denoted by y y yn, , . . . ,1 2 , we might try to find an estimated regression equation relating yt to the most recent time series values, 1yt2 , 2yt2 , and so on. If we use the actual values of the time series for the three most recent periods as independent variables, the esti- mated regression equation would be

ˆ 0 1 1 2 2 3 3y b b y b y b yt t t t5 1 1 12 2 2

Regression models such as this in which the independent variables are previous values of the time series are referred to as autoregressive models.

Seasonality Without Trend To the extent that seasonality exists, we need to incorporate it into our forecasting models to ensure accurate forecasts. We begin by considering a seasonal time series with no trend and then, in the next section, we discuss how to model seasonality with a linear trend. Let us consider again the data from Table 8.5, the number of umbrellas sold at a clothing store over the past five years. As we see in the time series plot provided in Figure 8.5, the data do not suggest any long-term trend in sales. In fact, unless you look carefully at the data, you might conclude that the data follow a horizontal pattern with random fluctuation and that single exponential smoothing could be used to forecast sales. How- ever, closer inspection of the time series plot reveals a pattern in the fluctuations. The first and third quarters have moderate sales, the second quarter the highest sales, and the fourth quarter the lowest sales. Thus, we conclude that a quarterly seasonal pattern is present.

We can model a time series with a seasonal pattern by treating the season as a dummy variable. As indicated in Chapter 7, categorical variables are data used to categorize obser- vations of data, and k − 1 dummy variables are required to model a categorical variable that

Because autoregressive models typically violate the conditions necessary for inference in least squares regression, you must be careful when testing hypotheses or estimating confidence intervals in autoregressive models. There are special methods for constructing autoregressive models, but they are beyond the scope of this book.

Excel Simple linear Regression output for Trendline Model for Bicycle Sales DataFiGURE 8.15

Regression Statistics

Multiple R

R Square

Adjusted R Square

Standard Error

Observations

ANOVA

Coef�cients Standard Error t Stat P-value Lower 95% Upper 95% Lower 99.0% Upper 99.0%

1 99.825 99.825

3.8375

26.01302932 0.000929509

30.7

130.525

0.874526167

0.764796016

0.735395518

1.958953802

20.4 1.338220211 15.24412786 3.39989E-07 17.31405866 23.48594134 15.90975286

0.376331148

24.89024714

1.8236688521.597344480.602655520.0009295095.1002969830.2156737151.1

SS MS F Signi�cance F

Regression

Residual

Total

Intercept

Year

A B C D E F G H I SUMMARY OUTPUT1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

398 chapter 8 Time Series Analysis and Forecasting

has k levels. Thus, we need three dummy variables to model four seasons. For instance, in the umbrella sales time series, the quarter to which each observation corresponds is treated as a season; it is a categorical variable with four levels: quarter 1, quarter 2, quarter 3, and quarter 4. Thus, to model the seasonal effects in the umbrella time series we need 4 1 32 5 dummy variables. The three dummy variables can be coded as follows:

        

Qtr1 1 if period is quarter 1 0 otherwise

Qtr2 1 if period is quarter 2 0 otherwise

Qtr3 1 if period is quarter 3 0 otherwise

Using ytˆ to denote the forecasted value of sales for period t, the general form of the equa- tion relating the number of umbrellas sold to the quarter the sales take place is as follows:

y b b b bt t t t5 1 1 1ˆ Qtr1 Qtr2 Qtr30 1 2 3 (8.10)

Note that the fourth quarter will be denoted by setting all three dummy variables to 0. Table 8.11 shows the umbrella sales time series with the coded values of the dummy

variables shown. We can use a multiple linear regression model to find the values of 0b , 1b , 2b , and 3b that minimize the sum of squared errors. For this regression model, yt is the

dependent variable, and the quarterly dummy variables Qtr1t, Qtr2t , and Qtr3t are the inde- pendent variables.

Using the data in Table 8.11 and regression analysis, we obtain the following equation:

yt t t t5 1 1 1ˆ 95.0 29.0Qtr1 57.0Qtr2 26.0Qtr3 (8.11)

We can use equation (8.11) to forecast sales of every quarter for the next year:

5 1 1 1 5

Quarter1 : Sales 95.0 29.0 1 57.0 0 26.0 0 124 Quarter2 : Sales 95.0 29.0 0 57.0 1 26.0 0 152 Quarter3 : Sales 95.0 29.0 0 57.0 0 26.0 1 121 Quarter4 : Sales 95.0 29.0 0 57.0 0 26.0 0 95

( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )

It is interesting to note that we could have obtained the quarterly forecasts for the next year by simply computing the average number of umbrellas sold in each quarter. Nonetheless, for more complex problem situations, such as dealing with a time series that has both trend and seasonal effects, this simple averaging approach will not work.

Seasonality with Trend We now consider situations for which the time series contains both seasonal effects and a linear trend by showing how to forecast the quarterly sales of smartphones introduced in Section 8.1. The data for the smartphone time series are shown in Table 8.6. The time series plot in Figure 8.6 indicates that sales are lowest in the second quarter of each year and increase in quarters 3 and 4. Thus, we conclude that a seasonal pattern exists for smart- phone sales. However, the time series also has an upward linear trend that will need to be accounted for in order to develop accurate forecasts of quarterly sales. This is easily done by combining the dummy variable approach for handling seasonality with the approach for handling a linear trend discussed earlier in this section.

The general form of the regression equation for modeling both the quarterly seasonal effects and the linear trend in the smartphone time series is

y b b b b b tt t t t5 1 1 1 1ˆ Qtr1 Qtr2 Qtr30 1 2 3 4 (8.12)

8.4 Using Regression Analysis for Forecasting 399

where

y t

ˆ forecast of sales in period

Qtr1 1 if time period corresponds to the first quarter of the year; 0 otherwise

Qtr2 1 if time period corresponds to the second quarter of the year; 0 otherwise

Qtr3 1 if time period corresponds to the third quarter of the year; 0 otherwise

time period (quarter)

For this regression model yt is the dependent variable and the quarterly dummy variables Qtr1t, Qtr2t , and Qtr3t and the time period t are the independent variables.

Table 8.12 shows the revised smartphone sales time series that includes the coded values of the dummy variables and the time period t. Using the data in Table 8.12 with the regres- sion model that includes both the seasonal and trend components, we obtain the following equation that minimizes our sum of squared errors:

y tt t t t5 2 2 2 1ˆ 6.07 1.36Qtr1 2.03Qtr2 0.304Qtr3 0.146 (8.13)

We can now use equation (8.13) to forecast quarterly sales for the next year. Next year is year 5 for the smartphone sales time series, that is, time periods 17, 18, 19, and 20.

Forecast for time period 17 (quarter 1 in year 5):

ˆ 6.07 1.36(1) 2.03(0) 0.304(0) 0.146(17) 7.1917y 5 2 2 2 1 5

Forecast for time period 18 (quarter 2 in year 5):

ˆ 6.07 1.36(0) 2.03(1) 0.304(0) 0.146(18) 6.6718y 5 2 2 2 1 5

Period Year Quarter Qtr1 Qtr2 Qtr3 Sales

1 1 1 1 0 0 125

2 2 0 1 0 153

3 3 0 0 1 106

4 4 0 0 0 88

5 2 1 1 0 0 118

6 2 0 1 0 161

7 3 0 0 1 133

8 4 0 0 0 102

9 3 1 1 0 0 138

10 2 0 1 0 144

11 3 0 0 1 113

12 4 0 0 0 80

13 4 1 1 0 0 109

14 2 0 1 0 137

15 3 0 0 1 125

16 4 0 0 0 109

17 5 1 1 0 0 130

18 2 0 1 0 165

19 3 0 0 1 128

20 4 0 0 0 96

Umbrella Sales Time Series with Dummy VariablesTABlE 8.11

400 chapter 8 Time Series Analysis and Forecasting

Forecast for time period 19 (quarter 3 in year 5):

ˆ 6.07 1.36(0) 2.03(0) 0.304(1) 0.146(19) 8.5419y 5 2 2 2 1 5

Forecast for time period 20 (quarter 4 in year 5):

ˆ 6.07 1.36(0) 2.03(0) 0.304(0) 0.146(20) 8.9920y 5 2 2 2 1 5

Thus, accounting for the seasonal effects and the linear trend in smartphone sales, the esti- mates of quarterly sales in year 5 are 7,190; 6,670; 8,540; and 8,990.

The dummy variables in the equation actually provide four equations, one for each quarter. For instance, if time period t corresponds to quarter 1, the estimate of quarterly sales is

Quarter 1: Sales 6.07 1.36(1) 2.03(0) 0.304(0) 0.146 4.71 0.146t t5 2 2 2 1 5 1

Similarly, if time period t corresponds to quarters 2, 3, and 4, the estimates of quarterly sales are as follows:

Quarter 2 : Sales 6.07 1.36(0) 2.03(1) 0.304(0) 0.146 4.04 0.146 Quarter 3 : Sales 6.07 1.36(0) 2.03(0) 0.304(1) 0.146 5.77 0.146 Quarter 4 : Sales 6.07 1.36(0) 2.03(0) 0.304(0) 0.146 6.07 0.146

t t t t t t

5 2 2 2 1 5 1 5 2 2 2 1 5 1 5 2 2 2 1 5 1

The slope of the trendline for each quarterly forecast equation is 0.146, indicating a consis- tent growth in sales of about 146 phones per quarter. The only difference in the four equa- tions is that they have different intercepts.

In the smartphone sales example, we showed how dummy variables can be used to account for the quarterly seasonal effects in the time series. Because there were four lev- els of seasonality, three dummy variables were required. However, many businesses use monthly rather than quarterly forecasts. For monthly data, season is a categorical variable

Period Year Quarter Qtr1 Qtr2 Qtr3 Sales (1,000s)

1 1 1 1 0 0 4.8

2 2 0 1 0 4.1

3 3 0 0 1 6.0

4 4 0 0 0 6.5

5 2 1 1 0 0 5.8

6 2 0 1 0 5.2

7 3 0 0 1 6.8

8 4 0 0 0 7.4

9 3 1 1 0 0 6.0

10 2 0 1 0 5.6

11 3 0 0 1 7.5

12 4 0 0 0 7.8

13 4 1 1 0 0 6.3

14 2 0 1 0 5.9

15 3 0 0 1 8.0

16 4 0 0 0 8.4

Smartphone Sales Time Series with Dummy Variables and Time Period

TABlE 8.12

8.4 Using Regression Analysis for Forecasting 401

with 12 levels, and thus 12 1 112 5 dummy variables are required to capture monthly sea- sonal effects. For example, the 11 dummy variables could be coded as follows:

�

     

  

Month1 1 if period is January 0 otherwise

Month2 1 if period is February 0 otherwise

Month11 1 if period is November 0 otherwise

Other than this change, the approach for handling seasonality remains the same. Time series data collected at other intervals can be handled in a similar manner.

Using Regression Analysis as a Causal Forecasting method The methods discussed for estimating linear trends and seasonal effects make use of pat- terns in historical values of the variable to be forecast; these methods are classified as time series methods because they rely on past values of the variable to be forecast when develop- ing the model. However, the relationship of the variable to be forecast with other variables may also be used to develop a forecasting model. Generally such models include only vari- ables that are believed to cause changes in the variable to be forecast, such as the following:

• Advertising expenditures when sales are to be forecast. • The mortgage rate when new housing construction is to be forecast. • Grade point average when starting salaries for recent college graduates are to be forecast.

• The price of a product when the demand for the product is to be forecast. • The value of the Dow Jones Industrial Average when the value of an individual stock is to be forecast.

• Daily high temperature when electricity usage is to be forecast.

Because these variables are used as independent variables when we believe they cause changes in the value of the dependent variable, forecasting models that include such vari- ables as independent variables are referred to as causal models. It is important to note here that the forecasting model provides evidence only of association between an independent variable and the variable to be forecast. The model does not provide evidence of a causal relationship between an independent variable and the variable to be forecast; the conclu- sion that a causal relationship exists must be based on practical experience.

To illustrate how regression analysis is used as a causal forecasting method, we consider the sales forecasting problem faced by Armand’s Pizza Parlors, a chain of Italian restau- rants doing business in a five-state area. Historically, the most successful locations have been near college campuses. The managers believe that quarterly sales for these restaurants (denoted by y) are related positively to the size of the student population (denoted by x); that is, restaurants near campuses with a large population tend to generate more sales than those located near campuses with a small population.

Using regression analysis we can develop an equation showing how the dependent vari- able y is related to the independent variable x. This equation can then be used to forecast quarterly sales for restaurants located near college campuses given the size of the student population. This is particularly helpful for forecasting sales for new restaurant locations. For instance, suppose that management wants to forecast sales for a new restaurant that it is considering opening near a college campus. Because no historical data are available on sales for a new restaurant, Armand’s cannot use time series data to develop the forecast. However, as we will now illustrate, regression analysis can still be used to forecast quar- terly sales for this new location.

402 chapter 8 Time Series Analysis and Forecasting

To develop the equation relating quarterly sales to the size of the student population, Armand’s collected data from a sample of 10 of its restaurants located near college cam- puses. These data are summarized in Table 8.13. For example, restaurant 1, with 58y 5 and 2x 5 , had $58,000 in quarterly sales and is located near a campus with 2,000 stu- dents. Figure 8.16 shows a scatter chart of the data presented in Table 8.13, with the size of the student population shown on the horizontal axis and quarterly sales shown on the vertical axis.

What preliminary conclusions can we draw from Figure 8.16? Sales appear to be higher at locations near campuses with larger student populations. Also, it appears that the rela- tionship between the two variables can be approximated by a straight line. In Figure 8.17, we can draw a straight line through the data that appears to provide a good linear approx- imation of the relationship between the variables. Observe that the relationship is not perfect. Indeed, few, if any, of the data fall exactly on the line. However, if we can develop the mathematical expression for this line, we may be able to use it to forecast the value of y corresponding to each possible value of x. The resulting equation of the line is called the estimated regression equation.

Using the least squares method of estimation, the estimated regression equation is

y b b xi i5 1ˆ 0 1 (8.14)

where

y i b b x i

5 5 5 5

ˆ estimated value of the dependent variable (quarterly sales) for the th observation intercept of the estimated regression equation slope of the estimated regression equation value of the independent variable (student population) for the th observation

The Excel output for a simple linear regression analysis of the Armand’s Pizza data is provided in Figure 8.18.

We see in this output that the estimated intercept 0b is 60 and the estimated slope 1b is 5. Thus, the estimated regression equation is

y xi i5 1ˆ 60 5

The slope of the estimated regression equation ( 5)1b 5 is positive, implying that, as stu- dent population increases, quarterly sales increase. In fact, we can conclude (because sales are measured in thousands of dollars and student population in thousands) that an increase in the student population of 1,000 is associated with an increase of $5,000 in expected

Restaurant Student Population (1,000s) Quarterly Sales ($1,000s)

1 2 58

2 6 105

3 8 88

4 8 118

5 12 117

6 16 137

7 20 157

8 20 169

9 22 149

10 26 202

Student Population and Quarterly Sales Data for 10 Armand’s Pizza Parlors

TABlE 8.13

Armand’s

8.4 Using Regression Analysis for Forecasting 403

Scatter chart of Student Population and Quarterly Sales for Armand’s Pizza Parlors

FiGURE 8.16

100

120

140

160

180

200

220

Q u

ar te

rl y

S al

es (

$1 ,0

00 s)

– y

60 2 4 148 10 12 2216 18 20 24 26 Student Population (1,000s) – x

Graph of the Estimated Regression Equation for Armand’s Pizza Parlors: 5 160 5y x

FiGURE 8.17

100

120

140

160

180

200

220

Q u

ar te

rl y

S al

es (

$1 ,0

00 s)

– y

60 2 4 148 10 12 2216 18 20 24 26

Student Population (1,000s) – x

y = 60

+ 5 x

y-intercept b0 = 60

Slope b1 = 5

404 chapter 8 Time Series Analysis and Forecasting

quarterly sales; that is, quarterly sales are expected to increase by $5 per student. The esti- mated y-intercept 0b tells us that if the student population for the location of an Armand’s pizza parlor was 0 students, we would expect sales of $60,000.

If we believe that the least squares estimated regression equation adequately describes the relationship between x and y, using the estimated regression equation to forecast the value of y for a given value of x seems reasonable. For example, if we wanted to forecast quarterly sales for a new restaurant to be located near a campus with 16,000 students, we would compute as follows:

( )y 5 1 5

ˆ 60 5 16 140

Hence, we would forecast quarterly sales of $140,000.

Combining Causal Variables with Trend and Seasonality Effects Regression models are very flexible and can incorporate both causal variables and time series effects. Suppose we had a time series of several years of quarterly sales data and advertising expenditures for a single Armand’s restaurant. If we suspected that sales were related to advertising expenditures and that sales showed trend and seasonal effects, we could incorporate each into a single model by combining the approaches we have outlined. If we believe that the effect of advertising is not immediate, we might also try to find a relationship between sales in period t and advertising in the previous period, t − 1.

Multiple regression analysis also can be applied in these situations if additional data for other independent variables are available. For example, suppose that the management of Armand’s Pizza Parlors also believes that the number of competitors near the college cam- pus is related to quarterly sales. Intuitively, management believes that restaurants located near campuses with fewer competitors generate more sales revenue than those located near campuses with more competitors. With additional data, multiple regression analysis could

Note that the values of the independent variable range from 2,000 to 26,000; thus, as discussed in Chapter 7, the y-intercept in such cases is an extrapolation of the regression line and must be interpreted with caution.

The value of an independent variable from the prior period is referred to as a lagged variable.

Excel Simple linear Regression output for Armand’s Pizza ParlorsFiGURE 8.18

Regression Statistics

Multiple R

R Square

Adjusted R Square

Standard Error

Observations

ANOVA

Coef�cients Standard Error t Stat P-value Lower 95% Upper 95% Lower 99.0% Upper 99.0%

1 14200 14200

191.25

74.24836601 2.54887E-05

1530

15730

0.950122955

0.90273363

0.890575334

13.82931669

60 9.22603481 6.503335532 0.000187444 38.72472558 81.27527442 29.04307968

3.052985371

90.95692032

6.9470146296.3380940383.6619059622.54887E-058.6167491560.5802652385

SS MS F Signi�cance F

Regression

Residual

Total

Intercept

Student Population (1,000s)

A B C D E F G H I

SUMMARY OUTPUT1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

8.5 Determining the Best Forecasting Model to Use 405

be used to develop an equation relating quarterly sales to the size of the student population and the number of competitors.

Considerations in Using Regression in Forecasting Although regression analysis allows for the estimation of complex forecasting models, we must be cautious about using such models and guard against the potential for overfitting our model to the sample data. Spyros Makridakis, a noted forecasting expert, conducted research showing that simple techniques usually outperform more complex procedures for short-term forecasting. Using a more sophisticated and expensive procedure will not guarantee better forecasts. However, many research studies, including those done by Makridakis, have also shown that quantitative forecasting models such as those presented in this chapter commonly outperform qualitative forecasts made by “experts.” Thus, there is good reason to use quantitative forecasting methods whenever data are available.

Whether a regression approach provides a good forecast depends largely on how well we are able to identify and obtain data for independent variables that are closely related to the time series. Generally, during the development of an estimated regression equation, we will want to consider many possible sets of independent variables. Thus, part of the regres- sion analysis procedure should focus on the selection of the set of independent variables that provides the best forecasting model.

Many different software packages can be used to estimate

regression models. Section 7.4 in this textbook explains how

Excel’s Regression tool can be used to perform regression

analysis. Appendix 7.1 demonstrates the use of Analytic Solver

to estimate regression models.

N O T E S + C O m m E N T S

8.5 Determining the Best Forecasting Model to Use Given the variety of forecasting models and approaches, the obvious question is, “For a given forecasting study, how does one choose an appropriate model?” As discussed throughout this text, it is always a good idea to get descriptive statistics on the data and graph the data so that they can be visually inspected. In the case of times series data, a visual inspection can indicate whether seasonality appears to be a factor and whether a linear or nonlinear trend seems to exist. For causal modeling, scatter charts can indicate whether strong linear or nonlinear relationships exist between the independent and depen- dent variables. If certain relationships appear totally random, this may lead you to exclude these variables from the model.

As in regression analysis, you may be working with large data sets when generating a forecasting model. In such cases, it is recommended to divide your data into training and validation sets. For example, you might have five years of monthly data available to pro- duce a time series forecast. You could use the first three years of data as a training set to estimate a model or a collection of models that appear to provide good forecasts. You might develop exponential smoothing models and regression models for the training set. You could then use the last two years as a validation set to assess and compare the models’ per- formances. Based on the errors produced by the different models for the validation set, you could ultimately pick the model that minimizes some forecast error measure, such as MAE, MSE, or MAPE. However, you must exercise caution in using the older portion of a time series for the training set and the more recent portion of the time series as the validation set; if the behavior of the time series has changed recently, the older portion of the time series may no longer show patterns similar to the more recent values of the time series, and a forecasting model based on such data will not perform well.

406 chapter 8 Time Series Analysis and Forecasting

Some software packages try many different forecasting models on time series data (those included in this chapter and more) and report back optimal model parameters and error mea- sures for each model tested. Although some of these software packages will even automati- cally select the best model to use, ultimately the user should decide which model to use going forward based on a combination of the software output and the user’s managerial knowledge.

S U M M A R y

This chapter provided an introduction to the basic methods of time series analysis and fore- casting. First, we showed that to explain the behavior of a time series, it is often helpful to graph the time series and identify whether trend, seasonal, and/or cyclical components are present in the time series. The methods we have discussed are based on assumptions about which of these components are present in the time series.

We discussed how smoothing methods can be used to forecast a time series that exhibits no significant trend, seasonal, or cyclical effect. The moving average approach consists of computing an average of past data values and then using that average as the forecast for the next period. In the exponential smoothing method, a weighted average of past time series values is used to compute a forecast.

For time series that have only a long-term trend, we showed how regression analysis could be used to make trend projections. For time series with seasonal influences, we showed how to incorporate the seasonality for more accurate forecasts. We described how regression analysis can be used to develop causal forecasting models that relate values of the variable to be forecast (the dependent variable) to other independent variables that are believed to explain (cause) the behavior of the dependent variable. Finally, we have provided guidance on how to select an appropriate model from the models discussed in this chapter.

G l o S S A R y

Autoregressive model A regression model in which a regression relationship based on past time series values is used to predict the future time series values. Causal models Forecasting methods that relate a time series to other variables that are believed to explain or cause its behavior. Cyclical pattern The component of the time series that results in periodic above-trend and below-trend behavior of the time series lasting more than one year. Exponential smoothing A forecasting technique that uses a weighted average of past time series values as the forecast. Forecast error The amount by which the forecasted value yt

^ differs from the observed value yt , denoted by e y yt t t5 2 ˆ . Forecasts A prediction of future values of a time series. Mean absolute error (MAE) A measure of forecasting accuracy; the average of the values of the forecast errors. Also referred to as mean absolute deviation (MAD). Mean absolute percentage error (MAPE) A measure of the accuracy of a forecasting method; the average of the absolute values of the errors as a percentage of the correspond- ing forecast values. Mean squared error (MSE) A measure of the accuracy of a forecasting method; the aver- age of the sum of the squared differences between the forecast values and the actual time series values. Moving average method A method of forecasting or smoothing a time series that uses the average of the most recent n data values in the time series as the forecast for the next period. Naïve forecasting method A forecasting technique that uses the value of the time series from the most recent period as the forecast for the current period. Seasonal pattern The component of the time series that shows a periodic pattern over one year or less. Smoothing constant A parameter of the exponential smoothing model that provides the weight given to the most recent time series value in the calculation of the forecast value.

Problems 407

Stationary time series A time series whose statistical properties are independent of time. Time series A set of observations on a variable measured at successive points in time or over successive periods of time. Trend The long-run shift or movement in the time series observable over several periods of time.

P R o B l E M S

1. Consider the following time series data:

Week 1 2 3 4 5 6

Value 18 13 16 11 17 14

Using the naïve method (most recent value) as the forecast for the next week, compute the following measures of forecast accuracy: a. Mean absolute error b. Mean squared error c. Mean absolute percentage error d. What is the forecast for week 7?

2. Refer to the time series data in Problem 1. Using the average of all the historical data as a forecast for the next period, compute the following measures of forecast accuracy: a. Mean absolute error b. Mean squared error c. Mean absolute percentage error d. What is the forecast for week 7?

3. Problems 1 and 2 used different forecasting methods. Which method appears to pro- vide the more accurate forecasts for the historical data? Explain.

4. Consider the following time series data:

Month 1 2 3 4 5 6 7

Value 24 13 20 12 19 23 15

Week 1 2 3 4 5 6

Value 18 13 16 11 17 14

a. Compute MSE using the most recent value as the forecast for the next period. What is the forecast for month 8?

b. Compute MSE using the average of all the data available as the forecast for the next period. What is the forecast for month 8?

c. Which method appears to provide the better forecast?

5. Consider the following time series data:

a. Construct a time series plot. What type of pattern exists in the data? b. Develop a three-week moving average for this time series. Compute MSE and a

forecast for week 7. c. Use a 5 0.2 to compute the exponential smoothing values for the time series.

Compute MSE and a forecast for week 7. d. Compare the three-week moving average forecast with the exponential smooth-

ing forecast using a 5 0.2. Which appears to provide the better forecast based on MSE? Explain.

e. Use trial and error to find a value of the exponential smoothing coefficient a that results in a smaller MSE than what you calculated for a 5 0.2.

6. Consider the following time series data:

Month 1 2 3 4 5 6 7

Value 24 13 20 12 19 23 15

408 chapter 8 Time Series Analysis and Forecasting

a. Construct a time series plot. What type of pattern exists in the data? b. Develop a three-week moving average for this time series. Compute MSE and a

forecast for week 8. c. Use a 5 0.2 to compute the exponential smoothing values for the time series.

Compute MSE and a forecast for week 8. d. Compare the three-week moving average forecast with the exponential smooth-

ing forecast using a 5 0.2. Which appears to provide the better forecast based on MSE?

e. Use trial and error to find a value of the exponential smoothing coefficient a that results in a smaller MSE than what you calculated for a 5 0.2.

7. Refer to the gasoline sales time series data in Table 8.1. a. Compute four-week and five-week moving averages for the time series. b. Compute the MSE for the four-week and five-week moving average forecasts. c. What appears to be the best number of weeks of past data (three, four, or five) to use

in the moving average computation? Recall that the MSE for the three-week moving average is 10.22.

8. With the gasoline time series data from Table 8.1, show the exponential smoothing forecasts using a 5 0.1. a. Applying the MSE measure of forecast accuracy, would you prefer a smoothing

constant of a 5 0.1 or a 5 0.2 for the gasoline sales time series? b. Are the results the same if you apply MAE as the measure of accuracy? c. What are the results if MAPE is used?

9. With a smoothing constant of a 5 0.2, equation (8.7) shows that the forecast for week 13 of the gasoline sales data from Table 8.1 is given by y y y5 1ˆ 0.2 0.8 ˆ13 12 12. However, the forecast for week 12 is given by y y y5 1ˆ 0.2 0.8 ˆ12 11 11. Thus, we could combine these two results to show that the forecast for week 13 can be written as

y y y y y y y5 1 1 5 1 1ˆ 0.2 0.8(0.2 0.8 ˆ ) 0.2 0.16 0.64 ˆ13 12 11 11 12 11 11

a. Making use of the fact that y y y5 1ˆ 0.2 0.8 ˆ11 10 10 (and similarly for ŷ10 and ŷ9), continue to expand the expression for ŷ13 until it is written in terms of the past data values 12y , 11y , 10y , 9y , 8y , and the forecast for period 8, ŷ8.

b. Refer to the coefficients or weights for the past values 12y , 11y , 10y , 9y , and 8y . What observation can you make about how exponential smoothing weights past data val- ues in arriving at new forecasts? Compare this weighting pattern with the weighting pattern of the moving averages method.

10. United Dairies, Inc. supplies milk to several independent grocers throughout Dade County, Florida. Managers at United Dairies want to develop a forecast of the number of half gallons of milk sold per week. Sales data for the past 12 weeks are as follows:

a. Construct a time series plot. What type of pattern exists in the data? b. Use exponential smoothing with a 5 0.4 to develop a forecast of demand for week

13. What is the resulting MSE?

11. For the Hawkins Company, the monthly percentages of all shipments received on time over the past 12 months are 80, 82, 84, 83, 83, 84, 85, 84, 82, 83, 84, and 83. a. Construct a time series plot. What type of pattern exists in the data?

Week Sales Week Sales

1 2,750 7 3,300

2 3,100 8 3,100

3 3,250 9 2,950

4 2,800 10 3,000

5 2,900 11 3,200

6 3,050 12 3,150

Gasoline

UnitedDairies

Hawkins

Problems 409

b. Compare a three-month moving average forecast with an exponential smoothing forecast for a 5 0.2. Which provides the better forecasts using MSE as the measure of model accuracy?

c. What is the forecast for the next month?

12. Corporate triple A bond interest rates for 12 consecutive months are as follows:

9.5 9.3 9.4 9.6 9.8 9.7 9.8 10.5 9.9 9.7 9.6 9.6

240 350 230 260 280 320 220 310 240 310 240 230

a. Construct a time series plot. What type of pattern exists in the data? b. Develop three-month and four-month moving averages for this time series. Does the

three-month or the four-month moving average provide the better forecasts based on MSE? Explain.

c. What is the moving average forecast for the next month?

13. The values of Alabama building contracts (in millions of dollars) for a 12-month period are as follows:

a. Construct a time series plot. What type of pattern exists in the data? b. Compare a three-month moving average forecast with an exponential smoothing

forecast. Use a 5 0.2. Which provides the better forecasts based on MSE? c. What is the forecast for the next month using exponential smoothing with a 5 0.2?

14. The following time series shows the sales of a particular product over the past 12 months.

a. Construct a time series plot. What type of pattern exists in the data? b. Use 0.3a 5 to compute the exponential smoothing values for the time series. c. Use trial and error to find a value of the exponential smoothing coefficient a that

results in a relatively small MSE.

15. Ten weeks of data on the Commodity Futures Index are as follows:

7.35 7.40 7.55 7.56 7.60 7.52 7.52 7.70 7.62 7.55

a. Construct a time series plot. What type of pattern exists in the data? b. Use trial and error to find a value of the exponential smoothing coefficient a that

results in a relatively small MSE.

16. The following table reports the percentage of stocks in a portfolio for nine quarters:

Month Sales Month Sales

1 105 7 145

2 135 8 140

3 120 9 100

4 105 10 80

5 90 11 100

6 120 12 110

Quarter Stock (%)

Year 1, Quarter 1 29.8

Year 1, Quarter 2 31.0

Year 1, Quarter 3 29.9

Year 1, Quarter 4 30.1

Year 2, Quarter 1 32.2

Year 2, Quarter 2 31.5

Year 2, Quarter 3 32.0

Year 2, Quarter 4 31.9

Year 3, Quarter 1 30.0

TripleABond

Alabama

MonthlySales

CommodityFutures

Portfolio

410 chapter 8 Time Series Analysis and Forecasting

a. Construct a time series plot. What type of pattern exists in the data? b. Use trial and error to find a value of the exponential smoothing coefficient a that

results in a relatively small MSE. c. Using the exponential smoothing model you developed in part (b), what is the

forecast of the percentage of stocks in a typical portfolio for the second quarter of year 3?

17. Consider the following time series:

t 1 2 3 4 5

y t 6 11 9 14 15

a. Construct a time series plot. What type of pattern exists in the data? b. Use simple linear regression analysis to find the parameters for the line that mini-

mizes MSE for this time series. c. What is the forecast for 6t 5 ?

18. Consider the following time series:

a. Construct a time series plot. What type of pattern exists in the data? b. Use simple linear regression analysis to find the parameters for the line that mini-

mizes MSE for this time series. c. What is the forecast for 8t 5 ?

19. Because of high tuition costs at state and private universities, enrollments at commu- nity colleges have increased dramatically in recent years. The following data show the enrollment for Jefferson Community College for the nine most recent years:

t 1 2 3 4 5 6 7

y t 120 110 100 96 94 92 88

a. Construct a time series plot. What type of pattern exists in the data? b. Use simple linear regression analysis to find the parameters for the line that mini-

mizes MSE for this time series. c. What is the forecast for year 10?

20. The Seneca Children’s Fund (SCF) is a local charity that runs a summer camp for dis- advantaged children. The fund’s board of directors has been working very hard over recent years to decrease the amount of overhead expenses, a major factor in how char- ities are rated by independent agencies. The following data show the percentage of the money SCF has raised that was spent on administrative and fund-raising expenses over the past seven years:

Year Period (t) Enrollment (1,000s) 2001 1 6.5

2002 2 8.1

2003 3 8.4

2004 4 10.2

2005 5 12.5

2006 6 13.3

2007 7 13.7

2008 8 17.2

2009 9 18.1

Jefferson

Problems 411

a. Construct a time series plot. What type of pattern exists in the data? b. Use simple linear regression analysis to find the parameters for the line that mini-

mizes MSE for this time series. c. Forecast the percentage of administrative expenses for year 8. d. If SCF can maintain its current trend in reducing administrative expenses, how long

will it take for SCF to achieve a level of 5% or less?

21. The president of a small manufacturing firm is concerned about the continual increase in manufacturing costs over the past several years. The following figures provide a time series of the cost per unit for the firm’s leading product over the past eight years:

Period (t) Expense (%) 1 13.9

2 12.2

3 10.5

4 10.4

5 11.5

6 10.0

7 8.5

a. Construct a time series plot. What type of pattern exists in the data? b. Use simple linear regression analysis to find the parameters for the line that mini-

mizes MSE for this time series. c. What is the average cost increase that the firm has been realizing per year? d. Compute an estimate of the cost/unit for the next year.

22. Consider the following time series:

Quarter Year 1 Year 2 Year 3

1 71 68 62

2 49 41 51

3 58 60 53

4 78 81 72

a. Construct a time series plot. What type of pattern exists in the data? Is there an indi- cation of a seasonal pattern?

b. Use a multiple linear regression model with dummy variables as follows to develop an equation to account for seasonal effects in the data: Qtr1 15 if quarter 1, 0 other- wise; Qtr2 15 if quarter 2, 0 otherwise; Qtr3 15 if quarter 3, 0 otherwise.

c. Compute the quarterly forecasts for the next year.

23. Consider the following time series data:

Quarter Year 1 Year 2 Year 3

1 4 6 7

2 2 3 6

3 3 5 6

4 5 7 8

Year Cost/Unit ($) Year Cost/Unit($)

1 20.00 5 26.60

2 24.50 6 30.00

3 28.20 7 31.00

4 27.50 8 36.00

Seneca

ManufacturingCosts

412 chapter 8 Time Series Analysis and Forecasting

a. Construct a time series plot. What type of pattern exists in the data? b. Use a multiple regression model with dummy variables as follows to develop an

equation to account for seasonal effects in the data: Qtr1 15 if quarter 1, 0 other- wise; Qtr2 15 if quarter 2, 0 otherwise; Qtr3 15 if quarter 3, 0 otherwise.

c. Compute the quarterly forecasts for the next year based on the model you developed in part (b).

d. Use a multiple regression model to develop an equation to account for trend and seasonal effects in the data. Use the dummy variables you developed in part (b) to capture seasonal effects and create a variable t such that 1t 5 for quarter 1 in year 1, 2t 5 for quarter 2 in year t 51, . . . 12 for quarter 4 in year 3.

e. Compute the quarterly forecasts for the next year based on the model you developed in part (d).

f. Is the model you developed in part (b) or the model you developed in part (d) more effective? Justify your answer.

24. The quarterly sales data (number of copies sold) for a college textbook over the past three years are as follows:

a. Construct a time series plot. What type of pattern exists in the data? b. Use a regression model with dummy variables as follows to develop an equation to

account for seasonal effects in the data: Qtr1 15 if quarter 1, 0 otherwise; Qtr2 15 if quarter 2, 0 otherwise; Qtr3 15 if quarter 3, 0 otherwise.

c. Based on the model you developed in part (b), compute the quarterly forecasts for the next year.

d. Let 1t 5 refer to the observation in quarter 1 of year 1; 2t 5 refer to the observa- tion in quarter 2 of year 1; … ; and 12t 5 refer to the observation in quarter 4 of year 3. Using the dummy variables defined in part (b) and t, develop an equation to account for seasonal effects and any linear trend in the time series.

e. Based upon the seasonal effects in the data and linear trend, compute the quarterly forecasts for the next year.

f. Is the model you developed in part (b) or the model you developed in part (d) more effective? Justify your answer.

25. Air pollution control specialists in Southern California monitor the amount of ozone, carbon dioxide, and nitrogen dioxide in the air on an hourly basis. The hourly time series data exhibit seasonality, with the levels of pollutants showing patterns that vary over the hours in the day. On July 15, 16, and 17, the following levels of nitrogen diox- ide were observed for the 12 hours from 6:00 a.m. to 6:00 p.m.:

a. Construct a time series plot. What type of pattern exists in the data? b. Use a multiple linear regression model with dummy variables as follows to develop

an equation to account for seasonal effects in the data:

�

5 5

Hour1 1 if the reading was made between 6:00 a.m. and 7:00 a.m., 0 otherwise Hour2 1 if the reading was made between 7:00 a.m. and 8:00 a.m., 0 otherwise

Hour11 1 if the reading was made between 4:00 p.m. and 5:00 p.m., 0 otherwise

Note that when the values of the 11 dummy variables are equal to 0, the observation corresponds to the 5:00 p.m. to 6:00 p.m. hour.

Year 1 1 1 1 2 2 2 2 3 3 3 3

Quarter 1 2 3 4 1 2 3 4 1 2 3 4

Sales 1,690 940 2,625 2,500 1,800 900 2,900 2,360 1,850 1,100 2,930 2,615

July 15 25 28 35 50 60 60 40 35 30 25 25 20

July 16 28 30 35 48 60 65 50 40 35 25 20 20

July 17 35 42 45 70 72 75 60 45 40 25 25 25

TextbookSales

Pollution

Problems 413

c. Using the equation developed in part (b), compute estimates of the levels of nitro- gen dioxide for July 18.

d. Let 1t 5 refer to the observation in hour 1 on July 15; 2t 5 refer to the observa- tion in hour 2 of July 15; . . . ; and 36t 5 refer to the observation in hour 12 of July 17. Using the dummy variables defined in part (b) and ts, develop an equation to account for seasonal effects and any linear trend in the time series.

e. Based on the seasonal effects in the data and linear trend estimated in part (d), com- pute estimates of the levels of nitrogen dioxide for July 18.

f. Is the model you developed in part (b) or the model you developed in part (d) more effective? Justify your answer.

26. South Shore Construction builds permanent docks and seawalls along the southern shore of Long Island, New York. Although the firm has been in business only five years, revenue has increased from $308,000 in the first year of operation to $1,084,000 in the most recent year. The following data show the quarterly sales revenue in thou- sands of dollars:

a. Construct a time series plot. What type of pattern exists in the data? b. Use a multiple regression model with dummy variables as follows to develop an

equation to account for seasonal effects in the data: Qtr1 15 if quarter 1, 0 other- wise; Qtr2 15 if quarter 2, 0 otherwise; Qtr3 15 if quarter 3, 0 otherwise.

c. Based on the model you developed in part (b), compute estimates of quarterly sales for year 6.

d. Let Period 15 refer to the observation in quarter 1 of year 1; Period 25 refer to the observation in quarter 2 of year 1; . . . ; and Period 205 refer to the observation in quarter 4 of year 5. Using the dummy variables defined in part (b) and the variable Period, develop an equation to account for seasonal effects and any linear trend in the time series.

e. Based on the seasonal effects in the data and linear trend estimated in part (c), com- pute estimates of quarterly sales for year 6.

f. Is the model you developed in part (b) or the model you developed in part (d) more effective? Justify your answer.

27. Hogs & Dawgs is an ice cream parlor on the border of north-central Louisiana and southern Arkansas that serves 43 flavors of ice creams, sherbets, frozen yogurts, and sorbets. During the summer Hogs & Dawgs is open from 1:00 p.m. to 10:00 p.m. on Monday through Saturday, and the owner believes that sales change systematically from hour to hour throughout the day. She also believes that her sales increase as the outdoor temperature increases. Hourly sales and the outside temperature at the start of each hour for the last week are provided in the file IceCreamSales. a. Construct a time series plot of hourly sales and a scatter plot of outdoor temperature

and hourly sales. What types of relationships exist in the data? b. Use a simple regression model with outside temperature as the causal variable to

develop an equation to account for the relationship between outside temperature and hourly sales in the data. Based on this model, compute an estimate of hourly sales for today from 2:00 p.m. to 3:00 p.m. if the temperature at 2:00 p.m. is 93 F8 .

c. Use a multiple linear regression model with the causal variable outside tempera- ture and dummy variables as follows to develop an equation to account for both

Quarter Year 1 Year 2 Year 3 Year 4 Year 5

1 20 37 75 92 176

2 100 136 155 202 282

3 175 245 326 384 445

4 13 26 48 82 181 SouthShore

IceCreamSales

414 chapter 8 Time Series Analysis and Forecasting

seasonal effects and the relationship between outside temperature and hourly sales in the data:

�

5 5

Hour1 1 if the sales were recorded between 1:00 p.m. and 2:00 p.m., 0 otherwise Hour2 1 if the sales were recorded between 2:00 p.m. and 3:00 p.m., 0 otherwise

Hour8 1 if the sales were recorded between 8:00 p.m. and 9:00 p.m., 0 otherwise

Note that when the values of the eight dummy variables are equal to 0, the observa- tion corresponds to the 9:00-to-10:00-p.m. hour.

Based on this model, compute an estimate of hourly sales for today from 2:00 p.m. to 3:00 p.m. if the temperature at 2:00 p.m. is 93 F8 .

d. Is the model you developed in part (b) or the model you developed in part (c) more effective? Justify your answer.

28. Donna Nickles manages a gasoline station on the corner of Bristol Avenue and Harpst Street in Arcata, California. Her station is a franchise, and the parent company calls her station every day at midnight to give her the prices for various grades of gasoline for the upcoming day. Over the past eight weeks Donna has recorded the price and sales (in gallons) of regular-grade gasoline at her station as well as the price of regular-grade gasoline charged by her competitor across the street. She is curious about the sensi- tivity of her sales to the price of regular gasoline she charges and the price of regular gasoline charged by her competitor across the street. She also wonders whether her sales differ systematically by day of the week and whether her station has experienced a trend in sales over the past eight weeks. The data collected by Donna for each day of the past eight weeks are provided in the file GasStation. a. Construct a time series plot of daily sales, a scatter plot of the price Donna charges

for a gallon of regular gasoline and daily sales at Donna’s station, and a scatter plot of the price Donna’s competitor charges for a gallon of regular gasoline and daily sales at Donna’s station. What types of relationships exist in the data?

b. Use a multiple regression model with the price Donna charges for a gallon of regular gasoline and the price Donna’s competitor charges for a gallon of regular gasoline as causal variables to develop an equation to account for the relationships between these prices and Donna’s daily sales in the data. Based on this model, compute an estimate of sales for a day on which Donna is charging $3.50 for a gallon of regular gasoline and her competitor is charging $3.45 for a gallon of regular gasoline.

c. Use a multiple linear regression model with the trend and dummy variables as follows to develop an equation to account for both trend and seasonal effects in the data:

�

5 5

Monday 1 if the sales were recorded on a Monday, 0 otherwise Tuesday 1 if the sales were recorded on a Tuesday, 0 otherwise

Saturday 1 if the sales were recorded on a Saturday, 0 otherwise

Note that when the values of the six dummy variables are equal to 0, the observation corresponds to Sunday.

Based on this model, compute an estimate of sales for Tuesday of the first week after Donna collected her data.

d. Use a multiple regression model with the price Donna charges for a gallon of reg- ular gasoline and the price Donna’s competitor charges for a gallon of regular gas- oline as causal variables and the trend and dummy variables from part (c) to create an equation to account for the relationships between these prices and daily sales as well as the trend and seasonal effects in the data. Based on this model, compute an estimate of sales for Tuesday of the first week after Donna collected her data a day if Donna is charging $3.50 for a gallon of regular gasoline and her competitor is charging $3.45 for a gallon of regular gasoline.

e. Which of the three models you developed in parts (b), (c), and (d) is most effective? Justify your answer.

GasStation

case Problem: Forecasting Food and Beverage Sales 415

C A S E P R o B l E M : F o R E C A S T i n G F o o D A n D B E v E R A G E S A l E S

The Vintage Restaurant, on Captiva Island near Fort Myers, Florida, is owned and operated by Karen Payne. The restaurant just completed its third year of operation. During those three years, Karen sought to establish a reputation for the restaurant as a high-quality dining establishment that specializes in fresh seafood. Through the efforts of Karen and her staff, her restaurant has become one of the best and fastest-growing restaurants on the Island.

To better plan for future growth of the restaurant, Karen needs to develop a system that will enable her to forecast food and beverage sales by month for up to one year in advance. The following table shows the value of food and beverage sales ($1,000s) for the first three years of operation:

Month First Year Second Year Third Year

January 242 263 282

February 235 238 255

March 232 247 265

April 178 193 205

May 184 193 210

June 140 149 160

July 145 157 166

August 152 161 174

September 110 122 126

October 130 130 148

November 152 167 173

December 206 230 235

managerial Report

Perform an analysis of the sales data for the Vintage Restaurant. Prepare a report for Karen that summarizes your findings, forecasts, and recommendations. Include the following:

1. A time series plot. Comment on the underlying pattern in the time series. 2. Using the dummy variable approach, forecast sales for January through December

of the fourth year. How would you explain this model to Karen?

Assume that January sales for the fourth year turn out to be $295,000. What was your forecast error? If this error is large, Karen may be puzzled about the difference between your forecast and the actual sales value. What can you do to resolve her uncertainty about the forecasting procedure?

Vintage

416

Appendix 8.1 Using the Excel Forecast Sheet Excel features a tool called Forecast Sheet which can automatically produce forecasts using the Holt–Winters additive seasonal smoothing model. The Holt–Winters model is an exponential smoothing approach to estimating additive linear trend and seasonal effects. It also generates a variety of other outputs that are useful in assessing the accuracy of the forecast model it produces.

We will demonstrate Forecast Sheet on the four years of quarterly smartphone sales that are provided in Table 8.6. A review of the time series plot of these data in Figure 8.6 provides clear evidence of an increasing linear trend and a seasonal pattern (sales are con- sistently lowest in the second quarter of each year and highest in quarters 3 and 4). We con- cluded in Section 8.4 that we need to use a forecasting method that is capable of dealing with both trend and seasonality when developing a forecasting model for this time series, and so it is appropriate to use Forecast Sheet to produce forecasts for these data.

We begin by putting the data into the format required by Forecast Sheet. The time series data must be collected on a consistent interval (i.e., annually, quarterly, monthly, and so on), and the spreadsheet must include two data series in contiguous columns or rows that include

• a series with the dates or periods in the time series • a series with corresponding time series values

First, open the file SmartPhoneSales, then insert a column between column B (Quarter) and Column C (Sales (1000s)). Enter Period into cell C1; this will be the heading for the column of values that will represent the periods in our data. Next enter 1 in cell C2, 2 in cell C3, 3 in cell C4, and so on, ending with 16 in Cell C17, as shown in Figure 8.19.

Now that the data are properly formatted for Forecast Sheet, the following steps can be used to produce forecasts for the next four quarters (periods 17 through 20) with Forecast Sheet:

Step 1. Highlight cells C1:D17 (the data in column C of this highlighted section is what Forecast Sheet refers to as the Timeline Range and the data in column D is the Values Range)

Step 2. Click the Data tab in the Ribbon Step 3. Click Forecast Sheet in the Forecast group Step 4. When the Create Forecast Worksheet dialog box appears (Figure 8.20):

Select 20 for Forecast End Click Options to expand the Create Forecast Worksheet dialog box and show the options (Figure 8.20)

Select 16 for Forecast Start Select 95% for Confidence Interval Under Seasonality, click on Set Manually and select 4 Select the checkbox for Include forecast statistics Click Create

The results of Forecast Sheet will be output to a new worksheet as shown in Figure 8.21. The output of Forecast Sheet includes the following:

• The period for each of the 16 time series observations and the forecasted time periods in column A

• The actual time series data for periods 1 to 16 in column B • The forecasts for periods 16 to 20 in column C • The lower confidence bounds for the forecasts for periods 16 to 20 in column D

Forecast Sheet was introduced in Excel 2016; it is not available in prior versions of Excel.

Excel refers to the forecasting approach used by Forecast Sheet as the AAA exponential smoothing (ETS) algorithm, where AAA stands for additive error, additive trend, and additive seasonality.

Forecast Sheet requires that the period selected for Forecast Start is one of the periods of the original time series.

SmartPhoneSales

Chapter 8 Appendix

Appendix 8.1 Using the Excel Forecast Sheet 417

• The upper confidence bounds for the forecasts for periods 16 to 20 in column E • A line graph of the time series, forecast values, and forecast interval • The values of the three parameters (alpha, beta, and gamma) used in the Holt– Winters additive seasonal smoothing model in cells H2:H4 (these values are deter- mined by an algorithm in Forecast Sheet)

• Measures of forecast accuracy in cells H5:H8, including • the MASE, or mean absolute scaled error, in cell H5. MASE is defined as:

∑ ∑

MASE 1

1 1

1 n

n y yt

n t

t t

2 25

MASE compares the forecast error, et, to a naïve forecast error given by 1y yt t2 2 . If MASE 1. , then the forecast is considered inferior to a naïve

forecast; if MASE 1, , the forecast is considered superior to a naïve forecast.

• the SMAPE, or symmetric mean absolute percentage error, in cell H6. SMAPE is defined as:

5 15

∑ ( )SMAPE 1

ˆ /21n

y yt

n t

t t

SMAPE is similar to mean absolute percentage error (MAPE), discussed in Section 8.2; both SMAPE and MAPE measure forecast error relative to actual values.

Smartphone Data Reformatted for Forecast SheetFiGURE 8.19

1 4.8

4.1

6.0

6.5

5.8

5.2

6.8

7.4

6.0

5.6

7.5

7.8

6.3

5.9

8.0

8.4

Year Quarter Period Sales (1000s)1

A B C D

418 chapter 8 Appendix

• the MAE, or mean absolute error, (as defined in equation (8.3)) in cell H7 • the RMSE, or root mean squared error, (which is the square root of the MSE, defined in equation (8.4)) in cell H8

Figures 8.22 and 8.23 display the formula view of portions of the worksheet that Forecast Sheet generated based on the smartphone quarterly sales data. For example, in cell C18, the forecast value generated for Period 17 smartphone sales is determined by the formula:

5FORECAST.ETS(A18, B2:B17, A2:A17, 4, 1)

The first argument in this function specifies the period to be forecasted. The second argument specifies the times series data upon which the forecast is based. The third argu- ment lists the timeline associated with the time series values. The fourth (optional) argu- ment addresses seasonality, and the value of 4 indicates the length of the seasonal pattern. The fifth (optional) argument addresses missing data, and a value of 1 means that any miss- ing observations will be approximated as the average of the neighboring observations; this data had no missing observations, so the value of this argument does not matter.

A value of 0 for the fifth argument of the FORECAST. ETS function means that Excel will treat any missing observations as zeros.

There is a sixth (optional) argumenf the FORECAST.ETS function that addresses how to aggregate multiple observations for the same time period. Choices include AVERAGE, SUM, COUNT, COUNTA, MIN, MAX, and MEDIAN.

create Forecast Worksheet Dialog Box with options open for Quarterly Smartphone Sales

FiGURE 8.20

Create Forecast Worksheet

Use historical data to create a visual forecast worksheet

12.0

10.0

8.0

6.0

4.0

2.0

0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Values Forecast Lower Confidence Bound Upper Confidence Bound

Confidence Interval

Detect Automatically

Set Manually

Include forecast statistics

Seasonality

Options

Forecast End 20

Forecast Start 16

95%

Data!$C$2:$C$17

Data!$D$2:$D$17

Interpolation

Average

Timeline Range

Values Range

Fill Missing Points Using

Aggregate Duplicates Using

CancelCreate

A value of 1 for the fourth argument of the FORECAST. ETS function means that Excel detects the seasonality in the data automatically. A value of 0 means that there is no seasonality in the data.

Appendix 8.1 Using the Excel Forecast Sheet 419

Cell D18 contains the lower confidence bound for the forecast of the Period 17 smart- phone sales. This lower confidence bound is determined by the formula:

5C18-FORECAST.ETS.CONFINT(A18, B2:B17, A2:A17, 0.95, 4, 1)

Similarly, cell E18 contains the upper confidence bound for the forecast of the Period 17 smartphone sales. This upper confidence bound is determined by the formula:

5 1C18 FORECAST.ETS.CONFINT(A18, B2:B17, A2:A17, 0.95, 4, 1)

Forecast Sheet Results for Quarterly Smartphone SalesFiGURE 8.21

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20

1 A B C D E F G

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

Period Sales (1000s) Forecast(Sales (1000s)) Lower Con�dence Bound(Sales (1000s)) Upper Con�dence Bound(Sales (1000s)) Statistic Value 4.8 4.1 6.0 6.5 5.8 5.2 6.8 7.4 6.0 5.6 7.5 7.8 6.3 5.9 8.0 8.4 8.4

7.3 6.6 8.4 9.0

8.4 6.9 6.2 7.9 8.4

8.4 7.8 7.1 9.0 9.6

Alpha Beta Gamma MASE SMAPE MAE RMSE

0.50 0.00 0.00 0.22 0.03 0.20 0.27

12.0

10.0

8.0

6.0

4.0

2.0

0.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Sales (1000s) Forecast(Sales (1000s))

Lower Confidence Bound(Sales (1000s)) Upper Confidence Bound(Sales (1000s))

Formula View of Forecast Sheet Results for Quarterly Smartphone SalesFiGURE 8.22

420 chapter 8 Appendix

Many of the arguments for the FORECAST.ETS.CONFINT function are the same as the arguments for the FORECAST.ETS function. The first argument in the FORECAST. ETS.CONFINT function specifies the period to be forecasted. The second argument spec- ifies the times series data upon which the forecast is based. The third argument lists the timeline associated with the time series values. The fourth (optional) argument specifies confidence level associated with the calculated confidence interval. The fifth (optional) argument addresses seasonality, and the value of 4 indicates the length of the seasonal pat- tern. The sixth (optional) argument addresses missing data, and a value of 1 means that any missing observations will be approximated as the average of the neighboring observations; this data had no missing observations, so the value of this argument does not matter.

Cells H2:H8 of Figure 8.23 list the Excel formulas used to compute the respective sta- tistics for the smartphone sales forecasts. These formulas are:

• Alpha

5FORECAST.ETS.STAT(B2:B17, A2:A17, 1, 4, 1)

• Beta

5FORECAST.ETS.STAT(B2:B17, A2:A17, 2, 4, 1)

• Gamma

5FORECAST.ETS.STAT(B2:B17, A2:A17, 3, 4, 1)

• MASE

5FORECAST.ETS.STAT(B2:B17, A2:A17, 4, 4, 1)

• SMAPE

5FORECAST.ETS.STAT(B2:B17, A2:A17, 5, 4, 1)

• MAE

5FORECAST.ETS.STAT(B2:B17, A2:A17, 6, 4, 1)

• RMSE

5FORECAST.ETS.STAT(B2:B17, A2:A17, 7, 4, 1)

Many of the arguments for the FORECAST.ETS.STAT function are the same as the argu- ments for the FORECAST.ETS function. The first argument in the FORECAST.ETS. STAT function the times series data upon which the forecast is based. The second argument lists the timeline associated with the time series values. The third argument specifies the statistic or parameter type; for example, a value of 4 corresponds to MASE statistic. The

Excel Formulas for Smartphone Forecast StatisticsFiGURE 8.23

Appendix 8.1 Using the Excel Forecast Sheet 421

fourth (optional) argument addresses seasonality, and the value of 4 indicates the length of the seasonal pattern. The fifth (optional) argument addresses missing data, and a value of 1 means that any missing observations will be approximated as the average of the neighbor- ing observations; this data had no missing observations, so the value of this argument does not matter.

We conclude this appendix with a few comments on the functionality of Forecast Sheet. Forecast Sheet features algorithm for automatically finding the number of time periods over which the seasonal pattern recurs. To use this algorithm, select the option for Detect Automatically under Seasonality in the Create Forecast Worksheet dialog box before clicking Create. We suggest using this feature only to confirm a suspected seasonal pattern as using this feature to find a seasonal effect may lead to identification of a spurious pattern that does not actually reflect seasonality. This would result in a model that is overfit on the observed time series data and would likely produce very inaccurate forecasts. A forecast model with seasonality should only be fit when the modeler has reason to suspect a specific seasonal pattern.

The Forecast Start parameter in the Create Forecast Worksheet dialog box controls both the first period to be forecasted and the last period to be used to generate the forecast model. If we had selected 15 for Forecast Start, we would have generated a forecast model for the smartphone monthly sales data based on only the first 15 periods of data in the orig- inal time series.

Forecast Sheet can accommodate multiple observations for a single period of the time series. The Aggregate Duplicates Using option in the Create Forecast Worksheet dialog box allows the user to select from several ways to deal with this issue.

Forecast Sheet allows for up to 30% of the values for the time series variable to be missing. In the smartphone quarterly sales data, the value of sales for up to 30% of the 16 periods (or 4 periods) could be missing and Forecast Sheet will still produce forecasts. The Fill Missing Points Using option in the Create Forecast Worksheet dialog box allows the user to select whether the missing values will be replaced with zero or with the result of linearly interpolating existing values in the time series.

Predictive Data Mining C O N T E N T S

ANALYTICS IN ACTION: ORBITZ 9.1 DATA SAMPLING, PREPARATION, AND PARTITIONING

9.2 PERFORMANCE MEASURES Evaluating the Classification of Categorical Outcomes Evaluating the Estimation of Continuous Outcomes

9.3 LOGISTIC REGRESSION

9.4 k-NEAREST NEIGHBORS Classifying Categorical Outcomes with k-Nearest

Neighbors Estimating Continuous Outcomes with k-Nearest

Neighbors

9.5 CLASSIFICATION AND REGRESSION TREES Classifying Categorical Outcomes with a Classification Tree Estimating Continuous Outcomes with a Regression Tree Ensemble Methods

AVAILABLE IN THE MINDTAP READER:

APPENDIX 9.1 DATA PARTITIONING WITH ANALYTIC SOLVER

APPENDIX 9.2 LOGISTIC REGRESSION CLASSIFICATION WITH ANALYTIC SOLVER

APPENDIX 9.3 K-NEAREST NEIGHBOR CLASSIFICATION AND ESTIMATION WITH ANALYTIC SOLVER

APPENDIX 9.4 SINGLE CLASSIFICATION AND REGRESSION TREES WITH ANALYTIC SOLVER

APPENDIX 9.5 RANDOM FORESTS OF CLASSIFICATION OR REGRESSION TREES WITH ANALYTIC SOLVER

APPENDIX 9.6: DATA PARTITIONING WITH JMP PRO

APPENDIX 9.7: LOGISTIC REGRESSION CLASSIFICATION WITH JMP PRO

APPENDIX 9.8: K-NEAREST NEIGHBOR CLASSIFICATION AND ESTIMATION WITH JMP PRO

APPENDIX 9.9: SINGLE CLASSIFICATION AND REGRESSION TREES WITH JMP PRO

APPENDIX 9.10: RANDOM FORESTS OF CLASSIFICATION AND REGRESSION TREES WITH JMP PRO

Chapter 9

Organizations are collecting an increasing amount of data, and one of the most pressing tasks is converting this data into actionable insights. A common challenge is to analyze these data to extract information on patterns and trends that can be used to assist decision makers in predicting future events. In this chapter, we discuss predictive methods that can be applied to leverage data to gain customer insights and to establish new business rules to guide managers.

We define an observation, or record, as the set of recorded values of variables associated with a single entity. An observation is often displayed as a row of values in a spreadsheet or database in which the columns correspond to the variables. For example, in direct-marketing data, an observation may correspond to a customer and contain information regarding her/his response to an e-mail advertisement as well as information regarding her/his demographic characteristics.

In this chapter, we focus on data mining methods for predicting an outcome based on a set of input variables, or features. These methods are also referred to as supervised learning. Linear regression is a well-known supervised learning approach from classical statistics in which observations of a quantitative outcome (the dependent y variable) and one or more corresponding features (the independent variables 1x , 2x , . . . , xq) are used to create an equation for estimating y values. That is, in supervised learning the outcome variable “supervises” or guides the process of “learning” how to predict future outcomes. In this chapter, we focus on supervised learning methods for the estimation of a continuous outcome (e.g., sales revenue) and for classification of a binary categorical outcomes (e.g., whether or not a customer defaults on a loan).

The data mining process comprises the following steps:

1. Data sampling. Extract a sample of data that is relevant to the business problem under consideration.

2. Data preparation. Manipulate the data to put it in a form suitable for formal modeling. This step includes addressing missing and erroneous data, reducing the number of variables, and defining new variables. Data exploration is an important part of this step and may involve the use of descriptive statistics, data visualization, and clustering to better understand the relationships supported by the data.

3. Data partitioning. Divide the sample data into three sets for the training, validation, and testing of the data mining algorithm performance.

In Chapter 4, we describe descriptive data mining methods, such as clustering and association rules, that explore relationships between observations and/or variables.

See Chapter 7 for a discussion of linear regression.

Estimation methods are also referred to as regression methods or prediction methods.

Chapter 4 discusses the data- preparation process as well as clustering techniques often used to redefine variables. Chapters 2 and 3 discuss descriptive statistics and data- visualization techniques.

Orbitz*

Although they might not see their customers face to face, online retailers are getting to know their patrons to tailor the offerings on their virtual shelves. By min- ing web-browsing data collected in “cookies”—files that web sites use to track people’s web-browsing behavior, online retailers identify trends that can potentially be used to improve customer satisfaction and boost online sales.

For example, consider Orbitz, an online travel agency that books flights, hotels, car rentals, cruises, and other travel activities for its customers. Tracking its patrons’ online activities, Orbitz discovered that people

*“On Orbitz, Mac Users Steered to Pricier Hotels” Wall Street Journal (2012, June 26).

who use Mac computers spend as much as 30% more per night on hotels. Orbitz’s analytics team has uncov- ered other factors that affect purchase behavior, includ- ing how the shopper arrived at the Orbitz site (Did the user visit Orbitz directly or was he or she referred from another site?), previous booking history on Orbitz, and the shopper’s geographic location. Orbitz can act on this and other information gleaned from the vast amount of web data to differentiate the recommenda- tions for hotels, car rentals, flight bookings, etc.

A N A L Y T I C S I N A C T I O N

Analytics in Action 423

424 Chapter 9 Predictive Data Mining

4. Model construction. Apply the appropriate data mining technique (e.g., k-nearest neighbors, regression trees) to the training data set to accomplish the desired data mining task (classification or estimation).

5. Model assessment. Evaluate models by comparing performance on the training and validation data sets. Apply the selected model to the test data as a final appraisal of the model’s performance.

9.1 Data Sampling, Preparation, and Partitioning Upon identifying a business problem, data on relevant variables must be obtained for analysis. Although access to large amounts of data offers the potential to unlock insight and improve decision making, it comes with the risk of drowning in a sea of data. Data repositories with millions of observations over hundreds of measured variables are now common. If the volume of relevant data is extremely large (thousands of observations or more), it is unnecessary (and computationally difficult) to use all the data in order to perform a detailed analysis. When dealing with large volumes of data (with hundreds of thousands or millions of observations), best practice is to extract a representative sample (with thousands or tens of thousands of observations) for analysis. A sample is representative if the analyst can make the same conclusions from it as from the entire population of data.

There are no definite rules to determine the size of a sample. The sample of data must be large enough to contain significant information, yet small enough to manipulate quickly. If the sample is too small, relationships in the data may be missed or spurious relationships may be suggested. Perhaps the best advice is to use enough data to eliminate any doubt about whether the sample size is sufficient; data mining algorithms typically are more effective given more data. If we are investigating a rare event (e.g., click-through on an advertisement posted on a web site), the sample should be large enough to ensure several hundred to thousands of obser- vations that correspond to click-throughs. That is, if the click-through rate is only 1%, then a representative sample would need to be approximately 50,000 observations in order to have about 500 observations corresponding to situations in which a person clicked on an ad.

When obtaining a representative sample, it is also important not to carelessly discard variables. It is generally best to include as many variables as possible in the sample. In the data preparation step, the analyst can use descriptive statistics and data visualization to identify any clearly irrelevant variables that should be eliminated. Descriptive statistics and data visualization also play a role in addressing missing and erroneous data. At this stage of data preparation, clustering may be useful to define new variables based on clusters of sim- ilar observations.

Once a representative data sample has been prepared for analysis, it must be partitioned into two or three data sets to appropriately evaluate the performance of predictive data mining models. To understand the need for data partitioning, we consider a situation in which an ana- lyst has relatively few data points from which to build a multiple regression model. To main- tain the sample size necessary to obtain reliable estimates of slope coefficients, an analyst may have no choice but to use the entire data set to build a model. Even if measures such as 2R and the standard error of the estimate suggest that the resulting linear regression model may fit the data set well, these measures only explain how well the model fits data it has “seen,” and the analyst has little idea how well this model will fit other “unobserved” data points.

Classical statistics deals with a scarcity of data by determining the minimum sample size needed to draw legitimate inferences about the population. In contrast, data mining applications deal with an abundance of data that simplifies the process of assessing the performance of data-based estimates of variable effects. However, the wealth of data can tempt the analyst to overfit the model. Model overfitting occurs when the analyst builds a model that does a great job of explaining the sample of data on which it is based, but fails to accurately predict outside the sample data. We can use the abundance of data to guard against the potential for overfitting by decomposing the data set into three partitions: the training set, the validation set, and the test set.

Multiple regression models are discussed in Chapter 7.

9.2 Performance Measures 425

The training set consists of the data used to build the candidate models. For example, a training set may be used to estimate the slope coefficients in a multiple regression model. We use measures of performance of these models on the training set to identify a promising initial subset of models. However, since the training set consists of the data used to build the models, it cannot be used to clearly identify the best model for prediction when applied to new data (data outside the training set). Therefore, the promising subset of models is then applied to the validation set to identify which model may be the most accurate at pre- dicting observations that were not used to build the model.

If the validation set is used to identify a “best” model through either comparison with other models or the tuning of model parameters, then the estimates of model performance are also biased (we tend to overestimate performance). Thus, the final model must be applied to the test set in order to conservatively estimate this model’s effectiveness when applied to data that have not been used to build or select the model.

For example, suppose we have identified four models that fit the training set reasonably well. To evaluate how these models will handle predictions when applied to new data, we apply these four models to the validation set. After identifying the best of the four models, we apply this “best” model to the test set in order to obtain an unbiased estimate of this model’s performance on future applications.

There are no definite rules for the size of the three partitions, but the training set is typically the largest. For estimation tasks, a rule of thumb is to have at least 10 times as many observations as variables. For classification tasks, a rule of thumb is to have at least

m q3 36 observations, where m is the number of outcome categories and q is the number of variables. When we are interested in predicting a rare event, such as a click-through on an advertisement posted on a web site or a fraudulent credit card transaction, it is recom- mended that the training set oversample the number of observations corresponding to the rare events to provide the data mining algorithm sufficient data to “learn” about the rare events. For example, if only one out of every 10,000 users clicks on an advertisement posted on a web site, we would not have sufficient information to distinguish between users who do not click-through and those who do if we constructed a representative training set consisting of one observation corresponding to a click-through and 9,999 observations with no click-through. In these cases, the training set should contain equal or nearly equal numbers of observations corresponding to the different values of the outcome variable. Note that we do not oversample the validation set and test sets; these samples should be representative of the overall population so that performance measures evaluated on these data sets appropriately reflect future performance of the data mining model.

9.2 Performance Measures There are different performance measures for methods classifying categorical outcomes than for methods estimating continuous outcomes. We describe each of these in the context of an example from the financial services industry. Optiva Credit Union wants to better understand its personal lending process and its loan customers. The file Optiva contains over 40,000 customer observations with information on whether the customer defaulted on a loan, customer age, average checking account balance, whether the customer had a mortgage, the customer’s job status, the customer’s marital status, and the customer’s level of education. We will use these data to demonstrate the use of supervised learning methods to classify customers who are likely to default and to estimate the average balance in a customer’s bank accounts.

Evaluating the Classification of Categorical Outcomes In our treatment of classification problems, we restrict our attention to problems for which we want to classify observations into one of two possible classes (e.g., loan default or no default), but the concepts generally extend to cases with more than two classes. A natural way to evaluate the performance of a classification method, or classifier, is to count the

Optiva

426 Chapter 9 Predictive Data Mining

number of times that an observation is predicted to be in the wrong class. By counting the classification errors on a sufficiently large validation set and/or test set that is representa- tive of the population, we will generate an accurate measure of classification performance of our model.

Classification error is commonly displayed in a confusion matrix, which displays a model’s correct and incorrect classifications. Table 9.1 illustrates a confusion matrix result- ing from an attempt to classify the customer observations in a subset of data from the file Optiva. In this table, Class 1 loan5 default and Class 0 no5 default. The confusion matrix is a cross-tabulation of the actual class of each observation and the predicted class of each observation. From the first row of the matrix in Table 9.1, we see that 7,479 observations corresponding to nondefaults were correctly identified and 5,244 actual nondefault obser- vations were incorrectly classified as loan defaults. From the second row of Table 9.1, we observe that 89 actual loan defaults were incorrectly classified as nondefaults and 146 observations corresponding to loan defaults were correctly identified.

Many measures of classification performance are based on the confusion matrix. The percentage of misclassified observations is expressed as the overall error rate and is computed as

5 1

1 1 1

n n

n n n n Overall error rate 10 01

11 10 01 00

The overall error rate of the classification in Table 9.1 is 1 1 1(89 5,244)/(146 89 1 55,244 7,479) 41.2%. One minus the overall error rate is often referred to as the

accuracy of the model. The model accuracy based on Table 9.1 is 58.8%. While overall error rate conveys an aggregate measure of misclassification, it counts

misclassifying an actual Class 0 observation as a Class 1 observation (a false positive) the same as misclassifying an actual Class 1 observation as a Class 0 observation (a false negative). In many situations, the cost of making these two types of errors is not equivalent. For example, suppose we are classifying patient observations into two categories: Class 1 is cancer and Class 0 is healthy. The cost of incorrectly classifying a healthy patient observation as “cancer” will likely be limited to the expense (and stress) of additional testing. The cost of incorrectly classifying a cancer patient observation as “healthy” may result in an indefinite delay in treatment of the cancer and premature death of the patient.

To account for the asymmetric costs in misclassification, we define the error rate with respect to the individual classes:

n n

Class 1 error rate

Class 0 error rate

5 1

11 10

01 00

The Class 1 error rate of the classification in Table 9.1 is 1 589/(146 89) 37.9%. The Class 0 error rate of the classification in Table 9.1 is 1 5(5, 244)/(5, 244 7, 479) 41.2%. That is, the model that produced the classifications in Table 9.1 is slightly better at predict- ing Class 1 observations than Class 0 observations.

In Table 9.1, n01 is the number of false positives and n10 is the number of false negatives.

Predicted Class

Actual Class 0 1

0 7,47900 5n 5,24401 5n

1 8910 5n 14611 5n

Confusion MatrixTABLE 9.1

9.2 Performance Measures 427

To understand the tradeoff between Class 1 error rate and Class 0 error rate, we must be aware of the criteria generally used by classification algorithms to classify observations. Most classification algorithms first estimate an observation’s probability of Class 1 mem- bership and then classify the observation into Class 1 if this probability meets or exceeds a specified cutoff value (default cutoff value, 0.5). The choice of cutoff value affects the type of classification error. As we decrease the cutoff value, more observations will be classified as Class 1, thereby increasing the likelihood that a Class 1 observation will be correctly classified as Class 1; that is, Class 1 error will decrease. However, as a side effect, more Class 0 observations will be incorrectly classified as Class 1; that is, Class 0 error will rise.

To demonstrate how the choice of cutoff value affects classification error, Table 9.2 shows a list of 50 observations (11 of which are actual Class 1 members) and an estimated probability of Class 1 membership produced by the classification algorithm. Table 9.3 shows the confusion matrices and corresponding Class 1 error rates, Class 0 error rates, and overall error rates for cutoff values of 0.75, 0.5, and 0.25, respectively. As we decrease the cutoff value, more observations will be classified as Class 1, thereby increasing the likelihood that a Class 1 observation will be correctly classified as Class 1 (decreasing the Class 1 error rate). However, as a side effect, more Class 0 observations will be incorrectly classified as Class 1 (increasing the Class 0 error rate). That is, we can accurately identify more of the actual Class 1 observations by lowering the cutoff value, but we do so at a

Actual Class

Probability of Class 1

Actual Class

Probability of Class 1

1 1.00 0 0.66

1 1.00 0 0.65

0 1.00 1 0.64

1 1.00 0 0.62

0 1.00 0 0.60

0 0.90 0 0.51

1 0.90 0 0.49

0 0.88 0 0.49

0 0.88 1 0.46

1 0.88 0 0.46

0 0.87 1 0.45

0 0.87 0 0.45

0 0.86 0 0.44

1 0.86 0 0.44

0 0.86 0 0.30

0 0.86 0 0.28

0 0.85 0 0.26

0 0.84 1 0.24

0 0.84 0 0.22

0 0.83 0 0.21

0 0.68 0 0.04

0 0.67 0 0.04

0 0.67 0 0.01

0 0.67 0 0.00

Classification ProbabilitiesTABLE 9.2

428 Chapter 9 Predictive Data Mining

cost of misclassifying more actual Class 0 observations as Class 1 observations. Figure 9.1 shows the Class 1 and Class 0 error rates for cutoff values ranging from 0 to 1. One com- mon approach to handling the tradeoff between Class 1 and Class 0 error is to set the cutoff value to minimize the Class 1 error rate subject to a threshold on the maximum Class 0 error rate. Specifically, Figure 9.1 illustrates that for a maximum allowed Class 0 error rate of 70%, a cutoff value of 0.45 (depicted by the vertical dashed line) achieves a Class 1 error rate of 20%.

As we have mentioned, identifying Class 1 members is often more important than iden- tifying Class 0 members. One way to evaluate a classifier’s value is to compare its effec- tiveness in identifying Class 1 observations as compared with random classification. To gauge a classifier’s added value, a cumulative lift chart compares the number of actual Class 1 observations identified if considered in decreasing order of their estimated

5Cutoff Value 0.75

Predicted Class

Actual Class 0 1

0 2400 5n 1501 5n

1 510 5n 611 5n

Actual Class No. of Cases No. of Errors Error Rate (%)

0 3900 01n n1 5 1501 5n 38.46

1 1110 11n n1 5 510 5n 45.45

Overall 5000 01 10 11n n n n1 1 1 5 2001 10n n1 5 40.00

5Cutoff Value 0.50

Predicted Class

Actual Class 0 1

0 1500 5n 2401 5n

1 410 5n 711 5n

Actual Class No. of Cases No. of Errors Error Rate (%)

0 39 24 61.54

1 11 4 36.36

Overall 50 28 56.00

5Cutoff Value 0.25

Predicted Class

Actual Class 0 1

0 600 5n 3301 5n

1 110 5n 1011 5n

Actual Class No. of Cases No. of Errors Error Rate (%)

0 39 33 84.62

1 11 1 9.09

Overall 50 34 68.00

Confusion Matrices for Various Cutoff ValuesTABLE 9.3

9.2 Performance Measures 429

probability of being in Class 1 and compares this to the number of actual Class 1 observa- tions identified if randomly selected. The left panel of Figure 9.2 illustrates a cumulative lift chart. The point (10, 5) on the blue curve means that if the 10 observations with the largest estimated probabilities of being in Class 1 were selected from Table 9.2, 5 of these observations correspond to actual Class 1 members. In contrast, the point (10, 2.2) on the red curve means that if 10 observations were randomly selected, only 3 5(11/50) 10 2.2 of these observations would be Class 1 members. Thus, the better the classifier is at identi- fying responders, the larger the vertical gap between points on the red and blue curves.

Another way to view how much better a classifier is at identifying Class 1 observations than random classification is to construct a decile-wise lift chart. For a decile-wise lift chart, observations are ordered in decreasing probability of Class 1 membership and then considered in 10 equal-sized groups. For the data in Table 9.2, the first decile group cor- responds to the 0.1 50 53 5 observations most likely to be in Class 1, the second decile group corresponds to the 6th through the 10th observations most likely to be in Class 1, and so on. For each of these deciles, the decile-wise lift chart compares the number of actual Class 1 observations to the number of Class 1 responders in a randomly selected

A decile is one of nine values that divide ordered data into ten equal parts. The deciles determine the values for 10%, 20%, 30%, . . . , 90% of the data.

Classification Error Rates vs. Cutoff ValueFIGURE 9.1

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

0 0.1

E rr

or R

at e

0.2 0.3 0.4 0.5 Cutoff Value

0.6 0.7 0.8 0.9 1

Class 1 Error Rate Class 0 Error Rate

Cumulative and Decile-Wise Lift ChartsFIGURE 9.2

12 3 10 8 6 4 2 0

0 20

C u

m u

la ti

No. of Cases

Lift Chart (Validation Data Set)

Decile-Wise Lift Chart (Validation Data Set)

60 1 0D

ec il

e M

ea n

/G lo

b al

M ea

0.5 1

1.5 2

2.5

2 3 4 5 6

Deciles

7 8 9 10

Cumulative Class 1 records when sorted using predicted values

Cumulative Class 1 records using average

(10, 5)

(10, 2.2)

Figure 9.1 was created using a data table that varied the cutoff value and tracked the Class 1 error rate and Class 0 error rate. For instructions on how to construct data tables in Excel, see Chapter 10.

430 Chapter 9 Predictive Data Mining

group of 0.1 50 53 5 observations. In the first decile group from Table 9.2 (the top 10% of observations believed by the classifier to most likely to be in Class 1), there are three Class 1 observations. A random sample of 5 observations would be expected to have

3 55 (11/50) 1.1 observations in Class 1. Thus, the first decile lift of this classification is 53/1.1 2.73, which corresponds to the height of the first bar in the chart in the right panel

of Figure 9.2. The interpretation of this ratio is that in the first decile, the model correctly predicted three observations, whereas random sampling would, on average, correctly clas- sify only 1.1. Visually, the taller the bar in a decile-wise lift chart, the better the classifier is at identifying responders in the respective decile group. The height of the bars for the 2nd through 10th deciles is computed and interpreted in a similar manner.

Lift charts are prominently used in direct-marketing applications that seek to identify customers who are likely to respond to a direct-mail promotion. In these applications, it is common to have a fixed budget and, therefore, a fixed number of customers to target. Lift charts identify how much better a data mining model does at identifying responders than a mailing to a random set of customers.

In addition to the overall error rate, Class 1 error rate, and Class 0 error rate, there are other measures that gauge a classifier’s performance. The ability to correctly predict Class 1 (positive) observations is expressed by subtracting the Class 1 error rate from one. The resulting measure is referred to as the sensitivity, or recall, which is calculated as

5 2 5 1

n n Sensitivity 1 Class 1 error rate 11

11 10

Similarly, the ability to correctly predict Class 0 (negative) observations is expressed by subtracting the Class 0 error rate from one. The resulting measure is referred to as the spec- ificity, which is calculated as

5 2 5 1

n n Specificity 1 Class 0 error rate 00

00 01

The sensitivity of the model that produced the classifications in Table 9.1 is 1 5146/(146 89) 62.1%. The specificity of the model that produced the classifications in

Table 9.1 is 1 57,479/(5,244 7,479) 58.8%. Precision is a measure that corresponds to the proportion of observations predicted to

be Class 1 by a classifier that are actually in Class 1

5 1

n n Precision 11

11 01

The F1 Score combines precision and sensitivity into a single measure and is defined as

5 1 1

n n n F1score

2 11

11 01 10

As we illustrated in Figure 9.1, decreasing the cutoff value will decrease the number of actual Class 1 observations misclassified as Class 0, but at the cost of increasing the number of Class 0 observations that are misclassified as Class 1. The receiver operating characteristic (ROC) curve is an alternative graphical approach for displaying this tradeoff between a classifier’s ability to correctly identify Class 1 observations and its Class 0 error rate. In a ROC curve, the vertical axis is the sensitivity of the classifier, and the horizontal axis is the Class 0 error rate (which is equal to 1 – specificity).

In Figure 9.3, the blue curve depicts the ROC curve corresponding to the classi- fication probabilities for the 50 observations in Table 9.2. The red diagonal line in Figure 9.3 represents the expected sensitivity and Class 0 error rate achieved by ran- dom classification of the 50 observations. The point (0, 0) on the blue curve occurs when the cutoff value is set so that all observations are classified as Class 0; for this set of 50 observations, a cutoff value greater than 1.0 will achieve this. That is, for a cut- off value greater than 1, for the observations in Table 9.2, 5 1 5sensitivity 0/(0 11) 0 and the 5 1 5Class 0 error rate 0/(0 39) 0. The point (1, 1) on the curve occurs when the cutoff value is set so that all observations are classified as Class 1; for this set of

9.2 Performance Measures 431

50 observations, a cutoff value of zero will achieve this. That is, for a cutoff value of 0, 5 1 5sensitivity 11/(11 0) 1 and the 5 1 5Class 0 error rate 39/(39 0) 1. Repeating

these calculations for varying cutoff values and recording the resulting sensitivity and Class 0 error rate values, we can construct the ROC curve in Figure 9.3.

In general, we can evaluate the quality of a classifier by computing the area under the ROC curve, often referred to as the AUC. The greater the area under the ROC curve, i.e., the larger the AUC, the better the classifier performs. To understand why, suppose there exists a cutoff value such that a classifier correctly identifies each observation’s actual class. Then, the ROC curve will pass through the point (0, 1), which represents the case in which the Class 0 error rate is zero and the sensitivity is equal to one (which means that the Class 1 error rate is zero). In this case, the area under the ROC curve would be equal to one as the curve would extend from (0, 0) to (0, 1) to (1, 1). In Figure 9.3, note that the area under the red diagonal line representing random clas- sification results is 0.5. In Figure 9.3, we observe that the classifier is providing value over a random classification, as its AUC is greater than 0.5.

Evaluating the Estimation of Continuous Outcomes There are several ways to measure performance when estimating a continuous outcome variable, but each of these measures is some function of the error ˆe y yi i i5 2 , where yi is the actual outcome for observation i and ŷi is the predicted outcome for observation i. Two common measures are the average error 5 5 e ni

n i∑ /1 and the root mean squared error

5 5 e ni n

i∑(RMSE) /1 2 . The average error estimates the bias in a model’s predictions. If the average error is negative, then the model tends to overestimate the value of the outcome variable; if the average error is positive, the model tends to underestimate. The RMSE is similar to the standard error of the estimate for a regression model; it has the same units as the outcome variable predicted and provides a measure of how much the predicted value varies from the actual value.

Applying these measures (or others) to the model’s predictions on the training set estimates the retrodictive performance or goodness-of-fit of the model, not the predictive performance. In estimating future performance, we are most interested in applying the per- formance measures to the model’s predictions on the validation and test sets.

In chapter 8, we discuss additional measures, such as mean absolute error, mean absolute percentage error, and mean squared error, that also can be used to evaluate the predictions of a continuous outcome.

Receiver Operating Characteristic (ROC) CurveFIGURE 9.3

1.00

0.90

0.80

0.70

0.60

0.50

0.40

0.30

0.20

0.10

0.10 0.00

0.00 Class 0 Error Rate 5 1 2 Specificity

S en

si ti

vi ty

ROC Curve

432 Chapter 9 Predictive Data Mining

Lift charts analogous to those constructed for classification

methods can also be applied to the continuous outcomes when

using estimation methods. A lift chart for a continuous outcome

variable is relevant for evaluating a model’s effectiveness in

identifying observations with the largest values of the outcome

variable. This is similar to the way a lift chart for a categorical

outcome variable helps evaluate a model’s effectiveness in iden-

tifying observations that are most likely to be Class 1 members.

N O T E S + C O M M E N T S

Actual Average Balance

Estimated Average Balance

Error ( )ie

Squared Error ( )i2e

3,793 3,784 9 9,054,081

1,800 1,460 340 16,384

900 1,381 −481 1,666,681

1,460 566 894 176,400

6,288 5,487 801 641,601

341 605 −264 69,696

506 760 −254 64,516

621 1,593 −972 944,784

1,442 3,050 −1,608 1,292,769

944 210 734 538,756

Computing Error in Estimates of Average Balance for 10 Customers

TABLE 9.4

9.3 Logistic Regression Similar to how multiple linear regression predicts a continuous outcome variable, y, with a collection of explanatory variables, , , . . . , ,1 2x x xq via the linear equation ˆ 0 1 1y b b x b xq q�5 1 1 1 , logistic regression attempts to classify a binary categorical outcome 5y( 0 or 1) as a linear function of explanatory variables. However, directly try- ing to explain a binary outcome via a linear function of the explanatory variables is not effective. To understand this, consider the task of predicting whether a movie wins the Academy Award for Best Picture using information on the total number of other Oscar nominations that a movie has received. Figure 9.4 shows a scatter chart of a sample of movie data found in the file OscarsDemo; each data point corresponds to the total num- ber of Oscar nominations that a movie received and whether the movie won the best pic- ture award (1 movie won, 0 movie lost)5 5 . The diagonal line in Figure 9.4 corresponds to the simple linear regression fit. This linear function can be thought of as predicting the probability p of a movie winning the Academy Award for Best Picture via the equation ˆ 0.4054 (0.836 total numberp 5 2 1 3 of Oscar nominations). As Figure 9.4 shows, a linear regression model fails to appropriately explain a binary outcome variable. This

To demonstrate the computation and interpretation of average error and RMSE, we con- sider the challenge of predicting the average balance of Optiva Credit Union customers based on their features. Table 9.4 shows the error and squared error resulting from the predictions of the average balance for 10 observations. Using Table 9.4, we compute 5 2average error 80.1 and the RSME 7745 . Because the average error is negative, we observe that the model overestimates the actual balance of these 10 customers. Furthermore, if the performance of the model on these 10 observations is indicative of the performance on a larger set of obser- vations, we should investigate improvements to the estimation model, as the RMSE of 774 is 43% of the average actual balance. As a rule-of-thumb, a good estimation model should have an RMSE less than 10% of the average value of the variable being predicted.

9.3 Logistic Regression 433

model predicts that a movie with fewer than 5 total Oscar nominations has a negative probability of winning the best picture award. For a movie with more than 17 total Oscar nominations, this model predicts a probability greater than 1.0 of winning the best pic- ture award. Furthermore, the residual plot in Figure 9.5 shows an unmistakable pattern of systematic misprediction, suggesting that the simple linear regression model is not appropriate.

Estimating the probability p with the linear function ˆ 0 1 1p b b x b xq q�5 1 1 1 does not fit well because, although p is a continuous measure, it is restricted to the range [0, 1]; that is, a probability cannot be less than zero or larger than one. Figure 9.6 shows an S-shaped curve that appears to better explain the relationship between the probability p of winning the best picture award and the total number of Oscar nominations. Instead of extending off to positive and negative infinity, the S-shaped curve flattens and never goes above one or below zero. We can achieve this S-shaped curve by estimating an appropriate function of the probability p of winning the best picture award with a linear function rather than directly estimating p with a linear function.

As a first step, we note that there is a measure related to probability known as odds that is very prominent in gambling and epidemiology. If an estimate of the probability of an event is p̂ then the equivalent odds measure is 2p pˆ /(1 ˆ ). For example, if the probability of an event is 5p̂ 2/3, then the odds measure would be 5(2/3)/(1/3) 2, meaning that the odds are 2 to 1 that the event will occur. The odds metric ranges between zero and positive infinity, so by considering the odds measure rather than the probability p̂, we eliminate the linear fit problem resulting from the upper bound of one on the probability p̂. To elim- inate the fit problem resulting from the remaining lower bound of zero on 2p pˆ /(1 ˆ ), we observe that the natural log of the odds for an event, also known as “log odds” or logit, ln p p2( ˆ /(1 ˆ )), ranges from negative infinity to positive infinity. Estimating the logit with a linear function results in a logistic regression model:

5 1 1 1� p

p b b x b xq qln

1 ˆ 0 1 1

 

 

(9.1)

As discussed in Chapter 7, if a linear regression model is appropriate, the residuals should appear randomly dispersed with no discernible pattern.

Scatter Chart and Simple Linear Regression Fit for Oscars Example

FIGURE 9.4

20.4

20.2

0.2

0.4

0.6

0.8

1.0

0 2 4 6 8 10 1412 Oscar Nominations

W in

n er

o f

B es

t P

ic tu

y 5 0.0836x 2 0.4054 R2 5 0.2708

OscarsDemo

434 Chapter 9 Predictive Data Mining

Given a training set of observations consisting of values for a set of explanatory variables, , , ,1 2x x xq… , and whether or not an event of interest occurred ( 0 or 1)y 5 , the logistic

regression model fits values of 0b , 1b , . . . , bq that best estimate the log odds of the event occurring. Using statistical software to fit the logistic regression model to the data in the file OscarsDemo results in estimates of 6.2140b 5 2 and 0.5961b 5 ; that is, the log odds of a movie winning the best picture award is given by

5 2 1 3 p

p ln

1 ˆ 6.214 0.596 total number of Oscar nominations

 

 

(9.2)

Unlike the coefficients in a multiple linear regression, the coefficients in a logistic regression do not have an intuitive interpretation. For example, 0.5961b 5 means that for every additional Oscar nomination that a movie receives, its log odds of winning the best picture award increase by 0.596. In other words, the total number of Oscar nominations is linearly related to the log odds of a movie winning the best picture award. Unfortunately, a change in the log odds of an event is not as easy as to interpret as a change in the probabil- ity of an event. Algebraically solving equation (9.1) for p, we can express the relationship between the estimated probability of an event and the explanatory variables with an equa- tion known as the logistic function:

Residuals for Simple Linear Regression on Oscars DataFIGURE 9.5

20.8

20.6

20.4

20.2

0.2

0.4

0.6

0.8

1.2

1.0

0 5 10 15 Oscar Nominations

R es

id u

al s

LOGISTIC FUNCTION

ˆ 1

1 0 1 1 p

e b b x b xq q� 5

1 2 1 1 1( ) (9.3)

For the OscarsDemo data, equation (9.3) is

ˆ 1

1 ( 6.214 0.596 total number of Oscar nominations) p

e 5

1 2 2 1 3 (9.4)

Plotting equation (9.4), we obtain the S-shaped curve of Figure 9.6. Clearly, the logistic regression fit implies a nonlinear relationship between the probability of winning the best picture award and the total number of Oscar nominations. The effect of increasing the

9.3 Logistic Regression 435

total number of Oscar nominations on the probability of winning the best picture award depends on the original number of Oscar nominations. For instance, if the total number of Oscar nominations is four, an additional Oscar nomination increases the estimated

probability of winning the best picture award from ˆ 1

1 0.021

( 6.214 0.596 4) p

e 5

1 5

2 2 1 3

to ˆ 1

1 0.038,

( 6.214 0.596 5) p

e 5

1 5

1 3− − an increase of 0.017. But if the total number of

Oscar nominations is eight, an additional Oscar nomination increases the estimated

probability of winning the best picture award from ˆ 1

1 0.191

( 6.214 0.596 8) p

e 5

1 5

1 3− − to

ˆ 1

1 0.299,

( 6.214 0.596 9) p

e 5

1 5

2 2 1 3 an increase of 0.108.

As with other classification methods, logistic regression classifies an observation by using equation (9.3) to compute the probability of an observation belonging to Class 1 and then comparing this probability to a cutoff value. If the probability exceeds the cutoff value (a typical value is 0.5), the observation is classified as Class 1 and otherwise it is classified as Class 0. Table 9.5 shows a subsample of the predicted probabilities computed using equation (9.3) and the subsequent classification.

The selection of variables to consider for a logistic regression model is similar to the approach in multiple linear regression. Especially when dealing with many variables,

Logistic S-Curve for Oscars ExampleFIGURE 9.6

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1.0

0.9

0 2 4 6 8 10 12 14 Total No. of Oscar Nominations

W in

n er

o f

B es

t P

ic tu

Total No. of Oscar Nominations

Predicted Probability of Winning

Predicted Class

Actual Class

14 0.89 Winner Winner

11 0.58 Winner Loser

10 0.44 Loser Loser

6 0.07 Loser Winner

Predicted Probabilities by Logistic Regression for Oscars ExampleTABLE 9.5

436 Chapter 9 Predictive Data Mining

See Chapter 7 for an in-depth discussion of variable selection in multiple regression models.

9.4 k-Nearest Neighbors The k-nearest neighbor (k-NN) method can be used either to classify a categorical out- come or to estimate a continuous outcome. In a k-NN approach, the predicted outcome for an observation is based on the k most similar observations from the training set, where similarity is measured with respect to the set of input variables (features). Statistical soft- ware commonly employs Euclidean distance in the k-NN method to measure the similarity between observations, which is most appropriate when all features are continuous.

A critical aspect of effectively applying the k-NN method is the selection of the appro- priate features on which to base similarity. When computing similarity with respect to too many features, Euclidean distance is less discriminating of a measure as all observations become nearly equidistant from each other. While no automated feature selection exists within the k-NN method, preliminary data exploration paired with experimentation can help identify promising features to include.

Classifying Categorical Outcomes with k-Nearest Neighbors Unlike logistic regression, which uses a training set to to generalize relationships in the data via the logistic equation and then applies this parametric model to estimate the class proba- bilities of observations in the validation and test sets, a nearest-neighbor classifier is a “lazy learner.” That is, k-NN instead directly uses the entire training set to classify observations in the validation and test sets. When k-NN is used as a classification method, a new observation is classified as Class 1 if the proportion of Class 1 observations in its k nearest neighbors from the training set is greater than or equal to a specified cutoff value (a typical value is 0.5).

The value of k can plausibly range from 1 to n, the number of observations in the train- ing set. If 1k 5 , then the classification of a new observation is set to be equal to the class of the single most similar observation from the training set. At the other extreme, if k n5 , then the new observation’s class is naïvely assigned to the most common class in the train- ing set. Smaller values of k are more susceptible to noise in the training set, while larger values of k may fail to capture the relationship between the features and output class. Val- ues of k from 1 to /2n are typically considered. The best value of k can be determined by building models for a range of k values and then selecting the value of *k that results in the smallest classification error on the validation set. Note that the use of the validation set to identify *k in this manner implies that the method should be applied to a test set with this value of *k to accurately estimate the classification error on future data.

To illustrate, suppose that a training set consists of the 10 observations listed in Table 9.6. For this example, we will refer to an observation with Loan Default 15 as a Class 1

As with multiple linear regression, strong collinearity between

the independent variables x ,1 x ,2 …, xq in a logistic regression model can distort the estimation of the coefficients b ,1 b ,2 …, bq in equation (9.1). If we are constructing a logistic regression model

to explain and quantify a relationship between the set of inde-

pendent variables and the log odds of an event occurring, then

it is recommended to avoid models that include independent

variables that are highly correlated. However, if the purpose of a

logistic regression model is to classify observations, multicollinear-

ity does not affect predictive capability so correlated independent

variables are not a concern and the model should be evaluated

based on its classification performance on validation and test sets.

N O T E S + C O M M E N T S

thorough data exploration via descriptive statistics and data visualization is essential in nar- rowing down viable candidates for explanatory variables. While a logistic regression model used for prediction should ultimately be judged based on its classification performance on validation and test sets, Mallow’s C p statistic is a measure commonly computed by statistical software that can be used to identify models with promising sets of variables. Models that achieve a small value of Mallow’s C p statistic tend to have smaller mean squared error and models with a value of Mallow’s C p statistic approximately equal to the number of coeffi- cients in the model tend to have less bias (the tendency to systemically over- or under-predict).

9.4 k-Nearest Neighbors 437

observation and an observation with Loan Default 05 as a Class 0 observation. Our task is to classify a new observation with Average Balance 9005 and Age 285 based on its simi- larity to the values of Average Balance and Age of the 10 observations in the training set.

Before computing the similarity between a new observation and the observations in the training set, it is common practice to normalize the values of all variables. By replacing the original values of each variable with the corresponding z-score, we avoid the computation of Euclidean distance being disproportionately affected by the scale of the variables. For example, the average value of the Average Balance variable in the training set is 1,285 and the standard deviation is 2,029. The average and standard deviation of the Age variable are 38.2 and 10.2, respectively. Thus, Observation 1’s normalized value of Average Balance is (49 1, 285)/2, 029 0.612 5 2 and its normalized value of Age is (38 38.2)/10.2 0.022 5 2 .

Figure 9.7 displays the 10 training-set observations and the new observation to be classi- fied plotted according to their normalized variable values. To classify the new observation,

In chapter 2, we discuss z-scores.

Training Set Observations for k-NN ClassifierTABLE 9.6

Observation Average Balance Age Loan Default

1 49 38 1

2 671 26 1

3 772 47 1

4 136 48 1

5 123 40 1

6 36 29 0

7 192 31 0

8 6,574 35 0

9 2,200 58 0

10 2,100 30 0

Average: 1,285 38.2

Standard Deviation: 2,029 10.2

Scatter Chart for k-NN ClassificationFIGURE 9.7

2.0

Loan Default No Default

Observation to Classify

1.5

1.0

0.5

0.0

–0.5

–0.5 0.5 1.5 2.51.0 2.0 3.00.0–1.0

A ge

z -S

co re

Average Balance z-Score

–1.0

–1.5

–2.0

1 8

7 6

438 Chapter 9 Predictive Data Mining

we will use a cutoff value of 0.5. For 1k 5 , this observation is classified as a Loan Default (Class 1) because its nearest neighbor (Observation 2) is in Class 1. For 2k 5 , we see that the two nearest neighbors are Observation 2 (Class 1) and Observation 6 (Class 0). Because at least 0.5 of the 2k 5 neighbors are Class 1, the new observation is classified as Class 1. For 3k 5 , the three nearest neighbors are Observation 2 (Class 1), Observation 6 (Class 0), and Observation 7 (Class 0). Because only 1/3 of the neighbors are Class 1, the new obser- vation is classified as Class 0 (0.33 is less than the 0.5 cutoff value). Table 9.7 summarizes the classification of the new observation for values of k ranging from 1 to 10.

Estimating Continuous Outcomes with k-Nearest Neighbors When k-NN is used to estimate a continuous outcome, a new observation’s outcome value is predicted to be the average of the outcome values of its k-nearest neighbors in the training set. The value of k can plausibly range from 1 to n, the number of observations in the training set. If 1k 5 , then the estimation of a new observation’s outcome value is set equal to the out- come value of the single most similar observation from the training set. At the other extreme, if k n5 , then the new observation’s outcome value is estimated by the average outcome value over the entire training set. Too small of a value for k results in predictions that are overfit to the noise in the training set, while too large of a value of k results in underfitting and fails to capture the relationships between the features and the outcome variable. The best value of k can be determined by building models over a typical range ( 1k 5 , . . . , /2n ) and then selecting the value of *k that results in the smallest estimation error. Note that the use of the validation set to identify *k in this manner implies that the method should be applied to a test set with this value of *k to accurately estimate the estimation error on future data.

To illustrate, we again consider the training set of 10 observations listed in Table 9.6. In this case, we are interested in estimating the value of Average Balance for a new observation based on its similarity with respect to Age to the 10 observations in the training set. Figure 9.8 displays the 10 training-set observations and a new observation

k % of Class 1 Neighbors Classification 1 1.00 1

2 0.50 1

3 0.33 0

4 0.25 0

5 0.40 0

6 0.50 1

7 0.57 1

8 0.63 1

9 0.56 1

10 0.50 1

Classification of Observation with Average Balance 9005 and Age 285 for Different Values of k

TABLE 9.7

Scatter Chart for k-NN EstimationFIGURE 9.8

671 36 192 2100 6574 49 123

Age

772 136 2200

25 30 35 40 45 50 55 60

9.5 Classification and Regression Trees 439

with Age 285 for which we want to estimate the value of Average Balance. For 1k 5 , the new observation’s average balance is estimated to be $36, which is the value of Average Balance for the nearest neighbor (Observation 6 in Table 9.6). For 2k 5 , we see that there is a tie between Observation 2 (Age 26)5 and Observation 10 (Age 30)5 for the second-closest observation to the new observation (Age 28)5 . While tie-breaking rules vary between statistical software packages, in this example we simply include all three observations to estimate the average balance of the new observation as

1 1 5(36 671 2,100)/3 $936. Table 9.8 summarizes the estimation of the new observa- tion’s average balance for values of k ranging from 1 to 10.

9.5 Classification and Regression Trees Classification and regression trees (CART) successively partition a data set of observations into increasingly smaller and more homogeneous subsets. At each iteration of the CART method, a subset of observations is split into two new subsets based on the values of a sin- gle variable. The CART method can be thought of as a series of questions that successively partition observations into smaller and smaller groups of decreasing impurity, which is the measure of the heterogeneity in a group of observations’ outcome classes or outcome values. The implementation of classification and regression trees by various statistical soft- ware packages vary with respect to the metrics they employ and how they grow the tree. In this section, we present a general description of CART logic.

Classifying Categorical Outcomes with a Classification Tree For classification trees, the impurity of a group of observations is based on the proportion of observations belonging to the same class (there is zero impurity if all observations in a group are in the same class). After a final tree is constructed, the classification of a new observation is then based on the final partition into which the new observation belongs (based on the variable-splitting rules).

To demonstrate the classification tree method, we consider an example involving Hawaiian Ham Inc. (HHI), a company that specializes in the development of software that filters out unwanted e-mail messages (often referred to as “spam”). The file DemoHHI contains a sample of data that HHI has collected. For 4,601 e-mail messages, HHI has col- lected whether or not the message was “spam” (Class 1) or “not spam” (Class 0), as well as the frequency of the “!” character and the “$” character (expressed as a percentage of characters in the message).DemoHHI

k Average Balance Estimate 1 $36

2 $936

3 $936

4 $750

5 $1,915

6 $1,604

7 $1,392

8 $1,315

9 $1,184

10 $1,285

Estimation Average Balance for Observation with 5Age 28 for Different Values of k

TABLE 9.8

440 Chapter 9 Predictive Data Mining

To explain how a classification tree categorizes observations, we consider a small training set from DemoHHI consisting of 46 observations. In this training set, we note that the variables Dollar and Exclamation correspond to the percentage of the “$” charac- ter and the percentage of the “!” character, respectively. The results of a classification tree analysis can be graphically displayed in a tree that explains the process of classifying a new observation. The tree outlines the values of the variables that result in an observation falling into a particular partition.

Let us consider the classification tree in Figure 9.9. At each step, the CART method identifies the split of the variable that results in the least impurity in the two resulting categories. In Figure 9.9, the number within the circle (or node) represents the value on which the variable (whose name is listed below the node) is split. The first partition is formed by splitting observations into two groups, observations with Dollar 0.0555# and observations with Dollar 0.0555. . The numbers on the left and right arcs emanating from the node denote the number of observations in the Dollar 0.0555# and Dollar 0.0555. partitions, respectively. There are 28 e-mails that consist of less than 5.55% of the “$” character and 18 observations containing more than 5.55% of the “$” character. The split on the variable Dollar at the value 0.0555 is selected because it results in the two sub- sets of the original 46 observations with the least impurity. The splitting process is then repeated on these two newly created groups of observations in a manner that again results in an additional subset with the least impurity. In this tree, the second split is applied to the group of 28 observations with Dollar 0.0555# using the variable Exclamation;

Construction Sequence of Branches in a Classification TreeFIGURE 9.9

0101

0.0555

21 7 4 14

1 3

2 11 3

4 3

Dollar

Exclamation

ExclamationExclamation

Exclamation

0.0615

0.1665

0.2665

0.0985

0.0735

0.5605

9.5 Classification and Regression Trees 441

21 of the 28 observations in this subset have Exclamation # 0.0735, while 7 have Exclamation 0.0735. . After this second variable splitting, there are three total partitions of the original 46 observations. There are 21 observations with values of Dollar 0.0555# and Exclamation 0.0735# , 7 observations with values of Dollar 0.0555# and Exclamation 0.0735. , and 18 observations with values of Dollar 0.0555. . No fur- ther partitioning of the 21-observation group with values of Dollar 0.0555# and Exclamation 0.0735# is necessary since this group consists entirely of Class 0 (nonspam) observations (i.e., this group has zero impurity). The 7-observation group with values of Dollar 0.0555# and Exclamation 0.0735. and 18-observation group with values of Dollar 0.0555. are successively partitioned in the order as denoted by the boxed numbers in Figure 9.9 until subsets with zero impurity are obtained.

For example, the group of 18 observations with Dollar 0.0555. is further split into two groups using the variable Exclamation; 4 of the 18 observations in this subset have Exclamation 0.0615# , while the other 14 observations have Exclamation 0.0615. . That is, 4 observations have Dollar . 0.0555 and Exclamation 0.0615# . This subset of 4 observations is further decomposed into 1 observation with Dollar 0.1665# and 3 observations with Dollar 0.1665. . At this point, there is no further branching in this portion of the tree since corresponding subsets have zero impurity. That is, the subset of 1 observation with Dollar 0.0555. , Exclamation 0.0615# , and Dollar 0.1665# is a Class 0 observation (nonspam) and the subset of 3 observations with Dollar 0.0555. , Exclamation 0.0615# , and Dollar 0.1665. are all Class 1 observations. The recursive partitioning for the other branches in Figure 9.9 follows similar logic. The scatter chart in Figure 9.10 illustrates the final partitioning resulting from the sequence of variable splits. The rules defining a partition divide the variable space into eight rectangles, each corre- sponding to one of the eight leaf nodes in the tree in Figure 9.9.

As Figure 9.10 suggests in this case, with enough variable splitting, it is possible to obtain partitions on the training set such that each partition contains either Class 1 obser- vations or Class 0 observations, but not both. In other words, enough decomposition of this data results in a set of partitions with zero impurity, and there are no misclassifications of the training set by this full tree. In general, unless there exist observations that have identi- cal values of all the input variables but different outcome classes, the leaf nodes of the full classification tree will have zero impurity. However, applying the entire set of partitioning rules from the full classification tree to observations in the validation set will typically result in a relatively large classification error. The degree of partitioning in the full classi- fication tree is an example of extreme overfitting; although the full classification tree per- fectly characterizes the training set, it is unlikely to classify new observations well.

To understand how to construct a classification tree that performs well on new obser- vations, we first examine how classification error is computed. The second column of Table 9.9 lists the classification error for each stage of constructing the classification tree in Figure 9.9. The training set on which this tree is based consists of 26 Class 0 observations and 20 Class 1 observations. Therefore, with no decision rules, we can achieve a classifica- tion error of 43.5% (20/46) on the training set by simply classifying all 46 observations as Class 0. Adding the first decision node separates the observations into two groups, one group of 28 and another of 18. The group of 28 observations has values of Dollar 0.0555# ; 25 of these observations are Class 0 and 3 are Class 1; therefore, by the majority rule, this group would be classified as Class 0, resulting in three misclassified observations. The group of 18 observations has values of Dollar 0.0555. ; 1 of these observations is Class 0, and 17 are Class 1; therefore, by the majority rule, this group would be classified as Class 1, resulting in one misclassified observation. Thus, for one decision node, the classification tree has a classification error of 1 5(3 1)/46 0.087.

When the second decision node is added, the 28 observations with values of Dollar 0.0555# are further decomposed into a group of 21 observations and a group

442 Chapter 9 Predictive Data Mining

of 7 observations. The classification tree with two decision nodes has three groups: a group of 18 observations with Dollar 0.0555. , a group of 21 observations with Dollar 0.0555# and Exclamation 0.0735# , and a group of 7 observations with Dollar 0.0555# and Exclamation 0.0735. . As before, the group of 18 observations would be classified as Class 1 and misclassify a single observation that is actu- ally Class 0. In the group of 21 observations, all of these observations are Class 0, so there is no misclassification error for this group. In the group of 7 observations, 4 are Class 0 and 3 are Class 1. Therefore, by the majority rule, this group would be classified as Class 0, resulting in three misclassified observations. Thus, for the clas- sification tree with two decision nodes (and three partitions), the classification error is (1 0 3)/46 0.0871 1 5 . Proceeding in a similar fashion, we can compute the classifi- cation error on the training set for classification trees with varying numbers of decision nodes to complete the second column of Table 9.9. Table 9.9 shows that the classifi- cation error on the training set decreases as we add more decision nodes and split the observations into smaller partitions.

To evaluate how well the decision rules of the classification tree in Figure 9.9 estab- lished from the training set extend to other data, we apply it to a validation set from DemoHHI of 4,555 observations consisting of 2,762 Class 0 observations and 1,793 Class 1 observations. Without any decision rules, we can achieve a classification error of 39.4% (1,793/4,555) on the training set by simply classifying all 4,555 observations as Class 0. Applying the first decision node separates into a group of 3,452 observations with Dollar 0.0555# and 1,103 with Dollar 0.0555. . In the group of 3,452 observa- tions, 2,631 are Class 0 and 821 are Class 1; therefore, by the majority rule, this group

Geometric Illustration of Full Classification Tree PartitionsFIGURE 9.10

0.05

20.05 0.1020.10 0.30 0.50 0.70 0.90 1.10 1.30

0.15

0.35

0.25

0.45

0.55

Exclamation

D ol

la r

SpamNot Spam

SpamDemoData

Figure 9.10 is based on all 46 observations, but only 28 of these observations are distinct. Of the 46 observations, 18 of them are not spam and have coordinates (0,0). Another two of observations are spam and have coordinates (0, 0.210).

9.5 Classification and Regression Trees 443

would be classified as Class 0, resulting in 821 misclassified observations. In the group of 1,103 observations, 131 are Class 0 and 972 are Class 1; therefore, by the majority rule, this group would be classified as Class 1, resulting in 131 misclassified observa- tions. Thus, for one decision node, the classification tree has a classification error of (821 131)/4, 555 0.2091 5 on the validation set. Proceeding in a similar fashion, we can apply the classification tree for varying numbers of decision nodes to compute the clas- sification error on the validation set displayed in the third column of Table 9.9. Note that the classification error on the validation set does not necessarily decrease as more decision nodes split the observations into smaller partitions.

To identify a classification tree with good performance on new data, we “prune” the full classification tree by removing decision nodes in the reverse order in which they were added. In this manner, we seek to eliminate the decision nodes corresponding to weaker rules. Figure 9.11 illustrates the tree resulting from pruning the last variable split- ting rule (Exclamation 0.5605 or Exclamation 0.5605)# . from Figure 9.9. By pruning this rule, we obtain a partition defined by Dollar 0.0555# , Exclamation 0.0735. , and Exclamation 0.2665. that contains three observations. Two of these observations are Class 1 (spam) and one is Class 0 (nonspam), so this pruned tree classifies observations in this partition as Class 1 observations, since the proportion of Class 1 observations in this partition (two-thirds) exceeds the default cutoff value of 0.5. Therefore, the classification error of this pruned true on the training set is 1/46 0.0225 , an increase over the zero clas- sification error of the full tree on the training set. However, Table 9.9 shows that applying the six decision rules of this pruned tree to the validation set achieves a classification error of 0.213, which is less than the classification error of 0.216 of the full tree on the validation set. Compared to the full tree with seven decision rules, the pruned tree with six decision rules is less likely to be overfit to the training set.

Sequentially removing decision nodes, we can obtain six pruned trees. These pruned trees have one to six variable splits (decision nodes). However, while adding decision nodes at first decreases the classification error on the validation set, too many decision nodes overfits the classification tree to the training data and results in increased error on the validation set. For each of these pruned trees, each observation belongs to a single partition defined by a sequence of decision rules and is classified as Class 1 if the proportion of Class 1 observations in the partition exceeds the cutoff value and Class 0 otherwise.

One common approach for identifying the best-pruned tree is to begin with the full classification tree and prune decision rules until the classification error on the valida- tion set increases. Following this procedure, Table 9.9 suggests that a classification tree

No. of Decision Nodes

% Classification Error on Training Set

% Classification Error on Validation Set

0 43.5 39.4

1 8.7 20.9

2 8.7 20.9

3 8.7 20.9

4 6.5 20.9

5 4.3 21.3

6 2.2 21.3

7 0 21.6

Classification Error Rates on Sequence of Pruned TreesTABLE 9.9

444 Chapter 9 Predictive Data Mining

partitioning observations into two subsets with a single decision node (Dollar 0.0555# or Dollar 0.0555. ) is just as reliable at classifying the validation data as any other tree. As Figure 9.12 shows, if the “$” character accounts for no more than 5.55% of the characters, this best-pruned tree classifies an e-mail as nonspam, otherwise this best-pruned tree classi- fies an e-mail as spam. This best-pruned classification tree results in a classification error of 20.9% on the validation set.

Classification Tree with One Pruned BranchFIGURE 9.11

11 0

0.0555

21 7 4 14

1 3

4 3

Dollar

Exclamation

0.0615

0.1665

0.2665

0.0985

0.0735

Best-Pruned Classification TreeFIGURE 9.12

3452 1103

Dollar

0.0555

9.5 Classification and Regression Trees 445

Estimating Continuous Outcomes with a Regression Tree To estimate a continuous outcome, a regression tree successively partitions observations of the training set into smaller and smaller groups in a similar fashion as a classification tree. The only differences are: (1) how impurity of the partitions is measured, and (2) how a partition is used to estimate the outcome value of an observation lying in that partition. Recall that in a classification tree, the impurity of a partition is based on the proportion of incorrectly classified observations. In a regression tree, the impurity of a partition is based on the variance of the outcome value for the observations in the group. A regression tree is constructed by sequentially identifying the variable-splitting rule that results in partitions with the smallest within-group variance of the outcome value. After a final tree is constructed, the estimated outcome value of an observation is based on the mean outcome value of the partition in which the new observation belongs.

To illustrate a regression tree, we consider the task of estimating the average balance of a bank customer using the customer’s age and whether he or she has ever defaulted on a loan. We construct the regression tree based on the 10 observations in Table 9.6. Figure 9.13 displays first six variable-splitting rules of the regression tree on the variable space. The blue lines correspond to the variable-splitting rules and the numbers within the circles denote the order in which the rules were introduced. The first rule splits the 10 observations into 5 observations with Loan Default 0.5# and 5

Geometric Illustration of First Six Rules of a Regression TreeFIGURE 9.13

20 0

0.5

1 671

Prediction = 671

25 30 35 45 5540 50

220065742100 19236

L oa

n D

ef au

lt (

1 =

y es

, 0 =

n o)

Age

49 123 136 772

Prediction = 36

Prediction = 6574

Prediction = 2200

Prediction = 86 Prediction = 454

3 6

Prediction = 1146

446 Chapter 9 Predictive Data Mining

with Loan Default 0.5. . This rule results in two groups of observations such that the variance in Average Balance within the groups is as small as possible. The second rule further splits the 5 observations with Loan Default 0.5# into a partition with 3 with Age 33# and 2 with Age 33. . Again, this rule results in the largest reduction in vari- ance within any partition. Four more rules further split the observations into partitions with smaller Average Balance variance as illustrated by Figure 9.13. This six-rule regression tree would then set its prediction estimate of each partition to be the average of the Average Balance variable (depicted in the red boxes in Figure 9.13).

Note that the full regression tree would continue to partition the variable space into smaller rectangles until the variance of the value of Average Balance within each partition is as small as possible. That is, the leaf nodes of the full regression tree will achieve zero impurity unless there exist observations that have identical values of all the input variables but different values of the outcome variable (Average Balance). Then, similar to the classi- fication tree, rules are pruned from this full regression tree in order to obtain the simplest tree that achieves the least amount of prediction error on the validation set.

Ensemble Methods Up to this point, we have demonstrated the prediction of a new observation (either classi- fication in the case of a categorical outcome or estimation in the case of a continuous out- come) based on the decision rules of a single constructed tree. In this section, we discuss the notion of ensemble methods. In an ensemble method, predictions are made based on the combination of a collection of models. For example, instead of basing the classification of a new observation on a single classification tree, an ensemble method generates a collec- tion of different classification trees and then predicts the class of a new observation based on the collective voting of this collection.

To gain an intuitive grasp of why an ensemble of prediction models may outperform, on average, any single prediction model, let’s consider the task of predicting the value of the S&P 500 Index one year in the future. Suppose there are 100 financial analysts inde- pendently developing their own forecast based on a variety of information. One year from now, there certainly will be one analyst (or more in the case of a tie) whose forecast will prove to be the most accurate. However, identifying beforehand which of the 100 analysts will be the most accurate may be virtually impossible. Therefore, instead of trying to pick one of the analysts and depending solely on their forecast, an ensemble approach would combine their forecasts (e.g., taking an average of the 100 forecast values) and use this as the predicted value of the S&P 500 Index. The two necessary conditions for an ensemble to perform better than a single model are as follows: (1) The individual base models are con- structed independently of each other (analysts don’t base their forecasts on the forecasts of other analysts), and (2) the individual models perform better than just randomly guessing.

There are two primary steps to an ensemble approach: (1) the development of a com- mittee of individual base models, and (2) the combination of the individual base models’ predictions to form a composite prediction. While an ensemble can be composed of any type of individual classification or estimation model, the ensemble approach works better with an unstable prediction method. A classification or estimation method is unstable if relatively small changes in the training set cause its predictions to fluctuate substantially. In this section, we discuss ensemble methods using classification or regression trees, which are known to be unstable. Specifically, we discuss three different ways to construct an ensemble of classification or regression trees: bagging, boosting, and random forests.

In the bagging approach, the committee of individual base models is generated by first constructing multiple training sets by repeated random sampling of the n observations in the original data with replacement. Because the sampling is done with replacement, some observations may appear multiple times in a single training set, while other observations will not appear at all. If each generated training set consists of n observations, then the probability of an observation from the original data not being selected for a specific train- ing set is n n n(( 1)/ )2 . Therefore, the average proportion of a training set of size n that are unique observations from the original data is n n n1 (( 1)/ )2 2 . The bagging approach then

9.5 Classification and Regression Trees 447

trains a predictive model on each of the m training sets and generates the ensemble predic- tion based on the average of the m individual predictions.

To demonstrate bagging, we consider the task of classifying customers as defaulting or not defaulting on their loan, using only their age. Table 9.10 contains the 10 observations in the original training data. Table 9.11 shows the results of generating 10 new training sets by randomly sampling from the original data with replacement. For each of these training sets, we construct a one-rule classification tree that minimizes the impurity of the resulting partition. The two partitions of each training set are illustrated with a vertical red line and accompanying decision rule.

Table 9.12 shows the results of applying this ensemble of 10 classification trees to a validation set consisting of 10 observations. The ensemble method bases its classification on the average of the 10 individual classifications trees; if at least half of the individual trees classify an observation as Class 1, so does the ensemble. Note from Table 9.12 that the 20% classification error rate of the ensemble is lower than any of the individual trees, illustrating the potential advantage of using ensemble methods.

Similar to bagging, the boosting method generates its committee of individual base models by sampling multiple training sets. However, boosting differs from bagging in how it samples the multiple training sets and how it weights the resulting classification or estimation models to compute the ensemble’s prediction. Boosting iteratively adapts how it samples the original data when constructing a new training set based on the prediction error of the models constructed on the previous training sets. To generate the first training set, each of the n observations in the original data is initially given equal weight of being selected. That is, each observation i has weight w ni 1/5 . A classification or estimation model is then trained on this training set and is used to predict the outcome of the n observations in the original data. The weight of each observation i is then adjusted based on the degree of its prediction error. For example, in a classification problem, if an obser- vation i is misclassified by a classifier, then its weight wi is increased, but if it is correctly classified, then its weight wi is decreased. The next training set is then generated by sam- pling the observations according to the updated weights. In this manner, the next training set is more likely to contain observations that have been mispredicted in early iterations.

To combine the predictions of the m individual models from the m training sets, boost- ing weights the vote of each individual model based on its overall prediction error. For example, suppose that the classifier associated with the jth training set has a large prediction error and the classifier associated with the kth training set has a small prediction error. Then the classification votes of the jth classifier will be weighted less than the classification votes of the kth classifer when they are combined. Note that this method differs from bagging, in which each of the individual classifiers has an equally weighted vote.

Random forests can be viewed as a variation of bagging specifically tailored for use with classification or regression trees. As in bagging, the random forests approach gener- ates multiple training sets by randomly sampling (with replacement) the n observations in the original data. However, when constructing a tree model for each separate training set, each tree is restricted to using only a fixed number of randomly selected input variables. For example, suppose we are attempting to classify a tax return as fraudulent or not and there are q input variables. For each of the m generated training sets, an individual clas- sification tree is constructed based on splitting rules based on f randomly selected input variables, where f is much smaller than q. The individual classification trees are referred to as “weak learners” because they are only allowed to consider a small subset of input variables. We note that these “weak learner” individual trees do not need to be pruned on a

Age 29 31 35 38 47 48 53 54 58 70

Loan default 0 0 0 1 1 1 1 0 0 0

Original 10-Observation Training DataTABLE 9.10

448 Chapter 9 Predictive Data Mining

Iteration 1 Age 36.5#

Age 29 31 31 35 38 38 47 48 58 58

Loan default 0 0 0 0 1 1 1 1 0 0

Prediction 0 0 0 0 1 1 1 1 1 1

Iteration 2 Age 50.5#

Age 29 31 35 38 47 54 58 70 70 70

Loan default 0 0 0 1 1 0 0 0 0 0

Prediction 0 0 0 0 0 0 0 0 0 0

Iteration 3 Age 36.5#

Age 29 31 35 38 38 47 53 53 54 58

Loan default 0 0 0 1 1 1 1 1 0 0

Prediction 0 0 0 1 1 1 1 1 1 1

Iteration 4 Age 34.5#

Age 29 29 31 38 38 47 47 53 54 58

Loan default 0 0 0 1 1 1 1 1 0 0

Prediction 0 0 0 1 1 1 1 1 1 1

Iteration 5 Age 39#

Age 29 29 31 47 48 48 48 70 70 70

Loan default 0 0 0 1 1 1 1 0 0 0

Prediction 0 0 0 1 1 1 1 1 1 1

Iteration 6 Age 53.5#

Age 31 38 47 48 53 53 53 54 58 70

Loan default 0 1 1 1 1 1 1 0 0 0

Prediction 1 1 1 1 1 1 1 0 0 0

Iteration 7 Age 53.5#

Age 29 38 38 48 53 54 58 58 58 70

Loan default 0 1 1 1 1 0 0 0 0 0

Prediction 1 1 1 1 1 0 0 0 0 0

Iteration 8 Age 53.5#

Age 29 31 47 47 47 53 53 54 58 70

Loan default 0 0 1 1 1 1 1 0 0 0

Prediction 1 1 1 1 1 1 1 0 0 0

Iteration 9 Age 53.5#

Age 29 35 38 38 48 53 53 54 70 70

Loan default 0 0 1 1 1 1 1 0 0 0

Prediction 1 1 1 1 1 1 1 0 0 0

Iteration 10 Age 14.5#

Age 29 29 29 29 35 35 54 54 58 58

Loan default 0 0 0 0 0 0 0 0 0 0

Prediction 0 0 0 0 0 0 0 0 0 0

Bagging: Generation of 10 New Training Sets and Corresponding Classification TreesTABLE 9.11

Age 26 29 30 32 34 37 42 47 48 54 Overall Error Rate

Loan default 1 0 0 0 0 1 0 1 1 0

Tree 1 0 0 0 0 0 1 1 1 1 1 30%

Tree 2 0 0 0 0 0 0 0 0 0 0 40%

Tree 3 0 0 0 0 0 1 1 1 1 1 30%

Tree 4 0 0 0 0 0 1 1 1 1 1 30%

Tree 5 0 0 0 0 0 0 1 1 1 1 40%

Tree 6 1 1 1 1 1 1 1 1 1 0 50%

Tree 7 1 1 1 1 1 1 1 1 1 0 50%

Tree 8 1 1 1 1 1 1 1 1 1 0 50%

Tree 9 1 1 1 1 1 1 1 1 1 0 50%

Tree 10 0 0 0 0 0 0 0 0 0 0 40%

Average Vote 0.4 0.4 0.4 0.4 0.4 0.7 0.8 0.8 0.8 0.4

Bagging Ensemble 0 0 0 0 0 1 1 1 1 0 20%

Classification of 10 Observations from Validation Set with Bagging EnsembleTABLE 9.12

validation set as incorporating them into an ensemble reduces the likelihood of overfitting. While the best number of individual trees in the random forest depends on the data, it is not unusual for a random forest to consist of hundreds and even thousands of individual trees.

For most problems, the predictive performance of boosting ensembles exceeds the pre- dictive performance of bagging ensembles. Boosting achieves its performance advantage because: (1) It evolves its committee of models by focusing on observations that are mispre- dicted, and (2) the member models’ votes are weighted by their accuracy. However, boosting is more computationally expensive than bagging. Because there is no adaptive feedback in a bagging approach, all m training sets and corresponding models can be implemented simulta- neously. However, in boosting, the first training set and predictive model guide the construc- tion of the second training set and predictive model, and so on. The random forests approach has performance similar to boosting, but maintains the computational simplicity of bagging.

S U M M A R Y

In this chapter, we introduced the concepts and techniques in predictive data mining. Predic- tive data mining methods, also called supervised learning, classify a categorical outcome or estimate a continuous outcome. We described how to partition data into training, validation, and test sets in order to construct and evaluate predictive data mining models. We discussed various performance measures for classification and estimation methods. We presented three common data mining methods: logistic regression, k-nearest neighbors, and classification/ regression trees. We explained how logistic regression is analogous to multiple linear regres- sion for the case when the outcome variable is binary. We demonstrated how to use logistic regression, as well as k-nearest neighbors and classification trees, to classify a binary categori- cal outcome. We also discussed the use of k-nearest neighbors and regression trees to estimate a continuous outcome. In our discussion of ensemble methods, we presented the concept of generating multiple prediction models and combining their predictions. We illustrated the use of ensemble methods within the context of classification trees and noted that ensemble methods based on large committees of “weak” prediction models generally outperform a single “strong” prediction model. Table 9.13 provides a comparative summary of common supervised learning approaches. We provide brief descriptions of support vector machines, the naïve Bayes method, and neural networks in the following Notes + Comments section.

Summary 449

450 Chapter 9 Predictive Data Mining

G L O S S A R Y

Accuracy Measure of classification success defined as 1 minus the overall error rate. Area under the ROC curve (AUC) A measure of a classification method’s performance; an AUC of 0.5 implies that a method is no better than random classification while a perfect classifier has an AUC of 1.0. Average error The average difference between the actual values and the predicted values of observations in a data set; used to detect prediction bias. Bagging An ensemble method that generates a committee of models based on different random samples and makes predictions based on the average prediction of the set of models. Bias The tendency of a predictive model to overestimate or underestimate the value of a continuous outcome. Boosting An ensemble method that iteratively samples from the original training data to generate individual models that target observations that were mispredicted in previously

Strengths Weaknesses

k-NN Simple Requires large amounts of data relative to number of variables

Classification and Regression Trees

Provides easy-to-interpret business rules; can handle data sets with missing data

May miss interactions between variables since splits occur one at a time; sensitive to changes in data entries

Multiple Linear Regression

Provides easy-to-interpret relationship between dependent and independent variables

Assumes linear relationship between outcome and variables

Logistic Regression Provides interpretable effects of each variable on the log odds of an outcome

Assumes linear relationship between log odds of an outcome and variables

Support Vector Machines

Can incorporate nonlinear effects and are robust against overfitting

May be difficult to directly apply on data sets with a large number of observations and variables

Naive Bayes Simple and effective at classifying Requires a large amount of data; restricted to categorical variables

Neural Networks Flexible and often effective Many difficult decisions to make when building the model; results cannot be easily explained, i.e., “black box”

Overview of Common Supervised Learning MethodsTABLE 9.13

1. A support vector machine separates observations using a

hyperplane to define a boundary. When the boundary is

restricted to be linear, a support vector machine is similar

to the logistic equation resulting from logistic regression.

However, a support vector machine can separate observa-

tions using nonlinear boundaries and capture more sophis-

ticated relationships between variables.

2. The idea behind the naive Bayes method is to express the

likelihood that an observation belongs to Class 1 as a condi-

tional probability that is then decomposed used Bayes’ theo-

rem. The naive aspect comes from the assumption that each

feature is conditionally independent of every other feature.

3. Neural networks are based on the biological model of brain

activity. Well-structured neural networks have been shown

to possess accurate classification and estimation perfor-

mance in many application domains. However, the use of

neural networks is a “black box” method that provides little

interpretable explanation to accompany the predictions.

Adjusting the parameters to tune the neural network per-

formance is largely trial-and-error guided by rules of thumb

and user experience. Neural networks form the basis of

deep learning, an emerging area in machine learning with

applications in image and speech recognition, among

others.

N O T E S + C O M M E N T S

generated models, and then bases the ensemble predictions on the weighted average of the predictions of the individual models, where the weights are proportional to the individual models’ accuracy. Class 0 error rate The percentage of Class 0 observations misclassified by a model in a data set. Class 1 error rate The percentage of actual Class 1 observations misclassified by a model in a data set. Classification A predictive data mining task requiring the prediction of an observation’s outcome class or category. Classification tree A tree that classifies a categorical outcome variable by splitting observations into groups via a sequence of hierarchical rules on the input variables. Confusion matrix A matrix showing the counts of actual versus predicted class values. Cumulative lift chart A chart used to present how well a model performs in identifying observations most likely to be in Class 1 as compared with random classification. Cutoff value The smallest value that the predicted probability of an observation can be for the observation to be classified as Class 1. Decile-wise lift chart A chart used to present how well a model performs at identifying observations for each of the top k deciles most likely to be in Class 1 versus a random classification. Ensemble method A predictive data mining approach in which a committee of individual classification or estimation models are generated and a prediction is made by combining these individual predictions. Estimation A predictive data mining task requiring the prediction of an observation’s continuous outcome value. F1 Score A measure combining precision and sensitivity into a single metric. False negative The misclassification of a Class 1 observation as Class 0. False positive The misclassification of a Class 0 observation as Class 1. Features A set of input variables used to predict an observation’s outcome class or continuous outcome value. Impurity Measure of the heterogeneity of observations in a classification or regression tree. k-nearest neighbors A data mining method that predicts (classifies or estimates) an observation i’s outcome value based on the k observations most similar to observation i with respect to the input variables. Logistic regression A generalization of linear regression that predicts a categorical outcome variable by computing the log odds of the outcome as a linear function of the input variables. Mallow’s C

p statistic A measure in which small values approximately equal to the number

of coefficients suggest promising logistic regression models. Model overfitting A situation in which a model explains random patterns in the data on which it is trained rather than just the generalizable relationships, resulting in a model with training-set performance that greatly exceeds its performance on new data. Observation (record) A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database. Overall error rate The percentage of observations misclassified by a model in a data set. Precision The percentage of observations predicted to be Class 1 that actually are Class 1. Random forests A variant of the bagging ensemble method that generates a committee of classification or regression trees based on different random samples but restricts each individual tree to a limited number of randomly selected features (variables) Receiver operating characteristic (ROC) curve A chart used to illustrate the tradeoff between a model’s ability to identify Class 1 observations and its Class 0 error rate. Regression tree A tree that predicts values of a continuous outcome variable by splitting observations into groups via a sequence of hierarchical rules on the input variables. Root mean squared error A performance measure of an estimation method defined as the square root of the sum of squared deviations between the actual values and predicted values of observations. Sensitivity (recall) The percentage of actual Class 1 observations correctly identified.

Glossary 451

452 Chapter 9 Predictive Data Mining

a. Using a cutoff value of 0.5 to classify a profile observation as Interested or not, con- struct the confusion matrix for this 40-observation training set. Compute sensitivity, specificity, and precision measures and interpret them within the context of Erin’s dating prospects.

Specificity The percentage of actual Class 0 observations correctly identified. Supervised learning Category of data mining techniques in which an algorithm learns how to classify or estimate an outcome variable of interest. Test set Data set used to compute unbiased estimate of final predictive model’s performance. Training set Data used to build candidate predictive models. Unstable When small changes in the training set cause a model’s predictions to fluctuate substantially. Validation set Data used to evaluate candidate predictive models. Variable (feature) A characteristic or quantity of interest that can take on different values.

P R O B L E M S

1. The dating web site Oollama.com requires its users to create profiles based on a survey in which they rate their interest (on a scale from 0 to 3) in five categories: physical fitness, music, spirituality, education, and alcohol consumption. A new Oollama cus- tomer, Erin O’Shaughnessy, has reviewed the profiles of 40 prospective dates and clas- sified whether she is interested in learning more about them.

Based on Erin’s classification of these 40 profiles, Oollama has applied a logistic regression to predict Erin’s interest in other profiles that she has not yet viewed. The resulting logistic regression model is as follows:

0.920 0.325 3.611

5.535 2.927

Log odds of Interested Fitness Music

Education Alcohol

5 2 1 3 2 3

1 3 2 3

For the 40 profiles (observations) on which Erin classified her interest, this logistic regression model generates that following probability of Interested.

Observation Interested Probability of

Interested

35 1 1.000

21 1 0.999

29 1 0.999

25 1 0.999

39 1 0.999

26 1 0.990

23 1 0.981

33 1 0.974

1 0 0.882

24 1 0.882

28 1 0.882

36 1 0.882

16 0 0.791

27 1 0.791

30 1 0.791

32 1 0.791

34 1 0.791

37 1 0.791

40 1 0.791

38 1 0.732

Observation Interested Probability of

Interested

13 0 0.412

2 0 0.285

3 0 0.219

7 0 0.168

9 0 0.168

12 0 0.168

18 0 0.168

22 1 0.168

31 1 0.168

6 0 0.128

20 0 0.128

15 0 0.029

5 0 0.020

14 0 0.015

19 0 0.011

8 0 0.008

10 0 0.001

17 0 0.001

4 0 0.001

11 0 0.000

b. Oollama understands that its clients have a limited amount of time for dating and therefore use decile-wise lift charts to evaluate their classification models. For the training data, what is the first decile lift resulting from the logistic regression model? Interpret this value.

c. A recently posted profile has values of Fitness 35 , Music 15 , Education 35 , and Alcohol 15 . Use the estimated logistic regression equation to compute the proba- bility of Erin’s interest in this profile.

d. Now that Oollama has trained a logistic regression model based on Erin’s initial evaluations of 40 profiles, what should its next steps be in the modeling process?

2. Fleur-de-Lis is a boutique bakery specializing in cupcakes. The bakers at Fleur-de-Lis like to experiment with different combinations of four major ingredients in its cupcakes and col- lect customer feedback; it has data on 150 combinations of ingredients with the correspond- ing customer reception for each combination classified as “thumbs up” (Class 1) or “thumbs down” (Class 0). To better anticipate the customer feedback of new recipes, Fleur-de-Lis has determined that a k-nearest neighbors classifier with 10k 5 seems to perform well.

Using a cutoff value of 0.5 and a validation set of 45 observations, Fleur-de-Lis constructs following confusion matrix and the ROC curve for the k-nearest neighbors classifier with 10k 5 :

Predicted Feedback

Actual Feedback Thumbs Up Thumbs Down

Thumbs Up 13 1

Thumbs Down 1 30

As the confusion matrix shows, there is one observation that actually received thumbs down, but the k-nearest neighbors classifier predicts a thumbs up. Also, there is one observation that actually received a thumbs up, but the k-nearest neighbors classi- fier predicts a thumbs down. Specifically:

1.0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0 1.00.90.80.70.60.50.40.30.20.10.0

1-specificity

S en

si ti

vi ty

Random Classifier Optimum Classifier Fitted Classifier

Observation ID Actual Class Probability of Thumbs Up Predicted Class

A Thumbs Down 0.5 Thumbs Up

B Thumbs Up 0.2 Thumbs Down

Problems 453

454 Chapter 9 Predictive Data Mining

a. Explain how the probability of Thumbs Up was computed for Observation A and Observation B. Why was Observation A classified as Thumbs Up and Observation B was classified as Thumbs Down?

b. Compute the values of sensitivity and specificity corresponding to the confusion matrix created using the cutoff value of 0.5. Locate the point corresponding to these values of sensitivity and specificity on the ROC curve shown on page 453.

c. Based on what we know about Observation B, if the cutoff value is lowered to 0.2, what happens to the values of sensitivity and specificity? Explain. Use the ROC curve to estimate the values of sensitivity and specificity for a cutoff value of 0.2.

3. Casey Deesel is a sports agent negotiating a contract for Titus Johnston, an athlete in the National Football League (NFL). An important aspect of any NFL contract is the amount of guaranteed money over the life of the contract. Casey has gathered data on 506 NFL athletes who have recently signed new contracts. Each observation (NFL athlete) includes values for percentage of his team’s plays that the athlete is on the field (SnapPercent), the number of awards an athlete has received recognizing on-field performance (Awards), the number of games the athlete has missed due to injury (GamesMissed), and millions of dollars of guaranteed money in the athlete’s most recent contract (Money, dependent variable).

Casey has trained a full regression tree on 304 observations and then used the validation set to prune the tree to obtain a best-pruned tree. The best-pruned tree (as applied to the 202 observations in the validation set) is:

14.73 20.71 6.75

85.09 95.37

90.28

1.5 2.532.52

50 23.61 32.7447.83

0 0 11

11 1423

7153

106 96

53 25

46.73

7.25

SnapPercent

SnapPercent SnapPercent

Awards

GamesMissed GamesMissed

Awards

a. For these 178 wines, the tree only misclassifies two wines. These wines have the following characteristics:

Wine 1 : Proline = 735, Ash = 2.88, Flavanoids = 2.69, Magnesium = 118

Wine 1 : Proline = 680, Ash = 2.29, Flavanoids = 2.63, Magnesium = 103

a. Titus Johnston’s variable values are: SnapPercent 96,5 Awards 7,5 and GamesMissed 35 . How much guaranteed money does the regression tree predict that a player with Titus Johnson’s profile should earn in his contract?

b. Casey feels that Titus was denied an additional award in the past season due to some questionable voting by some sports media. If Titus had won this additional award, how much additional guaranteed money would the regression tree predict for Titus versus the prediction in part (a)?

c. As Casey reviews the best-pruned tree, he is confused by the leaf node correspond- ing to the sequence of decision rules of “SnapPercent > 90.28, SnapPercent < 95.37, Awards < 6.75, GamesMissed < 1.5.” This sequence of decision rules results in an estimate of $50 million of guaranteed money, but the tree states that zero observations occur in the corresponding partition. If zero observations occur in this partition, how can the regression tree provide an estimate of $50 million? Explain this part of the regression tree to Casey by referring to how the best-pruned tree is obtained.

4. Sommelier4U is a company that ships its customers bottles of different types of wine and then has them rate the wines as “Like” or “Dislike.” For each customer, Sommelier4U trains a classification tree based on the characteristics and customer ratings of wines that the customer has tasted. Then, Sommelier4U uses the classification tree to identify new wines that the customer may Like. Sommelier4U recommends the wines that have a greater than 50% probability of being liked. Neal Jones, a loyal cus- tomer, has provided feedback on hundreds of different wines that he has tasted. Based on this feedback, Sommelier4U trained and validated the following classification tree:

1 0

257

0 1

2.87 2.17

755

111 67

108

0 135.5

598

Proline

Ash Flavanoids

Magnesium

Problems 455

456 Chapter 9 Predictive Data Mining

The following table lists some information on individual observations from the valida- tion set:

Based on this information, construct the confusion matrix based on the 178 wines. In order to better learn Neal’s preferences, what types of wines could Sommelier4U recommend to him?

b. Consider the wine with the following characteristics: Proline 8205 , Ash 2.16,5 Flavanoids 3.1,5 and Magnesium 875 . Does Sommelier4U believe that Neal will Like this wine?

5. A university is applying classification methods in order to identify alumni who may be interested in donating money. The university has a database of 58,205 alumni profiles containing numerous variables. Of these 58,205 alumni, only 576 have donated in the past. The university has oversampled the data and trained a random forest of 100 clas- sification trees. For a cutoff value of 0.5, the following confusion matrix summarizes the performance of the random forest on a validation set:

Predicted

Actual Donation No Donation

Donation 268 20

No Donation 5375 23,439

Observation ID Actual Class Probability of

Donation Predicted

Class

A Donation 0.8 Donation

B No Donation 0.1 No Donation

C No Donation 0.6 Donation

a. Explain how the probability of Donation was computed for the three observations. Why were Observations A and C classified as Donation and Observation B was classified as No Donation?

b. Compute the values of accuracy, sensitivity, specificity, and precision. Explain why accuracy is a misleading measure to consider in this case. Evaluate the performance of the random forest, particularly commenting on the precision measure.

6. Salmons Stores operates a national chain of women’s apparel stores. Five thousand copies of an expensive four-color sales catalog have been printed, and each catalog includes a coupon that provides a $50 discount on purchases of $200 or more. Salmons would like to send the catalogs only to customers who have the highest probability of using the coupon. The file Salmons contains data from an earlier promotional campaign. For each of 1,000 Salmons customers, three variables are tracked: last year’s total spending at Salmons (Spending), whether they have a Salmons store credit card (Card), and whether they used the promotional coupon they were sent (Coupon). Apply logistic regression to classify observations as a promotion- responder or not by using Spending and Card as input variables and Coupon as the output variable.

a. Evaluate candidate logistic regression models based on their classification error. Recommend a final model and express the model as a mathematical equation relat- ing the output variable to the input variables.

Salmons

b. For the model selected in part (a), interpret the meaning of the first decile lift in the decile-wise lift chart on the test set.

c. What is the area under the ROC curve on the test set? To achieve a sensitivity of at least 0.80, how much Class 0 error rate must be tolerated?

7. Over the past few years the percentage of students who leave Dana College at the end of their first year has increased. Last year, Dana started voluntary one-credit hour-long seminars with faculty to help first-year students establish an on-campus connection. If Dana is able to show that the seminars have a positive effect on retention, college administrators will be convinced to continue funding this initiative. Dana’s administration also suspects that first-year students with lower high school GPAs have a higher probability of leaving Dana at the end of the first year. The file Dana contains data on the 500 first-year students from last year. Each observation consists of a first-year student’s high school GPA, whether they enrolled in a seminar, and whether they dropped out and did not return to Dana. Apply logistic regression to classify observations as dropped out or not dropped out by using GPA and Seminar as input variables and Dropped as the output variable.

a. Evaluate the candidate logistic regression models based on their predictive performance on the validation set. Recommend a final model and express the model as a mathematical equation relating the output variable to the input variables. What is the implication on the effectiveness of the first-year seminars on retention?

b. The data analyst team realized that they jumped directly into building a predictive model without exploring the data. Using descriptive statistics and charts, investi- gate any relationships in the data that may explain the unsatisfactory result in part (a). For next year’s first-year class, what could Dana’s administration do regarding the enrollment of the seminars to better determine whether they have an effect on retention?

8. Sandhills Bank would like to increase the number of customers who use payroll direct deposit as part of the rollout of its new e-banking platform. Management has pro- posed offering an increased interest rate on a savings account if customers sign up for direct deposit into a checking account. To determine whether this proposal is a good idea, management would like to estimate how many of the 200 current customers who do not use direct deposit would accept the offer. The IT company that handles Sandhills Bank’s e-banking has provided anonymized data for 1,000 customers from one of its other client banks that made a similar promotion to increase direct deposit participation. For these 1,000 customers, each observation consists of the average monthly checking account balance and whether the customer signed up for direct deposit. In the file Sandhills, these data are split so that 600 observations are in the training set and 400 observations are in the validation set. Sandhills has designated the data corresponding to its 200 current customers as the test set. As Sandhills has not yet launched its promotion to any of these 200 customers, it has entered an artifi- cial value of zero (i.e., “No”) for whether they have signed up for direct deposit. As some of these 200 customers will be the target of the direct-deposit promotion, Sand- hills would like to estimate the likelihood of these customers signing up for direct deposit based on their average monthly balance. Classify the data using k-nearest neighbors for values of k 5 1, . . . ,10. Use Balance as the input variable and Direct as the output variable.

a. For the cutoff probability value of 0.5, what value of k minimizes the overall error rate on the validation data?

b. Using the cutoff value of 0.5, how many of Sandhills Bank’s 200 customers does k-nearest neighbors classify as enrolling in direct deposit?

Dana

Sandhills

Problems 457

458 Chapter 9 Predictive Data Mining

9. Campaign organizers for both the Republican and Democratic parties are interested in identifying individual undecided voters who would consider voting for their party in an upcoming election. The file BlueOrRed contains data on a sample of voter with tracked variables, including whether or not they are undecided regarding their candi- date preference, age, whether they own a home, gender, marital status, household size, income, years of education, and whether they attend church. Classify the data using k-nearest neighbors with 1k 5 , . . . , 10. Use Age, HomeOwner, Female, Household- Size, Income, Education, and Church as input variables and Undecided as the output variable. Standardize the input variables to adjust for the different magnitudes of the variables. a. For a cutoff probability value of 0.5, what value of k minimizes the overall error rate

on the validation data? b. Compare the overall error rates on the validation and test sets for the value of k

from part (a). Explain the role of the test set and the implication of these particular results.

10. Refer to the scenario in Problem 9 using the file BlueOrRed. Use logistic regression to classify observations as undecided (or decided) using Age, HomeOwner, Female, Mar- ried, HouseholdSize, Income, Education, and Church as input variables and Undecided as the output variable.

a. Use Mallow’s Cp statistic to identify a couple of candidate models. Then evaluate these candidate models based on their classification error and decile-wise lift on the validation set. Recommend a final model and express the model as a mathematical equation relating the output variable to the input variables.

b. For the final model from part (a), increases in which variables increase the chance of a voter being undecided? Increases in which variables decrease the chance of a voter being decided?

c. Using the cutoff value of 0.5 for your logistic regression model, what is the overall error rate on the test set for the final model from part (a)?

11. Refer to the scenario in Problem 9 using the file BlueOrRed. Fit a single classification tree using Age, HomeOwner, Female, Married, HouseholdSize, Income, Education, and Church as input variables and Undecided as the output variable.

a. For the cutoff value of 0.5, what are the overall error rate, Class 1 error rate, and Class 0 error rate of the best-pruned tree on the test set?

b. Consider a 50-year-old man who attends church, has 15 years of education, owns a home, is married, lives in a household of four people, and has an annual income of $150,000. Does the best-pruned tree classify this observation as Undecided?

c. For the best-pruned tree, what is the lift on the top 30% of the test set deemed most likely to be Undecided?

12. Refer to scenario in Problem 9 using the file BlueOrRed. Apply a random forest ensem- ble of 10 classification trees using Age, HomeOwner, Female, Married, HouseholdSize, Income, Education, and Church as input variables and Undecided as the output variable.

a. What is the most important variable in terms of reducing the classification error of the ensemble?

b. For the cutoff value of 0.5, compare the overall error rate, Class 1 error rate, and Class 0 error rate of the random forest on the test set to the corresponding measures of the single best-pruned tree from Problem 11.

13. Telecommunications companies providing cell-phone service are interested in cus- tomer retention. In particular, identifying customers who are about to churn (cancel their service) is potentially worth millions of dollars if the company can proactively address the reason that customer is considering cancellation and retain the customer. The file Cellphone contains customer data to be used to classify a customer as a churner or not. Classify the data using k-nearest neighbors with 1k 5 , . . . , 10. Use

BlueOrRed

Cellphone

Churn as the output variable and all the other variables as input variables. Standardize the input variables to adjust for the different magnitudes of the variables.

a. What is the proportion of churners in the training set? What is the proportion of churners in the validation and test sets? Explain why this discrepancy is appropriate.

b. For the cutoff probability value of 0.5, what value of k minimizes the overall error rate on the validation data?

c. What is the overall error rate, the Class 1 error rate, and the Class 0 error rate on the test set?

d. Compute and interpret the sensitivity and specificity for the test set. e. How many false positives and false negatives did the model commit on the test set?

What percentage of predicted churners were false positives? What percentage of predicted nonchurners were false negatives?

14. Refer to scenario in Problem 13 using the file Cellphone. Fit a single classification tree using Churn as the output variable and all the other variables as input variables.

a. For the cutoff value of 0.5, what is the overall rate, the Class 1 error rate, and the Class 0 error rate of the best-pruned tree on the test set?

b. List and interpret the set of rules that characterize churners in the best-pruned tree. c. Examine the decile-wise lift chart for the best-pruned tree on the test set. What is

the first decile lift? Interpret this value.

15. Refer to scenario in Problem 13 using the file Cellphone. Apply a random forest ensemble of 10 classification trees using Churn as the output variable and all the other variables as input variables.

a. What is the most important variable in terms of reducing the classification error of the ensemble?

16. Refer to scenario in Problem 13 using the file Cellphone. Apply logistic regression using Churn as the output variable and all the other variables as input variables.

a. Evaluate several candidate models based on their classification error on the vali- dation set and decile-wise lift on the validation set. Recommend a final model and express the model as a mathematical equation relating the output variable to the input variables. Do the relationships suggested by the model make sense? Try to explain them.

b. Using the cutoff value of 0.5 for your logistic regression model, what is the overall error rate on the test set?

17. A consumer advocacy agency, Equitable Ernest, is interested in providing a service that allows an individual to estimate his or her own credit score (a continuous mea- sure used by banks, insurance companies, and other businesses when granting loans, quoting premiums, and issuing credit). The file CreditScore contains data on an indi- vidual’s credit score and other variables. Predict the individual’s credit scores using a single regression tree. Use CreditScore as the output variable and all the other variables as input variables. Set the minimum number of records in a terminal node to be 244.

a. What is the RMSE of the best-pruned tree on the validation data and on the test set? Discuss the implication of these calculations.

b. Consider an individual with 5 credit bureau inquiries, has used 10% of her available credit, has $14,500 of total available credit, has no collection reports or missed pay- ments, is a homeowner, has an average credit age of 6.5 years, and has worked con- tinuously for the past 5 years. What is the best-pruned tree’s predicted credit score for this individual?

CreditScore

Problems 459

460 Chapter 9 Predictive Data Mining

c. Repeat the construction of a single regression tree, but now set the minimum num- ber of records in a terminal node to be 1. How does the RMSE of the best-pruned tree on the test set compare to the analogous measure from part (a)? In terms of number of decision nodes, how does the size of the best-pruned tree compare to the size of the best-pruned tree from part (a)?

18. Refer to the scenario in Problem 17 using the file CreditScore. Apply a random forest ensemble of 10 regression trees using CreditScore as the output variable and all the other variables as input variables. Set the minimum number of records in a terminal node to be 244. Compare the RMSE of the random forest on the test set to the RMSE of the single best-pruned tree from part (a) of Problem 18.

19. Refer to the scenario in Problem 17 using the file CreditScore. Predict the individuals’ credit scores using k-nearest neighbors with 1k 5 , . . . , 10. Use CreditScore as the output variable and all the other variables as input variables. Standardize the input vari- ables to adjust for the different magnitudes of the variables.

a. What value of k minimizes the RMSE on the validation data? b. How does the RMSE on the test set compare to the RMSE on the validation set?

20. Each year, the American Academy of Motion Picture Arts and Sciences recognizes excellence in the film industry by honoring directors, actors, and writers with awards (called “Oscars”) in different categories. The most notable of these awards is the Oscar for Best Picture. The Data worksheet in the file Oscars contains data on a sample of movies nominated for the Best Picture Oscar. The variables include total number of Oscar nominations across all award categories, number of Golden Globe awards won (the Golden Globe award show precedes the Academy Awards), whether or not the movie is a comedy, and whether or not the movie won the Best Picture Oscar award. Apply logistic regression to classify winners of the Best Picture Oscar. Use Winner as the output variable and OscarNominations, GoldenGlobeWins, and Comedy as input variables.

a. Evaluate several candidate models based on their classification error on the val- idation set. Recommend a final model and express the model as a mathematical equation relating the output variable to the input variables. Do the relationships sug- gested by the model make sense? Try to explain them.

b. Using the cutoff value of 0.5, what is the sensitivity of the logistic regression model on the validation set? Why is this a good metric to use for this problem?

c. Note that each year there is only one winner of the Best Picture Oscar. Knowing this, what is wrong with classifying a movie based on a cutoff value? (Hint: Investi- gate the predicted results on an annual basis.)

d. What is the best way to use the model to predict the annual winner? For the vali- dation set, how often is the actual winner deemed “most likely” to win out of each year’s nominees?

21. As an intern with the local home builder’s association, you have been asked to analyze the state of the local housing market, which has suffered during a recent economic crisis. You have been provided two data sets in the file HousingBubble. The Pre- Crisis worksheet contains information on 1,978 single-family homes sold during the one-year period before the burst of the “housing bubble.” These 1,978 observations have been split into a training set (1,186 observations) and a validation set (792 observations). The Post-Crisis worksheet contains information on 1,657 single- family homes sold during the one-year period after the burst of the housing bubble. These 1,657 observations have been split into a training set (994 observations) and a validation set (663 observations). The data in both the Pre-Crisis and Post-Crisis worksheets have been appended with the same set of 2,000 observations designated as

Oscars

HousingBubble

a test set. This test set corresponds to homes currently for sale, and because they have not yet been sold, each of these 2,000 observation has an artificial value of zero for the sale price.

a. Consider the Pre-Crisis worksheet data. Construct a model to predict the sale price using k-nearest neighbors with 1k = , . . . , 10. Use Price as the output variable and all the other variables as input variables. Standardize the input variables to adjust for the different magnitudes of the variables.

i. What value of k minimizes the RMSE on the validation set, and what is the value of this RMSE?

ii. Use the k-nearest neighbors with the value of k that minimizes RMSE on the validation set to predict sale prices of houses in the test set.

b. Repeat part (a) with the Post-Crisis worksheet data. c. For each of the 2,000 houses in the test set, compare the predictions from

part (a-ii) based on the pre-crisis data to those from part (b-ii) based on the post-crisis data. Specifically, compute the percentage difference in predicted price between the pre-crisis and post-crisis models, where

5percentage difference (post-crisis predicted price pre-crisis predicted price)− / pre-crisis predicted price. What is the average percentage change in predicted price between the pre-crisis and post-crisis models?

22. Refer to the scenario in Problem 21 using the file HousingBubble. a. Consider the Pre-Crisis worksheet data. Predict the sale price using a single

regression tree. Use Price as the output variable and all the other variables as input variables.

i. What is the RMSE of the best-pruned tree on the validation set? ii. Use the best-pruned tree to predict sale prices of houses in the test set.

b. Repeat part (a) with the Post-Crisis worksheet data. c. For each of the 2,000 houses in the NewDataToPredict worksheet, compare

the predictions from part (a-ii) based on the pre-crisis data to those from part (b-ii) based on the post-crisis data. Specifically, compute the percentage dif- ference in predicted price between the pre-crisis and post-crisis models, where

23. Refer to the scenario in Problem 21 using the file HousingBubble. a. Consider the Pre-Crisis worksheet data. Apply a random forest ensemble of 10

regression trees using Price as the output variable and all the other variables as input variables.

i. What is the RMSE of the random forest on the validation set? ii. Use the random forest to predict sale prices of houses in the test set.

b. Repeat part (a) with the Post-Crisis worksheet data. c. For each of the 2,000 houses in the test set, compare the predictions from part

(a-ii) based on the pre-crisis data to those from part (b-ii) based on the post-crisis data. Specifically, compute the percentage difference in predicted price between the pre-crisis and post-crisis models, where percentage difference = (post-crisis predicted price – pre-crisis predicted price)/pre-crisis predicted price. What is the average percentage change in predicted price between the pre-crisis and post-crisis models? What does this suggest about the impact of the bursting of the housing bubble?

Problems 461

462 Chapter 9 Predictive Data Mining

C A S E P R O B L E M : G R E Y C O D E C O R P O R A T I O N

Grey Code Corporation (GCC) is a media and marketing company involved in magazine and book publishing and in television broadcasting. GCC’s portfolio of home and family magazines has been a long-running strength, but it has expanded to become a provider of a spectrum of services (market research, communications planning, web site advertising, etc.) that can enhance its clients’ brands.

GCC’s relational database contains over a terabyte of data encompassing 75 million cus- tomers. GCC uses the data in its database to develop campaigns for new customer acqui- sition, customer reactivation, and identification of cross-selling opportunities for products. For example, GCC will generate separate versions of a monthly issue of a magazine that will differ only by the advertisements they contain. It will mail a subscribing customer the version with the print ads identified by its database as being of most interest to that customer.

One particular problem facing GCC is how to boost the customer response rate to renewal offers that it mails to its magazine subscribers. The industry response rate is about 2%, but GCC has historically performed better than that. However, GCC must update its model to correspond to recent changes. GCC’s director of database marketing, Chris Grey, wants to make sure that GCC maintains its place as one of the top achievers in targeted mar- keting. The file GCC contains 38 variables (columns) and 45,000 rows (distinct customers).

Play the role of Chris Grey and construct a classification model to identify customers who are likely to respond to a mailing. Write a report that documents the following steps:

1. Explore the data. This includes addressing any missing data as well as treatment of variables. Variables may need to be transformed. Also, because of the large number of variables, you must identify appropriate means to reduce the dimension of the data. In particular, it may be helpful to filter out unnecessary and redundant variables.

2. Appropriately partition the data set into training, validiation, and test sets. Experi- ment with various classification methods and propose a final model for identifying customers who will respond to the targeted marketing.

3. Your report should include appropriate charts (ROC curves, lift charts, etc.) and include a recommendation on how to apply the results of your proposed model. For example, if GCC sends the targeted marketing to the the model’s top decile, what is the expected response rate? How does that compare to the industry’s average response rate?

GCC

Spreadsheet Models C o n t e n t s

AnAlyticS in Action: Procter & Gamble

10.1 BUilDinG GooD SPREADSHEEt MoDElS influence Diagrams Building a Mathematical Model Spreadsheet Design and implementing

the Model in a Spreadsheet

10.2 WHAt-iF AnAlySiS Data tables Goal Seek Scenario Manager

10.3 SoME USEFUl EXcEl FUnctionS FoR MoDElinG SUM and SUMPRoDUct iF and coUntiF VlooKUP

10.4 AUDitinG SPREADSHEEt MoDElS trace Precedents and Dependents Show Formulas Evaluate Formulas Error checking Watch Window

10.5 PREDictiVE AnD PREScRiPtiVE SPREADSHEEt MoDElS

Chapter 10

Analytics in Action 465

Numerous specialized software packages are available for descriptive, predictive, and prescriptive business analytics. Because these software packages are specialized, they usually provide the user with numerous options and the capability to perform detailed analyses. However, they tend to be considerably more expensive than a spreadsheet package such as Excel. Also, specialized packages often require substantial user training. Because spreadsheets are less expensive, often come preloaded on computers, and are fairly easy to use, they are without question the most-used business analytics tool. Every day, millions of people around the world use spreadsheet decision models to perform risk analysis, inventory tracking and control, investment planning, breakeven analysis, and many other essential business planning and decision tasks. A well-designed, well-documented, and accurate spreadsheet model can be a very valuable tool in decision making.

Spreadsheet models are mathematical and logic-based models. Their strength is that they provide easy-to-use, sophisticated mathematical and logical functions, allowing for easy instantaneous recalculation for a change in model inputs. This is why spreadsheet models are often referred to as what-if models. What-if models allow you to answer questions such as, “If the per unit cost is $4, what is the impact on profit?” Changing data in a given cell has an impact not only on that cell but also on any other cells containing a formula or function that uses that cell.

In this chapter we discuss principles for building reliable spreadsheet models. We begin with a discussion of how to build a conceptual model of a decision problem, how to convert the conceptual model to a mathematical model, and how to implement the model in a spreadsheet. We introduce three analysis tools available in Excel: Data Tables, Goal Seek, and Scenario Manager. We discuss some Excel functions that are useful for building spreadsheet models for decision making. Finally, we present how to audit a spreadsheet model to ensure its reliability.

If you have never used a spreadsheet or have not done so recently, we suggest you first familiarize yourself with the material in Appendix A. It provides basic information that is fundamental to using Excel.

Procter & Gamble*

Procter & Gamble (P&G) is a Fortune 500 consumer goods company headquartered in Cincinnati, Ohio. P&G produces well-known brands such as Tide deter- gent, Gillette razors, Swiffer cleaning products, and many other consumer goods. P&G is a global com- pany and has been recognized for its excellence in business analytics, including supply chain analytics and market research.

With operations around the world, P&G must do its best to maintain inventory at levels that meet its high customer service requirements. A lack of on-hand inventory can result in a stockout of a product and an inability to meet customer demand. This not only results in lost revenue for an immediate sale but can also cause customers to switch permanently to a com- peting brand. On the other hand, excessive inventory forces P&G to invest cash in inventory when that money could be invested in other opportunities, such as research and development.

To ensure that the inventory of its products around the world is set at appropriate levels, P&G analyt- ics personnel developed and deployed a series of spreadsheet inventory models. These spreadsheets implement mathematical inventory models to tell business units when and how much to order to keep inventory levels where they need to be in order to maintain service and keep investment as low as possible.

The spreadsheet models were carefully designed to be easily understood by the users and easy to use and interpret. Their users can also customize the spread- sheets to their individual situations.

Over 70% of the P&G business units use these models, with a conservative estimate of a 10% reduc- tion in inventory around the world. This equates to a cash savings of nearly $350 million.

A n A l y t i C s i n A C t i o n

*i. Farasyn, K. Perkoz, and W. Van de Velde, “Spreadsheet Model for inventory target Setting at Procter & Gamble, Interfaces 38, no. 4 (July–August 2008): 241–250.

466 chapter 10 Spreadsheet Models

10.1 Building Good Spreadsheet Models Let us begin our discussion of spreadsheet models by considering the cost of producing a single product. The total cost of manufacturing a product can usually be defined as the sum of two costs: fixed cost and variable cost. Fixed cost is the portion of the total cost that does not depend on the production quantity; this cost remains the same no matter how much is produced. Variable cost, on the other hand, is the portion of the total cost that is dependent on and varies with the production quantity. To illustrate how cost models can be developed, we will consider a manufacturing problem faced by Nowlin Plastics.

Nowlin Plastics produces a line of cell phone covers. Nowlin’s best-selling cover is its Viper model, a slim but very durable black and gray plastic cover. The annual fixed cost for the Viper cover is $234,000. This fixed cost includes management time and other costs that are incurred regardless of the number of units eventually produced. In addition, the total variable cost, including labor and material costs, is $2 for each unit produced.

Nowlin is considering outsourcing the production of some products for next year, includ- ing the Viper. Nowlin has a bid from an outside firm to produce the Viper for $3.50 per unit. Although it is more expensive per unit to outsource the Viper ($3.50 versus $2.00), the fixed cost can be avoided if Nowlin purchases rather than manufactures the product. Next year’s exact demand for Viper is not yet known. Nowlin would like to compare the costs of manufacturing the Viper in-house to those of outsourcing its production to another firm, and management would like to do that for various production quantities. Many manufacturers face this type of decision, which is known as a make-versus-buy decision.

influence Diagrams It is often useful to begin the modeling process with a conceptual model that shows the relationships between the various parts of the problem being modeled. The conceptual model helps in organizing the data requirements and provides a road map for eventually constructing a mathematical model. A conceptual model also provides a clear way to com- municate the model to others. An influence diagram is a visual representation of which entities influence others in a model. Parts of the model are represented by circular or oval symbols called nodes, and arrows connecting the nodes show influence.

Figure 10.1 shows an influence diagram for Nowlin’s total cost of production for the Viper. Total manufacturing cost depends on fixed cost and variable cost, which in turn depends on the variable cost per unit and the quantity required.

An expanded influence diagram that includes an outsourcing option is shown in Figure 10.2. Note that the influence diagram in Figure 10.1 is a subset of the influence dia- gram in Figure 10.2. Our method here—namely, to build an influence diagram for a portion of the problem and then expand it until the total problem is conceptually modeled—is usu- ally a good way to proceed. This modular approach simplifies the process and reduces the likelihood of error. This is true not just for influence diagrams but for the construction of the mathematical and spreadsheet models as well. Next we turn our attention to using the influence diagram in Figure 10.2 to guide us in the construction of the mathematical model.

Building a Mathematical Model The task now is to use the influence diagram to build a mathematical model. Let us first consider the cost of manufacturing the required units of the Viper. As the influence diagram shows, this cost is a function of the fixed cost, the variable cost per unit, and the quantity required. In general, it is best to define notation for every node in the influence diagram. Let us define the following:

q FC VC

TMC q q

quantity (number of units) required the fixed cost of manufacturing the per-unit variable cost of manufacturing

( ) total cost to manufacture units

10.1 Building Good Spreadsheet Models 467

An influence Diagram for nowlin’s Manufacturing costFiGURe 10.1

Total Manufacturing

Cost

Quantity Required

Fixed Cost

Variable Cost

Variable Cost per Unit

An influence Diagram for comparing Manufacturing Versus outsourcing cost for nowlin Plastics

FiGURe 10.2

Difference in Cost of

Manufacturing and Outsourcing

Total Manufacturing

Cost

Total Outsource

Cost

Purchase Cost per Unit

Quantity Required

Variable Cost per Unit

Variable Cost

Fixed Cost

468 chapter 10 Spreadsheet Models

The cost-volume model for producing q units of the Viper can then be written as follows:

TMC q FC VC q5 1 3( ) ( ) (10.1)

For the Viper, 5 $234, 000FC and 5 $2VC , so that equation (10.1) becomes

TMC q q5 1( ) $234, 000 $2

Once a quantity required (q) is established, equation (10.1), now populated with the data for the Viper, can be used to compute the total manufacturing cost. For example, the decision to produce 5 10, 000 unitsq would result in a total cost of

5 1 5(10, 000) $234, 000 $2(10, 000) $254, 000TMC . Similarly, a mathematical model for purchasing q units is shown in equation (10.2).

Let 5 the per-unit purchase costP and 5( ) the total costTPC q to outsource or purchase q units:

TPC q Pq5( ) (10.2)

For the Viper, since 5 $3.50P , equation (10.2) becomes

TPC q q5( ) $3.5

Thus, the total cost to outsource 10,000 units of the Viper is TPC(10, 000) 3.5(10, 000)5 5 $35, 000.

We can now state mathematically the savings associated with outsourcing. Let 5( ) the savings due to outsourcingS q , that is, the difference between the total cost of

manufacturing q units and the total cost of buying q units:

S q TMC q TPC q5 2( ) ( ) ( ) (10.3)

In summary, Nowlin’s decision problem is whether to manufacture or outsource the demand for its Viper product next year. Because management does not yet know the required demand, the key question is, “For what quantities is it more cost-effective to out- source rather than produce the Viper?” Mathematically, this question is, “For what values of q is .( ) 0 ?S q ” Next we discuss a spreadsheet implementation of our conceptual and mathematical models that will help us answer this question.

spreadsheet Design and implementing the Model in a spreadsheet There are several guiding principles for how to build a spreadsheet so that it is easily used by others and the risk of error is mitigated. In this section, we discuss some of those princi- ples and illustrate the design and construction of a spreadsheet model using the Nowlin Plastics make-versus-buy decision.

In the construction of a spreadsheet model, it is helpful to categorize its components. For the Nowlin Plastics problem, we have defined the following components (correspond- ing to the nodes of the influence diagram in Figure 10.2):

q FC VC

TMC q q P

TPC q q S q q

5 5

number of units required the fixed cost of manufacturing the per-unit variable cost of manufacturing

( ) total cost to manufacture units the per-unit purchase cost

( ) the total cost to purchase units ( ) the savings from outsourcing units

Several points are in order. Some of these components are a function of other components (TMC, TPC, and S), and some are not (q, FC, VC, and P). TMC, TPC, and S will be for- mulas involving other cells in the spreadsheet model, whereas q, FC, VC, and P will just be entries in the spreadsheet. Furthermore, the value we can control or choose is q. In our

Note that q, FC, VC, and P each is the beginning of a path in the influence diagram in Figure 10.2. In other words, they have no inward-pointing arrows.

10.1 Building Good Spreadsheet Models 469

analysis, we seek the value of q, such that .( ) 0S q ; that is, the savings associated with outsourcing is positive. The number of Vipers to make or buy for next year is Nowlin’s decision. So we will treat q somewhat differently than FC, VC, and P in the spreadsheet model, and we refer to the quantity q as a decision variable. FC, VC, and P are measurable factors that define characteristics of the process we are modeling and so are uncontrollable inputs to the model, which we refer to as parameters of the model.

Figure 10.3 shows a spreadsheet model for the Nowlin Plastics make-versus-buy decision.

Column A is reserved for labels, including cell A1, where we have named the model “Nowlin Plastics.” The input parameters (FC, VC, and P) are placed in cells B4, B5, and B7,

nowlin Plastics Make-Versus-Buy Spreadsheet ModelFiGURe 10.3

A B C Nowlin Plastics

Parameters Manufacturing Fixed Cost

Manufacturing Variable Cost per Unit

234000

Model Quantity

Outsourcing Cost per Unit 3.5

1 2 3 4 5

7 8 9 10 11 12 13 14 15

17 18 19

10000

Total Cost to Produce =B4+B11*B5

Total Cost to Outsource =B7*B11

Savings due to Outsourcing =B13–B15

A B Nowlin Plastics

Parameters Manufacturing Fixed Cost

Manufacturing Variable Cost per Unit

$234,000.00

Model Quantity

$2.00

Outsourcing Cost per Unit $3.50

1 2 3 4 5

7 8 9 10 11 12 13 14 15

17 18 19

10,000

Total Cost to Produce $254,000.00

Total Cost to Outsource $35,000.00

Savings due to Outsourcing $219,000.00

Nowlin

470 chapter 10 Spreadsheet Models

respectively. We offset P from FC and VC because it is for outsourcing. We have created a parameters section in the upper part of the sheet. Below the parameters section, we have created the Model section. The first entry in the Model section is the quantity q—the num- ber of units of Viper produced or purchased in cell B11—and shaded it to signify that this is a decision variable. We have placed the formulas corresponding to equations (10.1) to (10.3) in cells B13, B15, and B17. Cell B13 corresponds to equation (10.1), cell B15 to (10.2), and cell B17 to (10.3).

In cell B11 of Figure 10.3, we have set the value of q to 10,000 units. The model shows that the cost to manufacture 10,000 units is $254,000, the cost to purchase the 10,000 units is $35,000, and the savings from outsourcing is $219,000. At a quantity of 10,000 units, we see that it is better to incur the higher variable cost ($3.50 versus $2) than to manufacture and have to incur the additional fixed cost of $234,000. It will take a value of q larger than 10,000 units to make up the fixed cost incurred when Nowlin manufactures the product. At this point, we could increase the value of q by placing a value higher than 10,000 in cell B11 and see how much the savings in cell B17 decreases, doing this until the savings are close to zero. This is called a trial-and-error approach. Fortunately, Excel has what-if anal- ysis tools that will help us use our model to further analyze the problem. We will discuss these what-if analysis tools in Section 10.2. Before doing so, let us first review what we have learned in constructing the Nowlin spreadsheet model.

The general principles of spreadsheet model design and construction are as follows:

• Separate the parameters from the model. • Document the model, and use proper formatting and color as needed. • Use simple formulas.

Let us discuss the general merits of each of these points.

separate the Parameters from the Model Separating the parameters from the model enables the user to update the model parameters without the risk of mistakenly creating an error in a formula. For this reason, it is good practice to have a parameters section at the top of the spreadsheet. A separate model section should contain all calculations. For a what-if model or an optimization model, some cells in the model section might also corre- spond to controllable inputs or decision variables (values that are not parameters or calcu- lations but are the values we choose). The Nowlin model in Figure 10.3 is an example of this. The parameters section is in the upper part of the spreadsheet, followed by the model section, below which are the calculations and a decision cell (B11 for q in our model). Cell B11 is shaded to signify that it is a decision cell.

Document the Model and Use Proper Formatting and Color as needed A good spreadsheet model is well documented. Clear labels and proper formatting and alignment facilitate navigation and understanding. For example, if the values in a worksheet are cost, currency formatting should be used. Also, no cell with content should be unlabeled. A new user should be able to easily understand the model and its calculations. If color makes a model easier to understand and navigate, use it for cells and labels.

Use simple Formulas Clear, simple formulas can reduce errors and make it easier to maintain the spreadsheet. Long and complex calculations should be divided into several cells. This makes the formula easier to understand and easier to edit. Avoid using num- bers in a formula (separate the data from the model). Instead, put the number in a cell in the parameters section of your worksheet and refer to the cell location in the formula. Building the formula in this manner avoids having to edit the formula for a simple data change. For example, equation (10.3), the savings due to outsourcing, can be calculated as follows: 5 2 5 1 2 5 1 2( ) ( ) ( ) ( ) ( )S q TMC q TPC q FC VC q Pq FC VC P q. Since

2 5 2 53.50 2 1.50VC P , we could have just entered the following formula in a single cell: 5 2234, 000 1.50 * B11. This is a very bad idea because if any of the input data change, the formula must be edited. Furthermore, the user would not know the values

As described in Appendix A, Excel formulas always begin with an equal sign.

10.2 What-if Analysis 471

of VC and P, only that, for the current values, the difference is 1.50. The approach in Figure 10.3 is more transparent, is simpler, lends itself better to analysis of changes in the parameters, and is less likely to contain errors.

1. Some users of influence diagrams recommend using dif-

ferent symbols for the various types of model entities. For

example, circles might denote known inputs, ovals might

denote uncertain inputs, rectangles might denote deci-

sions or controllable inputs, triangles might denote calcu-

lations, and so forth.

2. The use of color in a spreadsheet model is an effective

way to draw attention to a cell or set of cells. For example,

we shaded cell B11 in Figure 10.3 to draw attention to the

fact that q is a controllable input. However, avoid using too much color. Overdoing it may overwhelm users and actually

have a negative impact on their ability to understand the

model.

3. Holding down the Ctrl key and pressing the ~ key (usually

located above the Tab key) in Excel will toggle between

displaying the formulas in a spreadsheet and the values.

n o t e s + C o M M e n t s

10.2 What-If Analysis Excel offers a number of tools to facilitate what-if analysis. In this section we introduce three such tools, Data Tables, Goal Seek, and Scenario Manager. All of these tools are designed to rid the user of the tedious manual trial-and-error approach to analysis. Let us see how each of these tools can be used to aid with what-if analysis.

Data tables An Excel Data Table quantifies the impact of changing the value of a specific input on an output of interest. Excel can generate either a one-way data table, which summarizes a single input’s impact on the output, or a two-way data table, which summarizes two inputs’ impact on the output.

Let us consider how savings due to outsourcing changes as the quantity of Vipers changes. This should help us answer the question, “For which values of q is outsourcing more cost-effective?” A one-way data table changing the value of quantity and reporting savings due to outsourcing would be very useful. We will use the previously developed Nowlin spreadsheet for this analysis.

The first step in creating a one-way data table is to construct a sorted list of the values you would like to consider for the input. Let us investigate the quantity q over a range from 0 to 300,000 in increments of 25,000 units. Figure 10.4 shows the data entered in cells D5 through D17, with a column label in D4. This column of data is the set of values that Excel will use as inputs for q. Since the output of interest is savings due to outsourcing (located in cell B17), we have entered the formula 5B17 in cell E4. In general, set the cell to the right of the label to the cell location of the output variable of interest. Once the basic struc- ture is in place, we invoke the Data Table tool using the following steps:

Step 1. Select cells D4:E17 Step 2. Click the Data tab in the Ribbon Step 3. Click What-If Analysis in the Forecast group, and select Data Table Step 4. When the Data Table dialog box appears, enter B11 in the Column input

cell: box Click OK

As shown in Figure 10.5, the table will be populated with the value of savings due to outsourcing for each value of quantity of Vipers in the table. For example, when

5 25, 000q we see that 5(25, 000) $196,500S , and when 5 250, 000q , 5 2(250, 000) $141, 000S . A negative value for savings due to outsourcing means that

manufacturing is cheaper than outsourcing for that quantity.

In versions of Excel prior to Excel 2016, the What-If Analysis tool can be found in the Data Tools group.

Entering B11 in the Column input cell: box indicates that the column of data corresponds to different values of the input located in cell B11.

472 chapter 10 Spreadsheet Models

the input for constructing a one-Way Data table for nowlin PlasticsFiGURe 10.4

A B C F G Nowlin Plastics

Parameters Manufacturing Fixed Cost

Manufacturing Variable Cost per Unit

$219,000.00

Model Quantity

Outsourcing Cost per Unit

1 2 3 4 5

7 8 9 10 11 12 13 14 15

17 18

Total Cost to Produce

Total Cost to Outsource

Savings due to Outsourcing

$234,000.00

$2.00

$3.50

10,000

$254,000.00

$35,000.00

$219,000.00

Quantity

50,000

25,000

75,000

100,000

125,000

175,000

225,000

275,000

150,000

200,000

250,000

300,000

D E

Results of one-Way Data table for nowlin PlasticsFiGURe 10.5

A B C D E

Parameters Manufacturing Fixed Cost

Manufacturing Variable Cost per Unit

$234,000.00 $219,000.00

Model Quantity

$2.00

Outsourcing Cost per Unit $3.50

1 2 3 4 5

7 8 9 10 11 12 13 14 15

17 18

10,000

Total Cost to Produce $254,000.00

Total Cost to Outsource $35,000.00

Savings due to Outsourcing $219,000.00

Quantity

50,000

25,000

75,000

100,000

125,000

175,000

225,000

275,000

150,000

200,000

250,000

300,000

$196,500

$234,000

$159,000

$121,500

$84,000

$9,000

–$66,000

–$141,000

$46,500

–$28,500

–$103,500

–$178,500

–$216,000

Nowlin Plastics

We have learned something very valuable from this table. Not only have we quantified the savings due to outsourcing for a number of quantities, we know too that for quantities of 150,000 units or less, outsourcing is cheaper than manufacturing and for quantities of 175,000 units or more, manufacturing is cheaper than outsourcing. Depending on Nowlin’s

10.2 What-if Analysis 473

confidence in their demand forecast for the Viper product for next year, we have likely sat- isfactorily answered the make-versus-buy question. If, for example, management is highly confident that demand will be at least 200,000 units of Viper, then clearly they should manufacture the Viper rather than outsource. If management believes that Viper demand next year will be close to 150,000 units, they might still decide to manufacture rather than outsource. At 150,000 units, the savings due to outsourcing is only $9,000. That might not justify outsourcing if, for example, the quality assurance standards at the outsource firm are not at an acceptable level. We have provided management with valuable information that they may use to decide whether to make or buy. Next we illustrate how to construct a two- way data table.

Suppose that Nowlin has now received five different bids on the per-unit cost for out- sourcing the production of the Viper. Clearly, the lowest bid provides the greatest savings. However, the selection of the outsource firm—if Nowlin decides to outsource—will depend on many factors, including reliability, quality, and on-time delivery. So it would be instructive to quantify the differences in savings for various quantities and bids. The five current bids are $2.89, $3.13, $3.50, $3.54, and $3.59. We may use the Excel Data Table to construct a two- way data table with quantity as a column and the five bids as a row, as shown in Figure 10.6.

In Figure 10.6, we have entered various quantities in cells D5 through D17, as in the one-way table. These correspond to cell B11 in our model. In cells E4 through I4, we have entered the bids. These correspond to B7, the outsourcing cost per unit. In cell D4, above the column input values and to the left of the row input values, we have entered the formula 5B17, the location of the output of interest, in this case, savings due to outsourcing. Once the table inputs have been entered into the spreadsheet, we perform the following steps to construct the two-way data table.

Step 1. Select cells D4:I17 Step 2. Click the Data tab in the Ribbon Step 3. Click What-If Analysis in the Forecast group, and select Data Table Step 4. When the Data Table dialog box appears:

Enter B7 in the Row input cell: box Enter B11 in the Column input cell: box

Click OK

Figure 10.6 shows the selected cells and the Data Table dialog box. The results are shown in Figure 10.7.

We now have a table that shows the savings due to outsourcing for each combination of quantity and bid price. For example, for 75,000 Vipers at a cost of $3.13 per unit, the savings from buying versus manufacturing the units is $149,250. We can also see the range for the quantity for each bid price that results in a negative savings. For these quantities and bid combinations, it is better to manufacture than to outsource.

Using the Data Table allows us to quantify the savings due to outsourcing for the quanti- ties and bid prices specified. However, the table does not tell us the exact number at which the transition occurs from outsourcing being cheaper to manufacturing being cheaper. For exam- ple, although it is clear from the table that for a bid price of $3.50 the savings due to outsourc- ing goes from positive to negative at some quantity between 150,000 units and 175,000 units, we know only that this transition occurs somewhere in that range. As we illustrate next, the what-if analysis tool Goal Seek can tell us the precise number at which this transition occurs.

Goal seek Excel’s Goal Seek tool allows the user to determine the value of an input cell that will cause the value of a related output cell to equal some specified value (the goal). In the case of Nowlin Plastics, suppose we want to know the value of the quantity of Vipers at which it becomes more cost-effective to manufacture rather than outsource. For example, we see from the table in Figure 10.7 that for a bid price of $3.50 and some quantity between 150,000 units and 175,000 units, savings due to outsourcing goes from positive to negative. Somewhere in

In versions of Excel prior to Excel 2016, the What-If Analysis tool can be found in the Data Tools group.

474 chapter 10 Spreadsheet Models

the input for constructing a two-Way Data table for nowlin PlasticsFiGURe 10.6

A B C J K ML Nowlin Plastics

Parameters Manufacturing Fixed Cost

Manufacturing Variable Cost per Unit

$234,000.00 $2.89 $3.13 $3.50 $3.54 $3.59

Model Quantity

$2.00

Outsourcing Cost per Unit $3.50

10,000

Total Cost to Produce $254,000.00

Total Cost to Outsource $35,000.00

Savings due to Outsourcing $219,000.00

$219,000.00

50,000

25,000

75,000

100,000

125,000

175,000

225,000

275,000

150,000

200,000

250,000

300,000

1 2 3 4 5

7 8 9 10 11 12 13 14 15

17 18 19

D E F G H I

Results of a two-Way Data table for nowlin PlasticsFiGURe 10.7

A B C D E F G H I Nowlin Plastics

Parameters Manufacturing Fixed Cost

Manufacturing Variable Cost per Unit

$234,000.00 $2.89

Model Quantity

$2.00

Outsourcing Cost per Unit $3.50

10,000

Total Cost to Produce $254,000.00

Total Cost to Outsource $35,000.00

Savings due to Outsourcing $219,000.00

$219,000.00

50,000

25,000

75,000

100,000

125,000

175,000

225,000

275,000

150,000

200,000

250,000

300,000

$234,000

$189,500

$211,750

$167,250

$145,000

$122,750

$78,250

$33,750

–$10,750

$100,500

$56,000

$11,500

–$33,000

$3.13

$234,000

$177,500

$205,750

$149,250

$121,000

$92,750

$36,250

–$20,250

–$76,750

$64,500

$8,000

–$48,500

–$105,000

$3.50

$234,000

$159,000

$196,500

$121,500

$84,000

$46,500

–$28,500

–$103,500

–$178,500

$9,000

–$66,000

–$141,000

–$216,000

$3.54

$234,000

$157,000

$195,500

$118,500

$80,000

$41,500

–$35,500

–$112,500

–$189,500

$3,000

–$74,000

–$151,000

–$228,000

$3.59

$234,000

$154,500

$194,250

$114,750

$75,000

$35,250

–$44,250

–$123,750

–$203,250

–$4,500

–$84,000

–$163,500

–$243,000

1 2 3 4 5

7 8 9 10 11 12 13 14 15

17 18

10.2 What-if Analysis 475

this range of quantity, the savings due to outsourcing is zero, and that is the point at which Nowlin would be indifferent to manufacturing and outsourcing. We may use Goal Seek to find the quantity of Vipers that satisfies the goal of zero savings due to outsourcing for a bid price of $3.50. The following steps describe how to use Goal Seek to find this point.

Step 1. Click the Data tab in the Ribbon Step 2. Click What-If Analysis in the Forecast group, and select Goal Seek Step 3. When the Goal Seek dialog box appears (Figure 10.8):

Enter B17 in the Set cell: box Enter 0 in the To value: box Enter B11 in the By changing cell: box Click OK

Step 4. When the Goal Seek Status dialog box appears, click OK

The completed Goal Seek dialog box is shown in Figure 10.8. The results from Goal Seek are shown in Figure 10.9. The savings due to outsourcing in

cell B17 is zero, and the quantity in cell B11 has been set by Goal Seek to 156,000. When the annual quantity required is 156,000, it costs $564,000 either to manufacture the product or to purchase it. We have already seen that lower values of the quantity required favor out- sourcing. Beyond the value of 156,000 units it becomes cheaper to manufacture the product.

scenario Manager As we have seen, data tables are useful for exploring the impact of changing one or two model inputs on a model output of interest. Scenario Manager is an Excel tool that quan- tifies the impact of changing multiple inputs (a setting of these multiple inputs is called a scenario) on one or more outputs of interest. That is, Scenario Manager extends the data table concept to cases when you are interested in changing more than two inputs and want to quantify the changes these inputs have on one or more outputs of interest.

In versions of Excel prior to Excel 2016, the What-If Analysis tool can be found in the Data Tools group.

Goal Seek Dialog Box for nowlin PlasticsFiGURe 10.8

A B C ED F Nowlin Plastics

Parameters Manufacturing Fixed Cost

Manufacturing Variable Cost per Unit

$234,000.00

Model Quantity

$2.00

Outsourcing Cost per Unit $3.50

1 2 3 4 5

7 8 9 10 11 12 13 14 15

17 18

10,000

Total Cost to Produce $254,000.00

Total Cost to Outsource $35,000.00

Savings due to Outsourcing $219,000.00

476 chapter 10 Spreadsheet Models

To illustrate the use of Scenario Manager, let us consider the case of the Middletown Amusement Park. John Miller, the manager at Middletown, has developed a simple spread- sheet model of the park’s daily profit. His model is shown in Figure 10.10.

On any given day, there are two types of customers in the park, those who own season passes and those who do not. Season-pass owners pay an annual membership fee during the offseason, but then pay nothing at the gate to enter the park. Those who are not season pass holders pay $35 per person to enter the park for the day. John refers to these non- season-pass holders as “admissions.” On average, a season-pass holder spends $15 per person in the park on food, drinks, and novelties and an admission spends on average $45. The average daily cost of operations (including fixed costs) is $33,000 per day and the cost of goods is 50% of the price of the good. These data are reflected in John’s spread- sheet model, which calculates a daily profit. As shown in Figure 10.10, for the data just described, John’s model calculates the profit to be $81,500.

As you might expect, the profit generated on any given day is very dependent on the weather. As shown in Table 10.1, John has developed three weather-based scenarios: Partly Cloudy, Rain, and Sunny. The weather has a direct impact on four input parameters: the number of season-pass holders who enter the park, the number of non-season-pass holders (admissions) who enter the park, the amount each of these groups spends on average and the cost of operations. The Scenario Manager allows us to generate a report that gives an output variable or set of output variables of interest for each scenario. In this case, Scenario Manager will provide a report that gives the profit for each scenario.

The following steps describe how to use Scenario Manager to generate a scenario sum- mary report.

Step 1. Click the Data tab on the Ribbon. Step 2. Click What-if Analysis in the Forecast group and select Scenario

Manager...

Results from Goal Seek for nowlin PlasticsFiGURe 10.9

A B C ED F Nowlin Plastics

Parameters Manufacturing Fixed Cost

Manufacturing Variable Cost per Unit

$234,000.00

Model Quantity

$2.00

Outsourcing Cost per Unit $3.50

1 2 3 4 5

7 8 9 10 11 12 13 14 15

17 18

156,000

Total Cost to Produce $546,000.00

Total Cost to Outsource $546,000.00

Savings due to Outsourcing $0.00

Middletown

10.2 What-if Analysis 477

Middletown Amusement Park Daily Profit ModelFiGURe 10.10

3000

1600

15 45

33000

0.5

=B5*B7

=B6*B8

=B7*B9

=B16+B17+B18

=B11

=B12*(B17+B18)

=B21+B22

=B19–B23

Middletown Amusement Park

Parameters

Admission Price

Admissions

Number of Season-Pass Holders Admitted

Average Expenditure - Season Pass Holders Average Expenditure - Admissions

Admissions Revenue

Season Pass Holder Expenditures Revenue

Admissions Expenditures Revenue

Total Revenue

Cost of Operations

Cost of Goods

Total Cost

Pro�t

Cost of Operations

Cost of Goods %

Model

1 2 3 4 5 6 7 8 9 10 11 12

13 14

16 17 18 19 20 21 22

24 25 26

A B C

$35

3000

1600

$15 $45

$33,000

$56,000

$45,000

$72,000 $173,000

$33,000

$58,500

$91,500

$81,500

50%

Middletown Amusement Park

Parameters

Admission Price

Admissions

Number of Season-Pass Holders Admitted

Average Expenditure - Season Pass Holders Average Expenditure - Admissions

Admissions Revenue

Season Pass Holder Expenditures Revenue

Admissions Expenditures Revenue

Total Revenue

Cost of Operations

Cost of Goods

Total Cost

Pro�t

Cost of Operations

Cost of Goods%

Model

1 2 3 4 5 6 7 8 9 10 11 12

13 14

16 17 18 19 20 21 22

24 25 26

A B C

Step 3. When the Scenario Manager dialog box appears (Figure 10.11), click the Add... button

Step 4. When the Add Scenario dialog box appears (Figure 10.12): Enter Partly Cloudy in the Scenario name: box Enter $B$6:$B$9,$B$11 in the Changing cells: box Click OK

478 chapter 10 Spreadsheet Models

Scenarios

Partly Cloudy Rain Sunny Season-pass Holders 3000 1200 8000 Admissions 1600 250 2400 Average Expenditure - Season-Pass

Holders $15 $10 $18

Average Expenditure - Admissions $45 $20 $57

Cost of Operations $33,000 $27,000 $37,000

Weather Scenarios for Middletown Amusement ParktABle 10.1

Scenario Manager Dialog BoxFiGURe 10.11

Step 5. When the Scenario Values dialog box appears (Figure 10.13): Enter 3000 in the $B$6 box Enter 1600 in the $B$7 box Enter 15 in the $B$8 box Enter 45 in the $B$9 box Enter 33000 in the $B$11 box Click OK

Step 6. When the Scenario Manager dialog box appears, repeat steps 3–5 for each scenario shown in Table 10.1 (Rain and Sunny).

Step 7. When all scenarios have been entered and the Scenario Manager dialog box appears, click Summary...

Step 8. When the Scenario Summary dialog box appears (Figure 10.14): Select Scenario summary Enter B25 in the Result Cells box Click OK

10.2 What-if Analysis 479

Add Scenario Dialog BoxFiGURe 10.12

Scenario Values Dialog BoxFiGURe 10.13

Scenario Summary Dialog BoxFiGURe 10.14

480 chapter 10 Spreadsheet Models

1. We emphasize the location of the reference to the desired

output in a one-way versus a two-way data table. For a

one-way table, the reference to the output cell location

is placed in the cell above and to the right of the column

of input data so that it is in the cell just to the right of the

label of the column of input data. For a two-way table, the

reference to the output cell location is placed above the

column of input data and to the left of the row input data.

2. Notice that in Figures 10.5 and 10.7, the tables are format-

ted as currency. This must be done manually after the table

is constructed using the options in the Number group

under the Home tab in the Ribbon. It also a good idea to

label the rows and the columns of the table.

3. For very complex functions, Goal Seek might not converge

to a stable solution. Trying several different initial values

(the actual value in the cell referenced in the By changing

cell: box) when invoking Goal Seek may help.

4. In Figure 10.13, we chose Scenario summary to gen-

erate the summary in Figure 10.14. Choosing Scenario

PivotTable report will generate a pivot table with the rel-

evant inputs and outputs.

5. Once all scenarios have been added to the Scenario

Manager dialog box (Figure 10.11), there are several alter-

natives to choose. Scenarios can be edited via the Edit…

button. The Show button allows you to look at the selected

scenario settings by displaying the Scenario Values box for

that scenario. The Delete button allows you to delete a

scenario and the Merge… button allows you to merge sce-

narios from another worksheet with those of the current

worksheet.

n o t e s + C o M M e n t s

The Scenario Summary report appears on a separate worksheet as shown in Figure 10.15. The summary includes the values currently in the spreadsheet, along with the specified scenarios. We see that the profit ranges from a low of -$6,625 on a rainy day to a high of $187,400 on a sunny day.

Scenario Summary for Middletown Amusement ParkFiGURe 10.15

Scenario Summary

Changing Cells:

SBS6

SBS7

SBS8

SBS9

SBS11

SBS25

3000

1600

$15

$45

$33,000

$81,500 $81,500

3000

1600

$15

$45

$33,000

1200

250

$10

$45

$27,000

–$6,625

8000

2400

$18

$57

$37,000

$187,400 Result Cells:

Current Values:

Notes: Current Values column represents values of changing cells at time Scenario Summary Report was created. Changing cells for each scenario are highlighted in gray.

Partly Cloudy Rain Sunny

B C D E F G H

10.3 Some Useful Excel Functions for Modeling In this section we use several examples to introduce additional Excel functions that are useful in modeling decision problems. Many of these functions will be used in the chapters on simulation, optimization, and decision analysis.

10.3 Some Useful Excel Functions for Modeling 481

sUM and sUMPRoDUCt Two very useful functions are SUM and SUMPRODUCT. The SUM function adds up all of the numbers in a range of cells. The SUMPRODUCT function returns the sum of the products of elements in a set of arrays. As we shall see in Chapter 12, SUMPRODUCT is very useful for linear optimization models.

Let us illustrate the use of SUM and SUMPRODUCT by considering a transportation problem faced by Foster Generators. This problem involves the transportation of a prod- uct from three plants to four distribution centers. Foster Generators operates plants in Cleveland, Ohio; Bedford, Indiana; and York, Pennsylvania. Production capacities for the three plants over the next three-month planning period are known.

The firm distributes its generators through four regional distribution centers located in Boston, Massachusetts; Chicago, Illinois; St. Louis, Missouri; and Lexington, Kentucky. Foster has forecasted demand for the three-month period for each of the distribution cen- ters. The per-unit shipping cost from each plant to each distribution center is also known. Management would like to determine how much of its products should be shipped from each plant to each distribution center.

A transportation analyst developed a what-if spreadsheet model to help Foster develop a plan for how to ship its generators from the plants to the distribution centers to mini- mize cost. Of course, capacity at the plants must not be exceeded, and forecasted demand must be satisfied at each of the four distribution centers. The what-if model is shown in Figure 10.16.

The parameters section is rows 2 through 10. Cells B5 through E7 contain the per-unit shipping cost from each origin (plant) to each destination (distribution center). For exam- ple, it costs $2.00 to ship one generator from Bedford to St. Louis. The plant capacities are given in cells F5 through F7, and the distribution center demands appear in cells B8 through E8.

The model is in rows 11 through 20. Trial values of shipment amounts from each plant to each distribution center appear in the shaded cells, B17 through E19. The total cost of shipping for this proposed plan is calculated in cell B13 using the SUMPRODUCT func- tion. The general form of the SUMPRODUCT function is

array array5 1 2SUMPRODUCT( , )

The function pairs each element of the first array with its counterpart in the second array, multiplies the elements of the pairs together, and adds the results. In cell B13, 5SUMPRODUCT(B5:E7,B17:E19) pairs the per-unit cost of shipping for each origin- destination pair with the proposed shipping plan for that and adds their products:

1 1 1 1 1 1B5* B17 C5* C17 D5* D17 E5* E17 B6 * B18 · · · E7 * E19

In cells F17 through F19, the SUM function is used to add up the amounts shipped for each plant. The general form of the SUM function is

range5SUM( )

where range is a range of cells. For example, the function in cell F17 is 5SUM(B17:E17), which adds the values in B17, C17, D17, and E17: 1 1 1 55000 0 0 0 5000. The SUM function in cells B20 through E20 does the same for the amounts shipped to each distribu- tion center.

By comparing the amounts shipped from each plant to the capacity for that plant, we see that no plant violates its capacity. Likewise, by comparing the amounts shipped to each distribution center to the demand at that center, we see that all demands are met. The total shipping cost for the proposed plan is $54,500. Is this the lowest-cost plan? It is not clear. We will revisit the Foster Generators problem in Chapter 12, where we discuss linear optimization models.

The arrays used as arguments in the SUMPRODUCT function must be of the same dimension. For example, in the Foster Generator model, B5:E7 is an array of three rows and four columns. B17:E19 is an array of the same dimensions.

482 chapter 10 Spreadsheet Models

What-if Model for Foster GeneratorsFiGURe 10.16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

A B C D E F G

Foster Generators

5000

6000

2500

1500

2000

Parameters

Shipping Cost/Unit

Origin

Cleveland

Bedford

York

Demand

Model

Total Cost

Origin

Cleveland

Bedford

York

Total

Boston Chicago

Destination

St. Louis Lexington Supply

Boston Chicago St. Louis Lexington Total

=SUMPRODUCT(B5:E7,B17:E19)

40006000

Destination

=SUM(B17:E17)

=SUM(B18:E18)

=SUM(B19:E19)

=SUM(C17:C19)=SUM(B17:B19) =SUM(D17:D19) =SUM(E17:E19)

5000

1000

4000

1000

1500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

A B C D E F G

Foster Generators

5000

6000

2500

Parameters

Shipping Cost/Unit

Origin

Cleveland

Bedford

York

Demand

Model

Total Cost

Origin

Cleveland

Bedford

York

Total

Boston Chicago

Destination

St. Louis Lexington Supply

Boston

5000

1000

4000

1000

1500

Chicago St. Louis Lexington Total

$2.00

$6.00

$3.00

$54,500.00

6000

$5.00

$2.00

4000

$4.00

$2.00

$7.00

2000

$5.00

$3.00

$6.00

1500

Destination

5000

6000

2500

40006000 2000 1500

Foster

10.3 Some Useful Excel Functions for Modeling 483

iF and CoUntiF Gambrell Manufacturing produces car stereos. Stereos are composed of a variety of com- ponents that the company must carry in inventory to keep production running smoothly. However, because inventory can be a costly investment, Gambrell generally likes to keep its components inventory to a minimum. To help monitor and control its inventory, Gambrell uses an inventory policy known as an order-up-to policy.

The order-up-to policy is as follows. Whenever the inventory on hand drops below a certain level, enough units are ordered to return the inventory to that predetermined level. If the current number of units in inventory, denoted by H, drops below M units, enough inventory is ordered to get the level back up to M units. M is called the order-up-to point. Stated mathematically, if Q is the amount we order, then

Q M H5 2

An inventory model for Gambrell Manufacturing appears in Figure 10.17. In the upper half of the worksheet, the component ID number, inventory on hand (H), order-up-to point (M), and cost per unit are given for each of four components. Also given in this sheet is the fixed cost per order. The fixed cost is interpreted as follows: Each time a component is ordered, it costs Gambrell $120 to process the order. The fixed cost of $120 is incurred whenever an order is placed, regardless of how many units are ordered.

The model portion of the worksheet calculates the order quantity for each component. For example, for component 570, 5 100M and 5 5H , so 5 2 5 2 5100 5 95Q M H . For component 741, 5 70M and 5 70H and no units are ordered because the on-hand inventory of 70 units is equal to the order-up-to point of 70. The calculations are similar for the other two components.

Depending on the number of units ordered, Gambrell receives a discount on the cost per unit. If 50 or more units are ordered, there is a quantity discount of 10% on every unit purchased. For example, for component 741, the cost per unit is $4.50, and 95 units are ordered. Because 95 exceeds the 50-unit requirement, there is a 10% discount, and the cost per unit is reduced to 2 5 2 5$4.50 0.1($4.50) $4.50 $0.45 $4.05. Not including the fixed cost, the cost of goods purchased is then 5$4.05(95) $384.75.

The Excel functions used to perform these calculations are shown in Figure 10.17 (for clar- ity, we show formulas for only the first three columns). The IF function is used to calculate the purchase cost of goods for each component in row 17. The general form of the IF function is

condition result if condition is true result if condition is false5IF( , , )

For example, in cell B17 we have 5IF(B16 .5 $B$10, $B$11*B6, B6)*B16. This statement says that, if the order quantity (cell B16) is greater than or equal to minimum amount required for a discount (cell B10), then the cost per unit is B11*B6 (there is a 10% discount, so the cost is 90% of the original cost); otherwise, there is no discount, and the cost per unit is the amount given in cell B6. The cost per unit computed by the IF function is then multiplied by the order quantity (B16) to obtain the total purchase cost of component 570. The purchase cost of goods for the other components are computed in a like manner.

The total cost in cell B23 is the sum of the total fixed ordering costs (B21) and the total cost of goods (B22). Because we place three orders (one each for components 570, 578, and 755), the fixed cost of the orders is 53 *120 $360.

The COUNTIF function in cell B19 is used to count how many times we order. In par- ticular, it counts the number of components having a positive order quantity. The general form of the COUNTIF function (which was discussed in Chapter 2 for creating frequency distributions) is

range condition5COUNTIF( , )

484 chapter 10 Spreadsheet Models

The range is the range to search for the condition. The condition is the test to be counted when satisfied. In the Gambrell model in Figure 10.17, cell B19 counts the number of cells that are greater than zero in the range of cells B16:E16 via the syntax 5COUNTIF(B16:E16, “.0”). Note that quotes are required for the condition with the COUNTIF function. In the model, because only cells B16, C16, and E16 are greater than zero, the COUNTIF function in cell B19 returns 3.

Gambrell Manufacturing component ordering ModelFiGURe 10.17

570 5

100

4.5

120

0.9

=B5–B4

=IF(B16 >= $B$10, $B$11*B6,B6)*B16

=COUNTIF(B16:E16,“>0”)

=B19*B8

=SUM(B17:E17)

=SUM(B21:B22)

578 30

12.5

=C5–C4

=IF(C16 >= $B$10, $B$11*C6,C6)*C16

Gambrell Manufacturing Parameters Component ID

Inventory On-Hand

Order-up-to Point

Cost per Unit

Fixed Cost per Order

=B3 =C3Component ID Order Quantity

Cost of Goods

Total Number of Orders

Total Fixed Costs

Total Cost of Goods

Total Cost

Minimum Order Size for Discount Discounted to

Model

1 2 3 4 5 6 7 8 9 10 11 12

13 14

16 17 18 19 20 21 22

24 23

A B C

Gambrell Manufacturing Parameters

Component ID

90%

578 25

$312.50

741 0

$0.00

755 28

$116.20

741

70 70

$3.26

755

17 45

$4.15

578

30 55

$12.50

Inventory On-Hand

Order-up-to Point

Cost per Unit

Fixed Cost per Order

Component ID

Order Quantity

Cost of Goods

Total Number of Orders

Total Fixed Costs

Total Cost of Goods

Total Cost

Minimum Order Size for Discount Discounted to

Model

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

16 17 18 19 20 21 22 23

570 5

100

$4.50

$120

570 95

$384.75

$360.00

$813.45

$1,173.45

B EC D

Notice the use of absolute references to B10 and B11 in row 17. As discussed in Appendix A, this facilitates copying cell B17 to cells C17, D17, and E17.

Gambrell

10.3 Some Useful Excel Functions for Modeling 485

As we have seen, IF and COUNTIF are powerful functions that allow us to make cal- culations based on a condition holding (or not). There are other such conditional functions available in Excel. In a problem at the end of this chapter, we ask you to investigate one such function, the SUMIF function. Another conditional function that is extremely useful in modeling is the VLOOKUP function, which is illustrated with an example in the next section.

VlooKUP The director of sales at Granite Insurance needs to award bonuses to her sales force based on performance. There are 15 salespeople, each with his or her own territory. Based on the size and population of the territory, each salesperson has a sales target for the year.

The measure of performance for awarding bonuses is the percentage achieved above the sales target. Based on this metric, a salesperson is placed into one of five bonus bands and awarded bonus points. After all salespeople are placed in a band and awarded points, each is awarded a percentage of the bonus pool, based on the percentage of the total points awarded. The sales director has created a spreadsheet model to calculate the bonuses to be awarded. The spreadsheet model is shown in Figure 10.18 (note that we have hidden rows 19–28).

As shown in cell E3 in Figure 10.18, the bonus pool is $250,000 for this year. The bonus bands are in cells A7:C11. In this table, column A gives the lower limit of the bonus band, column B the upper limit, and column C the bonus points awarded to anyone in that bonus band. For example, salespeople who achieve 56% above their sales target would be awarded 15 bonus points.

As shown in Figure 10.18, the name and percentage above the target achieved for each salesperson appear below the bonus-band table in columns A and B. In column C, the VLOOKUP function is used to look in the bonus band table and automatically assign the number of bonus points to each salesperson.

The VLOOKUP function allows the user to pull a subset of data from a larger table of data based on some criterion. The general form of the VLOOKUP function is

value table index range5VLOOKUP( , , , )

where value table index range value

value

the value to search for in the first column of the table the cell range containing the table the column in the table containing the value to be returned TRUE if looking for the first approximate match of and FALSE if looking for an exact match of (We will explain the difference between approximate and exact matches in a moment.)

VLOOKUP assumes that the first column of the table is sorted in ascending order. The VLOOKUP function for salesperson Choi in cell C18 is as follows:

5VLOOKUP(B18, $A$7:$C$11, 3, TRUE)

This function uses the percentage above target sales from cell B18 and searches the first column of the table defined by A7:C11. Because the range is set to TRUE, indicating a search for the first approximate match, Excel searches in the first column of the table from the top until it finds a number strictly greater than the value of B18. B18 is 44%, and the first value in the table in column A larger than 44% is in cell A9 (51%). It then backs up one row (to row 8). In other words, it finds the last value in the first column less than or equal to 44%. Because a 3 is in the third argument of the VLOOKUP function, it takes the element in row 8 of the third column of the table, which is 10 bonus points. In summary, the VLOOKUP with range set to TRUE takes the first argument and searches the first

If the range in the VLOOKUP function is FALSE, the only change is that Excel searches for an exact match of the first argument in the first column of the data.

486 chapter 10 Spreadsheet Models

column of the table for the last row that is less than or equal to the first argument. It then selects from that row, the element in the column number of the third argument.

Once all salespeople are awarded bonus points based on VLOOKUP and the bonus- band table, the total number of bonus points awarded is given in cell C30 using the SUM function. Each person’s bonus points as a percentage of the total awarded is calcu- lated in column D, and in column E each person is awarded that percentage of the bonus pool. As a check, cells D30 and E30 give the total percentages and dollar amounts awarded.

Numerous mathematical, logical, and financial functions are available in Excel. In addition to those discussed here, we will introduce you to other functions, as needed, in examples and end-of-chapter problems. Having already discussed principles for building good spreadsheet models and after having seen a variety of spreadsheet models, we turn now to how to audit Excel models to ensure model integrity.

Granite insurance Bonus ModelFiGURe 10.18

Granite Insurance Bonus Awards

Bonus Bands to be awarded for percentage above target sales.

A B C D E

B C D E

Parameters

Bonus Pool

250000

$250,000

Lower Limit

Upper Limit

0.11

0.51

0.8

0.1

0.5

0.79

0.99

100

=VLOOKUP(B15,$A$7:$C$11,3,TRUE)

=VLOOKUP(B16,$A$7:$C$11,3,TRUE)

=VLOOKUP(B17,$A$7:$C$11,3,TRUE)

=VLOOKUP(B18,$A$7:$C$11,3,TRUE)

=VLOOKUP(B29,$A$7:$C$11,3,TRUE)

=SUM(C15:C29)

=C15/$C$30

=C16/$C$30

=C17/$C$30

=C18/$C$30

=C29/$C$30

=SUM(D15:D29)Total

Bonus Points

Model

Barth

Benson

Last Name % Above Target Sales Bonus Points % of Pool

=D15*$E$3

=D16*$E$3

=D17*$E$3

=D18*$E$3

=D29*$E$3

=SUM(E15:E29)

Bonus Amount

% Above Target Sales Bonus Points % of Pool Bonus Amount

Capel

Choi

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 29 30

Ruebush

Model

Barth

Benson

Last Name

Capel

Choi

Ruebush

0.83

1.18

0.44

0.85

100%

11%

51%

80%

10%

50%

79%

99%

10000%

295

8.5%

0.0%

13.6%

3.4%

8.5%

100%Total

$21,186.44

$0.00

$33,898.31

$8,474.58

$21,186.44

$250,000.00

83%

118%

44%

85%

Granite

10.4 Auditing Spreadsheet Models 487

10.4 Auditing Spreadsheet Models Excel contains a variety of tools to assist you in the development and debugging of spread- sheet models. These tools are found in the Formula Auditing group of the Formulas tab, as shown in Figure 10.19. Let us review each of the tools available in this group.

trace Precedents and Dependents After selecting cells, the Trace Precedents button creates arrows pointing to the selected cell from cells that are part of the formula in that cell. The Trace Dependents button, on the other hand, shows arrows pointing from the selected cell to cells that depend on the selected cell. Both of the tools are excellent for quickly ascertaining how parts of a model are linked.

An example of Trace Precedents is shown in Figure 10.20. Here we have opened the Foster Generators Excel file, selected cell B13, and clicked the Trace Precedents but- ton in the Formula Auditing group. Recall that the cost in cell B13 is calculated as the SUMPRODUCT of the per-unit shipping cost and units shipped. In Figure 10.20, to show this relationship, arrows are drawn to these areas of the spreadsheet to cell B13. These arrows may be removed by clicking on the Remove Arrows button in the Auditing Tools group.

An example of Trace Dependents is shown in Figure 10.21. We have selected cell E18, the units shipped from Bedford to Lexington, and clicked on the Trace Dependents button in the Formula Auditing group. As shown in Figure 10.21, units shipped from Bedford to Lexington impacts the cost function in cell B13, the total units shipped from Bedford given in cell F18, as well as the total units shipped to Lexington in cell E20. These arrows may be removed by clicking on the Remove Arrows button in the Auditing Tools group.

Trace Precedents and Trace Dependents can highlight errors in copying and formula construction by showing that incorrect sections of the worksheet are referenced.

show Formulas The Show Formulas button does exactly that. To see the formulas in a worksheet, simply click on any cell in the worksheet and then click on Show Formulas. You will see the for- mulas residing in that worksheet. To revert to hiding the formulas, click again on the Show Formulas button. As we have already seen in our examples in this chapter, the use of Show Formulas allows you to inspect each formula in detail in its cell location.

the Formula Auditing GroupFiGURe 10.19

Watch Window

Show Formulas

Error Checking

Evaluate Formula

Formula Auditing

Trace Precedents

Trace Dependents

Remove Arrows

488 chapter 10 Spreadsheet Models

trace Precedents for Foster GeneratorFiGURe 10.20

A C D E F G

Foster Generators Parameters Shipping Cost/Unit

Origin Cleveland

Bedford

York

Demand

Model

Total Cost

Origin Cleveland

Bedford

York

Total

Boston Chicago Destination

St. Louis Lexington Supply

Boston Chicago St. Louis Lexington Total

$54,500.00

$2.00

$6.00

$3.00

6000

Destination

1 2 3 4 5 6 7 8 9 10 11 12

14 15 16 17 18 19 20

5000

6000

2500

5000

6000

2500

6000

5000

1000

$5.00

$2.00

4000

$4.00

$2.00

$7.00

2000

1000

$5.00

$3.00

$6.00

1500

21 22

trace Dependents for the Foster Generators ModelFiGURe 10.21

Origin Cleveland

Bedford

York

Total

16 17 18 19 20 21

A B C D E F G Foster Generators

5000

6000

2500

Parameters Shipping Cost/Unit

Origin Cleveland

Bedford

York

Demand

Model

Total Cost

Boston Chicago Destination

St. Louis Lexington Supply

Boston Chicago St. Louis Lexington Total

$54,500.00

$2.00

$6.00

$3.00

6000

$5.00

$2.00

4000

$4.00

$2.00

$7.00

2000

$5.00

$3.00

$6.00

1500

Destination

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

5000

6000

2500

4000

5000

1000

6000

1000

2000

1500

10.4 Auditing Spreadsheet Models 489

evaluate Formulas The Evaluate Formula button allows you to investigate the calculations of a cell in great detail. As an example, let us investigate cell B17 of the Gambrell Manufacturing model (Figure 10.17). Recall that we are calculating cost of goods based on whether there is a quantity discount. We follow these steps:

Step 1. Select cell B17 Step 2. Click the Formulas tab in the Ribbon Step 3. Click the Evaluate Formula button in the Formula Auditing group Step 4. When the Evaluate Formula dialog box appears (Figure 10.22), click the

Evaluate button Step 5. Repeat Step 4 until the formula has been completely evaluated Step 6. Click Close

Figure 10.23 shows the Evaluate Formula dialog box for cell B17 in the Gambrell Manufacturing spreadsheet model after four clicks of the Evaluate button.

The Evaluate Formula tool provides an excellent means of identifying the exact location of an error in a formula.

error Checking The Error Checking button provides an automatic means of checking for mathematical errors within formulas of a worksheet. Clicking on the Error Checking button causes Excel to check every formula in the sheet for calculation errors. If an error is found, the

the Evaluate Formula Dialog Box for Gambrell ManufacturingFiGURe 10.22

1 2 3 4 5

7 8 9 10 11 12 13 14 15

19 20 21 22

24 23

A C D E F G H I J Gambrell Manufacturing

Component ID Parameters

Inventory On-Hand Order Up to Point Cost per Unit

570 578 741 755 5 30 70 17

100

570 95

$384.75

$360.00 $813.45

$1,173.45

55 70 45 $4.50

$120

$12.50 $3.26 $4.15

Minimum Order Size for Discount Discounted to

50 90%

Fixed Cost per Order

Model

Component ID

Cost of Goods

Total Number of Orders

Order Quantity

Total Fixed Costs

Total Cost Total Cost of Goods

Gambrell

490 chapter 10 Spreadsheet Models

the Evaluate Formula Dialog Box for Gambrell Manufacturing cell B17 after Four clicks of the Evaluate Button

FiGURe 10.23

Error Checking dialog box appears. An example for a hypothetical division by zero error is shown in Figure 10.24. From this box, the formula can be edited, the calculation steps can be observed (as in the previous section on Evaluate Formulas), or help can be obtained through the Excel help function. The Error Checking procedure is particularly helpful for large models where not all cells of the model are visible.

Watch Window The Watch Window, located in the Formula Auditing group, allows the user to observe the values of cells included in the Watch Window box list. This is useful for large models when not all of the model is observable on the screen or when multiple worksheets are used. The user can monitor how the listed cells change with a change in the model without searching through the worksheet or changing from one worksheet to another.

A Watch Window for the Gambrell Manufacturing model is shown in Figure 10.25. The following steps were used to add cell B17 to the watch list:

Step 1. Click the Formulas tab in the Ribbon Step 2. Click Watch Window in the Formula Auditing group to display the Watch

Window Step 3. Click Add Watch… Step 4. Select the cell you would like to add to the watch list (in this case B17)

As shown in Figure 10.25, the list gives the workbook name, worksheet name, cell name (if used), cell location, cell value, and cell formula. To delete a cell from the watch list, click on the entry from the list, and then click on the Delete Watch button that appears in the upper part of the Watch Window.

The Watch Window, as shown in Figure 10.25, allows us to monitor the value of B17 as we make changes elsewhere in the worksheet. Furthermore, if we had other worksheets in this workbook, we could monitor changes to B17 of the worksheet even from these other worksheets. The Watch Window is observable regardless of where we are in any worksheet of a workbook.

10.5 Predictive and Prescriptive Spreadsheet Models 491

the Error checking Dialog Box for a Division by Zero ErrorFiGURe 10.24

the Watch Window for cell B17 of the Gambrell Manufacturing Model

FiGURe 10.25

10.5 Predictive and Prescriptive Spreadsheet Models Two key phenomena that make decision making difficult are uncertainty and an overwhelm- ing number of choices. Spreadsheet what-if models, as we have discussed thus far in this chapter, are descriptive models. Given formulas and data to populate the formulas, calcu- lations are made based on the formulas. However, basic what-if spreadsheet models can be extended to help deal with uncertainty or the many alternatives a decision maker may face.

As we have seen in previous chapters, predictive models can be estimated from data in spreadsheets using tools provided in Excel. For example, the Excel Regression tool and other Data Analysis tools such as Exponential Smoothing and Moving Average allow us to develop predictive models based on data in the spreadsheet. These predictive models can help us deal with uncertainty by giving estimates for unknown events/quantities that serve as inputs to the decision-making process. Another important extension of what-if models that help us deal with uncertainty is simulation.

Monte Carlo simulation essentially automates manual what-if. This automation allows for very rapid and high-volume what-if to imitate the uncertainty the decision maker faces. By quantifying uncertainty in inputs, simulation allows us to quantify uncertainty in the outputs we care about and therefore assess the riskiness of a decision. Excel has built in probability functions that allow us to simulate uncertainty.

Monte Carlo simulation is discussed in detail in Chapter 11.

492 chapter 10 Spreadsheet Models

To deal with the other complicating factor of decision making, namely an overwhelming number of alternatives, optimization models can be used to help make smart decisions. Optimization models are prescriptive models, characterized by having an objective to be maximized or minimized and usually have constraints that limit the options available to the decision maker. Because they yield a course of action to follow, optimization models are one type of prescriptive analytics. Excel includes a special tool called Solver that solves optimization models. Excel Solver is used to extend a what-if model to find an optimal (or best) course of action that maximizes or minimizes an objective while satisfying the con- straints of the decision problem.

In this chapter, we discussed how to extend the Nowlin Plastics descriptive model to find the breakeven point by applying the Goal Seek tool to that descriptive model. Like Goal Seek, these other extensions of basic descriptive spreadsheet models to simulation and optimization models allow us to perform more advanced analytics.

S U M M A r y

In this chapter we discussed the principles of building good spreadsheet models, several what-if analysis tools, some useful Excel functions, and how to audit spreadsheet models. What-if spreadsheet models are important and popular analysis tools in and of themselves, but as we shall see in later chapters, they also serve as the basis for optimization and simu- lation models.

We discussed how to use influence diagrams to structure a problem. Influence diagrams can serve as a guide to developing a mathematical model and implementing the model in a spreadsheet. We discussed the importance of separating the parameters from the model because it leads to simpler analysis and minimizes the risk of creating an error in a for- mula. In most cases, cell formulas should use cell references in their arguments rather than being “hardwired” with values. We also discussed the use of proper formatting and color to enhance the ease of use and understanding of a spreadsheet model.

We used examples to illustrate how Excel What-If Analysis tools Data Tables, Goal Seek, and Scenario Manager can be used to perform detailed and efficient what-if analysis. We also discussed a number of Excel functions that are useful for business analytics. We discussed Excel Formula Auditing tools that may be used to debug and monitor spread- sheet models to ensure that they are error-free and accurate. We ended the chapter with a brief discussion of predictive and prescriptive spreadsheet models.

G l o S S A r y

Data Table An Excel tool that quantifies the impact of changing the value of a specific input on an output of interest. Decision variable A model input the decision maker can control. Goal Seek An Excel tool that allows the user to determine the value for an input cell that will cause the value of a related output cell to equal some specified value, called the goal. Influence diagram A visual representation that shows which entities influence others in a model. Make-versus-buy decision A decision often faced by companies that have to decide whether they should manufacture a product or outsource its production to another firm. One-way data table An Excel Data Table that summarizes a single input’s impact on the output of interest. Parameters In a what-if model, the uncontrollable model input. Scenario manager An Excel tool that quantifies the impact of changing multiple inputs on one or more outputs of interest. Two-way data table An Excel Data Table that summarizes two inputs’ impact on the out- put of interest. What-if model A model designed to study the impact of changes in model inputs on model outputs.

Chapters 12, 13 and 14 discuss the use optimization models for decision making and how to use Excel Solver.

Problems 493

P r o B l E M S

1. Cox Electric makes electronic components and has estimated the following for a new design of one of its products:

Fixed cost $10, 000 Material cost per unit $0.15 Labor cost per unit $0.10 Revenue per unit $0.65

5 5

These data are given in the file CoxElectric. Note that fixed cost is incurred regardless of the amount produced. Per-unit material and labor cost together make up the variable cost per unit. Assuming that Cox Electric sells all that it produces, profit is calculated by subtracting the fixed cost and total variable cost from total revenue. a. Build an influence diagram that illustrates how to calculate profit. b. Using mathematical notation similar to that used for Nowlin Plastics, give a mathe-

matical model for calculating profit. c. Implement your model from part (b) in Excel using the principles of good spread-

sheet design. d. If Cox Electric makes 12,000 units of the new product, what is the resulting profit?

2. Use the spreadsheet model constructed to answer Problem 1 to answer this problem. a. Construct a one-way data table with production volume as the column input and

profit as the output. Breakeven occurs when profit goes from a negative to a positive value; that is, breakeven is when 5total revenue total cost, yielding a profit of zero. Vary production volume from 0 to 100,000 in increments of 10,000. In which inter- val of production volume does breakeven occur?

b. Use Goal Seek to find the exact breakeven point. Assign Set cell: equal to the loca- tion of profit, 5 0To value: , and By changing cell: equal to the location of the production volume in your model.

3. Eastman Publishing Company is considering publishing an electronic textbook about spreadsheet applications for business. The fixed cost of manuscript preparation, text- book design, and web site construction is estimated to be $160,000. Variable process- ing costs are estimated to be $6 per book. The publisher plans to sell single-user access to the book for $46. a. Build a spreadsheet model to calculate the profit/loss for a given demand. What

profit can be anticipated with a demand of 3,500 copies? b. Use a data table to vary demand from 1,000 to 6,000 in increments of 200 to assess

the sensitivity of profit to demand. c. Use Goal Seek to determine the access price per copy that the publisher must charge

to break even with a demand of 3,500 copies. d. Consider the following scenarios:

Scenario 1 Scenario 2 Scenario 3 Scenario 4 Scenario 5

Variable Cost/ Book

$6 $8 $12 $10 $11

Access Price $46 $50 $40 $50 $60

Demand 2,500 1,000 6,000 5,000 2,000

For each of these scenarios, the fixed cost remains $160,000. Use Scenario Manager to generate a summary report that gives the profit for each of these scenarios. Which sce- nario yields the highest profit? Which scenario yields the lowest profit?

4. The University of Cincinnati Center for Business Analytics is an outreach center that collaborates with industry partners on applied research and continuing education in business analytics. One of the programs offered by the center is a quarterly Business Intelligence Symposium. Each symposium features three speakers on the real-world use of analytics. Each corporate member of the center (there are currently 10) receives

CoxElectric

494 chapter 10 Spreadsheet Models

five free seats to each symposium. Nonmembers wishing to attend must pay $75 per person. Each attendee receives breakfast, lunch, and free parking. The following are the costs incurred for putting on this event:

Rental cost for the auditorium $150 Registration processing $8.50 per person Speaker costs 53@$800 $2,400 Continental breakfast $4.00 per person Lunch $7.00 per person Parking $5.00 per person

a. Build a spreadsheet model that calculates a profit or loss based on the number of nonmember registrants.

b. Use Goal Seek to find the number of nonmember registrants that will make the event break even.

5. Consider again the scenario described in Problem 4. a. The Center for Business Analytics is considering a refund policy for no-shows. No

refund would be given for members who do not attend, but nonmembers who do not attend will be refunded 50% of the price. Extend the model you developed in Problem 4 for the Business Intelligence Symposium to account for the fact that, his- torically, 25% of members who registered do not show and 10% of registered non- members do not attend. The center pays the caterer for breakfast and lunch based on the number of registrants (not the number of attendees). However, the center pays for parking only for those who attend. What is the profit if each corporate member registers their full allotment of tickets and 127 nonmembers register?

b. Use a two-way data table to show how profit changes as a function of number of registered nonmembers and the no-show percentage of nonmembers. Vary the num- ber of nonmember registrants from 80 to 160 in increments of 5 and the percentage of nonmember no-shows from 10 to 30% in increments of 2%.

c. Consider three scenarios:

All other inputs are the same as in part a. Use Scenario Manager to generate a sum- mary report that gives the profit for each of these three scenarios. What is the highest profit? What is the lowest profit?

6. Consider again Problem 3. Through a series of web-based experiments, Eastman has created a predictive model that estimates demand as a function of price. The predictive model is 5 2demand 4,000 6p, where p is the price of the e-book. a. Update your spreadsheet model constructed for Problem 3 to take into account this

demand function. b. Use Goal Seek to calculate the price that results in breakeven. c. Use a data table that varies price from $50 to $400 in increments of $25 to find the

price that maximizes profit.

7. Lindsay is 25 years old and has a new job in web development. She wants to make sure that she is financially sound in 30 years, so she plans to invest the same amount into a retirement account at the end of every year for the next 30 years. Construct a data table that will show Lindsay the balance of her retirement account for various levels of annual investment and return. Develop the two-way table for annual investment amounts of $5,000 to $20,000 in increments of $1,000 and for returns of 0 to 12% in increments of 1%. Note that because Lindsay invests at the end of the year, there is no interest earned on the contribution for the year in which she contributes.

8. Consider again Lindsay’s investment in Problem 7. The real value of Lindsay’s account after 30 years of investing will depend on inflation over that period. In the

Base Case Worst Case Best Case

% of Members who do not show 25.0% 50% 15%

% of Nonmembers who do not show 10.0% 30% 5%

Number of Nonmember Registrants 130 100 150

Problems 495

rate value value5 1 2Excel function NPV( , , , . . .), rate is called the discount rate, and value 1, value 2, etc. are incomes (positive) or expenditures (negative) over equal peri- ods of time. Update your model from Problem 7 using the NPV function to get the net present value of Lindsay’s retirement fund. Construct a data table that shows the net present value of Lindsay’s retirement fund for various levels of return and inflation (discount rate). Use a data table to vary the return from 0 to 12% in increments of 1% and the discount rate from 0 to 4% in increments of 1% to show the impact on the net present value. (Hint: Calculate the total amount added to the account each year, and discount that stream of payments using the NPV function.)

9. Goal Kick Sports (GKS) is a retail chain that sells youth and adult soccer equipment. The GKS financial planning group has developed a spreadsheet model to calculate the net discounted cash flow of the first five years of operations for a new store. This model is used to assess new locations under consideration for expansion. a. Use Excel’s Formula Auditing tools to audit this model and correct any errors found. b. Once you are comfortable that the model is correct, use Scenario Manager to gen-

erate a Scenario Summary report that gives Total Discounted Cash Flow for the following scenarios:

scenario

1 2 3 4

Tax Rate 33% 25% 33% 25%

Inflation Rate 1% 2% 4% 3%

Annual Growth of Sales 20% 15% 10% 12%

What is the range of values for the Total Discounted Cash Flow for these scenarios?

10. Newton Manufacturing produces scientific calculators. The models are N350, N450, and the N900. Newton has planned its distribution of these products around eight cus- tomer zones: Brazil, China, France, Malaysia, U.S. Northeast, U.S. Southeast, U.S. Midwest, and U.S. West. Data for the current quarter (volume to be shipped in thou- sands of units) for each product and each customer zone are given in the file Newton. Newton would like to know the total number of units going to each customer zone and also the total units of each product shipped. There are several ways to get this informa- tion from the data set. One way is to use the SUMIF function.

The SUMIF function extends the SUM function by allowing the user to add the val- ues of cells meeting a logical condition. The general form of the function is

test range condition range to be summed5SUMIF( , , )

The test range is an area to search to test the condition, and the range to be summed is the position of the data to be summed. So, for example, using the file Newton, we use the following function to get the total units sent to Malaysia:

5SUMIF(A3:A26, A3, C3 : C26)

Cell A3 contains the text “Malaysia”; A3:A26 is the range of customer zones; and C3:C26 are the volumes for each product for these customer zones. The SUMIF looks for matches of “Malaysia” in column A and, if a match is found, adds the volume to the total. Use the SUMIF function to get total volume by each zone and total volume by each product.

11. Consider the transportation model in the file Williamson, which is very similar to the Foster Generators model discussed in this chapter. Williamson produces a single prod- uct and has plants in Atlanta, Lexington, Chicago, and Salt Lake City and warehouses in Portland, St. Paul, Las Vegas, Tucson, and Cleveland. Each plant has a capacity, and each warehouse has a demand. Williamson would like to find a low-cost shipping plan. Mr. Williamson has reviewed the results and notices right away that the total cost is way out of line. Use the Formula Auditing tool under the Formulas tab in Excel to find any errors in this model. Correct the errors. (Hint: The model contains two errors. Be sure to check every formula.)

GoalKick

Newton

Williamson

496 chapter 10 Spreadsheet Models

a. The Course Average is calculated by weighting the Midterm Score and Final Score 50% each. Use the VLOOKUP function with the table shown to generate the Course Grade for each student in cells E14 through E24.

b. Use the COUNTIF function to determine the number of students receiving each letter grade.

13. Richardson Ski Racing (RSR) sells equipment needed for downhill ski racing. One of RSR’s products is fencing used on downhill courses. The fence product comes in 150-foot rolls and sells for $215 per roll. However, RSR offers quantity discounts. The following table shows the price per roll depending on order size:

Section 001 Course Grading Scale Based on Course Average:

Last Name Alt Amini Amoako Apland Bachman Corder Desi Dransman Duffuor Finkel Foster

28 38

Midterm Score

70 95 82 45 68 91 87 60 80 97 90

Final Score

56 91 80 78 45 98 74 80 93 98 91

Course Average

Course Grade

63.0 93.0 81.0 61.5 56.5 94.5 80.5 70.0 86.5 97.5 90.5

OM 4551 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

A B C D E

Lower Limit

Upper Limit

Course Grade

0 60 70 80 90

59 69 79 89 100

F D C B A

12. Professor Rao would like to accurately calculate the grades for the 58 students in his Operations Planning and Scheduling class (OM 455). He has thus far constructed a spreadsheet, part of which follows:

Quantity ordered

From To Price per Roll

1 50 $215

51 100 $195

101 200 $175

201 and up $155

The file RSR contains 172 orders that have arrived for the coming six weeks. a. Use the VLOOKUP function with the preceding pricing table to determine the total

revenue from these orders. b. Use the COUNTIF function to determine the number of orders in each price bin.

14. A put option in finance allows you to sell a share of stock at a given price in the future. There are different types of put options. A European put option allows you to sell a share of stock at a given price, called the exercise price, at a particular point

OM455

RSR

Problems 497

in time after the purchase of the option. For example, suppose you purchase a six- month European put option for a share of stock with an exercise price of $26. If six months later, the stock price per share is $26 or more, the option has no value. If in six months the stock price is lower than $26 per share, then you can purchase the stock and immediately sell it at the higher exercise price of $26. If the price per share in six months is $22.50, you can purchase a share of the stock for $22.50 and then use the put option to immediately sell the share for $26. Your profit would be the difference,

2 5$26 $22.50 $3.50 per share, less the cost of the option. If you paid $1.00 per put option, then your profit would be 2 5$3.50 $1.00 $2.50 per share. a. Build a model to calculate the profit of this European put option. b. Construct a data table that shows the profit per share for a share price in six months

between $10 and $30 per share in increments of $1.00.

15. Consider again Problem 14. The point of purchasing a European option is to limit the risk of a decrease in the per-share price of the stock. Suppose you purchased 200 shares of the stock at $28 per share and 75 six-month European put options with an exercise price of $26. Each put option costs $1. a. Using data tables, construct a model that shows the value of the portfolio with

options and without options for a share price in six months between $15 and $35 per share in increments of $1.00.

b. Discuss the value of the portfolio with and without the European put options.

16. The Camera Shop sells two popular models of digital SLR cameras. The sales of these products are not independent; if the price of one increases, the sales of the other increases. In economics, these two camera models are called substitutable products. The store wishes to establish a pricing policy to maximize revenue from these prod- ucts. A study of price and sales data shows the following relationships between the quantity sold (N) and price (P) of each model:

5 2 1

5 1 2

195 0.6 0.25

301 0.08 0.5 A A B

B A B

N P P

a. Construct a model for the total revenue and implement it on a spreadsheet. b. Develop a two-way data table to estimate the optimal prices for each product in

order to maximize the total revenue. Vary each price from $250 to $500 in incre- ments of $10.

17. A few years back, Dave and Jana bought a new home. They borrowed $230,415 at an annual fixed rate of 5.49% (15-year term) with monthly payments of $1,881.46. They just made their 25th payment, and the current balance on the loan is $208,555.87.

Interest rates are at an all-time low, and Dave and Jana are thinking of refinancing to a new 15-year fixed loan. Their bank has made the following offer: 15-year term, 3.0%, plus out-of-pocket costs of $2,937. The out-of-pocket costs must be paid in full at the time of refinancing.

Build a spreadsheet model to evaluate this offer. The Excel function

5PMT( , , , , )rate nper pv fv type

calculates the payment for a loan based on constant payments and a constant interest rate. The arguments of this function are as follows:

rate nper pv fv

type

5 5 5

the interest rate for the loan the total number of payments present value (the amount borrowed) future value [the desired cash balance after the last payment (usually 0)] payment type (0 end of period, 1 beginning of the period)

For example, for Dave and Jana’s original loan, there will be 180 payments 5(12 *15 180), so we would use 5 5PMT(0.0549/12, 180, 230415, 0, 0) $1,881.46.

Note that because payments are made monthly, the annual interest rate must be

498 chapter 10 Spreadsheet Models

expressed as a monthly rate. Also, for payment calculations, we assume that the pay- ment is made at the end of the month.

The savings from refinancing occur over time, and therefore need to be discounted back to current dollars. The formula for converting K dollars saved t months from now to current dollars is

1 2(1 ) 1 K

r t

where r is the monthly inflation rate. Assume that 5 0.002r and that Dave and Jana make their payment at the end of each month.

Use your model to calculate the savings in current dollars associated with the refi- nanced loan versus staying with the original loan.

18. Consider again the mortgage refinance problem in Problem 17. Assume that Dave and Jana have accepted the refinance offer of a 15-year loan at 3% interest rate with out- of-pocket expenses of $2,937. Recall that they are borrowing $208,555.87. Assume that there is no prepayment penalty, so that any amount over the required payment is applied to the principal. Construct a model so that you can use Goal Seek to determine the monthly payment that will allow Dave and Jana to pay off the loan in 12 years. Do the same for 10 and 11 years. Which option for prepayment, if any, would you choose and why? (Hint: Break each monthly payment up into interest and principal [the amount that is deducted from the balance owed]. Recall that the monthly interest that is charged is the monthly loan rate multiplied by the remaining loan balance.)

19. Floyd’s Bumpers has distribution centers in Lafayette, Indiana; Charlotte, North Carolina; Los Angeles, California; Dallas, Texas; and Pittsburgh, Pennsylvania. Each distribution cen- ter carries all products sold. Floyd’s customers are auto repair shops and larger auto parts retail stores. You are asked to perform an analysis of the customer assignments to determine which of Floyd’s customers should be assigned to each distribution center. The rule for assigning customers to distribution centers is simple: A customer should be assigned to the closest center. The file Floyds contains the distance from each of Floyd’s 1,029 customers to each of the five distribution centers. Your task is to build a list that tells which distribution center should serve each customer. The following function will be helpful:

array5MIN( )

The MIN function returns the smallest value in a set of numbers. For example, if the range A1:A3 contains the values 6, 25, and 38, then the formula 5MIN(A1:A3) returns the number 6, because it is the smallest of the three numbers:

lookup value lookup array match_type5MATCH( _ , _ , )

The MATCH function searches for a specified item in a range of cells and returns the relative position of that item in the range. The lookup_value is the value to match, the lookup_array is the range of search, and match_type indicates the type of match (use 0 for an exact match).

For example, if the range A1:A3 contains the values 6, 25, and 38, then the formula 5MATCH(25,A1:A3,0) returns the number 2, because 25 is the second item in the range.

array column num5INDEX( , _ )

The INDEX function returns the value of an element in a position of an array. For exam- ple, if the range A1:A3 contains the values 6, 25, and 38, then the formula 5 INDEX(A1: A3, 2) 5 25, because 25 is the value in the second position of the array A1:A3. (Hint: Create three new columns. In the first column, use the MIN function to calculate the mini- mum distance for the customer in that row. In the second column use the MATCH function to find the position of the minimum distance. In the third column, use the position in the previous column with the INDEX function referencing the row of distribution center names to find the name of the distribution center that should service that customer.)

Floyds

case Problem: Retirement Plan 499

20. Refer to Problem 19. Floyd’s Bumpers pays a transportation company to ship its prod- uct in full truckloads to its customers. Therefore, the cost for shipping is a function of the distance traveled and a fuel surcharge (also on a per-mile basis). The cost per mile is $2.42, and the fuel surcharge is $0.56 per mile. The file FloydsMay contains data for shipments for the month of May (each record is simply the customer zip code for a given truckload shipment) as well as the distance table from the distribution centers to each customer. Use the MATCH and INDEX functions to retrieve the distance traveled for each shipment, and calculate the charge for each shipment. What is the total amount that Floyd’s Bumpers spends on these May shipments? (Hint: The INDEX function may be used with a two-dimensional array: 5INDEX(array, row_num, column_num), where array is a matrix, row_num is the row number, and column_num is the column position of the desired element of the matrix.)

21. An auto dealership is advertising that a new car with a sticker price of $35,208 is on sale for $25,995 if payment is made in full, or it can be financed at 0% interest for 72 months with a monthly payment of $489. Note that

3 572 payments $489 per payment $35, 208, which is the sticker price of the car. By allowing you to pay for the car in a series of payments (starting one month from now) rather than $25,995 now, the dealer is effectively loaning you $25,995. If you choose the 0% financing option, what is the effective interest rate that the auto dealership is earning on your loan? (Hint: Discount the payments back to current dollars [see Problem 17 for a discussion of discounting], and use Goal Seek to find the discount rate that makes the net present value of the 5payments $25,995.)

C A S E P r o B l E M : r E t I r E M E n t P l A n

Tim is 37 years old and would like to establish a retirement plan. Develop a spreadsheet model that could be used to assist Tim with retirement planning. Your model should include the following input parameters:

Tim’s current age 37 years Tim’s current total retirement savings $259, 000 Annual rate of return on retirement savings 4% Tim’s current annual salary $145, 000 Tim’s expected annual percentage increase in salary 2% Tim’s percentage of annual salary contributed to retirement 6% Tim’s expected age of retirement 65 Tim’s expected annual expenses after retirement (current dollars) $90, 000 Rate of return on retirement savings after retirement 3% Income tax rate postretirement 15%

5 5

Assume that Tim’s employer contributes 6% of Tim’s salary to his retirement fund. Tim can make an additional annual contribution to his retirement fund before taxes (tax free) up to a contribution of $16,000. Assume that he contributes $6,000 per year. Also, assume an inflation rate of 2%.

Managerial Report

Your spreadsheet model should provide the accumulated savings at the onset of retire- ment as well as the age at which funds will be depleted (given assumptions on the input parameters).

As a feature of your spreadsheet model, build a data table to demonstrate the sensitivity of the age at which funds will be depleted to the retirement age and additional pre-tax con- tributions. Similarly, consider other factors you think might be important.

Develop a report for Tim outlining the factors that will have the greatest impact on his retirement.

FloydsMay

Monte Carlo Simulation C O N T E N T S

ANALYTICS IN ACTION: POLIO ERADICATION

11.1 RISK ANALYSIS FOR SANOTRONICS LLC Base-Case Scenario Worst-Case Scenario Best-Case Scenario Sanotronics Spreadsheet Model Use of Probability Distributions to Represent Random

Variables Generating Values for Random Variables with Excel Executing Simulation Trials with Excel Measuring and Analyzing Simulation Output

11.2 SIMULATION MODELING FOR LAND SHARK INC. Spreadsheet Model for Land Shark Generating Values for Land Shark’s Random Variables Executing Simulation Trials and Analyzing Output Generating Bid Amounts with Fitted Distributions

11.3 SIMULATION WITH DEPENDENT RANDOM VARIABLES Spreadsheet Model for Press Teag Worldwide

11.4 SIMULATION CONSIDERATIONS Verification and Validation Advantages and Disadvantages of Using Simulation

APPENDIX 11.1: COMMON PROBABILITY DISTRIBUTIONS FOR SIMULATION

AVAILABLE IN THE MINDTAP READER:

APPENDIX 11.2: LAND SHARK INC. SIMULATION WITH ANALYTIC SOLVER

APPENDIX 11.3: DISTRIBUTION FITTING WITH ANALYTIC SOLVER

APPENDIX 11.4: CORRELATING RANDOM VARIABLES WITH ANALYTIC SOLVER

APPENDIX 11.5: SIMULATION OPTIMIZATION WITH ANALYTIC SOLVER

Chapter 11

Analytics in Action 501

Uncertainty pervades decision making in business, government, and our personal lives. This chapter introduces the use of Monte Carlo simulation to evaluate the impact of uncertainty on a decision. Simulation models have been successfully used in a variety of disciplines. Financial applications include investment planning, project selection, and option pricing. Marketing applications include new product development and the timing of market entry for a product. Management applications include project management, inven- tory ordering (especially important for seasonal products), capacity planning, and revenue management (prominent in the airline, hotel, and car rental industries). In each of these applications, uncertain quantities complicate the decision process.

As we will demonstrate, a spreadsheet simulation analysis requires a model foundation of logical formulas that correctly express the relationships between parameters and deci- sions to generate outputs of interest. For example, a simple spreadsheet model may com- pute a clothing retailer’s profit, given values for the number of ski jackets ordered from the manufacturer and the number of ski jackets demanded by customers. A simulation analysis

Monte Carlo simulation originated during World War II as part of the Manhattan Project to develop nuclear weapons. “Monte Carlo” was selected as the code name for the classified method in reference to the famous Monte Carlo casino in Monaco and the uncertainties inherent in gambling.

Polio Eradication*

Polio is an infectious disease that has existed for thousands of years. The disease causes muscle weakness, paralysis and can lead to death. The disease is preventable through use of polio vaccines developed by Dr. Jonas Salk at the University of Pittsburgh and Dr. Albert Sabin at the University of Cincinnati in the 1950s and 1960s. These vaccines led to the effective eradication of polio in much of the developed world including the United States. The success of eradicating polio in the developed world led to the Global Polio Eradication Initiative (GPEI) in 1988 with the goal of ending all cases of polio from all sources. The United States Centers for Disease Control and Prevention (CDC) is one of the leaders of the GPEI and contributes more than $100 million annually to polio eradication. In 2001, the CDC initiated a collaboration with Kid Risk, Inc. to better understand the implications of decisions related to polio control and eradication. These efforts have helped raise billions of dollars to support global polio eradication, and have led to faster responses to polio outbreaks and better decisions on vaccination strategies.

A group of researchers from the CDC and Kids Risk have applied a variety of analytics tools to evaluate polio control and eradication decisions. One of these tools, Monte Carlo simulation, is used to evaluate the implications of potential polio-outbreak risks over time. Specifically, the researchers evaluated policy decisions after an initial eradication of a wild poliovirus transmission. Even after initial eradication,

polio can reappear if sufficient vaccination policies and containment methods are not followed, or if the virus is accidentally reintroduced. Accidental reintroduction from what are known as vaccine- derived polioviruses can occur when vaccinated individuals, who are protected from the debilitating effects of polio, can still be a source of infection to others. As wild polioviruses are eradicated, vaccine- derived polioviruses can become the major source of transmission.

Monte Carlo simulation allows decision makers to evaluate different vaccination strategies to be used for 20 years after initial eradication of wild poliovi- ruses. The simulation models allow for uncertainty in the transmission rates of vaccine-derived polioviruses and in the probability of outbreaks. The simulation models track the polio outbreak risks over time for dif- ferent income groups and under different policies for vaccination. These simulation models then help the researchers to recommend specific policies as well as to inform decision makers about the inherent risks of outcomes to different income groups.

The CDC and other GPEI partners continue to use analytics models to evaluate policy decisions. It is estimated that the net benefits of this analysis will be $40–$50 billion for the GPEI between 1988 and 2035 compared to a policy that simply uses routine immunizations.

A N A L Y T I C S I N A C T I O N

*K. Thompson, R. Duintjer Tebbens, M. Pallansch, S. Wassilak, S. Cochi, “Polio Eradicators Use Integrated Analytics Models to Make Better Decisions,” Interfaces 45, no. 1 (January–February 2015): 5–25.

502 Chapter 11 Monte Carlo Simulation

extends this model by replacing the single value used for ski jacket demand with a proba- bility distribution of possible values of ski jacket demand. A probability distribution of ski jacket demand represents not only the range of possible values but also the relative likeli- hood of various levels of demand.

To evaluate a decision with a Monte Carlo simulation, an analyst identifies parameters that are not known with a high degree of certainty and treats these parameters as random, or uncertain, variables. The values for the random variables are randomly generated from the specified probability distributions. The simulation model uses the randomly generated values of the random variables and the relationships between parameters and decisions to compute the corresponding values of an output. Specifically, a simulation experiment pro- duces a distribution of output values that correspond to the randomly generated values of the uncertain input variables. This probability distribution of the output values describes the range of possible outcomes, as well as the relative likelihood of each outcome. After reviewing the simulation results, the analyst is often able to make decision recommenda- tions for the controllable inputs that address not only the average output but also the variability of the output.

In this chapter, we construct spreadsheet simulation models using only native Excel functionality. As we will show, practical simulation models for real-world problems can be executed in native Excel. However, there are many simulation software products that provide sophisticated simulation modeling features and automate the generation of outputs such as charts and summary statistics. Some of these software packages can be installed as Excel add-ins, including @RISK, Crystal Ball, and Analytic Solver.

11.1 Risk Analysis for Sanotronics LLC When making a decision in the presence of uncertainty, the decision maker should be inter- ested not only in the average, or expected, outcome, but also in information regarding the range of possible outcomes. In particular, decision makers are interested in risk analysis, that is, quantifying the likelihood and magnitude of an undesirable outcome. In this sec- tion, we show how to perform a risk analysis study for a medical device company called Sanotronics.

Sanotronics LLC is a start-up company that manufactures medical devices for use in hospital clinics. Inspired by experiences with family members who have battled cancer, Sanotronics’s founders have developed a prototype for a new device that limits health care workers’ exposure to chemotherapy treatments while they are preparing, administering, and disposing of these hazardous medications. The new device features an innovative design and has the potential to capture a substantial share of the market.

Sanotronics would like an analysis of the first-year profit potential for the device. Because of Sanotronics’s tight cash flow situation, management is particularly concerned about the potential for a loss. Sanotronics has identified the key parameters in determining first-year profit: selling price per unit (p), first-year administrative and advertising costs ( )ca , direct labor cost per unit ( )ci , parts cost per unit ( )cp , and first-year demand (d). After conducting market research and a financial analysis, Sanotronics estimates with a high level of certainty that the device’s selling price will be $249 per unit and that the first-year administrative and advertising costs will total $1,000,000.

Sanotronics is not certain about the values for the cost of direct labor, the cost of parts, and the first-year demand. At this stage of the planning process, Sanotronics’s base esti- mates of these inputs are $45 per unit for the direct labor cost, $90 per unit for the parts cost, and 15,000 units for the first-year demand. We begin our risk analysis by considering a small set of what-if scenarios.

Base-Case Scenario Sanotronics’s first-year profit is computed as follows:

Profit ( )p c c d ci p a5 2 2 3 2 (11.1)

In appendices to this chapter available in MindTap, we demonstrate the features of the Excel add-in Analytic Solver to construct spreadsheet simulation models.

11.1 Risk Analysis for Sanotronics LLC 503

Recall that Sanotronics is certain of a selling price of $249 per unit, and administrative and advertising costs total $1,000,000. Substituting these values into equation (11.1) yields

Profit (249 ) 1, 000, 000c c di p5 2 2 3 2 (11.2)

Sanotronics’s base-case estimates of the direct labor cost per unit, the parts cost per unit, and first-year demand are $45, $90, and 15,000 units, respectively. These values consti- tute the base-case scenario for Sanotronics. Substituting these values into equation (11.2) yields the following profit projection:

Profit (249 45 90)(15, 000) 1, 000, 000 710, 0005 2 2 2 5

Thus, the base-case scenario leads to an anticipated profit of $710,000. Although the base-case scenario looks appealing, Sanotronics is aware that the values

of direct labor cost per unit, parts cost per unit, and first-year demand are uncertain, so the base-case scenario may not occur. To help Sanotronics gauge the impact of the uncertainty, the company may consider performing a what-if analysis. A what-if analysis involves con- sidering alternative values for the random variables (direct labor cost, parts cost, and first- year demand) and computing the resulting value for the output (profit).

Sanotronics is interested in what happens if the estimates of the direct labor cost per unit, parts cost per unit, and first-year demand do not turn out to be as expected under the base-case scenario. For instance, suppose that Sanotronics believes that direct labor costs could range from $43 to $47 per unit, parts cost could range from $80 to $100 per unit, and first-year demand could range from 0 to 30,000 units. Using these ranges, what-if analysis can be used to evaluate a worst-case scenario and a best-case scenario.

Worst-Case Scenario The worst-case scenario for the direct labor cost is $47 (the highest value), the worst-case scenario for the parts cost is $100 (the highest value), and the worst-case scenario for demand is 0 units (the lowest value). Substituting these values into equation (11.2) leads to the following profit projection:

Profit (249 47 100)(0) 1, 000, 000 1, 000, 0005 2 2 2 5 2

So, the worst-case scenario leads to a projected loss of $1,000,000.

Best-Case Scenario The best-case value for the direct labor cost is $43 (the lowest value), for the parts cost it is $80 (the lowest value), and for demand it is 30,000 units (the highest value). Substituting these values into equation (14.2) leads to the following profit projection:

Profit (249 43 80)(30, 000) 1, 000, 000 2, 780, 0005 2 2 2 5

So the best-case scenario leads to a projected profit of $2,780,000. At this point, the what-if analysis provides the conclusion that profits may range from a

loss of $1,000,000 to a profit of $2,780,000 with a base-case profit of $710,000. Although the base-case profit of $710,000 is possible, the what-if analysis indicates that either a sub- stantial loss or a substantial profit is also possible. Sanotronics can repeat this what-if anal- ysis for other scenarios. However, simple what-if analyses do not indicate the likelihood of the various profit or loss values. In particular, we do not know anything about the probabil- ity of a loss. To conduct a more thorough evaluation of risk by obtaining insight on the potential magnitude and probability of undesirable outcomes, we now turn to developing a spreadsheet simulation model.

Sanotronics Spreadsheet Model The first step in constructing a spreadsheet simulation model is to express the relationship between the inputs and the outputs with appropriate formula logic. Figure 11.1 provides the formula and value views for the Sanotronics spreadsheet. Data on selling price per

In Chapter 10, we discuss the use of Data Tables, Goal Seek and Scenario Manager in Excel for what-if analysis. However, these methods do not indicate the relative likelihood of the occurrence of different scenarios.

504 Chapter 11 Monte Carlo Simulation

unit, administrative and advertising cost, direct labor cost per unit, parts cost per unit, and demand are in cells B4 to B8. The profit calculation, corresponding to equation (11.1), is expressed in cell B11 using appropriate cell references and formula logic. For the values shown in Figure 11.1, the spreadsheet model computes profit for the base-case scenario. By changing one or more values for the input parameters, the spreadsheet model can be used to conduct a manual what-if analysis (e.g., the best-case and worst-case scenarios).

Use of Probability Distributions to Represent Random Variables Using the what-if approach to risk analysis, we manually select values for the random vari- ables (direct labor cost per unit, parts cost per unit, and first-year demand), and then com- pute the resulting profit. Instead of manually selecting the values for the random variables, a Monte Carlo simulation randomly generates values for the random variables so that the values used reflect what we might observe in practice. A probability distribution describes the possible values of a random variable and the relative likelihood of the random variable taking on these values. The analyst can use historical data and knowledge of the random variable (range, mean, mode, and standard deviation) to specify the probability distribution for a random variable. As we describe in the following paragraphs, Sanotronics researched the direct labor cost per unit, the parts cost per unit, and first-year demand to identify the respective probability distributions for these three random variables.

Based on recent wage rates and estimated processing requirements of the device, Sanotronics believes that the direct labor cost will range from $43 to $47 per unit and is described by the discrete probability distribution shown in Figure 11.2. We see that there is a 0.1 probability that the direct labor cost will be $43 per unit, a 0.2 probability that the

Probability distributions are covered in more detail in Chapter 5.

Excel Worksheet for SanotronicsFIGURE 11.1

A B Sanotronics

Selling Price per Unit Administrative & Advertising Cost Direct Labor Cost Per Unit Parts Cost Per Unit Demand

Model Pro�t

Parameters 249 1000000 45 90 15000

=((B4-B6-B7)*B8)-B5

1 2 3 4 5 6 7 8 9 10 11 12 A B

Sanotronics

Selling Price per Unit Administrative & Advertising Cost Direct Labor Cost Per Unit Parts Cost Per Unit Demand

Model Pro�t

Parameters $249.00

$1,000,000 $45.00 $90.00 15,000

$710,000.00

1 2 3 4 5 6 7 8 9 10 11

Sanotronics

11.1 Risk Analysis for Sanotronics LLC 505

direct labor cost will be $44 per unit, and so on. The highest probability, 0.4, is associated with a direct labor cost of $45 per unit. Because we have assumed that the direct labor cost per unit is best described by a discrete probability distribution, the direct labor cost per unit can take on only the values of $43, $44, $45, $46, or $47.

Sanotronics is relatively unsure of the parts cost because it depends on many factors, including the general economy, the overall demand for parts, and the pricing policy of Sanotronics’s parts suppliers. Sanotronics is confident that the parts cost will be between $80 and $100 per unit but is unsure as to whether any particular values between $80 and $100 are more likely than others. Therefore, Sanotronics decides to describe the uncer- tainty in parts cost with a uniform probability distribution, as shown in Figure 11.3. Costs per unit between $80 and $100 are equally likely. A uniform probability distribution is an example of a continuous probability distribution, which means that the parts cost can take on any value between $80 and $100.

Based on sales of comparable medical devices, Sanotronics believes that first-year demand is described by the normal probability distribution shown in Figure 11.4. The mean

One advantage of simulation is that the analyst can adjust the probability distributions of the random variables to determine the impact of the assumptions about the shape of the uncertainty on the output measures.

Probability Distribution for Direct Labor Cost per UnitFIGURE 11.2

Direct Labor Cost Per Unit

P ro

b ab

il it

0.35

0.30

0.25

0.45

0.40

0.20

0.15

0.10

0.05

0 $43 $44 $45 $46 $47

Uniform Probability Distribution for Parts Cost per UnitFIGURE 11.3

1 20

80 90

Parts Cost per Unit

100

506 Chapter 11 Monte Carlo Simulation

m of first-year demand is 15,000 units. The standard deviation s of 4,500 units describes the variability in the first-year demand. The normal probability distribution is a continu- ous probability distribution in which any value is possible, but values extremely larger or smaller than the mean are increasingly unlikely.

Generating Values for Random Variables with Excel To simulate the Sanotronics problem, we must generate values for the three random vari- ables and compute the resulting profit. A set of values for the random variables is called a trial. Then we generate another trial, compute a second value for profit, and so on. We con- tinue this process until we are satisfied that enough trials have been conducted to describe the probability distribution for profit. Put simply, simulation is the process of generating values of random variables and computing the corresponding output measures.

In the Sanotronics model, representative values must be generated for the random vari- ables corresponding to direct labor cost per unit, the parts cost per unit, and the first-year demand. To illustrate how to generate these values, we need to introduce the concept of computer-generated random numbers.

Computer-generated random numbers1 are randomly selected numbers from 0 up to, but not including, 1; this interval is denoted by [0, 1). All values of the computer-generated random numbers are equally likely and so the values are uniformly distributed over the interval from 0 to 1. Computer-generated random numbers can be obtained using built-in functions available in computer simulation packages and spreadsheets. For example, plac- ing the formula 5RAND() in a cell of an Excel worksheet will result in a random number between 0 and 1 being placed into that cell.

Let us show how random numbers can be used to generate values corresponding to the probability distributions for the random variables in the Sanotronics example. We begin by showing how to generate a value for the direct labor cost per unit. The approach described is applicable for generating values from any discrete probability distribution.

Table 11.1 illustrates the process of partitioning the interval from 0 to 1 into subintervals so that the probability of generating a random number in a subinterval is equal to the prob- ability of the corresponding direct labor cost. The interval of random numbers from 0 up

1Computer-generated random numbers are formally called pseudorandom numbers because they are generated through the use of mathematical formulas and are therefore not technically random. The difference between ran- dom numbers and pseudorandom numbers is primarily philosophical, and we use the term random numbers even when they are generated by a computer.

Normal Probability Distribution for First-Year DemandFIGURE 11.4

m = 15,000

= 4,500 units

Number of Units Sold

11.1 Risk Analysis for Sanotronics LLC 507

to but not including 0.1, [0, 0.1), is associated with a direct labor cost of $43; the interval of random numbers from 0.1 up to but not including 0.3, [0.1, 0.3), is associated with a direct labor cost of $44, and so on. With this assignment of random number intervals to the possible values of the direct labor cost, the probability of generating a random number in any interval is equal to the probability of obtaining the corresponding value for the direct labor cost. Thus, to select a value for the direct labor cost, we generate a random number between 0 and 1 using the RAND function in Excel. If the random number is at least 0.0 but less than 0.1, we set the direct labor cost equal to $43. If the random number is at least 0.1 but less than 0.3, we set the direct labor cost equal to $44, and so on.

Each trial of the simulation requires a value for the direct labor cost. Suppose that on the first trial the random number is 0.9109. From Table 11.1, because 0.9109 is in the interval [0.9, 1.0), the corresponding simulated value for the direct labor cost would be $47 per unit. Suppose that on the second trial the random number is 0.2841. From Table 11.1, the simulated value for the direct labor cost would be $44 per unit.

Each trial in the simulation also requires a value of the parts cost and first-year demand. Let us now turn to the issue of generating values for the parts cost. The probability distri- bution for the parts cost per unit is the uniform distribution shown in Figure 11.3. Because this random variable has a different probability distribution than direct labor cost, we use random numbers in a slightly different way to generate simulated values for parts cost. To generate a value for a random variable characterized by a continuous uniform distribution, the following Excel formula is used:

Value of uniform random variable

lower bound (upper bound lower bound) RAND()5 1 2 3 (11.3)

For Sanotronics, the parts cost per unit is a uniformly distributed random variable with a lower bound of $80 and an upper bound of $100. Applying equation (11.3) leads to the following formula for generating the parts cost:

Parts cost 80 20 RAND()5 1 3 (11.4)

By closely examining equation (11.4), we can understand how it uses random numbers to generate uniformly distributed values for parts cost. The first term of equation (11.4) is 80 because Sanotronics is assuming that the parts cost will never drop below $80 per unit. Because RAND is between 0 and 1, the second term, 20 × RAND(), corresponds to how much more than the lower bound the simulated value of parts cost is. Because RAND is equally likely to be any value between 0 and 1, the simulated value for the parts cost is equally likely to be between the lower bound (80 0 80)1 5 and the upper bound (80 20 100)1 5 . For example, suppose that a random number of 0.4576 is generated by the RAND function. As illustrated by Figure 11.5, the value for the parts cost would be

Parts cost 80 20 0.4576 80 9.15 89.15 per unit5 1 3 5 1 5

Direct Labor Cost per Unit Probability

Interval of Random Numbers

$43 0.1 [0.0, 0.1)

$44 0.2 [0.1, 0.3)

$45 0.4 [0.3, 0.7)

$46 0.2 [0.7, 0.9)

$47 0.1 [0.9, 1.0)

Random Number Intervals for Generating Value of Direct Labor Cost per Unit

TABLE 11.1

508 Chapter 11 Monte Carlo Simulation

Suppose that a random number of 0.5842 is generated on the next trial. The value for the parts cost would be

Parts cost 80 20 0.5842 80 11.68 91.68 per unit5 1 3 5 1 5

With appropriate choices of the lower and upper bounds, equation (11.3) can be used to generate values for any continuous uniform probability distribution.

Lastly, we need a procedure for generating the first-year demand from computer- generated random numbers. Because first-year demand is normally distributed with a mean of 15,000 units and a standard deviation of 4,500 units (see Figure 11.4), we need a proce- dure for generating random values from this normal probability distribution.

Once again we will use random numbers between 0 and 1 to simulate values for first- year demand. To generate a value for a random variable characterized by a normal distribu- tion with a specified mean and standard deviation, the following Excel formula is used:

Value of normal random variable NORM.INV(RAND(), mean, standard deviation)5 (11.5)

For Sanotronics, first-year demand is a normally distributed random variable with a mean of 15,000 and a standard deviation of 4,500. Applying equation (11.5) leads to the follow- ing formula for generating the first-year demand:

Demand NORM.INV(RAND(), 15000, 4500)5 (11.6)

Suppose that the random number of 0.6026 is produced by the RAND function; apply- ing equation (11.6) then results in Demand 5NORM.INV(0.6026, 15000, 4500) 5 16,170 units. To understand how equation (11.6) uses random numbers to generate normally dis- tributed values for first-year demand, observe from Figure 11.6 that 60.26 percent of the area under the normal curve with a mean of 15,000 and a standard deviation of 4,500 lies to the left of the value of 16,170 generated by the Excel formula 5NORM.INV(0.6026, 15000, 4500). Thus, the RAND() function generates a percentage of the area under the nor- mal curve, and then the NORM.INV function generates the corresponding value such that the RAND() percentage lies to the left of this value.

Now suppose that the random number produced by the RAND function is 0.3551. Applying equation (11.6) then results in Demand 5NORM.INV(0.3551, 15000, 4500) 5 13,328 units. This matches intuition because half of this normal distribution lies below the mean of 15,000 and half lies above it, and so RAND values less than 0.5 result in values of first-year demand below the average of 15,000 units, and RAND values above 0.5 corre- spond to values of first-year demand above the average of 15,000 units.

Now that we know how to randomly generate values for the random variables (direct labor cost, parts cost, first-year demand) from their respective probability distributions, we modify the spreadsheet by adding this information. The static values in Figure 11.1 for

Equation (11.5) can be used to generate values for any normal probability distribution by changing the values specified for the mean and standard deviation, respectively.

Generation of Value for Parts Cost per Unit Corresponding to Random Number 0.4576

FIGURE 11.5

1 20

80 89.15 Parts Cost per Unit

100

0.4576

11.1 Risk Analysis for Sanotronics LLC 509

Generation of Value for First-Year Demand Corresponding to Random Number 0.6026

FIGURE 11.6

m = 15,000

16,170

= 4,500 units

Number of Units Sold

0.6026

Formula Worksheet for SanotronicsFIGURE 11.7

Selling Price per Unit

Direct Labor Cost

0 =B15

=B16 =B17 =B18

Parts Cost (Uniform) Lower Bound Upper Bound

Lower End of Interval Upper End of Interval Cost per Unit

43 44

45 46 47

0.1 0.2 0.4 0.2 0.1

Probability =D15+A15 =D16+A16 =D17+A17 =D18+A18 1

80 100

Demand (Normal) Mean Standard Deviation

15000 4500

249

1000000 =VLOOKUP(RAND(),A15:C19,3,TRUE) =B22+(B23-B22)*RAND() =NORM.INV(RAND(),D22,D23)

=((B4-B6-B7)*B8)-B5

Administrative & Advertising Cost Direct Labor Cost Per Unit Parts Cost Per Unit Demand

Pro�t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

A B C D Sanotronics

Parameters

Model

these parameters in cells B6, B7, and B8 are replaced with cell formulas that will randomly generate values whenever the spreadsheet is recalculated (as shown in Figure 11.7). Cell B6 uses a random number generated by the RAND function and looks up the correspond- ing direct labor cost per unit by applying the VLOOKUP function to the table of intervals contained in cells A15:C19 (which corresponds to Table 11.1). Cell B7 executes equation (11.4) using references to the lower bound and upper bound of the uniform

For further description of the VLOOKUP function, refer to Chapter 10.

510 Chapter 11 Monte Carlo Simulation

Cell H26 5COUNT(E26:E1025) Cell H27 5MIN(E26:E1025) Cell H28 5MAX(E26:E1025) Cell H29 5AVERAGE(E26:E1025) Cell H30 5STDEV.S(E26:E1025) Cell H32 5COUNTIF(E26:E1025,“,0”)/COUNT(E26:E1025) Cell H33 5SQRT(H32*(1-H32)/H26)

distribution of the parts cost in cells B22 and B23, respectively.2 Cell B8 executes equation (11.6) using references to the mean and standard deviation of the normal distribu- tion of the first-year demand in cells D22 and D23, respectively.3

Executing Simulation Trials with Excel Each trial in the simulation involves randomly generating values for the random variables (direct labor cost, parts cost, and first-year demand) and computing profit. To facilitate the execution of multiple simulation trials, we use Excel’s Data Table functionality in an unorth- odox, but effective, manner. To set up the spreadsheet for the execution of 1,000 simulation trials, we structure a table as shown in cells A25 through E1025 in Figure 11.8. As Figure 11.8 shows, A26:A1025 numbers the 1,000 simulation trials (rows 47 through 1,024 are hidden). Cells B26:E26 contain references to the cells corresponding to Direct Labor Cost, Parts Cost per Unit, Demand and Profit. To populate the table of simulation trials in cells A26 through E1025, we execute the following steps:

Step 1. Select cell range A26:E1025 Step 2. Click the Data tab in the Ribbon Step 3. Click What-If Analysis in the Forecast group and select Data Table… Step 4. When the Data Table dialog box appears, leave the Row input cell: box

blank and enter any empty cell in the spreadsheet (e.g., D1) into the Column input cell: box

Step 5. Click OK

Figure 11.9 shows the results of a set of 1,000 simulation trials. After executing the simula- tion with the data table, each row in this table corresponds to a distinct simulation trial consisting of different values of the random variables. In Trial 1 (row 26 in the spreadsheet), we see that the direct labor cost is $45 per unit, the parts cost is $85.56 per unit, and first-year demand is 8,675 units, resulting in profit of $27,434. In Trial 2 (row 27 in the spreadsheet), we observe random variables of $47 for the direct labor cost, $86.52 for the parts cost, and 12,372 for first-year demand. These values result in a simulated profit of $428,703 on the second simulation trial. We note that every time the spreadsheet recalculates (by pressing the F9 key), new random values are generated by the RAND() functions resulting in a new set of simulation trials.

Measuring and Analyzing Simulation Output The analysis of the output observed over a set of simulation trials is a critical part of a simulation process. For the collection of simulation trials, it is helpful to compute descriptive statistics such as sample count, minimum sample value, maximum sample value, sample mean, sample standard deviation, sample proportion, and sample standard error of the proportion. To compute these statistics for the Sanotronics example, we use the following Excel functions:

2Technically, random variables modeled with continuous probability distributions should be appropriately rounded to avoid modeling error. For example, the simulated values of parts cost per unit should be rounded to the nearest penny. To simplify exposition, we do not worry about the small amount of error that occurs in this case. To model these random variables more accurately, the formula in cell B7 should be 5ROUND(B221(B232B22)*RAND(),2). 3In addition to being a continuous distribution that technically requires rounding when applied to discrete phe- nomena (like units of medical device demand), the normal distribution also allows negative values. The probability of a negative value is quite small in the case of first-year demand, and we simply ignore the small amount of mod- eling error for the sake of simplicity. To model first-year demand more accurately, the formula in cell B8 should be 5MAX(ROUND(NORM.INV(RAND(),D22, D23),0),0).

These steps iteratively select the simulation trial number from the range A26 through A1025 and substitute it into the blank cell selected in Step 4 (D1). This substitution has no bearing on the spreadsheet, but it forces Excel to recalculate the spreadsheet each time, thereby generating new random numbers with the RAND functions in cells B6, B7, and B8.

SanotronicsModel

Sanotronics

11.1 Risk Analysis for Sanotronics LLC 511

Setting up Sanotronics Spreadsheet for 1,000 Simulation TrialsFIGURE 11.8

Cell H32 computes the ratio of the number of trials whose profit is less than zero over the total number of trials. By changing the value of the second argument in the COUNTIF function, the probability that the profit is less than any specified value can be computed in cell H32. Cell H33 computes the sample standard error of the proportion using the formula

2p p n(1 )/ , where p is the sample proportion of observations satisfying a criterion (profit less than $0 in this case) and n is the sample size (1,000 in this case). The sample

512 Chapter 11 Monte Carlo Simulation

standard error of the proportion provides a measure of how much the sample proportion P(Profit , $0) varies across different samples of 1,000 simulation trials.

As shown in Figure 11.9, the 1,000 profit observations range from 2$1,011,895 to 2,302,801. The sample mean profit is $712,014 and the sample standard deviation is $524,726. There is a sample proportion of 0.087 of the observations with negative profit and the sample standard error of this estimate is 0.009.

To visualize the distribution of profit on which these descriptive statistics are based, we create a histogram using the FREQUENCY function and a column chart. In Figure 11.9, the cell range J27:J44 contains the upper limits of the bins into which we wish to group the 1,000 simulated observations of profit listed in cells E26:E1025.

Step 1. Select cells K27:K46 Step 2. In the Formula Bar, enter the formula 5FREQUENCY(E26:E1025, J27:J45) Step 3. Press CTRL1SHIFT1ENTER after entering the formula in Step 2

Pressing CTRL1SHIFT1ENTER in Excel indicates that the function should return an array of values to fill the cell range K27:K46. For example, K27 contains the number of

Simulation studies enable an objective estimate of the probability of a loss, which is an important aspect of risk analysis.

For a detailed description of the FREQUENCY function and creating charts in Excel, see Chapters 2 and 3.

Output from Sanotronics SimulationFIGURE 11.9

11.1 Risk Analysis for Sanotronics LLC 513

profit observations less than 2$1,500,000, cell K28 contains the number of profit obser- vations greater than or equal to 2$1,500,000 and less than 2$1,250,000, cell K29 con- tains the number of profit observations greater than or equal to 2$1,250,000 and less than 2$1,000,000, and so on.

To construct the column chart based on this frequency data:

Step 1. Select cells K27:K46 Step 2. Click the Insert tab on the Ribbon Step 3. Click the Insert Column or Bar Chart button in the Charts group Step 4. When the list of bar chart subtypes appears, click the Clustered Column

button in the 2-D Column section Step 5. Select the column chart that was just created and then click the Chart Tools

tab on the Ribbon Step 6. Click the Select Data button in the Data group Step 7. In the Select Data Source dialog box: In the Horizontal (Category) Axis Labels area, click Edit When the Axis Labels dialog box appears, select the cell range J27:J46 and

click OK Click OK Step 8. Click on the text box above the chart, and replace “Chart Title” with Profit

Distribution

Figure 11.9 shows that the distribution of profit values is fairly symmetric, with a large number of values between $0 and $1,500,000. Only 10 trials out of 1,000 resulted in a loss of more than $500,000, and only 3 trials resulted in a profit greater than $2,000,000. The bin with the largest number of values has profit ranging between $500,000 and $750,000; 91 trials resulted in a profit between $500,000 and $750,000.

In comparing the simulation approach to the manual what-if approach, we observe that much more information is obtained using simulation. Recall from the what-if analysis in Section 11.1, we learned that the base-case scenario projected a profit of $710,000. The worst-case scenario projected a loss of $1,000,000, and the best-case scenario projected a profit of $2,591,000. From the 1,000 trials of the simulation that have been run, we see that extremes such as the worst- and best-case scenarios, although possible, are unlikely. Indeed, the advantage of simulation for risk analysis is the information it provides on the likelihood of output values. For the assumed distributions of the direct labor cost, parts cost, and demand, we now have estimates of the probability of a loss, how the profit values are distributed over their range, and what profit values are most likely.

When pressing the F9 key to generate a new set of 1,000 simulation trials, we observe that the summary statistics vary. In particular, the sample mean profit and the estimated probability of a negative profit fluctuate for each new set of simulation trials. To account for this sampling error, we can construct confidence intervals on the mean profit and pro- portion of observations with negative profit. Recall that the general formula for a confi- dence interval is point estimate 1/2 margin of error. To compute the confidence intervals for the Sanotronics example, we use the following Excel functions:

For more background on confidence intervals, see Chapter 6.

Cell H36 5H29 2 CONFIDENCE.T(0.05, H30, H26) Cell H37 5H29 1 CONFIDENCE.T(0.05, H30, H26) Cell H39 5H32 2 (NORM.S.INV(0.975)*H33) Cell H40 5H32 1 (NORM.S.INV(0.975)*H33)

Cells H36 and H37 compute the lower and upper limits of a 95% confidence interval of the mean profit. To compute the margin of error for this interval estimate, the Excel CONFIDENCE function requires three arguments: the significance level (1 2 confidence level), the sample standard deviation, and the sample size.

514 Chapter 11 Monte Carlo Simulation

Cells H39 and H40 compute the lower and upper limits of a 95% confidence interval of the proportion of observations with a negative profit. To compute the margin of error for this interval estimate, the sample standard error of the proportion (in cell H33) is multiplied by the z-value corresponding to a 95% confidence level (as calculated by 5NORM.S.INV(0.975)).

Figure 11.9 shows a 95% confidence interval on the mean profit ranging from $679,452 to $744,575 and a 95% confidence interval on the probability of a negative profit ranging from 0.070 to 0.104. A common misinterpretation is to relate the 95% confidence interval on the mean profit to the profit distribution of the 10,000 simulated profit values displayed in Figure 11.9. Looking at the profit distribution it should be clear that 95% of the values do not lie in the range [$679,452 to $744,575] suggested by the 95% confidence interval. The 95% confidence interval relates only to the confi- dence we have in the estimation of the mean profit, not the likelihood of an individual profit observation. If we desire an interval that contains 95% of the profit observa- tions, we can construct this by using the Excel PERCENTILE.EXC function. For the Sanotronics example, PERCENTILE.EXC(E26:E1025,0.025) 5 2$322,562 and PERCENTILE.EXC(E26:E1025,0.975) 5 $1,736,965 provide the lower and upper limits of an interval estimating the range that is 95% likely to contain the profit outcome.

The simulation results help Sanotronics’s management better understand the profit/loss potential of the new medical device. An estimated 0.070 to 0.104 probability of a loss with an estimated mean profit between $679,452 and $744,575 may be acceptable to manage- ment. On the other hand, Sanotronics might want to conduct further market research before deciding whether to introduce the product. In any case, the simulation results should be helpful in reaching an appropriate decision.

Recall that =NORM.S.INV(0.975) computes the value such that 2.5% of the area under the standard normal distribution lies in the upper tail defined by this value.

1. In the preceding section, we showed how to generate val-

ues for random variables from a generic discrete distribution,

a uniform distribution, and a normal distribution. Generating

values for a normally distributed random variable required the

use of the NORM.INV and RAND functions. When using the

Excel formula 5NORM.INV(RAND( ), m, s), the RAND( ) func- tion generates a random number r between 0 and 1 and then the NORM.INV function identifies the smallest value k such that ( )P X k r# $ , where X is a normal random variable with mean m and standard deviation s. Similarly, the RAND function can be used with the Excel functions BETA.INV, BINOM.INV,

GAMMA.INV, and LOGNORM.INV to generate values for a

random variable with a beta distribution, binomial distribution,

gamma distribution, or lognormal distribution, respectively.

Using a different probability distribution for a random variable

simply changes the relative likelihood of the random variable

realizing certain values. The choice of probability distribution

to use for a random variable should be based on historical data

and knowledge of the analyst. In Appendix 11.1, we discuss

several probability distributions and how to generate them

with native Excel functions.

2. We can reduce the width of the confidence intervals associ-

ated with the sample mean and the sample proportion com-

puted from a set of simulation trials by increasing the number

of trials beyond 1,000. However, increasing the number of trials

can begin to tax the computational capabilities of Excel. When

more than 1,000 trials are necessary to reduce the sampling

error, the analyst may want to restrict Excel to only update val-

ues upon a specific command rather than updating anytime

the Enter key is pressed in Excel. This can be accomplished

by choosing File from the Ribbon, clicking Options, choos-

ing Formulas, and then changing the Calculation options to

Manual. When this change is made, Excel will update values

only when the F9 key is pressed.

N O T E S + C O M M E N T S

11.2 Simulation Modeling for Land Shark Inc. Land Shark Inc., a real estate company, purchases properties that it develops and then resells. In the past, Land Shark has successfully acquired properties via first-price sealed- bid auctions involving commercial and residential properties. In such auctions, each bidder

11.2 Simulation Modeling for Land Shark Inc. 515

submits a single concealed bid. The submitted bids are then compared, and the party with the highest bid wins the property and pays the bid amount. In case of a tie (a rare occur- rence), a coin flip decides the winner.

Land Shark has been reviewing upcoming property auctions and has identified a commercial property of interest. Land Shark estimates the value of this property to be $1,389,000. Using bidding data disclosed to the public, Land Shark has maintained a file summarizing 56 previous auctions that it believes are similar to the upcoming property auction. Table 11.2 displays bid data for a portion of Land Shark’s data. The data for all 56 auctions is in the Auctions worksheet of the file LandShark. Because the property value up for sale varies between auctions, Land Shark expresses the submitted bid amounts as frac- tions of the respective property’s value to make the bids in different auctions comparable. These bid percentages can be converted into a bid amount (in dollars) by multiplying the bid percentage by the estimated value of the property under auction. Land Shark is consid- ering a bid of $1,229,000 and would like to evaluate its chances of winning the upcoming auction with this bid.

Spreadsheet Model for Land Shark To evaluate Land Shark’s chances of winning the auction, we develop a simulation model for the auction. Our first step in modeling the upcoming property auction is to identify the input parameters and output measures. The next step is to develop a spreadsheet model that correctly computes the values of the output measures given static values of the input parameters. Then we prepare the spreadsheet model for simulation analysis by replacing the static values of the input parameters that Land Shark does not know with certainty with probability distributions of possible values.

The relevant input parameters for the upcoming auction are the estimated value of the property, the number of bidders competing against Land Shark, the bid amounts submitted by the competitors, and Land Shark’s bid amount. Land Shark is certain about its estimate that the property is worth $1,389,000. Furthermore, Land Shark controls its bid amount and it would like to evaluate a bid amount of $1,229,000. However,

LandShark

Bid Amount (as a Fraction of Estimated Property Value)

Property No. Bid 1 Bid 2 Bid 3 Bid 4 Bid 5 Bid 6 Bid 7 Bid 8

1 0.830 0.797 0.833 0.878 0.839 0.843

2 0.835 0.823 0.781 0.892 0.767 0.787

3 0.763 0.862 0.814 0.895

4 0.771 0.859 0.867 0.850 0.833

5 0.836 0.898 0.831 0.897 0.831 0.657 0.846

6 0.850 0.863 0.825 0.910 0.848

7 0.890 0.820 0.874 0.877 0.818

8 0.804 0.881 0.786 0.884 0.773 0.819 0.824

9 0.819 0.851 0.786 0.896 0.784 0.792

10 0.860 0.756 0.876 0.887 0.866

11 0.880 0.834 0.831 0.871 0.857 0.759

12 0.810 0.870

13 0.887 0.716 0.817 0.9 0.869 0.885 0.856 0.761

Bid Data on Commercial Property AuctionsTABLE 11.2

516 Chapter 11 Monte Carlo Simulation

Land Shark is uncertain about the number of competing bidders and the bid amounts submitted by these competitors.

The output measures in which we are interested are whether Land Shark wins the simulated auction given its specified amount and Land Shark’s net return. If Land Shark wins the auction, its return is computed as the difference between the estimated value of the property and its bid amount. If Land Shark does not win the auction, its return is $0.

To understand how to construct the logic for determining whether Land Shark wins an auction and its return from the auction, let’s first consider static values for the input param- eters. Based on Land Shark’s data on the past 56 auctions, the number of competitor bids ranges from two to eight. Therefore, there may be as many as eight different bid amounts submitted by competitors. Suppose those eight competitor bid amounts (as a percentage of the property’s estimated value) are 0.887, 0.716, 0.817, 0.900, 0.869, 0.885, 0.856, and 0.761. However, it is possible that not all eight of these bid amounts will be submitted for an auction. Suppose only four competitors decide to submit bids in the auction. Then we only want to consider four of the eight bid amounts. If the bid amounts are listed in a random order (which they are in this case), we can just select the first four bid amounts and ignore the last four. In this case, the four competing bid amounts (expressed in dollars) are: (0.887)($1,389,000) 5 $1,232,043; (0.716)($1,389,000) 5 $994,524; (0.817)($1,389,000) 5 $1,134,813; and (0.900)($1,389,000) 5 $1,250,100. The largest competing bid amount is then the maximum of these four bid amounts, or $1,250,100. We compare Land Shark’s bid ($1,229,000) to the largest competing bid ($1,250,100) and observe that in this scenario, Land Shark does not win the auction, so its return is $0.

In the example in the previous paragraph, we determined the largest bid from four com- petitors by considering only the first four competitor bids and ignoring the last four. In gen- eral, the number of competitor bids is uncertain and varies from two to eight. Therefore, we need to devise a spreadsheet model that will correctly compute the largest competing bid amount from among a varying number of bids. Figure 11.10 shows the formula view and value view of the spreadsheet implementing one way to model the problem. Cell B4 con- tains the estimated value of the property (Land Shark is certain of this value) and cell B5 contains a value for the number of bidders (Land Shark is uncertain of this value). Cell range B8:B15 contains the values of eight possible competing bids expressed as fractions of the property’s estimated value (Land Shark is uncertain of these values). Cells C8 through C15 express the respective bid fractions in cells B8 through B15 as dollar amounts using the IF function to determine if the bid should be considered or effectively eliminated. If a bid index (from the range A8:A15) exceeds the realized number of bidders in cell B5, the corresponding bid amount in the cell range C8:C15 is set to $0, otherwise the bid amount is computed. For example, consider the formula in cell C8, 5IF(A8.$B$5, 0, B8*$B$4). This formula compares the bid index in cell A8 to the number of bidders in cell B5, and if the bid index exceeds the number of bidders, a bid amount of $0 is calculated so that the bid is not considered. Otherwise, the bid amount is calculated by multiplying the bid fraction by the estimated value of the property.

Cell B18 contains Land Shark’s bid amount. Cell B19 computes the largest competing bid by taking the maximum value over the range C8:C15. Land Shark tracks two output measures: whether it wins the auction and the return from the auction. By comparing Land Shark’s bid amount in cell B18 to the largest competitor bid in cell B19, the logic 5IF(B18.B19,1,0) in cell B20 returns a value of 1 if Land Shark wins the auction and a value of 0 if Land Shark loses the auction. The value of 1 or 0 in Cell B20 to denote a Land Shark win or loss allows the simulation model to easily count the number of times Land Shark wins the auction over a set of simulation trials. The formula in cell B21, 5B20*(B42B18), computes the return from the auction; if Land Shark wins the auction, the return is equal to the estimated value minus the bid amount, otherwise the return is zero because the value of cell B20 will be zero.

For a detailed discussion of the IF function in Excel, see Chapter 10; for a discussion of relative and absolute cell references, see Appendix A .

11.2 Simulation Modeling for Land Shark Inc. 517

Generating Values for Land Shark’s Random Variables In the Land Shark simulation model constructed in Figure 11.10, the uncertain quantities are the number of competing bidders and how much the competitors will bid (as a fraction of the property’s estimated value). In this section, we discuss how to specify probability distributions for these uncertain quantities, or random variables.

First, consider the number of bidders. Figure 11.11 contains the frequency distribution of the number of bidders for the 56 previous auctions that Land Shark has tracked in the Auctions worksheet of the file LandShark. The number of bidders has ranged from two to eight over the past 56 auctions. Unless Land Shark has reason to believe that there may be fewer than two bids on an upcoming auction, it is probably safe to assume that there will

Chapter 2 discusses frequency distributions in more detail.

Base Spreadsheet Model for Land SharkFIGURE 11.10

Estimated Value Number of Bidders

1 2 3 4

5 6 7 8

Land Shark Bid Amount 1229000 =MAX(C8:C15) =IF(B18>B19,1,0) =B20*(B4-B18)

Largest Competitor Bid Land Shark Win Auction? Land Shark Return

Bid Index Bid Fraction

1389000 4

0.887 =IF(A8>$B$5,0,B8*$B$4) =IF(A9>$B$5,0,B9*$B$4) =IF(A10>$B$5,0,B10*$B$4) =IF(A11>$B$5,0,B11*$B$4) =IF(A12>$B$5,0,B12*$B$4)

=IF(A13>$B$5,0,B13*$B$4) =IF(A14>$B$5,0,B14*$B$4) =IF(A15>$B$5,0,B15*$B$4)

0.716 0.818

0.869 0.885

0.761 0.856

0.9

Bid Amount

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

A B C

Land Shark

Parameters

Model

Estimated Value Number of Bidders

1 2 3 4

5 6 7 8

Land Shark Bid Amount $1,229,000 $1,250,100

0 $0

Largest Competitor Bid Land Shark Win Auction? Land Shark Return

Bid Index Bid Fraction

$1,389,000 4

0.887 $1,232,043 $994,524 $1,134,813 $1,250,100 $0

$0 $0 $0

0.716 0.818

0.869 0.885

0.761 0.856

0.900

Bid Amount

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

A B C

Land Shark

Parameters

Model

518 Chapter 11 Monte Carlo Simulation

be a minimum of two competing bids. There has not been an auction with more than eight bidders, so eight is a reasonable assumption for the maximum number of competing bids unless Land Shark’s experience with the local real estate market suggests that more than eight competing bids is possible.

Figure 11.11 suggests that the relative likelihood of different values for the number of bidders appears to be equal. Thus, Land Shark decides to model the number of bidders to be 2, 3, 4, 5, 6, 7, or 8 with equal probability. In this case, the integer uniform distribution is the appropriate choice, as it is characterized by a series of equally likely consecutive integers over a specified range.

To generate a value for a random variable characterized by an integer uniform distribu- tion, the following Excel formula is used:

Value of integer uniform random variable RANDBETWEEN(lower integer value,

upper integer value) 5

(11.7)

For Land Shark, the lower integer value is 2 and the upper integer value is 8. Applying equation (11.7), we enter the formula 5RANDBETWEEN(2, 8) into cell B5.

Each competitor’s bid fraction is also a random variable. From the past 56 auctions, there has been a total of 280 observations of how competitors have bid (as a fraction of the respective property’s estimated value). These 280 bid amounts from the Auctions work- sheet have been relisted in the BidList worksheet in the file LandShark. Figure 11.12 con- tains a histogram of the bid amount data grouped into 13 bins. We see that the bid amount distribution is negatively skewed, and that bid amounts most commonly occur in the range (0.875, 0.90).

There are several ways we could use the 280 bid amount observations as a basis for sim- ulating bid amount values in our spreadsheet model. One way would be to use Figure 11.12 as the basis for choosing a discrete probability distribution to represent this uncertain value (in the same manner we generated values for direct labor cost per unit in the Sanotronics problem). However, such a discrete probability distribution would result in a loss of infor- mation, as only bid percentages of, say, 0.65, 0.675, 0.70, 0.725, 0.75, 0.775, 0.80, 0.825, 0.85, 0.875, 0.90, 0.925, and 0.95 would be possible. From the 280 observations, we see that bid percentages take on many values between the minimum of 0.645 and the maximum

The integer uniform distribution is a special case of the discrete uniform distribution discussed in Chapter 5. In both distributions, all values are equally likely. However, in the integer uniform distribution, the possible values are consecutive integers over the defined range. In a general discrete uniform distribution, the possible values do not have to be consecutive integers (or even integers), but rather just a set of distinct, discrete values.

Frequency Distribution of Number of Bidders in 56 Previous Auctions

FIGURE 11.11

Number of Bidders

F re

q u

en cy

1 0

2 3 4 5 6 7 8 9

11.2 Simulation Modeling for Land Shark Inc. 519

of 0.947. Therefore, assuming a discrete probability distribution may not be preferred for generating bid percentage values.

Two other primary alternatives are to either directly sample from the 280 observations to generate values for simulation trials, or to fit a continuous probability distribution based on the 280 observations. We will describe the approach of directly sampling from the data and discuss distribution fitting later in this section.

Directly sampling from data is a good modeling choice if Land Shark believes that these 280 bid fraction values are an accurate representation of the distribution of future bids. We will simulate the bids for the upcoming auction by randomly selecting a value from one of these 280 bid fraction values. To sample a value for a bid fraction from the set of 280 possi- ble values, we use the Excel formula:

5VLOOKUP(RANDBETWEEN(1, 280), BidList!$A$2:$B$281, 2, FALSE)

When sampling values directly from sample data, we note that only values that exist in the data will be possible values for a simulation trial. Resampling empirical data is a good approach only when the data adequately represent the range of possible values and the dis- tribution of values across this range. If the sample data do not adequately describe the set of possible values for a random variable, it may be more appropriate to identify a probabil- ity distribution that closely fits the data and sample from the fitted probability distribution rather than just sampling directly from the data.

Executing Simulation Trials and Analyzing Output Each trial in the simulation of the auction involves randomly generating values for the number of bidders and the eight possible bid fractions and then computing whether Land Shark wins the auction and its return from the auction. To prepare the spreadsheet for the execution of 1,000 simulation trials, we structure the spreadsheet as in Figure 11.13. The cell range from A24 through L1024 has been prepared to hold the set of 1,000 simulation trials. Cell range A25:A1024 numbers the rows that will correspond to the 1,000 simulation trials (rows 43 through 1023 are hidden). The first row of the table (cells B25 through L25) contains Excel formulas referencing the random variables (number of bidders and the eight possible bid amounts) as well as the two output measures (whether Land Shark wins the auction and its return from the auction).

Only the output measures are strictly necessary to include table of 1,000 simulation trials, but we include the uncertain inputs as well for exposition.

Frequency Distribution of 280 Bid Fractions in 56 Previous Auctions

FIGURE 11.12

Bid Amount (Fraction of Property Value) F

re q

u en

<0 .6

0. 65

–0 .6

0. 67

5– 0.

0. 7–

0. 72

5– 0.

0. 75

–0 .7

0. 77

5– 0.

0. 8–

0. 82

5– 0.

0. 85

–0 .8

0. 87

5– 0.

0. 9–

0. 92

5– 0.

520 Chapter 11 Monte Carlo Simulation

To populate the table of simulation trials in the Model worksheet, we execute the fol- lowing steps:

Step 1. Select cell range A25:L1024 Step 2. Click the Data tab in the Ribbon Step 3. Click What-If Analysis in the Forecast group and select

Data Table… Step 4. When the Data Table dialog box appears, leave the Row input cell: box

blank and enter any empty cell in the spreadsheet (e.g., D1) into the Column input cell: box

Step 5. Click OK

Figure 11.14 shows the results of a set of 1,000 simulation trials. After executing the simulation with the Data Table, each row of this table corresponds to a distinct simulation trial consisting of different values of the random variables. We see that Land Shark does not win the simulated auction corresponding to Trial 1 because one of the three competing bids (Bid 1 $1, 258, 434)5 is larger than its bid of $1,229,000. In Trial 4, we observe that Land Shark wins the auction because its bid of $1,229,000 is larger than the two competing bids of $1,091,754 and $1,132,035.

Setting up Land Shark Spreadsheet for 1,000 Simulation TrialsFIGURE 11.13

11.2 Simulation Modeling for Land Shark Inc. 521

Similar to the Sanotronics problem in Section 14.1, we compute sample statistics and 95% confidence intervals on the mean and the proportion based on the 1,000 simulation trials. Referring to Figure 11.14,

Cell O25 5COUNT(L25:L1024) Cell O26 5MIN(L25:L1024) Cell O27 5MAX(L25:L1024) Cell O28 5AVERAGE(L25:L1024) Cell O29 5STDEV.S(L25:L1024) Cell O31 5AVERAGE(K25:K1024) Cell O32 5SQRT(O31*(12O31)/O25) Cell O34 5O28 2 CONFIDENCE.T(0.05,O29,005) Cell O35 5O28 1 CONFIDENCE.T(0.05,O29,025) Cell O34 5O31 2 NORM.S.INV(0.975)*O32 Cell O35 5O31 1 NORM.S.INV(0.975)*O32

Again similarly to the Sanotronics problem, we compute the frequency distribution of the returns generated from the set of 1,000 trials in cells Q26:R42. Cells Q26:Q42 contain the upper limits of the bins for the frequency distribution and the cell range R26:R42 is populated by the FREQUENCY function.

Figure 11.14 shows that based on this set of 1,000 simulation trials, Land Shark’s estimated mean return is $35,680 and the estimated probability that it wins the auction is 0.223. In this simulation experiment, when Land Shark bids $1,229,000, there are only two outcomes: either it wins the auction and earns a return of $160,000 or it loses the auction and earns of return of $0. Out of the 1,000 simulated auctions, the frequency table shows that Land Shark does not win the auction ($0 return) in 777 auctions and wins the auction (earns $160,000) in 223 auctions.

Output from Land Shark SimulationFIGURE 11.14

Bid Index

2 1

3 4 5 6 7 8

0.832 0.906

0.909 0.835 0.778 0.866 0.877 0.817

$1,155,648 $1,258,434

$1,262,601 $0 $0 $0 $0 $0

Bid Fraction Bid Amount

Land Shark

Estimated Value Number of Bidders

$1,389,000 3

Parameters

Simulation Trial Number of Bidders Bid 1 Bid 2 Bid 3 Bid 4 Bid 5 Bid 6 Bid 7 Bid 8 Win? Return Summary Statistics Frequency

663$0$0 $00

0 0 1

0 1

0 0

0 1 0 0

0 0 0

0 1 1

$0 $0

$160,000

$0 $0

$0 $0 $0

$0 $160,000 $160,000

$160,000

$0 $0

$0 $160,000

$0 $0

$1,143,147 $0 $0 $0$0

$0 $1,113,978

$0 $0 $0

$1,123,701 $0 $0 $0

$0 $0

$0$0

$0 $0

$1,204,263

$0 $0 $0

$1,204,263

$0 $0

$0 $1,222,320

$0 $1,279,269

$0 $1,237,599

$0$0

$1,232,043 $0

$1,232,043 $1,220,931 $1,299,265 $1,109,811

$0 $0$0

$1,223,709

$1,173,705 $1,112,589

$0$1,220,931

$1,050,084

$1,297,326

$1,262,601$1,258,434 $1,233,432 $1,230,654 $1,091,754 $1,234,821 $1,138,980 $1,201,485 $1,173,705 $,1258,434 $1,247,322 $1,080,642 $1,188,984 $1,308,438 $1,045,917 $1,191,762 $1,304,271 $1,125,090 $1,301,493 $1,184,817

$1,155,648 $1,029,249 $1,277,880 $1,132,035 $1,218,153 $1,070,919 $1,201,485 $1,195,929 $1,193,151 $1,309,827 $1,059,807 $1,132,035 $1,009,803 $1,173,705 $1,061,196 $1,061,196 $1,152,870 $1,125,090 $1,262,601

3 4 5 2 4 8 2 4 4 8 5 6 5 2 2 8 2 2 3

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18

1,000

$0 $1,245,933 $1,219,542

$1,276,491 $1,170,927

$1,237,599

$1,140,369 $0

$0 $0 $0

$1,191,762

$1,237,599 $0 $0

$1,265,379

$1,090,365 $1,287,603

$0 $1,154,259 $1,312,605

$1,220,931

$1,148,703

$1,134,813 $1,187,595

$1,220,931 $0 $0

$0 $0 $0 $0

$0 $0

$1,261,212 $0

$0 $0

$1,132,035

$0 $0 $0

$160,000 $35,680 $66,635

0.223 0.013

$31,545 $39,815

0.197 0.249

$10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000 $90,000

$100,000 $110,000 $120,000 $130,000 $140,000 $150,000 $160,000

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

337

Bin1000Count Minimum Return Maximum Return Mean Return Standard Deviation of Return

P(Win Auction) Standard Error of Proportion

95% C.I. on Mean Return

95% C.I. on P(Win Auction)

$1,229,000

$1,262,601

$0 0

Land Shark Bid Amount

Land Shark Win Amount?

Land Shark Return

Largest Competitor Bid

Model

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

1024

A B C D E F G H I J K L M N O P Q R

Return Distribution

100

300

200

400

500

700

600

800

900

$1 0,

00 0

$2 0,

00 0

$3 0,

00 0

$4 0,

00 0

$5 0,

00 0

$6 0,

00 0

$7 0,

00 0

$8 0,

00 0

$9 0,

00 0

$1 00

,0 00

$1 10

,0 00

$1 20

,0 00

$1 30

,0 00

$1 40

,0 00

$1 50

,0 00

$1 60

,0 00

LandSharkResample

522 Chapter 11 Monte Carlo Simulation

We note that a different set of 1,000 simulation trials can be generated by pressing the F9 key, and this may result in varying values of the summary statistics because these will now be based on a different sample. By pressing the F9 key and observing how much the output statistics vary, the analyst can gauge how much sampling error exists in the output statistics. Furthermore, the 95% confidence interval on the mean return and the 95% confidence inter- val on the probability of winning the auction reflect the degree of the sampling error. Wider confidence intervals reflect more uncertainty in the accuracy of the sample mean and sample proportion. If we would press the F9 key 100 times to create 100 different samples of 1,000 trials, we would expect 95 of the corresponding 100 confidence intervals on the mean return to contain Land Shark’s true mean return from the auction. Similarly, we would expect 95 of the 100 confidence intervals on the proportion of auction trials that Land Shark wins to con- tain the true probability of Land Shark winning the auction.

In general, increasing the number of trials in a simulation experiment will decrease the variability in the summary statistics from one sample of simulation trials to the next. Therefore, if we wish to decrease the sampling error in the output statistics, we should increase the number of simulation trials and re-execute the simulation experiment.

Generating Bid Amounts with Fitted Distributions In the Land Shark model represented in Figure 11.14, we generated the competing bid frac- tions by directly sampling from the 280 bids submitted in 56 previous auctions. The advan- tage of this approach is that it is relatively easy to execute, but if the 280 observations do not adequately represent the possible bid fractions for the upcoming auction, then our model may not accurately represent the future auction and Land Shark’s assessment of its bid amount.

In this section, we examine another approach for using the 280 bid observations to gen- erate bid fraction values in a simulation model. Specifically, we will use the 280 bid obser- vations to fit a continuous probability distribution to a histogram based on the data. The advantage of fitting a distribution is that it will allow us to generate values that may not exist in the list of the original 280 observations, but still share characteristics with these data. The disadvantage of fitting a distribution is that the process is a bit more involved and requires more familiarity with probability distributions.

Our goal is to identify a continuous probability distribution that fits the histogram of the bid fraction data shown in Figure 11.12. Appendix 11.1 contains a description of several continuous and discrete probability distributions. For the bid fraction data, we seek a con- tinuous probability distribution due to the large number of possible values for a submitted bid fraction. Furthermore, we know that the range of bid fractions has a lower bound of zero and upper bound of one; a competitor cannot bid a negative fraction and a competitor will never bid more than the property’s estimated value. There are many possible continu- ous probability distributions that have both lower and upper bounds, but some of the most common are the uniform, triangular, and beta distributions. We will consider each of these.

The uniform distribution assumes each value between a specified minimum value and minimum value is equally likely, which does not appear to be the case for bid fractions as illustrated by Figure 11.12. So, the uniform distribution does not appear to be a good choice to generate bid fraction values. Nonetheless, if we wanted to use a uniform distri- bution to generate bid fractions in our simulation model we only need to determine the minimum and maximum values. For these data, the minimum is 0.645 and the maximum is 0.947, but, theoretically, bid fractions could extend from 0.000 to 1.000. Setting the minimum and maximum of the distribution is a modeling choice that will affect how low and high our competitors will bid in the simulated auctions. If Land Shark believes that the observed values of 0.645 and 0.947 are likely to be the lowest and highest bid amounts placed by competitors, then these 0.645 and 0.947 should be used as the lower and upper limits of a uniform distribution. To generate a value from a continuous uniform distribution in Excel, we can use equation (11.3) as we did in the Sanotronics problem.

The triangular distribution is a unimodal distribution characterized by three input parameters: minimum (a), mode (m), and maximum (b). While the shape of the bid fraction distribution does not appear exactly triangular, it could be worthwhile option to explore.

When you run your LandShark simulation, the values you see will be different. This is to be expected with simulation models. Each time the simulation is executed, the values may vary because different random numbers are being used. If a set of static values is desired, ,you can replace the dynamic data table with a static set of trial values using the Excel functionality to Copy and Paste Values.

Specialized simulation software such as @RISK, Crystal Ball, and Analytic Solver provide automated distribution fitting functionality.

Chapter 5 discusses probability distributions in more detail.

11.2 Simulation Modeling for Land Shark Inc. 523

To determine the mode (most likely) value of the triangular distribution, we note that com- puting the mode of the effectively continuous bid fraction data is a bit dubious as no single value occurs frequently. Therefore, we base the mode on the histogram in Figure 11.12. We observe the most frequent bin is [0.875, 0.90) and use the midpoint of this bin, 0.8875, as the mode of the triangular distribution. Figure 11.15 provides a visualization of triangular distribution’s fit to the bid fraction data. The triangle-shaped curve represents the theoreti- cal continuous distribution from which values from the triangle distribution are generated. The blue columns correspond to one possible sample of 280 values generated from the tri- angular distribution. Comparing the blue curve (and blue columns) to the red columns rep- resenting the observed bid fractions, we observe that this triangular distribution appears to generate more bid fractions in the 0.645 to 0.80 range and fewer bid fractions in the 0.925 to 0.95 range. This is something to keep in mind in our simulation experiments with this distribution in the Land Shark model.

To generate a value for a random variable characterized by a triangular distribution, the following Excel formula is used:

value of triangular random variable

5 , 2 2 1 2 2

2 2 2 2

IF( ( )/( ), SQRT(( ) ( ) ), SQRT(( ) ( ) (1 )))

∗ ∗ ∗ ∗

random m a b a a b a m a random b b a b m random (11.8)

In equation (11.8), random refers to a single, separate cell containing the Excel function 5RAND(); a single, separate cell is necessary to make sure the same random value is used everywhere random appears in equation (11.8). Applying equation (11.8) for the triangular distribution fit to the 280 bid observations yields:

5 , 2 2 1

2 2 2

bid fraction IF( (0.8875 0.645)/(0.947 0.645), 0.645

SQRT ((0.947 .645) (0.8875 0.645) ), 0.947

SQRT ((0.947 0.645) (0.947 0.8875) (1 )))

∗ ∗ ∗ ∗

random

random (11.9)

Figure 11.16 displays the formula view of the Land Shark simulation model implementing equation (11.9) to generate bid fraction values. From Figure 11.17, we see that modeling bid fraction values with a triangular distribution results in a 95% confidence interval of $49,224 to $58,616 on the mean return and a 95% confidence interval of 0.308 to 0.366 on the probability of winning the auction. These results are significantly more optimistic for Land Shark than the results from Figure 11.14 based on generating bid fraction values by directly sampling the 280 bid observations. This can be explained by the difference in the fitted triangular distribution and observed bid fraction data. Compared to the observed bid fraction data, Figure 11.15 shows that this triangular distribution appears more likely to

Fit of Triangular Distribution to Bid Fraction DataFIGURE 11.15

Bid Amount (Fraction of Property Value)

F re

q u

en cy

<0 .6

0. 65

–0 .6

0. 67

5– 0.

0. 7–

0. 72

5– 0.

0. 75

–0 .7

0. 77

5– 0.

0. 8–

0. 82

5– 0.

0. 85

–0 .8

0. 87

5– 0.

0. 9–

0. 92

5– 0.

95 0

Sample from Triangular Distribution Observed Bid Fractions

LandSharkTriangular

524 Chapter 11 Monte Carlo Simulation

Land Shark Formula Worksheet for Bid Fraction Value Generated from Triangular Distribution

FIGURE 11.16

Estimated Value Number of Bidders

Bid Index 1 2 3 4 5 6 7 8

Land Shark Bid Amount

Land Shark Win Amount? Land Shark Return

Largest Competitor Bid 1229000

=IF(B18>B19,1,0) =B20*(B4-B18)

=MAX(C8:C15)

1389000 =RANDBETWEEN(2, 8)

=IF(H8<$F$13/$F$12,$F$8+$QRT($F$12*$F$13*H8),$F$9-$QRT($F$12*$F$14*(1-H8))) =IF(A8>$BS5,0,B8*$B$4) =IF(A9>$BS5,0,B9*$B$4) =IF(A10>$BS5,0,B10*$B$4) =IF(A11>$BS5,0,B11*$B$4) =IF(A12>$BS5,0,B12*$B$4) =IF(A13>$BS5,0,B13*$B$4)

=IF(A14>$BS5,0,B14*$B$4) =IF(A15>$BS5,0,B15*$B$4)

=IF(H9<$F$13/$F$12,$F$8+$QRT($F$12*$F$13*H9),$F$9-$QRT($F$12*$F$14*(1-H9))) =IF(H10<$F$13/$F$12,$F$8+$QRT($F$12*$F$13*H10),$F$9-$QRT($F$12*$F$14*(1-H10))) =IF(H11<$F$13/$F$12,$F$8+$QRT($F$12*$F$13*H11),$F$9-$QRT($F$12*$F$14*(1-H11))) =IF(H12<$F$13/$F$12,$F$8+$QRT($F$12*$F$13*H12),$F$9-$QRT($F$12*$F$14*(1-H12))) =IF(H13<$F$13/$F$12,$F$8+$QRT($F$12*$F$13*H13),$F$9-$QRT($F$12*$F$14*(1-H13))) =IF(H14<$F$13/$F$12,$F$8+$QRT($F$12*$F$13*H14),$F$9-$QRT($F$12*$F$14*(1-H14))) =IF(H15<$F$13/$F$12,$F$8+$QRT($F$12*$F$13*H15),$F$9-$QRT($F$12*$F$14*(1-H15)))

Bid Fraction Bid Amount Bid % Parameters (Triangular) Minimum Maximum Mode

Max - Min Mode - Min

Max - Mode

0.645 Random # =RAND() =RAND() =RAND() =RAND() =RAND() =RAND() =RAND() =RAND()

0.9647 =(0.9+0.875)/2

=F9-F8 =F10-F8

=F9-F10

Land Shark

Parameters

Model

A B C D E F G H

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Output from Land Shark Simulation Using Triangular Distribution to Generate Bid Fraction Values

FIGURE 11.17

generate smaller competing bid fractions than directly sampling from the 280 observed bid fractions.

The final alternative for modeling the bid fraction values would be to fit a beta distribu- tion to the 280 bid observations. The beta distribution is a very flexible distribution char- acterized by four input parameters: alpha (a), beta (b), minimum (A), and maximum (B). A common method for estimating the a and b values in a beta distribution uses the sample

11.2 Simulation Modeling for Land Shark Inc. 525

mean ( )x and sample standard deviation (s) as shown in equations (11.10) and (11.11) below.4

− −

 

 

 

 

 

 

 

 





  





  

( )

12 2

x A

B A

x A

B A

x A

B A s

B A

a 5

2 2

2 (11.10)

 

 

 

 

 

 





  





  

1 x A

B A x A

B A

b a5 3

2 2

(11.11)

For the 280 bid observations, the sample mean is 0.851x 5 , the sample standard deviation is 0.056s 5 , the minimum value is 0.645, and the maximum value is 0.947. Substituting these values first into equation (11.10) and then into equation (11.11) provides:

5 2

2 2

2 5

5 3

2 2

0.851 0.645

0.947 0.645

0.851 0.645 0.947 0.645

1 0.851 0.645 0.947 0.645

0.056 0.947 0.645

1 3.546

3.546 1

0.851 0.645 0.947 0.645

1.655

2( )

 

 

 

 

 

 





  





  

 

 

 

 

 

 





  





  

To generate a value for a random variable characterized by a beta distribution, the follow- ing Excel formula is used:

value of beta random variable

a b5BETA.INV(RAND() ), , , A, B (11.12)

For the Land Shark problem, substituting the values of the parameters results in:

5bid fraction BETA.INV(RAND(), 3.546, 1.655, 0.645, 0.947) (11.13)

Figure 11.18 provides a visualization of the beta distribution’s fit to the bid fraction data. The blue curve represents the theoretical continuous distribution from which values from the beta distribution are generated. The blue columns correspond to one possible sample of 280 values generated from the beta distribution. Comparing the blue curve (and blue col- umns) to the red columns representing the observed bid fractions, we observe that this beta distribution appears to reasonably fit the observed bid fractions.

Figure 11.19 displays the formula view of the Land Shark simulation model implementing equation (11.13) to generate bid fraction values. From Figure 11.20, we see that modeling bid fraction values with a beta distribution results in a 95% confidence interval of $26,199 to $33,961 on the mean return and a 95% confidence interval of 0.164 to 0.212 on the probabil- ity of winning the auction. These results are less optimistic than the results from Figure 11.14 based on generating bid fraction values by directly sampling the 280 bid observations.

While it is impossible to discern what is the “best” way to model the uncertain bid fraction values, the exercise of testing different distributions generates insight. One ben- efit of using a good-fitting theoretical distribution (such as the beta distribution in this case) to generate bid fraction values is that it generates thousands of unique bid fractions.

4Estimating the parameters using equations (11.10) and (11.11) is based on a statistical method known as the “method of moments.” The specifics of this method are beyond the scope of this textbook.

526 Chapter 11 Monte Carlo Simulation

Fit of Beta Distribution to Bid Fraction DataFIGURE 11.18

F re

q u

en cy

Bid Amount (Fraction of Property Value)

<0 .6

0. 65

–0 .6

0. 67

5– 0.

0. 7–

0. 72

5– 0.

0. 75

–0 .7

0. 77

5– 0.

0. 8–

0. 82

5– 0.

0. 85

–0 .8

0. 87

5– 0.

0. 9–

0. 92

5– 0.

95 0

Sample from Beta Distribution Observed Bid Fractions

Land Shark Formula Worksheet for Bid Fraction Value Generated from Beta Distribution

FIGURE 11.19

Estimated Value Number of Bidders

Bid Index Bid Fraction Bid Amount Bid % Parameters (Beta) Minimum Maximum Alpha

Beta

1389000

=RANDBETWEEN(2,8)

=BETA,INV(RAND(),$F$10,$F$11,$F$8,$F$9)1 2 3 4

5 6 7

=IF(A8>$B$5,0,B8*$B$4) =IF(A9>$B$5,0,B9*$B$4) =IF(A10>$B$5,0,B10*$B$4)

=IF(A11>$B$5,0,B11*$B$4)

=IF(A12>$B$5,0,B12*$B$4) =IF(A13>$B$5,0,B13*$B$4) =IF(A14>$B$5,0,B14*$B$4) =IF(A15>$B$5,0,B15*$B$4)

=BETA,INV(RAND(),$F$10,$F$11,$F$8,$F$9) =BETA,INV(RAND(),$F$10,$F$11,$F$8,$F$9) =BETA,INV(RAND(),$F$10,$F$11,$F$8,$F$9)

=BETA,INV(RAND(),$F$10,$F$11,$F$8,$F$9) =BETA,INV(RAND(),$F$10,$F$11,$F$8,$F$9) =BETA,INV(RAND(),$F$10,$F$11,$F$8,$F$9) =BETA,INV(RAND(),$F$10,$F$11,$F$8,$F$9)

Land Shark Bid Amount Largest Competitor Bid Land Shark Win Auction? Land Shark Return

1229000 =MAX(C8:C15) =IF(B18>B19,1,0) =B20*(B4-B18)

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21

Parameters

Land Shark

Model

0.645 0.947

1.6546630559593

3.54618101391835

A B C D E F

Conversely, sampling directly from the observed data means that the 280 values get re-used multiple times.

In general, the appropriate way to generate values for the random variables in a Monte Carlo simulation may be difficult to determine. For a well-defined situation, like rolling a fair die, it may be clear how to generate the value of the random variable (the outcome of a dice roll). In other situations, we may not know exactly how to model the uncertainty. In these sit- uations, it is recommended that we examine any sample data available to us. The sample data can then be used by sampling directly or we can compare the sample data to common proba- bility distributions (such as uniform, normal, triangular, and beta distributions) to determine if we can approximate the distribution of the data with an existing probability distribution.

11.3 Simulation with Dependent Random Variables 527

Output from Land Shark Simulation Using Beta Distribution for Bid Fraction Values

FIGURE 11.20

In all cases, it is important to test the implications of different modeling approaches and to understand that a simulation model is not a crystal ball that allows us to perfectly see the future, but rather it helps us to understand the impact of uncertainty on our decisions.

11.3 Simulation with Dependent Random Variables In the examples of Sections 11.1 and 11.2, we generated values of each uncertain quantity independently of each other. In other words, we treated each uncertain quantity as an inde- pendent random variable. In this section, we consider an example in which the values of some of the uncertain quantities are dependent.

Press Teag Worldwide (PTW) manufactures all of its products in the United States, but it sells the items in three different overseas markets: the United Kingdom, New Zealand, and Japan. Each of these overseas markets generates revenue in a different currency: pound sterling in the United Kingdom, New Zealand dollars in New Zealand and yen in Japan. At the end of each 13-week quarter, PTW converts the revenue from these three overseas markets back into U.S. dollars in order to pay its expenses in the United States, exposing PTW to exchange rate risk.

Spreadsheet Model for Press Teag Worldwide To assess the degree of PTW’s exposure to quarterly fluctuations in exchange rates, we develop a simulation model. The first step is to identify the input parameters and output measures. The next step is to develop a spreadsheet model that computes the values of the output measures given value of the input parameters. Then we prepare the spreadsheet model for simulation analysis by replacing the static values of the input parameters that are uncertain with probability distributions of possible values.

The relevant input parameters are: (i) the quarterly revenue generated in each of the three foreign currencies, and (ii) the end-of-quarter exchange rates between these foreign

Specialized simulation software such as @RISK, Crystal Ball, and Analytic Solver provide automated procedures to incorporate dependency between random variables.

528 Chapter 11 Monte Carlo Simulation

currencies and the U.S. dollar. The output measure of interest is the total end-of-quarter revenues converted into U.S. dollars.

To model the fluctuation in the exchange rate between the pound sterling and the U.S. dollar over the next quarter, PTW expresses the number of pounds sterling (£) per U.S. dollar ($) by

5 3 1(end-of-quarter £/$ rate) (start-of-quarter £/$ rate) (1 % change in £/$ rate) (11.14)

That is, equation (11.14) computes the end-of-quarter exchange rate based on the start- of-quarter exchange rate and the percent change in the exchange rate over the quarter. Analogously, the equations computing the end-of-quarter exchange rates between New Zealand dollars (NZD) per U.S. dollar and Japanese yen (¥) per U.S. dollar are as follows:

(end-of-quarter NZD/$ rate) (start-of-quarter NZD/$ rate)

(1 % change in NZD/$ rate) 5

3 1 (11.15)

5 3 1(end-of-quarter ¥ /$ rate) (start-of-quarter ¥ /$ rate) (1 % change in ¥ /$ rate) (11.16)

To see how one would use equations (11.14), (11.15), and (11.16), suppose that the start- of-quarter exchange rates are £0.615 per U.S. dollar, NZD 1.200 per U.S. dollar, and ¥87.10 per U.S. dollar. Further, assume that there is a 4.61 percent increase in the £ per $ exchange rate, a 0.27 percent decrease in the NZD per $ exchange rate, and a 11.23 percent increase in the ¥ per $ exchange rate. Then, we would have the following:

(end-of-quarter £/$ rate) 0.615 (1 0.0461) £0.6436 per $

(end-of-quarter NZD/$ rate) 1.200 (1 ( 0.0027)) NZD 1.1968 per $ (end-of-quarter ¥ /$ rate) 87.10 (1 0.1123) ¥96.8813 per $

5 3 1 5

5 3 1 2 5

5 3 1 5

Once the end-of-quarter exchange rates are known, the quarterly revenue in pounds ster- ling, New Zealand dollar, and Japanese yen can be converted into U.S. dollars as follows:

5 4(end-of-quarter $ from £) (quarterly revenue in £) (end-of-quarter £/$ rate) (11.17)

(end-of-quarter $ from NZD) (quarterly revenue in NZD)

(end-of-quarter NZD/$ rate) 5

4 (11.18)

5 4(end-of-quarter $ from ¥) (quarterly revenue in ¥) (end-of-quarter ¥ /$ rate) (11.19)

As an illustration of these calculations, suppose the quarterly revenues generated in pounds sterling, New Zealand dollar, and the Japanese yen are £100,000, NZD 250,000, and ¥10,000,000, respectively. Then, applying equations (11.17), (11.18), and (11.19), we com- pute the following:

5 4 5

(end-of-quarter $ from £) £100, 000 £0.6436 per $ $155, 385 (end-of-quarter $ from NZD) NZD250, 000 NZD 1.1968 per $ $208,897

(end-of-quarter $ from ¥) ¥10, 000, 000 ¥96.8813 per $ $103, 219

The total revenue in U.S. dollars is then $155,385 $208,897 $103, 219 $467,5021 1 5 . Figure 11.21 shows the formula view and value view of the PTW spreadsheet model for the base scenario just presented.

The percent change in the exchange rate between pairs of currencies from the start to the end of a quarter is uncertain. Therefore, PTW would like to use random variables to model the percent change in the £ per $ rate, the percent change in the NZD per $ rate, and the percent change in the ¥ per $ rate.

However, PTW realizes that there are dependencies between the exchange rate fluctuations. For example, if the U.S. dollar weakens against the pound sterling, it may be more likely to also weaken against the New Zealand dollar. Therefore, the percent changes in the exchange rates should not be generated independently, but instead these values should be generated

QuarterlyExchange

11.3 Simulation with Dependent Random Variables 529

jointly (as a related collection of values). To account for these dependencies, PTW constructed a data set on the joint percent changes between the three exchange rates for 2,000 quarter-sce- narios in the Data worksheet of the file QuarterlyExchange. These data are based on historical observations as well as scenarios based on expert judgment. Figure 11.22 displays these data as three scatter plots showing the pairwise relationships between the exchange rates.

Figure 11.22 indicates that the percent changes in exchange rates are correlated. Positive percentage fluctuations of £ per $ often occur with positive percentage fluctuations of NZD per $ while negative fluctuations of £ per $ often occur with negative fluctuations of NZD per $. If these values were independent, we would expect to see no pattern in this scatter plot. However, there is a clear pattern in this scatter plot suggesting positive correlation. Therefore, we con- clude that the fluctuations of £ per $ and NZD per $ are not independent, but are correlated. Similarly, percent changes in £ per $ appear to be correlated with percent changes in ¥ per $. Also, percent changes in NZD per $ appear to be correlated with percent changes in ¥ per $.

To directly sample one of the 2,000 scenarios and obtain the corresponding percent change in £ per $ rate, NZD per $ rate, and ¥ per $ rate, we use the respective Excel formulas:

5VLOOKUP(E7, Data!$A$3:$D$2002, 2, FALSE) (11.20)

5VLOOKUP(E7, Data!$A$3:$D$2002, 3, FALSE) (11.21)

5VLOOKUP(E7, Data!$A$3:$D$2002, 4, FALSE) (11.22)

As Figure 11.23 illustrates, in equations (11.20), (11.21), and (11.22), cell E7 contains the Excel function 5RANDBETWEEN(1, 2000) which randomly generates the index of one of the 2,000 quarter scenarios. The VLOOKUP function then looks up this index in the table of quarter scenarios and returns the percent change in £ per $ rate (cell B7), the percent change in NZD per $ rate (cell C7), or the percent change in ¥ per $ rate (cell D7). Note that the third argument in the VLOOKUP function corresponds to the column in the

Chapter 2 also discusses the concept of correlation.

Base Spreadsheet Model for Press Teag WorldwideFIGURE 11.21

Start-of-Quarter Exchange Rate (per $)

End-of-Quarter Revenue in $

Total

End-of-Quarter Exchange Rate (per $) Quarterly % Change in Exchange Rate

Quarterly Revenue

Start-of-Quarter Exchange Rate (per $)

End-of-Quarter Exchange Rate (per $) Quarterly % Change in Exchange Rate

Quarterly Revenue

0.6152

=B4*(1+B5)

=B7/B6 =C7/C6 =D7/D6 =SUM(B10:D10)

Total

$155,385

£0.615 NZD 1.200 ¥87.10

£0.6436 NZD 1.1968 ¥96.8813 £100,000 NZD 250,000 ¥10,000,000

4.61% –0.27% 11.23%

$208,897 $103,219 $467,502

0.0461

100000

1.2

=C4*(1+C5) –0.0027

250000

87.1

=D4*(1+D5) 0.1123

10000000

1 2 3 4

5 6 7 8 9 10

9 8 7 6 5 4 3 2 1

Press Teag Worldwide

Parameters

Model

Parameters

Press Teag Worldwide

A B C D E

530 Chapter 11 Monte Carlo Simulation

Pairwise Relationships Between PTW Exchange Rate DataFIGURE 11.22

N Z

D p

er $

£ per $

30%

20%

10%

–10%

–10%–20%–30% –10% –20% –30%

–20%

–30%

¥ p

er $

£ per $

30%

20%

10%

–10%

–10%–20%–30% –10% –20% –30%

–20%

–30%

¥ p

er $

NZD per $

30%

20%

10%

–10%

–10%–20%–30% –10% –20% –30%

–20%

–30%

11.3 Simulation with Dependent Random Variables 531

range Data!$A$3:$D$2002 that contains the quarterly percent change to be returned. So, 5VLOOKUP(E7, Data!$A$3:$D$2002, 2, FALSE) returns the value from the second column (Column B) in the Data worksheet. The fourth argument of the VLOOKUP function specifies that an exact match of the quarter index is required. Because the exchange rate fluctuations are sampled from the same quarter scenario, this captures their inter-dependency; that is, the individual exchange rate changes are not generated independently, but rather as a collection.

As in the Sanotronics and Land Shark problems, we now can use a Data Table to exe- cute simulation trials and gather sample statistics. Figure 11.24 shows the results of 1,000 simulation trials. PTW can use this simulation model to assess its exposure to currency exchange rates and consider actions to hedge against this risk.

QuarterlyExchangeModel

Formula Worksheet for PTWFIGURE 11.23

Press Teag Worldwide A

1 2 3 4 5 6 7 8 9

10 11 12

B C D E

Parameters

Model End-of-Quarter Revenue in $

Start-of-Quarter Exchange Rate (per $) 0.6152

100000

=VLOOKUP($E$7,Data!$A$3:$D$2002,2,FALSE) =B6*(1+B7)

=B9/B8

1.2

250000

=VLOOKUP($E$7,Data!$A$3:$D$2002,2,FALSE) =C6*(1+C7)

=C9/C8

87.1

10000000

=VLOOKUP($E$7,Data!$A$3:$D$2002,2,FALSE) =D6*(1+D7)

=D9/D8

Random Scenario =RANDBETWEEN(1,2000)

Total =SUM(B12:D12)

End-of-Quarter Exchange Rate (per $) Quarterly % Change in Exchange Rate

Quarterly Revenue

Output from PTW SimulationFIGURE 11.24

Start-of-Quarter Exchange Rate (per $)

Quarterly % Change in Exchange Rate

Quarterly Revenue

End-of-Quarter Exchange Rate (per $)

End-of-Quarter Revenue in $

Simulation Trial Total $ Revenue

$160,748

1 2 3 4 5

$513,349 $477,253 $501,569 $523,797 $457,771

6 $474,016

Count Minimum Revenue Maximum Revenue Average Revenue Standard Deviation of Revenue

$220,529 $132,072

£100,000

£0.615 1.12% £0.622

NZD 250,000

NZD 1.200 -5.53%

NZD 1.134 ¥10,000,000

¥87.10 Random Scenario 1173

Total $513,349

1000 Bin Frequency Labels $389,918 $420,000

$430,000 $440,000 $450,000 $460,000 $470,000 $480,000

$490,000 $500,000 $510,000 $520,000 $530,000 $540,000 $550,000 $560,000

$570,000 $580,000

5 < $420K $420K - $430K $430K - $440K $440K - $450K $450K - $460K $460K - $470K $470K - $480K $480K - $490K $490K - $500K $500K - $510K $510K - $520K $520K - $530K $530K - $540K $540K - $550K

$550K - $560K $560K - $570K $570K - $580K > $580K

3 4

19 53

107 191

213 178 92 61 37 17 11 6

2 0

$596,365 $487,087 $22,670

-13.07% ¥75.72

Summary Statistics

Parameters

Model

3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

B C D E F G H I Press Teag Worldwide1

Return Distribution

100

150

200

250

< $4

20 K

$4 20

K –$

43 0K

$4 30

K –$

44 0K

$4 40

K –$

45 0K

$4 50

K –$

46 0K

$4 60

K –$

47 0K

$4 70

K –$

48 0K

$4 80

K –$

49 0K

$4 90

K –$

50 0K

$5 00

K –$

51 0K

$5 10

K –$

52 0K

$5 20

K –$

53 0K

$5 30

K –$

54 0K

$5 40

K –$

55 0K

$5 50

K –$

56 0K

$5 60

K –$

57 0K

$5 70

K –$

58 0K

> $5

80 K

532 Chapter 11 Monte Carlo Simulation

11.4 Simulation Considerations Verification and Validation An important aspect of any simulation study involves confirming that the simulation model accurately describes the real system. Inaccurate simulation models cannot be expected to provide worthwhile information. Thus, before using simulation results to draw conclusions about a real system, one must take steps to verify and validate the simulation model.

Verification is the process of determining that the computer procedure that performs the simulation calculations is logically correct. Verification is largely a debugging task to make sure that there are no errors in the computer procedure that implements the simulation. In some cases, an analyst may compare computer results for a limited number of events with independent hand calculations. In other cases, tests may be performed to verify that the random variables are being generated correctly and that the output from the simulation model seems reasonable. The verification step is not complete until the user develops a high degree of confidence that the computer procedure is error free.

Validation is the process of ensuring that the simulation model provides an accurate representation of a real system. Validation requires an agreement among analysts and man- agers that the logic and the assumptions used in the design of the simulation model accu- rately reflect how the real system operates. The first phase of the validation process is done prior to, or in conjunction with, the development of the computer procedure for the simu- lation process. Validation continues after the computer program has been developed, with the analyst reviewing the simulation output to see whether the simulation results closely approximate the performance of the real system. If possible, the output of the simulation model is compared to the output of an existing real system to make sure that the simulation output closely approximates the performance of the real system. If this form of validation is not possible, an analyst can experiment with the simulation model and have one or more individuals experienced with the operation of the real system review the simulation output to determine whether it is a reasonable approximation of what would be obtained with the real system under similar conditions.

Verification and validation are not tasks to be taken lightly. They are key steps in any simulation study and are necessary to ensure that decisions and conclusions based on the simulation results are appropriate for the real system.

Advantages and Disadvantages of Using Simulation The primary advantages of simulation are that it is conceptually easy to understand and that the methods can be used to model and learn about the behavior of complex systems that would be difficult, if not impossible, to deal with analytically. Simulation models are flex- ible; they can be used to describe systems without requiring the assumptions that are often required by other mathematical models. In general, the larger the number of random variables a system has, the more likely it is that a simulation model will provide the best approach for studying the system. Another advantage is that a simulation model provides a convenient experimental laboratory for the real system. Changing assumptions or operating policies in the simulation model and rerunning it can provide results that help predict how such changes will affect the operation of the real system. Experimenting directly with a real system is often not feasible. Simulation models frequently warn against poor decision strategies by project- ing disastrous outcomes such as system failures, large financial losses, and so on.

Simulation is not without disadvantages. For complex systems, the process of develop- ing, verifying, and validating a simulation model can be time consuming and expensive. However, the process of developing the model generally leads to a better understanding of the system, which is an important benefit. Like all mathematical models, the analyst must be conscious of the assumptions of the model in order to understand its limitations. In addition, each simulation run provides only a sample of output data. As such, the summary of the simulation data provides only estimates or approximations about the real system. Nonetheless, the danger of obtaining poor solutions is greatly mitigated if the analyst exer- cises good judgment in developing the simulation model and follows proper verification

Summary 533

and validation steps. Furthermore, if a sufficiently large enough set of simulation trials is run under a wide variety of conditions, the analyst will likely have sufficient data to predict how the real system will operate.

S U M M A R Y

Simulation is a method for learning about a real system by experimenting with a model that represents the system. Some of the reasons simulation is frequently used are as follows:

1. It can be used for a wide variety of practical problems. 2. The simulation approach is relatively easy to explain and understand. As a result,

management confidence is increased and the results are more easily accepted. 3. Spreadsheet software such as Excel and specialized software packages have made it

easier to develop and implement simulation models for increasingly complex problems.

In this chapter, we showed how native Excel functions can be used to execute simula- tion models on several examples. For the Sanotronics problem, we used simulation to eval- uate the risk involving the development of a new product. Then we developed a simulation model to help Land Shark Inc. estimate how varying its bid amount affects the likelihood of winning a property auction. We then demonstrated how to build a simulation model with dependent random variables using the Press Teag Worldwide example that included correlated fluctuations for currency exchange rates. With the steps below, we summarize the procedure for developing a simulation model involving controllable inputs, uncertain inputs represented by random variables, and output measures.

Summary of Steps for Conducting a Simulation Analysis 1. Construct a spreadsheet model that computes output measures for given values of

inputs. The foundation of a good simulation model is logic that correctly relates input values to outputs. Audit the spreadsheet to ensure that the cell formulas cor- rectly evaluate the outputs over the entire range of possible input values.

2. Identify inputs that are uncertain, and specify probability distributions for these cells (rather than just static numbers). Note that all inputs may not have a degree of uncer- tainty sufficient to require modeling with a probability distribution. Other inputs may actually be decision variables, which are not random and should not be modeled with probability distributions; rather, these are values that the decision maker can control.

3. Select one or more outputs to record over the simulation trials. Typical information recorded for an output includes a histogram of output values over all simulation trials and summary statistics such as the mean, standard deviation, maximum, mini- mum, and percentile values.

4. Execute the simulation for a specified number of trials. In this chapter, we have used 1,000 trials for our simulation models. The amount of sampling error can be moni- tored by observing how much simulation output measures fluctuate across multiple simulation runs. If the confidence intervals on the output measures are unacceptably wide, the number of trials can be increased to reduce the amount of sampling error.

5. Analyze the outputs and interpret the implications on the decision-making process. In addition to estimates of the mean output, simulation allows us to construct a dis- tribution of possible output values. Analyzing the simulation results allows the deci- sion maker to draw conclusions about the operation of the real system.

In this chapter, we have focused on Monte Carlo simulation consisting of independent trials in which the results for one trial do not affect what happens in subsequent trials. Another style of simulation, called discrete-event simulation, involves trials that represent how a system evolves over time. One common application of discrete-event simulation is the analysis of waiting lines. In a waiting-line simulation, the random variables are the interar- rival times of the customers and the service times of the servers, which together determine the waiting and completion times for the customers. Although it is possible to conduct

Problems 34, 35, and 36 involve small waiting-line simulation models.

534 Chapter 11 Monte Carlo Simulation

small discrete-event simulations with native Excel functionality, discrete-event simulation modeling is best conducted with special-purpose software such as Arena®, ProModel®, and Simio®. These packages have built-in simulation clocks, simplified methods for generating random variables, and procedures for collecting and summarizing the simulation output.

G L O S S A R Y

Base-case scenario Output resulting from the most likely values for the random variables of a model. Best-case scenario Output resulting from the best values that can be expected for the ran- dom variables of a model. Continuous probability distribution A probability distribution for which the possible values for a random variable can take any value in an interval or collection of intervals. An interval can include negative and positive infinity. Controllable input Input to a simulation model that is selected by the decision maker. Discrete probability distribution A probability distribution for which the possible values for a random variable can take on only specified discrete values. Discrete-event simulation A simulation method that describes how a system evolves over time by using events that occur at discrete points in time. Monte Carlo simulation A simulation method that uses repeated random sampling to represent uncertainty in a model representing a real system and that computes the values of model outputs. Probability distribution A description of the range and relative likelihood of possible values of a random variable (uncertain quantity). Random variable (uncertain variable) Input to a simulation model whose value is uncer- tain and described by a probability distribution. Risk analysis The process of evaluating a decision in the face of uncertainty by quantify- ing the likelihood and magnitude of an undesirable outcome. Validation The process of determining that a simulation model provides an accurate repre- sentation of a real system. Verification The process of determining that a computer program implements a simulation model as it is intended. What-if analysis A trial-and-error approach to learning about the range of possible outputs for a model. Trial values are chosen for the model inputs (these are the what-ifs) and the value of the output(s) is computed. Worst-case scenario Output resulting from the worst values that can be expected for the random variables of a model.

P R O B L E M S

1. Galaxy Co. distributes wireless routers to Internet service providers. Galaxy procures each router for $75 from its supplier and sells each router for $125. Monthly demand for the router is a normal random variable with a mean of 100 units and a standard deviation of 20 units. At the beginning of each month, Galaxy orders enough routers from its supplier to bring the inventory level up to 100 routers. If the monthly demand is less than 100, Galaxy pays $15 per router that remains in inventory at the end of the month. If the monthly demand exceeds 100, Galaxy sells only the 100 routers in stock. Galaxy assigns a shortage cost of $30 for each unit of demand that is unsatisfied to represent a loss-of-goodwill among its customers. Management would like to use a simulation model to analyze this situation. a. What is the average monthly profit resulting from its policy of stocking 100 routers

at the beginning of each month? b. What is the proportion of months in which demand is completely satisfied? c. Use the simulation model to compare the profitability of monthly replenishment

levels of 100, 120, and 140 routers. Use the corresponding 95% confidence intervals on the average profit to make your comparison.

Problems 535

2. Construct a spreadsheet simulation model to simulate 1,000 rolls of a die with the six sides numbered 1, 2, 3, 4, 5 and 6. a. Construct a histogram of the 1,000 observed dice rolls. b. For each roll of two dice, record the sum of the dice. Construct a histogram of the

1,000 observations of the sum of two dice. c. For each roll of three dice, record the sum of the dice. Construct a histogram of the

1,000 observations of the sum of three dice. d. For each roll of four dice, record the sum of the dice. Construct a histogram of the

1,000 observations of the sum of four dice. e. Compare the histograms in parts (a), (b), (c), and (d). What statistical phenomenon

does this sequence of charts illustrate?

3. The management of Madeira Computing is considering the introduction of a wearable electronic device with the functionality of a laptop computer and phone. The fixed cost to launch this new product is $300,000. The variable cost for the product is expected to be between $160 and $240, with a most likely value of $200 per unit. The product will sell for $300 per unit. Demand for the product is expected to range from 0 to approxi- mately 20,000 units, with 4,000 units the most likely. a. Develop a what-if spreadsheet model computing profit for this product in the

base-case, worst-case, and best-case scenarios. b. Model the variable cost as a uniform random variable with a minimum of $160 and

a maximum of $240. Model product demand as 1,000 times the value of a gamma random variable with an alpha parameter of 3 and a beta parameter of 2. Construct a simulation model to estimate the average profit and the probability that the project will result in a loss.

c. What is your recommendation regarding whether to launch the product?

4. The management of Brinkley Corporation is interested in using simulation to estimate the profit per unit for a new product. The selling price for the product will be $45 per unit. Probability distributions for the purchase cost, the labor cost, and the transporta- tion cost are estimated as follows:

Procurement Cost ($) Probability

Labor Cost ($) Probability

Transportation Cost ($) Probability

10 0.25 20 0.10 3 0.75

11 0.45 22 0.25 5 0.25

12 0.30 24 0.35

25 0.30

a. Construct a simulation model to estimate the average profit per unit. What is a 95% confidence interval around this average?

b. Management believes that the project may not be sustainable if the profit per unit is less than $5. Use simulation to estimate the probability that the profit per unit will be less than $5. What is a 95% confidence interval around this proportion?

5. Statewide Auto Insurance believes that for every trip longer than 10 minutes that a teen- ager drives, there is a 1 in 1,000 chance that the drive will results in an auto accident. Assume that the cost of an accident can be modeled with a beta distribution with an alpha parameter of 1.5, a beta parameter of 3, a minimum value of $500, and a maxi- mum value of $20,000. Construct a simulation model to answer the following questions. (Hint: Review Appendix 11.1 for descriptions of various types of probability distribu- tions to identify the appropriate way to model the number of accidents in 500 trips.) a. If a teenager drives 500 trips longer than 10 minutes, what is the average cost result-

ing from accidents? Provide a 95% confidence interval on this mean. b. If a teenager drives 500 trips longer than 10 minutes, what is the probability that the

total cost from accidents will exceed $8,000? Provide a 95% confidence interval on this proportion.

536 Chapter 11 Monte Carlo Simulation

6. State Farm Insurance has developed the following table to describe the distribution of automobile collision claims paid during the past year. a. Set up a table of intervals of random numbers that can be used with the Excel

VLOOKUP function to generate values for automobile collision claim payments. b. Construct a simulation model to estimate the average claim payment amount and

the standard deviation in the claim payment amounts. c. Let X be the discrete random variable representing the dollar value of an

automobile collision claim payment. Let, ,1 2x x , . . . , xn represent possible values of X. Then, the mean (m) and standard deviation (s) of X can be computed as m 5 3 5 1 1 3 5x P X x x P X xn n( ) · · · ( )1 1 , and

s m m5 2 3 5 1 1 2 3 5x P X x x P X xn n( ) ( ) · · · ( ) ( )1 2 1 2 . Compare the val- ues of sample mean and sample standard deviation in part (b) to the analytical cal- culation of the mean and standard deviation. How can we improve the accuracy of the sample estimates from the simulation?

Chapter 5 describes the analytical calculation of the mean and standard deviation of a random variable.

Payment($) Probability

0 0.83

500 0.06

1,000 0.05

2,000 0.02

5,000 0.02

8,000 0.01

10,000 0.01

7. The Dallas Mavericks and the Golden State Warriors are two teams in the National Basketball Association. Dallas and Golden State will play multiple times over the course of an NBA season. Assume that the Dallas Mavericks have a 25% probability of winning each game against the Golden State Warriors. a. Construct a simulation model that uses the negative binomial distribution to simu-

late the number of games Dallas would lose before winning four games against the Golden State Warriors.

b. Now suppose that the Dallas Mavericks face the Golden State Warriors in a best-of- seven playoff series in which the first team to win four games out of seven wins the series. Using the simulation model from part (a), estimate that probability that the Dallas Mavericks would win a best-of-seven series against the Golden State Warriors.

8. Grear Tire Company has produced a new tire with an estimated mean lifetime mileage of 36,500 miles. Management also believes that the standard deviation is 5,000 miles and that tire mileage is normally distributed. To promote the new tire, Grear has offered to refund some money if the tire fails to reach 30,000 miles before the tire needs to be replaced. Specifically, for tires with a lifetime below 30,000 miles, Grear will refund a customer $1 per 100 miles short of 30,000. a. For each tire sold, what is the average cost of the promotion? b. What is the probability that Grear will refund more than $25 for a tire?

9. To generate leads for new business, Gustin Investment Services offers free financial planning seminars at major hotels in Southwest Florida. Gustin conducts seminars for groups of 25 individuals. Each seminar costs Gustin $3,500, and the commission for each new account opened is $5,000. Gustin estimates that for each individual attending the seminar, there is a 0.01 probability that he/she will open a new account. a. Construct a spreadsheet model that correctly computes Gustin’s profit per seminar,

given static values of the relevant parameters. b. What type of random variable is the number of new accounts opened? (Hint:

Review Appendix 11.1 for descriptions of various types of probability distributions.) c. Construct a simulation model to analyze the profitability of Gustin’s seminars.

Would you recommend that Gustin continue running the seminars?

Problems 537

d. How many attendees (in a multiple of five, i.e., 25, 30, 35, . . .) does Gustin need before a seminar’s average profit is greater than zero?

10. Using the file LandSharkBeta, evaluate bid amounts from $1,229,000 to $1,329,000 in increments of $20,000 by building a table listing 95% confidence intervals around the average return and probability of winning the auction. Which of these bid amounts do you recommend?

11. The Iowa Energy are scheduled to play against the Maine Red Claws in an upcoming game in the National Basketball Association (NBA) G League. Because a player in the NBA G League is still developing his skills, the number of points he scores in a game can vary substantially. Assume that each player’s point production can be represented as an integer uniform random variable with the ranges provided in the following table:

Player lowa Energy Maine Red Claws

1 [5,20] [7,12]

2 [7,20] [15,20]

3 [5,10] [10,20]

4 [10,40] [15,30]

5 [6,20] [5,10]

6 [3,10] [1,20]

7 [2,5] [1,4]

8 [2,4] [2,4]

a. Develop a spreadsheet model that simulates the points scored by each team and the difference in their point totals.

b. What are the average and standard deviation of points scored by the Iowa Energy? What is the shape of the distribution of points scored by the Iowa Energy?

c. What are the average and standard deviation of points scored by the Maine Red Claws? What is the shape of the distribution of points scored by the Maine Red Claws?

d. Let Point Differential 5 Iowa Energy points 2 Maine Red Claw points. What is the average Point Differential between the Iowa Energy and Maine Red Claws? What is the standard deviation of the Point Differential? What is the shape of the Point Dif- ferential distribution?

e. What is the probability that the Iowa Energy scores more points than the Maine Red Claws?

f. The coach of the Iowa Energy feels that they are the underdog and is considering a riskier game strategy. The effect of this strategy is that the range of each Energy player’s point production increases symmetrically so that the new range is [0, orig- inal upper bound 1 original lower bound]. For example, Energy player 1’s range with the risky strategy is [0, 25]. How does the new strategy affect the average and standard deviation of the Energy point total? How does that affect the probability of the Iowa Energy scoring more points than the Maine Red Claws?

12. Suppose that the price of a share of a particular stock listed on the New York Stock Exchange is currently $39. The following probability distribution shows how the price per share is expected to change over a three-month period:

LandSharkBeta

Stock Price Change ($) Probability

22 0.05 22 0.10

0 0.25 11 0.20 12 0.20 13 0.10 14 0.10

538 Chapter 11 Monte Carlo Simulation

a. Construct a spreadsheet simulation model that computes the value of the stock price in 3 months, 6 months, 9 months, and 12 months under the assumption that the change in stock price over any three-month period is independent of the change in stock price over any other three-month period. For a current price of $39 per share, what is the average stock price per share 12 months from now? What is the standard deviation of the stock price 12 months from now?

b. Based on the model assumptions, what are the lowest and highest possible prices for this stock in 12 months? Based on your knowledge of the stock market, how valid do you think this is? Propose an alternative to modeling how stock prices evolve over three-month periods.

13. Allegiant Airlines is considering an overbooking policy for one of its flights. The air- plane has 50 seats, but Allegiant is considering accepting more reservations than seats because sometimes passengers do not show up for their flights, resulting in empty seats. The PassengerAppearance worksheet in the file Overbooking contains data on 1,000 passengers showing whether or not they showed up for their respective flights.

In addition, Allegiant has conducted a field experiment to gauge the demand for res- ervations for the current flight. During this experiment, they did not limit the number of reservations for the flight to observe the uncensored demand. The following table summarizes the result of the field experiment.

No. of Reservations Demanded Probability 48 0.05 49 0.05 50 0.15 51 0.30 52 0.25 53 0.10 54 0.10

Allegiant receives a marginal profit of $100 for each passenger who books a reserva- tion (regardless of whether they show up). Allegiant incurs a rebooking cost of $300 for each passenger who books a reservation, but is denied seating due to a full airplane; this cost results from rescheduling the passenger and any loss of goodwill.

To control its rebooking costs, Allegiant wants to set a limit on the number of res- ervations it will accept. Evaluate Allegiant’s average net profit for reservation limits of 50, 52, and 54, respectively. Based on the 95% confidence intervals for average net profit, which reservation limit do you recommend?

14. A project has four activities (A, B, C, and D) that must be performed sequentially. The prob- ability distributions for the time required to complete each of the activities are as follows:

Overbooking

Activity Activity Time (weeks) Probability A 5 0.25

6 0.35 7 0.25 8 0.15

B 3 0.20 5 0.55 7 0.25

C 10 0.10 12 0.25 14 0.40 16 0.20 18 0.05

D 8 0.60 10 0.40

Problems 539

a. Construct a spreadsheet simulation model to estimate the average length of the proj- ect and the standard deviation of the project length.

b. What is the estimated probability that the project will be completed in 35 weeks or less?

15. In preparing for the upcoming holiday season, Fresh Toy Company (FTC) designed a new doll called The Dougie that teaches children how to dance. The fixed cost to pro- duce the doll is $100,000. The variable cost, which includes material, labor, and ship- ping costs, is $34 per doll. During the holiday selling season, FTC will sell the dolls for $42 each. If FTC overproduces the dolls, the excess dolls will be sold in January through a distributor who has agreed to pay FTC $10 per doll. Demand for new toys during the holiday selling season is uncertain. The normal probability distribution with an average of 60,000 dolls and a standard deviation of 15,000 is assumed to be a good description of the demand. FTC has tentatively decided to produce 60,000 units (the same as average demand), but it wants to conduct an analysis regarding this production quantity before finalizing the decision. a. Create a what-if spreadsheet model using formulas that relate the values of produc-

tion quantity, demand, sales, revenue from sales, amount of surplus, revenue from sales of surplus, total cost, and net profit. What is the profit when demand is equal to its average (60,000 units)?

b. Modeling demand as a normal random variable with a mean of 60,000 and a stan- dard deviation of 15,000, simulate the sales of The Dougie doll using a production quantity of 60,000 units. What is the estimate of the average profit associated with the production quantity of 60,000 dolls? How does this compare to the profit corre- sponding to the average demand (as computed in part (a))?

c. Before making a final decision on the production quantity, management wants an analysis of a more aggressive 70,000-unit production quantity and a more conser- vative 50,000-unit production quantity. Run your simulation with these two produc- tion quantities. What is the average profit associated with each?

d. Besides average profit, what other factors should FTC consider in determining a production quantity? Compare the four production quantities (40,000; 50,000; 60,000; and 70,000) using all these factors. What trade-offs occur? What is your recommendation?

16. Jonah Arkfeld, a building contractor, is preparing a bid on a new construction project. Two other contractors will be submitting bids for the same project. Jonah has analyzed past bidding practices and the requirements of the project to determine the probability distributions of the two competing contractors. The bid from Contractor A can be described with a triangular distribution with a minimum value of $600,000, a maximum value of $800,000, and a most likely value of $725,000. The bid from Contractor B can be described with a normal distribution with a mean of $700,000 and a standard deviation of $50,000. a. If Jonah submits a bid of $750,000, what is the probability that he will win the bid

for the project? b. What is the probability that Contractor A and Contractor B will win the bid,

respectively?

17. You are considering the purchase of a new car and are weighing the choice between a Ford Fusion Hybrid sedan (which assists a gasoline engine with an electric motor pow- ered via regenerative braking) and the Ford Fusion Non-Hybrid sedan (just a standard gasoline engine). The non-hybrid version costs $23,240 with fuel economy of 21 miles per gallon in city driving and 32 miles per gallon in highway driving. The hybrid ver- sion of the car costs $25,990 with fuel economy of 43 miles per gallon in city driving and 41 miles per gallon in highway driving.

You plan to keep the car for 10 years. Your annual mileage is uncertain; you only know that each year you will drive between 9,000 and 13,000 miles. Based on your past driving patterns, 60% of your miles are city driving and 40% of your miles are highway driving. The current gasoline price is $2.19 per gallon, but you know that gas- oline prices vary unpredictably over time.

Hybrid

540 Chapter 11 Monte Carlo Simulation

Compute the net present value (NPV) of the costs of each vehicle (purchase cost 1 gasoline cost) using a discount rate of 3%. Assume that you pay the entire purchase price of the vehicle immediately (Year 0) and the annual gasoline costs are incurred at the end of each year. a. On average, what the cost savings of the hybrid vehicle over the non-hybrid? b. Because of your concern about the maintenance needs of the hybrid vehicle, you

would need to be assured of significant savings to convince you to purchase the hybrid. What is the probability that the hybrid will result in more than $2,000 in savings over the non-hybrid?

18. Orange Tech (OT) is a software company that provides a suite of programs that are essen- tial to everyday business computing. OT has just enhanced its software and released a new version of its programs. For financial planning purposes, OT needs to forecast its revenue over the next few years. To begin this analysis, OT is considering one of its largest cus- tomers. Over the planning horizon, assume that this customer will upgrade at most once to the newest software version, but the number of years that pass before the customer pur- chases an upgrade varies. Up to the year that the customer actually upgrades, assume there is a 0.50 probability that the customer upgrades in any particular year. In other words, the upgrade year of the customer is a random variable. For guidance on an appropriate way to model upgrade year, refer to Appendix 11.1. Furthermore, the revenue that OT earns from the customer’s upgrade also varies (depending on the number of programs the customer decides to upgrade). Assume that the revenue from an upgrade obeys a normal distribution with a mean of $100,000 and a standard deviation of $25,000. Using the template in the file OrangeTech, complete a simulation model that analyzes the net present value of the revenue from the customer upgrade. Use an annual discount rate of 10%. a. What is the average net present value that OT earns from this customer? (Hint:

Excel’s NPV function computes the net present value for a sequence of cash flows that occur at the end of each period. To correctly use this function for cash flows that occur at the beginning of each period, use the formula 5NPV(discount rate, flow range) 1 initial amount, where discount rate is the annual discount rate, flow range is the cell range containing cash flows for years 1 through n, and initial amount is the cash flow in the initial period (year 0)).

b. What is the standard deviation of net present value? How does this compare to the standard deviation of the revenue? Explain.

19. OuRx, a retail pharmacy chain, is faced with the decision of how much flu vaccine to order for the next flu season. OuRx has to place a single order for the flu vaccine sev- eral months before the beginning of the season because it takes four to five months for the supplier to create the vaccine. OuRx wants to more closely examine the ordering decision because, over the past few years, the company has ordered too much vaccine or too little. OuRx pays a wholesale price of $12 per dose to obtain the flu vaccine from the supplier and then sells the flu shot to their customers at a retail price of $20.

Because OuRx earns a profit on flu shots that it sells and it can’t sell more than its supply, the appropriate profit computation depends on whether demand exceeds the order quantity or vice versa. Similarly, the number of lost sales and excess doses depends on whether demand exceeds the order quantity or vice versa. Demand for the flu vaccine is uncertain. The VaccineDemand worksheet in the file OuRx contains data produced by epidemiologists to help OuRx gain insight on demand for flu vaccine at their retail pharmacies. a. Construct a base spreadsheet model that correctly computes net profit for a given

level of demand and specified order quantity. Test your spreadsheet using an order quantity of 500,000 doses and demand of 400,000 doses and 600,000 doses.

b. To help determine how to model flu vaccine demand, construct a histogram of the data provided in the VaccineDemand worksheet in the file OuRx. In column B, com- pute the natural logarithm (using the Excel function LN) of each observation and construct a histogram of these logged demand observations. Based on the histograms of the non-logged demand and logged demand, respectively, what seems to be a

OrangeTech

OuRx

Problems 541

good choice of probability distribution for (non-logged) vaccine demand? (Hint: Review Appendix 11.1 for descriptions of various types of probability distributions.)

c. Representing flu vaccine demand with the type of random variable you identified in part (b), complete the simulation model and determine the average net profit result- ing from an order quantity of 500,000 doses. What is the 95% confidence interval on the average profit? What is the probability of running out of the flu vaccine?

20. At a local university, the Student Commission on Programming and Entertainment (SCOPE) is preparing to host its first music concert of the school year. To successfully produce this music concert, SCOPE has to complete several activities. The following table lists information regarding each activity. An activity’s immediate predecessors are the activities that must be completed before the considered activity can begin. The table also lists duration estimates (in days) for each activity.

Activity Immediate

Predecessors Minimum

Time Likely Time

Maximum Time

A: Negotiate contract with selected musicians — 5 6 9

B: Reserve site — 8 12 15

C: Logistical arrangements for music group A 5 6 7

D: Screen and hire security personnel B 3 3 3

E: Advertising and ticketing B, C 1 5 9

F: Hire parking staff D 4 7 10

G: Arrange concession sales E 3 8 10

The following network illustrates the precedence relationships in the SCOPE project. The project begins with activities A and B, which can start immediately (time 0) because they have no predecessors. On the other hand, activity E cannot be started until activities B and C are both completed. The project is not complete until all activities are completed.

Start Finish

C E

B D F

a. Using the triangular distribution to represent the duration of each activity, construct a simulation model to estimate the average amount of time to complete the concert preparations.

b. What is the likelihood that the project will be complete in 23 days or less?

21. Steve Austin is the fleet manager for SharePlane, a company that sells fractional own- ership of private jets. SharePlane must carefully maintain their jets at all times. If a jet breaks down, it must be repaired immediately. Even if a jet functions well, it must be maintained at regularly scheduled intervals. Currently, Steve is managing two jets, Jet A and Jet B, for a collection of clients and is interested in estimating their availability in between trips to the repair shop as having both jets out-of-service due to repair or maintenance at the same time can affect its customer service. Jet A and Jet B have just completed preventive maintenance. The next maintenance is scheduled for both Jet A and Jet B in four months. It is also possible that one or both will break down before this scheduled maintenance and require repair. The amount of time to a plane’s first failure is uncertain. Historical data recording the time to a plane’s first failure (measured in months) is provided in the TimeToFailData worksheet of the file TwoJets. Determine an appropriate probability distribution for these data. Furthermore, once a plane enters repair (either due to a failure or as scheduled maintenance), the amount of time the plane will be in maintenance is also uncertain. Historical data recording the repair time (measured in months) is provided in the RepairTimeData worksheet of the file TwoJets. Examine the appropriateness of fitting a log-normal distribution to these data. Steve wants to develop a simulation model to estimate the length of time that Jet A and Jet B

TwoJets

542 Chapter 11 Monte Carlo Simulation

are both out-of-service over the next few months. For simplicity, you can assume that these planes will enter repair or maintenance just once over the next few months. a. What is the average amount of time that the planes are both out-of-service? b. What is the probability that the planes are both out-of-service for longer than

1.5 months?

22. Blackjack, or 21, is a popular casino game that begins with each player and the dealer being dealt two cards. The value of each hand is determined by the point total of the cards in the hand. Face cards and 10s count 10 points, aces can be counted as either 1 or 11 points, and all other cards count at their face value. For instance, the value of a hand consisting of a jack and an 8 is 18; the value of a hand consisting of an ace and a two is either 3 or 13, depending on whether player counts the ace as 1 or 11 points. The goal is to obtain a hand with a value as close as possible to 21 without exceeding 21. After the initial deal, each player and the dealer may draw additional cards (called “taking a hit”) in order to improve her or his hand. If a player or the dealer takes a hit and the value of the hand exceeds 21, that person “goes broke” and loses. The deal- er’s advantage is that each player must decide whether to take a hit before the dealer decides whether to take a hit. If a player takes a hit and goes over 21, the player loses even if the dealer later takes a hit and goes over 21. For this reason, players will often decide not to take a hit when the value of their hand is 12 or greater.

The dealer’s hand is dealt with one card up (face showing) and one card down (face hidden). The player then decides whether to take a hit based on knowledge of the deal- er’s up card. Suppose that you are playing blackjack and the dealer’s up card is a 6 and your hand has a value of 16 for the two cards initially dealt.

With a hand of a value of 16, if you decide to take a hit, the following cards will improve your hand: ace, 2, 3, 4, or 5. Any card with a point count greater than 5 will result in you going broke. Assume that if you have a hand with a value of 16, the fol- lowing probabilities describe the ending value of your hand:

a. Construct a simulation model to simulate the result of 1,000 blackjack hands when the dealer has a 6 up and you take a hit with a hand that has a value of 16. What is the probability of the dealer winning, a push (a tie), and you winning, respectively?

b. If you have a hand with a value of 16 and don’t take a hit, the only way that you can win is if the dealer goes broke. If you don’t take a hit, what is the probability of the dealer winning, a push (a tie), and you winning, respectively?

c. Based on the results from parts (a) and (b), should you take a hit or not if you have a hand of value 16 and the dealer has a 6 up?

23. To boost holiday sales, Ginsberg jewelry store is advertising the following promotion: “If more than five inches of snow fall in the first three days of the year (January 1 through January 3), all purchases made between Thanksgiving and Christmas are free!” Based on historical sales records as well as experience with past promotions, the store manager believes that the total holiday sales between Thanksgiving and Christmas could range anywhere between $200,000 and $400,000 but is unsure of anything more specific. Ginsberg has collected data on snowfall from December 16 to January 18 for the past several winters in the file Ginsberg.

Value of Hand 17 18 19 20 21 Broke

Probability 0.0769 0.0769 0.0769 0.0769 0.0769 0.6155

Value of Hand 17 18 19 20 21 Broke

Probability 0.1654 0.1063 0.1063 0.1017 0.0972 0.4231

Ginsberg

A gambling professional determined that when the dealer’s up card is a 6, the follow- ing probabilities describe the ending value of the dealer’s hand:

Problems 543

a. Construct a simulation model to assess potential refund amounts so that Ginsberg can evaluate the option of purchasing an insurance policy to cover potential losses.

b. What is the probability that Ginsberg will have to refund sales? c. What is the average refund? Why is this a poor measure to use to assess risk? d. In the cases when snowfall exceeds 5 inches, what is the average refund?

24. A creative entrepreneur has created a novelty soap called Jackpot. Inside each bar of Jackpot soap is a rolled-up bill of U.S. currency. There are 1,000 bars of soap in the initial offering of the soap. Although the denomination of the bill inside a bar of soap is unknown, the distribution of bills in these first 1,000 bars is given in the following table:

Bill Denomination Number of Bills

$1 520

$5 260

$10 130

$20 60

$50 29

$100 1

Total 1,000

If a customer buys 40 bars of soap, the number of bars that contain a $50 or $100 bill is uncertain. On average, how many of these bars contain a $50 or $100 bill? What is the probability that at least one of the 40 bars contains a $50 or $100 bill? (Hint: Review Appendix 11.1 for descriptions of various types of probability distributions to identify the random variable that describes the number of bars that contain a $50 or $100 bill.)

25. Refer to the Jackpot soap scenario in Problem 24. After the sale of the original 1,000 bars of soap, Jackpot soap went viral, and the soap has become wildly popular. Produc- tion of the soap has been ramped up so that now millions of bars have been produced. However, the distribution of the bills in the soap obeys the same distribution as out- lined in Problem 24. On average, how many bars of soap will a customer have to buy before purchasing three bars of soap each containing a bill of at least $20 value? (Hint: Review Appendix 11.1 for descriptions of various types of probability distributions.)

26. Major League Baseball’s World Series is a maximum of seven games, with the winner being the first team to win four games. Assume that the Atlanta Braves are playing the Minnesota Twins in the World Series and that the first two games are to be played in Atlanta, the next three games at the Twins’ ballpark, and the last two games, if necessary, back in Atlanta. Taking into account the projected starting pitchers for each game and the home field advantage, the probabilities of Atlanta winning each game are as follows:

Game 1 2 3 4 5 6 7

Probability of Win 0.60 0.55 0.48 0.45 0.48 0.55 0.50

a. Set up a spreadsheet simulation model in which the outcome of each game (whether Atlanta or Minnesota wins) is a random variable.

b. What is the average number of games played regardless of the winner? c. What is the probability that the Atlanta Braves win the World Series?

27. Young entrepreneur Fan Bingbing has launched a business venture in which she uses stories submitted by university students as the basis for comics in a monthly anime- style magazine. Based on market research, Fan estimates that average monthly demand will be 500 copies. She has decided to model monthly demand as normal random vari- able with a mean of 500 and a standard deviation of 300.

Fan must pay a publishing company $3.75 for each copy of the comic printed. She then sells the magazine for $5 each. Rather than having a store-front, Fan sells the magazines through a group of student vendors who sell the comics out of their

544 Chapter 11 Monte Carlo Simulation

backpacks while on campus. Fan pays a student vendor $0.35 for each magazine he/she sells. As Fan distributes a new issue each month, she only sells each issue for a month. However, the publishing company has agreed to buy back from Fan any unsold copies at the end of each month for $2.25. a. As Fan validates the simulation model you have constructed, she observes some-

thing troublesome regarding the use of the normal distribution with a mean of 500 and a standard deviation of 30 to model monthly demand. What is it? How can you modify the simulation model to address this issue?

b. Based on the simulation model that incorporates a remedy for the validation issue observed in part (a), what is the estimate of the average profit if Fan sets the order quan- tity to 1,200? What is the 95% confidence interval on this estimate of average profit?

c. For an order quantity of 1,200 copies, what is the profit value such that 2.5% of the profit outcomes are smaller than this value? What is the profit value such that 2.5% of the profit outcomes are larger than this value? (Hint: you can use the Excel function PERCENTILE.EXC to help you determine these values.) These two val- ues define a range which 95% of the profit outcomes lie between. Why doesn’t this range correspond to the 95% confidence interval in part (b)?

28. Bianca Peterson is a marketing engineer for Hexagon Composites, a company which sells carbon composite storage tanks. In an effort to gain product adoptions from customers, Bianca goes on sales trips (often to foreign countries). For each of 120 previous sales trips, the file SalesTrips lists (1) whether the trip resulted in the visited customer adopting the product, and (2) the revenue generated by the adoption. a. Bianca has six sales trips planned over the next couple of months. What is the aver-

age revenue that Bianca expects to generate from these six trips? What is the proba- bility that she generates $200,000 or less from these six trips?

b. Bianca receives a sales bonus if she gains three more product adoptions before the end of the year. The number of sales trips that Bianca will need to make to earn her bonus is uncertain. What is its distribution? If Bianca only has time to make 10 more sales trips before the end of the year, what is the likelihood that she earns her bonus?

29. Gorditos sells a variety of Mexican-inspired cuisine for which tortillas are often the main ingredient. Assume that each customer places an order requiring one tortilla with a 75% probability independent of other customers’ orders. The other 25% of customers place orders that do not require a tortilla. Assume that the number of customers who arrive per hour has a Poisson distribution with the average number of customers in an hour time slot given in the following table:

SalesTrips

Time of Day Average Number of Customers

11am–noon 200

Noon–1pm 200

1pm–2pm 200

2pm–3pm 50

3pm–4pm 50

4pm–5pm 50

5pm–6pm 150

6pm–7pm 150

7pm–8pm 150

8pm–9pm 50

9pm–10pm 50

Gorditos currently prepares dough for 750 tortillas at the beginning of each day. Due to uncertain customer demand, Gorditos may run out of tortillas, which affects profit as well as customer relations. Every tortilla-based customer order generates $2.35 in profit. Every customer who places an order requiring a tortilla but is denied (due to a tortilla stock-out) leaves Gorditos without buying anything with probability 0.13, and purchases

Problems 545

a non-tortilla menu item (generating profit of $1.50) with probability 0.87. Create a sim- ulation model to generate the distribution of daily lost profit due to tortilla stock-outs. a. What is the average daily lost profit? What is the 95% confidence interval on this mean? b. On average, which hour of the work day does Gorditos run out of tortillas? What is

the 95% confidence interval on the mean?

30. As admissions director for an exclusive executive MBA program which takes place on Necker Island in the Caribbean, Richard Branson must decide which applicants should receive admission offers. This is a difficult decision-making problem, as an applicant may or may not accept an admission offer. Currently, Richard is considering 30 appli- cants, each of which has a different probability of accepting an admission offer. Based on their academic qualifications and experience, Richard has rated each of these appli- cants using a value score from 1 to 10 (higher value scores represent better applicants). The file Admissions contains data on the 30 applicants.

Based on the capacity of their facilities, Richard would like a class of 12 students. Fewer students than 12 results in under-utilized resources (empty classroom seats), but more than 12 students results in increased marginal costs. Specifically, each attending student beyond 12 incurs a cost of 20 value points. Note that an applicant will be an attending student only if he or she is admitted and he or she accepts the admission offer.

Construct a spreadsheet model that computes the net value of offering admission to the top 20 students as ranked by value score. Compute net value as the sum of the value of attending students minus the costs of students beyond the capacity of 12 seats. What is the average net value obtained when offering admission to the top 20 students?

31. The wedding date for a couple is quickly approaching, and the wedding planner must provide the caterer an estimate of how many people will attend the reception so that the appropriate quantity of food is prepared for the buffet. The following table contains information on the number of RSVPs for the 145 invitations. Unfortunately, the num- ber of guests who actually attend does not always correspond to the number of RSVPs.

Based on her experience, the wedding planner knows that it is extremely rare for guests to attend a wedding if they affirmed that they will not be attending. Therefore, the wedding planner will assume that no one from these 50 invitations will attend. The wedding planner estimates that each of the 25 guests planning to come alone has a 75% chance of attending alone, a 20% chance of not attending, and a 5% chance of bringing a companion. For each of the 60 RSVPs who plan to bring a companion, there is a 90% chance that she or he will attend with a companion, a 5% chance of attending alone, and a 5% chance of not attending at all. For the 10 people who have not responded, the wed- ding planner assumes that there is an 80% chance that each will not attend, a 15% chance that they will attend alone, and a 5% chance that they will attend with a companion.

To solve Problem 31, you will need to use native Excel functionality rather than Analytic Solver Basic because the problem exceeds the number of random variables allowed by Analytic Solver Basic.

a. Assist the wedding planner by constructing a spreadsheet simulation model to estimate the average number of guests who will attend the reception.

b. To be accommodating hosts, the couple has instructed the wedding planner to use the simulation model to determine X, the minimum number of guests for which the caterer should prepare the meal, so that there is at least a 90% chance that the actual attendance is less than or equal to X. What is the best estimate for the value of X?

32. A European put option on a currency allows you to sell a unit of that currency at the specified strike price (exchange rate) at a particular point in time after the purchase of the option. For example, suppose Press Teag Worldwide (from Section 11.3) purchases a three-month European put option for a British pound with a strike price of £0.630 per

Admissions

RSVPs No. of Invitations

0 50

1 25

2 60

No response 10

QuarterlyExchangeModel

546 Chapter 11 Monte Carlo Simulation

U.S. dollar. Then, if the exchange rate in three months is such that it takes more than £0.630 to buy a U.S. dollar, for example, £0.650 per U.S. dollar, Press Teag will exercise the put option and sell its pound sterling at the strike price of £0.630 per U.S. dollar. However, if exchange rate in three months is such that it take less than £0.630 to buy a U.S. dollar, for example, £0.620 per U.S. dollar, Press Teag will not exercise its put option and sell its pound sterling at the market rate of £0.620 per U.S. dollar.

The following table lists information on the three-month European put options on pound sterling, New Zealand dollars, and Japanese yen.

Currency Purchase Price per Option Strike Price

Pound Sterling $0.01 per £ £0.630 per $

New Zealand Dollar $0.01 per NZD NZD 1.230 per $

Japanese Yen $0.00005 per ¥ ¥90.00 per $

To solve Problem 32, you will need to use native Excel functionality rather than Analytic Solver Basic because the problem exceeds the number of random variables allowed by Analytic Solver Basic.

DailyStock

BurgerDome

Modify the simulation model in the file QuarterlyExchangeModel to compare the strat- egy of hedging half of the revenue in each of three foreign currencies using European put options versus the strategy of not using put options to hedge at all. What is the average difference in revenue (hedged revenue – unhedged revenue)? What is the probability that the hedged revenue is less than the unhedged revenue?

33. Over the past year, a financial analyst has tracked the daily change in the price per share of common stock for a major oil company. The financial analyst wants to develop a simulation model to analyze the stock price at the end of the next quarter. Assume 63 trading days and a current price per share of $51.60. a. Based on the data in the DataToFit worksheet of the file DailyStock, use the Excel

formula 5CORREL(B3:B313, B4:B314) to compute the correlation between the percent change in stock price on consecutive days. What do you conclude about the dependency of the percent change in stock price from day to day?

b. Based on the data in the DataToFit worksheet of the file DailyStock, compute sample statistics and construct a histogram to visualize the distribution of the data. Select a distribution that appears to fit this data.

c. Using the distribution that you selected in part (b) to represent the daily percent change in stock price, construct a simulation model to estimate the price per share at the end of the quarter. What is the probability that the stock price will be below $26.55?

d. The WhatReallyHappened worksheet of the file DailyStock contains the 63 values of the daily percent change in stock price that actually occurred during the quarter. What does this reveal about the limitations of simulation modeling? What could the financial analyst do to address this limitation?

34. Burger Dome is a fast-food restaurant currently evaluating its customer service. In its current operation, an employee takes a customer’s order, tabulates the cost, receives payment from the customer, and then fills the order. Once the customer’s order is filled, the employee takes the order of the next customer waiting for service. Assume that time between each customer’s arrival is an exponential random variable with a mean of 1.35 minutes. Assume that the time for the employee to complete the customer’s service is an exponential random variable with a mean of 1 minute. Use the file BurgerDome to complete a simulation model for the waiting line at Burger Dome for a 14-hour workday. Using the summary statistics gathered at the bottom of the spread- sheet model, answer the following questions. a. What is the average wait time experienced by a customer? b. What is the longest wait time experienced by a customer? c. What is the probability that a customer waits more than 2 minutes? d. Create a histogram depicting the wait time distribution. e. By pressing the F9 key to generate a new set of simulation trials, you can observe

the variability in the summary statistics from simulation to simulation. Typically, this variability can be reduced by increasing the number of trials. Why is this approach not appropriate for this problem?

To solve Problem 33, you will need to use native Excel functionality rather than Analytic Solver Basic because the problem exceeds the number of random variables allowed by Analytic Solver Basic.

To solve Problem 34, you will need to use native Excel functionality rather than Analytic Solver Basic because the problem exceeds the number of random variables allowed by Analytic Solver Basic.

Case Problem: Four Corners 547

35. One advantage of simulation is that a simulation model can be altered easily to reflect a change in the assumptions. Refer to the Burger Dome analysis in Problem 24. Assume that the service time is more accurately described by a normal distribution with a mean of 1 minute and a standard deviation of 0.2 minute. This distribution has less variability than the exponential distribution originally used. What is the impact of this change on the output measures?

36. Refer to the Burger Dome analysis in Problem 24. Burger Dome wants to consider the effect of hiring a second employee to serve customers (in parallel with the first employee). Use the file BurgerDomeTwoServers to complete a simulation model that accounts for the second employee. (Hint: The time that a customer begins service will depend on the availability of employees.) What is the impact of this change on the out- put measures?

C A S E P R O B L E M : F O U R C O R N E R S

What will your investment portfolio be worth in 10 years? In 20 years? When you stop working? The Human Resources Department at Four Corners Corporation was asked to develop a financial planning model that would help employees address these questions. Tom Gifford was asked to lead this effort and decided to begin by developing a financial plan for himself. Tom has a degree in business and, at the age of 40, is making $85,000 per year. Through contributions to his company’s retirement program and the receipt of a small inheritance, Tom has accumulated a portfolio valued at $50,000. Tom plans to work 20 more years and hopes to accumulate a portfolio valued at $1,000,000. Can he do it?

Tom began with a few assumptions about his future salary, his new investment contribu- tions, and his portfolio growth rate. He assumed a 5% annual salary growth rate and plans to make new investment contributions at 6% of his salary. After some research on historical stock market performance, Tom decided that a 10% annual portfolio growth rate was rea- sonable. Using these assumptions, Tom developed the following Excel worksheet:

To solve Problems 35 and 36, you will need to use native Excel functionality rather than Analytic Solver Basic because the problem exceeds the number of random variables allowed by Analytic Solver Basic.

BurgerDome2Servers

A C Four Corners

Age Current Salary Current Portfolio Annual Investment Rate Salary Growth Rate Portfolio Growth Rate

40 $85,000 $50,000

6% 5%

10%

Year Beginning Balance

2 1

$72,013

$85,118

$99,829

$60,355 $50,000

Salary

$93,713

$98,398

$103,318

$89,250 $85,000

New Investment

$5,623

$5,904

$6,199

$5,355 $5,100

Earnings

$7,482

$8,807

$10,293

$6,303 $5,255

Ending Balance

$85,118

$99,829

$116,321

$72,013 $60,355

Age

42 41

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

B D E GF

The worksheet provides a financial projection for the next five years. In computing the portfolio earnings for a given year, Tom assumed that his new investment contribution would occur evenly throughout the year, and thus half of the new investment could be included in the computation of the portfolio earnings for the year. From the worksheet, we see that, at age 45, Tom is projected to have a portfolio valued at $116,321.

Tom’s plan was to use this worksheet as a template to develop financial plans for the company’s employees. The data in the spreadsheet would be tailored for each employee, and rows would be added to it to reflect the employee’s planning horizon. After adding

FourCorners

548 Chapter 11 Monte Carlo Simulation

another 15 rows to the worksheet, Tom found that he could expect to have a portfolio of $772,722 after 20 years. Tom then took his results to show his boss, Kate Krystkowiak.

Although Kate was pleased with Tom’s progress, she voiced several criticisms. One of the criticisms was the assumption of a constant annual salary growth rate. She noted that most employees experience some variation in the annual salary growth rate from year to year. In addition, she pointed out that the constant annual portfolio growth rate was unreal- istic and that the actual growth rate would vary considerably from year to year. She further suggested that a simulation model for the portfolio projection might allow Tom to account for the random variability in the salary growth rate and the portfolio growth rate.

After some research, Tom and Kate decided to assume that the annual salary growth rate would vary from 0% to 5% and that a uniform probability distribution would provide a realistic approximation. Four Corners’ accountants suggested that the annual portfolio growth rate could be approximated by a normal probability distribution with a mean of 10% and a standard deviation of 5%. With this information, Tom set off to redesign his spreadsheet so that it could be used by the company’s employees for financial planning.

Managerial Report

Play the role of Tom Gifford, and develop a simulation model for financial planning. Write a report for Tom’s boss and, at a minimum, include the following:

1. Without considering the random variability, extend the current worksheet to 20 years. Confirm that by using the constant annual salary growth rate and the con- stant annual portfolio growth rate, Tom can expect to have a 20-year portfolio of $772,722. What would Tom’s annual investment rate have to increase to in order for his portfolio to reach a 20-year, $1,000,000 goal? (Hint: Use Goal Seek.)

2. Redesign the spreadsheet model to incorporate the random variability of the annual salary growth rate and the annual portfolio growth rate into a simulation model. Assume that Tom is willing to use the annual investment rate that predicted a 20-year, $1,000,000 portfolio in part 1. Show how to simulate Tom’s 20-year finan- cial plan. Use results from the simulation model to comment on the uncertainty associated with Tom reaching the 20-year, $1,000,000 goal.

3. What recommendations do you have for employees with a current profile similar to Tom’s after seeing the impact of the uncertainty in the annual salary growth rate and the annual portfolio growth rate?

4. Assume that Tom is willing to consider working 25 more years instead of 20 years. What is your assessment of this strategy if Tom’s goal is to have a portfolio worth $1,000,000?

5. Discuss how the financial planning model developed for Tom Gifford can be used as a template to develop a financial plan for any of the company’s employees.

For a review of Goal Seek, refer to Chapter 10.

549

Appendix 11.1 Common Probability Distributions for Simulation

Selecting the appropriate probability distribution to characterize a random variable in a simu- lation model can be a critical modeling decision. In this appendix, we review several proba- bility distributions commonly used in simulation models. We describe the native Excel functionality used to generate random values from the corresponding probability distribution.

Continuous Probability Distributions Random variables that can be many possible values (even if the values are discrete) are often modeled with a continuous probability distribution. For common continuous random vari- ables, we provide several pieces of information. First, we list the parameters which specify the probability distribution. We then delineate the minimum and maximum values defining the range that can be realized by a random variable that follows the given distribution. We also provide a short description of the overall shape of the distribution paired with an illustration. Then, we supply an example of the application of the random variable. We conclude with the native Excel functionality for generating random values from the probability distribution.

Normal Distribution Parameters: mean (m), standard deviation (s) Range: to2` 1` Description: The normal distribution is a bell-shaped, symmetric distribution centered

at its mean m. The normal distribution is often a good way to characterize a quantity that is the sum of many independent random variables.

Example: In human resource management, employee performance is often well repre- sented by a normal distribution. Typically, the performance of 68% of employees is within one standard deviation of the average performance, and the performance of 95% of the employees is within two standard deviations. Employees with exception- ally low or high performance are rare. For example, the performance of a pharma- ceutical company’s sales force may be well described by a normal distribution with a mean of 200 customer adoptions and a standard deviation of 40 customer adoptions.

Native Excel: NORM.INV(RAND(), m, s)

Beta Distribution Parameters: alpha (a), beta (b), minimum (A), maximum (B) Range: A to B Description: Over the range specified by values A and B, the beta distribution has a

very flexible shape that can be manipulated by adjusting a and b . The beta distribu- tion is useful in modeling an uncertain quantity that has a known minimum and max- imum value. To estimate the values of the alpha and beta parameters given sample data, we use the following equations:

1 1

2( )

  

  

  

  

  

  

  

  





   





   

  

  

  

  

  

  





   





   

x A

B A

x A

B A

x A

B A s

B A

x A

B A x A

B A

b a

5 2

2 2

5 3

2 2

Simulation software such as Analytic Solver, Crystal Ball, or @RISK automates the generation of random values from an even wider selection of probability distributions than is available in native Excel.

Chapter 11 Appendix

0.001

50 100 150 200 250 300 350

0.009

0.008

0.007

0.006

0.005

0.004

0.003

0.01

0.03

0.04

α = 0.45, β = 0.45, A = 0, B = 70

0 20 30 50 60

0.02

10 40 70

α = 2, β = 5, A = 0, B = 70

0.01

0.03

0 20 30 50 60

0.02

10 40 70

550 Chapter 11 Appendix

Example: The boom-or-bust nature of the revenue generated by a movie from a polariz- ing director may be described by a beta distribution. The relevant values (in millions of dollars) are 0A 5 , 70B 5 , 0.45a 5 , and 0.45b 5 . This particular distribution is U-shaped and extreme values are more likely than moderate values. The figures in the left margin illustrate beta distributions with different values of a and b , demon- strating its flexibility. The first figure depicts a U-shaped beta distribution. The sec- ond figure depicts a unimodal beta distribution with a positive skew. The third figure depicts a unimodal beta distribution with a negative skew.

Native Excel: BETA.INV(RAND(), a, b, A, B)

Gamma Distribution Parameters: alpha (a), beta (b) Range: 0 to 1` Description: The gamma distribution has a very flexible shape controlled by the values

of a and b. The gamma distribution is useful in modeling an uncertain quantity that can be as small as zero but can also realize large values. To estimate the values of the alpha and beta parameters given sample data, we use the following equations:

 

 a

Example: The aggregate amount (in $100,000s) of insurance claims in a region may be described by a gamma distribution with 2a 5 and 0.5b 5 .

Native Excel: GAMMA.INV(RAND(), a, b)

Exponential Distribution Parameters: mean (m) Range: 0 to 2` Description: The exponential distribution is characterized by a mean value equal to its

standard deviation and a long right tail stretching from a mode value of 0. Example: The time between events, such as customer arrivals or customer defaults on

bill payment, are commonly modeled with an exponential distribution. An expo- nential random variable possesses the “memoryless” property: the probability of a customer arrival occurring in the next x minutes does not depend on how long it’s been since the last arrival. For example, suppose the average time between customer arrivals is 10 minutes. Then, the probability that there will be 25 or more minutes between customer arrivals if 10 minutes have passed since the last customer arrival is the same as the probability that there will be more than 15 minutes until the next arrival if a customer just arrived.

Native Excel: LN(RAND())*(2m)

Triangular Distribution Parameters: minimum (a), most likely (m), maximum (b) Range: a to b Description: The triangular distribution is often used to subjectively assess uncertainty

when little is known about a random variable besides its range, but it is thought to have a single mode. The distribution is shaped like a triangle with vertices at a, m, and b.

Example: In corporate finance, a triangular distribution may be used to model a proj- ect’s annual revenue growth in a net present value analysis if the analyst can reliably provide minimum, most likely, and maximum estimates of growth. For example, a project may have worst-case annual revenue growth of 0%, a most -likely annual revenue growth of 5%, and best-case annual revenue growth of 25%. These values would then serve as the parameters for a triangular distribution.

Native Excel: IF(random , (m 2 a)/(b 2 a), a 1 SQRT((b 2 a)*(m 2 a)*random), b 2 SQRT((b 2 a)*(b 2 m)*(1 2 random))) where random refers to a single, separate cell containing 5RAND()

0.02

0.04

0.06

0 6 8 12 16 20

0.08

0.1

0.12

0.14

0.16

0.18

2 4–2 1410 2218

0.02

0.04

0.06

0 20 30 50 60 80

0.08

0.1

10 40 70

Because the exponential distribution with mean m is equivalent to the gamma distribution with parameters a 5 1 and b 5 (1/m), an exponential random variable can also be generated with GAMMA. INV(RAND(), 1, 1/m).

α = 7, β = 2, A = 0, B = 70

0.01

0.03

0 20 30 50 60

0.02

0.04

10 40 70

0 0.05 0.1 0.15 0.2 0.25

Appendix 11.1 Common Probability Distributions for Simulation 551

Uniform Distribution Parameters: minimum (a), maximum (b) Range: a to b Description: The uniform distribution is appropriate when a random variable is equally

likely to be any value between a and b. When little is known about a phenomenon other than its minimum and maximum possible values, the uniform distribution may be a conservative choice to model an uncertain quantity.

Example: A service technician making a house call may quote a 4-hour time window in which he will arrive. If the technician is equally likely to arrive any time during this time window, then the arrival time of the technician in this time window may be described with a uniform distribution.

Native Excel: a b a( ) * RAND()1 2

Log-Normal Distribution Parameters: log_mean, log_stdev Range: 0 to 1` Description: The log-normal distribution is a unimodal distribution (like the normal

distribution) that has a minimum value of 0 and a long right tail (unlike the normal distribution). The log-normal distribution is often a good way to characterize a quantity that is the product of many independent, positive random variables. The natural logarithm of a log-normally distributed random variable is normally distributed.

Example: The income distribution of the lower 99% of a population is often well described using a log-normal distribution. For example, for a population in which the natural logarithm of the income observations is normally distributed with a mean of 3.5 and a standard deviation of 0.5, the income observations are distributed log-normally.

Native Excel: LOGNORM.INV(RAND(), log_mean, log_stdev), where log_mean and log_stdev are the mean and standard deviation of the normally distributed random variable obtained when taking the logarithm of the log-normally distributed random variable.

Discrete Probability Distributions Random variables that can be only a relatively small number of discrete values are often best modeled with a discrete distribution. The appropriate choice of discrete distribution relies on the specific situation. For common discrete random variables, we provide several pieces of information. First, we list the parameters required to specify the distribution. Then, we outline possible values that can be realized by a random variable that follows the given distribution. We also provide a short description of the distribution paired with an illustration. Then, we supply an example of the application of the random variable. We con- clude with the native Excel functionality for generating random values from the probability distribution.

Integer Uniform Distribution Parameters: lower (l), upper (u) Possible values: l, l 1 1, l 1 2, . . . , u 2 2, u 2 1, u Description: An integer uniform random variable assumes that the integer values

between l and u are equally likely. Example: The number of philanthropy volunteers from a class of 10 students may be an

integer uniform variable with values 0, 1, 2, . . . , 10. Native Excel: RANDBETWEEN(l, u)

Discrete Uniform Distribution Parameters: set of values v v v vk{ , , , . . . , }1 2 3 Possible values: v v v vk, , , . . . ,1 2 3

–0.5 0

0.05

0.1

0.15

0.2

0.25

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

–20 0

0.005

0.025

0 20 40 60 80 100 120 140 160 180

0.02

0.015

0.01

–1 0

0.01

0.03

0.04

0.08

0 1 2 3 4 5 6 7 8 9 10 11

0.02

0.06

0.05

0.07

0.09

552 Chapter 11 Appendix

Description: A discrete uniform random variable is equally likely to be any of the spec- ified set of values v v v vk{ , , , . . . , }1 2 3 .

Example: Consider a game show that awards a contestant a cash prize from an envelope randomly selected from six possible envelopes. If the envelopes contain $1, $5, $10, $20, $50, and $100, respectively, then the prize is a discrete uniform random variable with values {1, 5, 10, 20, 50, 100}.

Native Excel: CHOOSE(RANDBETWEEN(1, k), 1v , 2v , . . ., vk )

Custom Discrete Distribution Parameters: set of values v v v vk{ , , , . . . , }1 2 3 and corresponding weights

w w w wk{ , , , . . . , }1 2 3 such that 1 1 1 5w w wk· · · 11 2 Possible values: v v v vk, , , . . . ,1 2 3 Description: A custom discrete distribution can be used to create a tailored distribution

to model a discrete, uncertain quantity. The value of a custom discrete random vari- able is equal to the value vi with probability wi .

Example: Analysis of daily sales for the past 50 days at a car dealership shows that on 7 days no cars were sold, on 24 days one car was sold, on 9 days two cars were sold, on 5 days three cars were sold, on 3 days four cars were sold, and on 2 days five cars were sold. We can estimate the probability distribution of daily sales using the relative frequencies. An estimate of the probability that no cars are sold on a given day is 7/50 5 0.14, an estimate of the probability that one car is sold is 24/50 5 0.48, and so on. Daily sales may then be described by a custom discrete distribution with values of {0, 1, 2, 3, 4, 5} with respective weights of {0.14, 0.48, 0.18, 0.10, 0.06, 0.04}.

Native Excel: Use the RAND() function in conjunction with the VLOOKUP function referencing a table in which each row lists a possible value and a segment of the interval [0, 1) representing the likelihood of the corresponding value. Figure 11.25 illustrates the implementation for the car sales example.

–0.5 0

0.05

0.15

0.4

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

0.1

0.25

0.3

0.2

0.35

0.45

0.5

0.02

0.04

0.06

0.08

0.1

0.16

4 6 8 10 12 14 16 18 20

0.14

0.18

0.12

Binomial Distribution Parameters: trials (n), probability of a success (p) Possible values: 0, 1, 2, . . . , n Description: A binomial random variable corresponds to the number of times an event

successfully occurs in n trials, and the probability of a success at each trial is p and independent of whether a success occurs on other trials. When 1n 5 , the binomial is also known as the Bernoulli distribution.

Example: In a portfolio of 20 similar stocks, each of which has the same probability of increasing in value of 0.6p 5 , the total number of stocks that increase in value can be described by a binomial distribution with parameters 20n 5 and 0.6p 5 .

Native Excel: BINOM.INV(n, p, RAND())

Native Excel Implementation of Custom Discrete Distribution

FIGURE 11.25

A B C D Cars Sold

Lower End of Interval Upper End of Interval Cars Sold Probability

=VLOOKUP(RAND(), A4:C9, 3, TRUE)

0.00

0.14 0.62

0.80

0.90

0.96

0.14

0.62 0.80

0.90

0.96

1.00

1 2

0.14

0.48 0.18

0.10

0.06

0.04

1 2 3 4

5 6 7 8 9

–10 0

0.02

0.06

0.14

0 10 20 30 40 50 60 70 80 90 100 110

0.04

0.1

0.08

0.12

0.16

Appendix 11.1 Common Probability Distributions for Simulation 553

Hypergeometric Distribution Parameters: trials (n), population size (N), successful elements in population (s) Possible values: max{0, n 1 s 2 N}, . . . , min{n, s} Description: A hypergeometric random variable corresponds to the number of times

an element labeled a success is selected out of n trials in the situation where there are N total elements, s of which are labeled a success and, once selected, cannot be selected again. Note that this is similar to the binomial distribution except that now the trials are dependent because removing the selected element changes the probabilities of selecting an element labeled a success on subsequent trials.

Example: A certain company produces circuit boards to sell to computer manufactur- ers. Because of a quality defect in the manufacturing process, it is known that only 70 circuit boards out of a lot of 100 have been produced correctly and the other 30 are faulty. If a company orders 40 circuit boards from this lot of 100, the number of functioning circuit boards that the company will receive in their order is a hypergeo- metric random variable with 40n 5 , 70s 5 , and 100N 5 . Note that, in this case, between 10 (5 40 1 70 2 100) and 40 (5min{40, 70}) of the 40 ordered circuit boards will be functioning. At least 10 of the 40 circuit boards will be functioning because at most 30 (5 100 2 70) are faulty.

Native Excel: Insert the file Hypergeometric into your Excel workbook, modify the parameters in the cell range B2:B4, and then reference cell B6 in your simulation model to obtain a value from a hypergeometric distribution. This file uses the RAND() function in conjunction with the VLOOKUP function referencing a table in which each row lists a possible value and a segment of the interval [0, 1) representing the likelihood of the corresponding value; the probability of each value is computed with the HYPGEOM.DIST function. Figures 11.26 illustrates the Hypergeometric file with the parameter values for the circuit board example.

20 0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

22 24 26 28 30 32 34 36

Excel Template to Generate Values from a Hypergeometric DistributionFIGURE 11.26

Hypergeometric Distribution Parameters

Trials (n) 40

100

=VLOOKUP(RAND(),$C$9:$E$109,3,TRUE)

Probability Mass

=HYPGEOM.DIST($A9,$B$2,$B$4,$B$3,FALSE)

=HYPGEOM.DIST($A10,$B$2,$B$4,$B$3,FALSE) =HYPGEOM.DIST($A11,$B$2,$B$4,$B$3,FALSE)

=HYPGEOM.DIST($A12,$B$2,$B$4,$B$3,FALSE)

=HYPGEOM.DIST($A13,$B$2,$B$4,$B$3,FALSE)

=HYPGEOM.DIST($A14,$B$2,$B$4,$B$3,FALSE)

=D9 =D10

=D11

=D12

=D13

=B9+C9

=B10+C10 =B11+C11

=B12+C12

=B13+C13

=B14+C14

1 2

Lower End of Interval Upper End of Interval Number of Successes in n Trials

Population (N)

Successful Elements in Population (s)

Randomly Generated Hypergeometric Value

Number of Successes in n Trials

7 6

12 13

A B C D E

Hypergeometric Distribution Parameters 40

100 70

Probability Mass Lower End of Interval Upper End of Interval Number of Successes in n Trials

Trials (n) Population (N) Successful Elements in Population (s)

Randomly Generated Hypergeometric Value

Number of Successes in n Trials 9 8 7 6 5 4 3 2 1

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.005 0.016 0.037 0.073 0.118 0.159 0.176 0.161 0.121 0.074 0.037 0.015 0.005 0.001

0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.007 0.023 0.060

0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.007 0.023 0.060 0.133 0.251 0.410 0.586 0.747 0.868 0.942 0.979 0.994 0.999

0.133 0.251 0.410 0.586 0.747 0.868 0.942 0.979 0.994 0.999 1.000

A B C D E

Hypergeometric

554 Chapter 11 Appendix

Negative Binomial Distribution Parameters: required number of successes (s), probability of success (p) Possible values: 0, 1, 2, . . . , ∞ Description: A negative binomial random variable corresponds to the number of times

that an event fails to occur until an event successfully occurs s times, given that the probability of an event successfully occurring at each trial is p. When 1s 5 , the neg- ative binomial is also known as the geometric distribution.

Example: Consider the research and development (R&D) division of a large company. An R&D division may invest in several projects that fail before investing in 5 proj- ects that succeed. If each project has a probability of success of 0.50, the number of projects that fail before 5 successful projects occur is a negative binomial random variable with parameters 5s 5 and 0.50p 5 .

Native Excel: Insert the file NegativeBinomial into your Excel workbook, modify the parameters in the cell range B2:B3, and then reference cell B5 in your simulation model to obtain a value from a negative binomial distribution. This file uses the RAND() function in conjunction with the VLOOKUP function referencing a table in which each row lists a possible value and a segment of the interval [0, 1) representing the likelihood of the corresponding value; the probability of each value is computed with the NEGBINOM.DIST function. The following figure illustrates the implementa- tion for the R&D project example.

–2 0

0.02

0.04

0.06

0.08

0.1

0.12

0 2 4 6 8 10 12 14 16 18 20 22

0.14

Excel Template to Generate Values from a Negative Binomial DistributionFIGURE 11.27

Negative Binomial Distribution Parameters Required Number of Successes (s) Probability of Success (p)

Randomly Generated Negative Binomial Value

0.5

=VLOOKUP(RAND(),$C$8:$E$108,3,TRUE)

Probability Mass Lower End of Interval Upper End of Interval Number of Failures Before s Successes 0

1 2 3 4

=C8+B8

=C9+B9 =C10+B10 =C11+B11 =C12+B12

=C13+B13

=D8

=NEGBINOM.DIST($A8,$B$2,$B$3,FALSE)

=NEGBINOM.DIST($A9,$B$2,$B$3,FALSE)

=NEGBINOM.DIST($A10,$B$2,$B$3,FALSE)

=NEGBINOM.DIST($A11,$B$2,$B$3,FALSE)

=NEGBINOM.DIST($A12,$B$2,$B$3,FALSE)

=NEGBINOM.DIST($A13,$B$2,$B$3,FALSE)

=C9+B9 =C10+B10 =C11+B11

=C12+B12

0 1 2 3 4 5

8 7 6 5 4 3 2 1

9 10 11 12 13

Number of Failures Before s Successes

A B C D E

Probability Mass

0.50 5

Lower End of Interval Upper End of Interval Number of Failures Before s Successes

8 7 6 5 4 3 2 1

9 10 11 12 13 14 15 16 17 10 19 20 21 22 23 24 25 26 27 28

8 7 6 5 4 3 2 1 0

9 10 11 12 13 14 15 16 17 10 19 20

0.060 0.081 0.103 0.123 0.137 0.137 0.117 0.078 0.031

0.044 0.031 0.021 0.014 0.009 0.006 0.004 0.002 0.001 0.001 0.001 0.000

0.806 0.726 0.623 0.500 0.363 0.227 0.109 0.031 0.000

0.867 0.910 0.941 0.962 0.975 0.985 0.990 0.994 0.996 0.998 0.999 0.999

0.867 0.806 0.726 0.623 0.500 0.363 0.227 0.109 0.031

0.910 0.941 0.962 0.975 0.985 0.990 0.994 0.996 0.998 0.999 0.999 1.000

1 0

Negative Binomial Distribution Parameters Required Number of Successes (s) Probability of Success (p)

Randomly Generated Negative Binomial Value

Number of Failures Before s Successes

A B C D E

NegativeBinomial

Appendix 11.1 Common Probability Distributions for Simulation 555

0.02

0.04

0.06

0.08

0.1

0.12

0 2 4 6 8 10 12 14

0.14

0.16

0.18

Poisson Distribution Parameters: mean (m) Possible values: 0, 1, 2, . . . Description: A Poisson random variable corresponds to the number of times that an

event occurs within a specified period of time given that m is the average number of events within the specified period of time.

Example: The number of patients arriving at a health care clinic in a hour can be mod- eled with a Poisson random variable with 5m 5 , if on average 5 customers arrive to the store in a hour.

Native Excel: Insert the file Poisson into your Excel workbook, modify the parameter in cell B2, and then reference cell B4 in your simulation model to obtain a value from a Poisson distribution. This file uses the RAND() function in conjunction with the VLOOKUP function referencing a table in which each row lists a possible value and a segment of the interval [0, 1) representing the likelihood of the corresponding value; the probability of each value is computed with the POISSON.DIST func- tion. The following figure illustrates the implementation for the health care clinic example.

Excel Template to Generate Values from a Poisson DistributionFIGURE 11.28

Poisson Distribution Parameters Mean (m)

Randomly Generated Poisson Value

=VLOOKUP(RAND(),$C$7:$E$107,3,TRUE)

Probability Mass Lower End of Interval Upper End of Interval Number of Event Occurrences

1 2 3 4

=C8+B8

=C7+B7

=C9+B9

=C10+B10

=C11+B11

=C12+B12

=D7

=POISSION,DIST($A7,$B$2,FALSE)

=POISSION,DIST($A8,$B$2,FALSE)

=POISSION,DIST($A9,$B$2,FALSE)

=POISSION,DIST($A10,$B$2,FALSE)

=POISSION,DIST($A11,$B$2,FALSE)

=POISSION,DIST($A12,$B$2,FALSE)

=C9

=C8

=C10

=C11

8 7 6 5 4 3 2 1

9 10 11 12

Number of Event Occurrences

A B C D E

Probability Mass Lower End of Interval Upper End of Interval Number of Event Occurrences

Poisson Distribution Parameters Mean (m)

Randomly Generated Poisson Value

Number of Event Occurrences

8 7 6 5 4 3 2 1 0

9 10 11 12 13 14

0.065 0.104 0.146 0.175 0.175 0.140 0.084 0.034 0.007

0.036 0.018 0.008 0.003 0.001 0.000

0.867 0.762 0.616 0.440 0.265 0.125 0.040 0.007 0.000

0.932 0.968 0.986 0.995 0.998 0.999

0.932 0.867 0.762 0.616 0.440 0.265 0.125 0.040 0.007

0.968 0.986 0.995 0.998 0.999 1.000

1 0

A B C D E

Poisson

Chapter 12 Linear Optimization Models C O N T E N T S

ANALYTICS IN ACTION: GENERAL ELECTRIC

12.1 A SIMPLE MAXIMIZATION PROBLEM Problem Formulation Mathematical Model for the Par, Inc. Problem

12.2 SOLVING THE PAR, INC. PROBLEM The Geometry of the Par, Inc. Problem Solving Linear Programs with Excel Solver

12.3 A SIMPLE MINIMIZATION PROBLEM Problem Formulation Solution for the M&D Chemicals Problem

12.4 SPECIAL CASES OF LINEAR PROGRAM OUTCOMES Alternative Optimal Solutions Infeasibility Unbounded

12.5 SENSITIVITY ANALYSIS Interpreting Excel Solver Sensitivity Report

12.6 GENERAL LINEAR PROGRAMMING NOTATION AND MORE EXAMPLES Investment Portfolio Selection Transportation Planning Advertising Campaign Planning

12.7 GENERATING AN ALTERNATIVE OPTIMAL SOLUTION FOR A LINEAR PROGRAM

APPENDIX 12.1 SOLVING LINEAR OPTIMIZATION MODELS USING ANALYTIC SOLVER (MINDTAP READER)

Analytics in Action 557

This chapter begins our discussion of prescriptive analytics and how optimization models can be used to support and improve managerial decision making. Optimization problems maximize or minimize some function, called the objective function, and usually have a set of restrictions known as constraints. Consider the following typical applications of optimization:

1. A manufacturer wants to develop a production schedule and an inventory policy that will satisfy demand in future periods. Ideally, the schedule and policy will enable the company to satisfy demand and at the same time minimize the total pro- duction and inventory costs.

2. A financial analyst must select an investment portfolio from a variety of stock and bond investment alternatives. The analyst would like to establish the portfolio that maximizes the return on investment.

3. A marketing manager wants to determine how best to allocate a fixed advertising budget among alternative advertising media such as web, radio, television, newspa- per, and magazine. The manager would like to determine the media mix that maxi- mizes advertising effectiveness.

4. A company has warehouses in a number of locations. Given specific customer demands, the company would like to determine how much each warehouse should ship to each customer so that total transportation costs are minimized.

Each of these examples has a clear objective. In example 1, the manufacturer wants to minimize costs; in example 2, the financial analyst wants to maximize return on invest- ment; in example 3, the marketing manager wants to maximize advertising effectiveness; and in example 4, the company wants to minimize total transportation costs.

General Electric*

With growing concerns about the environment and our ability to continue to utilize limited nonrenewable sources for energy, companies have begun to place much more emphasis on renewable forms of energy. Water, wind, and solar energy are renewable forms of energy that have become the focus of considerable investment by companies.

General Electric (GE) has products in a variety of areas within the energy sector. One such area of inter- est to GE is solar energy. Solar energy is a relatively new concept with rapidly changing technologies; for example, solar cells and solar power systems. Solar cells can convert sunlight directly into electricity. Con- centrating solar power systems focus a larger area of sunlight into a small beam that can be used as a heat source for conventional power generation. Solar cells can be placed on rooftops and hence can be used by both commercial and residential customers, whereas solar power systems are mostly used in commercial settings. In recent years, GE has invested in several solar cell technologies.

Determining the appropriate amount of production capacity in which to invest is a difficult problem due to

*Based on B. G. Thomas and S. Bollapragada, “General Electric Uses an Integrated Framework for Product Costing, Demand Forecasting and Capacity Planning for New Photovoltaic Technology Products,” Interfaces, 40, no. 5 (September/October 2010): 353–367.

the uncertainties in technology development, costs, and solar energy demand. GE uses a set of analytics tools to solve this problem. A detailed descriptive analytical model is used to estimate the cost of newly developed or proposed solar cells. Statistical models developed for new product introductions are used to estimate annual solar demand 10 to 15 years into the future. Finally, the cost and demand estimates are used in a multiperiod linear optimization model to determine the best production capacity investment plan.

The linear program finds an optimal expansion plan by taking into account inventory, capacity, production, and budget constraints. Because of the high level of uncertainty, the linear program is solved over multiple future scenarios. A solution to each individual scenario is found and evaluated in the other scenarios to assess the risk associated with that plan. GE planning analysts have used these tools to support management’s stra- tegic investment decisions in the solar energy sector.

A N A L Y T I C S I N A C T I O N

558 Chapter 12 Linear Optimization Models

Likewise, each problem has constraints that limit the degree to which the objective can be pursued. In example 1, the manufacturer is restricted by the constraints requiring prod- uct demand to be satisfied and limiting production capacity. The financial analyst’s portfo- lio problem is constrained by the total amount of investment funds available and the maximum amounts that can be invested in each stock or bond. The marketing manager’s media selection decision is constrained by a fixed advertising budget and the availability of the various media. In the transportation problem, the minimum-cost shipping schedule is constrained by the supply of product available at each warehouse.

Optimization models can be linear or nonlinear. We begin with linear optimization mod- els, also known as linear programs. Linear programming is a problem-solving approach developed to help managers make better decisions. Numerous applications of linear pro- gramming can be found in today’s competitive business environment. For instance, GE Capital uses linear programming to help determine optimal lease structuring, and Marathon Oil Company uses linear programming for gasoline blending and to evaluate the economics of a new terminal or pipeline.

12.1 A Simple Maximization Problem Par, Inc. is a small manufacturer of golf equipment and supplies whose management has decided to move into the market for medium- and high-priced golf bags. Par’s distributor is enthusiastic about the new product line and has agreed to buy all the golf bags Par pro- duces over the next three months.

After a thorough investigation of the steps involved in manufacturing a golf bag, management determined that each golf bag produced will require the following operations:

1. Cutting and dyeing the material 2. Sewing 3. Finishing (inserting umbrella holder, club separators, etc.) 4. Inspection and packaging

The director of manufacturing analyzed each of the operations and concluded that if the company produces a medium-priced standard model, each bag will require 7 10 hour in the cutting and dyeing department, 1 2 hour in the sewing department, 1 hour in the finishing department, and 110 hour in the inspection and packaging department. The more expensive deluxe model will require 1 hour for cutting and dyeing, 5 6 hour for sewing, 2 3 hour for finishing, and 1 4 hour for inspection and packaging. This production information is summa- rized in Table 12.1.

Par’s production is constrained by a limited number of hours available in each depart- ment. After studying departmental workload projections, the director of manufacturing estimates that 630 hours for cutting and dyeing, 600 hours for sewing, 708 hours for finish- ing, and 135 hours for inspection and packaging will be available for the production of golf bags during the next three months.

Linear programming was initially referred to as “programming in a linear structure.” In 1948, Tjalling Koopmans suggested to George Dantzig that the name was much too long: Koopman’s suggestion was to shorten it to linear programming. George Dantzig agreed, and the field we now know as linear programming was named.

Production Time (hours)

Department Standard Bag Deluxe Bag

Cutting and Dyeing 710 1

Sewing 12 5 6 Finishing 1 2 3 Inspection and Packaging 110 14

Production Requirements per Golf BagTABLE 12.1

12.1 A Simple Maximization Problem 559

The accounting department analyzed the production data, assigned all relevant variable costs, and arrived at prices for both bags that will result in a profit contribution1 of $10 for every standard bag and $9 for every deluxe bag produced. Let us now develop a mathemati- cal model of the Par, Inc. problem that can be used to determine the number of standard bags and the number of deluxe bags to produce in order to maximize total profit contribution.

Problem Formulation Problem formulation, or modeling, is the process of translating the verbal statement of a problem into a mathematical statement. Formulating models is an art that can be mastered only with practice and experience. Even though every problem has some unique features, most problems also have common features. As a result, some general guidelines for optimi- zation model formulation can be helpful, especially for beginners. We will illustrate these general guidelines by developing a mathematical model for Par, Inc.

Understand the problem thoroughly We selected the Par, Inc. problem to introduce linear programming because it is easy to understand. However, more complex problems will require much more effort to identify the items that need to be included in the model. In such cases, read the problem description to get a feel for what is involved. Taking notes will help you focus on the key issues and facts.

Describe the objective The objective is to maximize the total contribution to profit.

Describe each constraint Four constraints relate to the number of hours of manufacturing time available; they restrict the number of standard bags and the number of deluxe bags that can be produced.

• Constraint 1: The number of hours of cutting and dyeing time used must be less than or equal to the number of hours of cutting and dyeing time available.

• Constraint 2: The number of hours of sewing time used must be less than or equal to the number of hours of sewing time available.

• Constraint 3: The number of hours of finishing time used must be less than or equal to the number of hours of finishing time available.

• Constraint 4: The number of hours of inspection and packaging time used must be less than or equal to the number of hours of inspection and packaging time available.

Define the decision variables The controllable inputs for Par, Inc. are (1) the number of standard bags produced and (2) the number of deluxe bags produced. Let:

number of standard bags

number of deluxe bags

In optimization terminology, S and D are referred to as the decision variables.

Write the objective in terms of the decision variables Par’s profit contribution comes from two sources: (1) the profit contribution made by producing S standard bags and (2) the profit contribution made by producing D deluxe bags. If Par makes $10 for every standard bag, the company will make $10S if S standard bags are produced. Also, if Par makes $9 for every deluxe bag, the company will make $9D if D deluxe bags are produced. Thus, we have

5 1Total profit contribution 10 9S D

Because the objective—maximize total profit contribution—is a function of the decision variables S and D, we refer to 10S 1 9D as the objective function. Using Max as an abbre- viation for maximize, we write Par’s objective as follows:

1Max 10 9S D

It is important to understand that we are maximizing profit contribution, not profit. Overhead and other shared costs must be deducted before arriving at a profit figure.

1From an accounting perspective, profit contribution is more correctly described as the contribution margin per bag since overhead and other shared costs are not allocated.

560 Chapter 12 Linear Optimization Models

Write the constraints in terms of the decision variables Constraint 1:

 

 

 

 

Hours of cutting and dyeing time used

Hours of cutting and dyeing time available

Every standard bag Par produces will use 7 10 hour cutting and dyeing time; therefore, the total number of hours of cutting and dyeing time used in the manufacture of S standard bags is 7 10 S. In addition, because every deluxe bag produced uses 1 hour of cutting and dyeing time, the production of D deluxe bags will use 1D hours of cutting and dyeing time. Thus, the total cutting and dyeing time required for the production of S standard bags and D deluxe bags is given by

5 1Total hours of cutting and dyeing time used 17 10 S D

The director of manufacturing stated that Par has at most 630 hours of cutting and dye- ing time available. Therefore, the production combination we select must satisfy the requirement:

1 #1 6307 10 S D (12.1)

Constraint 2:

# Hours of sewing

time used Hours of sewing time available

 

 

 

 

From Table 12.1 we see that every standard bag manufactured will require 1 2 hour for sewing, and every deluxe bag will require 5 6 hour for sewing. Because 600 hours of sewing time are available, it follows that

1 # 6001 2 5 6S D (12.2)

Constraint 3:

# Hours of finishing

time used Hours of finishing

time available  

 

 

 

Every standard bag manufactured will require 1 hour for finishing, and every deluxe bag will require 2 3 hour for finishing. With 708 hours of finishing time available, it follows that

1 #1 7082 3S D (12.3)

Constraint 4:

# Hours of inspection and

packaging time used Hours of inspection and packaging time available

 

 

 

 

Every standard bag manufactured will require 110 hour for inspection and packaging, and every deluxe bag will require 1 4 hour for inspection and packaging. Because 135 hours of inspection and packaging time are available, it follows that

1 # 135110 1 4S D (12.4)

We have now specified the mathematical relationships for the constraints associated with the four departments. Have we forgotten any other constraints? Can Par produce a negative number of standard or deluxe bags? Clearly, the answer is no. Thus, to prevent the decision variables S and D from having negative values, two constraints must be added:

$ $0 and 0S D (12.5)

These constraints ensure that the solution to the problem will contain only nonnegative values for the decision variables and are thus referred to as the nonnegativity constraints. Nonnegativity constraints are a general feature of many linear programming problems and may be written in the abbreviated form:

$, 0S D

The units of measurement on the left-hand side of the constraint must match the units of measurement on the right-hand side.

12.2 Solving the Par, Inc. Problem 561

Mathematical Model for the Par, Inc. Problem The mathematical statement, or mathematical formulation, of the Par, Inc. problem is now complete. We succeeded in translating the objective and constraints of the problem into a set of mathematical relationships, referred to as a mathematical model. The complete mathematical model for the Par, Inc. problem is as follows:

S D

S D S D S D S D

S D

Max 10 9 subject to (s.t.)

1 630 Cutting and dyeing 600 Sewing

1 708 Finishing 135 Inspection and packaging

, 0

7 10

1 2

5 6

2 3

1 10

1 4

1 #

Our job now is to find the product mix (i.e., the combination of values for S and D) that satisfies all the constraints and at the same time yields a value for the objective function that is greater than or equal to the value given by any other feasible solution. Once these values are calculated, we will have found the optimal solution to the problem.

This mathematical model of the Par, Inc. problem is a linear programming model, or linear program, because the objective function and all constraint functions (the left-hand sides of the constraint inequalities) are linear functions of the decision variables.

Mathematical functions in which each variable appears in a separate term and is raised to the first power are called linear functions. The objective function (10S 1 9D) is linear because each decision variable appears in a separate term and has an exponent of 1. The amount of production time required in the cutting and dyeing department ( 1 )7 10 S D1 is also a linear function of the decision variables for the same reason. Similarly, the functions on the left-hand side of all the constraint inequalities (the constraint functions) are linear functions. Thus, the mathematical formulation of this problem is referred to as a linear program.

Linear programming has nothing to do with computer programming. The use of the word programming means “choosing a course of action.” Linear programming involves choosing a course of action when the mathematical model of the problem contains only linear functions.

The three assumptions necessary for a linear programming

model to be appropriate are proportionality, additivity, and

divisibility. Proportionality means that the contribution to the objective function and the amount of resources used in each

constraint are proportional to the value of each decision vari-

able. Additivity means that the value of the objective function

and the total resources used can be found by summing the

objective function contribution and the resources used for all

decision variables. Divisibility means that the decision variables are continuous. The divisibility assumption plus the nonnega-

tivity constraints mean that decision variables can take on any

value greater than or equal to zero.

N O T E S + C O M M E N T S

12.2 Solving the Par, Inc. Problem Now that we have modeled the Par, Inc. problem as a linear program, let us discuss how we might find the optimal solution. The optimal solution must be a feasible solution. A feasible solution is a setting of the decision variables that satisfies all of the constraints of the problem. The optimal solution also must have an objective function value as good as any other feasible solution. For a maximization problem like Par, Inc., this means that the solution must be feasible and achieve the highest objective function value of any feasible solution. To solve a linear program then, we must search over the feasible region, which is the set of all feasible solutions, and find the solution that gives the best objective function value.

Because the Par, Inc. model has two decision variables, we are able to graph the feasi- ble region. Discussing the geometry of the feasible region of the model will help us better understand linear programming and how we are able to solve much larger problems on the computer.

562 Chapter 12 Linear Optimization Models

The Geometry of the Par, Inc. Problem Recall that the feasible region is the set of points that satisfies all of the constraints of the problem. When we have only two decision variables and the functions of these variables are linear, they form lines in two-dimensional space. If the constraints are inequalities, the constraint cuts the space into two, with the line and the area on one side of the line being the space that satisfies that constraint. These subregions are called half spaces. The inter- section of these half spaces makes up the feasible region.

The feasible region for the Par, Inc. problem is shown in Figure 12.1. Notice that the horizontal axis corresponds to the value of S and the vertical axis to the value of D. The nonnegativity constraints define that the feasible region is in the area bounded by the hor- izontal and vertical axes. Each of the four constraints is graphed as equality (a line), and arrows show the direction of the half space that satisfies the inequality constraint. The intersection of the four half spaces in the area bounded by the axes is the shaded region; this is the feasible region for the Par, Inc. problem. Any point in the shaded region satisfies all four constraints of the problem and nonnegativity.

To solve the Par, Inc. problem, we must find the point in the feasible region that results in the highest possible objective function value. A contour line is a set of points on a map, all of which have the same elevation. Similar to the way contour lines are used in geogra- phy, we may define an objective function contour to be a set of points (in this case a line) that yield a fixed value of the objective function. By choosing a fixed value of the objec- tive function, we may plot contour lines of the objective function over the feasible region (Figure 12.2). In this case, as we move away from the origin we see higher values of the objective function and the highest such contour is 10 9 7,668S D1 5 , after which we leave the feasible region. The highest value contour intersects the feasible region at a single point—point 3 .

Feasible Region for the Par, Inc. ProblemFIGURE 12.1

C & D

Feasible Region

I & P

Sewing

Finishing

No. of Standard Bags

N o.

o f

D el

u xe

B ag

1,4000 200

200

400

600

800

1,000

1,200

12.2 Solving the Par, Inc. Problem 563

Of course, this geometric approach to solving a linear program is limited to problems with only two variables. What have we learned that can help us solve larger linear optimi- zation problems?

Based on the geometry of Figure 12.2, to solve a linear optimization problem we only have to search over the extreme points of the feasible region to find an optimal solution. The extreme points are found where constraints intersect on the boundary of the feasible region. In Figure 12.2, points 1 , 2 , 3 , 4 , and 5 are the extreme points of the feasible region.

Because each extreme point lies at the intersection of two constraint lines, we may obtain the values of S and D by solving simultaneously as equalities, the pair of constraints that form the given point. The values of S and D and the objective function value at points 1 through 5 are as follows:

The Optimal Solution to the Par, Inc. ProblemFIGURE 12.2

0 200 400 600 800

200

400

600

10S + 9D = 1,800

10S + 9D = 3,600

10S + 9D = 5,400

Optimal Solution is Point 3: S = 540, D = 252

M axim

um Profit Line

10S + 9D = 7668

2 S

N o.

o f

D el

u xe

B ag

No. of Standard Bags

Point S D S DProfit 10 95 1 1 0 0 10(0) 9(0) 01 5

2 708 0 10(708) 9(0) 7, 0801 5

3 540 252 10(540) 9(252) 7, 0801 5

4 300 420 10(300) 9(420) 6, 7801 5

5 0 540 10(0) 9(540) 4, 8601 5

The highest profit is achieved at point 3 . Therefore, the optimal plan is to produce 540 standard bags and 252 deluxe bags, as shown in Figure 12.2.

It turns out that this approach of investigating only extreme points works well and gen- eralizes for larger problems. The simplex algorithm, developed by George Dantzig, is quite effective at investigating extreme points in an intelligent way to find the optimal solution to even very large linear programs.

564 Chapter 12 Linear Optimization Models

Excel Solver is software that utilizes Dantzig’s simplex algorithm to solve linear pro- grams by systematically finding which set of constraints form the optimal extreme point of the feasible region. Once it finds an optimal solution, Solver then reports the optimal val- ues of the decision variables and the optimal objective function value. Let us illustrate now how to use Excel Solver to find the optimal solution to the Par, Inc. problem.

Solving Linear Programs with Excel Solver The first step in solving a linear optimization model in Excel is to construct the relevant what-if model. Using the principles for developing good spreadsheet models discussed in Chapter 10, a what-if model for optimization allows the user to try different values of the decision variables and see easily (a) whether that trial solution is feasible, and (b) the value of the objective function for that trial solution.

Figure 12.3 shows a spreadsheet model for the Par, Inc. problem with a trial solution of one standard bag and one deluxe bag. Rows 1 through 10 contain the parameters for the problem. Row 14 contains the decision variable cells: Cells B14 and C14 are the locations for the number of standard and deluxe bags to produce. Cell B16 calculates the objective func- tion value for the trial solution by using the SUMPRODUCT function. The SUMPRODUCT function is very useful for linear problems. Recall how the SUMPRODUCT function works:

5 5 1 5 1 5SUMPRODUCT(B9:C9, $B$14:$C$14) B9 * B14 C9 * C14 10(1) 9(1) 19

We likewise use the SUMPRODUCT function in cells B19:B22 to calculate the number of hours used in each of the four departments. The hours available are immediately to the right for each department. Hence, we see that the current solution is feasible, since Hours Used do not exceed Hours Available in any department.

Once the what-if model is built, we need a way to convey to Excel Solver the structure of the linear optimization model. This is accomplished through the Excel Solver dialog box as follows:

Step 1. Click the Data tab in the Ribbon Step 2. Click Solver in the Analyze group Step 3. When the Solver Parameters dialog box appears (Figure 12.4):

Enter B16 in the Set Objective: box Select Max for the To: option Enter B14:C14 in the By Changing Variable Cells: box

Step 4. Click the Add button When the Add Constraint dialog box appears: Enter B19:B22 in the left-hand box under Cell Reference: Select from the drop-down button5< Enter C19:C22 in the Constraint: box Click OK

Step 5. Select the checkbox for Make Unconstrained Variables Non-Negative Step 6. From the drop-down menu for Select a Solving Method:, choose Simplex LP Step 7. Click Solve Step 8. When the Solver Results dialog box appears:

Select Keep Solver Solution In the Reports section, select Answer Report Click OK

The completed Solver dialog box and solution for the Par, Inc. problem are shown in Figure 12.4. The optimal solution is to make 540 standard bags and 252 deluxe bags (see cells B14 and C14) for a profit of $7,688 (see cell B16). This corresponds to point 3 in Figure 12.2. Also note that, from cells B19:B22 compared to C19:C22, we use all cutting and dyeing time as well as all finishing time. This is, of course, consistent with what we have seen in Figures 12.1 and 12.2: The cutting, dyeing, and finishing constraints intersect to form point 3 in the graph.

Note the use of absolute referencing in the SUMPRODUCT function here. This facilitates copying this function from cell B19 to cells B20:B22 in Figure 12.3.

In versions of Excel prior to Excel 2016, Solver can be found in the Analysis group.

Variable cells that are required to be integer will be discussed in Chapter 13.

12.2 Solving the Par, Inc. Problem 565

What-If Spreadsheet Model for Par, Inc.FIGURE 12.3

A Par, Inc.

B C D

Parameters

Production Time (Hours) Time Available Standard HoursDeluxe

Standard Deluxe

Operation

Hours Used Hours AvailableOperation

Cutting and Dyeing

Sewing

Finishing

Inspection and Packaging

Cutting and Dyeing

Sewing

Finishing

Inspection and Packaging

Pro�t Per Bag

=7/10

=5/10

=1/10

Bags Produced 1

=SUMPRODUCT(B9:C9,$B$14:$C$14)

=SUMPRODUCT(B5:C5,$B$14:$C$14)

=SUMPRODUCT(B6:C6,$B$14:$C$14)

=SUMPRODUCT(B7:C7,$B$14:$C$14)

=SUMPRODUCT(B8:C8,$B$14:$C$14)

=D5

=D6

=D7

=D8

=5/6

=2/3

=1/4

630

600

708

135

Model

Total Pro�t

1 2 3 4 5 6 7 8 9 10 11

13 12

14 15 16 17 18 19 20 21 22

A Par, Inc.

B C D

Parameters Production Time (Hours) Time Available

Standard HoursDeluxe

Standard Deluxe

1 2 3 4 5 6 7 8 9 10 11

Operation

Hours Used Hours AvailableOperation

Cutting and Dyeing

Sewing

Finishing Inspection and Packaging

Cutting and Dyeing Sewing

Finishing Inspection and Packaging

Pro�t Per Bag

Bags Produced

0.7

0.5

1 0.1

1.00

$19.00

1.7 1.33333

1.66667 0.35

630 600

708 135

1.00

0.83333

0.66667 0.25

9.00

630

600

708 135

Model

Total Pro�t

13 12

14 15 16 17 18 19 20 21 22

Par

566 Chapter 12 Linear Optimization Models

The Excel Solver Answer Report appears in Figure 12.5. The Answer Report con- tains three sections: Objective Cell, Variable Cells, and Constraints. In addition to some other information, each section gives the cell location, name, and value of the cell(s). The Objective Cell section indicates that the optimal (Final Value) of Total Profit is $7,668.00. In the Variable Cells section, the two far-right columns indicate the optimal values of the decision cells and whether or not the variables are required to be integer (here they are labeled “Contin” for continuous). Note that Solver generates a Name for a cell by concat- enating the text to the left and above that cell. Hence, the name of cell $B$14 is created by combining the labels “Bags Produced” and “Standard” to produce the name “Bags Produced Standard.”

The Constraints section gives the left-hand side value for each constraint (in this case the hours used), the formula showing the constraint relationship, the status (Binding or Not Binding), and the Slack value. A binding constraint is one that holds as an equality at the optimal solution. Geometrically, binding constraints intersect to form the optimal point.

Solver Dialog Box and Solution to the Par, Inc. ProblemFIGURE 12.4

A Par, Inc.

B C D

Parameters Production Time (Hours) Time Available

Standard HoursDeluxe

Standard Deluxe

Operation

Hours Used Hours Available

Operation

Cutting and Dyeing

Sewing

Finishing

Inspection and Packaging

Cutting and Dyeing

Sewing

Finishing

Inspection and Packaging

Pro�t Per Bag

0.7

0.5

0.1

Bags Produced 540.00

$7,668.00

630

480.00000

708.00000

117

630 600

708 135

252.00

0.83333

0.66667

0.25

9.00

630

600

708

135

Model

Total Pro�t

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

23 22

24 25 26 27 28 29 30

20 21

E F G H I J K L M

12.2 Solving the Par, Inc. Problem 567

The Solver Answer Report for the Par, Inc. ProblemFIGURE 12.5

A B C D

Objective Cell (Max)

Name Original Value Final Value

Final Value Integer

Constraints

Cell

$B$22 Inspection and Packaging Hours Used

$B$16 Total Pro�t $19.00 $7,668.00

Cell

Variable Cells

$B$14 Bags Produced Standard

$B$19 Cutting and Dyeing Hours Used

$C$14 Bags Produced Deluxe

Name Original Value

Formula Status SlackCell Name Cell Value

1.000 540.000 Contin

1.000 252.000 Contin

$B$20 Sewing Hours Used

$B$21 Finishing Hours Used

$B$22<=$C$22

$B$19<=$C$19

$B$20<=$C$20

117

630

480

708 $B$21<=$C$21

120

Not Binding

Binding

Not Binding

Binding

13 14

15 16 17

19 18

20 21 22 23 24 25 26 27 28 29 30 31

E F G

We see in Figure 12.5 that the cutting and dyeing and finishing constraints are designated as binding, consistent with our geometric study of this problem.

The slack value for each less-than-or-equal-to constraint indicates the difference between the left-hand and right-hand values for a constraint. Of course, by definition, bind- ing constraints have zero slack. Consider for example the sewing department constraint. By adding a nonnegative slack variable, we can make the constraint equality:

600 slack 600

slack 600 (540) (252) 600 270 210 120

1 2

5 6

1 2

5 6 sewing

sewing 1

2 5

1 #

1 1 5

5 2 1 5 2 2 5

S D S D

The slack value for the inspecting and packaging constraint is calculated in a similar way. For resource constraints like departmental hours available, the slack value gives the amount of unused resource, in this case, time measured in hours.

1. Notice in the data section for the Par, Inc. spreadsheet,

shown in Figure 12.3, that we have entered fractions in cells

C6: 55/6 and C7: 52/3. We do this to make sure we main- tain accuracy because rounding these values could have an

impact on our solution.

2. By selecting Make Unconstrained Variables Non-

Negative in the Solver Parameters dialog box, all decision

variables are declared to be nonnegative.

3. Although we have shown the Answer Report and how

to interpret it, we will usually show the solution to an

optimization problem directly in the spreadsheet.

A well-designed spreadsheet that follows the princi-

ples discussed in Chapter 10 should make it easy for the

user to interpret the optimal solution directly from the

spreadsheet.

4. In addition to the Answer Report, Solver also allows you to

generate two other reports. The Sensitivity Report will be

discussed in Section 12.5. The Limits Report gives informa-

tion on the objective function value when variables are set

to their limits.

N O T E S + C O M M E N T S

568 Chapter 12 Linear Optimization Models

12.3 A Simple Minimization Problem M&D Chemicals produces two products that are sold as raw materials to companies that manufacture bath soaps and laundry detergents. Based on an analysis of current inventory levels and potential demand for the coming month, M&D’s management specified that the combined production for products A and B must total at least 350 gallons. Separately, a major customer’s order for 125 gallons of product A must also be satisfied. Product A requires 2 hours of processing time per gallon, and product B requires 1 hour of processing time per gallon. For the coming month, 600 hours of processing time are available. M&D’s objective is to satisfy these requirements at a minimum total production cost. Production costs are $2 per gallon for product A and $3 per gallon for product B.

Problem Formulation To find the minimum-cost production schedule, we will formulate the M&D Chemicals problem as a linear program. Following a procedure similar to the one used for the Par, Inc. problem, we first define the decision variables and the objective function for the problem. Let

number of gallons of product A to produce

number of gallons of product B to produce

With production costs at $2 per gallon for product A and $3 per gallon for product B, the objective function that corresponds to the minimization of the total production cost can be written as

1Min 2 3A B

Next consider the constraints placed on the M&D Chemicals problem. To satisfy the major customer’s demand for 125 gallons of product A, we know A must be at least 125. Thus, we write the constraint

$1 125A

For the combined production for both products, which must total at least 350 gallons, we can write the constraint

1 $1 1 350A B

Finally, for the limitation of 600 hours on available processing time, we add the constraint

1 #2 1 600A B

After adding the nonnegativity constraints ( , 0)A B $ , we arrive at the following linear program for the M&D Chemicals problem:

A B

A A B A B

A B

Min 2 3 s.t.

1 125 Demand for product A 1 1 350 Total production 2 1 600 Processing time , 0

1 $

1 #

Solution for the M&D Chemicals Problem A spreadsheet model for the M&D Chemicals problem along with the Solver dialog box are shown in Figure 12.6. The complete linear programming model for the M&D Chemicals problem in Excel Solver is contained in the file M&DModel. We use the SUMPRODUCT function to calculate total cost in cell B16 and also to calculate total pro- cessing hours used in cell B23. The optimal solution, which is shown in the spreadsheet and in the Answer Report in Figure 12.7, is to make 250 gallons of product A and 100 M&DModel

12.3 A Simple Minimization Problem 569

Solver Dialog Box and Solution to the M&D Chemicals ProblemFIGURE 12.6

A M&D Chemicals

Parameters Time Available

Processing Time (hours)

Production Cost

Minimum Total Production

Product A Minimum

Product A

Total Production

Processing Time

Product A

Provided

Hours Used

Product B

Required

Hours Available Unused Hours

600 600 0

$2.00

350

125

$3.00

Gallons Produced 250

$800.00

250

350

125

350

100

600

Model

Minimize Total Cost

1 2 3 4 5 6 7 8

11 10 9

13 12

14 15 16 17 18 19

23 22

24 25 26 27 28 29 30

20 21

M&DModel

570 Chapter 12 Linear Optimization Models

gallons of product B, for a total cost of $800. Both the total production constraint and the processing time constraints are binding (350 gallons are provided, the same as required, and all 600 processing hours are used). The requirement that at least 125 gallons of Product A be produced is not binding. For greater-than-or-equal-to constraints, we can define a nonnegative variable called a surplus variable. A surplus variable tells how much over the right-hand side the left-hand side of a greater-than-or-equal-to constraint is for a solution. A surplus variable is subtracted from the left-hand side of the constraint. For example,

1 125 1 surplus 125 surplus 1 125 250 125 125

$ 2 5

5 2 5 2 5

A A

As was the case with less-than-or-equal-to constraints and slack variables, a positive value for a surplus variable indicates that the constraint is not binding.

1. In the spreadsheet and Solver model for the M&D

Chemicals problem, we separated the greater-than-or-

equal-to constraints and the less-than-or-equal-to con-

straints. This allows for easier entry of the constraints into

the Add Constraint dialog box.

2. In the Excel Answer Report, both slack and surplus vari-

ables are labeled “Slack.”

N O T E S + C O M M E N T S

The Solver Answer Report for the M&D Chemicals ProblemFIGURE 12.7

Objective Cell (Min)

Name Original Value Final Value

Final Value Integer

Constraints

Cell $B$16 Minimize Total Cost $0.00 $800.00

Cell Variable Cells

$B$14 Gallons Produced Product A

$B$19 Product A Provided

$C$14 Gallons Produced Product B

Name Original Value

Formula Status SlackCell Name Cell Value

0 250 Contin

0 100 Contin

$B$20 Total Production Provided

$B$23 Processing Time Hours Used

250 $B$19>=$C$19

350 $B$20>=$C$20

600 $B$23<=$C$23

Not Binding

Binding

125

13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

28 29

A B C D E F G

12.4 Special Cases of Linear Program Outcomes In this section we discuss three special situations that can arise when we attempt to solve linear programming problems.

12.4 Special Cases of Linear Program Outcomes 571

Alternative Optimal Solutions From the discussion of the graphical solution procedure, we know that optimal solutions can be found at the extreme points of the feasible region. Now let us consider the special case in which the optimal objective function contour line coincides with one of the binding constraint lines on the boundary of the feasible region. We will see that this situation can lead to the case of alternative optimal solutions; in such cases, more than one solution provides the optimal value for the objective function.

To illustrate the case of alternative optimal solutions, we return to the Par, Inc. problem. However, let us assume that the profit for the standard golf bag (S) has been decreased to $6.30. The revised objective function becomes 6.3S 1 9D. The graphical solution of this problem is shown in Figure 12.8. Note that the optimal solution still occurs at an extreme point. In fact, it occurs at two extreme points: extreme point 4 ( 300S 5 , 420D 5 ) and extreme point 3 ( 540S 5 , 252D 5 ).

The objective function values at these two extreme points are identical; that is,

1 5 1 56.3 9 6.3(300) 9(420) 5,670S D

and

1 5 1 56.3 9 6.3(540) 9(252) 5,670S D

Furthermore, any point on the line connecting the two optimal extreme points also provides an optimal solution. For example, the solution point ( 420S 5 , 336D 5 ), which is halfway between the two extreme points, also provides the optimal objective function value of

1 5 1 56.3 9 6.3(420) 9(336) 5,670S D

A linear programming problem with alternative optimal solutions is generally a good situ- ation for the manager or decision maker. It means that several combinations of the decision

Par, Inc. Problem with an Objective Function of 6.3S 1 9D (Alternative Optimal Solutions)

FIGURE 12.8

0 200 400 600 800

No. of Standard Bags

200

400

600

N o.

o f

D el

u xe

B ag

6.3S + 9D = 3,780

6.3S + 9D = 5,670

(300, 420)

(540, 252)

572 Chapter 12 Linear Optimization Models

variables are optimal and that the manager can select the most desirable optimal solution. Unfortunately, determining whether a problem has alternative optimal solutions is not a simple matter. In Section 12.7, we discuss an approach for finding alternative optima.

Infeasibility Infeasibility means that no solution to the linear programming problem satisfies all the constraints, including the nonnegativity conditions. Graphically, infeasibility means that a feasible region does not exist; that is, no points satisfy all the constraints and the nonnega- tivity conditions simultaneously. To illustrate this situation, let us look again at the problem faced by Par, Inc.

Suppose that management specified that at least 500 of the standard bags and at least 360 of the deluxe bags must be manufactured. The graph of the solution region may now be constructed to reflect these new requirements (see Figure 12.9). The shaded area in the lower left-hand portion of the graph depicts the points that satisfy the departmental constraints on the availability of time. The shaded area in the upper right-hand portion depicts the points that satisfy the minimum production requirements of 500 standard and 360 deluxe bags. But no points satisfy both sets of constraints. Thus, we see that if man- agement imposes these minimum production requirements, no feasible region exists for the problem.

How should we interpret infeasibility in terms of this current problem? First, we should tell management that, given the resources available (i.e., production time for cutting and dyeing, sewing, finishing, and inspection and packaging), it is not possible to make 500 standard bags and 360 deluxe bags. Moreover, we can tell management exactly how much of each resource must be expended to make it possible to manufacture these numbers

Problems with no feasible solution do arise in practice, most often because management’s expectations are too high or because too many restrictions have been placed on the problem.

No Feasible Region for the Par, Inc. Problem with Minimum Production Requirements of 500 Standard and 360 Deluxe Bags

FIGURE 12.9

0 200 400 600 800

200

400

600

Points Satisfying Departmental Constraints

Points Satisfying Minimum Production

Requirements

Minimum D

M inim

um S

N o.

o f

D el

u xe

B ag

No. of Standard Bags

12.4 Special Cases of Linear Program Outcomes 573

of bags. Table 12.2 shows the minimum amounts of resources that must be available, the amounts currently available, and additional amounts that would be required to accomplish this level of production. Thus, we need 80 more hours for cutting and dyeing, 32 more hours for finishing, and 5 more hours for inspection and packaging to meet management’s minimum production requirements.

If after reviewing this information, management still wants to manufacture 500 standard and 360 deluxe bags, additional resources must be provided. Perhaps the resource require- ments can be met by hiring another person to work in the cutting and dyeing department, transferring a person from elsewhere in the plant to work part time in the finishing depart- ment, or having the sewing people help out periodically with the inspection and packaging. As you can see, once we discover the lack of a feasible solution, many possibilities are available for corrective management action. The important thing to realize is that linear programming analysis can help determine whether management’s plans are feasible. By analyzing the problem using linear programming, we are often able to point out infeasible conditions and initiate corrective action.

Whenever you attempt to solve a problem that is infeasible, Excel Solver will return a message in the Solver Results dialog box, indicating that no feasible solutions exists. In this case you know that no solution to the linear programming problem will satisfy all constraints, including the nonnegativity conditions. Careful inspection of your formulation is necessary to try to identify why the problem is infeasible. In some situations, the only reasonable approach is to drop one or more constraints and re-solve the problem. If you are able to find an optimal solution for this revised problem, you will know that the con- straint(s) that were omitted, in conjunction with the others, are causing the problem to be infeasible.

Unbounded The solution to a maximization linear programming problem is unbounded if the value of the solution may be made infinitely large without violating any of the constraints; for a minimization problem, the solution is unbounded if the value may be made infinitely small.

As an illustration, consider the following linear program with two decision variables, X and Y:

# $

X Y

Y X Y

Max 20 10 s.t.

1 2

1 5 , 0

In Figure 12.10 we graph the feasible region associated with this problem. Note that we can indicate only part of the feasible region because the feasible region extends

Operation Minimum Required Resources (hours)

Available Resources (hours)

Additional Resources Needed

(hours)

Cutting and Dyeing (500) 1(360) 710710 1 5 630 80

Sewing (500) (360) 55012 5 61 5 600 None

Finishing 1(500) (360) 7402 31 5 708 32

Inspection and Packaging (500) (360) 140110 141 5 135 5

Resources Needed to Manufacture 500 Standard Bags and 360 Deluxe BagsTABLE 12.2

574 Chapter 12 Linear Optimization Models

Example of an Unbounded ProblemFIGURE 12.10

0 5 10 15 20

Feasible Region

Objective function increases without limit.

20X +

10Y = 80

20X +

10Y = 160

20X +

10Y = 240

indefinitely in the direction of the X-axis. Looking at the objective function lines in Figure 12.10, we see that the solution to this problem may be made as large as we desire. In other words, no matter which solution we pick, we will always be able to reach some feasible solution with a larger value. Thus, we say that the solution to this linear program is unbounded.

Whenever you attempt to solve an unbounded problem using Excel Solver, you will receive a message in the Solver Results dialog box telling you that the “Objective Cell val- ues do not converge.” In linear programming models of real problems, the occurrence of an unbounded solution means that the problem has been improperly formulated. We know it is not possible to increase profits indefinitely. Therefore, we must conclude that if a profit maximization problem results in an unbounded solution, the mathematical model does not sufficiently represent the real-world problem. In many cases, this error is the result of inad- vertently omitting a constraint during problem formulation.

The parameters for optimization models are often less than certain. In the next section, we discuss the sensitivity of the optimal solution to uncertainty in the model parameters. In addition to the optimal solution, Excel Solver can provide some useful information on the sensitivity of that solution to changes in the model parameters.

12.5 Sensitivity Analysis 575

12.5 Sensitivity Analysis Sensitivity analysis is the study of how the changes in the input parameters of an optimi- zation model affect the optimal solution. Using sensitivity analysis, we can answer ques- tions such as the following:

1. How will a change in a coefficient of the objective function affect the optimal solution?

2. How will a change in the right-hand-side value for a constraint affect the optimal solution?

Because sensitivity analysis is concerned with how these changes affect the optimal solution, the analysis does not begin until the optimal solution to the original linear pro- gramming problem has been obtained. For that reason, sensitivity analysis is often referred to as postoptimality analysis. Let us return to the M&D Chemicals problem as an example of how to interpret the sensitivity report provided by Excel Solver.

Interpreting Excel Solver Sensitivity Report Recall the M&D Chemicals problem discussed in Section 12.3. We had defined the follow- ing decision variables and model:

number of gallons of product A

number of gallons of product B

1 $

1 #

A B

Min 2 3

s.t. 1 125 Demand for product A

1 1 350 Total production

2 1 600 Processing time

, 0

We found that the optimal solution is 250A 5 and 100B 5 with objective function 5 1 5value 2(250) 3(100) $800. The first constraint is not binding, but the second and

third constraints are binding because 1(250) 1(100) 3501 5 and 2(250) 100 6001 5 . After running Excel Solver, we may generate the Sensitivity Report by selecting Sensitivity from the Reports section of the Solver Results dialog box and then selecting OK. The Sensitivity report for the M&D Chemicals problem appears in Figure 12.11. There are two sections in this report: one for decision variables (Variable Cells) and one for Constraints.

Let us begin by interpreting the Constraints section. The cell location of the left-hand side of the constraint, the constraint name, and the value of the left-hand side of the con- straint at optimality are given in the first three columns. The fourth column gives the

1. Infeasibility is independent of the objective function. It

exists because the constraints are so restrictive that no fea-

sible region for the linear programming model is possible.

Thus, when you encounter infeasibility, making changes in

the coefficients of the objective function will not help; the

problem will remain infeasible.

2. The occurrence of an unbounded solution is often the result

of a missing constraint. However, a change in the objective

function may cause a previously unbounded problem to

become bounded with an optimal solution. For example,

the graph in Figure 12.10 shows an unbounded solution

for the objective function Max 20X 1 10Y. However, chang- ing the objective function to Max 220X 2 10Y will provide the optimal solution X 25 and Y 05 even though no changes have been made in the constraints.

N O T E S + C O M M E N T S

576 Chapter 12 Linear Optimization Models

shadow price for each constraint. The shadow price for a constraint is the change in the optimal objective function value if the right-hand side of that constraint is increased by one. Let us interpret each shadow price given in the report in Figure 12.11.

The first constraint is: 1 125A $ . This is a nonbinding constraint because 250 125. . If we change the constraint to 1 126A $ , there will be no change in the objective function value. The reason for this is that the constraint will remain nonbinding at the optimal solu- tion, because 1 250 126A 5 . . Hence, the shadow price is zero. In fact, nonbinding con- straints will always have a shadow price of zero.

The second constraint is binding and its shadow price is 4. The interpretation of the shadow price is as follows. If we change the constraint from 1 1 350A B1 $ to 1 1 351A B1 $ , the optimal objective function value will increase by $4; that is, the new optimal solution will have an objective function value equal to $800 $4 $8041 5 .

The third constraint is also binding and has a shadow price of −1. The interpretation of the shadow price is as follows. If we change the constraint from 2 1 600A B1 # to 2 1 601A B1 # , the objective function value will decrease by $1; that is, the new optimal solution will have an objective function value equal to $800 $1 $7992 5 .

Note that the shadow price for the second constraint is positive, but for the third it is negative. Why is this? The sign of the shadow price depends on whether the problem is a maximization or a minimization and the type of constraint under consideration. The M&D Chemicals problem is a cost minimization problem. The second constraint is a greater- than-or-equal-to constraint. By increasing the right-hand side, we make the constraint even more restrictive. This results in an increase in cost. Contrast this with the third constraint. The third constraint is a less-than-or-equal-to constraint. By increasing the right-hand side, we make more hours available. We have made the constraint less restrictive. Because we have made the constraint less restrictive, there are more feasible solutions from which to choose. Therefore, cost drops by $1.

When observing shadow prices, the following general principle holds: Making a binding constraint more restrictive degrades or leaves unchanged the optimal objective function value, and making a binding constraint less restrictive improves or leaves unchanged the optimal objective function. We shall see several more examples of this later in this chapter. Also, shadow prices are symmetric; so the negative of the shadow price is the change in the objective function for a decrease of one in the right-hand side.

Making a constraint more restrictive is often referred to as tightening the constraint. Making a constraint less restrictive is often referred to as relaxing, or loosening, the constraint.

Solver Sensitivity Report for the M&D Chemicals ProblemFIGURE 12.11

A B C D

Name Final Value

Final Value

Cost Reduced Objective

Coef�cient Allowable Increase

Allowable Decrease

Price Shadow Constraint

R.H. Side Allowable Increase

Allowable Decrease

Variable Cells

250 100

1 1E + 30

250 350

125 125

600

0 0

0 4

–1

2 3

125 350 600 100

Cell

Constraints

$B$14 Gallons Produced Product A $C$14 Gallons Produced Product B

$B$19 Product A Provided $B$20 Total Production Provided $B$23 Processing Time Hours Used

Name

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

E F G

1E + 30 1

1E + 30 50

125

12.6 General Linear Programming Notation and More Examples 577

In Figure 12.11, the Allowable Increase and the Allowable Decrease are the allowable changes in the right-hand side for which the current shadow price remains valid. For example, because the allowable increase in the processing time is 100, if we increase the processing time hours by 50 to 600 50 6501 5 , we can say with certainty that the optimal objective function value will change by ( 1)50 502 5 2 . Hence, we know that the optimal objective function value will be $800 $50 $7502 5 . If we increase the right-hand side of the processing time beyond the allowable increase of 100, we cannot predict what will happen. Likewise, if we decrease the right-hand side of the processing time constraint by 50, we know that the optimal objective function value will change by the negative of the shadow price: ( 1)50 502 2 5 . Cost will increase by $50. If we change the right-hand side by more than the allowable increase or decrease, the shadow price is no longer valid.

Let us now turn to the Variable Cells section of the Sensitivity Report. As in the con- straint section, the cell location, variable name, and final (optimal) value for each variable are given. The fourth column is Reduced Cost. The reduced cost for a decision variable is the shadow price of the nonnegativity constraint for that variable. In other words, the reduced cost indicates the change in the optimal objective function value that results from changing the right-hand side of the nonnegativity constraint from 0 to 1.

In the fifth column of the report, the objective function coefficient for the variable is given. The Allowable Increase and Allowable Decrease indicate the change in the objective function coefficient for which the current optimal solution will remain optimal. The value 1E 1 30 in the report is essentially infinity. So long as the cost of product A is greater than or equal to negative infinity and less than or equal to 2 1 31 5 , the current solution remains optimal. For example, if the cost of product A is really $2.50 per gallon, we do not need to re-solve the model. Because the increase in cost of $0.50 is less than the allowable increase of $1.00, the current solution of 250 gallons of product A and 100 gallons of prod- uct B remains optimal.

As we have seen, the Excel Solver Sensitivity Report can provide useful information about the sensitivity of the optimal solution to changes in the model input data. However, this type of classical sensitivity analysis is somewhat limited. Classical sensitivity analysis is based on the assumption that only one piece of input data has changed; it is assumed that all other parameters remain as stated in the original problem. In many cases, however, we are interested in what would happen if two or more pieces of input data are changed simul- taneously. The easiest way to examine the effect of simultaneous changes is to make the changes and rerun the model.

We defined the reduced cost as the shadow price of the

nonnegativity constraint for that variable. When there is a

binding simple upper-bound constraint for a variable, the

reduced cost reported by Excel Solver is the shadow price

of that upper-bound constraint. Likewise, if there is a binding

nonzero lower bound for a variable, the reduced cost is

the shadow price for that lower-bound constraint. So to be

more general, the reduced cost for a decision variable is the

shadow price of the binding simple lower- or upper-bound

constraint for that variable.

N O T E S + C O M M E N T S

12.6 General Linear Programming Notation and More Examples

Earlier in this chapter we showed how to formulate linear programming models for the Par, Inc. and M&D Chemicals problems. To formulate a linear program- ming model of the Par, Inc. problem, we began by defining two decision variables:

number of standard bagsS 5 and number of deluxe bagsD 5 . In the M&D Chemicals problem, the two decision variables were defined as number of gallons of product AA 5 and number of gallons of product BB 5 . We selected decision-variable names of S and

578 Chapter 12 Linear Optimization Models

D in the Par, Inc. problem and A and B in the M&D Chemicals problem to make it eas- ier to recall what these decision variables represented in the problem. Although this approach works well for linear programs involving a small number of decision variables, it can become difficult when dealing with problems involving a large number of decision variables.

A more general notation that is often used for linear programs uses the letter x with a subscript. For instance, in the Par, Inc. problem, we could have defined the decision vari- ables as follows:

x x

number of standard bags number of deluxe bags

In the M&D Chemicals problem, the same variable names would be used, but their defini- tions would change:

number of gallons of product A

number of gallons of product B 1

A disadvantage of using general notation for decision variables is that we are no longer able to easily identify what the decision variables actually represent in the mathematical model. However, the advantage of general notation is that formulating a mathematical model for a problem that involves a large number of decision variables is much easier. For instance, for a linear programming model with three decision variables, we would use variable names of 1x , 2x , and 3x ; for a problem with four decision variables, we would use variable names of 1x , 2x , 3x , and 4x ; and so on. Clearly, if a problem involved 1,000 decision variables, trying to identify 1,000 unique names would be difficult. However, using the general linear programming notation, the decision variables would be defined as x x x x, , , . . . ,1 2 3 1000 .

Using this new general notation, the Par, Inc. model would be written as follows:

1 #

x x

Max 10 9 s.t.

1 630 Cutting and dyeing

600 Sewing

1 708 Finishing

135 Inspection and packaging

, 0

1 2

7 10 1 2

1 2 1

5 6 2

1 2

3 2

1 10 1

1 4 2

1 2

In some of the examples that follow in this section and in Chapters 13 and 14, we will use this type of subscripted notation.

Investment Portfolio Selection In finance, linear programming can be applied in problem situations involving capital bud- geting, make-or-buy decisions, asset allocation, portfolio selection, financial planning, and many more. Next, we describe a portfolio selection problem.

Portfolio selection problems involve situations in which a financial manager must select specific investments—for example, stocks and bonds—from a variety of investment alternatives. Managers of mutual funds, credit unions, insurance companies, and banks frequently encounter this type of problem. The objective function for portfolio selection problems usually is maximization of expected return or minimization of risk. The con- straints usually take the form of restrictions on the type of permissible investments, state laws, company policy, maximum permissible risk, and so on. Problems of this type have been formulated and solved using a variety of optimization techniques. In this section we formulate and solve a portfolio selection problem as a linear program.

Consider the case of Welte Mutual Funds, Inc. located in New York City. Welte just obtained $100,000 by converting industrial bonds to cash and is now looking for other

12.6 General Linear Programming Notation and More Examples 579

investment opportunities for these funds. Based on Welte’s current investments, the firm’s top financial analyst recommends that all new investments be made in the oil industry, steel industry, or government bonds. Specifically, the analyst identified five investment oppor- tunities and projected their annual rates of return. The investments and rates of return are shown in Table 12.3.

The management at Welte imposed the following investment guidelines:

1. Neither industry (oil or steel) should receive more than $50,000. 2. The amount invested in government bonds should be at least 25% of the steel

industry investments. 3. The investment in Pacific Oil, the high-return but high-risk investment, cannot be

more than 60% of the total oil industry investment.

What portfolio recommendations—investments and amounts—should be made for the available $100,000? Given the objective of maximizing projected return subject to the bud- getary and managerially imposed constraints, we can answer this question by formulating and solving a linear programming model of the problem. The solution will provide invest- ment recommendations for the management of Welte Mutual Funds.

Let us define the following decision variables:

5 5 5 5 5

X X X X X

dollars invested in Atlantic Oil dollars invested in Pacific Oil dollars invested in Midwest Steel dollars invested in Huber Steel dollars invested in government bonds

Using the projected rates of return shown in Table 12.3, we write the objective function for maximizing the total return for the portfolio as

1 1 1 1Max 0.073 0.103 0.064 0.075 0.0451 2 3 4 5X X X X X

The constraint specifying investment of the available $100,000 is

1 1 1 1 5 100, 0001 2 3 4 5X X X X X

The requirements that neither the oil nor steel industry should receive more than $50,000 are as follows

50, 000

50, 000 1 2

3 4

X X

1 #

The requirement that the amount invested in government bonds be at least 25% of the steel industry investment is expressed as

$ 10.25( )5 3 4X X X

Finally, the constraint that Pacific Oil cannot be more than 60% of the total oil industry investment is

# 10.60( )2 1 2X X X

Investment Projected Rate of Return (%)

Atlantic Oil 7.3

Pacific Oil 10.3

Midwest Steel 6.4

Huber Steel 7.5

Government bonds 4.5

Investment Opportunities for Welte Mutual FundsTABLE 12.3

580 Chapter 12 Linear Optimization Models

By adding the nonnegativity restrictions, we obtain the complete linear programming model for the Welte Mutual Funds investment problem:

1. The optimal solution to the Welte Mutual Funds problem

indicates that $20,000 should be spent on the Atlantic Oil

stock. If Atlantic Oil sells for $75 per share, we would have

to purchase exactly 266 2 3 shares in order to spend exactly

$20,000. The difficulty of purchasing fractional shares can

be handled by purchasing the largest possible integer

number of shares with the allotted funds (e.g., 266 shares

of Atlantic Oil). This approach guarantees that the budget

constraint will not be violated. This approach, of course,

introduces the possibility that the solution will no longer be

optimal, but the danger is slight if a large number of securi-

ties are involved. In cases in which the analyst believes that

the decision variables must have integer values, the prob- lem must be formulated as an integer linear programming

model (the topic of Chapter 13).

2. Financial portfolio theory stresses obtaining a proper bal-

ance between risk and return. In the Welte problem, we

explicitly considered return in the objective function. Risk

is controlled by choosing constraints that ensure diver-

sity among oil and steel stocks and a balance between

government bonds and the steel industry investment. In

Chapter 14, we discuss investment portfolio models that

control risk as measured by the variance of returns on

investment.

N O T E S + C O M M E N T S

Transportation Planning The transportation problem arises frequently in planning for the distribution of goods and services from several supply locations to several demand locations. Typically, the quantity of goods available at each supply location (origin) is limited, and the quantity of goods needed at each of several demand locations (destinations) is known. The usual objective in a transpor- tation problem is to minimize the cost of shipping goods from the origins to the destinations.

Let us revisit the transportation problem faced by Foster Generators, discussed in Chapter 10. This problem involves the transportation of a product from three plants to four distribution centers. Foster Generators operates plants in Cleveland, Ohio; Bedford, Indiana; and York, Pennsylvania. Production capacities over the next three-month planning period for one type of generator are as follows:

Origin Plant Three-Month Production Capacity (units)

1 Cleveland 5,000

2 Bedford 6,000

3 York 2,500

Total 13,500

Max 0.073 X 1

1 0.103 X 2

1 0.064 X 3

1 0.075 X 4

1 0.045 X 5

s.t.

X 1

1 X 2

1 X 3

1 X 4

1 X 5

5 100,000 Available funds

X 1

1 X 2

# 50,000 Oil industry maximum

X 3

1 X 4

# 50,000 Steel industry maximum

X 5

$ 0.25 (X 3 1 X

4 ) Government bonds minimum

X 2

# 0.60 (X 1 1 X

2 ) Pacific Oil restriction

X 1 , X

2 , X

3 , X

4 , X

5 $ 0

The optimal solution to this linear program is shown in Figure 12.12. Note that the optimal solution indicates that the portfolio should be diversified among all the invest- ment opportunities except Midwest Steel. The projected annual return for this portfolio is $8,000, which is an overall return of 8%. Except for the upper bound on the Steel invest- ment, all constraints are binding.

Welte

12.6 General Linear Programming Notation and More Examples 581

The Solution for the Welte Mutual Funds ProblemFIGURE 12.12

Projected Rate of Return

A B C D E F

Parameters

3 2

4 5 6 7 8 9

Investment

Amount Invested

Funds Invested

Max Total Return

Funds Available Unused Funds

= C23–B23

Max Allowed

=E5

=E6

=E7

=E8*(B14+B15)

=E9*(B16+B17)

Total

Atlantic Oil

Paci�c Oil

Oil

Midwest Steel

Steel

Huber Steel

Gov’t Bonds

0.073

0.103

0.064

0.075

0.045

Investment

Atlantic Oil

Paci�c Oil

Midwest Steel

Huber Steel

Gov’t Bonds

Funds Invested

=SUM(B14:B18)

=SUM(B14:B15)

=SUM(B16:B17)

=B18

=B15

Funds Invested Min Required

Model

13 14 15 16 17 18 19 20

22 21

25 24

26 27 28

30 29

100000

50000

Oil Max

Steel Max

Paci�c Oil Max

Gov’t Bonds Min 0.25

0.6

20000

30000

40000

10000

Welte Mutual Funds Problem

=SUMPRODUCT(B5:B9, B14:B18)

Available Funds

Projected Rate of Return

A B C D E

Parameters

3 2

4 5 6 7 8 9

Investment

Amount Invested

Funds Invested

Max Total Return

Funds Available Unused Funds

$0.00

Max Allowed

$100,000.00

$50,000.00

$30,000.00

$10,000.00

Total

Atlantic Oil

Paci�c Oil

Oil

Midwest Steel

Steel

Huber Steel

Gov’t Bonds

0.073

0.103

0.064

0.075

0.045

Investment

Atlantic Oil

Paci�c Oil

Midwest Steel

Huber Steel

Gov’t Bonds

Funds Invested

Funds Invested Min Required

Model

13 14 15 16 17 18 19

22 21

25 24

26 27 28

30 29

$100.000.00

$50.000.00

0.25

0.6

Welte Mutual Funds Problem

$100,000.00

$50,000.00

$40,000.00

$10,000.00

$30,000.00

$20,000.00

$30,000.00

$40,000.00

$10,000.00

$0.00

$8,000.00

Oil Max

Steel Max

Paci�c Oil Max

Gov’t Bonds Min

Available Funds

Destination Distribution Center Three-Month Demand Forecast (units)

1 Boston 6,000

2 Chicago 4,000

3 St. Louis 2,000

4 Lexington 1,500

Total 13,500

582 Chapter 12 Linear Optimization Models

Management would like to determine how much of its production should be shipped from each plant to each distribution center. Figure 12.13 shows graphically the 12 distri- bution routes Foster can use. Such a graph is called a network; the circles are referred to as nodes, and the lines connecting the nodes as arcs. Each origin and destination is repre- sented by a node, and each possible shipping route is represented by an arc. The amount of the supply is written next to each origin node, and the amount of the demand is written next to each destination node. The goods shipped from the origins to the destinations repre- sent the flow in the network. Note that the direction of flow (from origin to destination) is indicated by the arrows.

For Foster’s transportation problem, the objective is to determine the routes to be used and the quantity to be shipped via each route that will provide the minimum total transpor- tation cost. The cost for each unit shipped on each route is given in Table 12.4 and is shown on each arc in Figure 12.13.

A linear programming model can be used to solve this transportation problem. We use double-subscripted decision variables, with 11x denoting the number of units shipped from origin 1 (Cleveland) to destination 1 (Boston), 12x denoting the number of units shipped from origin 1 (Cleveland) to destination 2 (Chicago), and so on. In general, the decision variables for a transportation problem having m origins and n destinations are written as follows:

x i j i m j n

ij number of units shipped from origin to destination where 1, 2, . . . , and 1, 2, . . . ,

5 5

Because the objective of the transportation problem is to minimize the total transporta- tion cost, we can use the cost data in Table 12.4 or on the arcs in Figure 12.13 to develop the following cost expressions:

5 1 1 1

x x x x x x x x x x x x

Transportation costs for units shipped from Cleveland 3 2 7 6 Transportation costs for units shipped from Bedford 6 5 2 3 Transportation costs for units shipped from York 2 5 4 5

11 12 13 14

21 22 23 24

31 32 33 34

The sum of these expressions provides the objective function showing the total transporta- tion cost for Foster Generators.

Transportation problems need constraints because each origin has a limited supply and each destination has a demand requirement. We consider the supply constraints first. The capacity at the Cleveland plant is 5,000 units. With the total number of units shipped from the Cleveland plant expressed as 11 12 13 14x x x x1 1 1 , the supply constraint for the Cleveland plant is

5, 000 Cleveland supply11 12 13 14x x x x1 1 1 #

With three origins (plants), the Foster transportation problem has three supply con- straints. Given the capacity of 6,000 units at the Bedford plant and 2,500 units at the York plant, the two additional supply constraints are as follows:

6, 000 Bedford supply

2,500 York supply 21 22 23 24

31 32 33 34

x x x x

1 1 1 #

With the four distribution centers as the destinations, four demand constraints are needed to ensure that destination demands will be satisfied:

1 1 5

x x x

x x x x x x

x x x

6, 000 Boston demand

4, 000 Chicago demand 2, 000 St. Louis demand

1, 500 Lexington demand

11 21 31

12 22 32

13 23 33

14 24 34

12.6 General Linear Programming Notation and More Examples 583

Destination

Origin Boston Chicago St. Louis Lexington

Cleveland 3 2 7 6

Bedford 6 5 2 3

York 2 5 4 5

Transportation Cost per Unit for the Foster Generators Transportation Problem ($)

TABLE 12.4

The Network Representation of the Foster Generators Transportation Problem

FIGURE 12.13

Plants (origin nodes)

Distribution Centers (destination nodes)

Supplies Distribution Routes (arcs)

Demands

1,500

3 York

1 Cleveland

3 St. Louis

2 Chicago

1 Boston

4 Lexington

2,000

4,000

6,000

2 Bedford

7 6

2 5

4 5

Transportation Cost per Unit

6,000

2,500

5,000

584 Chapter 12 Linear Optimization Models

Combining the objective function and constraints into one model provides a 12- variable, 7-constraint linear programming formulation of the Foster Generators transportation problem:

Min 3x 11

1 2x 12

1 7x 13

1 6x 14

1 6x 21

1 5x 22

1 2x 23

1 3x 24

1 2x 31

1 5x 32

1 4x 33

1 5x 34

s.t.

x 11

1 x 12

1 x 13

1 x 14

# 5,000

x 21

1 x 22

1 x 23

1 x 24

# 6,000

x 31

1 x 32

1 x 33

1 x 34

# 2,500

x 11

1 x 21

1 x 31

5 6,000

x 12

1 x 22

1 x 32

5 4,000

x 13

1 x 23

1 x 33

5 2,000

x 14

1 x 24

1 x 34

5 1,500

x ij $ 0 for i 5 1, 2, 3 and j 5 1, 2, 3, 4

Comparing the linear programming formulation to the network in Figure 12.13 leads to several observations. All the information needed for the linear programming formulation is on the network. Each node has one constraint, and each arc has one variable. The sum of the variables corresponding to arcs from an origin node must be less than or equal to the origin’s supply, and the sum of the variables corresponding to the arcs into a destination node must be equal to the destination’s demand.

A spreadsheet model and the solution to the Foster Generators problem (Figure 12.14) show that the minimum total transportation cost is $39,500. The values for the decision variables show the optimal amounts to ship over each route. For example, 1,000 units should be shipped from Cleveland to Boston, and 4,000 units should be shipped from Cleveland to Chicago. Other values of the decision variables indicate the remaining ship- ping quantities and routes.

Advertising Campaign Planning Applications of linear programming to marketing are numerous. Advertising campaign planning, marketing mix, and market research are just a few areas of application. In this section we consider an advertising campaign planning application.

Advertising campaign planning applications of linear programming are designed to help marketing managers allocate a fixed advertising budget to various advertising media. Potential media include newspapers, magazines, radio, television, and direct mail. In these applications, the objective is to maximize reach, frequency, and quality of exposure. Restrictions on the allowable allocation usually arise during consideration of company policy, contract requirements, and media availability. In the application that follows, we illustrate how a media selection problem might be formulated and solved using a linear programming model.

Relax-and-Enjoy Lake Development Corporation is developing a lakeside community at a privately owned lake. The primary market for the lakeside lots and homes includes all middle- and upper-income families within approximately 100 miles of the development. Relax-and-Enjoy employed the advertising firm of Boone, Phillips, and Jackson (BP&J) to design the promotional campaign.

After considering possible advertising media and the market to be covered, BP&J rec- ommended that the first month’s advertising be restricted to five media. At the end of the month, BP&J will then reevaluate its strategy based on the month’s results. BP&J collected data on the number of potential customers reached, the cost per advertisement, the maxi- mum number of times each medium is available, and the exposure quality rating for each of the five media. The quality rating is measured in terms of an exposure quality unit, a mea- sure of the relative value of one advertisement in each of the media. This measure, based on

12.6 General Linear Programming Notation and More Examples 585

Spreadsheet Model and Solution for the Foster Generator ProblemFIGURE 12.14

Foster Generators

A B C D E F

Shipping Cost/Unit

Parameters

Origin Boston St. Louis Lexington Supply

St. Louis

Destination

Chicago

Destination

Chicago

Total

Cleveland

Bedford

York

Demand

5000

6000

2500

Total Cost

Origin

Cleveland

Bedford

York

Model

1 2 3 4 5 6

9 10

7 8

11 12 13 14 15 16 17 18 19 20 21

6000

1000

2500

4000

Lexington Total

1500

Boston

=SUM(B17:B19) =SUM(C17:C19)

2000

=SUM(D17:D19) =SUM(E17:E19)

=SUM(B17:E17)

=SUM(B18:E18)

=SUM(B19:E19)

=SUMPRODUCT(B5:E7,B17:E19)

Foster Generators

Shipping Cost/Unit

Parameters

Origin Boston St. Louis Lexington Supply

St. Louis

Destination

Chicago

Destination

Chicago

Total

Cleveland

Bedford

York

Demand

$5.00

$2.00

$5.00

4000

$7.00

$2.00

$4.00

2000

$6.00

$3.00

$5.00

1500

5000

6000

2500

Total Cost

Origin

Cleveland

Bedford

York

Model

Lexington TotalBoston

$3.00

$6.00

$2.00

6000

1000

2500

6000

4000

2000

1500

5000

6000

2500

$39,500.00

A B C D E F G 1 2 3 4 5 6

9 10

7 8

11 12 13 14 15 16 17 18 19 20 21

BP&J’s experience in the advertising business takes into account factors such as audience demographics (age, income, and education of the audience reached), image presented, and quality of the advertisement. The information collected is presented in Table 12.5.

Relax-and-Enjoy provided BP&J with an advertising budget of $30,000 for the first month’s campaign. In addition, Relax-and-Enjoy imposed the following restrictions on how BP&J may allocate these funds: At least 10 television commercials must be used, at least 50,000 potential customers must be reached, and no more than $18,000 may be spent on television advertisements. What advertising media selection plan should be recommended?

Foster

586 Chapter 12 Linear Optimization Models

The decision to be made is how many times to use each medium. We begin by defining the decision variables:

5 5 5

DTV ETV DN SN R

number of times daytime TV is used number of times evening TV is used number of times daily newspaper is used number of times Sunday newspaper is used number of times radio is used

The data on quality of exposure in Table 12.5 show that each daytime TV (DTV) adver- tisement is rated at 65 exposure quality units. Thus, an advertising plan with DTV adver- tisements will provide a total of 65DTV exposure quality units. Continuing with the data in Table 12.5, we find evening TV (ETV) rated at 90 exposure quality units, daily newspaper (DN) rated at 40 exposure quality units, Sunday newspaper (SN) rated at 60 exposure qual- ity units, and radio (R) rated at 20 exposure quality units. With the objective of maximizing the total exposure quality units for the overall media selection plan, the objective function becomes:

Max 65 90 40 60 20 Exposure qualityDTV ETV DN SN R1 1 1 1

We now formulate the constraints for the model from the information given: Each medium has a maximum availability:

# # # # #

DTV ETV

DN SN

15 10 25 4

A total of $30,000 is available for the media campaign:

1500 3000 400 1000 100 30, 000DTV ETV DN SN R1 1 1 1 #

At least 10 television commercials must be used:

1 $ 10DTV ETV

Advertising Media

No. of Potential

Customers Reached

Cost ($) per Advertisement

Maximum Times Available

per Month*

Exposure Quality Units

1. Daytime TV (1 min), station WKLA 1,000 1,500 15 65 2. Evening TV (30 sec), station WKLA 2,000 3,000 10 90 3. Daily newspaper (full page),

The Morning Journal 1,500 400 25 40

4. Sunday newspaper magazine ( 12 page color), The Sunday Press

2,500 1,000 4 60

5. Radio, 8:00 a.m. or 5:00 p.m. news (30 sec), station KNOP

300 100 30 20

Advertising Media Alternatives for the Relax-and-Enjoy Lake Development Corporation

TABLE 12.5

*The maximum number of times the medium is available is either the maximum number of times the advertising medium occurs (e.g., four Sundays per month) or the maximum number of times BP&J recommends that the medium is used.

12.6 General Linear Programming Notation and More Examples 587

At least 50,000 potential customers must be reached:

1000 2000 1500 2500 300 50, 000DTV ETV DN SN R1 1 1 1 $

No more than $18,000 may be spent on television advertisements:

1 #1500 3000 18, 000DTV ETV

By adding the nonnegativity restrictions, we obtain the complete linear programming model for the Relax-and-Enjoy advertising campaign planning problem:

Max 65 DTV 1 90 ETV 1 40 DN 1 60 SN 1 20 R Exposure quality

s.t.

DTV # 15

Availability of media

ETV # 10

DN # 25

SN # 4

R # 30

1500 DTV 1 3000 ETV 1 400 DN 1 1000 SN 1 100 R # 30,000 Budget

DTV 1 ETV $ 10 Television restrictions1500 DTV 1 3000 ETV # 18,000

1000 DTV 1 2000 ETV 1 1500 DN 1 2500 SN 1 300 R $ 50,000 Customers reached

DTV, ETV, DN, SN, R $ 0

A spreadsheet model and the optimal solution to this linear programming model are shown in Figure 12.15.

The optimal solution calls for advertisements to be distributed among daytime TV, daily newspaper, Sunday newspaper, and radio. The maximum number of exposure quality units is 2,370, and the total number of customers reached is 61,500.

Let us consider now the Sensitivity Report for the Relax-and-Enjoy advertising cam- paign planning problem shown in Figure 12.16. We begin by interpreting the Constraints section.

Note that the overall budget constraint has a shadow price of 0.060. Therefore, a $1.00 increase in the advertising budget will lead to an increase of 0.06 exposure quality unit. The shadow price of −25 for the number of TV ads indicates that increasing the number of television commercials required by 1 will decrease the exposure quality of the advertising plan by 25 units. Alternatively, decreasing the number of television commercials by 1 will increase the exposure quality of the advertising plan by 25 units. Thus, Relax-and-Enjoy should consider reducing the requirement of having at least 10 television commercials.

Note that the availability-of-media constraints are not listed in the constraint section. These types of constraints, simple upper (or lower) bounds on a decision variable, are not listed in the report, just as nonnegativity constraints are not listed. There is information about these constraints in the variables section under reduced cost. Let us therefore turn our attention to the Variable Cells section of the report.

Let us interpret each of the three nonzero reduced costs in Figure 12.16. The variable ETV, the number of evening TV ads, is currently at its lower bound of zero. Therefore, the reduced cost of −65 is the shadow price of the nonnegativity constraint, which we interpret as follows. If we change the requirement that 0ETV $ to 1ETV $ , exposure will drop by 65. Notice that for the other variables that have a nonzero reduced cost, DN and R, the number of daily newspaper ads and radio ads respectively, are at their upper bounds of 25 and 30. In these cases, the reduced cost is the shadow price of the upper-bound constraint on each of these variables. For example, allowing 31 rather than only 30 radio ads will increase exposures by 14.

588 Chapter 12 Linear Optimization Models

A Spreadsheet Model and the Solution for the Relax-and-Enjoy Lake Development Corporation Problem

FIGURE 12.15

A B C D E F G

Parameters

DTV ETV DN SN R

Media

Cust Reach

Cost/Ad

Availability

Exposure/Ad

1000

1500

2000

3000

1500

400

Min Cust Reach

Min TV Ads

Max TV Budget Budget

2500

1000

300

100

Ads Placed

Max Exposure

Reach

Num TV Ads

TV Budget

Budget

Model

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Achieved

Used Limit

=I5

=I6

=I17

=I18

=SUMPRODUCT(B8:F8,B13:F13)

=SUMPRODUCT(B5:F5,B13:F13)

=SUMPRODUCT(B6:C6,B13:C13)

=SUMPRODUCT(B6:F6,B13:F13)

=B13+C13

Min Required

10 0 25 2 30

Relax-and-Enjoy Lake Development Corporation

ETV DN SN RDTV

50000

18000 30000

A B C D E F G

Parameters

DTV ETV DN SN R Media

Cust Reach

Cost/Ad

Availability

Exposure/Ad

1,500

$400

Min Cust Reach

Min TV Ads

Max TV Budget Budget

50,000

$18,000 $30,000

2,500

$1,000

300

$100

Ads Placed

Max Exposure

Reach

Num TV Ads

TV Budget

Budget

Model

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

17 18 19

21 22 23

Achieved

Used Limit

2,000

$3,000

50,000

$18,000

$30,000

Min Required

1,000

$1,500

61,500

$15,000

$30,000

2370

10 0 25 2 30

Relax-and-Enjoy Lake Development Corporation

ETV DN SN RDTV

The allowable increase and decrease for the objective function coefficients are inter- preted as discussed in Section 12.5. For example, as long as the number of exposures per ad for daytime TV does not increase by more than 25 or decrease by more than 65, the cur- rent plan shown in Figure 12.15 remains optimal.

Relax

12.7 Generating an Alternative Optimal Solution for a Linear Program 589

The Excel Sensitivity Report for the Relax-and-Enjoy Lake Development Corporation Problem

FIGURE 12.16

A B C D

Name Final Value

Reduced Cost

Objective Coef�cient

Allowable Increase

Allowable Decrease

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Variable Cells

Cell $B$13 Ads Placed DTV

$C$13 Ads Placed ETV

$E$13 Ads Placed SN

$D$13 Ads Placed DN

$F$13 Ads Placed R

–65

1E + 30

16.6666667

$B$22 TV Budget Used

$B$23 Budget Used

20 21 22

E F G H

Final Value

Shadow Price

Constraint R.H. Side

Allowable DecreaseCell

Constraints

$B$18 Reach Achieved

$B$19 Num TV Ads Achieve

$15,000.00

$30,000.00

61500

0.06

–25

18000

30000

50000

1E + 30

2000

11500

1.33333333

3000

2000

1E + 30

1.33333333

Name Allowable Increase

1. The media selection model required subjective evaluations

of the exposure quality for the media alternatives. Market-

ing managers may have substantial data concerning expo-

sure quality, but the final coefficients used in the objective

function may also include considerations based primarily

on managerial judgment.

2. The media selection model presented in this section uses expo-

sure quality as the objective function and places a constraint on

the number of customers reached. An alternative formulation of

this problem would be to use the number of customers reached

as the objective function and to add a constraint indicating the

minimum total exposure quality required for the media plan.

N O T E S + C O M M E N T S

12.7 Generating an Alternative Optimal Solution for a Linear Program

The goal of business analytics is to provide information to management for improved decision making. If a linear program has more than one optimal solution, as discussed in Section 12.4, it would be good for management to know this. There might be factors external to the model that make one optimal solution preferable to another. For example, in a portfolio optimization problem, perhaps more than one strategy yields the maximum expected return. However, those strategies might be quite different in terms of their risk to the investor. Knowing the optimal alternatives and then assessing the risk of each, the investor could then pick the least risky alternative from the optimal solutions. In this sec- tion, we discuss how to generate an alternative optimal solution if one exists.

590 Chapter 12 Linear Optimization Models

Let us reconsider the Foster Generators transportation problem from the previous sec- tion. If one exists, how might we generate an alternative optimal solution for this problem? From Figure 12.14 we know that the following is an optimal solution:

5 5 5 5

x x x x

1, 000, 4, 000, 0, 0

2, 500, 0, 2, 000, 1, 500

2, 500, 0, 0, 0

11 12 13 14

21 22 23 24

31 32 33 34

The optimal cost is $39,500. With this information, we may revise our previous model to try to find an alternative optimal solution. We know that any alternative solution must be feasible, so it must satisfy all of the constraints of the original model. Also, to be optimal, the solution must give a total cost of $39,500. We can enforce this by taking the objective function and making it a constraint equal to $39,500:

1 1 1 1 1 1 1 1 1 1 1 5x x x x x x x x x x x x3 2 7 6 6 5 2 3 2 5 4 5 39,500.11 12 13 14 21 22 23 24 31 32 33 34

But, what should our objective function be for the revised problem? In the solution we pre- viously found:

5 5 5 5 5 5 013 14 22 32 33 34x x x x x x

If we maximize the sum of these variables and if the optimal objective function value of this revised problem is positive, we have found a different feasible solution that is also optimal. The revised model is as follows:

The solution to this problem has an objective function value of 2, 500, indicating that the variables that were zero in the previous solution now add up to 2,500. The new solution is shown in Table 12.6.

Comparing Figure 12.14 and Table 12.6, we see that in this new solution, Bedford ships 2,500 units to Chicago instead of to Boston.

What types of issues might make management prefer one of these solutions over the other? Notice that the original solution has the Boston distribution center sourced from all three plants, whereas each of the other distribution centers is sourced by one plant. This would imply that the manager in the Boston distribution center has to deal with three dif- ferent plant managers, whereas each of the other distribution center managers has only one plant manager. The Boston manager might feel disadvantaged, having to spend too much time coordinating among the plants. The alternative solution provides a more balanced solution. Managers in Boston and Chicago each deal with two plants, and those in St. Louis and Lexington, which have lower total volumes, deal with only one plant. Because the alternative solution seems to be more equitable, it might be preferred. Recall that both solutions give a total cost of $39,500.

Max x 13

1 x 14

1 x 22

1 x 32

1 x 33

1 x 34

s.t.

x 11

1 x 12

1 x 13

1 x 14

# 5,000

x 21

1 x 22

1 x 23

1 x 24

# 6,000

x 31

1 x 32

1 x 33

1 x 34

# 2,500

x 11

1 x 21

1 x 31

5 6,000

x 12

1 x 22

1 x 32

5 4,000

x 13

1 x 23

1 x 33

5 2,000

x 14

1 x 24

1 x 34

5 1,500

3x 11

1 2x 12

1 7x 13

1 6x 14

1 6x 21

1 5x 22

1 2x 23

1 3x 24

1 2x 31

1 5x 32

1 4x 33

1 5x 34

5 39,500

x ij

$ 0

for i 5 1, 2, 3 and j 5 1, 2, 3, 4

Summary 591

In summary, the general approach for trying to find an alternative optimal solution to a linear program is as follows:

Step 1. Solve the linear program Step 2. Make a new objective function to be maximized. It is the sum of those vari-

ables that were equal to zero in the solution from Step 1 Step 3. Keep all the constraints from the original problem. Add a constraint that forces

the original objective function to be equal to the optimal objective function value from Step 1

Step 4. Solve the problem created in Steps 2 and 3. If the objective function value is positive, you have found an alternative optimal solution

Total Cost 5 $39,500 Amount Shipped

To:

Boston Chicago St. Louis Lexington Total

Cleveland 3,500 1,500 0 0 5,000

From: Bedford 0 2,500 2,000 1,500 6,000

York 2,500 0 0 0 2,500

Total 6,000 4,000 2,000 1,500

An Alternative Optimal Solution to the Foster Generators Transportation Problem

TABLE 12.6

Steps 1–4 for finding an alternative optimal solution may be

repeated to try to find more than one alternative optimal

solution. However, the process is not guaranteed to find an

alternative optimal solution when one exists. For example,

alternative optimal solutions that are not an extreme point (see

Figure 12.8) will not be found by this approach.

N O T E S & C O M M E N T S

S U M M A R Y

We formulated linear programming models for the Par, Inc. maximization problem and the M&D Chemicals minimization problem. For the Par, Inc. problem, we showed how a graphical solution procedure could be used to solve a two-variable problem to help us bet- ter understand how the computer can solve large linear programs. We discussed how Excel Solver can be used to solve linear optimization problems. In formulating a linear program- ming model of the Par, Inc. and M&D problems, we developed a general definition of a linear program.

A linear program is a mathematical model with the following qualities:

1. A linear objective function that is to be maximized or minimized 2. A set of linear constraints 3. Variables restricted to nonnegative values

Slack variables may be used to write less-than-or-equal-to constraints in equality form, and surplus variables may be used to write greater-than-or-equal-to constraints in equality form. The value of a slack variable can usually be interpreted as the amount of unused resource, whereas the value of a surplus variable indicates the amount over and above some stated minimum requirement. Binding constraints have zero slack or surplus.

592 Chapter 12 Linear Optimization Models

If the solution to a linear program is infeasible or unbounded, no optimal solution to the problem can be found. In the case of infeasibility, no feasible solutions are possible. In the case of an unbounded solution, the objective function can be made infinitely large for a maximization problem and infinitely small for a minimization problem. In the case of alter- native optimal solutions, two or more optimal extreme points exist.

We also discussed sensitivity analysis and the interpretation of the Sensitivity Report generated by Excel Solver and how the impact of changes in the objective function coeffi- cients and right-hand side values of constraints can be assessed. We showed how to write a mathematical model using general linear programming notation and presented three additional examples of linear programming applications: portfolio selection, transportation planning, and media selection. Finally, we concluded the chapter with a procedure for find- ing an alternative optimal solution when one exists.

G L O S S A R Y

Alternative optimal solutions The case in which more than one solution provides the opti- mal value for the objective function. Binding constraint A constraint that holds as an equality at the optimal solution. Constraints Restrictions that limit the settings of the decision variables. Decision variable A controllable input for a linear programming model. Extreme point Graphically speaking, the feasible solution points occurring at the vertices, or “corners,” of the feasible region. With two-variable problems, extreme points are deter- mined by the intersection of the constraint lines. Feasible region The set of all feasible solutions. Feasible solution A solution that satisfies all the constraints simultaneously. Infeasibility The situation in which no solution to the linear programming problem satis- fies all the constraints. Linear function A mathematical function in which each variable appears in a separate term and is raised to the first power. Linear programming model (linear program) A mathematical model with a linear objective function, a set of linear constraints, and nonnegative variables. Mathematical model A representation of a problem in which the objective and all con- straint conditions are described by mathematical expressions. Nonnegativity constraints A set of constraints that requires all variables to be nonnegative. Objective function The expression that defines the quantity to be maximized or minimized in a linear programming model. Objective function coefficient allowable increase (decrease) The allowable increase/ decrease of an objective function coefficient is the amount the coefficient may increase (decrease) without causing any change in the values of the decision variables in the optimal solution. The allowable increase/decrease for the objective function coefficients can be used to calculate the range of optimality. Problem formulation (modeling) The process of translating a verbal statement of a prob- lem into a mathematical statement called the mathematical model. Reduced cost If a variable is at its lower bound of zero, the reduced cost is equal to the shadow price of the nonnegativity constraint for that variable. In general, if a variable is at its lower or upper bound, the reduced cost is the shadow price for that simple lower- or upper-bound constraint. Right-hand side allowable increase (decrease) The amount the right-hand side may increase (decrease) without causing any change in the shadow price for that constraint. The allowable increase and decrease for the right-hand side can be used to calculate the range of feasibility for that constraint. Sensitivity analysis The study of how changes in the input parameters of a linear program- ming problem affect the optimal solution.

Problems 593

Shadow price The change in the optimal objective function value per unit increase in the right-hand side of a constraint. Slack The difference between the right-hand-side and the left-hand-side of a less-than-or- equal-to constraint. Slack variable A variable added to the left-hand side of a less-than-or-equal-to constraint to convert the constraint into an equality. The value of this variable can usually be inter- preted as the amount of unused resources. Surplus variable A variable subtracted from the left-hand side of a greater-than-or-equal- to constraint to convert the constraint into an equality. The value of this variable can usu- ally be interpreted as the amount over and above some required minimum level. Unbounded The situation in which the value of the solution may be made infinitely large in a maximization linear programming problem or infinitely small in a minimization prob- lem without violating any of the constraints.

P R O B L E M S

1. Kelson Sporting Equipment, Inc. makes two types of baseball gloves: a regular model and a catcher’s model. The firm has 900 hours of production time available in its cut- ting and sewing department, 300 hours available in its finishing department, and 100 hours available in its packaging and shipping department. The production time require- ments and the profit contribution per glove are given in the following table:

Production Time (hours)

Model Cutting

and Sewing Finishing Packaging

and Shipping Profit/Glove

Regular model 1 12 18 $5

Catcher’s model 3 2 13 14 $8

Assuming that the company is interested in maximizing the total profit contribution, answer the following: a. What is the linear programming model for this problem? b. Develop a spreadsheet model and find the optimal solution using Excel Solver. How

many of each model should Kelson manufacture? c. What is the total profit contribution Kelson can earn with the optimal production

quantities? d. How many hours of production time will be scheduled in each department? e. What is the slack time in each department?

2. The Sea Wharf Restaurant would like to determine the best way to allocate a monthly advertising budget of $1,000 between newspaper advertising and radio advertising. Management decided that at least 25% of the budget must be spent on each type of media and that the amount of money spent on local newspaper advertising must be at least twice the amount spent on radio advertising. A marketing consultant developed an index that measures audience exposure per dollar of advertising on a scale from 0 to 100, with higher values implying greater audience exposure. If the value of the index for local newspaper advertising is 50 and the value of the index for spot radio adver- tising is 80, how should the restaurant allocate its advertising budget to maximize the value of total audience exposure? a. Formulate a linear programming model that can be used to determine how the

restaurant should allocate its advertising budget in order to maximize the value of total audience exposure.

b. Develop a spreadsheet model and solve the problem using Excel Solver.

3. Blair & Rosen, Inc. (B&R) is a brokerage firm that specializes in investment portfo- lios designed to meet the specific risk tolerances of its clients. A client who contacted B&R this past week has a maximum of $50,000 to invest. B&R’s investment advisor

594 Chapter 12 Linear Optimization Models

decides to recommend a portfolio consisting of two investment funds: an Internet fund and a Blue Chip fund. The Internet fund has a projected annual return of 12%, and the Blue Chip fund has a projected annual return of 9%. The investment advisor requires that at most $35,000 of the client’s funds should be invested in the Internet fund. B&R services include a risk rating for each investment alternative. The Internet fund, which is the more risky of the two investment alternatives, has a risk rating of 6 per $1,000 invested. The Blue Chip fund has a risk rating of 4 per $1,000 invested. For example, if $10,000 is invested in each of the two investment funds, B&R’s risk rating for the portfolio would be 6(10) 4(10) 1001 5 . Finally, B&R developed a questionnaire to measure each client’s risk tolerance. Based on the responses, each client is classified as a conservative, moderate, or aggressive investor. Suppose that the questionnaire results classified the current client as a moderate investor. B&R recommends that a client who is a moderate investor limit his or her portfolio to a maximum risk rating of 240. a. Formulate a linear programming model to find the best investment strategy for this

client. b. Build a spreadsheet model and solve the problem using Excel Solver. What is the

recommended investment portfolio for this client? What is the annual return for the portfolio?

c. Suppose that a second client with $50,000 to invest has been classified as an aggres- sive investor. B&R recommends that the maximum portfolio risk rating for an aggressive investor is 320. What is the recommended investment portfolio for this aggressive investor?

d. Suppose that a third client with $50,000 to invest has been classified as a conser- vative investor. B&R recommends that the maximum portfolio risk rating for a conservative investor is 160. Develop the recommended investment portfolio for the conservative investor.

4. Adirondack Savings Bank (ASB) has $1 million in new funds that must be allocated to home loans, personal loans, and automobile loans. The annual rates of return for the three types of loans are 7% for home loans, 12% for personal loans, and 9% for automobile loans. The bank’s planning committee has decided that at least 40% of the new funds must be allocated to home loans. In addition, the planning committee has specified that the amount allocated to personal loans cannot exceed 60% of the amount allocated to automobile loans. a. Formulate a linear programming model that can be used to determine the amount of

funds ASB should allocate to each type of loan to maximize the total annual return for the new funds.

b. How much should be allocated to each type of loan? What is the total annual return? What is the annual percentage return?

c. If the interest rate on home loans increases to 9%, would the amount allocated to each type of loan change? Explain.

d. Suppose the total amount of new funds available is increased by $10,000. What effect would this have on the total annual return? Explain.

e. Assume that ASB has the original $1 million in new funds available and that the planning committee has agreed to relax the requirement that at least 40% of the new funds must be allocated to home loans by 1%. How much would the annual return change? How much would the annual percentage return change?

5. Round Tree Manor is a hotel that provides two types of rooms with three rental classes: Super Saver, Deluxe, and Business. The profit per night for each type of room and rental class is as follows:

Rental Class

Room Super Saver Deluxe Business

Type I (Mountain View) $30 $35 —

Type II (Street View) $20 $30 $40

Problems 595

Round Tree’s management makes a forecast of the demand by rental class for each night in the future. A linear programming model developed to maximize profit is used to determine how many reservations to accept for each rental class. The demand forecast for a particular night is 130 rentals in the Super Saver class, 60 in the Deluxe class, and 50 in the Business class. Since these are the forecasted demands, Round Tree will take no more than these amounts of each reservation for each rental class. Round Tree has a limited number of each type of room. There are 100 Type I rooms and 120 Type II rooms. a. Formulate and solve a linear program to determine how many reservations to accept

in each rental class and how the reservations should be allocated to room types. b. For the solution in part (a), how many reservations can be accommodated in each

rental class? Is the demand for any rental class not satisfied? c. With a little work, an unused office area could be converted to a rental room. If the

conversion cost is the same for both types of rooms, would you recommend convert- ing the office to a Type I or a Type II room? Why?

d. Could the linear programming model be modified to plan for the allocation of rental demand for the next night? What information would be needed and how would the model change?

6. Industrial Designs has been awarded a contract to design a label for a new wine pro- duced by Lake View Winery. The company estimates that 150 hours will be required to complete the project. The firm’s three graphic designers available for assignment to this project are Lisa, a senior designer and team leader; David, a senior designer; and Sarah, a junior designer. Because Lisa has worked on several projects for Lake View Winery, management specified that Lisa must be assigned at least 40% of the total number of hours assigned to the two senior designers. To provide label designing expe- rience for Sarah, the junior designer must be assigned at least 15% of the total project time. However, the number of hours assigned to Sarah must not exceed 25% of the total number of hours assigned to the two senior designers. Due to other project com- mitments, Lisa has a maximum of 50 hours available to work on this project. Hourly wage rates are $30 for Lisa, $25 for David, and $18 for Sarah. a. Formulate a linear program that can be used to determine the number of hours each

graphic designer should be assigned to the project to minimize total cost. b. How many hours should be assigned to each graphic designer? What is the total

cost? c. Suppose Lisa could be assigned more than 50 hours. What effect would this have on

the optimal solution? Explain. d. If Sarah were not required to work a minimum number of hours on this project,

would the optimal solution change? Explain.

7. Vollmer Manufacturing makes three components for sale to refrigeration companies. The components are processed on two machines: a shaper and a grinder. The times (in minutes) required on each machine are as follows:

Machine

Component Shaper Grinder

1 6 4

2 4 5

3 4 2

The shaper is available for 120 hours, and the grinder for 110 hours. No more than 200 units of component 3 can be sold, but up to 1,000 units of each of the other compo- nents can be sold. In fact, the company already has orders for 600 units of component 1 that must be satisfied. The profit contributions for components 1, 2, and 3 are $8, $6, and $9, respectively.

596 Chapter 12 Linear Optimization Models

a. Formulate and solve for the recommended production quantities. b. What are the objective coefficient ranges for the three components? Interpret these

ranges for company management. c. What are the right-hand-side ranges? Interpret these ranges for company

management. d. If more time could be made available on the grinder, how much would it be worth? e. If more units of component 3 can be sold by reducing the sales price by $4, should

the company reduce the price?

8. Photon Technologies, Inc., a manufacturer of batteries for mobile phones, signed a contract with a large electronics manufacturer to produce three models of lithium-ion battery packs for a new line of phones. The contract calls for the following:

Battery Pack Production Quantity

PT-100 200,000

PT-200 100,000

PT-300 150,000

Plant

Product Philippines Mexico

PT-100 $0.95 $0.98

PT-200 $0.98 $1.06

PT-300 $1.34 $1.15

Photon Technologies can manufacture the battery packs at manufacturing plants located in the Philippines and Mexico. The unit cost of the battery packs differs at the two plants because of differences in production equipment and wage rates. The unit costs for each battery pack at each manufacturing plant are as follows:

The PT-100 and PT-200 battery packs are produced using similar production equip- ment available at both plants. However, each plant has a limited capacity for the total number of PT-100 and PT-200 battery packs produced. The combined PT-100 and PT-200 production capacities are 175,000 units at the Philippines plant and 160,000 units at the Mexico plant. The PT-300 production capacities are 75,000 units at the Philippines plant and 100,000 units at the Mexico plant. The cost of shipping from the Philippines plant is $0.18 per unit, and the cost of shipping from the Mexico plant is $0.10 per unit. a. Develop a linear program that Photon Technologies can use to determine how many

units of each battery pack to produce at each plant to minimize the total production and shipping cost associated with the new contract.

b. Solve the linear program developed in part (a), to determine the optimal production plan.

c. Use sensitivity analysis to determine how much the production and/or shipping cost per unit would have to change to produce additional units of the PT-100 in the Philippines plant.

d. Use sensitivity analysis to determine how much the production and/or shipping cost per unit would have to change to produce additional units of the PT-200 in the Mexico plant.

9. The Westchester Chamber of Commerce periodically sponsors public service seminars and programs. Currently, promotional plans are under way for this year’s program. Advertising alternatives include television, radio, and online. Audience estimates, costs, and maximum media usage limitations are as shown:

Problems 597

Constraint Television Radio Online

Audience per advertisement 100,000 18,000 40,000

Cost per advertisement $2,000 $300 $600

Maximum media usage 10 20 10

Labor-Hours Required (hours/unit)

Department Product 1 Product 2 Hours Available

A 1.00 0.35 100

B 0.30 0.20 36

C 0.20 0.50 50

Profit contribution/unit $30.00 $15.00

Type of Loan/Investment Annual Rate of Return (%)

Automobile loans 8

Furniture loans 10

Other secured loans 11

Signature loans 12

Risk-free securities 9

To ensure a balanced use of advertising media, radio advertisements must not exceed 50% of the total number of advertisements authorized. In addition, television should account for at least 10% of the total number of advertisements authorized. a. If the promotional budget is limited to $18,200, how many commercial messages

should be run on each medium to maximize total audience contact? What is the allocation of the budget among the three media, and what is the total audience reached?

b. By how much would audience contact increase if an extra $100 were allocated to the promotional budget?

10. The management of Hartman Company is trying to determine the amount of each of two products to produce over the coming planning period. The following information concerns labor availability, labor utilization, and product profitability:

a. Develop a linear programming model of the Hartman Company problem. Solve the model to determine the optimal production quantities of products 1 and 2.

b. In computing the profit contribution per unit, management does not deduct labor costs because they are considered fixed for the upcoming planning period. However, suppose that overtime can be scheduled in some of the departments. Which depart- ments would you recommend scheduling for overtime? How much would you be willing to pay per hour of overtime in each department?

c. Suppose that 10, 6, and 8 hours of overtime may be scheduled in departments A, B, and C, respectively. The cost per hour of overtime is $18 in department A, $22.50 in department B, and $12 in department C. Formulate a linear programming model that can be used to determine the optimal production quantities if overtime is made available. What are the optimal production quantities, and what is the revised total contribution to profit? How much overtime do you recommend using in each depart- ment? What is the increase in the total contribution to profit if overtime is used?

11. The employee credit union at State University is planning the allocation of funds for the coming year. The credit union makes four types of loans to its members. In addi- tion, the credit union invests in risk-free securities to stabilize income. The various revenue-producing investments, together with annual rates of return, are as follows:

598 Chapter 12 Linear Optimization Models

The credit union will have $2 million available for investment during the coming year. State laws and credit union policies impose the following restrictions on the composi- tion of the loans and investments:

• Risk-free securities may not exceed 30% of the total funds available for investment. • Signature loans may not exceed 10% of the funds invested in all loans (automobile, furniture, other secured, and signature loans).

• Furniture loans plus other secured loans may not exceed the automobile loans. • Other secured loans plus signature loans may not exceed the funds invested in risk- free securities.

How should the $2 million be allocated to each of the loan/investment alternatives to maximize total annual return? What is the projected total annual return?

12. The Atlantic Seafood Company (ASC) is a buyer and distributor of seafood products that are sold to restaurants and specialty seafood outlets throughout the Northeast. ASC has a frozen storage facility in New York City that serves as the primary distribution point for all products. One of the ASC products is frozen large black tiger shrimp, which are sized at 16 to 20 pieces per pound. Each Saturday, ASC can purchase more tiger shrimp or sell the tiger shrimp at the existing New York City warehouse mar- ket price. ASC’s goal is to buy tiger shrimp at a low weekly price and sell it later at a higher price. ASC currently has 20,000 pounds of tiger shrimp in storage. Space is available to store a maximum of 100,000 pounds of tiger shrimp each week. In addi- tion, ASC developed the following estimates of tiger shrimp prices for the next four weeks:

Week Price/lb

1 $6.00

2 $6.20

3 $6.65

4 $5.55

ASC would like to determine the optimal buying/storing/selling strategy for the next four weeks. The cost to store a pound of shrimp for one week is $0.15, and to account for unforeseen changes in supply or demand, management also indicated that 25,000 pounds of tiger shrimp must be in storage at the end of week 4. Determine the optimal buying/storing/selling strategy for ASC. What is the projected four-week profit? (Hint: Define variables for buying, selling, and inventory held in each week. Then use a constraint to define the relationship between these: inventory from end of 1 2 5previous period bought this period sold this period inventory at end of this period. This type of constraint is referred to as an inventory balance constraint.)

13. The Silver Star Bicycle Company will manufacture both men’s and women’s models for its Easy-Pedal bicycles during the next two months. Management wants to develop a production schedule indicating how many bicycles of each model should be produced in each month. Current demand forecasts call for 150 men’s and 125 women’s models to be shipped during the first month and 200 men’s and 150 women’s models to be shipped during the second month. Additional data are as follows:

Labor Requirements (hours)

Production Model Costs Manufacturing Assembly Current Inventory

Men’s $120 2.0 1.5 20

Women’s $90 1.6 1.0 30

Problems 599

Last month, the company used a total of 1,000 hours of labor. The company’s labor relations policy will not allow the combined total hours of labor (manufacturing plus assembly) to increase or decrease by more than 100 hours from month to month. In addition, the company charges monthly inventory at the rate of 2% of the production cost based on the inventory levels at the end of the month. The company would like to have at least 25 units of each model in inventory at the end of the two months. (Hint: Define variables for production and inventory held in each period for each product. Then use a constraint to define the relationship between these: inventory from end of previous period 1 produced this period 2 demand this period 5 inventory at end of this period.) a. Establish a production schedule that minimizes production and inventory costs and

satisfies the labor-smoothing, demand, and inventory requirements. What invento- ries will be maintained and what are the monthly labor requirements?

b. If the company changed the constraints so that monthly labor increases and decreases could not exceed 50 hours, what would happen to the production sched- ule? How much will the cost increase? What would you recommend?

14. The Clark County Sheriff’s Department schedules police officers for 8-hour shifts. The beginning times for the shifts are 8:00 a.m., noon, 4:00 p.m., 8:00 p.m., midnight, and 4:00 a.m. An officer beginning a shift at one of these times works for the next 8 hours. During normal weekday operations, the number of officers needed varies depending on the time of day. The department staffing guidelines require the following minimum number of officers on duty:

Time of Day Minimum No. of Officers on Duty

8:00 a.m.–noon 5

Noon–4:00 p.m. 6

4:00 p.m.–8:00 p.m. 10

8:00 p.m.–midnight 7

Midnight–4:00 a.m. 4

4:00 a.m.–8:00 a.m. 6

Determine the number of police officers who should be scheduled to begin the 8-hour shifts at each of the six times to minimize the total number of officers required. (Hint: Let the number of officers beginning work at 8:00 a.m.1x 5 ,

the number of officers beginning work at noon2x 5 , and so on.)

15. Bay Oil produces two types of fuel (regular and super) by mixing three ingredients. The major distinguishing feature of the two products is the octane level required. Reg- ular fuel must have a minimum octane level of 90, whereas super must have a level of at least 100. The cost per barrel, octane levels, and available amounts (in barrels) for the upcoming two-week period appear in the following table, along with the maximum demand for each end product and the revenue generated per barrel:

Ingredient Cost/Barrel Octane Available (barrels)

1 $16.50 100 110,000

2 $14.00 87 350,000

3 $17.50 110 300,000

Revenue/Barrel Max Demand (barrels)

Regular $18.50 350,000

Super $20.00 500,000

Develop and solve a linear programming model to maximize contribution to profit. What is the optimal contribution to profit?

600 Chapter 12 Linear Optimization Models

16. Consider the following network representation of a transportation problem:

The supplies, demands, and transportation costs per unit are shown on the network. What is the optimal (cost minimizing) distribution plan?

17. Refer to the transportation problem described in Problem 16. Use the procedure described in Section 12.7 to try to find an alternative optimal solution.

18. Aggie Power Generation supplies electrical power to residential customers for many U.S. cities. Its main power generation plants are located in Los Angeles, Tulsa, and Seattle. The following table shows Aggie Power Generation’s major residential mar- kets, the annual demand in each market (in Megawatts or MWs), and the cost to supply electricity to each market from each power generation plant (prices are in $/MW).

Supplies Demands

Omaha

Kansas City

Des Moines

St. Louis

Jefferson City

7 8

a. If there are no restrictions on the amount of power that can be supplied by any of the power plants, what is the optimal solution to this problem? Which cities should be supplied by which power plants? What is the total annual power distribution cost for this solution?

Distribution Costs ($/MW)

City Los Angeles Tulsa Seattle Demand (MWs)

Seattle $356.25 $593.75 $ 59.38 950.00

Portland $356.25 $593.75 $178.13 831.25

San Francisco $178.13 $475.00 $296.88 2,375.00

Boise $356.25 $475.00 $296.88 593.75

Reno $237.50 $475.00 $356.25 950.00

Bozeman $415.63 $415.63 $296.88 593.75

Laramie $356.25 $415.63 $356.25 1,187.50

Park City $356.25 $356.25 $475.00 712.50

Flagstaff $178.13 $475.00 $593.75 1,187.50

Durango $356.25 $296.88 $593.75 1,543.75

AggiePower

Problems 601

b. If at most 4,000 MWs of power can be supplied by any one of the power plants, what is the optimal solution? What is the annual increase in power distribution cost that results from adding these constraints to the original formulation?

19. The Calhoun Textile Mill is in the process of deciding on a production schedule. It wishes to know how to weave the various fabrics it will produce during the coming quarter. The sales department has confirmed orders for each of the 15 fabrics produced by Calhoun. These demands are given in the following table. Also given in this table is the variable cost for each fabric. The mill operates continuously during the quarter: 13 weeks, 7 days a week, and 24 hours a day.

There are two types of looms: dobby and regular. Dobby looms can be used to make all fabrics and are the only looms that can weave certain fabrics, such as plaids. The rate of production for each fabric on each type of loom is also given in the table. Note that if the production rate is zero, the fabric cannot be woven on that type of loom. Also, if a fabric can be woven on each type of loom, then the production rates are equal. Calhoun has 90 regular looms and 15 dobby looms. For this problem, assume that the time requirement to change over a loom from one fabric to another is negligi- ble. In addition to producing the fabric using dobby and regular looms, Calhoun has the option to buy some or all of each fabric on the market. The market cost per yard for each fabric is given in the table.

Management would like to know how to allocate the looms to the fabrics and which fabrics to buy on the market so as to minimize the cost of meeting demand.

Fabric Demand

(yd) Dobby (yd/hr)

Regular (yd/hr)

Mill Cost ($/yd)

Market Cost ($/yd)

1 16,500 4.653 0.00 0.6573 0.80

2 52,000 4.653 0.00 0.5550 0.70

3 45,000 4.653 0.00 0.6550 0.85

4 22,000 4.653 0.00 0.5542 0.70

5 76,500 5.194 5.194 0.6097 0.75

6 110,000 3.809 3.809 0.6153 0.75

7 122,000 4.185 4.185 0.6477 0.80

8 62,000 5.232 5.232 0.4880 0.60

9 7,500 5.232 5.232 0.5029 0.70

10 69,000 5.232 5.232 0.4351 0.60

11 70,000 3.733 3.733 0.6417 0.80

12 82,000 4.185 4.185 0.5675 0.75

13 10,000 4.439 4.439 0.4952 0.65

14 380,000 5.232 5.232 0.3128 0.45

15 62,000 4.185 4.185 0.5029 0.70

20. Refer to the Calhoun Textile Mill production problem described in Problem 19. Use the procedure described in Section 12.7 to try to find an alternative optimal solution. If you are successful, discuss the differences in the solution you found versus that found in Problem 19.

21. Orion Fitness produces bracelets with an embedded chip that tracks its wearer’s activ- ities. Orion has plants in Denver and Jacksonville. Bracelets produced at either plant may be shipped to either of the firm’s two regional warehouses, which are located in Davenport and Cincinnati. These regional warehouses subsequently supply retail outlets in Chicago, Orlando, Houston, and Little Rock. The file Orion contains data on each plant’s supply (number of bracelets), each retail outlet’s demand (number of bracelets), and the shipping costs ($ per bracelet) for the shipping channels.

Calhoun

Orion

602 Chapter 12 Linear Optimization Models

Shipping Costs ($ per bracelet)

Warehouse Supply (number

of bracelets)

Plant Davenport Cincinnati

Denver $2.00 $3.00 700

Jacksonville $3.00 $1.00 400

Shipping Costs ($ per bracelet)

Retail Outlet

Warehouse Chicago Orlando Houston Little Rock

Davenport $3.00 $7.00 $4.00 $7.00

Cincinnati $5.00 $5.00 $7.00 $6.00

Demand (number of bracelets)

200 150 350 300

Month Football Demand Forecast Soccer Ball Demand Forecast

1 15,000 10,000

2 25,000 15,000

3 20,000 10,000

4 5,000 5,000

5 2,500 5,000

6 5,000 7,500

Month

Production Cost ($ per

football)

Holding Cost ($ per

football)

Production Cost ($ per soccer ball)

Holding Cost ($ per soccer

ball)

1 $13.80 $0.69 $10.85 $0.54

2 $13.90 $0.70 $10.55 $0.53

3 $12.95 $0.65 $10.50 $0.53

4 $12.60 $0.63 $10.50 $0.53

5 $12.55 $0.63 $10.55 $0.53

6 $12.70 $0.64 $10.00 $0.50

During each month, there is enough production capacity to produce up to 32,000 total balls (football 1 soccer balls), and there is enough storage capacity to store up to 20,000 total balls (football 1 soccer balls) at the end of the month. Brendamore wants to meet these demands on time and it currently has 7,000 footballs and 5,000 soccer

a. Construct and solve a linear optimization model that defines the supply chain strat- egy that meets the demand of each retail outlet at minimum total shipping cost.

b. In a separate worksheet, reformulate the problem to determine if there is an alterna- tive optimal solution. Clearly explain your result.

22. Brendamore Sports produces footballs and soccer balls and must plan on how many to produce each month for the next six months. The file Brendamore contains demand forecasts, as well as the production costs and inventory holding costs, for the next six months. Brendamore must meet the monthly demand of each product through either a combination of inventory or production during the month. Assume that demand occurs at the end of the month, so that any production during a month can meet that month’s demand.

Brendamore

Problems 603

Possible Market States

Price State 1 State 2 State 3 State 4

Stock A $4.94 $4.58 $3.95 $5.67 $5.39

Stock B $5.88 $5.24 $7.28 $4.82 $6.22

Stock C $6.48 $8.27 $5.65 $7.66 $5.78

Bond D $2.68 $2.11 $2.53 $2.80 $2.09

balls in inventory. In anticipation of demand beyond the six-month planning horizon, Brendamore would like to have 3,000 footballs and 3,000 soccer balls in inventory at the end of the sixth month.

Brendamore wants to determine the six-month production schedule that minimizes the total production and holding cost.

23. An investor wishes to invest $10,000 for the coming year and anticipates that the market will be in one of four different states at the end of the year. These states affect her investments in each of three possible stocks and a bond as shown in the following table. Unfortunately, she is uncertain about which market state will occur. Because she is risk-averse, the investor would like to invest in a manner so that the return in the worst-case, no matter what market state occurs, is as good as possible. The following table provides the current price of each possible instrument as well as projected year- end prices of each instrument under each of the 4 possible states.

Investment Yield Availability

1-month CD 0.5% Beginning of each month

3-month CD 1.75% Beginning of Months 1, 2, 3, 4

6-month CD 2.3% Beginning of Month 1

Month Net Expenditures ($1,000s)

1 $45

2 ($11)

3 $25

4 ($22)

5 $43

6 ($15)

These data are in the file MarketStates. Formulate her investment problem as a linear program and solve it using Excel Solver. How much should she invest in each security? Note: It is possible to purchase fractional shares.

24. A financial manager is managing a cash fund. His investment alternatives available are various certificates of deposit, also known as CDs, as listed in the following table:

However, he also must ensure that sufficient funds are available to pay company expen- ditures over the next six months. The following table lists the net expenditures (in thou- sands of dollars) that the manager is obligated to cover (cash amounts in parenthesis indicate a net inflow of cash rather than outflow).

The cash on hand to invest at the start of month 1 is $200,000 and the minimum cash required to be available at the end of month 6 is $100,000. Develop and solve a linear program that will recommend how to invest to maximize the amount of interest income accrued over the next six months while satisfying all financial commitments.

MarketStates

CashManagement

604 Chapter 12 Linear Optimization Models

(Hint: Investment time starts at the beginning of the month and returns at the end of the month. For example, money invested in a 1-month CD in month 1 will be invested at the beginning of month 1 and returned with interest at the end of month 1. Likewise, money invested in a 3-month CD at the start of month 1 will be returned with interest at the end of month 3.)

25. A produce company supplies organically grown apples to four regional specialty stores. After the apples are collected at the company’s orchard, they are transported to any of three preparation centers where they are prepared for retail (by undergoing extensive cleaning and then packaging). After the apples are prepared, they are shipped to the specialty stores. Due to the fragility of the product, the cost of transporting this organic fruit is rather high.

The company has three preparation centers available for use. The Apple worksheet contains: (i) the unit preparation costs (in $/pound) to get the apples from company’s orchard to the preparation centers and then prepare the apples for retail, (ii) the prepa- ration centers’ monthly capacities, (iii) the demand at the specialty stores, and (iv) the unit costs of transporting the apples from the preparation centers to the specialty stores.

Preparation Center Transportation 1

Preparation Cost ($/pound) Monthly Capacity

(pounds)

1 $0.60 300

2 $1.20 500

3 $1.80 800

Shipping Cost ($ per pound)

Preparation Center

Organic Orchard

Fresh & Local

Healthy Pantry

Season’s Harvest

1 $0.80 $1.10 $0.70 $1.40

2 $1.20 $1.10 $0.50 $1.40

3 $0.20 $1.40 $1.30 $1.70

Monthly Demand (pounds)

300 500 400 200

a. Construct and solve a linear optimization model to determine the number of pounds of apples to prepare at each of the three preparation centers and how much of each specialty store’s demand to supply from each preparation center, to minimize the total cost of the operation.

b. In a separate worksheet, reformulate the problem to determine if there is an alterna- tive optimal solution. Clearly explain your result.

C A S E P R O B L E M : I N V E S T M E N T S T R A T E G Y

J. D. Williams, Inc. is an investment advisory firm that manages more than $120 million in funds for its numerous clients. The company uses an asset allocation model that rec- ommends the portion of each client’s portfolio to be invested in a growth stock fund, an income fund, and a money market fund. To maintain diversity in each client’s portfolio, the firm places limits on the percentage of each portfolio that may be invested in each of the three funds. General guidelines indicate that the amount invested in the growth fund must be between 20% and 40% of the total portfolio value. Similar percentages for the other two funds stipulate that between 20% and 50% of the total portfolio value must be in the income fund and that at least 30% of the total portfolio value must be in the money market fund.

Apple

Case Problem: Investment Strategy 605

In addition, the company attempts to assess the risk tolerance of each client and adjust the portfolio to meet the needs of the individual investor. For example, Williams just con- tracted with a new client who has $800,000 to invest. Based on an evaluation of the client’s risk tolerance, Williams assigned a maximum risk index of 0.05 for the client. The firm’s risk indicators show the risk of the growth fund at 0.10, the income fund at 0.07, and the money market fund at 0.01. An overall portfolio risk index is computed as a weighted aver- age of the risk rating for the three funds, where the weights are the fraction of the client’s portfolio invested in each of the funds.

Additionally, Williams is currently forecasting annual yields of 18% for the growth fund, 12.5% for the income fund, and 7.5% for the money market fund. Based on the infor- mation provided, how should the new client be advised to allocate the $800,000 among the growth, income, and money market funds? Develop a linear programming model that will provide the maximum yield for the portfolio. Use your model to develop a managerial report.

Managerial Report

1. Recommend how much of the $800,000 should be invested in each of the three funds. What is the annual yield you anticipate for the investment recommendation?

2. Assume that the client’s risk index could be increased to 0.055. How much would the yield increase, and how would the investment recommendation change?

3. Refer again to the original situation, in which the client’s risk index was assessed to be 0.05. How would your investment recommendation change if the annual yield for the growth fund were revised downward to 16% or even to 14%?

4. Assume that the client expressed some concern about having too much money in the growth fund. How would the original recommendation change if the amount invested in the growth fund is not allowed to exceed the amount invested in the income fund?

5. The asset allocation model you developed may be useful in modifying the portfo- lios for all of the firm’s clients whenever the anticipated yields for the three funds are periodically revised. What is your recommendation as to whether use of this model is possible?

Integer Linear Optimization Models C O N T E N T S

AnALytIcs In ActIOn: Petrobras

13.1 tyPEs OF IntEGER LInEAR OPtIMIZAtIOn MODELs

13.2 EAstBORnE REALty, An EXAMPLE OF IntEGER OPtIMIZAtIOn the Geometry of Linear All-Integer Optimization

13.3 sOLVInG IntEGER OPtIMIZAtIOn PROBLEMs WItH EXcEL sOLVER A cautionary note About sensitivity Analysis

13.4 APPLIcAtIOns InVOLVInG BInARy VARIABLEs capital Budgeting Fixed cost Bank Location Product Design and Market share Optimization

13.5 MODELInG FLEXIBILIty PROVIDED By BInARy VARIABLEs Multiple-choice and Mutually Exclusive constraints k Out of n Alternatives constraint conditional and corequisite constraints

13.6 GEnERAtInG ALtERnAtIVEs In BInARy OPtIMIZAtIOn

APPEnDIX 13.1 sOLVInG IntEGER LInEAR OPtIMIZAtIOn PROBLEMs UsInG AnALytIc sOLVER (MInDtAP READER)

Chapter 13

13.1 types of Integer Linear Optimization Models 607

In this chapter we discuss a class of problems that are modeled as linear programs with the additional requirement that one or more variables must be an integer. Such problems are called integer linear programs.

The objective of this chapter is to provide an applications-oriented introduction to inte- ger linear programming. First, in Section 13.1, we discuss the different types of integer lin- ear programming models. In Section 13.2, we discuss an example, Eastborne Realty and the geometry of all-integer linear programs, and in Section 13.3, we show how to use Excel Solver to solve integer optimization problems. In Section 13.4, we discuss four applica- tions of integer linear programming that make use of binary variables: capital budgeting, fixed cost, bank location, and market share optimization problems. In Section 13.5, we pro- vide additional illustrations of the modeling flexibility provided by binary variables. In Section 13.6, we discuss ways to generate useful alternative solutions in integer linear optimization.

13.1 Types of Integer Linear Optimization Models The only difference between the problems in this chapter and the problems in Chapter 12 on linear programming is that one or more variables are required to be an integer. If all variables are required to be an integer, we have an all-integer linear program. The follow- ing is a two-variable, all-integer linear programming model:

Max 2 3 s.t.

3 3 12 1 4

1 2 6

1 2

1 2 2

3 1 2

1 2

1 # 1 # 1 #

x x

x x x x x x

, 0 and integer1 2 $x x

In the chapter appendix available in the MindTap Reader we discuss how to use Analytic Solver to solve integer linear optimization problems.

Petrobras*

Petrobras, the largest corporation in Brazil, operates approximately 80 offshore oil production and explo- ration platforms in the oil-rich Campos Basin. One of Petrobras’s biggest challenges is planning its logistics, including how to efficiently and safely transport nearly 1,900 employees per day from its four mainland bases to the offshore platforms and then back to the main- land. Every day, planners must route and schedule the helicopters used for this purpose. This routing and scheduling problem is challenging because there are over a billion possible combinations of schedules and routes.

Petrobras uses mixed integer linear optimization to solve its helicopter transport scheduling and routing problem. The objective function of the optimization model is a weighted function designed to ensure safety, minimize unmet demand, and minimize the cost of the transport of its crews. Because offshore landings are the riskiest part of the transport, the safety objective is met by minimizing the number of

*Based on F. Menezes et al., “Optimizing Helicopter transport of Oil Rig crews at Petrobras,” Interfaces 40. no. 5 (september–October 2010): 408–416.

offshore landings required in the schedule. Numerous constraints must be met in planning these routes and schedule. These include limiting the number of depar- tures from a platform at certain times; ensuring no time conflicts for a given helicopter and pilot; ensuring proper breaks for pilots; and limiting the number of flights per day for a given helicopter as well as rout- ing restrictions. The decision variables include binary variables for assigning helicopters to flights and pilots to break times, as well as variables on the number of passengers per flight.

Compared to the previously used manual approach to this problem, the new approach using the integer optimization model transports the same number of passengers but with 18% fewer offshore landings, 8% less flight time, and a reduction in cost of 14%. The annual cost savings is estimated to be approximately $24 million.

A N A l y T i C S i N A C T i O N

608 chapter 13 Integer Linear Optimization Models

If we drop the phrase and integer from the last line of this model, we have the familiar two-variable linear program. The linear program that results from dropping the integer requirements is called the linear programming relaxation, or LP Relaxation, of the integer linear program.

If some, but not necessarily all, variables are required to be integer, we have a mixed- integer linear program. The following is a two-variable, mixed-integer linear program:

Max 3 4

s.t. 1 2 8 1 2 12 2 1 16

1 2

2 1 # 1 #

1 #

x x

x x x x x x

, 0 and integer1 2 2$x x x

We obtain the LP Relaxation of this mixed-integer linear program by dropping the requirement that 2x be integer.

In some applications, the integer variables may take on only the values 0 or 1. Then we have a binary integer linear program. As we see later in the chapter, binary variables provide additional modeling capability.

13.2 Eastborne Realty, an Example of Integer Optimization Eastborne Realty has $2 million available for the purchase of new rental property. After an initial screening, Eastborne reduced the investment alternatives to townhouses and apart- ment buildings. Each townhouse can be purchased for $282,000, and five are available. Each apartment building can be purchased for $400,000, and the developer will construct as many buildings as Eastborne wants to purchase.

Eastborne’s property manager can devote up to 140 hours per month to these new proper- ties; each townhouse is expected to require 4 hours per month, and each apartment building is expected to require 40 hours per month. The annual cash flow, after deducting mortgage payments and operating expenses, is estimated to be $10,000 per townhouse and $15,000 per apartment building. Eastborne’s owner would like to determine the number of town- houses and the number of apartment buildings to purchase to maximize annual cash flow.

We begin by defining the decision variables:

number of townhouses

number of apartment buildings

The objective function for cash flow (in thousands of dollars) is

Max 10 15T A1

Three constraints must be satisfied:

282 400 2, 000 Funds available ($1, 000s) 4 40 140 Manager’s time (hours)

5 Townhouses available

T A T A T

1 # 1 #

The variables T and A must be nonnegative. In addition, the purchase of a fractional number of townhouses and/or a fractional number of apartment buildings is unacceptable. Thus, T and A must be integers. The model for the Eastborne Realty problem is the follow- ing all-integer linear program:

Max 10 15 s.t.

282 400 2, 000 4 40 140

1 # 1 #

T A

T A T A T

, 0 and integer$T A

13.2 Eastborne Realty, an Example of Integer Optimization 609

The model for Eastborne Realty is a linear all-integer program. Next we discuss the geom- etry of this model.

The Geometry of linear All-integer Optimization The geometry of the feasible region for the Eastborne Reality problem is shown in Figure 13.1. The lightly shaded region is the feasible region of the LP Relaxation. The optimal linear programming solution is point b, which is 2.479T 5 townhouses and

3.252A 5 apartment buildings. The optimal value of the objective function is 73.574, which indicates an annual cash flow of $73,574. Point b is formed by the intersection of the Manager’s Time constraint and the Available Funds constraint. Unfortunately, because Eastborne cannot purchase fractional numbers of townhouses and apartment buildings, fur- ther analysis is necessary.

In many cases, a noninteger solution can be rounded to obtain an acceptable integer solution. For instance, a linear programming solution to a production scheduling problem might call for the production of 15,132.4 cases of breakfast cereal. The rounded integer solution of 15,132 cases would probably have minimal impact on the value of the objec- tive function and the feasibility of the solution. Rounding would be a sensible approach. Indeed, whenever rounding has a minimal impact on the objective function and constraints, most managers find it acceptable; a near-optimal solution is satisfactory.

However, rounding may not always be a good strategy. When the decision variables take on small values that have a major impact on the value of the objective function or feasibility, an optimal integer solution is needed. Let us return to the Eastborne Realty problem and examine the impact of rounding. The optimal solution to the LP Relaxation

the Geometry of the Eastborne Realty ProblemFiGURE 13.1

1 2 3 4 5 6 T

f g

h c

d e

No. of Townhouses

Objective Function = 70

Available Funds Constraint

Feasible Region

Manager’s Time Constraint

N o.

o f

A p

ar tm

en t

B u

il d

in gs

Townhouse Availability Constraint

Optimal Integer Solution T = 4, A = 2

Note: Dots show the location of feasible integer solutions

610 chapter 13 Integer Linear Optimization Models

for Eastborne Realty resulted in 2.479T 5 townhouses and 3.252A 5 apartment build- ings. Because each townhouse costs $282,000 and each apartment building costs $400,000, rounding to an integer solution can be expected to have a substantial economic impact on the problem.

Suppose that we round the solution to the LP Relaxation to obtain the integer solution 2T 5 and 3A 5 , with an objective function value of 10(2) 15(3) 651 5 . The annual cash

flow of $65,000 is substantially less than the annual cash flow of $73,574 provided by the solution to the LP Relaxation. Do other rounding possibilities exist? Exploring other rounding alternatives shows that the integer solution 3T 5 and 3A 5 is infeasible because it requires $282, 000(3) $400, 000(3) $3, 738, 0001 5 , which is more than the $2 million that Eastborne has available. The rounded solution of 2T 5 and 4A 5 is also infeasible for the same reason. At this point, rounding has led to two townhouses and three apartment buildings with an annual cash flow of $65,000 as the best feasible integer solution to the problem. Unfortunately, we don’t know whether this solution is the best integer solution to the problem.

Rounding to an integer solution is a trial-and-error approach. Each rounded solution must be evaluated for feasibility as well as for its impact on the value of the objective func- tion. Even when a rounded solution is feasible, we have no guarantee that we have found the optimal integer solution. We will see shortly that the rounded solution ( 2T 5 and

3A 5 ) is not optimal for Eastborne Realty. What is the true feasible region for the Eastborne Realty problem? As shown in

Figure 13.1, the feasible region is the set of integer points that lie within the feasible region of the LP Relaxation. There are 20 such feasible solutions (designated by blue dots in the figure). The region bounded by the dashed lines is known as the convex hull of the set of feasible integer solutions. The convex hull of a set of points is the smallest intersection of linear inequalities that contain the set of points. Notice that the convex hull in Figure 13.1 has integer extreme points (points d, e, f, g, h, and i). If we knew the convex hull, we could use linear programming to find the optimal integer corner point. Unfortunately, identifying the convex hull can be very time consuming. This is somewhat counterintuitive because there are only 20 feasible solutions, but solving an integer optimization problem such as that for Eastborne Realty may require solving numerous linear programs to find the opti- mal integer solution. Therefore, an integer optimization problem can be much more time consuming to solve than solving a linear program of comparable size.

It is true that the optimal solution to the integer program will be an extreme point of the convex hull, so one or more of the extreme points d, e, f, g, h, and i are optimal. The objective function contour shown in Figure 13.1 with an objective function value equal to 70 shows that point h is the optimal solution. As a check, let us evaluate each of the corner points of the convex hull in Figure 13.1:

Point T 5 A 5 Annual Cash Flow ($000) 5 d 5 0 10(5) 15(0) 501 5

e 0 0 1 510(0) 15(0) 0

f 0 3 10(0) 15(3) 451 5

g 2 3 10(2) 15(3) 651 5

h 4 2 10(4) 15(2) 701 5

i 5 1 10(5) 15(1) 651 5

This confirms that the optimal integer solution occurs at point h, where 4T 5 townhouses and 2A 5 apartment buildings. The objective function value is an annual cash flow of $70,000. This solution is substantially better than the best solution found by rounding

2T 5 , 3A 5 with an annual cash flow of $65,000. Thus, we see that rounding would not have been the best strategy for Eastborne Realty.

13.3 solving Integer Optimization Problems with Excel solver 611

13.3 Solving Integer Optimization Problems with Excel Solver

The worksheet formulation and solution for integer linear programs is similar to that for linear programming problems. Actually the worksheet formulation is exactly the same, but some additional information must be provided when setting up the Solver Parameters and Options dialog boxes. Constraints must be added in the Solver Parameters dialog box to identify the integer variables. In addition, the value for Tolerance in the Integer Options dialog box may need to be adjusted to obtain a solution.

Let us demonstrate the Excel solution of an integer linear program by showing how Excel Solver can be used to solve the Eastborne Realty problem. The worksheet with the optimal solution is shown in Figure 13.2. We will describe the key elements of the work- sheet and how to obtain the solution, and then we will interpret the solution.

The parameters and descriptive labels appear in cells A1:G7 of the worksheet in Figure 13.2. The cells in the lower portion of the worksheet contain the information required by the Excel Solver (decision variables, objective function, constraint left-hand sides, and constraint right-hand sides).

1. An important observation can be made from the analysis

of the Eastborne Realty problem. It has to do with the rela-

tionship between the value of the optimal integer solution

and the value of the optimal solution to the LP Relaxation.

For integer linear programs involving maximization, the

value of the optimal solution to the LP Relaxation provides

an upper bound on the value of the optimal integer solu-

tion. This observation is valid for the Eastborne Realty prob-

lem. The value of the optimal integer solution is $70,000,

and the value of the optimal solution to the LP Relaxation

is $73,574. Thus, we know from the LP Relaxation solution

that the upper bound for the value of the objective function

is $73,574. For integer linear programs involving minimiza-

tion, the value of the optimal solution to the LP Relaxation

provides a lower bound on the value of the optimal integer

solution.

2. The two popular approaches to solving integer linear

optimization problems are branch-and-bound and cutting

planes. Both solve a series of LP relaxations to arrive at an

optimal integer solution. The branch-and-bound approach breaks the feasible region of the LP Relaxation into subre-

gions until the subregions have integer solutions or it is

determined that the solution cannot be in the subregion.

Cutting plane approaches try to identify the convex hull by adding a series of new constraints that do not exclude any

feasible integer points. Indeed, most software for integer

optimization, including Excel Solver, employs a combina-

tion of these two approaches.

N O T E S + C O m m E N T S

Decision variables Cells B14:C14 are reserved for the decision variables. Objective function The formula 5SUMPRODUCT(B7:C7,B14:C14) has been

placed into cell B17 to reflect the annual cash flow associated with the solution.

Left-hand sides The left-hand sides for the three constraints are placed into cells F15:F17. Cell F15 5SUMPRODUCT(B4:C4, $B$14:$C$14) (Copy to cell F16) Cell F17 5B14

Right-hand sides The right-hand sides for the three constraints are placed into cells G15:G17. Cell G15 5G4 (Copy to cells G16:G17)

To solve the Eastborne Realty problem, we follow these steps:

Step 1. Click the Data tab in the Ribbon Step 2. In the Analyze group, click Solver

Eastborne

612 chapter 13 Integer Linear Optimization Models

Eastborne Realty spreadsheet ModelFiGURE 13.2

Eastborne Realty Problem

A B C D E F G

Parameters

Price (000)

Townhouse Apt. Bldg.

Apt. Bldgs.

Mgr. Time

Ann. Cash Flow ($000)

Funds Avl. ($000)

Mgr. Time Avl. (Hours)

Townhouses Avl.

282

400

Purchase Plan Total Used Total Available4 2

Max Cash Flow ($000)

Model

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

=SUMPRODUCT(B7:C7,B14:C14) Townhouses

2000

140

=G4

=G5

=G6

Funds ($000)

Funds (Hours)

=SUMPRODUCT(B4:C4,$B$14:$C$14)

=SUMPRODUCT(B5:C5,$B$14:$C$14)

=B14

Townhouses

Number of

Eastborne Realty Problem

A B C D E F G H

Price (000)

Townhouse Apt. Bldg.

Apt. Bldgs.

Mgr. Time

Ann. Cash Flow ($000)

Funds Avl. ($000)

Mgr. Time Avl. (Hours)

Townhouses Avl.

$400

$15

$2,000

140

Purchase Plan Total Used Total Available2

Max Cash Flow ($000)

Model

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

$282

$10

$70

$2,000

140

Funds ($000)

Townhouses

Time (Hours)

$1,928

Townhouses

Number of

Parameters

Step 3. When the Solver Parameters dialog box appears (Figure 13.3): Enter B17 in the Set Objective: box

Select Max for the To: option Enter B14:C14 in the By Changing Variable Cells: box

Step 4. Click the Add button When the Add Constraint dialog box appears: Enter B14:C14 in the Cell Reference: box Select int from the drop-down menu When int is selected, the term “integer” automatically appears in the

Constraint: box. This constraint tells Solver that the decision variables in cells B14 and C14 must be integers.

Binary variables are identified with the bin designation in the Solver Parameters dialog box.

13.3 solving Integer Optimization Problems with Excel solver 613

Step 5. Click the Add button When the Add Constraint dialog box appears: Enter F15:F17 in the Cell Reference: box Select ≤ from the drop-down menu Enter G15:G17 in the Constraint: area Click OK

Step 6. Select the Make Unconstrained Variables Non-Negative option Select Simplex LP from the Select a Solving Method: drop-down menu

Step 7. Click the Options button Select the All Methods tab, and set the Integer Optimality (%): to 0,

as shown in Figure 13.4. This ensures that we find the optimal integer solution.

Click OK to close the Options dialog box Step 8. When the Solver Parameters dialog box reappears, click Solve Step 9. When the Solver Results dialog box appears, select Answer in the Reports

area and click OK

The completed linear integer optimization model for the Eastborne Realty problem is con- tained in the file EastborneModel.

Figure 13.5 shows the Eastborne Realty Answer Report. The structure of the Answer Report from Excel Solver for integer programs is the same as that described in Chapter 12 for linear programs. The first section gives information regarding the objective function. It shows that the objective function is located in cell B17 and that the optimal (Final Value) of

solver Parameters Dialog Box for Eastborne RealtyFiGURE 13.3

EastborneModel

614 chapter 13 Integer Linear Optimization Models

solver Options Dialog BoxFiGURE 13.4

the objective function is $70,000. The Variable Cells section gives the location, name, and original and optimal values (Final Value) of the decision variables, as well as an indication that the decision variables have been designated as integers. For the Eastborne problem, in Figure 13.5, we see that the optimal solution is to purchase four townhouses and two apart- ment buildings. Finally, the Constraints section gives us detail on the status of each con- straint at optimality. We see that none of the three constraints is binding, and from the slack column, we see that we have $72,000 unused from budget and 44 unused hours and that we are under the limit of 5 townhouses by 1.

As this example illustrates, and as we have seen in Figure 13.1, unlike in a linear pro- gram, the solution to an integer program can be such that none of the constraints is binding at the optimal point.

A Cautionary Note About Sensitivity Analysis The classical sensitivity analysis discussed in Chapter 12 for linear programs is not avail- able for integer programs. Because of the discrete nature of integer optimization, it is not possible to easily calculate objective function coefficient ranges, shadow prices, and right-hand-side ranges. However, this does not mean that the sensitivity analysis is not important for integer programs. Sensitivity analysis is often more crucial for integer linear programming problems than for linear programming problems. A small change in one of the coefficients in the constraints can cause a relatively large change in the value of the optimal solution. To understand why, consider the following integer programming model of

13.3 solving Integer Optimization Problems with Excel solver 615

a simple capital budgeting problem involving four projects and a budgetary constraint for a single time period:

x x x x

Max 40 60 70 160 s.t.

16 35 45 85 100

1 2 3 4

1 1 1

1 1 1 # , , , 0, 11 2 3 4x x x x 5

The optimal solution to this problem is 11x 5 , 12x 5 , 13x 5 , and 04x 5 , with an objec- tive function value of $170. However, note that if the available budget is increased by $1 (from $100 to $101), the optimal solution changes to 11x 5 , 02x 5 , 03x 5 , and

14x 5 , with an objective function value of $200. In other words, one additional dollar in the budget would lead to a $30 increase in the return. Surely management, when faced with such a situation, would increase the budget by $1. Because of the extreme sensitivity of the value of the optimal solution to the constraint coefficients, practitioners usually recom- mend re-solving the integer linear program several times with variations in the coefficients before attempting to choose the best solution for implementation.

Sensitivity reports are not available for integer optimization problems. To determine the sensitivity of the solution to changes in model inputs, you must change the data and re-solve the problem.

The time required to obtain an optimal solution can be highly

variable for integer linear programs. If an optimal solution can-

not be found within a reasonable amount of time, the Integer

Optimality (%) can be reset to 5% or some higher value so

that the search procedure may stop when a near-optimal solu-

tion (within the tolerance of being optimal) has been found.

This can shorten the solution time because, if the Integer

Optimality (%) is set to 5%, Solver can stop when it knows

it is within 5% of optimal rather than having to complete the

search. In general, unless you are experiencing excessive run

times, we recommend you set the Integer Optimality (%)

to 0.

N O T E S + C O m m E N T S

Excel solver Answer Report for the Eastborne Realty ProblemFiGURE 13.5

A B C D

Objective Cell (Max)

Name Original Value Final Value

Final Value Integer

Constraints

13 14 15 16 17 18 19 20 21 22 23

Cell

$B$14:$C$14=Integer

$B$17 Max Cash Flow ($000) $0 $70

Cell Variable Cells

$B$14 Purchase Plan Townhouses

Purchase Plan Apt. Bldgs.

$F$15

$C$14

Name Original Value

Formula Status SlackCell Name Cell Value

0 4 Integer

0 2 Integer

$F$16

$F$17

Funds ($000) Total Used

Time (Hours) Total Used

Townhouses Total Used

$F$15<=$G$15 Not Binding

Not Binding

$F$16<=$G$16

$1,928

4 $F$17<=$G$17

24 25 26 27 28 29 30 31

E F G

616 chapter 13 Integer Linear Optimization Models

13.4 Applications Involving Binary Variables Much of the modeling flexibility provided by integer linear programming is due to the use of binary variables. In many applications, binary variables provide selections or choices with the value of the variable equal to one if a corresponding activity is undertaken and equal to zero if the corresponding activity is not undertaken. The capital budgeting, fixed cost, bank location, and product design and market share optimization applications pre- sented in this section make use of binary variables.

Capital Budgeting The Ice-Cold Refrigerator Company is considering investing in several projects that have varying capital requirements over the next four years. Faced with limited capital each year, management would like to select the most profitable projects that it can afford. The esti- mated net present value for each project, the capital requirements, and the available capital over the four-year period are shown in Table 13.1.

Let us define four binary decision variables:

1 if the plant expansion project is accepted; 0 if rejected 1 if the warehouse expansion project is accepted; 0 if rejected 1 if the new machinery project is accepted; 0 if rejected 1 if the new product research project is accepted; 0 if rejected

P W M R

5 5 5 5

In a capital budgeting problem, the company’s objective function is to maximize the net present value of the capital budgeting projects. This problem has four constraints: one for the funds available in each of the next four years.

A binary integer linear programming model with dollars in thousands is as follows:

Max 90 40 10 37 s.t.

15 10 10 15 40 (Year 1 capital available) 20 15 10 50 (Year 2 capital available) 20 20 10 40 (Year 3 capital available) 15 5 4 10 35 (Year 4 capital available)

1 1 1

1 1 1 # 1 1 # 1 1 # 1 1 1 #

P W M R

P W M R P W R P W R P W M R

, , , 0, 15P W M R

The Ice-Cold spreadsheet model and Solver dialog box are shown in Figure 13.6. The SUMPRODUCT function is used to calculate the amount of capital used in each year as well as the net present value.

The Excel Solver Answer Report is shown in Figure 13.7. The optimal solution is 1P 5 , 1W 5 , 1M 5 , 0R 5 , with a total estimated net present value of $140,000. Thus,

The estimated net present value is the net cash flow discounted back to the beginning of year 1.

Project

Plant Expansion ($)

Warehouse Expansion ($)

New Machinery ($)

New Product Research ($)

Total Capital Available ($)

Present Value 90,000 40,000 10,000 37,000

Year 1 Cap Rqmt 15,000 10,000 10,000 15,000 40,000

Year 2 Cap Rqmt 20,000 15,000 10,000 50,000

Year 3 Cap Rqmt 20,000 20,000 10,000 40,000

Year 4 Cap Rqmt 15,000 5,000 4,000 10,000 35,000

Project net Present Value, capital Requirements, and Available capital for the Ice-cold Refrigerator company

TABlE 13.1

13.4 Applications Involving Binary Variables 617

the company should fund the plant expansion, warehouse expansion, and new machinery projects. The new product research project should be put on hold unless additional capi- tal funds become available. The values of the slack variables (Figure 13.7) show that the company will have $5,000 remaining in year 1, $15,000 remaining in year 2, and $11,000 remaining in year 4. Checking the capital requirements for the new product research proj- ect, we see that enough funds are available for this project in years 2 and 4. However, the company would have to find additional capital funds of $10,000 in year 1 and $10,000 in year 3 to fund the new product research project.

Ice-cold spreadsheet Model and solver Dialog BoxFiGURE 13.6

A B C D F GE 1 Ice-Cold Refrigerator 2 Parameters 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Year 1 Year 2 Year 3

Financial Data ($1000s)

$40

$40 $35

$50 $15

$15

$20 $20

$10

$15 $10 $10

Capital Available

Plant Expansion

$10 $15 $20

$40$90 $37

Warehouse Expansion Machinery

New

$10 $10

New Prod. Research

Net Present Value

Plant Expansion

Warehouse Expansion Machinery

New New Prod. Research

Investment Plan

Amount ($1000s)

Year 1 Capital Year 2 Capital Year 3 Capital Year 4 Capital

Model

Net Present Value ($1000s) $140.00

Spent Available $35 $35

27 Year 4 $24 $40

$40 $50

$35 $40

0111 IceCold

618 chapter 13 Integer Linear Optimization Models

Fixed Cost In many applications, the cost of production has two components: a fixed setup cost and a variable cost directly related to the production quantity. The use of binary variables makes including the setup cost possible in a model for a production application.

As an example of a fixed-cost problem, consider the production problem faced by RMC Inc. Three raw materials are used to produce three products: a fuel additive, a solvent base, and a carpet cleaning fluid. The following decision variables are used:

tons of fuel additive produced tons of solvent base produced tons of carpet cleaning fluid produced

F S C

5 5 5

The profit contributions are $40 per ton for the fuel additive, $30 per ton for the solvent base, and $50 per ton for the carpet cleaning fluid. Each ton of fuel additive is a blend of 0.4 ton of material 1 and 0.6 ton of material 3. Each ton of solvent base requires 0.5 ton of material 1, 0.2 ton of material 2, and 0.3 ton of material 3. Each ton of carpet cleaning fluid is a blend of 0.6 ton of material 1, 0.1 ton of material 2, and 0.3 ton of material 3. RMC has 20 tons of material 1, 5 tons of material 2, and 21 tons of material 3, and management is interested in determining the optimal production quantities for the upcoming planning period.

A linear programming model of the RMC problem is as follows:

Max 40 30 50 s.t.

0.4 0.5 0.6 20 Material 1 0.2 0.1 5 Material 2

0.6 0.3 0.3 21 Material 3

F S C

F S C S C

F S C

1 1

1 1 # 1 #

1 1 # , , 0F S C $

Answer Report for Ice-cold RefrigeratorFiGURE 13.7

A B C D

Objective Cell (Max)

Name Original Value Final Value

Final Value Integer

13 14 15 16 17 18 19 20 21 22 23

Cell

$B$25

$D$16 Net Present Value ($1000s) Expansion $0.00 $140.00

Cell Variable Cells

$C$20 Investment Plan Plant Expansion

Investment Plan WH Expansion

Investment Plan Machinery

Investment Plan Research

Constraints

$D$20

$E$20

$F$20

Name Original Value

Formula Status SlackName Cell Value

0 1 Binary

0 0 Binary

Cell $B$24

$B$27

$C$20:$F$20=Binary

$B$26

Year 1 Spent

Year 2 Spent

Year 3 Spent

Year 4 Spent

Not Binding

Binding

Not Binding

$35 $B$24<=$C$24

$35 $B$25<=$C$25

$40 $B$26<=$C$26

$24 $B$27<=$C$27

24 25 26 27 28 29

32 33 34

30 31

E F G

13.4 Applications Involving Binary Variables 619

Using Excel Solver, we obtain an optimal solution consisting of 27.5 tons of fuel addi- tive, 0 tons of solvent base, and 15 tons of carpet cleaning fluid, with a value of $1,850.

This linear programming formulation of the RMC problem does not include a fixed cost for production setup of the products. Suppose that the following data are available concern- ing the setup cost and the maximum production quantity for each of the three products:

Product Setup Cost ($) Maximum

Production (tons)

Fuel additive 200 50

Solvent base 50 25

Carpet cleaning fluid 400 40

The modeling flexibility provided by binary variables can now be used to incorpo- rate the fixed setup costs into the production model. The binary variables are defined as follows:

1 if the fuel additive is produced; 0 if not 1 if the solvent base is produced; 0 if not 1 if the carpet cleaning fluid is produced; 0 if not

SF SS SC

5 5 5

Using these setup variables, the total setup cost is

200 50 400SF SS SC1 1

We can now rewrite the objective function to include the setup cost. Thus, the net profit objective function becomes

Max 40 30 50 200 50 400F S C SF SS SC1 1 2 2 2

Next, we must write production capacity constraints so that, if a setup variable equals 0, production of the corresponding product is not permitted, and if a setup variable equals 1, production is permitted up to the maximum quantity. For the fuel additive, we do so by adding the following constraint:

50F SF#

Note that, with this constraint present, production of the fuel additive is not permitted when 0SF 5 . When 1SF 5 , production of up to 50 tons of fuel additive is permitted. We can

think of the setup variable as a switch. When it is off ( 0)SF 5 , production is not permitted; when it is on ( 1)SF 5 , production is permitted.

Similar production capacity constraints, using binary variables, are added for the solvent base and carpet cleaning products:

25 40

S SS C SC

In summary, we have the following fixed-cost model for the RMC problem with setups:

Max 40 30 50 200 50 400 s.t. 0.4 0.5 0.6 20 Material 1

0.2 0.1 5 Material 2 0.6 0.3 0.3 21 Material 3

50 Maximum Fuel Additive 25 Maximum Solvent Base 40 Maximum Carpet Cleaning

F S C SF SS SC

F S C S C

F S C F SF

S SS C SC

1 1 2 2 2

1 1 #

1 #

1 1 #

# , , 0; , , 0 or 1$ 5F S C SF SS SC

A spreadsheet model and Solver dialog box for the RMC problem are shown in Figure 13.8. The SUMPRODUCT function is used to calculate the material used, and cells

620 chapter 13 Integer Linear Optimization Models

D31, D32, and D33 contain the capacity multiplied by the appropriate binary variable (5B11*B22 in cell D31, 5C11*C22 in cell D32, and 5D11*D22 in cell D33).

The Excel Answer Report is shown in Figure 13.9. The optimal solution requires 25 tons of fuel additive and 20 tons of solvent base. The value of the objective function after deducting the setup cost is $1,350. The setup cost for the fuel additive and the solvent base is $200 $50 $2501 5 . The optimal solution includes 0SC 5 , which indicates that the more expensive $400 setup cost for the carpet cleaning fluid should be avoided. Thus, the carpet cleaning fluid is not produced.

The key to developing a fixed-cost model is the introduction of a binary variable for each fixed cost and the specification of an upper bound for the corresponding production variable. For a production quantity x, a constraint of the form #x My can then be used to allow production when the setup variable 1y 5 and not to allow production when the setup variable 0y 5 . The value of the maximum production quantity M should be large enough to allow for all reasonable levels of production, but choosing excessively large values of M will slow the solution procedure.

RMc with setups spreadsheet Model and solver Dialog BoxFiGURE 13.8

A B C D E 1 RMC 2 Parameters 3 4 5 6 7 8 9 10 11

13 12

14 15 16 17 18 19 20 21

28 29 30 31 32 33

22 23 24 25 26

Material Requirements (tons) Fuel Additive

$1,350.00

$50 25

0.2 0.3 $30

0.5

$400 40

0.1 0.3 $50

0.6 5

50 $200

0.6 $40

0.4

Solvent Base Fluid

Cleaning Tons AvailableMaterials

Tons Produced Setup

Material 1 Material 2 Material 3 Pro�t per Ton Setup Cost Capacity (Tons)

Model

Max Net Pro�t

Tons Produced Max Tons

Used

Available Material 1

CleaningSolventFuel

Material 2 Material 3

20 5

20 4

Max F Max S Max C

50 25 0

25 20 0.0

0.0 0

20.0 1

25.0 1

RMCSetup

13.4 Applications Involving Binary Variables 621

Bank location The long-range planning department for the Ohio Trust Company is considering expanding its operation into a 20-county region in northeastern Ohio (Figure 13.10). Currently, Ohio Trust does not have a principal place of business in any of the 20 counties. According to the banking laws in Ohio, if a bank establishes a principal place of business (PPB) in any county, branch banks can be established in that county and in any of the adjacent counties. However, to establish a new principal place of business, Ohio Trust must either obtain approval for a new bank from the state’s superintendent of banks or purchase an existing bank.

Table 13.2 lists the 20 counties in the region and adjacent counties. For example, Ashtabula County is adjacent to Lake, Geauga, and Trumbull counties; Lake County is adjacent to Ashtabula, Cuyahoga, and Geauga counties; and so on.

As an initial step in its planning, Ohio Trust would like to determine the minimum num- ber of PPBs necessary to do business throughout the 20-county region. A binary integer programming model can be used to solve this location problem for Ohio Trust. We define the variables as

1 if a PBB is established in county ; 0 otherwisex ii 5

To minimize the number of PPBs needed, we write the objective function as

Min · · ·1 2 20x x x1 1 1

Answer Report for RMc Production ProblemFiGURE 13.9

A B C D

Objective Cell (Max)

Name Original Value Final Value

Final Value Integer

13 14 15 16 17 18 19 20 21 22 23

Cell

$C$27

$C$17 Max Net Pro�t $0.00 $1,350.00

Cell Variable Cells

$B$21

Constraints

$C$21

$D$21

$B$22

Name Original Value

Formula Status SlackName Cell Value

Contin

Binary

Tons Produced Fuel

Tons Produced Solvent

Tons Produced Cleaning

Setup Fuel

Setup Solvent

Setup Cleaning

$C$22

$D$22

Binary

0.0

25.0

20.0

0.0

0 Binary

Cell $C$26

$C$31

$C$32

$C$28

$C$26<=$D$26

$C$27<=$D$27

$C$28<=$D$28

$C$31<=$D$31

$C$33 $B$22:$D$22=Binary

$C$33<=$D$33

Material 1 Used

Material 2 Used

Material 3 Used

Max F Tons Produced

Max C Tons Produced

Max S Tons Produced

Binding

Not Binding

Binding

Not Binding

Binding

Not Binding$C$32<=$D$32

0.0

27 28

25 26

29 30 31

34 35

37 36

32 33

E F G

622 chapter 13 Integer Linear Optimization Models

The bank may locate branches in a county if the county contains a PPB or is adjacent to another county with a PPB. Thus, the binary linear program will need one constraint for each county. For example, the constraint for Ashtabula County is

x x x x 1 Ashtabula1 2 12 161 1 1 $

Note that satisfaction of this constraint ensures that a PPB will be placed in Ashtabula County or in one or more of the adjacent counties. This constraint thus guarantees that Ohio Trust will be able to place branch banks in Ashtabula County.

The complete statement of the bank location problem is as follows:

Min . . . s.t.

1 Ashtabula 1 Lake

1 Carroll 0, 1 1, 2, . . . , 20

1 2 20

1 2 12 16

1 2 3 12

11 14 19 20

x x x

x x x x x x x x

x x x x x ii

1 1 1

1 1 1 $

5 5

� �

We use Excel Solver to solve this 20-variable, 20-constraint problem formulation. In Figure 13.11, we show the optimal solution. The optimal solution calls for principal places of business in Ashland, Stark, and Geauga counties. With PPBs in these three counties, Ohio Trust can place branch banks in all 20 counties. Clearly the integer programming model could be enlarged to allow for expansion into a larger area or throughout the entire state.

In Problem 10, we ask you to solve this problem for the entire state of Ohio.

Ohio trust county MapFiGURE 13.10

Lake Erie

Ohio

Pennsylvania

West Virginia

Counties 1. 2. 3. 4. 5.

Ashtabula Lake Cuyahoga Lorain Huron

6. 7. 8. 9.

10.

Richland Ashland Wayne Medina Summit

11. 12. 13. 14. 15.

Stark Geauga Portage Columbiana Mahoning

16. 17. 18. 19. 20.

Trumbull Knox Holmes Tuscarawas Carroll

6 8

OhioTrust

13.4 Applications Involving Binary Variables 623

Product Design and market Share Optimization Conjoint analysis is a market research technique that can be used to learn how prospec- tive buyers of a product value the product’s attributes. In this section, we will show how the results of conjoint analysis can be used in an integer programming model of a product design and market share optimization problem. We illustrate the approach by consider- ing a problem facing Salem Foods, a major producer of frozen foods.

Salem Foods is planning to enter the frozen pizza market. Currently, two existing brands, Antonio’s and King’s, have the major share of the market. In trying to develop a sausage pizza that will capture a substantial share of the market, Salem determined that the four most important attributes when consumers purchase a frozen sausage pizza are crust, cheese, sauce, and sausage flavor. The crust attribute has two levels (thin and thick); the cheese attribute has two levels (mozzarella and blend); the sauce attribute has two levels (smooth and chunky); and the sausage flavor attribute has three levels (mild, medium, and hot).

In a typical conjoint analysis, a sample of consumers is asked to express their prefer- ence for a product with chosen levels for the attributes. Then regression analysis is used to determine the part-worth for each of the attribute levels. In essence, the part-worth is the utility value that a consumer attaches to each level of each attribute. Provided part-worths from regression analysis, we will show how they can be used to determine the overall value a consumer attaches to a particular product.

Table 13.3 shows the part-worths for each level of each attribute provided by a sample of eight potential Salem customers who are currently buying either King’s or Antonio’s pizza. For consumer 1, the part-worths for the crust attribute are 11 for thin crust and 2 for thick crust, indicating a preference for thin crust. For the cheese attribute, the part-worths

Counties Under Consideration

Adjacent Counties (by Number)

1. Ashtabula 2, 12, 16

2. Lake 1, 3, 12

3. Cuyahoga 2, 4, 9, 10, 12, 13

4. Lorain 3, 5, 7, 9

5. Huron 4, 6, 7

6. Richland 5, 7, 17

7. Ashland 4, 5, 6, 8, 9, 17, 18

8. Wayne 7, 9, 10, 11, 18

9. Medina 3, 4, 7, 8, 10

10. Summit 3, 8, 9, 11, 12, 13

11. Stark 8, 10, 13, 14, 15, 18, 19, 20

12. Geauga 1, 2, 3, 10, 13, 16

13. Portage 3, 10, 11, 12, 15, 16

14. Columbiana 11, 15, 20

15. Mahoning 11, 13, 14, 16

16. Trumbull 1, 12, 13, 15

17. Knox 6, 7, 18

18. Holmes 7, 8, 11, 17, 19

19. Tuscarawas 11, 18, 20

20. Carroll 11, 14, 19

counties in the Ohio trust Expansion RegionTABlE 13.2

624 chapter 13 Integer Linear Optimization Models

Optimal solution to the Ohio trust Location ProblemFiGURE 13.11

Lake Erie

Ohio

Pennsylvania

West Virginia

Counties 1. 2. 3. 4. 5.

Ashtabula Lake Cuyahoga Lorain Huron

6. 7. 8. 9.

10.

Richland Ashland Wayne Medina Summit

11. 12. 13. 14. 15.

Stark Geauga Portage Columbiana Mahoning

16. 17. 18. 19. 20.

Trumbull Knox Holmes Tuscarawas Carroll

A principal place of business should be located in these counties.

6 8

Crust Cheese Sauce Sausage Flavor

Consumer Thin Thick Mozzarella Blend Smooth Chunky Mild Medium Hot

1 11 2 6 7 3 17 26 27 8

2 11 7 15 17 16 26 14 1 10

3 7 5 8 14 16 7 29 16 19

4 13 20 20 17 17 14 25 29 10

5 2 8 6 11 30 20 15 5 12

6 12 17 11 9 2 30 22 12 20

7 9 19 12 16 16 25 30 23 19

8 5 9 4 14 23 16 16 30 3

Part-Worths for the salem Foods ProblemTABlE 13.3

are 6 for the mozzarella cheese and 7 for the cheese blend; thus, consumer 1 has a slight preference for the cheese blend. From the other part-worths, we see that consumer 1 shows a strong preference for the chunky sauce over the smooth sauce (17 to 3) and has a slight preference for the medium-flavored sausage. Note that consumer 2 shows a preference for the thin crust, the cheese blend, the chunky sauce, and mild-flavored sausage. The part- worths for the other consumers are interpreted similarly.

13.4 Applications Involving Binary Variables 625

The part-worths can be used to determine the overall value (utility) that each consumer attaches to a particular type of pizza. For instance, consumer 1’s current favorite pizza is the Antonio’s brand, which has a thick crust, mozzarella cheese, chunky sauce, and medi- um-flavored sausage. We can determine consumer 1’s utility for this particular type of pizza using the part-worths in Table 13.3. For consumer 1, the part-worths are 2 for thick crust, 6 for mozzarella cheese, 17 for chunky sauce, and 27 for medium-flavored sausage. Thus, consumer 1’s utility for the Antonio’s brand pizza is 2 6 17 27 521 1 1 5 . We can compute consumer 1’s utility for a King’s brand pizza similarly. The King’s brand pizza has a thin crust, a cheese blend, smooth sauce, and mild-flavored sausage. Because the part-worths for consumer 1 are 11 for thin crust, 7 for cheese blend, 3 for smooth sauce, and 26 for mild-flavored sausage, consumer 1’s utility for the King’s brand pizza is 11 7 3 26 471 1 1 5 . In general, each consumer’s utility for a particular type of pizza is the sum of the part-worths for the attributes of that type of pizza.

To be successful with its brand, Salem Foods realizes that it must entice consumers in the marketplace to switch from their current favorite brand of pizza to the Salem product. In other words, Salem must design a pizza (choose the type of crust, cheese, sauce, and sausage flavor) that will have the highest utility for enough people to ensure sufficient sales to justify making the product. Assuming the sample of eight consumers in the current study is representative of the marketplace for frozen sausage pizza, we can formulate and solve an integer programming model that can help Salem come up with such a design. In market- ing literature, the problem being solved is called the share of choice problem.

The decision variables are defined as follows:

1 if Salem chooses level for attribute ; 0 otherwise 1 if consumer chooses the Salem brand; 0 otherwise

l i j y k

The objective is to choose the levels of each attribute that will maximize the number of consumers who prefer the Salem brand pizza. Because the number of consumers who pre- fer the Salem brand pizza is just the sum of the ky variables, the objective function is

Max 1 2 8y y y1 1 ? ? ? 1

One constraint is needed for each consumer in the sample. To illustrate how the con- straints are formulated, let us consider the constraint corresponding to consumer 1. For consumer 1, the utility of a particular type of pizza can be expressed as the sum of the part-worths:

Utility for consumer 1 11 2 6 7 3 17 26 27 811 21 12 22 13 23 14 24 34l l l l l l l l l5 1 1 1 1 1 1 1 1

For consumer 1 to prefer the Salem pizza, the utility for the Salem pizza must be greater than the utility for consumer 1’s current favorite. Recall that consumer 1’s current favorite brand of pizza is Antonio’s, with a utility of 52. Thus, consumer 1 will purchase the Salem brand only if the levels of the attributes for the Salem brand are chosen such that

11 2 6 7 3 17 26 27 8 5211 21 12 22 13 23 14 24 34l l l l l l l l l1 1 1 1 1 1 1 1 .

Given the definitions of the ky decision variables, we want 11y 5 when the consumer prefers the Salem brand and 01y 5 when the consumer does not prefer the Salem brand. Thus, we write the constraint for consumer 1 as follows:

11 2 6 7 3 17 26 27 8 1 5211 21 12 22 13 23 14 24 34 1l l l l l l l l l y1 1 1 1 1 1 1 1 $ 1

With this constraint, 1y cannot equal 1 unless the utility for the Salem design (the left- hand side of the constraint) exceeds the utility for consumer 1’s current favorite by at least 1. Because the objective function is to maximize the sum of the yk variables, the optimization will seek a product design that will allow as many yk variables as possible to equal 1.

A similar constraint is written for each consumer in the sample. The coefficients for the lij variables in the utility functions are taken from Table 13.3, and the coefficients for the yk

Utility values are discussed in more detail in Chapter 15.

626 chapter 13 Integer Linear Optimization Models

variables are obtained by computing the overall utility of the consumer’s current favorite brand of pizza. The following constraints correspond to the eight consumers in the study:

11 2 6 7 3 17 26 27 8 1 52 11 7 15 17 16 26 14 1 10 1 58 7 5 8 14 16 7 29 16 19 1 66 13 20 20 17 17 14 25 29 10 1 83 2 8 6 11 30 20 15 5 12 1 58 12 17 11 9 2 30 22 12 20 1 70 9 19 12 16 16 25 30 23 19 1 79 5 9 4 14 23 16 16 30 3 1 59

11 21 12 22 13 23 14 24 34 1

11 21 12 22 13 23 14 24 34 2

11 12 12 22 13 23 14 24 34 3

11 21 12 22 13 23 14 24 34 4

11 21 12 22 13 23 14 24 34 5

11 21 12 22 13 23 14 24 34 6

11 21 12 22 13 23 14 24 34 7

11 21 12 22 13 23 14 24 34 8

l l l l l l l l l y l l l l l l l l l y

1 1 1 1 1 1 1 1 $ 1 1 1 1 1 1 1 1 1 $ 1 1 1 1 1 1 1 1 1 $ 1 1 1 1 1 1 1 1 1 $ 1 1 1 1 1 1 1 1 1 $ 1 1 1 1 1 1 1 1 1 $ 1 1 1 1 1 1 1 1 1 $ 1 1 1 1 1 1 1 1 1 $ 1

Four more constraints must be added, one for each attribute. These constraints are nec- essary to ensure that one and only one level is selected for each attribute. For attribute 1 (crust), we must add the constraint:

111 21l l1 5

Because 11l and 21l are both binary variables, this constraint requires that one of the two variables equals one, and the other equals zero. The following three constraints ensure that one and only one level is selected for each of the other three attributes:

1 1 1

12 22

13 13

14 24 34

l l l l

l l l

1 5 1 5

1 1 5

The data, model, and solution for the Salem pizza problem may be found in the file Salem. The optimal solution to this 17-variable, 12-constraint integer linear program is

111 22 23 14l l l l5 5 5 5 and 12 5 6 7y y y y5 5 5 5 . The value of the optimal solution is 4, indicating that if Salem makes this type of pizza, it will be preferable to the current favorite for four of the eight consumers. With 121 22 23 14l l l l5 5 5 5 , the pizza design that obtains the largest market share for Salem has a thin crust, a cheese blend, a chunky sauce, and mild-flavored sausage. Note also that with 12 5 6 7y y y y5 5 5 5 , consumers 2, 5, 6, and 7 will prefer the Salem pizza. This information may lead Salem to choose to market this type of pizza.

13.5 Modeling Flexibility Provided by Binary Variables In Section 13.4, we presented four applications involving binary integer variables. In this section, we continue the discussion of the use of binary integer variables in modeling. First, we show how binary integer variables can be used to model multiple-choice and mutually exclusive constraints. Then we show how binary integer variables can be used to model sit- uations in which k projects out of a set of n projects must be selected, as well as situations in which the acceptance of one project is conditional on the acceptance of another project.

multiple-Choice and mutually Exclusive Constraints Recall the Ice-Cold Refrigerator capital budgeting problem introduced in Section 13.4. The decision variables were defined as follows:

P W M R

Suppose that, instead of one warehouse expansion project, the Ice-Cold Refrigerator Com- pany actually has three warehouse expansion projects under consideration. One of the warehouses must be expanded because of increasing product demand, but new demand is not sufficient to make expansion of more than one warehouse necessary. The following

Antonio’s brand is the current favorite pizza for consumers 1, 4, 6, 7, and 8. King’s brand is the current favorite pizza for consumers 2, 3, and 5.

Salem

13.5 Modeling Flexibility Provided by Binary Variables 627

variable definitions and multiple-choice constraint could be incorporated into the previ- ous binary integer linear programming model to reflect this situation. Let:

1 if the original warehouse expansion project is accepted; 0 if rejected 1 if the second warehouse expansion project is accepted; 0 if rejected 1 if the third warehouse expansion project is accepted; 0 if rejected

W W W

The multiple-choice constraint reflecting the requirement that exactly one of these projects must be selected is

11 2 3W W W1 1 5

If 1W , 2W , and 3W are allowed to assume only the values 0 or 1, then one and only one of these projects will be selected from among the three choices.

If the requirement that one warehouse must be expanded did not exist, the multi- ple-choice constraint could be modified as follows:

11 2 3W W W1 1 #

This modification allows for the case of no warehouse expansion ( 0)1 2 3W W W5 5 5 but does not permit more than one warehouse to be expanded. This type of constraint is often called a mutually exclusive constraint.

k Out of n Alternatives Constraint An extension of the notion of a multiple-choice constraint can be used to model situations in which k out of a set of n projects must be selected—a k out of n alternatives constraint. Suppose that 1W , 2W , 3W , 4W , and 5W represent five potential warehouse expansion projects and that two of the five projects must be accepted. The constraint that satisfies this new requirement is

21 2 3 4 5W W W W W1 1 1 1 5

If no more than two of the projects are to be selected, we would use the following less- than-or-equal-to constraint:

21 2 3 4 5W W W W W1 1 1 1 #

Again, each of these variables must be restricted to binary values.

Conditional and Corequisite Constraints Sometimes the acceptance of one project is conditional on the acceptance of another. For example, suppose that for the Ice-Cold Refrigerator Company, the warehouse expansion project was conditional on the plant expansion project. In other words, suppose manage- ment will not consider expanding the warehouse unless the plant is expanded. With binary variable P representing plant expansion (1 expand, 0 do not expand)5 5 and W a binary variable representing warehouse expansion (1 expand, 0 do not expand)5 5 , a conditional constraint needs to be developed to enforce the requirement the warehouse cannot be expanded unless the plant has been expanded.

When faced with this type of conditional constraint, it is often helpful to construct a feasibility table. A feasibility table is a table that lists all possible settings of the relevant binary variables and indicates which settings of these variables are feasible and which are not feasible. In the Ice-Cold Refrigerator case, we have the following feasibility table:

W P Feasible? Rationale 0 0 Yes We can choose to expand neither.

1 0 No We cannot expand the warehouse if the plant is not expanded.

0 1 Yes We can choose not to expand the warehouse, even if we expand the plant.

1 1 Yes We can expand both the warehouse and the plant.

628 chapter 13 Integer Linear Optimization Models

Notice that W is less than or equal to P for the feasible cases and W is greater than P in the infeasible case. Hence, the conditional constraint that enforces the restriction is

W P#

Let us consider another situation where the warehouse and plant expansions are depen- dent on each other. If the warehouse expansion project had to be accepted whenever the plant expansion project was accepted, and vice versa, we would say that we have a core- quisite constraint. So, if we choose to expand either, the other must be expanded. In this situation, we have the following feasibility table:

W P Feasible? Rationale 0 0 Yes We can choose to expand neither.

1 0 No We cannot expand the warehouse if the plant is not expanded.

0 1 No If the warehouse is not expanded, we cannot expand the plant.

1 1 Yes We can expand both the warehouse and the plant.

1. As in the Ice-Cold Refrigerator examples with conditional

and corequisite constraints, many restrictions will involve

only two binary variables. Since in the feasibility table, we

list all possible cases (settings of the binary variables), there

are 2 42 5 cases when there are two variables. Some con-

ditional and corequisite constraints might involve three or

more variables. For three variables, there are 2 83 5 cases,

for four variables there are 2 164 5 cases. In general, for n variables, there will be n2 cases. Therefore, for situations

involving more than three variables, feasibility tables can

become cumbersome.

2. A somewhat natural way to try to model a conditional or

corequisite constraint in Excel is to use an IF function.

However, since the IF function is a discontinuous func-

tion (i.e., a function with a break or jump in the function

value), using an IF statement will preclude the use of the

LP Simplex option as discussed in Section 13.3. While the

nonlinear option in Excel Solver (discussed in Chapter 14)

can sometimes find good results even with the use of an IF

function, optimality cannot always be guaranteed. There-

fore, we recommend you model conditional constraints in

a linear way as discussed in Section 13.5.

N O T E S + C O m m E N T S

In this feasibility table, we see that when W and P are set to the same value, the result is feasible, but different settings of W and P are infeasible. Hence, in the corequisite situation, the following constraint enforces the restriction:

W P5

The constraint forces P and W to take on the same value.

13.6 Generating Alternatives in Binary Optimization If alternative optimal solutions exist, it would be good for management to know this because some factors that make one alternative preferred over another might not be included in the model. Also, if the solution is a unique optimal solution, it would be good to know how much worse the second-best solution is than the unique optimal solution. If the second-best solution is very close to optimal, it might be preferred over the true optimal solution because of factors outside the model.

As an example, let us reconsider the Ohio Trust location problem presented in Section 13.4. The solution for the minimum number of principle places of business (PPBs) is three. As shown in Figure 13.11, the solution is to place PBBs in county 7 (Ashland), county 11 (Stark), and county 12 (Geauga). However, suppose when Ohio Trust tries to implement this solution, it is not possible to find a suitable location for a PPB in one of

13.6 Generating Alternatives in Binary Optimization 629

these three counties. Are there other alternative solutions of three counties, or is this a unique optimal solution? By adding a special constraint based on the current solution and then resolving the model, we may answer this question.

The current solution for Ohio Trust can be broken into two sets of variables: those that are set to one and those that are set to zero. Let the set O denote the set of variables set to one and the set Z those that are set to zero. For the Ohio Trust solution, these sets are as follows:

Set O: , , Set Z: , , , , , , , , , , , , , , , ,

7 11 12

1 2 3 4 5 6 8 9 10 13 14 15 16 17 18 19 20

x x x x x x x x x x x x x x x x x x x x

We may add the following constraint:

(Sum of variables in the set O) (sum of variables in the set Z) (number of variables in the set O) 1,

2 #

which for our current solution is

3 1 2 7 11 12 1 2 3 4 5 6 8 9 10 13

14 15 16 17 18 19 20

x x x x x x x x x x x x x x x x x x x x

1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 # 2 5

This constraint has the very special property that it makes the current solution infeasi- ble, but keeps feasible all other solutions that are feasible to the original problem. This con- straint will force (at least) one of the variables in set O to change from one to zero or will force (at least) one of the variables in set Z to change from zero to one.

When we append this new constraint to the original model, we obtain the solution dis- played in Figure 13.12. Notice that the optimal objective function value has increased to

A second-Best solution to the Ohio trust Location ProblemFiGURE 13.12

Lake Erie

Ohio

Pennsylvania

West Virginia

Counties 1. 2. 3. 4. 5.

Ashtabula Lake Cuyahoga Lorain Huron

6. 7. 8. 9.

10.

Richland Ashland Wayne Medina Summit

11. 12. 13. 14. 15.

Stark Geauga Portage Columbiana Mahoning

16. 17. 18. 19. 20.

Trumbull Knox Holmes Tuscarawas Carroll

A principal place of business should be located in these counties.

6 8

630 chapter 13 Integer Linear Optimization Models

four. This tells us that the solution we found in Section 13.4 with objective function value equal to 3 is a unique optimal solution. Any other feasible solution will require four or more PBBs to cover the entire 20-county region. So, if for any of the three counties in the original solution we cannot find a suitable location for a PBB, the next-best solution will require PBBs in four counties and the solution in Figure 13.10 is a second-best alternative. Note that if the optimal objective functions of the new problem with constraint added had been 3, we would have found an alternative optimal solution.

We can summarize the procedure for finding an alternative solution as follows:

Step 1. Solve the original problem Step 2. Create two sets:

O the set of variables equal to one in Step 1 Z the set of variables equal to zero in Step 1

5 5

Step 3. Add the following constraint to the original problem, and solve

(Sum of variables in the set O) (sum of variables in the set Z)

(number of variables in the set O) 1 2

# 2 (13.1)

If the objective function value in Step 3 is equal to the objective function value of Step 1, we have found an alternative optimal solution. If the objective function value of Step 3 is inferior to that of Step 1, we have found a next-best solution.

1. The procedure just described can be applied iteratively. In

other words, we can take the second-best solution found

and create the equation (13.1) based on that solution to

find the next-best solution. Note that we leave all previous

constraints in the problem, including the first constraint

based on equation (13.1). The resulting solution could be

a third-best solution or an alternative second-best solution.

It turns out that there are numerous second-best solutions

to the Ohio Trust problem using four PPBs.

2. Applying equation (13.1) iteratively and finding that the

objective function value does not deteriorate generates an

alternative optimal solution. In fact applying equation (13.1)

iteratively until the objective function changes ensures you

have found all alternative optima.

N O T E S + C O m m E N T S

S u M M A R y

In this chapter we introduced the important extension of linear programming referred to as integer linear programming. The only difference between the integer linear programming problems discussed in this chapter and the linear programming problems studied in the pre- vious chapter is that one or more of the variables must be an integer. If all variables must be integer, we have an all-integer linear program. If some, but not necessarily all, variables must be an integer, we have a mixed-integer linear program. Most integer programming applications involve binary variables.

Studying integer linear programming is important for two major reasons. First, integer linear programming may be helpful when fractional values for the variables are not permit- ted. Rounding a linear programming solution may not provide an optimal integer solution; methods for finding optimal integer solutions are needed when the economic consequences of rounding are substantial. A second reason for studying integer linear programming is the increased modeling flexibility provided through the use of binary variables. We showed how binary variables could be used to model important managerial considerations in cap- ital budgeting, fixed cost, facility location, and product design/market share applications. We showed how to generate second-best solutions or alternative optima if they exist by

Glossary 631

adding a constraint based on those solutions. This is important for providing alternatives for management.

The number of applications of integer linear programming continues to grow rapidly, partly because of the availability of good integer linear programming software packages. As researchers develop solution procedures capable of solving larger integer linear pro- grams and as computer speed increases, a continuation of the growth of integer program- ming applications is expected.

G L O S S A R y

All-integer linear program An integer linear program in which all variables are required to be integers. Binary integer linear program An all-integer or mixed-integer linear program in which the integer variables are permitted to assume only the values 0 or 1. Also called binary integer program. Capital budgeting problem A binary integer programming problem that involves choos- ing which possible projects or activities provide the best investment return. Conditional constraint A constraint involving binary variables that does not allow certain variables to equal one unless certain other variables are equal to one. Conjoint analysis A market research technique that can be used to learn how prospective buyers of a product value the product’s attributes. Convex hull The smallest intersection of linear inequalities that contain a certain set of points. Corequisite constraint A constraint requiring that two binary variables be equal and that they are both either in or out of the solution. Feasibility table A table that is useful in modeling conditional and corequisite constraints with binary variables. The table lists all possible settings of the relevant binary variables and indicates which settings of these variables are feasible and which are not feasible. Fixed-cost problem A binary mixed-integer programming problem in which the binary vari- ables represent whether an activity, such as a production run, is undertaken (variable 1)5 or not (variable 0)5 . Integer linear program A linear program with the additional requirement that one or more of the variables must be an integer. k out of n alternatives constraint An extension of the multiple-choice constraint that requires that the sum of n binary variables equals k. Location problem A binary integer programming problem in which the objective is to select the best locations to meet a stated objective. Variations of this problem (see the bank location problem in Section 13.4) are known as covering problems. LP Relaxation The linear program that results from dropping the integer requirements for the variables in an integer linear program. Mixed-integer linear program An integer linear program in which some, but not neces- sarily all, variables are required to be integers. Multiple-choice constraint A constraint requiring that the sum of two or more binary vari- ables equals one. Thus, any feasible solution makes a choice of which variable to set equal to one. Mutually exclusive constraint A constraint requiring that the sum of two or more binary variables be less than or equal to one. Thus, if one of the variables equals one, the others must equal zero. However, all variables could equal zero. Part-worth The utility value that a consumer attaches to each level of each attribute in a conjoint analysis model. Product design and market share optimization problem Sometimes called the share of choice problem, the choice of a product design that maximizes the number of consumers that prefer it.

632 chapter 13 Integer Linear Optimization Models

P R O B L E M S

1. King City Inc. manufactures machine tools. The production planner who oversees the production of two of King City’s machines needs to determine how many of each to produce this month. The two machines, TopLathe and BigPress, each require a cer- tain common component. Each TopLathe requires 10 of these components and each BigPress requires 7. Only 49 components are available this month. The sales depart- ment requires that the total number of machines produced in a month must be at least 5 (the number TopLathes plus the number BigPresses must be at least 5). The profit for a TopLathe is $50,000 and $34,000 for a BigPress. a. Assuming that adequate labor and all other resources are available, formulate an

integer programming model to determine how many of each product King City should produce to maximize profit.

b. Solve the model formulated in part a without integer requirements. What is the opti- mal profit? What are the optimal values for TopLathe and BigPress?

c. Round the TopLathe and BigPress values found in part b. Is the solution feasible? Why? d. Truncate the TopLathe and BigPress values found in part b (drop the fractional part

of each value). Is the solution feasible? Why? e. Add integer requirements to the model you constructed in part b. What is the opti-

mal profit and what are the optimal number of TopLathes and BigPresses?

2. Hospital administrators must schedule nurses so that the hospital’s patients are provided adequate care. At the same time, careful attention must be paid to keeping costs down. From historical records, administrators can project the minimum number of nurses required to be on hand for various times of day and days of the week. The objective is to find the minimum total number of nurses required to provide adequate care.

Nurses start work at the beginning of one of the four-hour shifts given below (except for shift 6) and work for 8 consecutive hours. Hence, possible start times are the start of shifts 1 through 5. Also, assume that the projected required number of nurses factors in time for each nurse to have a meal break.

Formulate and solve the nurse scheduling problem as an integer program for one day for the data given below.

Hint: Note that exceeding the minimum number of needed nurses in each shift is acceptable so long as the total number of nurses over all shifts is minimized.

3. STAR Co. provides paper to smaller companies with volumes that are not large enough to warrant dealing directly with the paper mill. STAR receives 100-feet-wide paper rolls from the mill and cuts the rolls into smaller rolls of widths 12, 15, and 30 feet. The demands for these widths vary from week to week. The following cutting patterns have been established:

Pattern Number 12-ft 15-ft 30-ft Trim Loss (ft)

1 0 6 0 10 2 0 0 3 10 3 8 0 0 4 4 2 1 2 1 5 7 1 0 1

Shift Time Minimum Number of Nurses Needed

1 12:00 a.m.–4:00 a.m. 10

2 4:00 a.m.–8:00 a.m. 24

3 8:00 a.m.–12:00 p.m. 18

4 12:00 p.m.–4:00 p.m. 10

5 4:00 p.m.–8:00 p.m. 23

6 8:00 p.m.–12:00 a.m. 17

NurseSchedule

Problems 633

Trim loss is the leftover paper from a pattern (e.g., for pattern 4, 2(12) 1(15) 2(30)1 1 99 feet5 used results in 100 99 1foot2 5 of trim loss). Orders in hand for the coming

week are 5,670 12-foot rolls, 1,680 15-foot rolls, and 3,350 30-foot rolls. Any of the three types of rolls produced in excess of the orders in hand will be sold on the open market at the selling price. No inventory is held. a. Formulate an integer programming model that will determine how many 100-foot

rolls to cut into each of the five patterns in order to minimize trim loss. b. Solve the model formulated in part a. What is the minimal amount of trim loss?

How many of each pattern should be used and how many of each type of roll will be sold on the open market?

4. Brooks Development Corporation (BDC) faces the following capital budgeting deci- sion. Six real estate projects are available for investment. The net present value and expenditures required for each project (in millions of dollars) are as follows:

Capital Requirements ($)

Alternative Net Present Value ($) Year 1 Year 2 Year 3

Limited warehouse expansion 4,000 3,000 1,000 4,000

Extensive warehouse expansion 6,000 2,500 3,500 3,500

Test market new product 10,500 6,000 4,000 5,000

Advertising campaign 4,000 2,000 1,500 1,800

Basic research 8,000 5,000 1,000 4,000

Purchase new equipment 3,000 1,000 500 900

Capital funds available 10,500 7,000 8,750

There are conditions that limit the investment alternatives:

• At least two of projects 1, 3, 5, and 6 must be undertaken. • If either project 3 or 5 is undertaken, they must both be undertaken. • Project 4 cannot be undertaken unless both projects 1 and 3 also are undertaken.

The budget for this investment period is $220 million.

a. Formulate a binary integer program that will enable BDC to find the projects to invest in to maximize net present value, while satisfying all project restrictions and not exceeding the budget.

b. Solve the model formulated in part a. What is the optimal net present value? Which projects will be undertaken? How much of the budget is unused?

5. Spencer Enterprises is attempting to choose among a series of new investment alterna- tives. The potential investment alternatives, the net present value of the future stream of returns, the capital requirements, and the available capital funds over the next three years are summarized as follows:

a. Develop and solve an integer programming model for maximizing the net present value.

b. Assume that only one of the warehouse expansion projects can be implemented. Modify your model from part (a).

c. Suppose that if test marketing of the new product is carried out, the advertising campaign also must be conducted. Modify your formulation from part (b) to reflect this new situation.

Project 1 2 3 4 5 6

Net Present Value ($ Millions)

$15 $5 $13 $14 $20 $9

Expenditure Required ($ Millions)

$90 $34 $81 $70 $114 $50Brooks

634 chapter 13 Integer Linear Optimization Models

6. Morgan Inc. is planning the purchase of one of the component parts it needs for its finished product. The anticipated demands for the component for the next 12 periods are shown in the following table. The cost to order the component (labor, shipping, and paperwork) is $150. The cost to hold these components in inventory is $1 per compo- nent per period. The price of the component is expected to remain stable at $12 per unit for the next 12 periods, and no quantity discounts are available. The maximum order size is 1,000 units.

Period 1 2 3 4 5 6 7 8 9 10 11 12

Demand 20 20 30 40 140 360 500 540 460 80 0 20

a. Formulate a model to minimize the total cost of satisfying Morgan Inc.’s demand for this component.

b. Solve the model formulated in part a. What is the optimal cost? How many orders are placed?

7. Grave City is considering the relocation of several police substations to obtain better enforcement in high-crime areas. The locations under consideration together with the areas that can be covered from these locations are given in the following table:

Potential Locations for Substations Areas Covered

A 1, 5, 7

B 1, 2, 5, 7

C 1, 3, 5

D 2, 4, 5

E 3, 4, 6

F 4, 5, 6

G 1, 5, 6, 7

a. Formulate an integer programming model that could be used to find the minimum number of locations necessary to provide coverage to all areas.

b. Solve the problem in part (a).

8. Hart Manufacturing makes three products. Each product requires manufacturing oper- ations in three departments: A, B, and C. The labor-hour requirements, by department, are as follows:

Department Product 1 Product 2 Product 3

A 1.50 3.00 2.00

B 2.00 1.00 2.50

C 0.25 0.25 0.25

During the next production period the labor-hours available are 450 in department A, 350 in department B, and 50 in department C. The profit contributions per unit are $25 for product 1, $28 for product 2, and $30 for product 3. a. Formulate a linear programming model for maximizing total profit contribution. b. Solve the linear program formulated in part (a). How much of each product should

be produced, and what is the projected total profit contribution? c. After evaluating the solution obtained in part (b), one of the production supervisors

noted that production setup costs had not been taken into account. She noted that setup costs are $400 for product 1, $550 for product 2, and $600 for product 3. If the solution developed in part (b) is to be used, what is the total profit contribution after taking into account the setup costs?

Morgan

Problems 635

d. Management realized that the optimal product mix, taking setup costs into account, might be different from the one recommended in part (b). Formulate a mixed-inte- ger linear program that takes setup costs provided in part (c) into account. Manage- ment also stated that we should not consider making more than 175 units of product 1, 150 units of product 2, or 140 units of product 3.

e. Solve the mixed-integer linear program formulated in part (d). How much of each product should be produced and what is the projected total profit contribution? Compare this profit contribution to that obtained in part (c).

9. Offhaus Manufacturing produces office supplies but outsources the delivery of its prod- ucts to third-party carriers. Offhaus ships to 20 cities from its Dayton, Ohio, manufac- turing facility and has asked a variety of carriers to bid on its business. Seven carriers have responded with bids. The resulting bids (in dollars per truckload) are shown in the table. For example, the table shows that carrier 1 bid on the business to cities 11 to 20. The right side of the table provides the number of truckloads scheduled for each desti- nation in the next quarter.

Bid Carrier Carrier Carrier Carrier Carrier Carrier Carrier Demand

$/Truckload 1 2 3 4 5 6 7 Destination (truckloads)

City 1 $2,188 $1,666 $1,790 City 1 30

City 2 $1,453 $2,602 $1,767 City 2 10

City 3 $1,534 $2,283 $1,857 $1,870 City 3 20

City 4 $1,687 $2,617 $1,738 City 4 40

City 5 $1,523 $2,239 $1,771 $1,855 City 5 10

City 6 $1,521 $1,571 $1,545 City 6 10

City 7 $2,100 $1,922 $1,938 $2,050 City 7 12

City 8 $1,800 $1,432 $1,416 $1,739 City 8 25

City 9 $1,134 $1,233 $1,181 $1,150 City 9 25

City 10 $672 $610 $669 $678 City 10 33

City 11 $724 $723 $627 $657 $706 City 11 11

City 12 $766 $766 $721 $682 $733 City 12 29

City 13 $741 $745 $682 $733 City 13 12

City 14 $815 $800 $828 $745 $832 City 14 24

City 15 $904 $880 $891 $914 City 15 10

City 16 $958 $933 $891 $914 City 16 10

City 17 $925 $929 $937 $984 City 17 23

City 18 $892 $869 $822 $829 $864 City 18 25

City 19 $927 $969 $967 $1,008 City 19 12

City 20 $963 $938 $955 $995 City 20 10

No. of Bids 10 10 10 7 20 5 18

Because dealing with too many carriers can be cumbersome, Offhaus would like to limit the number of carriers it uses to three. Also, for customer relationship reasons Offhaus wants each city to be assigned to only one carrier (i.e., no splitting of the demand to a given city across carriers). a. Develop a model that will yield the three selected carriers and the city-carrier

assignments that minimize the cost of shipping. Solve the model and report the solution.

b. Offhaus is not sure whether three is the correct number of carriers to select. Run the model you developed in part (a) for allowable carriers varying from one to seven. Based on results, how many carriers would you recommend and why?

Offhaus

636 chapter 13 Integer Linear Optimization Models

10. The Martin-Beck Company operates a plant in St. Louis with an annual capacity of 30,000 units. Product is shipped to regional distribution centers located in Boston, Atlanta, and Houston. Because of an anticipated increase in demand, Martin-Beck plans to increase capacity by constructing a new plant in one or more of the following cities: Detroit, Toledo, Denver, or Kansas City. The estimated annual fixed cost and the annual capacity for the four proposed plants are as follows:

Distribution Center Annual Demand

Boston 30,000

Atlanta 20,000

Houston 20,000

Proposed Plant Annual Fixed Cost Annual Capacity

Detroit $175,000 10,000

Toledo $300,000 20,000

Denver $375,000 30,000

Kansas City $500,000 40,000

Distribution Centers

Plant Site Boston Atlanta Houston

Detroit 5 2 3

Toledo 4 3 4

Denver 9 7 5

Kansas City 10 4 2

St. Louis 8 4 3

The company’s long-range planning group developed forecasts of the anticipated annual demand at the distribution centers as follows:

The shipping cost per unit from each plant to each distribution center is as follows:

a. Formulate a mixed-integer programming model that could be used to help Martin-Beck determine which new plant or plants to open in order to satisfy anticipated demand.

b. Solve the model you formulated in part (a). What is the optimal cost? What is the optimal set of plants to open?

c. Using equation (13.1), find a second-best solution. What is the increase in cost ver- sus the best solution from part (b)?

11. Galaxy Cloud Services operates several data centers across the United States contain- ing servers that store and process the data on the Internet. Suppose that Galaxy Cloud Services currently has five outdated data centers: one each in Michigan, Ohio, and California and two in New York. Management is considering increasing the capacity of these data centers to keep up with increasing demand. Each data center contains serv- ers that are dedicated to Secure data and to Super Secure data. The cost to update each data center and the resulting increase in server capacity for each type of server are as follows:

Problems 637

The projected needs are for a total increase in capacity of 90 Secure servers and 90 Super Secure servers. Management wants to determine which data centers to update to meet projected needs and, at the same time, minimize the total cost of the added capacity. a. Formulate a binary integer programming model that could be used to determine the

optimal solution to the capacity increase question facing management. b. Solve the model formulated in part (a) to provide a recommendation for

management.

12. CHB, Inc., a bank holding company, is evaluating the potential for expanding into the State of Ohio. State law permits establishing branches in any county that is adjacent to a county in which a PPB (principal place of business) is located. The following map shows the State of Ohio. The file CHB contains an adjacency matrix with a one in the ith row and jth column indicating that the counties represented by the ith row and the jth column share a border. A zero indicates that the two counties do not share a border.

Formulate and solve a linear binary model that will tell CHB the minimum number of PPBs required and their location in order to allow CHB to put a branch in every county in Ohio.

Data Center Cost ($ millions) Secure Servers Super Secure Servers

Michigan 2.5 50 30

New York 1 3.5 80 40

New York 2 3.5 40 80

Ohio 4.0 90 60

California 2.0 20 30

9 10

19 20

29 30

66 67

76 77

79 80

85 86

Ohio

CHB

638 chapter 13 Integer Linear Optimization Models

13. For Problem 12, use equation (13.1) to determine whether your solution to Problem 12 is unique. If your solution is not unique, use equation (13.1) iteratively to find all alter- native optimal solutions. How many are there?

14. Consider again the CHB, Inc. problem described in Problem 12. Suppose only a limited number of PPBs can be placed. CHB would like to place this limited number of PPBs in counties so that the allowable branches can reach the maximum possible popu- lation. The file CHBPop contains the county adjacency matrix described in Problem 12 as well as the population of each county. a. Assume that only a fixed number of PPBs, denoted by k, can be established.

Formulate a linear binary integer program that will tell CHB, Inc. where to locate the fixed number of PPBs in order to maximize the population reached. (Hint: Review the Ohio Trust formulation in Section 13.4. Introduce a binary variable y

such that 1yi 5 if county i, that is, if county i can be reached by a PBB (because there is a PBB in county i or in an adjacent county to county i), and 0yi 5 otherwise.

b. Suppose that two PPBs can be established. Where should they be located to maxi- mize the population served?

c. Solve your model from part (a) for an allowable number of PPBs ranging from 1 to 10. In other words, solve the model 10 times, k set to 1, 2, . . . , 10. Record the population reached for each value of k. Graph the results by plotting the population reached versus the number of PPBs allowed. Based on their cost calculations, CHB considers an additional PPB to be fiscally prudent only if it increases the population reached by at least 500,000 people. Based on this graph, how many PPBs do you recommend to implement?

15. The North Shore Bank is working to develop an efficient work schedule for full-time and part-time tellers. The schedule must provide for efficient operation of the bank, including adequate customer service, employee breaks, and so on. On Fridays, the bank is open from 9:00 a.m. to 7:00 p.m. The number of tellers necessary to provide adequate customer service during each hour of operation is summarized as follows:

Time No. of Tellers

9:00 a.m.–10:00 a.m. 6

10:00 a.m.–11:00 a.m. 4

11:00 a.m.–Noon 8

Noon–1:00 p.m. 10

1:00 p.m.–2:00 p.m. 9

Time No. of Tellers

2:00 p.m.–3:00 p.m. 6

3:00 p.m.–4:00 p.m. 4

4:00 p.m.–5:00 p.m. 7

5:00 p.m.–6:00 p.m. 6

6:00 p.m.–7:00 p.m. 6

Each full-time employee starts on the hour and works a 4-hour shift, followed by a 1-hour break and then a 3-hour shift. Part-time employees work one 4-hour shift begin- ning on the hour. Considering salary and fringe benefits, full-time employees cost the bank $15 per hour ($105 a day), and part-time employees cost the bank $8 per hour ($32 per day). a. Formulate an integer programming model that can be used to develop a schedule

that will satisfy customer service needs at a minimum employee cost. (Hint: Let number of full-time employees coming on duty at the beginning of hourxi 5 i and number of part-time employees coming on duty at the beginning of houryi 5 i.)

b. Solve the LP Relaxation of your model in part (a). c. Solve your model in part (a) for the optimal schedule of tellers. Comment on the

solution. d. After reviewing the solution to part (c), the bank manager realized that some addi-

tional requirements must be specified. Specifically, she wants to ensure that one full-time employee is on duty at all times and that there is a staff of at least five full- time employees. Revise your model to incorporate these additional requirements, and solve for the optimal solution.

CHBPop

Problems 639

16. Burnside Marketing Research conducted a study for Barker Foods on several formula- tions for a new dry cereal. Three attributes were found to be most influential in deter- mining which cereal had the best taste: ratio of wheat to corn in the cereal flake, type of sweetener (sugar, honey, or artificial), and the presence or absence of flavor bits. Seven children participated in taste tests and provided the following part-worths for the attributes (see Section 13.4 for a discussion of part-worths):

a. Suppose that the overall utility (sum of part-worths) of the current favorite cereal is 75 for each child. What product design will maximize the number of children in the sample who prefer the new dry cereal? Note that a child will prefer the new dry cereal only if its overall utility is at least 1 part-worth larger than the utility of their current preferred cereal.

b. Assume that the overall utility of the current favorite cereal for children 1 to 4 is 70, and the overall utility of the current favorite cereal for children 5 to 7 is 80. What product design will maximize the number of children in the sample who prefer the new dry cereal? Note that a child will prefer the new dry cereal only if its overall utility is at least 1 part-worth larger than the utility of their current pre- ferred cereal.

17. The Bayside Art Gallery is considering installing a video camera security system to reduce its insurance premiums. A diagram of Bayside’s eight exhibition rooms is shown in the figure in the next page; the openings between the rooms are numbered 1 to 13. A security firm proposed that two-way cameras be installed at some room open- ings. Each camera has the ability to monitor the two rooms between which the camera is located. For example, if a camera were located at opening number 4, rooms 1 and 4 would be covered; if a camera were located at opening 11, rooms 7 and 8 would be covered; and so on. Management decided not to locate a camera system at the entrance to the display rooms. The objective is to provide security coverage for all eight rooms using the minimum number of two-way cameras. a. Formulate a binary integer linear programming model that will enable Bayside’s

management to determine the locations for the camera systems. b. Solve the model formulated in part (a) to determine how many two-way cameras to

purchase and where they should be located. c. Suppose that management wants to provide additional security coverage for room 7.

Specifically, management wants room 7 to be covered by two cameras. How would the model you formulated in part (a) have to change to accommodate this policy restriction?

d. With the policy restriction specified in part (c), determine how many two-way camera systems will need to be purchased and where they should be located.

Wheat/Corn Sweetener Flavor Bits

Child Low High Sugar Honey Artificial Present Absent

1 15 35 30 40 25 15 9

2 30 20 40 35 35 8 11

3 40 25 20 40 10 7 14

4 35 30 25 20 30 15 18

5 25 40 40 20 35 18 14

6 20 25 20 35 30 9 16

7 30 15 25 40 40 20 11

Burnside

640 chapter 13 Integer Linear Optimization Models

18. Suppose that the order quantity for the component must be 0, 250, 500, 750, or 1,000. Modify your model to enforce this restriction. What is the optimal cost?

19. Roedel Electronics produces tablet computer accessories, including integrated key- board tablet stands that connect a keyboard to a tablet device and holds the device at a preferred angle for easy viewing and typing. Roedel produces two sizes of inte- grated keyboard tablet stands, small and large. Each size uses the same keyboard attachment, but the stand consists of two different pieces, a top flap and a vertical stand that differ by size. Thus, a completed integrated keyboard tablet stand consists of three subassemblies that are manufactured by Roedel: a keyboard, a top flap, and a vertical stand.

Roedel’s sales forecast indicates that 7,000 small integrated keyboard tablet stands and 5,000 large integrated keyboard tablet stands will be needed to satisfy demand during the upcoming Christmas season. Because only 500 hours of in-house manufactur- ing time are available, Roedel is considering purchasing some, or all, of the subassem- blies from outside suppliers. If Roedel manufactures a subassembly in-house, it incurs a fixed setup cost as well as a variable manufacturing cost. The following table shows the setup cost, the manufacturing time per subassembly, the manufacturing cost per subas- sembly, and the cost to purchase each of the subassemblies from an outside supplier:

Entrance

Room 1

Room 3

Room 7

Room 4

Room 8

Room 6

Room 5

Room 2

8 9

12 13

1110

Problems 641

a. Determine how many units of each subassembly Roedel should manufacture and how many units of each subassembly Roedel should purchase. What is the total manufacturing and purchase cost associated with your recommendation?

b. Suppose Roedel is considering purchasing new machinery to produce large top flaps. For the new machinery, the setup cost is $3,000; the manufacturing time is 2.5 minutes per unit, and the manufacturing cost is $2.60 per unit. Assuming that the new machinery is purchased, determine how many units of each subassembly Roedel should manufacture and how many units of each subassembly Roedel should purchase. What is the total manufacturing and purchase cost associated with your recommendation? Do you think the new machinery should be purchased? Explain.

20. John White is the program scheduling manager for the television channel CCFO. John would like to plan the schedule of television shows for next Wednesday evening.

The table below lists nine shows under consideration. John must select exactly five of these shows for the period from 8:00 p.m. to 10:30 p.m. next Wednesday evening. For each television show, the estimated advertising revenue (in $ million) is provided. Fur- thermore, each show has been categorized into one or more of the categories “Public Interest,” “Violent,” “Comedy,” and “Drama.” In the following table, a 1 indicates that the show is in the corresponding category and a 0 indicates it is not.

Subassembly Setup

Cost ($) Manufacturing

Time per Unit (min) Manufacturing

Cost per Unit ($) Purchase Cost

per Unit ($)

Keyboard 1,000 0.9 0.40 0.65

Small top flap 1,200 2.2 2.90 3.45

Large top flap 1,900 3.0 3.15 3.70

Small vertical stand 1,500 0.8 0.30 0.50

Large vertical stand 1,500 1.0 0.55 0.70

Show Revenue

($ Millions) Public Interest Violent Comedy Drama

Sam’s Place $6 0 0 1 1

Texas Oil $10 0 1 0 1

Cincinnati Law $9 1 0 0 1

Jarred $4 0 1 0 1

Bob & Mary $5 0 0 1 0

Chainsaw $2 0 1 0 0

Loving Life $6 1 0 0 1

Islanders $7 0 0 1 0

Urban Sprawl $8 1 0 0 0

John would like to determine a revenue-maximizing schedule of television shows for next Wednesday evening. However, he must be mindful of the following considerations:

• The schedule must include at least as many shows that are categorized as public interest as shows that are categorized as violent.

• If John schedules “Loving Life,” then he must also schedule either “Jarred” or “Cincinnati Law” (or both).

• John cannot schedule both “Loving Life” and “Urban Sprawl.” • If John schedules more than one show in the “Violent” category, he will lose an estimated $4 million in advertising revenues from family-oriented sponsors.

a. Formulate a binary integer program that models the decisions John faces. b. Solve the model formulated in part (a). What is the optimal revenue?

TVSchedule

642 chapter 13 Integer Linear Optimization Models

21. East Coast Trucking provides service from Boston to Miami using regional offices located in Boston, New York, Philadelphia, Baltimore, Washington, Richmond, Raleigh, Florence, Savannah, Jacksonville, and Tampa. The number of miles between the regional offices is provided in the following table:

Fund Type Expected Return (%)

1 Growth 6.70

2 Growth 7.65

3 Growth 7.55

4 Growth 7.45

5 Growth & Income 7.50

6 Growth & Income 6.45

7 Growth & Income 7.05

8 Stock & Bond 6.90

9 Bond 5.20

10 Bond 5.90

Boston New York Philadelphia Baltimore Washington Richmond Raleigh Florence Savannah Jacksonville Tampa Miami

Boston 0 211 320 424 459 565 713 884 1056 1196 1399 1669

New York 211 0 109 213 248 354 502 673 845 985 1188 1458

Philadelphia 320 109 0 104 139 245 393 564 736 876 1079 1349

Baltimore 424 213 104 0 35 141 289 460 632 772 975 1245

Washington 459 248 139 35 0 106 254 425 597 737 940 1210

Richmond 565 354 245 141 106 0 148 319 491 631 834 1104

Raleigh 713 502 393 289 254 148 0 171 343 483 686 956

Florence 884 673 564 460 425 319 171 0 172 312 515 785

Savannah 1056 845 736 632 597 491 343 172 0 140 343 613

Jacksonville 1196 985 876 772 737 631 483 312 140 0 203 473

Tampa 1399 1188 1079 975 940 834 686 515 343 203 0 270

Miami 1669 1458 1349 1245 1210 1104 956 785 613 473 270 0

The company’s expansion plans involve constructing service facilities in some of the cities where regional offices are located. Each regional office must be within 400 miles of a service facility. For instance, if a service facility is constructed in Richmond, it can provide service to regional offices located in New York, Philadelphia, Baltimore, Washington, Richmond, Raleigh, and Florence. Management would like to determine the minimum number of service facilities needed and where they should be located. a. Formulate an integer linear program that can be used to determine the minimum

number of service facilities needed and their locations. b. Solve the integer linear program formulated in part (a). How many service facilities

are required, and where should they be located? c. Suppose that each service facility can provide service only to regional offices within

300 miles. Re-solve the integer linear program with the 300-mile requirement. How many service facilities are required and where should they be located?

22. Dave has $100,000 to invest in 10 mutual fund alternatives with the following restric- tions. For diversification, no more than $25,000 can be invested in any one fund. If a fund is chosen for investment, then at least $10,000 will be invested in it. No more than two of the funds can be pure growth funds, and at least one pure bond fund must be selected. The total amount invested in pure bond funds must be at least as much as the amount invested in pure growth funds. Using the following expected returns, formu- late and solve a model that will determine the investment strategy that will maximize expected annual return. What assumptions have you made in your model? How often would you expect to run your model?

EastCoast

case Problem: Applecore children’s clothing 643

C A S E P R O B L E M : A P P L E C O R E C h I L d R E n ’ S C L O T h I n G

Applecore Children’s Clothing is a retailer that sells high-end clothes for toddlers (ages 1 to 3), primarily in shopping malls. Applecore also has a successful Internet-based sales division. Recently Dave Walker, vice-president of the e-commerce division, has been given the directive to expand the company’s Internet sales. He commissioned a major study on the effectiveness of Internet ads placed on news web sites. The results were favorable: Current patrons who purchased via the Internet and saw the ads on news web sites spent more, on average, than did comparable Internet customers who did not see the ads.

With this new information on Internet ads, Walker continued to investigate how new Internet customers could most effectively be reached. One of these ideas involved strate- gically purchasing ads on news web sites prior to and during the holiday season. To deter- mine which news sites might be the most effective for ads, Walker conducted a follow-up study. An e-mail questionnaire was administered to a sample of 1,200 current Internet customers to ascertain which of 30 news sites they regularly visit. The idea is that web sites with high proportions of current customer visits would be viable sources of future custom- ers for Applecore products.

Walker would like to ascertain which news sites should be selected for ads. The prob- lem is complicated because Walker does not want to count multiple exposures. So, if a respondent visits multiple sites with Applecore ads or visits a given site multiple times, that respondent should be counted as reached but not more than once. In other words, a customer is considered reached if he or she has visited at least one web site with an Applecore ad.

Data from the customer e-mail survey have begun to trickle in. Walker wants to develop a prototype model based on the current survey results. So far, 53 surveys have been returned. To keep the prototype model manageable, Walker wants to proceed with model development using the data from the 53 returned surveys and using only the first 10 news sites in the questionnaire. The costs of ads per week for the 10 web sites are given in the following table, and the budget is $10,000 per week. For each of the 53 responses received, the 10 web sites visited regularly are shown below. For a given customer–web site pair, a one indicates that the customer regularly visits that web site, and a zero indicates that the customer does not regularly visit that site.

managerial Report

1. Develop a model that will allow Applecore to maximize the number of customers reached for a budget of $10,000 for one week of promotion.

2. Solve the model. What is the maximum number of customers reached for the $10,000 budget?

3. Perform a sensitivity analysis on the budget for values from $5,000 to $35,000 in increments of $5,000. Construct a graph of percentage reach versus budget. Is the additional increase in percentage reach monotonically decreasing as the bud- get allocation increases? Why or why not? What is your recommended budget? Explain.

644 chapter 13 Integer Linear Optimization Models

Web Site

$5.0 $8.0 $3.5 $5.5 $7.0 $4.5 $6.0 $5.0 $3.0 $2.2

Customer

Cost/Wk ($000)

2 3 4

1 1

35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53

1 1 0 0

1 1 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0

0 0 0 0

0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 0 1

0 0 0 0

0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0

1 0 0 1

0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0

0 0 1 1

1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 1

0 0 1 0

1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0

0 0 0 0

0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 0

0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1

1 2 3 4 5 6 7 8 9 10

0 0 0 0

0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1

Data for Applecore Customer Visits to News Web Sites (respondents 5 to 33 hidden)

Applecore

Nonlinear Optimization Models C O N T E N T S

ANALYTICS IN ACTION: INTERCONTINENTAL HOTELS

14.1 A PRODUCTION APPLICATION: PAR, INC. REVISITED An Unconstrained Problem A Constrained Problem Solving Nonlinear Optimization Models Using Excel Solver Sensitivity Analysis and Shadow Prices in Nonlinear Models

14.2 LOCAL AND GLOBAL OPTIMA Overcoming Local Optima with Excel Solver

14.3 A LOCATION PROBLEM

14.4 MARKOWITZ PORTFOLIO MODEL

14.5 FORECASTING ADOPTION OF A NEW PRODUCT

APPENDIX 14.1: SOLVING NONLINEAR OPTIMIZATION PROBLEMS WITH ANALYTIC SOLVER (MINDTAP READER)

Chapter 14

14.1 A Production Application: Par, Inc. Revisited 647

InterContinental Hotels*

InterContinental Hotel Group (IHG) owns, leases, or franchises over 4,500 hotels in about 100 countries around the world. It offers over 700,000 guest rooms, more than any other hotel. InterContinental Hotels, Crowne Plaza Hotels and Resorts, Holiday Inn Hotels and Resorts, and Holiday Inn Express are some of InterContinental’s brands.

Like airlines and rental car companies, hotels offer a perishable good; that is, hotels have a limited time window in which to sell the product, after which the value perishes. For example, an empty seat on an air- line flight is of no value, as is a hotel room that goes empty overnight. In dealing with perishable goods, how to price them in such a way as to maximize reve- nue is a challenge. Price the hotel room too high, and it will sit empty overnight and generate zero revenue. Price the hotel room too low, the hotel will be filled, but revenue likely will be lower than it could have

*Based on D. Kosuhik, J. A. Higbie, and C. Eister, “Retail Price Optimization at InterContinental Hotels Group,” Interfaces 42, no. 1, (January–February 2012): 45–57.

been with higher pricing, even if fewer rooms were booked. Revenue management (RM) is a term used to describe analytical approaches to this pricing problem.

IHG developed a novel approach to the hotel room pricing problem that uses a nonlinear optimization model to determine prices to charge for its rooms. Each day, IHG searches the Internet to acquire com- petitors’ prices. The competitors’ prices are factored into IHG’s pricing optimization model, which is run daily. The model is nonlinear because the objective function is to maximize contribution (revenue − cost), but both demand and revenue are a function of the price variable. Over 2,000 IHG hotels have begun using this pricing model, and its use has led to increased revenue in excess of $145 million.

A N A L Y T I C S I N A C T I O N

Many business processes behave in a nonlinear manner. For example, the price of a bond is a nonlinear function of interest rates, and the price of a stock option is a nonlinear func- tion of the price of the underlying stock. The marginal cost of production often decreases with the quantity produced, and the quantity demanded for a product is usually a nonlinear function of the price. These and many other nonlinear relationships are present in many business applications.

A nonlinear optimization problem is any optimization problem in which at least one term in the objective function or a constraint is nonlinear. In Section 14.1, we examine a production problem in which the objective function is a nonlinear function of the decision variables, similar to the Analytics in Action: InterContinental Hotels. In Section 14.2, we discuss issues that make nonlinear optimization very different from linear optimization. Section 14.3 presents a nonlinear model for facility location. In Section 14.4, we present the Nobel Prize–winning Markowitz model for managing the trade-off between risk and return in the construction of an investment portfolio. In Section 14.5, we consider a well- known model that effectively forecasts sales or adoptions of a new product.

14.1 A Production Application: Par, Inc. Revisited We introduce constrained and unconstrained nonlinear optimization problems by consider- ing an extension of the Par, Inc. linear program introduced in Chapter 12. We first consider the case in which the relationship between price and quantity sold causes the objective function to be nonlinear. The resulting unconstrained nonlinear program is then solved. As we shall see, the unconstrained optimal solution does not satisfy the production constraints of the original problem. Adding the production constraints back into the problem allows us to show the formulation and solution of a constrained nonlinear optimization model.

An Unconstrained Problem Let us consider a revision of the Par, Inc. problem discussed in Chapter 12. Recall that Par, Inc. decided to manufacture standard and deluxe golf bags. In formulating the linear

The on-line chapter appendix describes how to solve nonlinear optimization models using the Analytic Solver.

648 Chapter 14 Nonlinear Optimization Models

programming model for the Par, Inc. problem, we assumed that the company could sell all of the standard and deluxe bags it could produce. However, depending on the price of the golf bags, this assumption may not hold. An inverse relationship usually exists between price and demand. As price increases, the quantity demanded decreases. Let PS denote the price Par, Inc. charges for each standard bag and PD denote the price for each deluxe bag. Assume that the demand for standard bags, S, and the demand for deluxe bags, D, are given by

2, 250 15S PS5 2 (14.1)

1,500 5D PD5 2 (14.2)

The revenue generated from standard bags is the price of each standard bag, PS, times the number of standard bags sold, S. If the cost to produce a standard bag is $70, then the cost to produce S standard bags is 70S. Thus, the profit contribution for producing and sell- ing S standard bags (revenue − cost) is

70 ( 70)P S S P SS S2 5 2 (14.3)

We can solve equation (14.1) for PS to show how the price of a standard bag is related to the number of standard bags sold: 150 (1/15)P SS 5 2 . Substituting 150 − (1/15)S for PS in equation (14.3), the profit contribution for standard bags is

( 70) [150 (1/15) 70] 80 (1/15) 2P S S S S SS 2 5 2 2 5 2 (14.4)

Suppose that the cost to produce each deluxe golf bag is $150. Using the same logic we used to develop equation (14.4), the profit contribution for deluxe bags is

( 150) [300 (1/5) 150] 150 (1/5) 2P D D D D DD 2 5 2 2 5 2

Total profit contribution is the sum of the profit contribution for standard bags and the profit contribution for deluxe bags. Thus, total profit contribution is written as

Total profit contribution 80 (1/15) 150 (1/5)2 2S S D D5 2 1 2 (14.5)

Note that the two linear demand functions, equations (14.1) and (14.2), give a nonlinear total profit contribution function, equation (14.5). This function is an example of a quadratic function because the nonlinear terms have an exponent of 2 ( 2S and 2D ).

Using Excel Solver, we find that the values of S and D that maximize the profit contri- bution function are 600S 5 and 375D 5 . The corresponding prices are $110 for standard bags and $225 for deluxe bags, and the profit contribution is $52,125. If all production con- straints are also satisfied, these values provide the optimal solution for Par, Inc.

A Constrained Problem In calculating the unconstrained optimal solution, we have ignored the production con- straints discussed in Chapter 12. Recall that Par, Inc. has limited amounts of time available in each of four departments (cutting and dyeing, sewing, finishing, and inspection and packaging). We must enforce constraints that ensure that the amount of time used does not exceed the amount of time available in each of these departments. The problem that Par, Inc. must solve is to maximize the total profit contribution subject to all of the departmen- tal labor hour constraints given in Chapter 12. The complete mathematical model for the Par, Inc. constrained nonlinear maximization problem is as follows:

Details of how to use Excel Solver for nonlinear optimization are discussed in the next section.

Max 80S 2 1⁄15 S2 1 150D 2 1⁄5D2

s.t. 7⁄10S 1 1D # 630 Cutting and dyeing 1⁄2 S 1 5⁄6 D # 600 Sewing

1S 1 2⁄3 D # 708 Finishing 1⁄10 S 1 1⁄4 D # 135 Inspection and packaging

S, D $ 0

14.1 A Production Application: Par, Inc. Revisited 649

The feasible region for the original Par, Inc. problem, along with the unconstrained optimal solution point (600, 375), is shown in Figure 14.1. The unconstrained optimum of (600, 375) is obviously outside the feasible region.

This maximization problem is exactly the same as the Par, Inc. problem in Chapter 12 except for the nonlinear objective function. The solution to this new constrained nonlinear maximization problem is shown in Figure 14.2.

In Figure 14.2 we see three profit contribution contour lines. Each point on the same contour line is a point of equal profit. Here, the contour lines show profit contributions of $45,000, $49,920.55, and $51,500. In the original Par, Inc. problem described in Chapter 12, the objective function is linear, and thus the profit contours are straight lines. However, for the Par, Inc. problem with a quadratic objective function, the profit contours are ellipses.

Because part of the $45,000 profit contour line cuts through the feasible region, we know that an infinite number of combinations of standard and deluxe bags will yield a profit of $45,000. An infinite number of combinations of standard and deluxe bags also provide a profit of $51,500. However, none of the points on the $51,500 contour profit line is in the feasible region. As the contour lines move farther out from the unconstrained opti- mum of (600, 375) the profit contribution associated with each contour line decreases. The contour line representing a profit of $49,920.55 intersects the feasible region at a single point. Without showing all of the details in solving for this point, the point of intersection is 459.717 standard bags and 308.198 deluxe bags. This solution provides the maximum possible profit. No contour line that has a profit contribution greater than $49,920.55 will intersect the feasible region. Because the contour lines are nonlinear, the contour line with the highest profit can touch the boundary of the feasible region at any point, not just an extreme point. In the Par, Inc. case, the optimal solution is on the cutting and dyeing con- straint line partway between two extreme points.

The Par, Inc. Feasible Region and the Optimal Solution for the Unconstrained Optimization Problem

FIGURE 14.1

S 800600400200

No. of Standard Bags

N o.

o f

D el

u xe

B ag

Feasible Region200

400

600

Unconstrained Optimum (600, 375)

Pro�t = $52,125

650 Chapter 14 Nonlinear Optimization Models

The Par, Inc. Feasible Region with Objective Function Contour Lines

FIGURE 14.2

S 8006004002000

200

400

600

$45,000 Contour

$52,125

Optimal Solution $49,920.55

$49,920.55 Contour

$51,500 Contour

No. of Standard Bags

N o.

o f

D el

u xe

B ag

It is also possible for the optimal solution to a nonlinear optimization problem to lie in the interior of the feasible region. For instance, if the right-hand sides of the constraints in the Par, Inc. problem were all increased by a sufficient amount, the feasible region would expand so that the optimal unconstrained solution point of (600, 375) with a profit contri- bution of $52,125 in Figure 14.2 would be in the interior of the feasible region.

Many linear optimization algorithms (e.g., the simplex method) optimize by examin- ing only the extreme points and selecting the extreme point that gives the best solution value. As the solution to the constrained nonlinear problem for Par, Inc. illustrates, such a method will not work in the nonlinear case because the optimal solution is generally not an extreme-point solution. Hence, nonlinear optimization algorithms are more complex than linear optimization algorithms, and the details are beyond the scope of this text. Fortu- nately, we do not need to know how nonlinear algorithms work; we just need to know how to use them. Computer software such as Excel Solver and Analytic Solver are available to solve nonlinear optimization problems.

Next we discuss how to use Excel Solver to solve nonlinear optimization problems.

Solving Nonlinear Optimization Models Using Excel Solver We use the constrained nonlinear problem for Par, Inc. to illustrate how to use Excel Solver to solve nonlinear optimization problems. The procedure for developing and entering the model in Excel is the same as for linear problems as discussed in Chapter 12, except that one or more of the functions is nonlinear.

Figure 14.3 shows the Excel model and Solver dialog box for the nonlinear Par, Inc. prob- lem. The SUMPRODUCT function is used in cells B19 through B22 to calculate the number of hours required in each department. The price function for standard bags is entered in cell B25 as 150-(1/15)*$B$145 and similarly for deluxe bags in cell D26 as 300-(1/5)*$C$145 . ParNonlinear

14.1 A Production Application: Par, Inc. Revisited 651

The objective function in cell B16 contains the formula (B25-B9)5 *B14+(B26-C9)*C14, which corresponds to (150 − (1/15)S − 70)S + (300 − (1/5)D − 150)D. As previously shown, this is mathematically equivalent to equation (14.5) because (150 (1/15) 70) (300 (1/5) 150) 80 (1/15) 150 (1/5)2 2S S D D S S D D2 2 1 2 2 5 2 1 2 .

To invoke Solver, we follow these steps:

Step 1. Click the Data tab in the Ribbon Step 2. Click Solver in the Analyze group Step 3. When the Solver Parameters dialog box appears:

Enter B16 into the Set Objective: box Step 4. Enter B14:C14 into the By Changing Variable Cells: box area Step 5. Click the Add button

Enter B19:B22 in the Cell Reference: box Select  from the drop-down menu Enter C19:C22 in the Constraint: box Click OK

Step 6. Select the Make Unconstrained Variables Non-negative option Step 7. For Select a Solving Method: select GRG Nonlinear from the drop-down

menu Step 8. Click Solve Step 9. When the Solver Results dialog box appears, click OK

The complete model for the constrained nonlinear Par, Inc. problem is contained in the file ParNonlinearModel.

The Answer Report generated by Excel Solver has the same structure as that of lin- ear programs. Rather than show the Answer Report here, we refer to the optimal values shown in the spreadsheet in Figure 14.3. The optimal value of the objective function is $49,920.55, and this is achieved by producing 459.717 standard bags and 308.198 deluxe bags. This is the optimal point shown geometrically in Figure 14.2. Also, comparing cells C19 through C22 with D19 through D22 shows that only the cutting and dyeing constraint is binding, which is consistent with Figure 14.2. The optimal prices, based on the optimal quantities, are shown in cells B25 and B26. The optimal price for a standard bag is $119.35 and the optimal price for a deluxe bag is $238.36.

Sensitivity Analysis and Shadow Prices in Nonlinear Models The Sensitivity Report for the nonlinear Par, Inc. problem is shown in Figure 14.4. As in the linear case, there are two sections: one for the variables and the other for constraints. The variables section gives the cell location, name, final (optimal) value, and reduced gradient for each variable. The reduced gradient is analogous to the reduced cost for linear models. It is essentially the shadow price of the nonnegativity constraint or, more generally, the shadow price of a binding simple lower or upper bound on the decision variable.

The constraint section gives the cell location, name, and final value for the left-hand side of each constraint. For the Par, Inc. problem, the final values are the amount of time in hours used in each of the four departments. The far right column gives the Lagrang- ian multiplier for each constraint. The Lagrangian multiplier is the shadow price for a constraint in a nonlinear problem. In other words, the Lagrangian multiplier is the rate of change of the objective function with respect to the right-hand side of a constraint. For the Par, Inc. example, as we increase the number of hours available in the cutting and dyeing department, we expect the profit to increase by $26.72 per hour. However, notice that no ranges are given for allowable changes to the right-hand side. This is because the allowable increase and decrease are essentially zero. Changing the right-hand side of a binding con- straint by even a small amount will change the value of Lagrangian multiplier. Nonetheless, the Lagrangian multiplier does give an estimate of the importance of relieving a binding constraint.

ParNonlinearModel

652 Chapter 14 Nonlinear Optimization Models

14.2 Local and Global Optima A feasible solution is a local optimum if no other feasible solution with a better objective function value is found in the immediate neighborhood. For example, for the constrained Par, Inc. problem, the local optimum corresponds to a local maximum; a point is a local maximum if no other feasible solution with a larger objective function value is in the immediate neighborhood. Similarly, for a minimization problem, a point is a local mini- mum if no other feasible solution with a smaller objective function value is in the immedi- ate neighborhood.

Nonlinear optimization problems can have multiple local optimal solutions, which means we are concerned with finding the best of the local optimal solutions. A feasible solution is a global optimum if no other feasible point with a better objective func- tion value is found in the feasible region. In the case of a maximization problem, the global optimum corresponds to a global maximum. A point is a global maximum if no other point in the feasible region gives a strictly larger objective function value. For a

The neighborhood of a solution is a mathematical concept that refers to the set of points within a relatively close proximity of the solution. See Figure 14.7 for a graphical example of local minimums and local maximums.

All global optimal solutions are local optimal solutions, but not all local optimal solutions are global optimal solutions.

Spreadsheet Model and Solver Parameters Dialog Box for the Nonlinear Par, Inc. Problem

FIGURE 14.3

A Par, Inc.

B C D

Parameters Production

Time (Hours) Time Available

Standard HoursDeluxe

Standard Deluxe

1 2

4 5 6 7

9 10 11

Operation

Hours Used

Hours Available

Operation

Cutting and Dyeing

Sewing

Finishing

Inspection and Packaging

Cutting and Dyeing

Sewing

Finishing

Inspection and Packaging

Standard Bag Price Function

Deluxe Bag Price Function

Marginal Cost

0.7

0.5

0.1

$70.00

Bags Produced 459.717

$49,920.55

630.000

486.690

665.182

123.021

119.35

238.36

630 600

708

135

308.198

0.833

0.667

0.25

$150.00

630

600

708

135

Model

Total Pro�t

12 13 14 15 16 17

20 21

E F G H I J

ParNonlinear

14.2 Local and Global Optima 653

minimization problem, a point is a global minimum if no other feasible point with a strictly smaller objective function value is in the feasible region. A global maximum is also a local maximum, and a global minimum is also a local minimum.

Nonlinear problems with multiple local optima are difficult to solve. But in many non- linear applications, a single local optimal solution is also the global optimal solution. For such problems, we need to find only a local optimal solution. We will now present some of the more common classes of nonlinear problems of this type.

Consider the function ( , ) 2 2f X Y X Y5 2 2 . The shape of this function is illustrated in Figure 14.5. A function that is bowl-shaped down is called a concave function. The max- imum value for this particular function is 0, and the point (0, 0) gives the optimal value of 0. The point (0, 0) is a local maximum; but it is also a global maximum because no point gives a larger function value. In other words, no values of X and Y result in an objective function value greater than 0. Functions that are concave, such as ( , ) 2 2f X Y X Y5 2 2 , have a single local maximum that is also a global maximum. This type of nonlinear prob- lem is relatively easy to maximize.

The objective function for the nonlinear Par, Inc. problem is an example of a concave function:

80 150115 2 1 5 2S S D D2 1 2

In general, if all the squared terms in a quadratic function have a negative coefficient and there are no cross-product terms, such as xy (or for the Par, Inc. problem, SD), then the function is a concave quadratic function. Thus, for the Par, Inc. problem, we are assured that the local maximum identified by Excel Solver in Figure 14.3 is the global maximum.

Let us now consider another type of function with a single local optimum that is also a global optimum. Consider the function ( , ) 2 2f X Y X Y5 1 . The shape of this func- tion is illustrated in Figure 14.6. It is bowl-shaped up and called a convex function. The

Excel Solver Sensitivity Report for the Nonlinear Par, Inc. Problem

FIGURE 14.4

A B C D

Reduced Gradient

Constraints

$B$22

Cell

Variable Cells

$B$14 Bags Produced Standard

Bags Produced Deluxe

$B$19

$C$14

Name Final Value

Lagrange MultiplierCell Name Final Value

0459.7166

308.19838 0

$B$20

$B$21

Cutting and Dyeing Hours Used

Sewing Hours Used

Finishing Hours Used

Inspection and Packaging Hours Used 0

26.720587

123.02126

630

486.69028

665.18219 0

4 5 6 7

9 10 11 12 13

14 15 16 17 18 19

E F

654 Chapter 14 Nonlinear Optimization Models

minimum value for this particular function is 0, and the point (0, 0) gives the minimum value of 0. The point (0, 0) is a local minimum and a global minimum because no val- ues of X and Y give an objective function value less than 0. Convex functions, such as

( , ) 2 2f X Y X Y5 1 , have a single local minimum and are relatively easy to minimize. For a concave function, we can be assured that if our computer software finds a local

maximum, it has found a global maximum. Similarly, for a convex function, we know that if our computer software finds a local minimum, it has found a global minimum. However, some nonlinear functions have multiple local optima. For example, Figure 14.7 shows the graph of the following function over the feasible regions: 0 1X# # , 0 1Y# # :

( , ) sin(5 ) sin(5 )f X Y X X Y Y5 p 1 p

where sin is the trigonometric sine function, and p is approximately 3.1416. The hills and valleys in this graph show that this function has a number of local maximums and local minimums.

From a technical standpoint, functions with multiple local optima pose a serious chal- lenge for optimization software; most nonlinear optimization software methods can get stuck and terminate at a local optimum. Unfortunately, many applications can be nonlinear with multiple local optima, and the objective function value for a local optimum may be much worse than the objective function value for a global optimum. Developing algorithms capable of finding the global optimum is currently an active research area.

Next we discuss a very practical approach to dealing with local maximums and local minimums when using Excel Solver for nonlinear problems.

A Concave Function ( , ) 2 2f X Y X Y5 2 2

A Convex Function ( , ) 2 2f X Y X Y5 1

FIGURE 14.5

FIGURE 14.6

– 4 –2 0 2 4

X – 4–202

– 40

–20

– 4 –2 0 2 4

X –4–202

14.2 Local and Global Optima 655

A Function with Local Maxima and MinimaFIGURE 14.7

–1.5

0.1 0

0.2 0.3

0.4 0.5

0.6 0.7

0.8 0.9

2.0

0.5

–0.5

–1.0

1.0

1.5

f (X, Y )

0.1 0

0.2

0.3 0.4

0.5

0.6

0.7

0.8 0.9

Overcoming Local Optima with Excel Solver How do you know when multiple local optima exist? The mathematical ways to deter- mine this are beyond the scope of this text. From a practical point of view, if the solution obtained by optimization software depends on the starting point, then there are multiple local optima. Thus, when using Excel Solver, if the solution returned from Solver is differ- ent when starting from different values in the decision variable cells, then there are local optima. The converse is not necessarily true; that is, if the same solution is returned when starting from a different set of starting points, this does not necessarily mean that you have found the global optimal solution.

Let us consider the problem shown in Figure 14.7:

Max ( , ) sin(5 ) sin(5 )

s.t. 0 1 0 1

f X Y X X Y Y

X Y

5 p 1 p

# #

Table 14.1 shows the results returned from Excel Solver for different starting points (val- ues in the decision variable cells when Solver is invoked). In each of the five cases in Table 14.1, Solver returns with the message, “Solver has converged to the current solution. All constraints are satisfied.”

Excel Solver does provide an option that allows you to increase the confidence that you have found a global optimal solution. Clicking Options on the Solver Parameters dialog box and then selecting the GRG Nonlinear tab results in the dialog box shown in Figure 14.8. Clicking the Use Multistart option in the Multistart section causes Solver to use multiple starting solutions and report the best solution found from all of the starting points. The Population Size is the number of starting points used. Solver selects starting

LocalOptima

656 Chapter 14 Nonlinear Optimization Models

points randomly using the Random Seed (an integer value) such that the points are within the bounds specified. Although providing simple lower and upper bounds is not required (unless the Require Bounds on the Variables option is selected), the procedure is much more effective when bounds are provided. We recommend selecting the Require Bounds on the Variables checkbox and providing bounds before you use the Multistart option.

In Figure 14.8, randomly generated starting points will be used and simple bounds of 0 and 1 have been specified as constraints in the Solver dialog box. The result reported by Solver is 0.90447X 5 , 0.90447Y 5 , with objective function 1.8045 . The message pro- vided by Solver is “Solver converged in probability to a global solution.”

If the solution to a problem appears to depend on the starting values for the decision variables, we recommend you use the Multistart option.

Starting Point Solution Returned

X Y X Y Objective Function Value 0.000 0.000 0.129 0.129 0.231

1.000 0.000 0.905 0.000 0.902

0.000 1.000 0.000 0.905 0.902

0.500 0.500 0.508 0.508 1.008

1.000 1.000 0.905 0.905 1.805

Solutions from Excel Solver for a Problem with Multiple Local Optima

TABLE 14.1

The GRG Nonlinear Tab in Solver OptionsFIGURE 14.8

14.3 A Location Problem 657

14.3 A Location Problem Let us consider the case of LaRosa Machine Shop (LMS). LMS is studying where to locate its tool bin facility on the shop floor. The locations of the five production stations appear in Figure 14.9. In an attempt to be fair to the workers in each of the production stations, man- agement has decided to try to find the position of the tool bin that would minimize the sum of the distances from the tool bin to the five production stations. We define the following decision variables:

horizontal location of the tool bin vertical location of the tool bin

X Y

1. The Multistart option works best with bounds speci-

fied on each decision variable. It is often easy to calcu-

late effective upper and lower bounds for the decision

variables. For example, if you have a linear less-than-

or-equal-to constraint with positive coefficients, upper

bounds can be a calculated by simply dividing the right-

hand side by the coefficient for each variable. Using the

cutting and dyeing constraint from the Par, Inc. problem,

1 6307 10 1 #S D , we can deduce the following upper bounds: 630/(7/10) 900# 5S and 630/1 630# 5D .

2. In addition to GRG Nonlinear, Excel Solver provides

another solution method, Evolutionary Solver, to solve non-

linear problems with local optimal solutions. Evolutionary

Solver is based on a method that searches for an optimal

solution by iteratively adjusting a population of candidate

solutions. In this text, we limit our discussion for nonlin-

ear problems to GRG Nonlinear, which is based on more

classical optimization techniques. However, Evolutionary

Solver may be useful for more complex nonlinear models

that involve Excel functions such as VLOOKUP and IF.

N O T E S + C O M M E N T S

Data for the LMS Tool Bin Location ProblemFIGURE 14.9

Fabrication

Paint Subassembly 1

Subassembly 2

10 2 3 4 5 6 X

Assembly

Fabrication Paint

Subassembly 1 Subassembly 2

Assembly

1 1

2.5 3 4

4 2 2 5 4

X YStation Location

658 Chapter 14 Nonlinear Optimization Models

We may measure the distance from a station to the tool bin located at (X, Y) by using Euclidean (straight-line) distance. For example, the distance from fabrication located at the coordinates (1, 4) to the tool bin located at the coordinates (X, Y) is given by

( 1) ( 4)2 2X Y2 1 2

The unconstrained optimization problem is as follows:

Min 1 4 1 2 2.5 2

3 5 4 4

2 2 2 2 2 2

2 2 2 2

2 1 2 1 2 1 2 1 2 1 2

1 2 1 2 1 2 1 2 ) ( ) ) ) ) ) )

) ) ) )

( ( ( ( ( (

( ( ( (

X Y X Y X Y

X Y X Y

Note that we do not require that the variables X or Y be nonnegative. The optimal solution found by Excel Solver is 2.230, 3.349X Y5 5 . The solution is shown in Figure 14.10.

Location models are used extensively for determining the optimal locations for everything from drilling holes in computer circuit boards to locating distribution centers and retail stores in supply chains. A variety of location models can be created by using different objective functions or by adding additional constraints on distances traveled.

14.4 Markowitz Portfolio Model Harry Markowitz received the 1990 Nobel Prize for his ground-breaking work in portfolio optimization. The Markowitz mean-variance portfolio model is a classic application of nonlinear programming. In this section, we present the Markowitz mean-variance portfolio model. Money management firms throughout the world use numerous variations of this basic model.

Solution to the LMS Tool Bin Location ProblemFIGURE 14.10

Fabrication

Paint Subassembly 1

Subassembly 2

Tool Bin 4

10 2 3 4 5 6

Assembly

Fabrication Paint

Subassembly 1 Subassembly 2

Assembly

1 1

2.5 3 4

4 2 2 5 4

X YStation Location

The exercises at the end of this chapter provide practice in creating several different forms of location models.

LaRosa

14.4 Markowitz Portfolio Model 659

A key trade-off in financial planning is that between risk and return. For a chance to earn greater returns, the investor must also accept greater risk. In most portfolio optimiza- tion models, the return used is the expected (or average) return of the possible outcomes, and the risk is some measure of variability in these possible outcomes. To illustrate the Markowitz portfolio model, let us consider the case of Hauck Investment Services.

Hauck Investment Services designs annuities, IRAs, 401(k) plans, and other investment vehicles for investors with a variety of risk tolerances. Hauck would like to develop a portfolio model that can be used to determine an optimal portfolio involving a mix of six mutual funds. Table 14.2 shows the annual return (%) for five 1-year periods for the six mutual funds. Year 1 represents a year in which all mutual funds yield good returns. Year 2 is also a good year for most of the mutual funds. But year 3 is a bad year for the small-cap value fund, year 4 is a bad year for the intermediate-term bond fund, and year 5 is a bad year for four of the six mutual funds.

It is not possible to predict the exact returns for any of the funds over the next 12 months, but the portfolio managers at Hauck Financial Services think that the returns for the five years shown in Table 14.2 are scenarios that can be used to represent the possibili- ties for the next year. For the purpose of building portfolios for their clients, Hauck’s port- folio managers will choose a mix of these six mutual funds and assume that one of the five possible scenarios will describe the return over the next 12 months.

The portfolio construction problem is to determine how much of the portfolio to invest in each investment alternative. To determine the proportion of the portfolio that will be invested in each of the mutual funds we use the following decision variables:

proportion of portfolio invested in the foreign stock mutual fund

proportion of portfolio invested in the intermediate-term bond fund

proportion of portfolio invested in the large-cap growth fund

proportion of portfolio invested in the large-cap value fund

proportion of portfolio invested in the small-cap growth fund

proportion of portfolio invested in the small-cap value fund

Because the sum of these proportions must equal one, we need the following constraint:

1FS IB LG LV SG SV1 1 1 1 1 5

The other constraints are concerned with the return that the portfolio will earn under each of the planning scenarios in Table 14.2.

The portfolio return over the next 12 months depends on which of the possible scenarios (years 1 through 5) in Table 14.2 occurs. Let 1R denote the portfolio return if the scenario represented by year 1 occurs, 2R denote the portfolio return if the scenario represented by

Annual Return (%)

Mutual Fund Year 1 Year 2 Year 3 Year 4 Year 5

Foreign Stock 10.06 13.12 13.47 45.42 −21.93

Intermediate-Term Bond 17.64 3.25 7.51 −1.33 7.36

Large-Cap Growth 32.41 18.71 33.28 41.46 −23.26

Large-Cap Value 32.36 20.61 12.93 7.06 −5.37

Small-Cap Growth 33.44 19.40 3.85 58.68 −9.02

Small-Cap Value 24.56 25.32 −6.70 5.43 17.31

Mutual Fund Performances in Five Selected Years (Used as Planning Scenarios for the Next 12 Months)

TABLE 14.2

660 Chapter 14 Nonlinear Optimization Models

year 2 occurs, and so on. The portfolio returns for the five planning years (scenarios) are as follows: Scenario 1 return:

10.06 17.64 32.41 32.36 33.44 24.561R FS IB LG LV SG SV5 1 1 1 1 1

Scenario 2 return:

13.12 3.25 18.71 20.61 19.40 25.322R FS IB LG LV SG SV5 1 1 1 1 1

Scenario 3 return:

13.47 7.51 33.28 12.93 3.85 6.703R FS IB LG LV SG SV5 1 1 1 1 2

Scenario 4 return:

45.42 1.33 41.46 7.06 58.68 5.434R FS IB LG LV SG SV5 2 1 1 1 1

Scenario 5 return:

21.93 7.36 23.26 5.37 9.02 17.315R FS IB LG LV SG SV5 2 1 2 2 2 1

If ps is the probability of scenario s, among n possible scenarios, then the expected return for the portfolio is R, where

R p R s

s s5 5

∑ (14.6) If we assume that the five planning scenarios in the Hauck Financial Services model are equally likely to occur, then

5 1

R R R s

s5 5 5 5

∑ ∑ Measuring risk is a bit more difficult. Entire books are devoted to the topic of risk mea-

surement. The measure of risk most often associated with the Markowitz portfolio model is the variance of the portfolio’s return. If the expected return is defined by equation (14.6), the variance of the portfolio’s return is:

2 Var p R R

s s5 2 5

∑ )( (14.7) For the Hauck Financial Services example, the five planning scenarios are equally

likely, thus:

5 1

5 2

5 2 5

∑ )(Var R R s

The portfolio variance is the average of the sum of the squares of the deviations from the mean value under each scenario. The larger this number, the more widely dispersed the scenario returns are about the average value. If the portfolio variance were equal to zero, then every scenario return Ri would be equal, and there would be no risk.

Two basic ways to formulate the Markowitz model are (1) to minimize the variance of the portfolio subject to a constraint on the expected return of the portfolio and (2) to maximize the expected return of the portfolio subject to a constraint on variance. Consider the first case. Assume that Hauck clients would like to construct a portfolio from the six mutual funds listed in Table 14.2 that will minimize their risk as measured by the portfolio variance. However, the clients also require the expected portfolio return to be at least 10%. In our notation, the objective function is

Min 1 5 1

5 2

R R s

S 2 5

∑ )( The constraint on expected portfolio return is 10R $ . The complete Markowitz model involves 12 variables and 8 constraints (excluding the nonnegativity constraints).

Min 1 5 1

5 2

R R s

S 2 5

∑( ) (14.8)

14.4 Markowitz Portfolio Model 661

s.t.

10.06 17.64 32.41 32.36 33.44 24.56 1FS IB LG LV SG SV R1 1 1 1 1 5 (14.9)

13.12 3.25 18.71 20.61 19.40 25.32 2FS IB LG LV SG SV R1 1 1 1 1 5+ (14.10)

13.47 7.51 33.28 12.93 3.85 6.70 3FS IB LG LV SG SV R1 1 1 1 2 5 (14.11)

45.42 1.33 41.46 7.06 58.68 5.43 4FS IB LG LV SG SV R2 1 1 1 1 = (14.12)

21.93 7.36 23.26 5.37 9.02 17.31 5FS IB LG LV SG SV R2 1 2 2 2 1 5 (14.13)

1FS IB LG LV SG SV1 1 1 1 1 5 (14.14)

1 5 1

R R s

S 5 5

∑ (14.15) 10R $ (14.16)

, , , , , 0FS IB LG LV SG SV $ (14.17)

The objective for the Markowitz model is to minimize portfolio variance. Equations (14.9) through (14.13) define the return for each scenario. Equation (14.14) requires all of the money to be invested in the mutual funds; this constraint is often called the unity constraint. Equation (14.15) defines R, which is the expected return of the portfolio. Equation (14.16) requires the portfolio return to be at least 10%. Finally, equation (14.17) requires a nonnegative investment in each Hauck mutual fund. Note that

1R , 2R , 3R , 4R , and 5R , as well as R, are not required to be nonnegative. It is possible that the return in a given scenario or the expected return of the portfolio is negative.

The solution for this model using a required return of at least 10% appears in Figure 14.11. The minimum value for the portfolio variance is 27.136. This solution implies

Solution for the Hauck Minimum Variance Portfolio with a Required Return of At Least 10%

FIGURE 14.11

662 Chapter 14 Nonlinear Optimization Models

that the clients will get an expected return of 10% 10R $( ) and minimize their risk as mea- sured by portfolio variance by investing approximately 16% of the portfolio in the foreign stock fund 0.158FS 5( ), 53% in the intermediate bond fund 0.525IB 5 )( , 4% in the large- cap growth fund 0.042LG 5 )( , and 27% in the small-cap value fund 0.274SV 5 )( .

The Solver Parameters dialog box is also shown in Figure 14.11. Note that we have selected GRG Nonlinear as the method and we have not selected Make Unconstrained Variables Non-Negative. Instead we have entered as an explicit constraint set that B17 through B22 must be 0$ .

The Markowitz portfolio model provides a convenient way for an investor to trade off risk versus return. In practice, this model is typically solved iteratively for different val- ues of return. Figure 14.12 is a graph of the minimum portfolio variances versus required expected returns as required expected return is varied from 8% to 12% in increments of 1%. In finance, this graph is called the efficient frontier. Each point on the efficient fron- tier is the minimum possible risk (measured by portfolio variance) for the given return. By looking at the graph of the efficient frontier, investors can select the mean-variance combi- nation with which they are most comfortable.

An Efficient Frontier for the Markowitz Portfolio ModelFIGURE 14.12

8 9 10 11 12 Required Return (%)

P or

tf ol

io V

ar ia

n ce

1. Notice that the solution given in Figure 14.11 has more

than 50% of the portfolio invested in the intermediate-term

bond fund. It may be unwise to let one asset contribute

so heavily to the portfolio. Upper and lower bounds on

the amount of an asset type in the portfolio can be easily

modeled. Hence, upper bounds are often placed on the

percentage of the portfolio invested in a single asset. Like-

wise, it might be undesirable to include an extremely small

quantity of an asset in the portfolio. Thus, there may be

constraints that require nonzero amounts of an asset to be

at least a minimum percentage of the portfolio.

2. In the Hauck example, 100% of the available portfolio was

invested in mutual funds. However, risk-averse investors

often prefer to have some of their money in a so-called risk-

free asset, such as U.S. Treasury Bills. Thus, many portfolio

optimization models allow funds to be invested in a risk-

free asset.

3. In this section, portfolio variance was used to measure risk.

However, variance, as it is defined, counts deviations both

above and below the mean. Most investors are happy with

returns above the mean but wish to avoid returns below the

mean. Hence, numerous portfolio models allow for flexible

risk measures. A problem at the end of this chapter illus-

trates the use of alternative risk measures.

4. In practice, both brokers and mutual fund companies adjust

portfolios as new information becomes available. However,

constantly adjusting a portfolio may lead to large trans-

action costs. The case problem at the end of this chapter

requires you to develop a modification of the Markowitz

portfolio selection problem to account for transaction costs.

N O T E S + C O M M E N T S

HauckMarkowitz

14.5 Forecasting Adoption of a New Product 663

14.5 Forecasting Adoption of a New Product Forecasting new adoptions after a product introduction is an important marketing problem. In this section, we introduce a forecasting model developed by Frank Bass1 that has proven to be particularly effective in forecasting the adoption of innovative and new technologies in the marketplace. Nonlinear optimization is used to estimate the parameters of the Bass forecasting model. The model has three parameters that must be estimated.

the number of people estimated to eventually adopt the new productm 5

A company introducing a new product is obviously interested in the value of parameter m.

the coefficient of imitationq 5

Parameter q measures the likelihood of adoption due to a potential adopter being influenced by someone who has already adopted the product. It measures the word-of- mouth or social media effect influencing purchases.

the coefficient of innovationp 5

Parameter p measures the likelihood of adoption, assuming no influence from someone who has already purchased (adopted) the product. It is the likelihood of someone adopting the product because of her or his own interest in the innovation.

Using these parameters, let us now develop the forecasting model. Let 1Ct2 denote the number of people who have adopted the product through time t − 1. Because m is the number of people estimated to eventually adopt the product, 1m Ct2 2 is the number of potential adopters remaining at time t − 1. We refer to the time interval between time t − 1 and time t as period t. During period t, some percentage of the remaining number of potential adopters,

1m Ct2 2 , will adopt the product. This value depends on the likelihood of a new adoption. Loosely speaking, the likelihood of a new adoption is the likelihood of adoption due

to imitation plus the likelihood of adoption due to innovation. The likelihood of adop- tion due to imitation is a function of the number of people who have already adopted the product. The larger the current pool of adopters, the greater their influence through word of mouth. Because /1C mt2 is the fraction of the number of people estimated to adopt the product by time t − 1, the likelihood of adoption due to imitation is computed by multiply- ing this fraction by q, the coefficient of imitation. Thus, the likelihood of adoption due to imitation is

/1q C mt2 )( The likelihood of adoption due to innovation is simply p, the coefficient of innovation. Thus, the likelihood of adoption is

/1p q C mt1 2 )( Using the likelihood of adoption we can develop a forecast of the remaining number of potential customers who will adopt the product during time period t. Thus, Ft, the forecast of the number of new adopters during time period t, is

/1 1F p q C m m Ct t t5 1 22 2)( [ ] )( (14.18) In developing a forecast of new adoptions in period t using the Bass model, the value of 1Ct2 will be known from past sales data. But we also need to know the values of the param-

eters to use in the model. Let us now see how nonlinear optimization is used to estimate the parameter values m, p, and q.

Consider Figure 14.13. This figure shows the graph of box office revenues (in $ millions) for two different films, an independent studio film and a summer blockbuster action movie, over the first 12 weeks after release. Strictly speaking, box office revenues for time period t are not the same as the number of adopters during time period t. However,

The Bass forecasting model given in equation (14.18) can be rigorously derived from statistical principles. Rather than providing such a derivation, we have emphasized the intuitive aspects of the model.

1See Frank M. Bass, “A New Product Growth Model for Consumer Durables,” Management Science 15 (1969).

664 Chapter 14 Nonlinear Optimization Models

the number of repeat customers is usually small, and box office revenues are a multiple of the number of moviegoers. The Bass forecasting model seems appropriate here.

These two films illustrate drastically different adoption patterns. Note that reve- nues for the independent studio film grow until the revenues peak in week 4 and then decline. For this film, much of the revenue is obviously due to word-of-mouth influence. In terms of the Bass model, the imitation factor dominates the innovation factor, and we expect q p. . However, for the summer blockbuster, revenues peak in week 1 and drop sharply afterward. The innovation factor dominates the imitation factor, and we expect q p, .

The forecasting model given in equation (14.18) can be incorporated into a nonlin- ear optimization problem to find the values of p, q, and m that give the best forecasts for a set of data. Assume that N periods of data are available. Let us denote the actual number of adopters (or a multiple of that number, such as sales) in period t as Ct for

1, . . . ,t N5 . Then the forecast in each period and the corresponding forecast error Et is defined by

/1 1F p q C m m Ct t t5 1 22 2( )[ ] ( ) and E F Ct t t5 2 Notice that the forecast error is the difference between the forecast value Ft and the actual value Ct. It is common statistical practice to estimate the parameters p, q, and m by mini- mizing the sum of squared errors.

Doing so for the Bass forecasting model leads to the following nonlinear optimization problem:

Min 1

1 2E

∑ (14.19) s.t.

/ 1, 2, . . . ,1 1F p q C m m C t Nt t t5 1 2 52 2)( [ ] )( (14.20) 1, 2, . . . ,5 2 5E F C t Nt t t (14.21)

Because equations (14.19) and (14.20) both contain nonlinear terms, this model is a nonlin- ear minimization problem.

The data in Table 14.3 provide the revenue and cumulative revenues for the independent studio film in weeks 1–12. Using these data, the nonlinear model to estimate the parame- ters of the Bass forecasting model for the independent studio film is as follows:

Note that the parameters of the Bass forecasting model are the decision variables in this nonlinear optimization model.

Weekly Box Office Revenues for an Independent Studio Film and a Summer Blockbuster Movie

FIGURE 14.13

10 2 3 4 5 6 7 8 9 10 11 12 Week

Independent Studio Film

R ev

en u

e ($

m il

li on

1 2 3 4 5 6 7 8 9 10 11 12 Week

Summer Blockbuster

R ev

en u

e ($

m il

li on

14.5 Forecasting Adoption of a New Product 665

Min

s.t. ( )

[ (0.10/ )]( 0.10)

[ (3.10/ )]( 3.10)

[ (34.85/ )]( 34.85)

0.10

3.00

0.60

1 2

2 2

12 2

1 1

2 2

12 12

E E E

F p m

F p q m m

E F

�

1 1 ? ? ? 1

5 1 2

5 2

The solutions to this nonlinear model and to a similar nonlinear model for the summer blockbuster are given in Table 14.4.

The optimal forecasting parameter values given in Table 14.4 are intuitively appealing and consistent with Figure 14.13. For the independent studio film, which has the largest revenues in week 4, the value of the imitation parameter q is 0.49; this value is substan- tially larger than the innovation parameter 0.074p 5 . The film picks up momentum over time because of favorable word of mouth. After week 4, revenues decline as more and more of the potential market for the film has already seen it. Contrast these data with the summer blockbuster movie, which has a negative value of −0.018 for the imitation parameter q and an innovation parameter p of 0.49. The greatest number of adoptions is in week 1, and new adoptions decline afterward. Obviously the word-of-mouth influence is not favorable.

Week Revenues St Cumulative Revenues Ct 1 2 3 4 5 6 7 8 9

10 11 12

0.10 3.00 5.20 7.00 5.25 4.90 3.00 2.40 1.90 1.30 0.80 0.60

0.10 3.10 8.30

15.30 20.55 25.45 28.45 30.85 32.75 34.05 34.85 35.45

Parameter Independent Studio Film Summer Blockbuster

p 0.074 0.460

q 0.490 −0.018

m 34.850 149.540

Box Office Revenues and Cumulative Revenues in $ Millions for Independent Studio Film

Optimal Forecast Parameters for Independent Studio Film and Summer Blockbuster Movie

TABLE 14.3

TABLE 14.4

Bass

666 Chapter 14 Nonlinear Optimization Models

In Figure 14.14, we show the forecast values based on the parameters in Table 14.4 and the observed values in the same graph. The Bass forecasting model does a good job of tracking revenue for the independent small-studio film. For the summer blockbuster, the Bass model does an outstanding job; it is virtually impossible to distinguish the forecast line from the actual adoption line.

You may wonder what good a forecasting model is if we must wait until after the adop- tion cycle is complete to estimate the parameters. One way to use the Bass forecasting model for a new product is to assume that sales of the new product will behave in a way that is similar to a previous product for which p and q values have been calculated and to subjectively estimate m, the potential market for the new product. For example, one might assume that box office receipts for movies next summer will behave similarly to box office receipts for movies last summer. Then the p and q values used for next summer’s movies would be the p and q values calculated from the actual box office receipts last summer.

A second approach is to wait until several periods of data for the new product are avail- able. For example, if five periods of data are available, the sales data for these five periods could be used to forecast demand for period 6. Then, after six periods of sales are observed, a forecast for period 7 is made. This method is often called a rolling-horizon approach.

Forecast and Actual Weekly Box Office Revenues for Independent Studio Film and Summer Blockbuster

FIGURE 14.14

Forecast values Observed values

1 2 3 4 5 6 7 8 9 10 11 12

Week

Independent Studio Film

R ev

en u

e an

d F

or ec

as t

($ m

il li

on s)

1 2 3 4 5 6 7 8 9 10 11 12

Week

R ev

en u

e an

d F

or ec

as t

($ m

il li

on s)

Forecast values Observed values

Summer Blockbuster

The optimization model used to determine the parameter

values for the Bass forecasting model is an example of a dif-

ficult nonlinear optimization problem. It is neither convex nor

concave. For such models, local optima may give values that

are much worse than the global optimum. We recommend

using the Multistart option in Excel Solver when solving such

problems.

N O T E S + C O M M E N T S

S U M M A R Y

In this chapter we introduced nonlinear optimization models. A nonlinear optimization model is a model with at least one nonlinear term in either a constraint or the objective function. Because so many applications of business analytics involve nonlinear functions, allowing nonlinear terms greatly increases the number of important applications that can be modeled as an optimization problem. Numerous problems in portfolio optimization, option

Problems 667

pricing, marketing, economics, facility location, forecasting, and scheduling lend them- selves to nonlinear models.

Unfortunately, nonlinear optimization models are not as easy to solve as linear opti- mization models, or even integer linear optimization models. As a rule of thumb, if a problem can be modeled realistically as a linear or integer linear problem, then it is prob- ably best to do so. Many nonlinear formulations have local optima that are not globally optimal. Because most nonlinear optimization software terminates with a local optimum, the solution returned by the software may not be the best solution available. However, as discussed in this chapter, numerous important classes of optimization problems, such as the Markowitz portfolio models, are convex optimization problems. For a convex optimization problem, a local optimum is also the global optimum. Additionally, the development of software for solving (nonconvex) nonlinear optimization problems that find globally opti- mal solutions is proceeding at a rapid rate. When using Excel Solver for nonlinear optimi- zation, we recommend using the Multistart option.

G L O S S A R Y

Concave function A function that is bowl-shaped down: For example, the functions ( ) 5 52f x x x5 2 2 and ( , ) 112 2f x y x y5 2 2 are concave functions.

Convex function A function that is bowl-shaped up: For example, the functions ( ) 52f x x x5 2 and ( , ) 52 2f x y x y5 1 are convex functions.

Efficient frontier A set of points defining the minimum possible risk (measured by portfo- lio variance) for a set of return values. Global maximum A feasible solution is a global maximum if there are no other feasible points with a larger objective function value in the entire feasible region. A global maxi- mum is also a local maximum. Global minimum A feasible solution is a global minimum if there are no other feasible points with a smaller objective function value in the entire feasible region. A global mini- mum is also a local minimum. Global optimum A feasible solution is a global optimum if there are no other feasible points with a better objective function value in the entire feasible region. A global optimum may be either a global maximum or a global minimum. Lagrangian multiplier The shadow price for a constraint in a nonlinear problem, that is, the rate of change of the objective function with respect to the right-hand side of a constraint. Local maximum A feasible solution is a local maximum if there are no other feasible solu- tions with a larger objective function value in the immediate neighborhood. Local minimum A feasible solution is a local minimum if there are no other feasible solu- tions with a smaller objective function value in the immediate neighborhood. Local optimum A feasible solution is a local optimum if there are no other feasible solu- tions with a better objective function value in the immediate neighborhood. A local opti- mum may be either a local maximum or a local minimum. Markowitz mean-variance portfolio model An optimization model used to construct a portfolio that minimizes risk subject to a constraint requiring a minimum level of return. Nonlinear optimization problem An optimization problem that contains at least one non- linear term in the objective function or a constraint. Quadratic function A nonlinear function with terms to the power of two. Reduced gradient Value associated with a variable in a nonlinear model that is analogous to the reduced cost in a linear model; the shadow price of a binding simple lower or upper bound on the decision variable.

P R O B L E M S

1. GreenLawns provides a lawn fertilizing and weed control service. The company is adding a special aeration treatment as a low-cost extra service option that it hopes will help attract new customers. Management is planning to promote this new service in

668 Chapter 14 Nonlinear Optimization Models

two media: radio and direct-mail advertising. A media budget of $3,000 is available for this promotional campaign. Based on past experience in promoting its other services, GreenLawns has obtained the following estimate of the relationship between sales and the amount spent on promotion in these two media:

2 10 8 18 342 2S R M RM R M5 2 2 2 1 1

where

total sales in thousands of dollars

thousands of dollars spent on radio advertising

thousands of dollars spent on direct-mail advertising

GreenLawns would like to develop a promotional strategy that will lead to maximum sales subject to the restriction provided by the media budget. a. What is the value of sales if $2,000 is spent on radio advertising and $1,000 is spent

on direct-mail advertising? b. Formulate an optimization problem that can be solved to maximize sales subject to

the media budget of spending no more than $3,000 on total advertising. c. Determine the optimal amount to spend on radio and direct-mail advertising. How

much in sales will be generated?

2. The Cobb-Douglas production function is a classic model from economics used to model output as a function of capital and labor. It has the form

, 0 1 2f L C c L Cc c5)( where 0c , 1c , and 2c are constants. The variable L represents the units of input of labor, and the variable C represents the units of input of capital. a. In this example, assume 50c 5 , 0.251c 5 , and 0.752c 5 . Assume each unit of labor

costs $25 and each unit of capital costs $75. With $75,000 available in the budget, develop an optimization model to determine how the budgeted amount should be allocated between capital and labor in order to maximize output.

b. Find the optimal solution to the model you formulated in part (a). (Hint: When using Excel Solver, use the Multistart option with bounds 0 3, 000L# # and 0 1, 000C# # .)

3. Let S represent the amount of steel produced (in tons). Steel production is related to the amount of labor used (L) and the amount of capital used (C) by the following function:

20 0.30 0.70S L C5

In this formula L represents the units of labor input and C the units of capital input. Each unit of labor costs $50, and each unit of capital costs $100. a. Formulate an optimization problem that will determine how much labor and capital

are needed to produce 50,000 tons of steel at minimum cost. b. Solve the optimization problem you formulated in part (a). (Hint: When using Excel

Solver, start with an initial 0L . and 0C . .)

4. The profit function for two products is

Profit 3 42 3 48 7001 2

1 2 2

2x x x x5 2 1 2 1 1

where 1x represents units of production of product 1 and 2x represents units of produc- tion of product 2. Producing one unit of product 1 requires 4 labor-hours and producing one unit of product 2 requires 6 labor-hours. Currently, 24 labor-hours are available. The cost of labor-hours is already factored into the profit function, but it is possible to schedule overtime at a premium of $5 per hour. a. Formulate an optimization problem that can be used to find the optimal produc-

tion quantity of products 1 and 2 and the optimal number of overtime hours to schedule.

b. Solve the optimization model you formulated in part (a). How much should be produced and how many overtime hours should be scheduled?

Problems 669

5. Jim’s Camera shop sells two high-end cameras, the Sky Eagle and Horizon. The demands for these two cameras are as follows: demand for the Sky EagleSD 5 , SP is the selling price of the Sky Eagle, HD is the demand for the Horizon, and HP is the selling price of the Horizon.

222 0.60 0.35

270 0.10 0.64 S S H

H S H

D P P

5 2 1

5 1 2

The store wishes to determine the selling price that maximizes revenue for these two products. Develop the revenue function for these two models, and find the prices that maximize revenue.

6. Heller Manufacturing has two production facilities that manufacture baseball gloves. Production costs at the two facilities differ because of varying labor rates, local prop- erty taxes, type of equipment, capacity, and so on. The Dayton plant has weekly costs that can be expressed as a function of the number of gloves produced:

52TCD X X X5 2 1)( where X is the weekly production volume in thousands of units, and TCD(X) is the cost in thousands of dollars. The Hamilton plant’s weekly production costs are given by

2 32TCH Y Y Y5 2 1)( where Y is the weekly production volume in thousands of units, and TCH(Y) is the cost in thousands of dollars. Heller Manufacturing would like to produce 8,000 gloves per week at the lowest possible cost. a. Formulate a mathematical model that can be used to determine the optimal number

of gloves to produce each week at each facility. b. Solve the optimization model to determine the optimal number of gloves to produce

at each facility.

7. Many forecasting models use parameters that are estimated using nonlinear optimiza- tion. A good example is the Bass model introduced in this chapter. Another example is the exponential smoothing forecasting model discussed in Chapter 8. The exponential smoothing model is common in practice and is described in further detail in Chapter 8. For instance, the basic exponential smoothing model for forecasting sales is

ˆ 1 ˆ1 a a5 1 21 )(y y yt t t where

ˆ forecast of sales for period 1

actual sales for period

ˆ forecast of sales for period

smoothing constant, 0 1

5 1

5 # a #

+y t

y t

This model is used recursively; the forecast for time period t + 1 is based on the forecast for period t, ŷt , the observed value of sales in period t, yt , and the smoothing parameter a. The use of this model to forecast sales for 12 months is illustrated in the following table with the smoothing constant 0.3a 5 . The forecast errors, ˆy yt t2 , are calculated in the fourth column. The value of a is often chosen by minimizing the sum of squared forecast errors. The last column of the table shows the square of the forecast error and the sum of squared forecast errors.

In using exponential smoothing models, one tries to choose the value of a that pro- vides the best forecasts. a. The file ExpSmooth contains the observed data shown here. Construct this table

using the formula above. Note that we set the forecast in period 1 to the observed in period 1 to get started ˆ 171 1y y5 5( ), then the formula above for ˆ 1yt1 is used starting in period 2. Make sure to have a single cell corresponding to a in your spreadsheet model. After confirming the values in the table below with 0.3a 5 , try different values of a to see if you can get a smaller sum of squared forecast errors.

ExpSmooth

670 Chapter 14 Nonlinear Optimization Models

b. Use Excel Solver to find the value of a that minimizes the sum of squared forecast errors.

8. Andalus Furniture Company has two manufacturing plants, one at Aynor and another at Spartanburg. The cost in dollars of producing a kitchen chair at each of the two plants is given here. The cost of producing 1Q chairs at Aynor is

75 5 1001 1 2Q Q1 1

and the cost of producing 2Q kitchen chairs at Spartanburg is

25 2.5 1502 2 2Q Q1 1

Andalus needs to manufacture a total of 40 kitchen chairs to meet an order just received. How many chairs should be made at Aynor, and how many should be made at Spartanburg in order to minimize total production cost?

9. The economic order quantity (EOQ) model is a classical model used for controlling inventory and satisfying demand. Costs included in the model are holding cost per unit, ordering cost, and the cost of goods ordered. The assumptions for that model are that only a single item is considered, that the entire quantity ordered arrives at one time, that the demand for the item is constant over time, and that no shortages are allowed.

Suppose we relax the first assumption and allow for multiple items that are independent except for a restriction on the amount of space available to store the products. The following model describes this situation: Let

D j 5 annual demand for item j

C j 5 unit cost of item j

S j 5 cost per order placed for item j

w j 5 space required for item j

W 5 the maximum amount of space available for all goods i 5 inventory carrying charge as a percentage of the cost per unit

The decision variables are Qj, the amount of item j to order. The model is:

1 1

$ 5

Minimize 2

s.t

0 1, 2, ...

∑



 



 =

C D S D

Q iC

w Q W

Q j N

j N

j j j j

j j

j N

j j

Week (t)

Observed Value ( )y t

Forecast ( ˆ )ty

Forecast Error

2( ˆ )y yt t

Squared Forecast Error

2( ˆ )2y yt t

1 17 17.00 0.00 0.00

2 21 17.00 4.00 16.00

3 19 18.20 0.80 0.64

4 23 18.44 4.56 20.79

5 18 19.81 −1.81 3.27

6 16 19.27 −3.27 10.66

7 20 18.29 1.71 2.94

8 18 18.80 −0.80 0.64

9 22 18.56 3.44 11.83

10 20 19.59 0.41 0.17

11 15 19.71 −4.71 22.23

12 22 18.30 3.70 13.69

5SUM 102.86

Problems 671

In the objective function, the first term is the annual cost of goods, the second is the annual ordering cost (D

j /Q

j is the number of orders), and the last term is the annual

inventory holding cost (Q i /2 is the average amount of inventory).

Construct and solve a nonlinear optimization model for the following data:

Item 1 Item 2 Item 3

Annual Demand 2,000 2,000 1,000

Item Cost ($) 100 50 80

Order Cost ($) 150 135 125

Space Required (sq. feet) 5000 0.20

W i

50 25 40

LaRosaDemand

10. Phillips Inc. produces two distinct products, A and B. The products do not compete with each other in the marketplace; that is, neither cost, price, nor demand for one product will impact the demand for the other. Phillips’ analysts have collected data on the effects of advertising on profits. These data suggest that, although higher advertis- ing correlates with higher profits, the marginal increase in profits diminishes at higher advertising levels, particularly for product B. Analysts have estimated the following functions:

Annual profit for product A 1.2712 17.414

Annual profit for product B 0.3970 16.109

LN X

5 1

( ) ( )

where AX and BX are the advertising amount allocated to products A and B, respec- tively, in thousands of dollars, profit is in millions of dollars, and LN is the natural log- arithm function. The advertising budget is $500,000, and management has dictated that at least $50,000 must be allocated to each of the two products. (Hint: To compute a natural logarithm for the value X in Excel, use the formula =LN(X). For Solver to find an answer, you also need to start with decision variable values greater than 0 in this problem.) a. Build an optimization model that will prescribe how Phillips should allocate its

marketing budget to maximize profit. b. Solve the model you constructed in part (a) using Excel Solver.

11. Let us consider again the data from the LaRosa tool bin location problem discussed in Section 14.3. a. Suppose we know the average number of daily trips made to the tool bin from

each production station. The average number of trips per day are 12 for fab- rication, 24 for Paint, 13 for Subassembly 1, 7 for Subassembly 2, and 17 for Assembly. It seems as though we would want the tool bin closer to those stations with high average numbers of trips. Develop a new unconstrained model that minimizes the sum of the demand-weighted distance defined as the product of the demand (measured in number of trips) and the distance to the station.

b. Solve the model you developed in part (a). Comment on the differences between the unweighted distance solution given in Section 14.3 and the demand-weighted solution.

12. TN Communications provides cellular telephone services. The company is planning to expand into the Cincinnati area and is trying to determine the best location for its trans- mission tower. The tower transmits over a radius of 10 miles. The locations that must be reached by this tower are shown in the following figure.

672 Chapter 14 Nonlinear Optimization Models

TN Communications would like to find the tower location that reaches each of these cities and minimizes the sum of the distances to all locations from the new tower. a. Formulate and solve a model to find the optimal location. b. Formulate and solve a model that minimizes the maximum distance from the trans-

mission tower location to the city locations.

13. The distance between two cities in the United States can be approximated by the fol- lowing formula, where lat1 and long1 are the latitude and longitude of city 1 and lat 2 and long2 are the latitude and longitude of city 2:

69 lat lat long long1 2 2

1 2 2

2 1 2( )( ) Ted’s daughter is getting married, and he is inviting relatives from 15 different loca- tions in the United States. The file Wedding gives the longitude, latitude, and number of relatives in each of the 15 locations. Ted would like to find a wedding location that minimizes the demand-weighted distance, where demand is the number of relatives at each location. Assuming that the wedding can occur anywhere, find the latitude and longitude of the optimal location. (Hint: Notice that all longitude values given for this problem are negative. Make sure that you do not check the option for Make Uncon- strained Variables Non-Negative in Solver.)

14. Consider the stock return scenarios for Apple Computer (APPL), Advanced Micro Devices (AMD), and Oracle Corporation (ORCL) shown in the following table:

1 2 3 4 5 6 7 8

APPL –39.80 10.10 124.90 151.80 –58.30 14.30 –41.90 57.10

AMD –42.50 13.60 56.90 36.70 –34.80 –67.40 183.60 6.30

ORCL –10.20 137.90 170.60 16.60 –40.70 –30.30 15.20 –0.60

a. Develop the Markowitz portfolio model for these data with a required expected return of 25%. Assume that the eight scenarios are equally likely to occur.

b. Solve the model developed in part (a). c. Vary the required return in 1% increments from 25% to 30% and plot the efficient

frontier.

15. A second version of the Markowitz portfolio model maximizes expected return subject to a constraint that the variance of the portfolio must be less than or equal to some spec- ified amount. Consider again the Hauck Financial Service data given in Section 14.4.

StockReturn1

Wedding

50 10 15 20 x

Hyde Park

Evendale

TN Locations

Covington

Florence

Florence Covington Hyde Park

Evendale

10 12 16 12

10 16 18 22

x y

Problems 673

Annual Return (%)

Mutual Fund Year 1 Year 2 Year 3 Year 4 Year 5

Foreign Stock 10.06 13.12 13.47 45.42 −21.93

Intermediate-Term Bond 17.64 3.25 7.51 −1.33 7.36

Large-Cap Growth 32.41 18.71 33.28 41.46 −23.26

Large-Cap Value 32.36 20.61 12.93 7.06 −5.37

Small-Cap Growth 33.44 19.40 3.85 58.68 −9.02

Small-Cap Value 24.56 25.32 −6.70 5.43 17.31

a. Construct this version of the Markowitz model for a maximum variance of 30. b. Solve the model developed in part (a).

16. Reconsider the data in Problem 15. Construct a model that maximizes the minimum return achieved over the five scenarios provided. Solve your model to find the optimal portfolio.

17. Consider the following stock return data:

Mutual Fund Year 1 Year 2 Year 3 Year 4 Year 5

Foreign Stock 10.060 13.120 13.470 45.420 –21.930

Intermediate-Term Bond

17.640 3.250 7.510 –1.330 7.360

Large-Cap Growth 32.410 18.710 33.280 41.460 –23.260

Large-Cap Value 32.360 20.610 12.930 7.060 –5.370

Small-Cap Growth 33.440 19.400 3.850 58.680 –9.020

Small-Cap Value 24.560 25.320 –6.700 5.430 17.310

S&P 500 Return 25.000 20.000 8.000 30.000 –10.000

a. Construct the Markowitz portfolio model using a required expected return of 15%. Assume that the 12 scenarios are equally likely to occur.

b. Solve the model using Excel Solver. c. Solve the model for various values of required expected return and plot the efficient

frontier.

18. Let us consider again the investment data from Hauck Financial Services used in Section 14.4 to illustrate the Markowitz portfolio model. The data are shown below, along with the return of the S&P 500 Index. Hauck would like to create a portfolio using the funds listed, so that the resulting portfolio matches the return of the S&P 500 index as closely as possible.

Hauck500

1 2 3 4 5 6

Stock 1 0.300 0.103 0.216 −0.046 −0.071 0.056

Stock 2 0.225 0.290 0.216 −0.272 0.144 0.107

Stock 3 0.149 0.260 0.419 −0.078 0.169 −0.035

7 8 9 10 11 12

Stock 1 0.038 0.089 0.090 0.083 0.035 0.176

Stock 2 0.321 0.305 0.195 0.390 −0.072 0.715

Stock 3 0.133 0.732 0.021 0.131 0.006 0.908

StockReturn2

HauckData

a. Develop an optimization model that will give the fraction of the portfolio to invest in each of the funds so that the return of the resulting portfolio matches as closely as possible the return of the S&P 500 Index. (Hint: Minimize the sum of the squared deviations between the portfolio’s return and the S&P 500 Index return for each year in the data set.)

b. Solve the model developed in part (a).

674 Chapter 14 Nonlinear Optimization Models

The Bass forecasting model is a good example of a difficult-to-solve nonlinear pro- gram, and the answer you get may be a local optimum that is not nearly as good as the result given in Table 14.4. Solve the model using Excel Solver with the Multistart option, and see whether you can duplicate the results in Table 14.4. Use a lower bound of 21 and an upper bound of 1 on both p and q. Use a lower bound of 100 and an upper bound of 1,000 on m.

22. A women’s clothing retail chain has collected data on pricing and sales over the last five years at its flagship store in Charleston, S.C. These data were used to estimate a regression equation that relates price to demand. The following estimated equation relates demand to the price for summer dresses:

1, 000 1.89Y p5 2

Where the demand for summer dressesY 5 and the price per dress5p . Summer dresses cost $210. The data also show that when a summer dress is sold, on average one pair of shoes and one purse are sold with the dress. The profit on a pair of shoes is $18 and the profit on a purse is $26. a. What is the profit-maximizing price for dresses, ignoring the profit associated with

the accompanying shoe and purse? b. What is the profit-maximizing price for dresses taking into account the accompany-

ing shoe and purse purchases? c. Discuss the difference in prices obtained in parts (a) and (b).

Week Revenues

1 72.39

2 37.93

3 17.58

4 9.57

5 5.39

6 3.13

7 1.62

8 0.87

9 0.61

10 0.26

11 0.19

12 0.35

Blockbuster

19. As discussed in Section 14.4, the Markowitz model uses the variance of the portfolio as the measure of risk. However, variance includes deviations both below and above the mean return. Semivariance includes only deviations below the mean and is considered by many to be a better measure of risk. a. Develop a model that minimizes semivariance for the Hauck Financial data given

in the file HauckData with a required return of 10%. (Hint: Modify model (14.8)– (14.17). Define a variable ds for each scenario and let d R Rs s$ 2 with 0ds $ .

Then make the objective function: Min 1 5 1

5 2d

∑ .) b. Solve the model you developed in part (a) with a required expected return of 10%.

20. Refer to Problem 15. Use the model developed there to construct an efficient frontier by varying the maximum allowable variance from 20 to 60 in increments of 5 and solving for the maximum return for each. Plot the efficient frontier and compare it to Figure 14.12.

21. The weekly box office revenues (in $ millions) for the summer blockbuster movie dis- cussed in Section 14.5 follow. Use these data in the Bass forecasting model given by equations (14.19)–(14.21) to estimate the parameters p, q, and m.

HauckData

Case Problem: Portfolio Optimization with Transaction Costs 675

C A S E P R O B L E M : P O R T F O L I O O P T I M I Z A T I O N W I T H T R A N S A C T I O N C O S T S

Hauck Financial Services has a number of passive, buy-and-hold clients. For these clients, Hauck offers an investment account whereby clients agree to put their money into a portfolio of mutual funds that is rebalanced once a year. When the rebalancing occurs, Hauck determines the mix of mutual funds in each investor’s portfolio by solving an extension of the Markowitz portfolio model that incorporates transaction costs. Investors are charged a small transaction cost for the annual rebalancing of their portfolio. For simplicity, assume the following:

• At the beginning of the time period (in this case one year), the portfolio is rebalanced by buying and selling Hauck mutual funds.

• The transaction costs associated with buying and selling mutual funds are paid at the beginning of the period when the portfolio is rebalanced, which, in effect, reduces the amount of money available to reinvest.

• No further transactions are made until the end of the time period, at which point the new value of the portfolio is observed.

• The transaction cost is a linear function of the dollar amount of mutual funds bought or sold.

Jean Delgado is one of Hauck’s buy-and-hold clients. We briefly describe the model as it is used by Hauck for rebalancing her portfolio. The mix of mutual funds that are being con- sidered for her portfolio are a foreign stock fund (FS), an intermediate-term bond fund (IB), a large-cap growth fund (LG), a large-cap value fund (LV), a small-cap growth fund (SG), and a small-cap value fund (SV). In the traditional Markowitz model, the variables are usually interpreted as the proportion of the portfolio invested in the asset represented by the variable. For example, FS is the proportion of the portfolio invested in the foreign stock fund. How- ever, it is equally correct to interpret FS as the dollar amount invested in the foreign stock fund. Then 25, 000FS 5 implies that $25,000 is invested in the foreign stock fund. Based on these assumptions, the initial portfolio value must equal the amount of money spent on trans- action costs plus the amount invested in all the assets after rebalancing; that is,

Initial portfolio value amount invested in all assets after rebalancing transaction costs5 1

The extension of the Markowitz model that Hauck uses for rebalancing portfolios requires a balance constraint for each mutual fund. This balance constraint is

Amount invested in fund initial holding of fund

amount of fund purchased amount of fund sold

i i

1 2

Using this balance constraint requires three additional variables for each fund: one for the amount invested prior to rebalancing, one for the amount sold, and one for the amount purchased. For instance, the balance constraint for the foreign stock fund is

_ _ _FS FS START FS BUY FS SELL5 1 2

Jean Delgado has $100,000 in her account prior to the annual rebalancing, and she has specified a minimum acceptable return of 10%. Hauck plans to use the following model to rebalance Ms. Delgado’s portfolio. The complete model with transaction costs is

Min ( )1 5 1

5 2R R

s 2 5

∑ s.t.

0.1006FS 1 0.1764IB 1 0.3241LG 1 0.3236LV 1 0.3344SG 1 0.2456SV 5 R 1

0.1312FS 1 3.2500IB 1 0.1871LG 1 0.2061LV 1 0.1940SG 1 0.2532SV 5 R 2

0.1347FS 1 0.0751IB 1 0.3328LG 1 0.1293LV 1 0.3850SG 2 0.0670SV 5 R 3

0.4542FS 2 0.0133IB 1 0.4146LG 1 0.0706LV 1 0.5868SG 1 0.0543SV 5 R 4

20.2193FS 1 0.0736IB 2 0.2326LG 2 0.0537LV 2 0.0902SG 1 0.1731SV 5 R 5

676 Chapter 14 Nonlinear Optimization Models

Notice that the transaction fee is set at 1% in the model (the last constraint) and that the transaction cost for buying and selling shares of the mutual funds is a linear function of the amount bought and sold. With this model, the transaction costs are deducted from the cli- ent’s account at the time of rebalancing and thus reduce the amount of money invested. The solution for Ms. Delgado’s rebalancing problem is shown as part of the Managerial Report.

Managerial Report

Assume that you are a financial analytics specialist newly hired by Hauck Financial Services. One of your first tasks is to review the portfolio rebalancing model in order to resolve a dispute with Jean Delgado. Ms. Delgado has had one of the Hauck passively managed portfolios for the past five years and has complained that she is not getting the rate of return of 10% that she specified. After reviewing her annual statements for the past five years, she feels that she is actually getting less than 10% on average.

1. According to the following Model Solution, _ $41, 268.51IB BUY 5 . How much in transaction costs did Ms. Delgado pay for purchasing additional shares of the inter- mediate-term bond fund?

2. Based on the Model Solution, what is the total transaction cost associated with rebalancing Ms. Delgado’s portfolio?

3. After paying transactions costs, how much did Ms. Delgado have invested in mutual funds after her portfolio was rebalanced?

4. According to the Model Solution, $51, 268.51IB 5 . How much can Ms. Delgado expect to have in the intermediate-term bond fund at the end of the year?

5. According to the Model Solution, the expected return of the portfolio is $10,000. What is the expected dollar amount in Ms. Delgado’s portfolio at the end of the year? Can she expect to earn 10% on the $100,000 she had at the beginning of the year?

Rs s5

1 5

∑ 5 R

R $ 10,000

FS 1 IB 1 LG 1 LV 1 SG 1 SV 1 TRANS_COST 5 100,000

FS_START 1 FS_BUY 2 FS_SELL 5 FS

IB_START 1 IB_BUY 2 IB_SELL 5 IB

LG_START 1 LG_BUY 2 LG_SELL 5 LG

LV_START 1 LV_BUY 2 LV_SELL 5 LV

SG_START 1 SG_BUY 2 SG_SELL 5 SG

SV_START 1 SV_BUY 2 SV_SELL 5 SV

TRANS_FEE * (FS_BUY 1 FS_SELL 1 IB_BUY 1 IB_SELL 1

LG_BUY 1 LG_SELL 1 LV_BUY 1 LV_SELL 1 SG_BUY 1

SG_SELL 1 SV_BUY 1 SV_SELL) 5 TRANS_COST

FS_START 5 10,000

IB_START 5 10,000

LG_START 5 10,000

LV_START 5 40,000

SG_START 5 10,000

SV_START 5 20,000

TRANS_FEE 5 0.01

FS, IB, LG, LV, SG, SV $ 0

Case Problem: Portfolio Optimization with Transaction Costs 677

6. It is now time to prepare a report to management to explain why Ms. Delgado did not earn 10% each year on her investment. Make a recommendation in terms of a revised portfolio model that can be used so that Jean Delgado can have an expected portfolio balance of $110,000 at the end of next year. Prepare a report that includes a modified optimization model that will give an expected return of 10% on the amount of money available at the beginning of the year before paying the transac- tion costs. Explain why the current model does not do this.

7. Solve the formulation in part (6) for Jean Delgado. How does the portfolio compo- sition differ from that of the Model Solution?

MODEL SOLUTION

Optimal Objective Value

27219457.356

Variable Value Variable Value 1R 18953.280 IB_START 10000.000

R 10000.000 IB_BUY 41268.510

2R 11569.210 IB_SELL 0.000

3R 5663.961 LG_START 10000.000

4R 9693.921 LG_BUY 0.000

5R 4119.631 LG_SELL 5060.688 FS 15026.860 LV_START 40000.000 IB 51268.510 LV_BUY 0.000 LG 4939.312 LV_SELL 40000.000 LV 0.000 SG_START 10000.000 SG 0.000 SG_BUY 0.000 SV 27675.000 SG_SELL 10000.000 TRANS_COST 1090.311 SV_START 20000.000 FS_START 10000.000 SV_BUY 7675.004 FS_BUY 5026.863 SV_SELL 0.000 FS_SELL 0.000 TRANS_FEE 0.010

Decision Analysis C O N T E N T S

ANALYTICS IN ACTION: PHYTOPHARM

15.1 PROBLEM FORMULATION Payoff Tables Decision Trees

15.2 DECISION ANALYSIS WITHOUT PROBABILITIES Optimistic Approach Conservative Approach Minimax Regret Approach

15.3 DECISION ANALYSIS WITH PROBABILITIES Expected Value Approach Risk Analysis Sensitivity Analysis

15.4 DECISION ANALYSIS WITH SAMPLE INFORMATION Expected Value of Sample Information Expected Value of Perfect Information

15.5 COMPUTING BRANCH PROBABILITIES WITH BAYES’ THEOREM

15.6 UTILITY THEORY Utility and Decision Analysis Utility Functions Exponential Utility Function

APPENDIX 15.1 USING ANALYTIC SOLVER TO CREATE DECISION TREES (MINDTAP READER)

Chapter 15

Analytics in Action 679

Phytopharm*

As a pharmaceutical development and functional food company, Phytopharm’s primary revenue streams come from licensing agreements with larger compa- nies. After Phytopharm establishes proof of principle for a new product by successfully completing early clinical trials, it seeks to reduce its risk by licensing the product to a large pharmaceutical or nutrition com- pany that will further develop and market it.

There is substantial uncertainty regarding the future sales potential of early stage products; only 1 in 10 of such products makes it to market, and only 30% of these yield a healthy return. Phytopharm and its licensing part- ners would often initially propose very different terms for the licensing agreement. Therefore, Phytopharm employed a team of researchers to develop a flexible method for appraising a product’s potential and sub- sequently supporting the negotiation of the lump-sum payments for development milestones and royalties on eventual sales that comprise the licensing agreement.

Using computer simulation, the resulting decision analysis model allows Phytopharm to perform sensi- tivity analysis on estimates of development cost, the probability of successful Food and Drug Administra- tion clearance, launch date, market size, market share, and patent expiry. In particular, a decision tree model allows Phytopharm and its licensing partner to mutu- ally agree on the number of development milestones. Depending on the status of the project at a milestone, the licensing partner can opt to abandon the project or continue development. Laying out these sequen- tial decisions in a decision tree allows Phytopharm to negotiate milestone payments and royalties that equi- tably split the project’s value between Phytopharm and its potential licensee.

A N A L Y T I C S I N A C T I O N

*Based on P. Crama, B. De Ryck, Z. Degraeve, and W. Chong, “ Research and Development Project Valuation and Licensing Negotiations at Phytopharm plc,” Interfaces, 37, no. 5 (September– October 2007): 472–487.

Ultimately, business analytics is about making better decisions. The tools and techniques we have introduced previously are designed to aid a decision maker in analyzing existing data, predicting future behavior, and recommending decisions. This chapter introduces a field known as decision analysis that can be used to develop an optimal strategy when a decision maker is faced with several decision alternatives and an uncertain or risk-filled pattern of future events. For example, by evaluating the different naming options and understanding the potential sources of uncertainty, Procter & Gamble used decision anal- ysis techniques to help choose the best brand name when they introduced Crest White Strips.

Decision analysis techniques are used widely in many different settings. The Analytics in Action, Phytopharm, discusses the use of decision analysis to manage Phytopharm’s pipeline of pharmaceutical products, which have long development times and relatively high levels of uncertainty. Federal agencies in the United States have used decision analysis to evaluate the potential risks from terrorist attacks and to recommend counterterrorism strategies. The State of North Carolina used decision analysis in eval- uating whether to implement a medical screening test to detect metabolic disorders in newborns.

Even when a careful decision analysis has been conducted, uncertainty about future events means that the final outcome is not completely under the control of the decision maker. In some cases, the selected decision alternative may provide good or excellent results. In other cases, a relatively unlikely future event may occur, causing the selected decision alternative to provide only fair or even poor results. The risk associated with any decision alternative is a direct result of the uncertainty associated with the final outcome. A good decision analysis includes careful consideration of risk. Through risk analysis, the decision maker is provided with probability information about the favorable as well as the unfavorable outcomes that may occur.

680 Chapter 15 Decision Analysis

We begin the study of decision analysis by considering problems that involve reasonably few decision alternatives and reasonably few possible future events. Payoff tables and decision trees are introduced to provide a structure for the decision problem and to illustrate the fundamentals of decision analysis. Decision trees are used to analyze more complex problems and to identify an optimal sequence of decisions, referred to as an optimal decision strategy. Sensitivity analysis shows how changes in various aspects of the problem affect the recommended decision alternative. We return to the use of Bayes’ theorem for calculating the probabilities of future events and incorporating additional information about the decisions. We conclude this chapter with a discussion of utility and decision analysis that expands on different attitudes toward risk taken by decision makers.

15.1 Problem Formulation The first step in the decision analysis process is problem formulation. We begin with a verbal statement of the problem. We then identify the decision alternatives; the uncertain future events, referred to as chance events; and the outcomes associated with each com- bination of decision alternative and chance event outcome. Let us begin by considering a construction project of the Pittsburgh Development Corporation.

Pittsburgh Development Corporation (PDC) purchased land that will be the site of a new luxury condominium complex. The location provides a spectacular view of downtown Pittsburgh and the Golden Triangle, where the Allegheny and Monongahela Rivers meet to form the Ohio River. PDC plans to price the individual condominium units between $300,000 and $1,400,000.

PDC commissioned preliminary architectural drawings for three different projects: one with 30 condominiums, one with 60 condominiums, and one with 90 condominiums. The financial success of the project depends on the size of the condominium complex and the chance event concerning the demand for the condominiums. The statement of the PDC decision problem is to select the size of the new luxury condominium project that will lead to the largest profit given the uncertainty concerning the demand for the condominiums.

Given the statement of the problem, it is clear that the decision is to select the best size for the condominium complex. PDC has the following three decision alternatives:

a small complex with 30 condominiums

a medium complex with 60 condominiums

a large complex with 90 condominiums

A factor in selecting the best decision alternative is the uncertainty associated with the chance event concerning the demand for the condominiums. When asked about the possible demand for the condominiums, PDC’s president acknowledged a wide range of possibilities but decided that it would be adequate to consider two possible chance event outcomes: a strong demand and a weak demand.

In decision analysis, the possible outcomes for a chance event are referred to as the states of nature. The states of nature are defined so that they are mutually exclusive (no more than one can occur) and collectively exhaustive (at least one must occur); thus, one and only one of the possible states of nature will occur. For the PDC problem, the chance event concerning the demand for the condominiums has two states of nature:

strong demand for the condominiums

weak demand for the condominiums

Management must first select a decision alternative (complex size); then a state of nature follows (demand for the condominiums), and finally an outcome will occur. In this case, the outcome is PDC’s profit.

Bayes’ Theorem was introduced in Chapter 5.

15.1 Problem Formulation 681

Payoff Tables Given the three decision alternatives and the two states of nature, which complex size should PDC choose? To answer this question, PDC will need to know the outcome associ- ated with each possible combination of decision alternative and state of nature. In decision analysis, we refer to the outcome resulting from a specific combination of a decision alter- native and a state of nature as a payoff. A table showing payoffs for all combinations of decision alternatives and states of nature is a payoff table.

Because PDC wants to select the complex size that provides the largest profit, profit is used as the outcome. The payoff table with profits (in millions of dollars) is shown in Table 15.1. Note, for example, that if a medium complex is built and demand turns out to be strong, a profit of $14 million will be realized. We will use the notation Vij to denote the payoff associated with decision alternative i and state of nature j. Using Table 15.1,

2031V 5 indicates that a payoff of $20 million occurs if the decision is to build a large complex ( )3d and the strong demand state of nature ( )1s occurs. Similarly, 932V 5 2 indi- cates a loss of $9 million if the decision is to build a large complex ( )3d and the weak demand state of nature ( )2s occurs.

Decision Trees A decision tree provides a graphical representation of the decision-making process. Figure 15.1 presents a decision tree for the PDC problem. Note that the decision tree shows the natural or logical progression that will occur over time. First, PDC must make a decision regarding the size of the condominium complex ( 1d , 2d , or 3d ). Then, after the decision is implemented, either state of nature 1s or 2s will occur. The number at each end point of the tree indicates that the payoff is associated with a particular sequence. For example, the top- most payoff of 8 indicates that an $8 million profit is anticipated if PDC constructs a small condominium complex ( )1d and demand turns out to be strong ( )1s . The next payoff of 7 indicates an anticipated profit of $7 million if PDC constructs a small condominium com- plex ( )1d and demand turns out to be weak ( )2s . Thus, the decision tree provides a graphical depiction of the sequences of decision alternatives and states of nature that provide the six possible payoffs for PDC.

The decision tree in Figure 15.1 shows four nodes, numbered 1 to 4. Nodes are used to represent decisions and chance events. Squares are used to depict decision nodes, circles are used to depict chance nodes. Thus, node 1 is a decision node, and nodes 2, 3, and 4 are chance nodes. The branches connect the nodes; those leaving the decision node cor- respond to the decision alternatives. The branches leaving each chance node correspond to the states of nature. The outcomes (payoffs) are shown at the end of the states-of-nature branches. We now turn to the question: How can the decision maker use the information in the payoff table or the decision tree to select the best decision alternative? Several approaches may be used and are covered in the remaining sections of this chapter.

Payoffs can be expressed in terms of profit, cost, time, distance, or any other measure appropriate for the decision problem being analyzed.

State of Nature Decision Alternative Strong Demand, 1s Weak Demand, 2s

Small complex, 1d 8 7

Medium complex, 2d 14 5

Large complex, 3d 20 29

Payoff Table for the PDC Condominium Project ($ Millions)TABLE 15.1

682 Chapter 15 Decision Analysis

15.2 Decision Analysis without Probabilities In this section we consider approaches to decision analysis that do not require knowledge of the probabilities of the states of nature. These approaches are appropriate in situations in which a simple best-case and worst-case analysis is sufficient and in which the deci- sion maker has little confidence in his or her ability to assess the probabilities. Because different approaches sometimes lead to different decision recommendations, the decision maker must understand the approaches available and then select the specific approach that, according to the judgment of the decision maker, is the most appropriate.

Optimistic Approach The optimistic approach evaluates each decision alternative in terms of the best payoff that can occur. The decision alternative that is recommended is the one that provides the best possible payoff. For a problem in which maximum profit is desired, as in the PDC problem, the optimistic approach would lead the decision maker to choose the alternative corresponding to the largest profit. For problems involving minimization, this approach leads to choosing the alternative with the smallest payoff.

To illustrate the optimistic approach, we use it to develop a recommendation for the PDC problem. First, we determine the maximum payoff for each decision alternative; then we select the decision alternative that provides the overall maximum payoff. These steps

For a maximization problem, the optimistic approach often is referred to as the maximax approach; for a minimization problem, the corresponding terminology is minimin.

1. The first step in solving a complex problem is to decom-

pose the problem into a series of smaller subproblems.

Decision trees provide a useful way to decompose a prob-

lem and illustrate the sequential nature of the decision

process.

2. People often view the same problem from different per-

spectives. Thus, the discussion regarding the development

of a decision tree may provide additional insight into the

problem.

N O T E S + C O M M E N T S

Decision Tree for the PDC Condominium Project ($ Millions)FIGURE 15.1

Weak (s2)

Strong (s1)

Weak (s2)

Strong (s1)

Weak (s2)

Strong (s1)

–9

Small (d1)

Medium (d2)

Large (d3)

15.2 Decision Analysis without Probabilities 683

systematically identify the decision alternative that provides the largest possible profit. Table 15.2 illustrates these steps.

Because 20, corresponding to 3d , is the largest payoff, the decision to construct the large condominium complex is the recommended decision alternative using the optimistic approach.

Conservative Approach The conservative approach evaluates each decision alternative in terms of the worst pay- off that can occur. The decision alternative recommended is the one that provides the best of the worst possible payoffs. For a problem in which the output measure is profit, as in the PDC problem, the conservative approach would lead the decision maker to choose the alternative that maximizes the minimum possible profit that could be obtained. For prob- lems involving minimization (e.g., when the output measure is cost instead of profit), this approach identifies the alternative that will minimize the maximum payoff.

To illustrate the conservative approach, we use it to develop a recommendation for the PDC problem. First, we identify the minimum payoff for each of the decision alternatives; then we select the decision alternative that maximizes the minimum payoff. Table 15.3 illustrates these steps for the PDC problem.

Because 7, corresponding to 1d , yields the maximum of the minimum payoffs, the deci- sion alternative of a small condominium complex is recommended. This decision approach is considered conservative because it identifies the worst possible payoffs and then recom- mends the decision alternative that avoids the possibility of extremely “bad” payoffs. In the conservative approach, PDC is guaranteed a profit of at least $7 million. Although PDC may make more, it cannot make less than $7 million.

Minimax Regret Approach In decision analysis, regret is the difference between the payoff associated with a partic- ular decision alternative and the payoff associated with the decision that would yield the most desirable payoff for a given state of nature. Thus, regret represents how much poten- tial payoff one would forgo by selecting a particular decision alternative, given that a spe- cific state of nature will occur. This is why regret is often referred to as opportunity loss.

As its name implies, under the minimax regret approach to decision analysis, one would choose the decision alternative that minimizes the maximum state of regret that could occur over all possible states of nature. This approach is neither purely optimistic nor

For a maximization problem, the conservative approach often is referred to as the maximin approach; for a minimization problem the corresponding terminology is minimax.

Decision Alternative Maximum Payoff

Small complex, 1d 8

Medium complex, 2d 14

Large complex, 3d 20 Maximum of the maximum payoff values

Decision Alternative Minimum Payoff ($ Millions)

Small complex, 1d 7 Maximum of the minimum payoff values

Medium complex, 2d 5

Large complex, 3d 29

Maximum Payoff for Each PDC Decision Alternative

Minimum Payoff for Each PDC Decision Alternative

TABLE 15.2

TABLE 15.3

684 Chapter 15 Decision Analysis

purely conservative. Let us illustrate the minimax regret approach by showing how it can be used to select a decision alternative for the PDC problem.

Suppose that PDC constructs a small condominium complex ( )1d and demand turns out to be strong ( )1s . Table 15.1 showed that the resulting profit for PDC would be $8 million. However, given that the strong demand state of nature ( )1s has occurred, we real- ize that the decision to construct a large condominium complex ( )3d , yielding a profit of $20 million, would have been the best decision. The difference between the payoff for the best decision alternative ($20 million) and the payoff for the decision to construct a small condominium complex ($8 million) is the regret or opportunity loss associated with decision alternative 1d when state of nature 1s occurs; thus, for this case, the oppor- tunity loss or regret is $20 million $8 million $12 million2 5 . Similarly, if PDC makes the decision to construct a medium condominium complex ( )2d and the strong demand state of nature ( )1s occurs, the opportunity loss, or regret, associated with 2d would be $20 million $14 million $6 million2 5 . Of course, if PDC chooses to construct a large complex ( )3d and demand is strong, they would have no regret.

In general, the following expression represents the opportunity loss, or regret:

REGRET (OPPORTUNITY LOSS)

*R V Vij j ij5 2 (15.1) where

the regret associated with decision alternative and state of nature

the payoff value corresponding to the best decision for the state of nature

the payoff corresponding to decision alternative and state of nature

R d s

V s

V d s

ij i j

j j

ij i j

Note the role of the absolute value in equation (15.1). For minimization problems, the best payoff, *Vj , is the smallest entry in column j. Because this value always is less than or equal to Vij, the absolute value of the difference between *Vj and Vij ensures that the regret is always the magnitude of the difference.

Using equation (15.1) and the payoffs in Table 15.1, we can compute the regret asso- ciated with each combination of decision alternative di and state of nature s j . Because the PDC problem is a maximization problem, *Vj will be the largest entry in column j of the payoff table. Thus, to compute the regret, we simply subtract each entry in a column from the largest entry in the column. Table 15.4 shows the opportunity loss, or regret, table for the PDC problem.

The next step in applying the minimax regret approach is to list the maximum regret for each decision alternative; Table 15.5 shows the results for the PDC problem. Selecting the decision alternative with the minimum of the maximum regret values—hence, the name minimax regret—yields the minimax regret decision. For the PDC problem, the alternative to construct the medium condominium complex, with a corresponding maximum regret of $6 million, is the recommended minimax regret decision.

State of Nature Decision Alternative Strong Demand, 1s Weak Demand, 2s

Small complex, 1d 12 0

Medium complex, 2d 6 2

Large complex, 3d 0 16

Opportunity Loss, or Regret, Table for the PDC Condominium Project ($ Millions)

TABLE 15.4

15.3 Decision Analysis with Probabilities 685

Note that the three approaches discussed in this section provide different recommen- dations, which in itself is not bad. It simply reflects the difference in decision-making philosophies that underlie the various approaches. Ultimately, the decision maker will have to choose the most appropriate approach and then make the final decision accordingly. The main criticism of the approaches discussed in this section is that they do not consider any information about the probabilities of the various states of nature. In the next section, we discuss an approach that utilizes probability information in selecting a decision alternative.

15.3 Decision Analysis with Probabilities Expected Value Approach In many decision-making situations, we can obtain probability assessments for the states of nature. When such probabilities are available, we can use the expected value approach to identify the best decision alternative. Let us first define the expected value of a decision alternative and then apply it to the PDC problem.

Let

the number of states of nature

( ) the probability of state of nature

P s sj j

Because one and only one of the N states of nature can occur, the probabilities must satisfy two conditions:

( ) 0 for all states of nature

( ) ( ) ( ) ( ) 1 1

1 2

P s

P s P s P s P s

j N�∑

5 1 1 1 5 5

The expected value (EV) of decision alternative di is defined as follows:

Decision Alternative Maximum Regret ($ millions)

Small complex, 1d 12

Medium complex, 2d 6 Minimum of the maximum regret

Large complex, 3d 16

Maximum Regret for Each PDC Decision AlternativeTABLE 15.5

EXPECTED VALUE OF DECISION ALTERNATIVE di

EV( ) ( ) 1

d P s Vi j

j ij∑5 5

(15.2)

In words, the expected value of a decision alternative is the sum of weighted payoffs for the decision alternative. The weight for a payoff is the probability of the associated state of nature and therefore the probability that the payoff will occur. Let us return to the PDC problem to see how the expected value approach can be applied.

PDC is optimistic about the potential for the luxury high-rise condominium complex. Suppose that this optimism leads to an initial subjective probability assessment of 0.8 that demand will be strong ( )1s and a corresponding probability of 0.2 that demand will be weak ( )2s . Thus, ( ) 0.81P s 5 and ( ) 0.22P s 5 . Using the payoff values in Table 15.1 and equation (15.2), we compute the expected value for each of the three decision alternatives as follows:

EV( ) 0.8 (8) 0.2 (7) 7.8

EV( ) 0.8 (14) 0.2 (5) 12.2

EV( ) 0.8 (20) 0.2 ( 9) 14.2

5 1 5

5 1 2 5

686 Chapter 15 Decision Analysis

Thus, using the expected value approach, we find that the large condominium complex, with an expected value of $14.2 million, is the recommended decision.

The calculations required to identify the decision alternative with the best expected value can be conveniently carried out on a decision tree. Figure 15.2 shows the decision tree for the PDC problem with state-of-nature branch probabilities. Working backward through the decision tree, we first compute the expected value at each chance node. In other words, at each chance node, we weight each possible payoff by its probability of occurrence. By doing so, we obtain the expected values for nodes 2, 3, and 4, as shown in Figure 15.3.

Because the decision maker controls the branch leaving decision node 1 and because we are trying to maximize the expected profit, the best decision alternative at node 1 is 3d . Thus, the decision tree analysis leads to a recommendation of 3d , with an expected value of $14.2 million. Note that this recommendation is also obtained with the expected value approach in conjunction with the payoff table.

Other decision problems may be substantially more complex than the PDC problem, but if a reasonable number of decision alternatives and states of nature are present, you can use the decision tree approach outlined here. First, draw a decision tree consisting of decision nodes, chance nodes, and branches that describe the sequential nature of the problem. If you use the expected value approach, the next step is to determine the probabilities for each of the states of nature and compute the expected value at each chance node. Then select the decision branch leading to the chance node with the best expected value. The decision alternative associated with this branch is the recommended decision.

In practice, obtaining precise estimates of the probabilities for each state of nature is often impossible. In some cases where similar decisions have been made many times in the past, one may use historical data to estimate the probabilities for the different states of nature. However, often there are little, or no, historical data to guide the estimates of these probabilities. In these cases, we may have to rely on subjective estimates to determine the probabilities for the states of nature. When relying on subjective estimates, we often want to get more than one estimate because many studies have shown that even knowledgeable experts are often overly optimistic in their estimates. It is also particularly important when dealing with subjective probability estimates to perform risk analysis and sensitivity analy- sis, as we will explain.

Computer packages are available to help in constructing more complex decision trees. In the online chapter appendix, we discuss the use of Analytic Solver to create decision trees.

PDC Decision Tree with State-of-Nature Branch ProbabilitiesFIGURE 15.2

Weak (s2)

Strong (s1)

Weak (s2)

Strong (s1)

Weak (s2)

Strong (s1)

Small (d1)

Medium (d2 )

Large (d3)

P(s1) = 0.8

P(s2) = 0.2

P(s1) = 0.8

P(s2) = 0.2

P(s1) = 0.8

P(s2) = 0.2 29

15.3 Decision Analysis with Probabilities 687

Risk Analysis Risk analysis helps the decision maker recognize the difference between the expected value of a decision alternative and the payoff that may actually occur. A decision alternative and a state of nature combine to generate the payoff associated with a decision. The risk profile for a decision alternative shows the possible payoffs along with their associated probabilities.

Let us demonstrate risk analysis and the construction of a risk profile by returning to the PDC condominium construction project. Using the expected value approach, we identified the large condominium complex ( )3d as the best decision alternative. The expected value of $14.2 million for 3d is based on a 0.8 probability of obtaining a $20 million profit and a 0.2 probability of obtaining a $9 million loss. The 0.8 probability for the $20 million payoff and the 0.2 probability for the 2$9 million payoff provide the risk profile for the large-complex decision alternative. This risk profile is shown graphically in Figure 15.4.

Sometimes a review of the risk profile associated with an optimal decision alternative may cause the decision maker to choose another decision alternative even though the

Applying the Expected Value Approach Using a Decision Tree for the PDC Condominium Project

FIGURE 15.3

Small (d1)

Medium (d 2)

Large (d3)

EV(d1) = 0.8(8) + 0.2(7) = 7.8

EV(d2) = 0.8(14) + 0.2(5) = 12.2

EV(d3) = 0.8(20) + 0.2(29) = 14.2

Risk Profile for the Large-Complex Decision Alternative for the PDC Condominium Project

FIGURE 15.4

1.0

0.8

0.6

0.4

0.2

210 0 10 20 Pro�t ($ millions)

P ro

b ab

il it

688 Chapter 15 Decision Analysis

expected value of the other decision alternative is not as good. For example, the risk profile for the medium-complex decision alternative ( )2d shows a 0.8 probability for a $14 million payoff and a 0.2 probability for a $5 million payoff. Because no probability of a loss is associated with decision alternative 2d , the medium-complex decision alternative would be judged less risky than the large-complex decision alternative. As a result, a decision maker might prefer the less risky medium-complex decision alternative even though it has an expected value of $2 million less than the large-complex decision alternative.

Sensitivity Analysis Sensitivity analysis can be used to determine how changes in the probabilities for the states of nature or changes in the payoffs affect the recommended decision alternative. In many cases, the probabilities for the states of nature and the payoffs are based on subjec- tive assessments. Sensitivity analysis helps the decision maker understand which of these inputs are critical to the choice of the best decision alternative. If a small change in the value of one of the inputs causes a change in the recommended decision alternative, the solution to the decision analysis problem is sensitive to that particular input. Extra effort and care should be taken to make sure the input value is as accurate as possible. On the other hand, if a modest-to-large change in the value of one of the inputs does not cause a change in the recommended decision alternative, the solution to the decision analysis problem is not sensitive to that particular input. No extra time or effort would be needed to refine the estimated input value.

One approach to sensitivity analysis is to select different values for the probabilities of the states of nature and the payoffs and then resolve the decision analysis problem. If the recommended decision alternative changes, we know that the solution is sensitive to the changes made. For example, suppose that in the PDC problem the probability for a strong demand is revised to 0.2 and the probability for a weak demand is revised to 0.8. Would the recommended decision alternative change? Using ( ) 0.21P s 5 , ( ) 0.82P s 5 , and equation (15.2), the revised expected values for the three decision alternatives are as follows:

EV( ) 0.2 (8) 0.8 (7) 7.2

EV( ) 0.2 (14) 0.8 (5) 6.8

EV( ) 0.2 (20) 0.8 ( 9) 3.2

5 1 5

5 1 2 5 2

With these probability assessments, the recommended decision alternative is to construct a small condominium complex ( )1d , with an expected value of $7.2 million. The probability of strong demand is only 0.2, so constructing the large condominium complex ( )3d is the least preferred alternative, with an expected value of 2$3.2 million (a loss).

Thus, when the probability of strong demand is large, PDC should build the large com- plex; when the probability of strong demand is small, PDC should build the small complex. Obviously, we could continue to modify the probabilities of the states of nature and learn even more about how changes in the probabilities affect the recommended decision alternative. Sensitivity analysis calculations can also be made for the values of the payoffs. We can easily change the payoff values and resolve the problem to see if the best decision changes.

1. The definition of expected value given in this chapter is

consistent with that given in Chapter 5, but here we use the

notation and terminology specific to decision analysis. In

both cases, the expected value is defined as the weighted

average of possible values.

2. The drawback to the sensitivity analysis approach described

in this section is the numerous calculations required to

evaluate the effect of several possible changes in the

state-of-nature probabilities and/or payoff values. In

the online chapter appendix we demonstrate how to

use Analytic Solver in Excel to generate decision trees.

The use of Excel makes it much easier to make changes

to probabilities and/or payoff values in a decision tree for

sensitivity analysis.

N O T E S + C O M M E N T S

15.4 Decision Analysis with Sample Information 689

15.4 Decision Analysis with Sample Information Frequently, decision makers have the ability to collect additional information about the states of nature. It is worthwhile for the decision maker to consider the potential value of this additional information and how it can affect the decision analysis process. Most often, additional information is obtained through experiments designed to provide sample information about the states of nature. Raw material sampling, product testing, and market research studies are examples of experiments (or studies) that may enable management to revise or update the state-of-nature probabilities.

To analyze the potential benefit of additional information, we must first introduce a few additional terms related to decision analysis. The preliminary or prior probability assess- ments for the states of nature that are the best probability values available prior to obtaining additional information. These revised probabilities after obtaining additional information are called posterior probabilities.

Let us return to the PDC problem and assume that management is considering a six-month market research study designed to learn more about potential market acceptance of the PDC condominium project. Management anticipates that the market research study will provide one of the following two results:

1. Favorable report: A substantial number of the individuals contacted express interest in purchasing a PDC condominium.

2. Unfavorable report: Very few of the individuals contacted express interest in pur- chasing a PDC condominium.

The decision tree for the PDC problem with sample information shows the logical sequence for the decisions and the chance events in Figure 15.5. By introducing the pos- sibility of conducting a market research study, the PDC problem becomes more complex. First, PDC’s management must decide whether the market research should be conducted. If it is conducted, PDC’s management must be prepared to make a decision about the size of the condominium project if the market research report is favorable and, possibly, a dif- ferent decision about the size of the condominium project if the market research report is unfavorable. In Figure 15.5, the squares are decision nodes and the circles are chance nodes. At each decision node, the branch of the tree that is taken is based on the decision made. At each chance node, the branch of the tree that is taken is based on probability or chance. For example, decision node 1 shows that PDC must first make the decision of whether to conduct the market research study. If the market research study is undertaken, chance node 2 indicates that both the favorable report branch and the unfavorable report branch are not under PDC’s control and will be determined by chance. Node 3 is a deci- sion node, indicating that PDC must make the decision to construct the small, medium, or large complex if the market research report is favorable. Node 4 is a decision node showing that PDC must make the decision to construct the small, medium, or large complex if the market research report is unfavorable. Node 5 is a decision node indicating that PDC must make the decision to construct the small, medium, or large complex if the market research is not undertaken. Nodes 6 to 14 are chance nodes indicating that the strong demand or weak demand state-of-nature branches will be determined by chance.

Analysis of the decision tree and the choice of an optimal strategy require that we know the branch probabilities corresponding to all chance nodes. PDC has developed the follow- ing branch probabilities: If the market research study is undertaken:

(favorable report) 0.77

(unfavorable report) 0.23

If the market research report is favorable, the posterior probabilities are as follows:

(strong demand given a favorable report) 0.94

(weak demand given a favorable report) 0.06

The branch probabilities for P(favorable report) and P(unfavorable report) are calculated using Bayes’ rule, first introduced in Chapter 5. We illustrate these calculations in Section 15.5.

690 Chapter 15 Decision Analysis

If the market research report is unfavorable, the posterior probabilities are as follows:

(strong demand given an unfavorable report) 0.35

(weak demand given an unfavorable report) 0.65

If the market research report is not undertaken, the prior probabilities are applicable:

(strong demand) 0.80

(weak demand) 0.20

The branch probabilities are shown on the decision tree in Figure 15.6.

The PDC Decision Tree Including the Market Research StudyFIGURE 15.5

Strong (s1)

Weak (s2)

Unfavorable Report

Market Research Study

No Market Research Study

Favorable Report

Small (d1)

Large (d3)

Medium (d2)

15.4 Decision Analysis with Sample Information 691

A decision strategy is a sequence of decisions and chance outcomes in which the deci- sions chosen depend on the yet-to-be-determined outcomes of chance events. The approach used to determine the optimal decision strategy is based on a rollback of the expected val- ues in the decision tree using the following steps:

1. At chance nodes, compute the expected value by multiplying the payoff at the end of each branch by the corresponding branch probabilities.

2. At decision nodes, select the decision branch that leads to the best expected value. This expected value becomes the expected value at the decision node.

The PDC Decision Tree with Branch ProbabilitiesFIGURE 15.6

Strong (s1)

Weak (s2)

Unfavorable Report 0.23

Market Research Study

No Market Research Study

Small (d1)

Large (d3)

Medium (d2)

0.94

0.06

0.94

0.35

0.06

0.65

0.35

0.65

0.35

0.65

0.80

0.20

0.80

0.20

0.80

Favorable Report 0.77

692 Chapter 15 Decision Analysis

Starting the rollback calculations by computing the expected values at chance nodes 6 to 14 provides the following results:

PDC Decision Tree after Computing Expected Values at Chance Nodes 6 to 14

FIGURE 15.7

Unfavorable Report 0.23

Market Research Study

No Market Research Study

Small (d1)

Large (d3)

Medium (d2)

EV = 7.94

EV = 13.46

EV = 18.26

EV = 7.35

EV = 8.15

EV = 1.15

EV = 7.80

EV = 12.20

EV = 14.20

Favorable Report 0.77

EV(Node 6) EV(Node 7) EV(Node 8) EV(Node 9) EV(Node 10) EV(Node 11) EV(Node 12) EV(Node 13) EV(Node 14)

5 5 5 5 5 5 5 5 5

0.94(8) 1 0.06(7) 0.94(14) 1 0.06(5) 0.94(20) 1 0.06(29) 0.35(8) 1 0.65(7) 0.35(14) 1 0.65(5) 0.35(20) 1 0.65(29) 0.80(8) 1 0.20(7) 0.80(14) 1 0.20(5) 0.80(20) 1 0.20(29)

5 5 5 5 5 5 5 5 5

7.94 13.46 18.26 7.35 8.15 1.15 7.80 12.20 14.20

Figure 15.7 shows the reduced decision tree after computing expected values at these chance nodes.

15.4 Decision Analysis with Sample Information 693

Next, move to decision nodes 3, 4, and 5. For each of these nodes, we select the decision alternative branch that leads to the best expected value. For example, at node 3 we have the choice of the small complex branch with EV (Node 6) 7.945 , the medium complex branch with EV (Node 7) 13.465 , and the large complex branch with EV (Node 8) 18.265 . Thus, we select the large-complex decision alternative branch and the expected value at node 3 becomes EV (Node 3) 18.265 .

For node 4, we select the best expected value from nodes 9, 10, and 11. The best deci- sion alternative is the medium complex branch that provides EV (Node 4) 8.155 . For node 5, we select the best expected value from nodes 12, 13, and 14. The best decision alterna- tive is the large complex branch that provides EV (Node 5) 14.205 . Figure 15.8 shows the reduced decision tree after choosing the best decisions at nodes 3, 4, and 5 and rolling back the expected values to these nodes.

The expected value at chance node 2 can now be computed as follows:

EV(Node 2) 0.77EV(Node 3) 0.23EV(Node 4)

0.77(18.26) 0.23(8.15) 15.93

5 1

5 1 5

This calculation reduces the decision tree to one involving only the two decision branches from node 1 (see Figure 15.9).

Finally, the decision can be made at decision node 1 by selecting the best expected val- ues from nodes 2 and 5. This action leads to the decision alternative to conduct the market research study, which provides an overall expected value of 15.93.

PDC Decision Tree after Choosing Best Decisions at Nodes 3, 4, and 5

FIGURE 15.8

Unfavorable Report 0.23

Market Research Study

No Market Research Study

EV(d3) = 18.26

EV(d2 ) = 8.15

EV(d3) = 14.20

Favorable Report 0.77

PDC Decision Tree Reduced to Two Decision BranchesFIGURE 15.9

Market Research Study

No Market Research Study

EV = 15.93

EV = 14.20

694 Chapter 15 Decision Analysis

The optimal decision for PDC is to conduct the market research study and then carry out the following decision strategy:

If the market research is favorable, construct the large condominium complex. If the market research is unfavorable, construct the medium condominium complex.

The analysis of the PDC decision tree describes the methods that can be used to analyze more complex sequential decision problems. First, draw a decision tree consisting of decision and chance nodes and branches that describe the sequential nature of the problem. Determine the probabilities for all chance outcomes. Then, by working backward through the tree, compute expected values at all chance nodes and select the best decision branch at all decision nodes. The sequence of optimal decision branches determines the optimal decision strategy for the problem.

Expected Value of Sample Information In the PDC problem, the market research study is the sample information used to deter- mine the optimal decision strategy. The expected value associated with the market research study is 15.93. Previously, we showed that the best expected value if the market research study is not undertaken is 14.20. Thus, we can conclude that the difference, 15.93 14.20 1.732 5 , is the expected value of sample information (EVSI). In other words, conducting the market research study adds $1.73 million to the PDC expected value. In general, the expected value of sample information is as follows:

The EVSI $1.73 million5 suggests PDC should be willing to pay up to $1.73 million to conduct the market research study.

EXPECTED VALUE OF SAMPLE INFORMATION (EVSI)

5 2EVSI EVwSI EVwoSI (15.3) where

EVSI expected value of sample information

EVwSI expected value sample information about the states of nature

EVwoSI expected value sample information about the states of nature

with

without

Expected Value of Perfect Information A special case of gaining additional information related to a decision problem is when the sample information provides perfect information on the states of nature. In other words, con- sider a case in which the marketing study undertaken by PDC would determine exactly which state of nature will occur. Clearly, such a result is highly unlikely from a marketing study, but such an analysis provides a best-case analysis of the benefit provided by the marketing study. If the investment required for the additional information exceeds the expected value of perfect information, then we would not want to invest in procuring the additional information.

To illustrate the calculation of the expected value of perfect information, we return to the PDC decision. We assume for the moment that PDC could determine with certainty, prior to making a decision, which state of nature is going to occur. To make use of this per- fect information, we will develop a decision strategy that PDC should follow once it knows which state of nature will occur.

To help determine the decision strategy for PDC, we reproduce PDC’s payoff table as Table 15.6. If PDC knew for sure that state of nature 1s would occur, the best decision alternative would be 3d , with a payoff of $20 million. Similarly, if PDC knew for sure that state of nature 2s would occur, the best decision alternative would be 1d , with a payoff of $7 million. Thus, we can state PDC’s optimal decision strategy when the perfect information becomes available as follows:

If 1s , select 3d and receive a payoff of $20 million. If 2s , select 1d and receive a payoff of $7 million.

It would be worth $3.2 million for PDC to learn the level of market acceptance before selecting a decision alternative. This represents the maximum that PDC should invest in any market research to provide additional information on the states of nature because no market research study can be expected to provide perfect information.

15.5 Computing Branch Probabilities with Bayes’ Theorem 695

State of Nature Decision Alternative Strong Demand, 1s Weak Demand, 2s

Small complex, 1d 8 7

Medium complex, 2d 14 5

Large complex, 3d 20 29

Payoff Table for the PDC Condominium Project ($ Millions)TABLE 15.6

What is the expected value for this decision strategy? To compute the expected value with perfect information, we return to the original probabilities for the states of nature:

( ) 0.81P s 5 and ( ) 0.22P s 5 . Thus, there is a 0.8 probability that the perfect information will indicate state of nature 1s , and the resulting decision alternative 3d will provide a $20 million profit. Similarly, with a 0.2 probability for state of nature 2s , the optimal decision alternative 1d will provide a $7 million profit. Thus, from equation (15.2) the expected value of the decision strategy that uses perfect information is 0.8(20) 0.2(7) 17.41 5 .

We refer to the expected value of $17.4 million as the expected value with perfect information (EVwPI).

Earlier in this section we showed that the recommended decision using the expected value approach is decision alternative 3d , with an expected value of $14.2 million. Because this decision recommendation and expected value computation were made without the benefit of perfect information, $14.2 million is referred to as the expected value without perfect information (EVwoPI).

The expected value with perfect information is $17.4 million, and the expected value without perfect information is $14.2; therefore, the expected value of the perfect informa- tion (EVPI) is $17.4 $14.2 $3.2 million2 5 . In other words, $3.2 million represents the additional expected value that can be obtained if perfect information were available about the states of nature.

In general, the expected value of perfect information (EVPI) is computed as follows:

EXPECTED VALUE OF PERFECT INFORMATION (EVPI)

5 2EVPI EVwPI EVwoPI (15.4) where

EVPI expected value of perfect information

EVwPI expected value perfect information about the states of nature

EVwoPI expected value perfect information about the states of nature

with

without

15.5 Computing Branch Probabilities with Bayes’ Theorem In Section 15.4 the branch probabilities for the PDC decision tree chance nodes were provided in the problem description. No computations were required to determine these probabilities. In this section, we show how Bayes’ theorem can be used to compute branch probabilities for decision trees. The branch probabilities are the posterior probabilities for demand that have been updated based on the sample information of whether the market research report is favorable or unfavorable.

We first introduced Bayes’ theorem in Chapter 5 as a means of calculating posterior probabilities as updates of prior probabilities once additional information is obtained.

696 Chapter 15 Decision Analysis

The PDC decision tree is shown again in Figure 15.10. Let

favorable market research report

unfavorable market research report

strong demand (state of nature 1)

weak demand (state of nature 2) 1

At chance node 2, we need to know the branch probabilities P(F) and P(U). At chance nodes 6, 7, and 8, we need to know the branch probabilities ( | )1P s F , which is read as “the probability of state of nature 1 given a favorable market research report,” and ( | )2P s F ,

The PDC Decision TreeFIGURE 15.10

Strong (s1)

Weak (s2)

Unfavorable Report P(U)

Market Research Study

No Market Research Study

Favorable Report P(F)

Small (d1)

Large (d3)

Medium (d2)

P(s1| F)

P(s2| F)

P(s1| F)

P(s1| U)

P(s2| F)

P(s2| U)

P(s1| U)

P(s2| U)

P(s1| U)

P(s2| U)

P(s1)

P(s2)

P(s1)

P(s2)

P(s1)

15.5 Computing Branch Probabilities with Bayes’ Theorem 697

which is the probability of state of nature 2 given a favorable market research report. The notation | in ( | )1P s F and ( | )2P s F is read as “given” and indicates a conditional probability because we are interested in the probability of a particular state of nature “conditioned” on the fact that we receive a favorable market report. ( | )1P s F and ( | )2P s F are referred to as posterior probabilities because they are conditional probabilities based on the outcome of the sample information. At chance nodes 9, 10, and 11, we need to know the branch prob- abilities ( | )1P s U and ( | )2P s U ; note that these are also posterior probabilities, denoting the probabilities of the two states of nature given that the market research report is unfavorable. Finally, at chance nodes 12, 13, and 14, we need the probabilities for the states of nature,

( )1P s and ( )2P s , if the market research study is not undertaken. In performing the probability computations, we need to know PDC’s assessment of the

probabilities for the two states of nature, ( )1P s and ( )2P s , which are the prior probabilities as discussed earlier. In addition, we must know the conditional probability of the market research outcomes (the sample information) given each state of nature. For example, we need to know the conditional probability of a favorable market research report given that the state of nature is strong demand for the PDC project. To carry out the probability cal- culations, we will need conditional probabilities for all sample outcomes given all states of nature, that is, P F s( | )1 , P F s( | )2 , P U s( | )1 , and P U s( | )2 . These conditional probabilities are assessments of the accuracy of the market research; they are often estimated using historical performance of previous market research reports. For example, P F s( | )1 may be estimated via the historical frequency of strong demand being associated with a mar- ket research report that was favorable. In the PDC problem we assume that the following assessments are available for these conditional probabilities:

Market Research State of Nature Favorable, F Unfavorable, U

Strong demand, 1s P F s( | ) 0.901 5 P U s( | ) 0.101 5

Weak demand, 2s P F s( | ) 0.252 5 P U s( | ) 0.752 5

Note that the preceding probability assessments provide a reasonable degree of con- fidence in the market research study. If the true state of nature is 1s , the probability of a favorable market research report is 0.90, and the probability of an unfavorable market research report is 0.10. If the true state of nature is 2s , the probability of a favorable market research report is 0.25, and the probability of an unfavorable market research report is 0.75. One reason for a 0.25 probability of a potentially misleading favorable market research report for state of nature 2s is that when some potential buyers first hear about the new condominium project, their enthusiasm may lead them to overstate their real interest in it. A potential buyer’s initial favorable response can change quickly to a “no-thank-you” when later faced with the reality of signing a purchase contract and making a down payment.

Equation (15.5) is known as Bayes’ theorem, and it is used to compute posterior probabilities.

BAYES’ THEOREM

( | ) ( ) ( | )

( ) ( | ) ( ) ( | ) ( ) ( | )1 1 2 2 P A B

P A P B A

P A P B A P A P B A P A P B A i

i i

n n� 5

1 1 1 (15.5)

To perform the Bayes’ theorem calculations for ( | )1P s U using equation (15.5), we replace B with U (unfavorable report) and Ai with 1s in equation (15.5) so that we have

( | ) ( | ) ( )

( | ) ( ) ( | ) ( )

0.10 0.80

(0.10 0.80) (0.20 0.75) 0.35

1 1 1

1 1 2 2

P s U P U s P s

P U s P s P U s P s 5

5 3

3 1 3 5

Conditional probability was introduced in Chapter 5.

Equation (15.5) is a restatement of Bayes’ theorem introduced in Chapter 5.

698 Chapter 15 Decision Analysis

which indicates that the probability of strong demand given an unfavorable market research report is 0.35. We can also calculate the probability of weak demand given an unfavorable market research report as shown below:

5 1

5 3

3 1 3 5P s U

P U s P s

P U s P s P U s P s ( | )

( | ) ( )

( | ) ( ) ( | ) ( )

0.75 0.20

(0.10 0.80) (0.75 0.20) 0.652

2 2

1 1 2 2

Similarly, we can calculate the posterior probabilities for strong and weak demand given a favorable market research report using equation (15.5):

5 1

5 3

3 1 3 5P s F

P F s P s

P F s P s P F s P s ( | )

( | ) ( )

( | ) ( ) ( | ) ( )

0.90 0.80

(0.90 0.80) (0.25 0.20) 0.941

1 1

1 1 2 2

and

5 1

5 3

3 1 3 5P s F

P F s P s

P F s P s P F s P s ( | )

( | ) ( )

( | ) ( ) ( | ) ( )

0.25 0.20

(0.90 0.80) (0.25 0.20) 0.062

2 2

1 1 2 2

This indicates that a favorable research report leads to a posterior probability of 0.94 that the demand will be strong and a posterior probability of only 0.06 that demand will be weak.

The discussion in this section shows an underlying relationship between the probabilities on the various branches in a decision tree. It would be inappropriate to assume different prior probabilities, ( )1P s and ( )2P s , without determining how these changes would alter P(F) and P(U), as well as the posterior probabilities ( | )1P s F , ( | )2P s F , ( | )1P s U , and ( | )2P s U .

15.6 Utility Theory The decision analysis situations presented so far in this chapter expressed outcomes (payoffs) in terms of monetary values. With probability information available about the outcomes of the chance events, we defined the optimal decision alternative as the one that provides the best expected value. However, in some situations the decision alternative with the best expected value may not be the preferred alternative. A decision maker may also wish to consider intangible factors such as risk, image, or other nonmonetary criteria in order to evaluate the decision alternatives. When monetary value does not necessarily lead to the most preferred decision, expressing the value (or worth) of a consequence in terms of its utility will permit the use of expected utility to identify the most desirable decision alternative. The discussion of utility and its application in decision analysis is presented in this section.

Utility is a measure of the total worth or relative desirability of a particular outcome; it reflects the decision maker’s attitude toward a collection of factors such as profit, loss, and risk. Researchers have found that as long as the monetary value of payoffs stays within a range that the decision maker considers reasonable, selecting the decision alternative with the best expected value usually leads to selection of the most preferred decision. However, when the payoffs are extreme, decision makers are often unsatisfied or uneasy with the decision that simply provides the best expected value.

As an example of a situation in which utility can help in selecting the best decision alternative, let us consider the problem faced by Swofford, Inc., a relatively small real estate investment firm located in Atlanta, Georgia. Swofford currently has two investment opportunities that require approximately the same cash outlay. The cash requirements necessary prohibit Swofford from making more than one investment at this time. Conse- quently, three possible decision alternatives may be considered.

The three decision alternatives, denoted by 1d , 2d , and 3d , are as follows:

make investment A

make investment B

do not invest

15.6 Utility Theory 699

The monetary payoffs associated with the investment opportunities depend on the investment decision and on the direction of the real estate market during the next six months (the chance event). Real estate prices will go up, remain stable, or go down. Thus, the states of nature, denoted by 1s , 2s , and 3s , are as follows:

real estate prices go up

real estate prices remain stable

real estate prices go down

Using the best information available, Swofford has estimated the profits, or payoffs, asso- ciated with each decision alternative and state-of-nature combination. The resulting payoff table is shown in Table 15.7.

The best estimate of the probability that real estate prices will go up is 0.3; the best estimate of the probability that prices will remain stable is 0.5; and the best estimate of the probability that prices will go down is 0.2. Thus, the expected values for the three decision alternatives are as follows:

5 1 1 2 5

5 1 2 1 2 5 2

5 1 1 5

EV( ) 0.3(30, 000) 0.5(20, 000) 0.2( 50, 000) 9,000

EV( ) 0.3(50, 000) 0.5( 20, 000) 0.2( 30, 000) 11, 000

EV( ) 0.3(0) 0.5(0) 0.2(0) 0

Using the expected value approach, the optimal decision is to select investment A with an expected value of $9,000. Is it really the best decision alternative? Let us consider some other relevant factors that relate to Swofford’s capability for absorbing the loss of $50,000 if investment A is made and prices actually go down.

Actually, Swofford’s current financial position is weak. This condition is partly reflected in Swofford’s ability to make only one investment. More important, however, the firm’s pres- ident believes that, if the next investment results in a substantial loss, Swofford’s future will be in jeopardy. Although the expected value approach leads to a recommendation for 1d , do you think the firm’s president would prefer this decision? We suspect that the president would select 2d or 3d to avoid the possibility of incurring a $50,000 loss. In fact, a reasonable con- clusion is that, if a loss of even $30,000 could drive Swofford out of business, the president would select 3d , believing that both investments A and B are too risky for Swofford’s current financial position.

The way we resolve Swofford’s dilemma is first to determine Swofford’s utility for the various outcomes. Recall that the utility of any outcome is the total worth of that outcome, taking into account all risks and consequences involved. If the utilities for the various con- sequences are assessed correctly, the decision alternative with the highest expected utility is the most preferred, or best, alternative. We next show how to determine the utility of the outcomes so that the alternative with the highest expected utility can be identified.

Utility and Decision Analysis The procedure we use to establish a utility for each of the payoffs in Swofford’s situation requires that we first assign a utility to the best and worst possible payoffs. Any values will

State of Nature Decision Alternative Prices Up, 1s Prices Stable, 2s Prices Down, 3s

Investment A, 1d $30,000 $20,000 2$50,000

Investment B, 2d $50,000 2$20,000 2$30,000

Do not invest, 3d 0 0 0

Payoff Table for Swofford, Inc.TABLE 15.7

700 Chapter 15 Decision Analysis

work as long as the utility assigned to the best payoff is greater than the utility assigned to the worst payoff. In this case, $50,000 is the best payoff and 2$50,000 is the worst. Sup- pose, then, that we arbitrarily make assignments to these two payoffs as follows:

2 5 2 5

5 5

Utility of $50, 000 ( 50, 000) 0

Utility of $50, 000 (50, 000) 10

Let us now determine the utility associated with every other payoff. Consider the process of establishing the utility of a payoff of $30,000. First we ask

Swofford’s president to state a preference between a guaranteed $30,000 payoff and an opportunity to engage in the following lottery, or bet, for some probability of p that we select:

Lottery: Swofford obtains a payoff of $50,000 with probability p and a payoff of 2$50,000 with probability (1 2 p).

Obviously, if p is very close to 1, Swofford’s president would prefer the lottery to the guaranteed payoff of $30,000 because the firm would virtually ensure itself a payoff of $50,000. If p is very close to 0, Swofford’s president would clearly prefer the guarantee of $30,000. In any event, as p increases continuously from 0 to 1, the preference for the guaranteed payoff of $30,000 decreases and at some point is equal to the preference for the lottery. At this value of p, Swofford’s president would have equal preference for the guaranteed payoff of $30,000 and the lottery; at greater values of p, Swofford’s president would prefer the lottery to the guaranteed $30,000 payoff. For example, let us assume that when 0.95p 5 , Swofford’s president is indifferent between the guaranteed payoff of $30,000 and the lottery. For this value of p, we can compute the utility of a $30,000 payoff as follows:

U pU p U5 1 2 2

5 1

(30, 000) (50, 000) (1 ) ( 50, 000)

0.95(10) (0.05)(0)

9.5

Obviously, if we had started with a different assignment of utilities for a payoff of $50,000 and 2$50,000, the result would have been a different utility for $30,000. For example, if we had started with an assignment of 100 for $50,000 and 10 for 2$50,000, the utility of a $30,000 payoff would be:

5 1

U (30, 000) 0.95(100) 0.05(10)

95.0 0.5 95.5

Hence, we must conclude that the utility assigned to each payoff is not unique but merely depends on the initial choice of utilities for the best and worst payoffs.

Before computing the utility for the other payoffs, let us consider the implication of Swofford’s president assigning a utility of 9.5 to a payoff of $30,000. Clearly, when

0.95p 5 , the expected value of the lottery is

5 1 2

5 2

EV(lottery) 0.95($50, 000) 0.05( $50, 000)

$47, 500 $2, 500

$45, 000

Although the expected value of the lottery when 0.95p 5 is $45,000, Swofford’s presi- dent is indifferent between the lottery (and its associated risk) and a guaranteed payoff of $30,000. Thus, Swofford’s president is taking a conservative, or risk-avoiding, viewpoint. A decision maker who would choose a guaranteed payoff over a lottery with a superior expected payoff is a risk avoider (or is said to be risk-averse). The president would rather have $30,000 for certain than risk anything greater than a 5% chance of incurring a loss of $50,000. In other words, the difference between the EV of $45,000 and the guaranteed

Utility values of 0 and 1 could have been selected here; we selected 0 and 10 to avoid any possible confusion between the utility value for a payoff and the probability p.

p is often referred to as the indifference probability.

The difference between the expected value of the lottery and the guaranteed payoff can be viewed as the risk premium the decision maker is willing to pay.

15.6 Utility Theory 701

payoff of $30,000 is the risk premium that Swofford’s president would be willing to pay to avoid the 5% chance of losing $50,000.

To compute the utility associated with a payoff of 2$20,000, we must ask Swofford’s president to state a preference between a guaranteed 2$20,000 payoff and an opportunity to engage again in the following lottery:

Lottery: Swofford obtains a payoff of $50,000 with probability p and a payoff of 2$50,000 with probability (1 2 p).

Note that this lottery is exactly the same as the one we used to establish the utility of a payoff of $30,000 (in fact, we can use this lottery to establish the utility for any value in the Swofford payoff table). We need to determine the value of p that would make the president indifferent between a guaranteed payoff of 2$20,000 and the lottery. For example, we might begin by asking the president to choose between a certain loss of $20,000 and the lottery with a payoff of $50,000 with probability 0.90p 5 and a payoff of 2$50,000 with probability (1 ) 0.10p2 5 . What answer do you think we would get? Surely, with this high probability of obtaining a payoff of $50,000, the president would elect the lottery. Next, we might ask whether 0.85p 5 would result in indifference between the loss of $20,000 for certain and the lottery. Again the president might prefer the lottery. Suppose that we continue until we get to 0.55p 5 , at which point the president is indifferent between the payoff of 2$20,000 and the lottery. In other words, for any value of p less than 0.55, the president would take a loss of $20,000 for certain rather than risk the potential loss of $50,000 with the lottery; and for any value of p above 0.55, the president would choose the lottery. Thus, the utility assigned to a payoff of 2$20,000 is

2 5 1 2 2

5 1

U pU p U( $20, 000) (50, 000) (1 ) ( $50, 000)

0.55(10) 0.45(0)

5.5

Again let us assess the implication of this assignment by comparing it to the expected value approach. When 0.55p 5 , the expected value of the lottery is

5 1 2

5 2

EV(lottery) 0.55($50, 000) 0.45( $50, 000)

$27, 500 $22, 500

$5, 000

Thus, Swofford’s president would just as soon absorb a certain loss of $20,000 as take the lottery and its associated risk, even though the expected value of the lottery is $5,000. Once again this preference demonstrates the conservative, or risk-avoiding, point of view of Swofford’s president.

In these two examples, we computed the utility for the payoffs of $30,000 and 2$20,000. We can determine the utility for any payoff M in a similar fashion. First, we must find the probability p for which the decision maker is indifferent between a guaran- teed payoff of M and a lottery with a payoff of $50,000 with probability p and 2$50,000 with probability (1 2 p). The utility of M is then computed as follows:

5 1 2 2

5 1 2

U M pU p U

p p

( ) ($50, 000) (1 ) ( $50, 000)

(10) (1 )0

Using this procedure we developed a utility for each of the remaining payoffs in Swofford problem. The results are presented in Table 15.8.

Now that we have determined the utility of each of the possible monetary values, we can write the original payoff table in terms of utility. Table 15.9 shows the utility for the various outcomes in the Swofford problem. The notation we use for the entries in the util- ity table is Uij, which denotes the utility associated with decision alternative di and state of nature s j . Using this notation, we see that 4.023U 5 .

702 Chapter 15 Decision Analysis

We can now compute the expected utility (EU) of the utilities in Table 15.9 in a sim- ilar fashion as we computed expected value in Section 15.3. In other words, to identify an optimal decision alternative for Swofford, Inc., the expected utility approach requires the analyst to compute the expected utility for each decision alternative and then select the alternative yielding the highest expected utility. With N possible states of nature, the expected utility of a decision alternative di is given as follows:

EXPECTED UTILITY (EU)

( ) ( ) 1

EU d P s Ui j

j ij∑5 5

(15.6)

Monetary Value Indifference Value of p Utility $50,000 Does not apply 10.0

30,000 0.95 9.5

20,000 0.90 9.0

0 0.75 7.5

220,000 0.55 5.5

230,000 0.40 4.0

250,000 Does not apply 0

Utility of Monetary Payoffs for Swofford, Inc.TABLE 15.8

State of Nature Decision Alternative Prices Up, 1s Prices Stable, 2s Prices Down, 3s

Investment A, 1d 9.5 9.0 0

Investment B, 2d 10.0 5.5 4.0

Do not invest, 3d 7.5 7.5 7.5

Utility Table for Swofford, Inc.TABLE 15.9

The expected utility for each of the decision alternatives in the Swofford problem is as follows:

EU( ) 0.3(9.5) 0.5(9.0) 0.2(0) 7.35

EU( ) 0.3(10) 0.5(5.5) 0.2(4.0) 6.55

EU( ) 0.3(7.5) 0.5(7.5) 0.2(7.5) 7.50

5 1 1 5

Note that the optimal decision using the expected utility approach is 3d , do not invest. The ranking of alternatives according to the president’s utility assignments and the associ- ated monetary values are as follows:

Ranking of Decision Alternatives Expected Utility Expected Value

Do not invest 7.50 0

Investment A 7.35 9,000

Investment B 6.55 21,000

15.6 Utility Theory 703

Note that, although investment A had the highest expected value of $9,000, the analysis indicates that Swofford should decline this investment. The rationale behind not selecting investment A is that the 0.20 probability of a $50,000 loss was considered by Swofford’s president to involve a serious risk. The seriousness of this risk and its associated impact on the company were not adequately reflected by the expected value of investment A. We assessed the utility for each payoff to assess this risk adequately.

The following steps state in general terms the procedure used to solve the Swofford, Inc. investment problem:

Step 1. Develop a payoff table using monetary values Step 2. Identify the best and worst payoff values in the table and assign each a utility,

with (best payoff ) (worst payoff )U U. Step 3. For every other monetary value M in the original payoff table, do the follow-

ing to determine its utility: a. Define the lottery such that there is a probability p of the best payoff and

a probability (1 2 p) of the worst payoff b. Determine the value of p such that the decision maker is indifferent

between a guaranteed payoff of M and the lottery defined in Step 3(a) c. Calculate the utility of M as follows:

best payoff 1 worst payoffU M pU p U( ) ( ) ( )( ) 5 1 2 Step 4. Convert each monetary value in the payoff table to a utility Step 5. Apply the expected utility approach to the utility table developed in Step 4 and

select the decision alternative with the highest expected utility

The procedure we described for determining the utility of monetary consequences can also be used to develop a utility measure for nonmonetary consequences. Assign the best consequence a utility of 10 and the worst a utility of 0. Then create a lottery with a proba- bility of p for the best consequence and (1 2 p) for the worst consequence. For each of the other consequences, find the value of p that makes the decision maker indifferent between the lottery and the consequence. Then calculate the utility of the consequence in question as follows:

5 1 2U pU p U(consequence) (best consequence) (1 ) (worst consequence)

Utility Functions Next we describe how different decision makers may approach risk in terms of their assess- ment of utility. The financial position of Swofford, Inc. was such that the firm’s president evaluated investment opportunities from a conservative, or risk-avoiding, point of view. However, if the firm had a surplus of cash and a stable future, Swofford’s president might have been looking for investment alternatives that, although perhaps risky, contained a potential for substantial profit. That type of behavior would demonstrate that the president is a risk taker with respect to this decision.

A risk taker is a decision maker who would choose a lottery over a guaranteed payoff when the expected value of the lottery is inferior to the guaranteed payoff. In this section, we analyze the decision problem faced by Swofford from the point of view of a decision maker who would be classified as a risk taker. We then compare the conservative point of view of Swofford’s president (a risk avoider) with the behavior of a decision maker who is a risk taker.

For the decision problem facing Swofford, Inc., using the general procedure for devel- oping utilities as discussed previously, a risk taker might express the utility for the various payoffs shown in Table 15.10. As before, (50, 000) 10U 5 and ( 50, 000) 0U 2 5 . Note the difference in behavior reflected in Tables 15.10 and 15.8. In other words, in determining the value of p at which the decision maker is indifferent between a guaranteed payoff of M and a lottery in which $50,000 is obtained with probability p and 2$50,000 with probabil- ity (1 2 p), the risk taker is willing to accept a greater risk of incurring a loss of $50,000 in order to gain the opportunity to realize a profit of $50,000.

704 Chapter 15 Decision Analysis

To help develop the utility table for the risk taker, we have reproduced the Swofford, Inc. payoff table in Table 15.11. Using these payoffs and the risk taker’s utilities given in Table 15.10, we can write the risk taker’s utility table as shown in Table 15.12. Using the state-of-nature probabilities ( ) 0.31P s 5 , ( ) 0.52P s 5 , and ( ) 0.23P s 5 , the expected utility for each decision alternative is as follows:

EU( ) 0.3(5.0) 0.5(4.0) 0.2(0) 3.50

EU( ) 0.3(10) 0.5(1.5) 0.2(1.0) 3.95

EU( ) 0.3(2.5) 0.5(2.5) 0.2(2.5) 2.50

+ + +

5 1 5

What is the recommended decision? Perhaps somewhat to your surprise, the analysis rec- ommends investment B, with the highest expected utility of 3.95. Recall that this invest- ment has a 2$1,000 expected value. Why is it now the recommended decision? Remember that the decision maker in this revised problem is a risk taker. Thus, although the expected value of investment B is negative, utility analysis has shown that this decision maker is enough of a risk taker to prefer investment B and its potential for the $50,000 profit.

Monetary Value Indifference Value of p Utility $50,000 Does not apply 10.0

30,000 0.50 5.0

20,000 0.40 4.0

0 0.25 2.5

220,000 0.15 1.5

230,000 0.10 1.0

250,000 Does not apply 0

Revised Utilities for Swofford, Inc., Assuming a Risk TakerTABLE 15.10

State of Nature Decision Alternative Prices Up, 1s Prices Stable, 2s Prices Down, 3s

Investment A, 1d $30,000 $20,000 2$50,000

Investment B, 2d $50,000 2$20,000 2$30,000

Do not invest, 3d 0 0 0

Payoff Table for Swofford, Inc.TABLE 15.11

State of Nature Decision Alternative Prices Up, 1s Prices Stable, 2s Prices Down, 3s

Investment A, 1d 5.0 4.0 0

Investment B, 2d 10.0 1.5 1.0

Do not invest, 3d 2.5 2.5 2.5

Utility Table of a Risk Taker for Swofford, Inc.TABLE 15.12

15.6 Utility Theory 705

Ranking by the expected utilities generates the following order of preference of the decision alternatives for the risk taker and the associated expected values:

Ranking of Decision Alternatives Expected Utility Expected Value

Investment B 3.95 2$1,000

Investment A 3.50 $9,000

Do not invest 2.50 0

Comparing the utility analysis for a risk taker with the more conservative preferences of the president of Swofford, Inc., who is a risk avoider, we see that, even with the same decision prob- lem, different attitudes toward risk can lead to different recommended decisions. The utilities established by Swofford’s president indicated that the firm should not invest at this time, whereas the utilities established by the risk taker showed a preference for investment B. Note that both of these decisions differ from the best expected value decision, which was investment A.

We can obtain another perspective of the difference between behaviors of a risk avoider and a risk taker by developing a graph that depicts the relationship between monetary value and utility. We use the horizontal axis of the graph to represent monetary values and the verti- cal axis to represent the utility associated with each monetary value. Now, consider the data in Table 15.8, with a utility corresponding to each monetary value for the original Swofford, Inc. problem. These values can be plotted on a graph to produce the top curve in Figure 15.11. The resulting curve is the utility function for money for Swofford’s president. Recall that these points reflected the conservative, or risk-avoiding, nature of Swofford’s president. Hence, we refer to the top curve in Figure 15.11 as a utility function for a risk avoider. Using the data in Table 15.10 developed for a risk taker, we can plot these points to produce the bottom curve in Figure 15.11. The resulting curve depicts the utility function for a risk taker.

By looking at the utility functions in Figure 15.11, we can begin to generalize about the utility functions for risk avoiders and risk takers. Although the exact shape of the utility

Utility Function for Money for Risk-Avoider, Risk-Taker, and Risk-Neutral Decision Makers

FIGURE 15.11

–50 Monetary Value ($1,000s)

U ti

li ty

–40 –30 –20 –10 0 10 20 30 40 50

Risk Neutral

Risk Avoider

Risk Taker

706 Chapter 15 Decision Analysis

function will vary from one decision maker to another, we can see the general shape of these two types of utility functions. The utility function for a risk avoider shows a dimin- ishing marginal return for money. For example, the increase in utility going from a mon- etary value of 2$30,000 to $0 is 7.5 4.0 3.52 5 , whereas the increase in utility in going from $0 to $30,000 is only 9.5 7.5 2.02 5 .

However, the utility function for a risk taker shows an increasing marginal return for money. For example, in Figure 15.11, the increase in utility for the risk taker in going from 2$30,000 to $0 is 2.5 1.0 1.52 5 , whereas the increase in utility in going from $0 to $30,000 for the risk taker is 5.0 2.5 2.52 5 . Note also that in either case the utility func- tion is always increasing; that is, more money leads to more utility. All utility functions possess this property.

We concluded that the utility function for a risk avoider shows a diminishing marginal return for money and that the utility function for a risk taker shows an increasing mar- ginal return. When the marginal return for money is neither decreasing nor increasing but remains constant, the corresponding utility function describes the behavior of a decision maker who is neutral to risk. The following characteristics are associated with a risk- neutral decision maker:

1. The utility function can be drawn as a straight line connecting the “best” and the “worst” points.

2. The expected utility approach and the expected value approach applied to monetary payoffs result in the same action.

The straight, diagonal line in Figure 15.11 depicts the utility function of a risk-neutral decision maker using the Swofford, Inc. problem data.

Generally, when the payoffs for a particular decision-making problem fall into a reason- able range—the best is not too good and the worst is not too bad—decision makers tend to express preferences in agreement with the expected value approach. Thus, we suggest asking the decision maker to consider the best and worst possible payoffs for a problem and assess their reasonableness. If the decision maker believes that they are in the reason- able range, the decision alternative with the best expected value can be used. However, if the payoffs appear unreasonably large or unreasonably small (e.g., a huge loss) and if the decision maker believes that monetary values do not adequately reflect her or his true pref- erences for the payoffs, a utility analysis of the problem should be considered.

Exponential Utility Function Having a decision maker provide enough indifference values to create a utility function can be time consuming. An alternative is to assume that the decision maker’s utility is defined by an exponential function. Figure 15.12 shows examples of different exponential utility functions. Note that all the exponential utility functions indicate that the decision maker is risk averse. The form of the exponential utility function is as follows:

EXPONENTIAL UTILITY FUNCTION

( ) 1 /U x e x R5 2 2 (15.7)

The R parameter in equation (15.7) represents the decision maker’s risk tolerance; it controls the shape of the exponential utility function. Larger R values create flatter exponential func- tions, indicating that the decision maker is less risk averse (closer to risk neutral). Smaller R values indicate that the decision maker has less risk tolerance (is more risk averse). A common method to determine an approximate risk tolerance is to ask the decision maker to consider a scenario in which he or she could win $R with probability 0.5 and lose $R/2 with probability 0.5. The R value to use in equation (15.7) is the largest $R for which the decision maker would accept this gamble. For instance, if the decision maker is comfortable accepting

In equation (15.7), the number 2.718282<e … is a mathematical constant corresponding to the base of the natural logarithm. In Excel, ex can be evaluated for any power x using the function EXP(x).

15.6 Utility Theory 707

a gamble with a 50% chance of winning $2,000 and a 50% chance of losing $1,000, but not with a gamble with a 50% chance of winning $3,000 and a 50% chance of losing $1,500, then we would use $2, 000R 5 in equation (15.7). Determining the maximum gamble that a decision maker is willing to take and then using this value in the exponential utility function can be much less time consuming than generating a complete table of indifference probabil- ities. One should remember that using an exponential utility function assumes that the deci- sion maker is risk averse; however, this is often true in practice for business decisions.

Exponential Utility Functions with Different Risk Tolerance (R) Values

FIGURE 15.12

1.0

0.8

0.2

0.4

0.6

–0.04

–0.02

–0.06 5

U ti

li ty

, U (x

)

R = 10

R = 20

R = 50

R = 100

15 20 25 x

1. In the Swofford problem, we have been using a utility of

10 for the best payoff and 0 for the worst. We could have

chosen any values as long as the utility associated with the

best payoff exceeds the utility associated with the worst

payoff. Alternatively, a utility of 1 can be associated with

the best payoff and a utility of 0 can be associated with

the worst payoff. Had we made this choice, the utility for

any monetary value M would have been the value of p at which the decision maker was indifferent between a guar-

anteed payoff of M and a lottery in which the best pay- off is obtained with probability p and the worst payoff is obtained with probability (1 2 p). Thus, the utility for any monetary value would have been equal to the probabil-

ity of earning the best payoff. Often this choice is made

because of the ease in computation. We chose not to do

so to emphasize the distinction between the utilities and

the indifference probabilities for the lottery.

2. Circumstances often dictate whether one acts as a risk

avoider or a risk taker when making a decision. For exam-

ple, you may think of yourself as a risk avoider when faced

with financial decisions, but if you have ever purchased

a lottery ticket, you have actually acted as a risk taker.

Suppose you purchase a $1 lottery ticket for a simple lot-

tery in which the object is to pick the six numbers that will

be drawn from 50 potential numbers. Also suppose that

the winner (who correctly chooses all six numbers that are

drawn) will receive $1,000,000. There are 15,890,700 possi-

ble winning combinations, so your probability of winning

is 51/15,890,700 0.000000062929889809763 (i.e., very low) and the expected value of your ticket is

1 15,890,700

($1,000,000 $1) 1 1

15,890,700 ( $1) $0.93707

 

 

2 1 2 2 5 2

1 15,890,700

($1,000,000 $1) 1 1

15,890,700 ( $1) $0.93707

 

 

2 1 2 2 5 2

or about 2$0.94.

If a lottery ticket has a negative expected value, why

does anyone play? The answer is in utility; most people

who play lotteries associate great utility with the possibility

of winning the $1,000,000 prize and relatively little utility

with the $1 cost for a ticket, and so the expected value of

the utility of the lottery ticket is positive even though the

expected value of the ticket is negative.

N O T E S + C O M M E N T S

708 Chapter 15 Decision Analysis

S U M M A R Y

Decision analysis can be used to determine a recommended decision alternative or an optimal decision strategy when a decision maker is faced with an uncertain and risk-filled pattern of future events. The goal of decision analysis is to identify the best decision alter- native or the optimal decision strategy, given information about the uncertain events and the possible consequences or payoffs. The “best” decision should consider the risk prefer- ence of the decision maker in evaluating outcomes.

We showed how payoff tables and decision trees could be used to structure a decision problem and describe the relationships among the decisions, the chance events, and the consequences. We presented three approaches to decision making without probabilities: the optimistic approach, the conservative approach, and the minimax regret approach. When probability assessments are provided for the states of nature, the expected value approach can be used to identify the recommended decision alternative or decision strategy.

Even though the expected value approach can be used to obtain a recommended deci- sion alternative or optimal decision strategy, the payoff that actually occurs will usually have a value different from the expected value. A risk profile provides a probability dis- tribution for the possible payoffs and can assist the decision maker in assessing the risks associated with different decision alternatives. Sensitivity analysis can be conducted to determine the effect changes in the probabilities for the states of nature and changes in the values of the payoffs have on the recommended decision alternative.

In cases in which sample information about the chance events is available, a sequence of decisions has to be made. First we must decide whether to obtain the sample information. If the answer is yes, an optimal decision strategy based on the specific sample information must be developed. In this situation, decision trees and the expected value approach can be used to determine the optimal decision strategy.

Bayes’ theorem can be used to compute branch probabilities for decision trees. Bayes’ theorem updates a decision maker’s prior probabilities regarding the states of nature using sample information to compute revised posterior probabilities.

We showed how utility could be used in decision-making situations in which monetary value did not provide an adequate measure of the payoffs. Utility is a measure of the total worth of an outcome. As such, utility takes into account the decision maker’s assessment of all aspects of a consequence, including profit, loss, risk, and perhaps additional nonmon- etary factors. The examples showed how the use of expected utility can lead to decision recommendations that differ from those based on expected value.

A decision maker’s judgment must be used to establish the utility for each consequence. We presented a step-by-step procedure to determine a decision maker’s utility for monetary payoffs. We also discussed how conservative, risk-avoiding decision makers assess utility differently from more aggressive, risk-taking decision makers.

G L O S S A R Y

Bayes’ theorem A theorem that enables the use of sample information to revise prior probabilities. Branch Lines showing the alternatives from decision nodes and the outcomes from chance nodes. Chance event An uncertain future event affecting the consequence, or payoff, associated with a decision. Chance nodes Nodes indicating points at which an uncertain event will occur. Conditional probabilities The probability of one event, given the known outcome of a (possibly) related event. Conservative approach An approach to choosing a decision alternative without using probabilities. For a maximization problem, it leads to choosing the decision alternative that maximizes the minimum payoff; for a minimization problem, it leads to choosing the deci- sion alternative that minimizes the maximum payoff.

Glossary 709

Decision alternatives Options available to the decision maker. Decision nodes Nodes indicating points at which a decision is made. Decision strategy A strategy involving a sequence of decisions and chance outcomes to provide the optimal solution to a decision problem. Decision tree A graphical representation of the decision problem that shows the sequential nature of the decision-making process. Expected utility (EU) The weighted average of the utilities associated with a decision alternative. The weights are the state-of-nature probabilities. Expected value (EV) For a chance node, the weighted average of the payoffs. The weights are the state-of-nature probabilities. Expected value approach An approach to choosing a decision alternative based on the expected value of each decision alternative. The recommended decision alternative is the one that provides the best expected value. Expected value of perfect information (EVPI) The difference between the expected value of an optimal strategy based on perfect information and the “best” expected value without any sample information. Expected value of sample information (EVSI) The difference between the expected value of an optimal strategy based on sample information and the “best” expected value without any sample information. Minimax regret approach An approach to choosing a decision alternative without using probabilities. For each alternative, the maximum regret is computed, which leads to choos- ing the decision alternative that minimizes the maximum regret. Node An intersection or junction point of a decision tree. Optimistic approach An approach to choosing a decision alternative without using probabilities. For a maximization problem, it leads to choosing the decision alternative corresponding to the largest payoff; for a minimization problem, it leads to choosing the decision alternative corresponding to the smallest payoff. Outcome The result obtained when a decision alternative is chosen and a chance event occurs. Payoff A measure of the outcome of a decision such as profit, cost, or time. Each combina- tion of a decision alternative and a state of nature has an associated payoff. Payoff table A tabular representation of the payoffs for a decision problem. Perfect information A special case of sample information in which the information tells the decision maker exactly which state of nature is going to occur. Posterior (revised) probabilities The probabilities of the states of nature after revising the prior probabilities based on sample information. Prior probabilities The probabilities of the states of nature prior to obtaining sample information. Regret (opportunity loss) The amount of loss (lower profit or higher cost) from not mak- ing the best decision for each state of nature. Risk analysis The study of the possible payoffs and probabilities associated with a deci- sion alternative or a decision strategy in the face of uncertainty. Risk avoider A decision maker who would choose a guaranteed payoff over a lottery with a better expected payoff. Risk-neutral A decision maker who is neutral to risk. For this decision maker, the deci- sion alternative with the best expected value is identical to the alternative with the highest expected utility. Risk profile The probability distribution of the possible payoffs associated with a decision alternative or decision strategy. Risk taker A decision maker who would choose a lottery over a better guaranteed payoff. Sample information New information obtained through research or experimentation that enables updating or revising the state-of-nature probabilities. Sensitivity analysis The study of how changes in the probability assessments for the states of nature or changes in the payoffs affect the recommended decision alternative.

710 Chapter 15 Decision Analysis

P R O B L E M S

1. The following payoff table shows profit for a decision analysis problem with two deci- sion alternatives and three states of nature:

State of Nature Decision Alternative 1s 2s 3s

1d 250 100 25

2d 100 100 75

a. Construct a decision tree for this problem. b. If the decision maker knows nothing about the probabilities of the three states of

nature, what is the recommended decision using the optimistic, conservative, and minimax regret approaches?

2. Southland Corporation’s decision to produce a new line of recreational products resulted in the need to construct either a small plant or a large plant. The best selection of plant size depends on how the marketplace reacts to the new product line. To conduct an analysis, marketing management has decided to view the possible long-run demand as low, medium, or high. The following payoff table shows the projected profit in millions of dollars:

Long-Run Demand

Plan Size Low Medium High

Small 150 200 200

Large 50 200 500

a. What is the decision to be made, and what is the chance event for Southland’s problem? b. Construct a decision tree. c. Recommend a decision based on the use of the optimistic, conservative, and mini-

max regret approaches.

3. Amy Lloyd is interested in leasing a new Honda and has contacted three automobile deal- ers for pricing information. Each dealer offered Amy a closed-end 36-month lease with no down payment due at the time of signing. Each lease includes a monthly charge and a mileage allowance. Additional miles receive a surcharge on a per-mile basis. The monthly lease cost, the mileage allowance, and the cost for additional miles are as follows:

Dealer Monthly Cost Mileage

Allowance Cost per

Additional Mile

Hepburn Honda $299 36,000 $0.15

Midtown Motors $310 45,000 $0.20

Hopkins Automotive $325 54,000 $0.15

Amy decided to choose the lease option that will minimize her total 36-month cost. The difficulty is that Amy is not sure how many miles she will drive over the next three years. For purposes of this decision, she believes it is reasonable to assume that she will drive 12,000 miles per year, 15,000 miles per year, or 18,000 miles per year. With this assumption Amy estimated her total costs for the three

States of nature The possible outcomes for chance events that affect the payoff associated with a decision alternative. Utility A measure of the total worth of a consequence reflecting a decision maker’s attitude toward considerations such as profit, loss, and risk. Utility function for money A curve that depicts the relationship between monetary value and utility.

Problems 711

lease options. For example, she figures that the Hepburn Honda lease will cost her 36($299) $0.15(36, 000 36, 000) $10, 7641 2 5 if she drives 12,000 miles per year, 36($299) $0.15(45, 000 36, 000) $12,1141 2 5 if she drives 15,000 miles per year, or 36($299) $0.15(54, 000 36, 000) $13, 4641 2 5 if she drives 18,000 miles per year. a. What is the decision, and what is the chance event? b. Construct a payoff table for Amy’s problem. c. If Amy has no idea which of the three mileage assumptions is most appropriate,

what is the recommended decision (leasing option) using the optimistic, conserva- tive, and minimax regret approaches?

d. Suppose that the probabilities that Amy drives 12,000, 15,000, and 18,000 miles per year are 0.5, 0.4, and 0.1, respectively. What option should Amy choose using the expected value approach?

e. Develop a risk profile for the decision selected in part (d). What is the most likely cost, and what is its probability?

f. Suppose that, after further consideration, Amy concludes that the probabilities that she will drive 12,000, 15,000, and 18,000 miles per year are 0.3, 0.4, and 0.3, respectively. What decision should Amy make using the expected value approach?

4. Investment advisors estimated the stock market returns for four market segments: computers, financial, manufacturing, and pharmaceuticals. Annual return projections vary depending on whether the general economic conditions are improving, stable, or declining. The anticipated annual return percentages for each market segment under each economic condition are as follows:

Economic Condition Market Segment Improving Stable Declining

Computers 10 2 24

Financial 8 5 23

Manufacturing 6 4 22

Pharmaceuticals 6 5 21

a. Assume that an individual investor wants to select one market segment for a new investment. A forecast shows improving to declining economic conditions with the following probabilities: improving (0.2), stable (0.5), and declining (0.3). What is the preferred market segment for the investor, and what is the expected return percentage?

b. At a later date, a revised forecast shows a potential for an improvement in eco- nomic conditions. New probabilities are as follows: improving (0.4), stable (0.4), and declining (0.2). What is the preferred market segment for the investor based on these new probabilities? What is the expected return percentage?

5. Hudson Corporation is considering three options for managing its data warehouse: continuing with its own staff, hiring an outside vendor to do the managing, or using a combination of its own staff and an outside vendor. The cost of the operation depends on future demand. The annual cost of each option (in thousands of dollars) depends on demand as follows:

Demand Staffing Options High Medium Low

Own staff 650 650 600

Outside vendor 900 600 300

Combination 800 650 500

a. If the demand probabilities are 0.2, 0.5, and 0.3, which decision alternative will minimize the expected cost of the data warehouse? What is the expected annual cost associated with that recommendation?

b. Construct a risk profile for the optimal decision in part (a). What is the probability of the cost exceeding $700,000?

712 Chapter 15 Decision Analysis

6. The following payoff table shows the profit for a decision problem with two states of nature and two decision alternatives:

State of Nature Decision Alternative 1s 2s

1d 10 1

2d 4 3

a. Suppose ( ) 0.21P s 5 and ( ) 0.82P s 5 . What is the best decision using the expected value approach?

b. Perform sensitivity analysis on the payoffs for decision alternative 1d . Assume that the probabilities are as given in part (a), and find the range of payoffs under states of nature 1s and 2s that will keep the solution found in part (a) optimal. Is the solution more sensitive to the payoff under state of nature 1s or 2s ?

7. Myrtle Air Express decided to offer direct service from Cleveland to Myrtle Beach. Management must decide between a full-price service using the company’s new fleet of jet aircraft and a discount service using smaller-capacity commuter planes. It is clear that the best choice depends on the market reaction to the service Myrtle Air offers. Management developed estimates of the contribution to profit for each type of service based on two possible levels of demand for service to Myrtle Beach: strong and weak. The following table shows the estimated quarterly profits (in thousands of dollars):

Demand for Service Service Strong Weak

Full price $960 2$490

Discount $670 $320

a. What is the decision to be made, what is the chance event, and what is the conse- quence for this problem? How many decision alternatives are there? How many outcomes are there for the chance event?

b. If nothing is known about the probabilities of the chance outcomes, what is the recom- mended decision using the optimistic, conservative, and minimax regret approaches?

c. Suppose that management of Myrtle Air Express believes that the probability of strong demand is 0.7 and the probability of weak demand is 0.3. Use the expected value approach to determine an optimal decision.

d. Suppose that the probability of strong demand is 0.8 and the probability of weak demand is 0.2. What is the optimal decision using the expected value approach?

e. Use sensitivity analysis to determine the range of demand probabilities for which each of the decision alternatives has the largest expected value.

8. Video Tech is considering marketing one of two new video games for the coming holiday season: Battle Pacific or Space Pirates. Battle Pacific is a unique game and appears to have no competition. Estimated profits (in thousands of dollars) under high, medium, and low demand are as follows:

Demand Battle Pacific High Medium Low

Profit $1,000 $700 $300

Probability 0.2 0.5 0.3

Video Tech is optimistic about its Space Pirates game. However, the concern is that profitability will be affected by a competitor’s introduction of a video game viewed as similar to Space Pirates. Estimated profits (in thousands of dollars) with and without competition are as follows:

Problems 713

Space Pirates Demand With Competition High Medium Low

Profit $800 $400 $200

Probability 0.3 0.4 0.3

Space Pirates Demand Without Competition High Medium Low

Profit $1,600 $800 $400

Probability 0.5 0.3 0.2

a. Develop a decision tree for the Video Tech problem. b. For planning purposes, Video Tech believes there is a 0.6 probability that its com-

petitor will produce a new game similar to Space Pirates. Given this probability of competition, the director of planning recommends marketing the Battle Pacific video game. Using expected value, what is your recommended decision?

c. Show a risk profile for your recommended decision. d. Use sensitivity analysis to determine what the probability of competition for Space

Pirates would have to be for you to change your recommended decision alternative.

9. Seneca Hill Winery recently purchased land for the purpose of establishing a new vineyard. Management is considering two varieties of white grapes for the new vine- yard: Chardonnay and Riesling. The Chardonnay grapes would be used to produce a dry Chardonnay wine, and the Riesling grapes would be used to produce a semidry Riesling wine. It takes approximately four years from the time of planting before new grapes can be harvested. This length of time creates a great deal of uncertainty con- cerning future demand and makes the decision about the type of grapes to plant diffi- cult. Three possibilities are being considered: Chardonnay grapes only; Riesling grapes only; and both Chardonnay and Riesling grapes. Seneca management decided that for planning purposes it would be adequate to consider only two demand possibilities for each type of wine: strong or weak. With two possibilities for each type of wine, it was necessary to assess four probabilities. With the help of some forecasts in industry pub- lications, management made the following probability assessments:

Riesling Demand

Chardonnay Demand Weak Strong

Weak 0.05 0.50

Strong 0.25 0.20

Revenue projections show an annual contribution to profit of $20,000 if Seneca Hill plants only Chardonnay grapes and demand is weak for Chardonnay wine, and $70,000 if Seneca plants only Chardonnay grapes and demand is strong for Chardonnay wine. If Seneca plants only Riesling grapes, the annual profit projection is $25,000 if demand is weak for Riesling grapes and $45,000 if demand is strong for Riesling grapes. If Seneca plants both types of grapes, the annual profit projections are as shown in the following table:

Riesling Demand

Chardonnay Demand Weak Strong

Weak $22,000 $40,000

Strong $26,000 $60,000

a. What is the decision to be made, what is the chance event, and what is the conse- quence? Identify the alternatives for the decisions and the possible outcomes for the chance events.

714 Chapter 15 Decision Analysis

b. Develop a decision tree. c. Use the expected value approach to recommend which alternative Seneca Hill

Winery should follow in order to maximize expected annual profit. d. Suppose management is concerned about the probability assessments when demand

for Chardonnay wine is strong. Some believe it is likely for Riesling demand to also be strong in this case. Suppose that the probability of strong demand for Chardonnay and weak demand for Riesling is 0.05 and that the probability of strong demand for Chardonnay and strong demand for Riesling is 0.40. How does this change the recommended decision? Assume that the probabilities when Chardonnay demand is weak are still 0.05 and 0.50.

e. Other members of the management team expect the Chardonnay market to become saturated at some point in the future, causing a fall in prices. Suppose that the annual profit projections fall to $50,000 when demand for Chardonnay is strong and only Chardonnay grapes are planted. Using the original probability assessments, determine how this change would affect the optimal decision.

10. Hemmingway, Inc. is considering a $5 million research and development (R&D) proj- ect. Profit projections appear promising, but Hemmingway’s president is concerned because the probability that the R&D project will be successful is only 0.50. Further- more, the president knows that even if the project is successful, it will require that the company build a new production facility at a cost of $20 million in order to manufac- ture the product. If the facility is built, uncertainty remains about the demand and thus uncertainty about the profit that will be realized. Another option is that if the R&D project is successful, the company could sell the rights to the product for an estimated $25 million. Under this option, the company would not build the $20 million produc- tion facility.

The decision tree follows. The profit projection for each outcome is shown at the end of the branches. For example, the revenue projection for the high demand outcome is $59 million. However, the cost of the R&D project ($5 million) and the cost of the production facility ($20 million) show the profit of this outcome to be $59 $5 $20 $34 million2 2 5 . Branch probabilities are also shown for the chance events.

Not Successful 0.5

Start R&D Project ($5 million)

Do Not Start the R&D Project

Successful 0.5

Building Facility ($20 million)

Sell Rights

Pro�t ($ millions)

High Demand 0.5

Medium Demand 0.3

Low Demand 0.2

a. Analyze the decision tree to determine whether the company should undertake the R&D project. If it does, and if the R&D project is successful, what should the com- pany do? What is the expected value of your strategy?

Problems 715

b. What must the selling price be for the company to consider selling the rights to the product?

c. Develop a risk profile for the optimal strategy.

11. Dante Development Corporation is considering bidding on a contract for a new office building complex. The following figure shows the decision tree prepared by one of Dante’s analysts. At node 1, the company must decide whether to bid on the contract. The cost of preparing the bid is $200,000. The upper branch from node 2 shows that the company has a 0.8 probability of winning the contract if it submits a bid. If the company wins the bid, it will have to pay $2 million to become a partner in the project. Node 3 shows that the company will then consider doing a market research study to forecast demand for the office units prior to beginning construction. The cost of this study is $150,000. Node 4 is a chance node showing the possible outcomes of the mar- ket research study.

Nodes 5, 6, and 7 are similar in that they are the decision nodes for Dante to either build the office complex or sell the rights in the project to another developer. The deci- sion to build the complex will result in an income of $5 million if demand is high and $3 million if demand is moderate. If Dante chooses to sell its rights in the project to another developer, income from the sale is estimated to be $3.5 million. The proba- bilities shown at nodes 4, 8, and 9 are based on the projected outcomes of the market research study.

Lose Contract 0.2

Bid

Do Not Bid

Win Contract 0.8

Market Research

No Market Research

Build Complex

Sell 6

1,150

2,650

650

1,150

Build Complex

Sell 7

2,800

800

1,300

2200

Build Complex

Sell 5

2,650

Pro�t ($1,000s)

650Forecast High 0.6

Forecast Moderate 0.4

High Demand 0.85

Moderate Demand 0.15

High Demand 0.225

Moderate Demand 0.775

High Demand 0.6

Moderate Demand 0.4

a. Verify Dante’s profit projections shown at the ending branches of the decision tree by calculating the payoffs of $2,650,000 and $650,000 for first two outcomes.

b. What is the optimal decision strategy for Dante, and what is the expected profit for this project?

c. What would the cost of the market research study have to be before Dante would change its decision about the market research study?

d. Develop a risk profile for Dante.

12. Embassy Publishing Company received a six-chapter manuscript for a new college textbook. The editor of the college division is familiar with the manuscript and esti- mated a 0.65 probability that the textbook will be successful. If successful, a profit

716 Chapter 15 Decision Analysis

of $750,000 will be realized. If the company decides to publish the textbook and it is unsuccessful, a loss of $250,000 will occur.

Before making the decision to accept or reject the manuscript, the editor is con- sidering sending the manuscript out for review. A review process provides either a favorable (F) or unfavorable (U) evaluation of the manuscript. Past experience with the review process suggests that probabilities ( ) 0.7P F 5 and ( ) 0.3P U 5 apply. Let

the textbook is successful1s 5 and the textbook is unsuccessful2s 5 . The editor’s ini- tial probabilities of 1s and 2s will be revised based on whether the review is favorable or unfavorable. The revised probabilities are as follows:

( | ) 0.75 ( | ) 0.417

( | ) 0.25 ( | ) 0.583

1 1

2 2

P s F P s U

5 5

a. Construct a decision tree assuming that the company will first make the decision as to whether to send the manuscript out for review and then make the decision to accept or reject the manuscript.

b. Analyze the decision tree to determine the optimal decision strategy for the publish- ing company.

c. If the manuscript review costs $5,000, what is your recommendation? d. What is the expected value of perfect information? What does this EVPI suggest for

the company?

13. The following profit payoff table was presented in Problem 1:

State of Nature Decision Alternative 1s 2s 3s

1d 250 100 25

2d 100 100 75

The probabilities for the states of nature are ( ) 0.651P s 5 , ( ) 0.152P s 5 , and ( ) 0.203P s 5 .

a. What is the optimal decision strategy if perfect information were available? b. What is the expected value for the decision strategy developed in part (a)? c. Using the expected value approach, what is the recommended decision without per-

fect information? What is its expected value? d. What is the expected value of perfect information?

14. The Lake Placid Town Council decided to build a new community center to be used for conventions, concerts, and other public events, but considerable controversy sur- rounds the appropriate size. Many influential citizens want a large center that would be a showcase for the area. But the mayor feels that if demand does not support such a center, the community will lose a large amount of money. To provide structure for the decision process, the council narrowed the building alternatives to three sizes: small, medium, and large. Everybody agreed that the critical factor in choosing the best size is the number of people who will want to use the new facility. A regional planning consultant provided demand estimates under three scenarios: worst case, base case, and best case. The worst-case scenario corresponds to a situation in which tourism drops substantially; the base-case scenario corresponds to a situation in which Lake Placid continues to attract visitors at current levels; and the best-case scenario corresponds to a substantial increase in tourism. The consultant has provided probability assess- ments of 0.10, 0.60, and 0.30 for the worst-case, base-case, and best-case scenarios, respectively.

The town council suggested using net cash flow over a five-year planning horizon as the criterion for deciding on the best size. The following projections of net cash flow

Problems 717

(in thousands of dollars) for a five-year planning horizon have been developed. All costs, including the consultant’s fee, have been included.

Demand Scenario Center Size Worst Case Base Case Best Case

Small 400 500 660

Medium 2250 650 800

Large 2400 580 990

a. What decision should Lake Placid make using the expected value approach? b. Construct risk profiles for the medium and large alternatives. Given the mayor’s

concern over the possibility of losing money and the result of part (a), which alter- native would you recommend?

c. Compute the expected value of perfect information. Do you think it would be worth trying to obtain additional information concerning which scenario is likely to occur?

d. Suppose the probability of the worst-case scenario increases to 0.2, the probability of the base-case scenario decreases to 0.5, and the probability of the best-case sce- nario remains at 0.3. What effect, if any, would these changes have on the decision recommendation?

e. The consultant has suggested that an expenditure of $150,000 on a promotional campaign over the planning horizon will effectively reduce the probability of the worst-case scenario to zero. If the campaign can be expected to also increase the probability of the best-case scenario to 0.4, is it a good investment?

15. A real estate investor has the opportunity to purchase land currently zoned as resi- dential. If the county board approves a request to rezone the property as commercial within the next year, the investor will be able to lease the land to a large discount firm that wants to open a new store on the property. However, if the zoning change is not approved, the investor will have to sell the property at a loss. Profits (in thousands of dollars) are shown in the following payoff table:

State of Nature Decision Alternative Rezoning Approved 1s Rezoning Not Approved 2s

Purchase, 1d 600 2200

Do not purchase, 2d 0 0

a. If the probability that the rezoning will be approved is 0.5, what decision is recom- mended? What is the expected profit?

b. The investor can purchase an option to buy the land. Under the option, the inves- tor maintains the rights to purchase the land anytime during the next three months while learning more about possible resistance to the rezoning proposal from area residents. Probabilities are as follows:

Let high resistance to rezoning

low resistance to rezoning

P H P s H P s H

P L P s L P s L

5 5 5

( ) 0.55 ( | ) 0.18 ( | ) 0.82

( ) 0.45 ( | ) 0.89 ( | ) 0.11

1 2

What is the optimal decision strategy if the investor uses the option period to learn more about the resistance from area residents before making the purchase decision?

718 Chapter 15 Decision Analysis

c. If the option will cost the investor an additional $10,000, should the investor pur- chase the option? Why or why not? What is the maximum that the investor should be willing to pay for the option?

16. Suppose that you are given a decision situation with three possible states of nature: 1s , 2s , and 3s . The prior probabilities are ( ) 0.21P s 5 , ( ) 0.52P s 5 , and ( ) 0.33P s 5 . With

sample information I, ( | ) 0.11P I s 5 , ( | ) 0.052P I s 5 , and ( | ) 0.23P I s 5 . Compute the revised (or posterior) probabilities: ( | )1P s I , ( | )2P s I , and ( | )3P s I .

17. To save on expenses, Rona and Jerry agreed to form a carpool for traveling to and from work. Rona prefers to use the somewhat longer but more consistent Queen City Avenue. Although Jerry prefers the quicker expressway, he agreed with Rona that they should take Queen City Avenue if the expressway has a traffic jam. The following payoff table provides the one-way time estimate in minutes for traveling to or from work:

State of Nature Decision Alternative Expressway Open, 1s Expressway Jammed, 2s

Queen City Avenue, 1d 30 30

Expressway, 2d 25 45

Based on their experience with traffic problems, Rona and Jerry agreed on a 0.15 prob- ability that the expressway would be jammed.

In addition, they agreed that weather seemed to affect the traffic conditions on the expressway. Let

clear overcast rain

C O R

The following conditional probabilities apply:

P C s P O s P R s

5 5 5

( | ) 0.8 ( | ) 0.2 ( | ) 0.0

( | ) 0.1 ( | ) 0.3 ( | ) 0.6

1 1 1

2 2 2

a. Use Bayes’ theorem for probability revision to compute the probability of each weather condition and the conditional probability of the expressway being open, 1s , or jammed, 2s , given each weather condition.

b. Show the decision tree for this problem. c. What is the optimal decision strategy, and what is the expected travel time?

18. The Gorman Manufacturing Company must decide whether to manufacture a compo- nent part at its Milan, Michigan, plant or purchase the component part from a supplier. The resulting profit is dependent on the demand for the product. The following payoff table shows the projected profit (in thousands of dollars):

State of Nature

Decision Alternative Low Demand

1s Medium Demand

2s High Demand

Manufacture, 1d 220 40 100

Purchase, 2d 10 45 70

The state-of-nature probabilities are ( ) 0.351P s 5 , ( ) 0.352P s 5 , and ( ) 0.303P s 5 . a. Use a decision tree to recommend a decision. b. Use EVPI to determine whether Gorman should attempt to obtain a better estimate

of demand.

Problems 719

c. A test market study of the potential demand for the product is expected to report either a favorable (F) or an unfavorable (U) condition. The relevant conditional probabilities are as follows:

( | ) 0.10 ( | ) 0.90

( | ) 0.40 ( | ) 0.60

( | ) 0.60 ( | ) 0.40

1 1

2 2

3 3

P F s P U s

5 5

What is the probability that the market research report will be favorable? [Hint: We can find this value by summing the joint probability values as follows: 5P F( )

( ) ( ) ( ) ( ) ( | ) ( ) ( | ) ( ) ( | )1 2 3 1 1 2 2 3 3P F s P F s P F s P s P F s P s P F s P s P F s> > >1 1 5 1 1 .]

d. What is Gorman’s optimal decision strategy? e. What is the expected value of the market research information?

19. A firm has three investment alternatives. Payoffs are in thousands of dollars.

Economic Conditions Decision Alternative Up, 1s Stable, 2s Down, 3s

Investment A, 1d 100 25 0

Investment B, 2d 75 50 25

Investment C, 3d 50 50 50

Probabilities 0.40 0.30 0.30

a. Using the expected value approach, which decision is preferred? b. For the lottery having a payoff of $100,000 with probability p and $0 with probabil-

ity (1 2 p), two decision makers expressed the following indifference probabilities. Find the most preferred decision for each decision maker using the expected utility approach.

Indifference Probability (p) Profit Decision Maker A Decision Maker B

$75,000 0.80 0.60

$50,000 0.60 0.30

$25,000 0.30 0.15

c. Why don’t decision makers A and B select the same decision alternative?

20. Alexander Industries is considering purchasing an insurance policy for its new office building in St. Louis, Missouri. The policy has an annual cost of $10,000. If Alexander Industries doesn’t purchase the insurance and minor fire damage occurs, a cost of $100,000 is anticipated; the cost if major or total destruction occurs is $200,000. The costs, including the state-of-nature probabilities, are as follows:

Damage Decision Alternative None, 1s Minor, 2s Major, 3s

Purchase insurance, 1d 10,000 10,000 10,000

Do not purchase insurance, 2d 0 100,000 200,000

Probabilities 0.96 0.03 0.01

a. Using the expected value approach, what decision do you recommend? b. What lottery would you use to assess utilities? (Note: Because the data are costs, the

best payoff is $0.) c. Assume that you found the following indifference probabilities for the lottery

defined in part (b). What decision would you recommend?

Joint probabilities are discussed in Chapter 5.

720 Chapter 15 Decision Analysis

Cost Indifference Probability

10,000 0.99p 5 100,000 0.60p 5

d. Do you favor using expected value or expected utility for this decision problem? Why?

21. In a certain state lottery, a lottery ticket costs $2. In terms of the decision to purchase or not to purchase a lottery ticket, suppose that the following payoff table applies:

State of Nature Decision Alternatives Win, 1s Lose, 2s

Purchase lottery ticket, 1d 300,000 22

Do not purchase lottery ticket, 2d 0 0

a. A realistic estimate of the chances of winning is 1 in 250,000. Use the expected value approach to recommend a decision.

b. If a particular decision maker assigns an indifference probability of 0.000001 to the $0 payoff, would this individual purchase a lottery ticket? Use expected utility to justify your answer.

22. Three decision makers have assessed utilities for the following decision problem (payoff in dollars):

State of Nature Decision Alternative 1s 2s 3s

1d 20 50 220

2d 80 100 2100

The indifference probabilities are as follows:

Indifference Probability (p) Payoff Decision Maker A Decision Maker B Decision Maker C

100 1.00 1.00 1.00

80 0.95 0.70 0.90

50 0.90 0.60 0.75

20 0.70 0.45 0.60

220 0.50 0.25 0.40

2100 0.00 0.00 0.00

a. Plot the utility function for money for each decision maker. b. Classify each decision maker as a risk avoider, a risk taker, or risk-neutral. c. For the payoff of 20, what is the premium that the risk avoider will pay to avoid

risk? What is the premium that the risk taker will pay to have the opportunity of the high payoff?

23. In Problem 22, if ( ) 0.251P s 5 , ( ) 0.502P s 5 , and ( ) 0.253P s 5 , find a recommended decision for each of the three decision makers. (Note: For the same decision problem, different utilities can lead to different decisions.)

24. Translate the following monetary payoffs into utilities for a decision maker whose util- ity function is described by an exponential function with 250R 5 : 2$200, 2$100, $0, $100, $200, $300, $400, $500.

25. Consider a decision maker who is comfortable with an investment decision that has a 50% chance of earning $25,000 and a 50% chance of losing $12,500, but not with any larger investments that have the same relative payoffs. a. Write the equation for the exponential function that approximates this decision

maker’s utility function.

Case Problem: Property Purchase Strategy 721

b. Plot the exponential utility function for this decision maker for x values between 220,000 and 35,000. Is this decision maker risk-seeking, risk-neutral, or risk-averse?

c. Suppose the decision maker decides that she would actually be willing to make an investment that has a 50% chance of earning $30,000 and a 50% chance of losing $15,000. Plot the exponential function that approximates this utility function and compare it to the utility function from part (b). Is the decision maker becoming more risk-seeking or more risk-averse?

C A S E P R O B L E M : P R O P E R T Y P U R C H A S E S T R A T E G Y

Glenn Foreman, president of Oceanview Development Corporation, is considering sub- mitting a bid to purchase property that will be sold by sealed-bid auction at a county tax foreclosure. Glenn’s initial judgment is to submit a bid of $5 million. Based on his expe- rience, Glenn estimates that a bid of $5 million will have a 0.2 probability of being the highest bid and securing the property for Oceanview. The current date is June 1. Sealed bids for the property must be submitted by August 15. The winning bid will be announced on September 1.

If Oceanview submits the highest bid and obtains the property, the firm plans to build and sell a complex of luxury condominiums. However, a complicating factor is that the property is currently zoned for single-family residences only. Glenn believes that a refer- endum could be placed on the voting ballot in time for the November election. Passage of the referendum would change the zoning of the property and permit construction of the condominiums.

The sealed-bid procedure requires the bid to be submitted with a certified check for 10% of the amount bid. If the bid is rejected, the deposit is refunded. If the bid is accepted, the deposit is the down payment for the property. However, if the bid is accepted and the bid- der does not follow through with the purchase and meet the remainder of the financial obli- gation within six months, the deposit will be forfeited. In this case, the county will offer the property to the next highest bidder.

To determine whether Oceanview should submit the $5 million bid, Glenn conducted some preliminary analysis. This preliminary work provided an assessment of 0.3 for the probability that the referendum for a zoning change will be approved and resulted in the following estimates of the costs and revenues that will be incurred if the condominiums are built:

Costs and Revenue Estimates Revenue from condominium sales $15,000,000

Costs

Property $5,000,000

Construction expenses $8,000,000

If Oceanview obtains the property and the zoning change is rejected in November, Glenn believes that the best option would be for the firm not to complete the purchase of the property. In this case, Oceanview would forfeit the 10% deposit that accompanied the bid.

Because the likelihood that the zoning referendum will be approved is such an important factor in the decision process, Glenn suggested that the firm hire a market research service to conduct a survey of voters. The survey would provide a better estimate of the likeli- hood that the referendum for a zoning change would be approved. The market research firm that Oceanview Development has worked with in the past has agreed to do the study for $15,000. The results of the study will be available August 1, so that Oceanview will have this information before the August 15 bid deadline. The results of the survey will be a prediction either that the zoning change will be approved or that the zoning change will

722 Chapter 15 Decision Analysis

be rejected. After considering the record of the market research service in previous studies conducted for Oceanview, Glenn developed the following probability estimates concerning the accuracy of the market research information:

P A s P N s

5 5

( | ) 0.9 ( | ) 0.1

( | ) 0.2 ( | ) 0.8

1 1

2 2

where

prediction of zoning change approval

prediction that zoning change will not be approved

the zoning change is approved by the voters

the zoning change is rejected by the voters 1

Managerial Report

Perform an analysis of the problem facing the Oceanview Development Corporation, and prepare a report that summarizes your findings and recommendations. Include the follow- ing items in your report:

1. A decision tree that shows the logical sequence of the decision problem 2. A recommendation regarding what Oceanview should do if the market research

information is not available 3. A decision strategy that Oceanview should follow if the market research is

conducted 4. A recommendation as to whether Oceanview should employ the market research

firm, along with the value of the information provided by the market research firm

Include the details of your analysis as an appendix to your report.

C O N T E N T S A.1 USING MICROSOFT EXCEL

Basic Spreadsheet Workbook Operations Creating, Saving, and Opening Files in Excel

A.2 SPREADSHEET BASICS Cells, References, and Formulas in Excel Finding the Right Excel Function Colon Notation Inserting a Function into a Worksheet Cell Using Relative versus Absolute Cell References

A.1 Using Microsoft Excel Excel stores data and calculations in a file called a workbook, which contains one or more worksheets. Figure A.1 shows the layout of a blank workbook created in Excel 2016. The workbook is named Book1 and by default contains a worksheet named Sheet1.

The wide bar located across the top of the workbook is referred to as the Ribbon. Tabs, located at the top of the Ribbon, contain groups of related commands. By default, eight tabs are included on the Ribbon in Excel: File, Home, Insert, Page Layout, Formulas, Data, Review, and View. Loading additional packages (such as Analytic Solver or Acrobat) may create additional tabs. Each tab contains several groups of related commands. The File tab is used to Open, Save, and Print files as well as to change the Options being used by Excel and to load Add-ins. Note that the Home tab is selected when a workbook is opened. Figure A.2 displays the seven groups located in the Home tab: Clipboard, Font, Alignment, Number, Styles, Cells, and Editing. Commands are arranged within each group.

Depending on the settings for your particular installation of Excel, you may see additional worksheets labeled Sheet2, Sheet3, and so on.

1.1 Applications in Business and Economics 724

Appendix A–Basics of Excel

Blank Workbook in ExcelFIGURE A.1

A.1 Using Microsoft Excel 725

Groups on the Home Tab in the Ribbon of an Excel WorkbookFIGURE A.2

For example, to change selected text to boldface, click the Home tab and click the Bold button in the Font group. The other tabs in the Ribbon are used to modify data in your spreadsheet or to perform analysis.

Figure A.3 illustrates the location of the File tab, the Quick Access toolbar, and the Formula bar. The Quick Access toolbar allows you to quickly access commonly used workbook functions.

Keyboard shortcut: pressing Ctrl-B will change the font of the text in the selected cell to bold. We include a full list of keyboard shortcuts for Excel at the end of this appendix.

File Tab, Quick Access toolbar, and Formula bar of an Excel WorkbookFIGURE A.3

726 Appendix A–Basics of Excel

For instance, the Quick Access toolbar shown in Figure A.3 includes a Save button that can be used to save files without having to first click the File tab. To add or remove features on the Quick Access toolbar, click the Customize Quick Access toolbar button on the Quick Access toolbar.

The Formula bar contains a Name box, the Insert Function button fx, and a Formula box. In Figure A.3, “A1” appears in the Name box because cell A1 is selected. You can select any other cell in the worksheet by using the mouse to move the cursor to another cell and clicking or by typing the new cell location in the name box and pressing the Enter key. The Formula box is used to display the formula in the currently selected cell. For instance, if you had entered A1 A25 1 into cell A3, whenever you select cell A3, the formula

A1 A25 1 will be shown in the Formula box. This feature makes it very easy to see and edit a formula in a cell. The Insert Function button allows you to quickly access all of the func- tions available in Excel. Later, we show how to find and use a particular function with the Insert Function button.

Basic Spreadsheet Workbook Operations To change the name of the current worksheet, we take the following steps:

Step 1. Right-click on the worksheet tab named Sheet1 Step 2. Select the Rename option Step 3. Enter Nowlin to rename the worksheet and press Enter

You can create a copy of the newly renamed Nowlin worksheet by following these steps:

Step 1. Right-click the worksheet tab named Nowlin Step 2. Select the Move or Copy… option Step 3. When the Move or Copy dialog box appears, select the checkbox for Create

a Copy, and click OK

The name of the copied worksheet will appear as “Nowlin (2).” You can then rename it, if desired, by following the steps outlined previously. Worksheets can also be moved to other workbooks or to a different position in the current workbook by using the Move or Copy option.

To create additional worksheets follow these steps:

Step 1. Right-click on the tab of any existing worksheet Step 2. Select Insert… Step 3. When the Insert dialog box appears, select Worksheet from the General

area, and click OK

Worksheets can be deleted by right-clicking the worksheet tab and choosing Delete. After clicking Delete, a window may appear, warning you that any data appearing in the worksheet will be lost. Click Delete to confirm that you do want to delete the worksheet.

Creating, Saving, and Opening Files in Excel To illustrate manually entering, saving, and opening a file, we will use the Nowlin Plastics make-versus-buy model from Chapter 10. The objective is to determine whether Nowlin should manufacture or outsource production for its Viper product next year. Nowlin must pay a fixed cost of $234,000 and a variable cost per unit of $2 to manufacture the product. Nowlin can outsource production for $3.50 per unit.

We begin by assuming that Excel is open and a blank worksheet is displayed. The Nowlin data can now be entered manually by simply typing the manufacturing fixed cost of $234,000, the variable cost of $2, and the outsourcing cost of $3.50 into the worksheet.

We will place the data for the Nowlin example in the top portion of Sheet1 of the new workbook. First, we enter the label Nowlin Plastics in cell A1 and click the Bold button in the Font group. Next, we enter the label Parameters and click on the Bold button in the Font group. To identify each of the three data values, we enter the label Manufacturing

New worksheets can also be created using the insert worksheet button at the bottom of the screen.

Fixed Cost in cell A4, the label Manufacturing Variable Cost per Unit in cell A5, and the label Outsourcing Cost per Unit in cell A7. Next, we enter the actual data into the corre- sponding cells in column B: the value of $234,000 in cell B4; the value of $2 in cell B5; and the value of $3.50 in cell B7. Figure A.4 shows a portion of the worksheet we have just developed.

Before we begin the development of the model portion of the worksheet, we recom- mend that you first save the current file; this will prevent you from having to reenter the data in case something happens that causes Excel to close. To save the workbook using the filename Nowlin, we perform the following steps:

Step 1. Click the File tab on the Ribbon Step 2. Click Save in the list of options Step 3. Select This PC under Save As, and click Browse Step 4. When the Save As dialog box appears:

Select the location where you want to save the file Enter the filename Nowlin in the File name: box Click Save

Excel’s Save command is designed to save the file as an Excel workbook. As you work with and build models in Excel, you should follow the practice of periodically saving the file so that you will not lose any work. After you have saved your file for the first time, the Save command will overwrite the existing version of the file, and you will not have to per- form Steps 3 and 4.

Sometimes you may want to create a copy of an existing file. For instance, suppose you change one or more of the data values and would like to save the modified file using the filename NowlinMod. The following steps show how to save the modified workbook using filename NowlinMod:

Step 1. Click the File tab in the Ribbon Step 2. Click Save As in the list of options Step 3. Select This PC under Save As, and click Browse Step 4. When the Save As dialog box appears:

Select the location where you want to save the file Type the filename NowlinMod in the File name: box Click Save

Once the NowlinMod workbook has been saved, you can continue to work with the file to perform whatever type of analysis is appropriate. When you are finished working with the file, simply click the close-window button located at the top right-hand corner of the Ribbon.

Keyboard shortcut: To save the file, press Ctrl-S.

Nowlin Plastics DataFIGURE A.4

A C

Nowlin Plastics

Parameters Manufacturing Fixed Cost

Manufacturing Variable Cost per Unit

Outsourcing Cost per Unit

$234,000.00

$3.50

$2.00

1 2 3 4 5 6 7 8

A.1 Using Microsoft Excel 727

Later, you can easily access a previously saved file. For example, the following steps show how to open the previously saved Nowlin workbook:

Step 1. Click the File tab in the Ribbon Step 2. Click Open in the list of options Step 3. Select This PC under Open and click Browse Step 4. When the Open dialog box appears:

Find the location where you previously saved the Nowlin file Click on the filename Nowlin so that it appears in the File name: box Click Open

A.2 Spreadsheet Basics Cells, References, and Formulas in Excel We begin by assuming that the Nowlin workbook is open again and that we would like to develop a model that can be used to compute the manufacturing and outsourcing cost given a certain required volume. We develop the model based on the data in the worksheet shown in Figure A.4. The model will contain formulas that refer to the location of the data cells in the upper section of the worksheet. By putting the location of the data cells in the formula, we will build a model that can be easily updated with new data.

To provide a visual reminder that the bottom portion of this worksheet will contain the model, we enter the label Model into cell A10 and press the Bold button in the Font group. In cell A11, we enter the label Quantity. Next, we enter the labels Total Cost to Produce in cell A13, Total Cost to Outsource in cell A15, and Savings due to Outsourcing in cell A17.

In cell B11 we enter 10000 to represent the quantity produced or outsourced by Nowlin Plastics. We will now enter formulas in cells B13, B15, and B17 that use the quantity in cell B11 to compute the values for production cost, outsourcing cost, and savings from outsourcing. The total cost to produce is the sum of the manufacturing fixed cost (cell B4) and the manufacturing variable cost. The manufacturing variable cost is the product of the production volume (cell B11) and the variable cost per unit (cell B5). Thus, the formula for total variable cost is B11*B5; to compute the value of total cost, we enter the formula =B4+B11*B5 in cell B13. Next, total cost to outsource is the product of the outsourcing cost per unit (cell B7) and the quantity (cell B11); this is computed by entering the formula =B7*B11 in cell B15. Finally, the savings due to outsourcing is computed by subtracting the cost of outsourcing (cell B15) from the production cost (cell B13). Thus, in cell B17 we enter the formula =B13-B15. Figure A.5 shows the Excel worksheet values and formulas used for these calculations.

We can now compute the savings due to outsourcing by entering a value for the quantity to be manufactured or outsourced in cell B11. Figure A.5 shows the results after entering a value of 10,000 in cell B11. We see that a quantity of 10,000 units results in a production cost of $254,000 and outsourcing cost of $35,000. Thus, the savings due to outsourcing is $219,000.

Finding the Right Excel Function Excel provides a variety of built-in formulas or functions for developing mathematical models. If we know which function is needed and how to use it, we can simply enter the function into the appropriate worksheet cell. However, if we are not sure which functions are available to accomplish a task or are not sure how to use a particular function, Excel can provide assistance.

To identify the functions available in Excel click the Insert Function button on the Formula bar; this opens the Insert Function dialog box shown in Figure A.6. The Search for a function: box at the top of the dialog box enables us to type a brief description of what we want to do. After entering a description and clicking Go, Excel will search for and display the functions that may accomplish our task in the Select a function: box. In many situations, however, we may want to browse through an entire category of functions to see

To display all formulas in the cells of a worksheet, hold down the Ctrl key and then press the ~ key (usually located above the Tab key).

728 Appendix A–Basics of Excel

what is available. For this task, the Or select a category: box is helpful. It contains a drop- down list of several categories of functions provided by Excel. Figure A.6 shows that we selected the Math & Trig category. As a result, Excel’s Math & Trig functions appear in alphabetical order in the Select a function: area. We see the ABS function listed first, fol- lowed by the ACOS function, and so on.

Colon Notation Although many functions, such as the ABS function, have a single argument, some Excel functions depend on arrays. Colon notation provides an efficient way to convey arrays and matrices of cells to functions. For example, the colon notation B1:B5 means cell B1 “through” cell B5, namely the array of values stored in the locations (B1,B2,B3,B4,B5).

The ABS function calculates the absolute value of a number. The ACOS function calculates the arccosine of a number.

A.2 Spreadsheet Basics 729

Nowlin Plastics Data and ModelFIGURE A.5

A C

Nowlin Plastics

Parameters Manufacturing Fixed Cost

Manufacturing Variable Cost per Unit

Outsourcing Cost per Unit

234000

3.5

=B4+B11*B5

=B7*B11

=B13-B15

1 2 3 4 5 6 7

Model Quantity

Total Cost to Produce

Total Cost to Outsource

9 10 11 12 13 14

Savings due to Outsourcing

15 16 17 18

10000

A C

Nowlin Plastics

Parameters Manufacturing Fixed Cost

Manufacturing Variable Cost per Unit

Outsourcing Cost per Unit

1 2 3 4 5 6 7

Model Quantity

Total Cost to Produce

Total Cost to Outsource

9 10 11 12 13 14

Savings due to Outsourcing

15 16 17 18

$234,000.00

$3.50

$254,000.00

$35,000.00

$219,000.00

$2.00

10,000

Nowlin

Consider, for example, the following function =SUM(B1:B5). The sum function adds up the elements contained in the function’s argument. Hence, =SUM(B1:B5) evaluates the following formula:

B1 B2 B3 B4 B55 1 1 1 1

To illustrate the use of colon notation, we will consider the financial data for Nowlin Plastics contained in the DATAfile NowlinFinancial and shown in Figure A.7. Column A contains the name of each month, column B the revenue for each month, and column C the cost data. In row 15, we compute the total revenues and costs for the year. To do this we first enter Total: in cell A15. Next, we enter the formula =SUM(B2:B13) in cell B15 and =SUM(C2:C13) in cell C15. This shows that the total revenues for the company are $39,319,000 and the total costs are $36,549,000.

Inserting a Function into a Worksheet Cell Continuing with the Nowlin financial data, we will now show how to use the Insert Func- tion and Function Arguments dialog boxes to select a function, develop its arguments, and insert the function into a worksheet cell. We wish to calculate the average monthly revenue and cost at Nowlin. To do so, we execute the following steps:

Step 1. Select cell B17 in the DATAfile NowlinFinancial Step 2. Click the Insert Function button .

Select Statistical in the Or select a category: box Select AVERAGE from the Select a function: options

Step 3. When the Function Arguments dialog box appears: Enter B2:B13 in the Number1 box Click OK

Step 4. Repeat Steps 1 through 3 for the cost data in column C

Figure A.7 shows that the average monthly revenue is $3,276,583 and the average monthly cost is $3,045,750.

If you need additional guidance on the use of a particular function in Excel, the Function Arguments dialog box contains a link, Help on this function.

Insert Function Dialog BoxFIGURE A.6

730 Appendix A–Basics of Excel

NowlinFinancial

Using Relative versus Absolute Cell References One of the most powerful abilities of spreadsheet software such as Excel is the ability to use relative references in formulas. Use of a relative reference allows the user to enter a formula once into Excel and then copy and paste that formula to other places so that the formula will update with the correct data without having to retype the formula. We will demonstrate the use of relative references in Excel by calculating the monthly profit at Nowlin Plastics using the following steps:

Step 1. Enter the label Profit in cell D1 and press the Bold button in the Font group of the Home tab

Step 2. Enter the formula =B2-C2 in cell D2 Step 3. Copy the formula from cell D2 by selecting cell D2 and clicking Copy from

the Clipboard group of the Home tab

After completing Step 2, a shortcut to copying the formula to the range D3:D13 is to place the pointer in the bottom-right corner of cell D2 and then double-click.

Keyboard shortcut: You can copy in Excel by pressing Ctrl-C. You can paste in Excel by pressing Ctrl-V.

Nowlin Plastics Monthly Revenues and CostsFIGURE A.7

A C

Month

February

January

March

April

June

July

August

May

=SUM(B2:B13)

=AVERAGE(B2:B13)

1 2 3 4 5 6 7

September

October

November

December

Total:

9 10 11 12 13 14

Average:

15 16 17

Revenue

2873000

3459000

3195000

3436000

2845000

2925000

3682000

3410000

3782000

3548000

3028000

3136000

Cost

2640000

3250000

3021000

3240000

2803000

=SUM(C2:C13)

=AVERAGE(C2:C13)

3015000

3150000

3185000

3237000

3196000

2815000

2997000

A C

Month

February

January

March

April

June

July

August

May

1 2 3 4 5 6 7

September

October

November

December

Total:

9 10 11 12 13 14

Average:

15 16 17

Revenue

$39,319,000

$ 3,276,583

$ 2,873,000

$ 3,459,000

$ 3,195,000

$ 3,436,000

$ 2,845,000

$ 2,925,000

$ 3,682,000

$ 3,410,000

$ 3,782,000

$ 3,548,000

$ 3,028,000

$ 3,136,000

Cost

$ 2,640,000

$ 3,250,000

$ 3,021,000

$ 3,240,000

$ 2,803,000

$36,549,000

$ 3,045,750

$ 3,015,000

$ 3,150,000

$ 3,185,000

$ 3,237,000

$ 3,196,000

$ 2,815,000

$ 2,997,000

NowlinFinancial

A.2 Spreadsheet Basics 731

NowlinFinancial

Step 4. Select cells D3:D13 Step 5. Paste the formula from cell D2 by clicking Paste from the Clipboard group

of the Home tab

The result of these steps is shown in Figure A.8, where we have calculated the profit for each month. Note that even though the only formula we entered was =B2-C2 in cell D2, the formulas in cells D3 through D13 have been updated correctly to calculate the profit of each month using that month’s revenue and cost.

In some situations, however, we do not want to use relative referencing in formulas. The alternative is to use an absolute reference, which we indicate to Excel by putting “$” before the row and/or column locations of the cell location. An absolute reference does not update to a new cell reference when the formula is copied to another location. We illustrate the use of an absolute reference by continuing to use the Nowlin financial data. Nowlin cal- culates an after-tax profit each month by multiplying its actual monthly profit by one minus its tax rate, which is currently estimated to be 30%. Cell B19 in Figure A.9 contains this tax rate. In column E, we calculate the after-tax profit for Nowlin in each month by using the following steps:

In some cases, you may want Excel to use relative referencing for either the column or row location and absolute referencing for the other. For instance, to force Excel to always refer to column A but use relative referencing for the row, you would enter =$A1 into, say, cell B1. If this formula is copied into cell C3, the updated formula would be =$A3 (whereas it would be updated to =B3 if relative referencing was used for both the column and the row location).

Nowlin Plastics Profit CalculationFIGURE A.8

A C

Month

February

January

March

April

June

July

August

May

=SUM(B2:B13)

=AVERAGE(B2:B13)

1 2 3 4 5 6 7

September

October

November

December

Total:

9 10 11 12 13 14

Average:

15 16 17

B D

Revenue

2873000

3459000

3195000

3436000

2845000

2925000

3682000

3410000

3782000

3548000

3028000

3136000

Cost

2640000

3250000

3021000

3240000

2803000

=SUM(C2:C13)

=AVERAGE(C2:C13)

3015000

3150000

3185000

3237000

3196000

2815000

2997000

Pro�t

=B3-C3

=B2-C2

=B4-C4

=B7-C7

=B13-C13

=B5-C5

=B6-C6

=B8-C8

=B9-C9

=B10-C10

=B12-C12

=B11-C11

A C

Month

February

January

March

April

June

July

August

May

1 2 3 4 5 6 7

September

October

November

December

Total:

9 10 11 12 13 14

Average:

15 16 17

B D

Revenue

$39,319,000

$ 3,276,583

$ 2,873,000

$ 3,459,000

$ 3,195,000

$ 3,436,000

$ 2,845,000

$ 2,925,000

$ 3,682,000

$ 3,410,000

$ 3,782,000

$ 3,548,000

$ 3,028,000

$ 3,136,000

Cost Pro�t

$ 2,640,000

$ 3,250,000

$ 3,021,000

$ 3,240,000

$ 2,803,000

$36,549,000

$ 3,045,750

$ 3,015,000

$ 3,150,000

$ 3,185,000

$ 3,237,000

$ 3,196,000

$ 2,815,000

$ 2,997,000

$ 233,000

$ 209,000

$ 174,000

$ 196,000

$ 42,000

$ (90,000)

$ 532,000

$ 225,000

$ 545,000

$ 352,000

$ 213,000

$ 139,000

732 Appendix A–Basics of Excel

Step 1. Enter the label After-Tax Profit in cell E1 and press the Bold button in the Font group of the Home tab.

Step 2. Enter the formula =D2*(1-$B$19) in cell E2 Step 3. Copy the formula from cell E2 by selecting cell E2 and clicking Copy from

the Clipboard group of the Home tab Step 4. Select cells E3:E13 Step 5. Paste the formula from cell E2 by clicking Paste from the Clipboard group of

the Home tab

Figure A.9 shows the after-tax profit in each month. Using $B$19 in the formula in cell E2 forces Excel to always refer to cell $B$19, even if we copy and paste this formula somewhere else in our worksheet. Notice that D2 continues to be a relative reference and is updated to D3, D4, and so on when we copy this formula to cells E3, E4, and so on, respectively.

Nowlin Plastics After-Tax Profit Calculation Illustrating Relative versus Absolute References

FIGURE A.9

A C

Month

February

January

March

April

June

July

August

May

=SUM(B2:B13)

=AVERAGE(B2:B13)

1 2 3 4 5 6 7

September

October

November

December

Total:

9 10 11 12 13 14

Average:

15 16 17

B D

Revenue

2873000

3459000

3195000

3436000

2845000

2925000

3682000

3410000

3782000

3548000

3028000

3136000

Cost

=SUM(C2:C13)

=AVERAGE(C2:C13)

2640000

3250000

3021000

3240000

2803000

3015000

3150000

3185000

3237000

3196000

2815000

2997000

=D3*(1-$B$19)

=D2*(1-$B$19)

=D4*(1-$B$19)

=D7*(1-$B$19)

=D13*(1-$B$19)

=D5*(1-$B$19)

=D6*(1-$B$19)

=D8*(1-$B$19)

=D9*(1-$B$19)

=D10*(1-$B$19)

=D12*(1-$B$19)

=D11*(1-$B$19)

Tax Rate: 18 19 0.3

=B3-C3

=B2-C2

=B4-C4

=B7-C7

=B13-C13

=B5-C5

=B6-C6

=B8-C8

=B9-C9

=B10-C10

=B12-C12

=B11-C11

A C

Month

February

January

March

April

June

July

August

May

1 2 3 4 5 6 7

September

October

November

December

Total:

9 10 11 12 13 14

Average:

15 16

Tax Rate: 30% 18 19

B D E

Revenue

$39,319,000

$ 3,276,583

$ 2,873,000

$ 3,459,000

$ 3,195,000

$ 3,436,000

$ 2,845,000

$ 2,925,000

$ 3,682,000

$ 3,410,000

$ 3,782,000

$ 3,548,000

$ 3,028,000

$ 3,136,000

Cost

$ 2,640,000

$ 3,250,000

$ 3,021,000

$ 3,240,000

$ 2,803,000

$36,549,000

$ 3,045,750

$ 3,015,000

$ 3,150,000

$ 3,185,000

$ 3,237,000

$ 3,196,000

$ 2,815,000

$ 2,997,000

$ 233,000

$ 209,000

$ 174,000

$ 196,000

$ 42,000

$ (90,000)

$ 532,000

$ 225,000

$ 545,000

$ 352,000

$ 213,000

$ 139,000

$ 163,100

$ 146,300

$ 121,800

$ 137,200

$ 29,400

$ (63,000)

$ 372,400

$ 157,500

$ 381,500

$ 246,400

$ 149,100

$ 97,300

Pro�t After-Tax Pro�t

A.2 Spreadsheet Basics 733

NowlinFinancialComplete

S U M M A R Y

In this appendix we have reviewed the basics of using Microsoft Excel. We have discussed the basic layout of Excel, file creation, saving, and editing as well as how to reference cells, use formulas, and use the copy and paste functions in an Excel worksheet. We have illustrated how to find and enter Excel functions and described the difference between relative and absolute cell references. In Chapter 10, we give a detailed treatment of how to create more advanced business analytics models in Excel. We conclude this appendix with Table A.1, which shows commonly used keyboard shortcut keys in Excel. Keyboard short- cut keys can save considerable time when entering data into Excel.

G L O S S A R Y

Absolute reference A reference to a cell location in an Excel worksheet formula that does not update according to its relative position when the formula copied. Colon notation Notation used in an Excel worksheet to denote “through.” For example, 5SUM(B1:B4) implies sum cells B1 through B4, or equivalently, B1 B2 B3 B41 1 1 . Relative reference A reference to a cell location in an Excel worksheet formula that updates according to its relative position when the formula copied. Workbook An Excel file that contains a series of worksheets. Worksheet A single page in an Excel workbook containing a matrix of cells defined by their column and row locations.

734 Appendix A–Basics of Excel

Keyboard Shortcut Key Task Description

Ctrl-S Save

Ctrl-C Copy

Ctrl-V Paste

Ctrl-F Find (can be used to find text both within a cell and within a formula in Excel)

Ctrl-P Print

Ctrl-A Selects all cells in the current data region

Ctrl-B Changes the selected text to/from bold font

Ctrl-I Changes the selected text to/from italic font

Ctrl-~ (usually located above the Tab key)

Toggles between displaying values and formulas in the Worksheet.

Ctrl-↓ (down arrow key) Moves to the bottom-most cell of the current data region Ctrl-↑ (up arrow key) Moves to the top-most cell of the current data region Ctrl-→ (right arrow key) Moves to the right-most cell of the current data region Ctrl-← (left arrow key) Moves to the left-most cell of the current data region Ctrl-Home Moves to the top-left-most cell of the current data region

Ctrl-End Moves to the bottom-left-most cell of the current data region

Shift-↓ Selects the current cell and the cell below Shift-↑ Selects the current cell and the cell above Shift-→ Selects the current cell and the cell to the right Shift-← Selects the current cell and the cell to the left Ctrl-Shift-↓ Selects all cells from the current cell to the bottom-most

cell of the data region

Ctrl-Shift-↑ Selects all cells from the current cell to the top-most cell of the data region

Ctrl-Shift-→ Selects all cells from the current cell to the right-most cell in the data region

Ctrl-Shift-← Selects all cells from the current cell to the left-most cell in the data region

Ctrl-Shift-Home Selects all cells from the current cell to the top-left-most cell in the data region

Ctrl-Shift-End Selects all cells from the current cell to the bottom-right- most cell in the data region

Ctrl-Spacebar Selects the entire current column

Shift-Spacebar Selects the entire current row

Keyboard Shortcut Keys in ExcelTABLE A.1

A data region refers to all adjacent cells that contain data in an Excel worksheet.

Holding down the Ctrl key and clicking on multiple cells allows you to select multiple nonadjacent cells. Holding down the Shift key and clicking on two nonadjacent cells selects all cells between the two cells.

Glossary 735

736 Appendix B–Database Basics with Microsoft Access

C O N T E N T S B.1 DATABASE BASICS

Considerations When Designing a Database Creating a Database in Access

B.2 CREATING RELATIONSHIPS BETWEEN TABLES IN MICROSOFT ACCESS

B.3 SORTING AND FILTERING RECORDS

B.4 QUERIES Select Queries Action Queries Crosstab Queries

B.5 SAVING DATA TO EXTERNAL FILES

Data are the cornerstone of analytics; without accurate and timely data on relevant aspects of a business or organization, analytic techniques are useless, and the resulting analyses are meaningless (or worse yet, potentially misleading). The data used by organizations to make decisions are not static, but rather are dynamic and constantly changing, usually at a rapid pace. Every change or addition to a database represents a new opportunity to introduce errors into the data, so it is important to be capable of searching for duplicate entries or entries with errors. Furthermore, related data may be stored in different locations to sim- plify data entry or increase security. Because an analysis frequently requires information from several sets of data, an analyst must be able to efficiently combine information from multiple data sets in a logical manner. In this appendix, we will review tools in Microsoft Access® that can be used for these purposes.

B.1 Database Basics A database is a collection of logically related data that can be retrieved, manipulated, and updated to meet a user’s or organization’s needs. By providing centralized access to data efficiently and consistently, a database serves as an electronic warehouse of information on some specific aspect of an organization. A database allows for the systematic accumu- lation, management, storage, retrieval, and analysis of the information it contains while reducing inaccuracies that routinely result from manual record keeping. Organizations of all sizes maintain databases that contain information about their customers, markets, suppliers, and employees. Before embarking on designing a database, it is important to consider what are good characteristics of a database. Foremost, the information in a data- base should be correct and complete so that decisions based on reports retrieved from the database will be based on accurate information. Second, a database should avoid duplicate information as much as possible in order to minimize wasted space and reduce the likeli- hood of errors and inconsistencies. Thus, a good database design

• divides the organization's information into subject-based tables to reduce redundant data without loss of information.

• provides the organization's database software with the information required to join information in tables together as needed.

• supports, maintains, and ensures the integrity and accuracy of the organization's information.

• avoids tables that have large numbers of entries with empty attributes. • protects the organization's information through database security. • accommodates the organization's data processing and reporting needs.

Appendix B–Database Basics with Microsoft Access

B.1 Database Basics 737

Throughout this appendix, we will consider issues that arise in the creation and mainte- nance of a database for Stinson’s MicroBrew Distributor, a licensed regional independent distributor of beer and a member of the National Beer Wholesalers Association. Stinson’s provides refrigerated storage, transportation, and delivery of premium beers produced by several local microbreweries, so the company’s facilities include a state-of-the-art tempera- ture-controlled warehouse and a fleet of temperature-controlled trucks. Stinson’s also employs sales, receiving, warehousing/inventory, and delivery personnel. When making a delivery, Stinson’s monitors the retailer’s shelves, taps, and keg lines to ensure the fresh- ness and quality of the product. Because beer is perishable and because microbreweries often do not have the capacity to store, transport, and deliver large quantities of the prod- ucts they produce, Stinson’s holds a critical position in this supply chain.

Stinson’s needs to develop a faster, more efficient, and more accurate means of record- ing, maintaining, and retrieving data related to various aspects of its business. The compa- ny’s management team has identified three broad key areas of data management: personnel (information on Stinson’s employees); supplier (information on purchases of beer made by Stinson’s from its suppliers); and retailer (information on sales to Stinson’s retail cus- tomers). We will use Microsoft Access 2016 in designing Stinson’s database. Access is a relational database management system (RDBMS), which is the most commonly used type of database system in business. Data in a relational database are stored in tables, which are the fundamental components of a database. A relational database allows the user to retrieve subsets of data from tables and retrieve and combine data that are stored in related tables.

In this section we will learn how to use Access to create a database and perform some basic database operations. In Access, a database is defined as a collection of related objects that are saved as a single file. An object in Access can be a:

• Table: Data arrayed in rows and columns (similar to a worksheet in an Excel spread- sheet) in which rows correspond to records (the individual units from which the data have been collected) and columns correspond to fields (the variables on which data have been collected from the records)

• Form: An object that is created from a table to simplify the process of entering data • Query: A question posed by a user about the data in the database • Report: Output from a table or a query that has been put into a specific prespecified format

We will focus on tables and queries in this appendix. You can refer to a wide variety of books on database design to learn about forms, reports, and other database objects.

Tables are the foundation of an Access database. Each field in a table has a data type. The most commonly used are as follows:

• Short Text: A field that contains words (such as the field Gender that may be used to record whether a Stinson’s employee is female or male); can contain no more than 255 alphanumeric characters

• Long Text: A larger field that contains words and is generally used for recording lengthy descriptive entries (such as the field Notes on Special Circumstances for a Transaction that may be used to record detailed notes about unique aspects of specific transactions between Stinson’s and its retail customers); can contain up to approximately 1 gigabyte, but controls to display a long text are limited to the first 64,000 characters.

• Number: A field that contains numerical values. There are several sizes of Number fields, which include: • Byte: Stores whole numbers from 0 to 255 • Decimal: Stores numbers from 10 1282 1 to 10 128 2 • Integer: Stores nonfractional numbers from −32,768 to 32,767

Microsoft Access 2016 is virtually identical to Microsoft Access 2013 and 2010, so the instructions provided in this appendix also apply to Access 2013 and Access 2010.

In versions of Access prior to Access 2013 the Long Text field type is referred to as the Memo field type.

738 Appendix B–Database Basics with Microsoft Access

• Long Integer: Stores nonfractional numbers from −2,147,483,648 to 2,147,483,647 • Single: Stores numbers from 3.402823 10382 3 to 3.402823 10383 • Double: Stores numbers from 1.79769313 103082 3 to 1.79769313 103083

• Currency: A field that contains monetary values (such as the field Transaction Amount that may be used to record payments for goods that have been ordered by Stinson’s retail customers)

• Yes/No: A field that contains binary variables (such as the field Sunday Deliveries? that may be used to record whether Stinson’s retail customers accept deliveries on Sundays)

• Date/Time: A field that contains dates and times (such as the field Date of Order that may be used to record the date of an order placed by Stinson’s with one of its suppliers)

Once you create a field and set its data type, you can set additional field properties. For example, for a numerical field you can define the data size to be Byte, Integer, Long Inte- ger, Single, Double, Replication ID, or Decimal.

A database may consist of several tables that are maintained separately for a variety of reasons. We have already mentioned that Stinson’s maintains information on its personnel, its suppliers and orders and its retail customers and sales. With regard to its retail custom- ers, Stinson’s may maintain information on the company name, street address, city, state, zip code, telephone number, and e-mail address; the dates of orders placed and quantities ordered; and the dates of actual deliveries and quantities delivered. In this example, we may consider establishing a table on Stinson’s retailer customers; in this table each record corresponds to a retail customer, and the fields include the retail customer’s company name, street address, city, state, zip code, telephone number, and e-mail address. Main- tenance of this table is relatively simple; these data likely are not updated frequently for existing retail customers, and when Stinson’s begins selling to a new retail customer, it has to establish only a single new record containing the information for the new retail customer in each field.

Stinson’s may maintain other tables in this database. To track purchases made by its retail customers, the company may maintain a table of retail orders that includes the retail customer’s name and the dollar value, date, and number of kegs and cases of beer for each order received by Stinson’s. Because this table contains one record for each order placed with Stinson’s, this table must be updated much more frequently than the table of informa- tion on Stinson’s retailer customers.

A user who submits a query is effectively asking a question about the information in one or more tables in a database. For example, suppose Stinson’s has determined that it has surplus kegs of Fine Pembrook Ale in inventory and is concerned about potential spoilage. As a result, the Marketing Department decides to identify all retail customers who have ordered kegs of Fine Pembrook Ale during the previous three months so that Stinson’s can call these retailers and offer them a discounted price on additional kegs of this beer. A query could be designed to search the Retail Orders table for retail customers who meet this criterion. When the query is run, the output of the query provides the answer.

More complex queries may require data to be retrieved from multiple tables. For these queries, the tables must be connected by a join operation that links the records of the tables by their values in some common field. The common field serves as a bridge between the two tables, and the bridged tables are then treated by the query as a large single table com- prising the fields of the original tables that have been joined. In designing a database for Stinson’s, we may include the customer ID as a field in both the table of retail customers and the table of retail orders; values in the field customer ID would then provide the basis for linking records in these two tables. Thus, even though the table of retail orders does not contain the information on each of Stinson’s retail customers that is contained in the table of Stinson’s retail customers, if the database is well designed, the information in these two tables can easily be combined whenever necessary.

A Replication ID field is used for storing a globally unique identifier to prevent dupli- cation of an identifier (such as customer number) when multiple copies of the same database are in use in different locations.

In addition to answering a user’s questions about the data in one or more tables, a query can also be used to add a record to the end of a table, delete a record from a table, or change the values for one or more records in a table. These functions are accom- plished through append, delete, and update queries. We discuss queries in more detail later in this appendix.

Each table in a database generally contains a primary key field that has a unique value for each record in the table. A primary key field is used to identify how records from sev- eral tables in a database are logically related. In our previous example, Customer ID is the primary key field for the table of Stinson’s retail customers. To facilitate the linking of records in the table of Stinson’s retail customers with logically related records in the table of retail orders, the two tables must share a primary key. Thus, a field for Customer ID may be included in the table of retail orders so that information in this table can be linked to information on each of Stinson’s retail customers; when a field is included in a table for the sole purpose of facilitating links with records from another table, the field is referred to as a foreign key field.

Considerations When Designing a Database Before creating a new database, we should carefully consider the following issues:

• What is the purpose of this database? • Who will use this database? • What queries and reports do the users of this database need? • What information or data (fields) will this database include? • What tables must be created, and how will the fields be allocated to these tables? • What are the relationships between these tables? • What are the fields that will be used to link related tables? • What forms does the organization need to create to support the use of this database?

The answers to these questions will enable us to efficiently create a more effective and useful database. Let us consider these issues within the context of designing Stinson’s data- base. Stinson’s has several reasons for developing and implementing a database. Quick access to reliable and current data will enable Stinson’s to monitor inventory and place orders from the microbreweries so that it can meet the demand of the retailers it supplies, while avoiding excess quantities and potential spoilage of inventory. These data can also be used to monitor the age of the product in inventory, which is a critical issue for a perishable product. Patterns in the orders of various beers placed by Stinson’s retail customers can be analyzed to deter- mine forecasts of future demand. Employees’ salaries, federal and state tax withholding, vacation and sick days taken/remaining for the current year, and contributions to retirement funds can be tracked. Orders received from retail customers and Stinson’s deliveries can be better coordinated. In summary, Stinson’s can use a database to utilize information about its business in numerous ways that will potentially improve the efficiency and profitability of the company.

If we were to create a database for Stinson’s MicroBrew Distributor, who within the company might need to use information from the database? A quick review of Stinson’s reasons for developing and implementing a database provides the answer. Warehousing/ inventory can use the database to control inventory. Delivery can create efficient delivery routes for the drivers on a daily basis and assess the on-time performance of the delivery system. Receiving can anticipate and prepare to receive daily deliveries of microbrews. Human resources can administer payroll, taxes, and benefits. Marketing can identify and exploit potential sales opportunities.

By considering the users and uses for the database, we can make a preliminary determi- nation of the queries and reports the users of this database will need and the data (fields) this database must include. At this point we can consider the tables to be created, how the fields will be allocated to the tables, and the potential relationships between these tables. We can see that we will need to incorporate data on:

• each microbrewery for which Stinson’s distributes beer (Stinson’s suppliers). • each order placed with and delivery received from the microbreweries (Stinson’s supplies).

For tables that do not include a primary key field, a unique identifier for each record in the table may be formed by combining two or more fields (if the combination of these two fields will yield a unique value for each record that may be included in the table); the result is called a compound primary key and is used in the same way a primary key is used.

B.1 Database Basics 739

740 Appendix B–Database Basics with Microsoft Access

• each retailer to which Stinson’s distributes beer (Stinson’s customers). • each order received from and delivery made to Stinson’s retail customers (Stinson’s sales).

• each of Stinson’s employees (Stinson’s workforce).

As we design these tables and allocate fields to the tables we design, we must ensure that our database stores Stinson’s data in the correct formats and is capable of outputting the queries, forms, and reports that Stinson’s employees need.

With these considerations in mind, we decide to begin with the following 11 tables and associated fields in designing a database for Stinson’s MicroBrew Distributor:

• TblEmployees • EmployeeID • EmpFirstName • EmpLastName • Gender • DOB

• Street Address • City • State • Zip Code • Phone Number

• TblJobTitle • Job ID • Job Title

• TblEmployHist • EmployeeID • Start Date • End Date

• Job ID • Salary • Hourly Rate

• TblBrewers • BrewerID • Brewery Name • Street Address • City

• State • Zip Code • Phone Number

• TblSOrders • SOrder Number • BrewerID • Date of SOrder

• EmployeeID • Keg or Case? • SQuantity Ordered

• TblSDeliveries • SOrder Number • BrewerID • EmployeeID

• Date of SDelivery • SQuantity Delivered

• TblPurchasePrices • BrewerID • KegPurchasePrice

• CasePurchasePrice

• TblRetailers • CustID • Name • Class • Street Address

• City • State • Zip Code • Phone Number

• TblROrders • ROrder Number • Name • CustID • BrewerID

• Date of ROrder • Keg or Case? • RQuantity Ordered • Rush?

• TblRDeliveries • CustID • Name • ROrder Number

• EmployeeID • Date of RDelivery • RQuantity Delivered

• TblSalesPrices • BrewerID • KegSalesPrice

• CaseSalesPrice

Each table contains information about a particular aspect of Stinson’s business operations:

• TblEmployees: Information about each Stinson’s employee, primarily obtained when the employee is hired

• TblJobTitle: Information about each type of position held by Stinson’s employees • TblEmployHist: Information about the employment history of each Stinson’s employee

• TblBrewers: Information about each microbrewery that supplies Stinson’s with beer • TblSOrders: Information about each order that Stinson’s has placed with the micro- breweries that supply Stinson’s with beer

• TblSDeliveries: Information about each delivery that Stinson’s has received from the microbreweries that supply Stinson’s with beer

• TblPurchasePrices: Information about the price charged by each microbrewery that supplies Stinson’s with beer

• TblRetailers: Information about each retailer that Stinson’s supplies with beer • TblROrders: Information about each order that Stinson’s has received from the retail- ers that Stinson’s supplies with beer

• TblRDeliveries: Information about each delivery that Stinson’s has made to the retail- ers that Stinson’s supplies with beer

• TblSalesPrices: Information about the price charged to retailers by Stinson’s for each of the microwbrews it distributes

The first three tables deal with personnel information, the next four with product supply/ purchasing information, and the last four with demand/sales information. Figure B.1 shows how these tables are related.

The relationships among the tables define how they can be linked. For example, suppose Stinson’s Shipping Manager needs information on the orders placed by Stinson’s retail custom- ers that are to be filled tomorrow so that she can solve an optimization model that provides the optimal routes for Stinson’s delivery trucks. The Shipping Manager needs to generate a report that includes the amount of various beers ordered and the address of each retail customer that has placed an order. To do so, she can use the common field CustID to link records from the TblRetailers. When the delivery has been made, the relevant information is input into the TblRDeliveries table. If the Shipping Manager needs to generate a report of deliveries made by each driver for the past week, she can use the common field EmployeeID to link records from the TblEmployees table with related records from the TblRDeliveries table.

Once Stinson’s is satisfied that the planned database will provide the organization with the capability to collect and manage its data, and Stinson’s is also confident that the database is capable of outputting the queries, forms, and reports that its employees need,

Note that the name of each table begins with the three let- ter designation Tbl; this is con- sistent with the Leszynski/ Reddick guidelines, a common set of standards for naming database objects.

B.1 Database Basics 741

742 Appendix B–Database Basics with Microsoft Access

Tables and Relationships for Stinson’s Microbrew Distributor DatabaseFIGURE B.1

TblEmployHist

EmployeeID Start Date End Date Job ID Salary Hourly Rate

TblRDeliveries

TblEmployees

CustID Name ROrder Number EmployeeID Date of RDelivery RQuantity Delivered

TblSDeliveries

SOrder Number BrewerID EmployeeID Date of SDelivery SQuantity Delivered TblPurchasePrices

TblJobTitle

Job ID Job Title

EmployeeID EmpFirstName EmpLastName Gender DOB Street Address City State Zip Code Phone Number

TblRetailers

CustID Name Class Street Address City State Zip Code Phone Number

TblROrders

ROrder Number Name CustID BrewerID Date of ROrder Keg or Case? RQuantity Ordered Rush?

TblBrewers

BrewerID Brewery Name Street Address City State Zip Code Phone Number

TblSOrders

Personnel Information

Retailer Information Supplier Information

BrewerID KegPurchasePrice CasePurchasePrice

SOrder Number BrewerID Date of SOrder EmployeeID Keg or Case? SQuantity Ordered

TblSalesPrices

BrewerID KegSalesPrice CaseSalesPrice

we can proceed by using Access to create the new database. However, it is important to realize that it is unusual for a new database to meet all of the potential needs of its users. A well-designed database allows for augmentation and revision when needs that the current database does not meet are identified.

Creating a Database in Access When you open Access, the left pane provides links to databases you have recently opened as well as a means for opening existing database documents. The available document templates are provided in the right pane; these preinstalled templates can be used to create new databases that utilize common formats. Because we are focusing on building a fairly generic database, we will use the Blank desktop database tool. We are now ready to create a new database by following these steps:

Step 1. Click the Blank desktop database icon (Figure B.2) Step 2. When the Blank desktop database dialog box (Figure B.3) appears:

Enter the name of the new database in the File Name box (we will call our new database Stinsons.accdb) Indicate the location where the new database will be saved by clicking the Browse button (we will save the database called Stinsons.accdb in the folder C:\Stinson Files)

Step 3. Click Create

This takes us to the Access Datasheet view. As shown in Figure B.4, the Datasheet view includes a Navigation Panel and Table Window. The Ribbon in the Datasheet view contains the File, Home, Create, External Data, Database Tools, Fields, and Table tabs.

The Datasheet view provides the means for controlling the database. The groups and buttons of the Table Tools contextual tab are displayed across the top of this window. The Navigation Panel on the left side of the display lists all objects in the database. This pro- vides a user with direct access to tables, reports, queries, forms, and so on that make up the

Clicking the File tab in the Ribbon will allow the user to create new databases and access existing databases from the Datasheet view.

Blank Desktop Database IconFIGURE B.2

Blank Desktop Database Dialog BoxFIGURE B.3

B.1 Database Basics 743

744 Appendix B–Database Basics with Microsoft Access

Datasheet View and Table Tools Contextual TabFIGURE B.4

Navigation Panel Table Window

currently open database. On the right side of the display is the Table Window; the tab in the upper left corner of the Table Window shows the name of the current table (Table1 in Figure B.4). In the Table Window, we can enter data directly into the table or modify data in an existing table.

The first step in creating a new database is to create one or more tables. Because tables store the information contained in a database, they are the foundation of a database and must be created prior to the creation of any other objects in the database. There are two options for manually creating a table: We can enter data directly in Datasheet view, or we can design a table in Design view. We will create our first table, TblBrewers, by entering data directly in Datasheet view. You can review an example database comprising all of the objects and relationships between the objects that we create throughout this appendix for the Stinson’s database in the file Stinsons.

In Datasheet view the data are entered by field, one record at a time. In Figure B.1 we see that the fields for TblBrewers are BrewerID, Brewery Name, Street Address, City, State, Zip Code, and Phone Number. From Stinson’s current filing system, we have been able to retrieve the information in Table B.1 on the breweries that supply Stinson’s.

We can enter these data into our new database in Datasheet view by following these steps:

Step 1. Enter the first record from Table B.1 into the first row of the Table Window in Access by entering a 3 in the top row next to (New), pressing the Tab key, entering Oak Creek Brewery in the next column, pressing the Tab key, enter- ing 12 Appleton St in the next column, pressing the Tab key, entering Dayton in the next column, pressing the Tab key, and so on.

Step 2. Enter the second record from Table B.1 by repeating Step 1 for the Gonzo Microbrew data and entering these data into the second row of the Table Window in Access Continue entering data for the remaining microbreweries in this manner

The completed table in Access appears in Figure B.5. Now that we have entered all of our information on the microbreweries that supply

Stinson’s, we need to save this table as an object in the Stinson’s database. We click on the Save button in the Quick Access Toolbar above the Ribbon, type the table name TblBrewers in the Save As dialog box that appears (as shown in Figure B.6), and click OK. The name in the Table Name tab on the Table Window now reads “TblBrewers.”

You can click the Help button ? to find detailed instructions

on creating a table or using any other Access functionality.

When we enter 3 in Step 1, this establishes a new field with the generic name “Field1” and generates a value for the ID column. Pressing the Tab key moves to the next field entry box for the same record.

BrewerID Brewery Name Street Address City State Zip Code Phone Number

3 Oak Creek Brewery 12 Appleton St Dayton OH 45455 937-445-1212

6 Gonzo Microbrew 1515 Main St Dayton OH 45429 937-278-2651

4 McBride’s Pride 425 Mad River Rd Miamisburg OH 45459 937-439-0123

9 Fine Pembrook Ale 141 Dusselberg Ave Trotwood OH 45426 937-837-8752

7 Midwest Fiddler Crab 844 Far Hills Ave Kettering OH 45453 937-633-7183

2 Herman’s Killer Brew 912 Airline Dr Fairborn OH 45442 937-878-2651

Raw Data for Table TblBrewersTABlE B.1

Records for Six Microbreweries Entered into an Access TableFIGURE B.5

Save as Dialog BoxFIGURE B.6

We can now use the Design view to provide meaningful names for our fields and specify each field’s properties. We switch to Design view by first clicking on the arrow below the View button in the Views group of the Ribbon. This will open a pull-down menu with options for various views (recall that we are currently in the Datasheet view). Clicking on

B.1 Database Basics 745

746 Appendix B–Database Basics with Microsoft Access

the Design View option opens the Design view for the current table as shown in Figure B.7. From the Design view we can define or edit the table’s fields and field properties as well as rearrange the order of the fields if we wish. The name of the current table is again identified in the Name tab, and the Table Window is replaced with two sections: the Table Design Grid on top and the Field Properties Pane on the bottom of this window.

We can now replace the generic field names (Field1, Field2, etc.) in the column on the right side of the Table Design Grid of TblBrewers with the names we established from our original database design and then move to defining the field type for each field. To change the data type for a field in design view, we follow these steps:

Step 1. Click on the cell in the Data Type column (the middle column) in the Table Design Grid in the row of the field for which you want to change the data type

Step 2. Click on the drop-down arrow in the upper right-hand corner of the selected cell

Step 3. Define the data type for the field using the drop-down menu (Figure B.8)

Notice that when you use this menu to define the data type for a field, the Field Prop- erties Pane changes to display the properties and restrictions of the selected data type. For example, the field Brewery Name is defined as Short Text; when any row of the Table Design Grid associated with this field is selected, the Field Property Pane shows the char- acteristics associated with a field of data type Short Text, including a limit of 255 char- acters. If we thought we might eventually do business with a brewery that has a business name that exceeds 255 characters, we may decide to select the Long Text data type for this field (Figure B.8). However, selecting a data type that allows for greater capacity will increase the size of the database and should not be used unless necessary.

A field such as State is a good candidate for reducing the field size. If we use the offi- cial U.S. Postal Service abbreviations for the states (i.e., AL for Alabama, AK for Alaska,

Note that Field Names used in Access cannot exceed 64 characters, cannot begin with a space, and can include any combination of letters, numbers, spaces, and spe- cial characters except for a period (.), an accent grave (`), an exclamation point (!), or square brackets ([ and ]).

Design View for the Table TblBrewersFIGURE B.7

Navigation Panel Table Design Grid Field Properties Pane

and so on), this field would always use two characters. Note that if we violate the restric- tion we place on a field, Access will respond with an error statement. The restriction on length can be very helpful in this instance. Because we know that a state abbreviation is always exactly two characters, an error statement regarding the length of the State field indicates that we made an incorrect entry for this field.

After defining the data type for each of the fields to be Short Text (although fields such as BrewerID, Zip Code, and Phone Number are made up of numbers, we would not con- sider doing arithmetic operations on these cells, so we define these fields as Short Text), we can use the column labeled Description on the right side of the Table Design Grid to docu- ment the contents of each field. Here we may include the following:

• Brief descriptions of the fields (especially if our field names are not particularly meaningful or descriptive)

• Instructions for entering data into the fields (e.g., we may indicate that a telephone number is entered in the format (XXX) XXX-XXXX)

• Indications of whether a field acts as a primary or a foreign key

To change the primary key from the default field ID to BrewerID, we use the following steps:

Step 1. Click on any cell in the BrewerID row Step 2. Click the Design tab in the Ribbon Step 3. Click the Primary Key icon in the Tools group

This changes the primary key from the ID field to the BrewerID field. We can now delete the ID field because it is no longer needed.

Step 4. Right-click any cell in the ID row and click Delete Rows (Figure B.9) Click Yes when the dialog box appears to confirm that you want to delete this row

We have now created the table TblBrewers by entering the data in Datasheet view (Figure B.10) and (1) changed the name of each field, (2) identified the correct data type

Changing the Data Type for the Brewery Name Field in the Table TblBrewersFIGURE B.8

B.1 Database Basics 747

748 Appendix B–Database Basics with Microsoft Access

Drop-Down Menu for Deleting Fields in the Design View

Design View of Table Design for TblBrewers

FIGURE B.9

FIGURE B.10

for each field, (3) revised properties for some fields, (4) added a description for each field, and (5) changed the primary key field to BrewerID in Design view. Alternatively, we could create a table in Design view. We first enter the field names, data types, and descriptions in the Table Design Grid. After saving this table as TblSOrders, we then move to the Database Window, which now has defined fields, and enter the information in the appropriate cells. Suppose we take this approach to create the table TblSOrders, which contains information on orders Stinson’s places with the microbreweries. We have the following data for orders from the past week (Table B.2) that we will use to initially populate this table (new orders will be added to the table as they are placed).

The fields represent Stinson’s internal number assigned to each order placed with a brewery (SOrderNumber), Stinson’s internal identification number assigned to the micro- brewery with which the order has been placed (BrewerID), the date on which Stinson’s placed the order (Date of SOrder), the identification number of the Stinson’s employee who placed the order (EmployeeID), an indication of whether the order was for kegs or cases of beer (Keg or Case?), and the quantity (in units) ordered (SQuantity Ordered). As before, we enter the information into the Field Name, Data Type, and Description columns

Design View of Table Design for TblSOrdersFIGURE B.11

SOrderNumber BrewerID Date of SOrder EmployeeID Keg or Case? SQuantity Ordered

17351 3 11/5/2012 135 Keg 3

17352 9 11/5/2012 94 Case 6

17353 7 11/5/2012 94 Keg 2

17354 3 11/6/2012 94 Keg 3

17355 2 11/6/2012 135 Keg 2

17356 6 11/6/2012 135 Case 5

17358 9 11/7/2012 94 Keg 3

17359 4 11/7/2012 135 Keg 2

17360 3 11/8/2012 94 Case 8

17361 2 11/8/2012 94 Keg 1

17362 7 11/8/2012 94 Keg 2

17363 9 11/8/2012 135 Keg 4

17364 6 11/8/2012 94 Keg 2

17365 2 11/9/2012 135 Case 5

17366 3 11/9/2012 135 Keg 4

17367 7 11/9/2012 94 Case 4

17368 9 11/9/2012 135 Keg 4

17369 4 11/9/2012 94 Keg 3

Raw Data for Table TblSOrdersTABlE B.2

of the Table Design Grid, remove the ID field, change the primary key field (this time to the field SOrderNumber), and revise the properties of the fields as necessary in the Field Properties area as shown in Figure B.11.

B.1 Database Basics 749

750 Appendix B–Database Basics with Microsoft Access

Now we return to the Database Window and manually input the data from Table B.2 into the table TblSOrders as shown in Figure B.12. Note that in both Datasheet view and Design view, we now have separate tabs with the table names TblBrewers and TblSOrders and that these two tables are listed in the Navigation Panel. We can use either Datasheet view or Design view to move between our tables.

We can also create a table by reading information from an external file. Access is capable of reading information from several types of external files. Here we demon- strate by reading data from an Excel file into a new table TblSDeliveries. The Excel file SDeliveries.xlsx contains the information on deliveries received by Stinson’s from various microbreweries during a recent week. The fields of this table, as shown in Figure B.12, will correspond to the column headings in the Excel worksheet displayed in Figure B.13.

The columns in Figure B.13 represent: Stinson’s internal number assigned to each order placed with a microbrewery (SOrderNumber), Stinson’s internal identification number assigned to the microbrewery with which the order has been placed (BrewerID), the iden- tification number of the Stinson’s employee who received the delivery (EmployeeID), the date on which Stinson’s received the delivery (Date of Sdelivery), and the quantity (in units) received in the delivery (SQuantity Delivered). To import these data directly into the table TblSDeliveries, we follow these steps:

Step 1. Click the External Data tab in the Ribbon Step 2. Click the Excel icon in the Import & Link group (Figure B.14) Step 3. When the Get External Data—Excel Spreadsheet dialog box appears

(Figure B.15), click the Browse… button Navigate to the location of the Excel file to be imported into Access (in this case, SDeliveries.xlsx), and indicate the manner in which we want to import the information in this Excel file by selecting the appropriate radio button (we are importing these data to a new table, TblSDeliveries, in the current database)

Step 4. Click OK

If the Excel worksheet from which we are importing the data does not contain column headings, Access will assign dummy names to the fields that can later be changed in the Table Design Grid of Design view.

Datasheet View for TblSOrders>FIGURE B.12

External Data Tab on the Access RibbonFIGURE B.14

Step 5. When the Import Spreadsheet Wizard dialog box opens (Figure B.16), arrange the information as shown in Figure B.16

Verify that the check box for First Row Contains Column Headings is selected because the worksheet from which we are importing the data contains column headings Click Next > to open the second screen of the Import Spreadsheet Wizard dialog box (Figure B.17)

Step 6. Indicate the format for the first field (in this case, SOrderNumber) and whether this field is the primary key field for the new table (it is in this case)

Click Next >

We continue to work through the ensuing screens of the Import Spreadsheet Wizard dialog box, indicating the format for each field and identifying the primary key field (SOrder- Number) for the new table. When we have completed the final screen, we click Finish and add the table TblSDeliveries to our database. Note that in both Datasheet view (Figure B.18) and Design view, we now have separate tabs with the table names TblBrewers, TblSOrders, and TblSDeliveries, and that these three tables are listed in the Navigation Panel.

We have now created the table TblSDeliveries by reading the information from the Excel file SDeliveries.xlsx, and we have entered information in the fields and identified the

If your Excel file contains multiple worksheets, you will be prompted by the Import Spreadsheet Wizard to select the worksheet from which you want to import data. After you have selected a worksheet and clicked on Next, you will automatically proceed to the screen in Figure B.16.

B.1 Database Basics 751

Excel Spreadsheet SDeliveries.xlsxFIGURE B.13

SDeliveries

752 Appendix B–Database Basics with Microsoft Access

Get External Data—Excel Spreadsheet Dialog BoxFIGURE B.15

First Screen of Import Spreadsheet Wizard Dialog BoxFIGURE B.16

Second Screen of Import Spreadsheet Wizard Dialog Box

Datasheet View for TblSDeliveries

FIGURE B.17

FIGURE B.18

B.1 Database Basics 753

754 Appendix B–Database Basics with Microsoft Access

primary key field in this process. This procedure for creating a table is more convenient (and more accurate) than manually inputting the information in Datasheet view, but it requires that the data be in a file that can be imported into Access.

B.2 Creating Relationships between Tables in Microsoft Access One of the advantages of a database over a spreadsheet is the economy of data storage and maintenance. Information that is associated with several records can be placed in a separate table. As an example, consider that the microbreweries that supply beer to Stinson’s are each associated with multiple orders for beer that have been placed by Stinson’s. In this case, the names and addresses of the microbreweries do not have to be included in records of Stinson’s orders, saving a great deal of time and effort in data entry and maintenance. However, the two tables, in this case the table with the information on Stinson’s orders for beer and the table with information on the microbreweries that supply Stinson’s with beer, must be joined (i.e., have a defined relationship) by a common field. To use the data from these two tables for a common purpose, a relationship between the two tables must be cre- ated to allow one table to share information with the other.

The first step in deciding how to join related tables is to decide what type of relationship you need to create between tables. Next we briefly summarize the three types of relation- ships that can exist between two tables.

One-to-Many This relationship occurs between two tables, which we will label as Table A and Table B, when the value in the common field for a record in Table A can match the value in the common field for multiple records in Table B, but a value in the common field for a record in Table B can match the value in the common field for at most a single record in Table A. Consider TblBrewers and TblSOrders with the common field BrewerID. In TblBrewers, each unique value of BrewerID is associated with a sin- gle record that contains contact information for a single brewer, while in TblSOrders each unique value of BrewerID may be associated with several records that contain information on various orders placed by Stinson’s with a specific brewer. When these tables are linked through the common field BrewerID, each record in TblBrewers can potentially be matched with multiple records of orders in TblSOrders, but each record in TblSOrders can be matched with only one record in TblBrewers. This makes sense, as a single brewer can be matched to several orders, but each order can be matched to only a single brewer.

One-to-One This relationship occurs when the value in the common field for a record in Table A can match the value in the common field for at most one record in Table B, and a value in the common field for a record in Table B can match the value in the common field for at most a single record in Table A. Here we consider TblBrewers and TblPur- chasePrices, which also share the common field BrewerID. In TblBrewers, each unique value of BrewerID is associated with a single record that contains contact information for a single brewer, while in TblPurchasePrices each unique value of BrewerID is associated with a single record that contains information on prices charged to Stinson’s by a specific brewer for kegs and cases of beer. When these tables are linked through the common field BrewerID, each record in TblBrewers can be matched to at most a single record of prices in TblPurchasePrices, and each record in TblPurchasePrices can be matched with no more than one record in TblBrewers. This makes sense, as a single brewer can be matched only to the prices it charges, and a specific set of prices can be matched only to a single brewer.

Many-to-Many This occurs when a value in the common field for a record in Table A can match the value in the common field for multiple records in Table B, and a value in the common field for a record in Table B can match the value in the common field for several records in Table A. Many-To-Many relationships are not directly supported by Access but can be facilitated by creating a third table, called an associate table, that contains a primary key and a foreign key to each of the original tables. This ultimately results in one-to-many

One-to-Many relationships are the most common type of relationship between two tables in a relational database; these relationships are some- times abbreviated as 1:`.

One-to-One relationships are the least common form of relationship between two tables in a relational database because it is often possible to include these data in a single table; these relationships are sometimes abbreviated as 1:1.

Many-to-Many relationships are sometimes abbreviated as :` `.

relationships between the associate table and the two original tables. Our design for Stinson’s database does not include any many-to-many relationships.

To create any of these three types of relationships between two tables, we must satisfy the rules of integrity. Recall that the primary key field for a table is a field that has (and will have throughout the life of the database) a unique value for each record. Defining a primary key field for a table ensures that the table will have entity integrity, which means that the table will have no duplicate records.

Note that when the primary key field for one table is a foreign key field in another table, it is possible for a value of this field to occur several times in the table for which it is a foreign key field. For example, Job ID is the primary key field in the table TblJobTitle and will have a unique value for each record in this table. But Job ID is a foreign field in the table TblEmployHist, so a value of Job ID can occur several times in TblEmployHist.

Referential integrity is the rule that establishes the relationship between two tables. For referential integrity to be established, when the foreign key field in one table (say, Table B) and the primary key field in the other table (say, Table A) are matched, each value that occurs in the foreign key field in Table B must also occur in the primary key field in Table A. For instance, to preserve referential integrity for the relationship between TblEm- ployHist and TblJobTitle, each employee record in TblEmployHist must have a value in the Job ID field that exactly matches a value of the Job ID field in TblJobTitle. If a record in TblEmployHist has a value for the foreign key field (Job ID) that does not occur in the pri- mary key field (Job ID) of TblJobTitle, the record is said to be orphaned (in this case, this would occur if we had an employee who has been assigned a job that does not exist in our database). An orphaned record would be lost in any table that results from joining TblJob- Title and TblEmployHist. Enforcing referential integrity through Access prevents records from becoming orphaned and lost when tables are joined.

Violations of referential integrity lead to inconsistent data, which results in meaningless and potentially misleading analyses. Enforcement of referential integrity is critical not only for ensuring the quality of the information in the database but also for ensuring the validity of all conclusions based on these data.

We are now ready to establish relationships between tables in our database. We will first establish a relationship between the tables TblBrewers and TblSOrders. To establish a rela- tionship between these two tables, take the following steps:

Step 1. Click the Database Tools tab in the Ribbon (Figure B.19) Step 2. From the Navigation Panel select one of the tables for which you want to

establish a relationship (we will click on TblBrewers) Step 3. Click the Relationships icon in the Relationships group

This will open the contextual tab Relationship Tools in the Ribbon and a new display with a tab labeled Relationships in the workspace, as shown in Figure B.20. A box listing all fields in the table you selected before clicking the Relationships icon will be provided

Database Tools Tab in the Access RibbonFIGURE B.19

B.2 Creating Relationships between Tables in Microsoft Access 755

756 Appendix B–Database Basics with Microsoft Access

Relationship Tools Contextual Tab in the Access Ribbon and Tab Labeled Relationships in the Workspace

FIGURE B.20

Step 4. Click Show Table in the Relationships group When the Show Table dialog box opens (Figure B.21), select the second table for which you want to establish a relationship (in our example, this is TblSOrders) to establish a relationship between these two tables Click Add Click Close

You can select multiple tables in the Show Table dialog box by holding down the Ctrl key and selecting multiple tables.

Show Table Dialog BoxFIGURE B.21

Upper Portion of the Relationships Workspace Showing the Relationship between TblBrewers and TblSOrders

FIGURE B.22

Edit Relationships Dialog BoxFIGURE B.23

Once we have selected the two tables (TblBrewers and TblSOrders) for which we are establishing a relationship, boxes showing the fields for the two tables will appear in the workspace. If Access can identify a common field, it will also suggest a relationship between these two tables. In our example, Access has identified BrewerID as a common field between TblBrewers and TblSOrders and is showing a relationship between these two tables based on this field (Figure B.22).

In this instance, Access has correctly identified the relationship we want to establish between TblBrewers and TblSOrders. However, if Access does not correctly identify the relationship, we can modify the relationship between these tables. If we double-click on the line connecting TblBrewers to TblSOrders, we open the relationship’s Edit Relation- ships dialog box, as shown in Figure B.23.

If Access does not suggest a relationship between two tables, you can click Create New... in the Edit Relation- ships dialog box to open the Create New dialog box, which then will allow you to specify the tables to be related and the fields in these tables to be used to establish the relationship.

B.2 Creating Relationships between Tables in Microsoft Access 757

758 Appendix B–Database Basics with Microsoft Access

Note here that Access has correctly identified the relationship between TblBrewers and TblSOrders to be one-to-many and that we have several options from which to select. We can use the pull-down menu under the name of each table in the relationship to select dif- ferent fields to use in the relationship between the two tables.

By selecting the Enforce Referential Integrity option in the Edit Relationships dialog box, we can indicate that we want Access to monitor this relationship to ensure that it satisfies relational integrity. This means that every unique value in the BrewerID field in TblSOrders also appears in the BrewerID field of TblBrewers; that is, there is a one-to-many relationship between TblBrewers and TblSOrders, and Access will revise the display of the relationship, as shown in Figure B.22, to reflect that this is a one-to-many relationship.

Finally, we can click Join Type.. in the Edit Relationships dialog box to open the Join Properties dialog box (Figure B.24). This dialog box allows us to specify which records are retained when the two tables are joined.

Once we have established a relationship between two tables, we can create new Access objects (tables, queries, reports, etc.) using information from both of the joined tables simultaneously. Suppose Stinson’s will need to combine information from TblBrewers, TblSOrders, and TblSDeliveries. Using the same steps, we can also establish relation- ships among the three tables TblBrewers, TblSOrders, and TblSDeliveries, as shown in Figure B.25. Note that for each relationship shown in this figure, we have used the Enforce Referential Integrity option in the Edit Relationships dialog box to indicate that we want Access to monitor these relationships to ensure that they satisfy relational integrity. Thus, each relationship is identified in this case as a one-to-many relationship.

This set of relationships will also allow us to combine information from all three tables and create new Access objects (tables, queries, reports, etc.) using information from the three joined tables simultaneously.

B.3 Sorting and Filtering Records As our tables inevitably grow or are joined to form larger tables, the number of records can become overwhelming. One of the strengths of relational database software such as Access is that they provide tools, such as sorting and filtering, for dealing with large quantities of data. Access provides several tools for sorting the records in a table into a desired sequence and filtering the records in a table to generate a subset of your data that meets specific criteria. We begin by considering sorting the records in a table to improve the organization of the data and increase the value of information in the table by making it easier to find records with specific characteristics. Access allows for records to be sorted on values of

Join Properties Dialog BoxFIGURE B.24

one or more fields, called the sort fields, in either ascending or descending order. To sort on a single field, we click on the Filter Arrow in the field on which we wish to sort.

Suppose that Stinson’s Manager of Receiving wants to review a list of all deliveries received by Stinson’s, and she wants the list sorted by the Stinson’s employee who received the orders. To accomplish this, we first open the table TblSDeliveries in Datasheet view. We then click on the Filter Arrow for the field EmployeeID (the sort field), as shown in Figure B.26; to sort the data in ascending order by values in the EmployeeID field, we click on AZ Sort Smallest to Largest (clicking on AZ Sort Largest to Smallest will sort the data in descending order by values in the EmployeeID field). By using the Filter Arrows, we can sort the data in a table on values of any of the table’s fields.

We can also use this pull-down menu to filter our data to generate a subset of data in a table that satisfies specific conditions. If we want to create a display of only deliveries that were received by employee 135, we would click the Filter Arrow next to EmployeeID, select only the check box for 135, and click OK (Figure B.27).

Filtering through the Filter Arrows is convenient if you want to retain records associated with several different values in a field. For example, if we want to generate a display of the records in the table TblSDeliveries associated with breweries with BrewerIDs 3, 4, and 9, we would click on the Filter Arrow next to BrewerID, deselect the check boxes for 2, 4, 6, and 7, and click OK.

The Sort & Filter group in the Home tab also provides tools for sorting and filtering records in a table. To quickly sort all records in a table on values for a field, open the table to be sorted in Datasheet view, and click on any cell in the field to be sorted. Then click on AZ Ascending to sort records from smallest to largest values in the sort field or on A Z Descending to sort records from largest to smallest in the sort field.

Note that different data types have different sort options.

Note that different data types have different filter options.

Clicking Selection in the Sort & Filter group will also filter on values of a single field.

Upper Portion of the Relationships Workspace Showing the Relationships among TblBrewers, TblSOrders, and TblSDeliveries

FIGURE B.25

B.3 Sorting and Filtering Records 759

760 Appendix B–Database Basics with Microsoft Access

Access also allows for simultaneous sorting and filtering through the Advanced function in the Sort & Filter group of the Home tab; the advanced Filter/Sort display for the table TblSDeliveries is shown in Figure B.28. Once we have opened the table to be filtered and sorted in Datasheet view, we click on Advanced in the Sort & Filter group of the Home tab, as shown in Figure B.28. We then select Advanced Filter/Sort…. From this display, we double-click on the first field in the field list on which we wish to filter. The field we have selected will appear in the heading of the first column in the tabular display at the bot- tom of the screen. We can then indicate in the appropriate portion of this display the sorting and filtering to be done on this field. We continue this process for every field for which we want to apply a filter and/or sort, remembering that the sorting will be nested (the table will be sorted on the first sort field, and then the sort for the second sort field will be executed within each unique value of the first sort field, and so on).

Pull-Down Menu for Sorting and Filtering Records in a Table with the Filter ArrowFIGURE B.26

Top Rows of the Tabular Display of Results of FilteringFIGURE B.27

Suppose we wish to create a new tabular display of all records for deliveries from brew- eries with BrewerIDs of 4 or 7 for which fewer than 7 units were delivered, and we want the records in this display sorted in ascending order first on values of the field BrewerID and then on values of the field SQuantity Delivered. To execute these criteria, we perform the following steps:

Step 1. Click the Home tab in the Ribbon Step 2. Click Advanced in the Sort & Filter group, and select Advanced Filter/

Sort… Step 3. In the TblSDeliveries box, double-click BrewerID to add this field to the first

column in the lower pane of the screen Select Ascending in the Sort: row of the BrewerID column in the lower pane Enter 4 in the Criteria: row of the BrewerID column in the lower pane Enter 7 in the or: row of the BrewerID column in the lower pane

Step 4. In the TblSDeliveries box, double-click SQuantity Delivered to add this to the second column in the lower pane of the screen

Select Ascending in the Sort: row of the SQuantity Delivered column in the lower pane Enter <7 in the Criteria: row of the SQuantity Delivered column in the lower pane

Step 5. Click Advanced in the Sort & Filter group of the Home tab Click Apply Filter/Sort

These steps produce the tabular display shown in Figure B.30. Note that the data, after being filtered to show only records with breweries that

have values of 4 or 7 in the BrewerID field and all records with deliveries of 7 or fewer units, are sorted first in ascending order on the BrewerID field. Within each unique value in the BrewerID field, the records are sorted in ascending order on the SQuanity Delivered field.

We can toggle between a dis- play of the filtered/sorted data and a display of the original table by clicking on Toggle Filter in the Sort & Filter group of the Home tab.

Advanced Filter/Sort Display for TblSDeliveriesFIGURE B.28

B.3 Sorting and Filtering Records 761

Figure B.29 displays the lower pane of the Advanced Filter/ Sort after Steps 1 to 4 have been completed.

762 Appendix B–Database Basics with Microsoft Access

1. We can use wildcard symbols when filtering by substitut-

ing an asterisk symbol (*) for any portion of the value of a

field you want to represent with a wildcard. For example, if

we wanted to create a table of information on all Stinson’s

employees whose last names started with the letter B, we

would filter the field EmpLastName in the table TblEmploy-

ees by entering B* in the Criteria: row of the Advanced Filter/Sort. This filter will return all records that have the

combination of the first letter “B” and any other following

characters in the EmpLastName field.

N O T E S + C O m m E N T S

Tabular Display of Criteria for Simultaneous Filtering and Sorting Using Advanced Filter/Sort

FIGURE B.29

Tabular Display of Filtered and Sorted Data Using Advanced Filter/SortFIGURE B.30

B.4 Queries Queries are a way of searching for and compiling data that meet specific criteria from one or more tables. They enable you to extract particular fields from a table or create a new table that combines information from several related tables.

Although there are similarities between queries and simple searches or filters, queries are far more powerful because they can be used to extract information from multiple tables.

For example, although you could use a search in the table TblBrewers to find the name of a brewer that supplies beer to Stinson’s or a filter on the table TblSOrders to view only orders placed by Stinson’s for kegs of beer, neither of those approaches would let you simultaneously view both the names of brewers and the orders placed for kegs of beer. However, you could easily run a query to create a record of every order Stinson’s has placed for kegs of beer that includes the name of the brewer and the corresponding order that was placed. By taking advantage of the relationships among the tables of a database, a well-designed query can yield information that would be cumbersome or difficult to dis- cern by examining the data in individual tables.

Access allows for several types of queries. The three most commonly used are as follows:

• Select queries: These are the simplest and most commonly used queries; they are used to extract the subset of data from a table that satisfy one or more criteria. For example, Stinson’s Manager of Receiving may want to review a list of all deliveries received by Stinson’s that includes the Stinson’s employee who received each order over some period of time. A select query could be applied to the table TblSDeliveries (shown in the original database design illustrated in Figure B.1) to create the subset of this table containing only the fields SOrderNumber and EmployeeID.

• Action queries: These queries are used to change data in existing tables. For example, the sales manager may want to increase the prices charged to retailers by Stinson’s for the kegs of microbrews that Stinson’s sells. The sales manager can quickly make this change through an action query applied to the table TblSalesPrices to quickly perform these calculations and modify these prices in the database. Action queries allow the user to modify many records quickly and efficiently. Access pro- vides four types of action queries:

• Update allows the values of one or more fields in the result set to be modified. • Make table creates a new table based on the results of the query. • Append is similar to a make table query, except that the results of the query are appended to an existing table.

• Delete deletes all the records in the results of the query from the underlying table.

• Crosstab queries: These perform calculations on information in a table. Stinson’s Manager of Receiving may be interested in how many kegs and cases of beer have been delivered to Stinson’s and which Stinson’s employee received the shipment. The manager could find this information by applying a crosstab query to the table TblSDeliveries (shown in the original database design in Figure B.1) to create a table that shows number of kegs and cases delivered by the Stinson’s employee who received the shipment.

We next review how to execute each of these types of queries in Access.

Select Queries We start by considering the needs of Stinson’s Manager of Receiving, who wants to review a list of all deliveries received by Stinson’s and the Stinson’s employee who received the orders during some recent week. This requires us to perform a select query on the table TblSDeliveries to create a subset of this table that includes only the fields SOrderNumber and EmployeeID for deliveries to Stinson’s during the past week (the only week for which we have data in our new database) and display this subset in Datasheet view. To execute this select query, we take the following steps:

Step 1. Click the Create tab in the Ribbon (Figure B.31) Step 2. Click Query Wizard in the Queries group Step 3. When the New Query dialog box appears (Figure B.32)

Select Simple Query Wizard Click OK

Action queries are also known as Data Manipulation Lan- guage (DML) statements.

B.4 Queries 763

Stinsons

764 Appendix B–Database Basics with Microsoft Access

Create Tab in the Access RibbonFIGURE B.31

New Query Dialog BoxFIGURE B.32

Step 4. When the next Simple Query Wizard dialog box appears (see Figure B.33): Select Table: TblSDeliveries in the Tables/Queries box Select the fields SOrderNumber and EmployeeID from the Available Fields: box and move these to the Selected Fields: box using the button (Figure B.33) Click Next >

Step 5. When the next Simple Query Wizard dialog box appears (Figure B.34): Select Detail (shows every field of every record) Click Next >

Step 6. When the final Simple Query Wizard dialog box appears (Figure B.35): Name our query by entering TblSDeliveries Employee Query in the What title do you want for your query? box Select Open the query to view information Click Finish

The display of the query results is provided in Figure B.36. Although Step 5 offers us the option of using the Simple Query Wizard to generate a summary display of the fields

First Step of the Simple Query WizardFIGURE B.33 A query can be saved and used repeatedly. A saved query can be modified to suit the needs of future users.

Second Step of the Simple Query Wizard and the Summary Options Dialog BoxFIGURE B.34

B.4 Queries 765

766 Appendix B–Database Basics with Microsoft Access

Final Step of the Simple Query Wizard

Display of Results of a Simple Query

FIGURE B.35

FIGURE B.36

we selected, we use the Detailed Query option here because the Manager of Receiving wants to review a list of all deliveries received by Stinson’s and the Stinson’s employee who received the orders during some recent week. See Figure B.34 for displays of the dia- log boxes for this step of the Simple Query Wizard and Summary Options.

Pull-Down Menu of Options in the Navigation PanelFIGURE B.37

Note that in both Datasheet view and Design view, we now have a new tab with the table TblSDeliveries Employee Query. We can also change the Navigation Panel so that it shows a list of all queries associated with this database by using the Navigation Panel’s pull-down menu of options, as shown in Figure B.37.

Action Queries Suppose that in reviewing the database system we are designing, Stinson’s Sales Manager notices that we have made an error in the table TblSalesPrices. She shares with us that the price she charges for a keg of beer that has been produced by the Midwest Fiddler Crab microbrewery (value of 7 for BrewerID) should be $240, not $230 that we have entered in this table. We can use an action query applied to the table TblSalesPrices to quickly perform these changes. Because we want to modify all values of a field that meet some criteria, this is an update query. The Datasheet view of the table TblSalesPrices is provided in Figure B.38.

To make this pricing change, we take the following steps:

Step 1. Click the Create tab in the Ribbon Step 2. Click Query Design in the Queries group. This opens the Query Design win-

dow and the Query Tools contextual tab (Figure B.39)

Datasheet View of TblSalesPricesFIGURE B.38

B.4 Queries 767

Stinsons

768 Appendix B–Database Basics with Microsoft Access

Once saved, a query can be modified and saved again to use later.

Query Tools Contextual TabFIGURE B.39

Display of Information for the Update QueryFIGURE B.40

Step 3. When the Show Table dialog box appears, select TblSalesPrices and click Add

Click Close Step 4. In the TblSalesPrices box, double-click on KegSalesPrice. This opens a col-

umn labeled KegSalesPrice in the Field: row at the bottom pane of the display Click Update, , in the Query Type group of the Design tab Enter 240 in the Update To: row of the KegSalesPrice column in the bot- tom pane of the display

Step 5. In the TblSalesPrices box, double-click on BrewerID to open a second col- umn in the bottom pane of the display labeled BrewerID

Enter 7 in the Criteria: row of the BrewerID column (Figure B.40) Step 6. Click the Run button in the Results group of the Design tab

When the dialog box alerting us that we are about to update one row of the table appears, click Yes

Once we click Yes in the dialog box, the price charged to Stinson’s for a keg of beer supplied by the Midwest Fiddler Crab microbrewery (BrewerID equal to 7) in the table TblSalesPrices is changed from $230.00 to $240.00.

Step 7. To save this query, click the Save icon in the Quick Access toolbar When the Save As dialog box opens (Figure B.41), enter the name Change Price per Keg Charged by a Microbrewery for Query Name: Click OK

Opening the table TblSalesPrices in Datasheet view (Figure B.42) shows that the price of a keg charged to Stinson’s for a keg of beer supplied by the Midwest Fiddler Crab microbrewery (BrewerID equal to 7) has been revised from $230 to $240.

Crosstab Queries We use crosstab queries to summarize data in one field by values of one or more other fields. In our example, we will consider an issue faced by Stinson’s Inventory Manager,

Step 4 produces the TblSales- Prices box that contains a list of fields in this table.

who wants to know how many kegs and cases of beer have been ordered by each Stinson’s employee from each microbrewery. To provide the manager with this information, we apply a crosstab query to the table TblSOrders (shown in the original database design illus- trated in Figure B.1) to create a table that shows the number of kegs and cases ordered by each Stinson’s employee from each microbrewery. To create this crosstab query, we take the following steps:

Step 1. Click the Create tab in the Ribbon Step 2. Click Query Design in the Queries group. This opens the Query Design

window and the Query Tools contextual tab Step 3. When the Show Table dialog box opens, select TblSOrders, click Add, then

click Close Step 4. In the TblSOrders box, double-click BrewerID, Keg or Case?, and

SQuantity Ordered to add these fields to the columns in the lower pane of the window

Step 5. In the Query Type group of the Design tab, click Crosstab Step 6. In the BrewerID column of the window’s lower pane,

Select Row Heading in the Crosstab: row Select Ascending in the Sort: row

Step 7. In the Keg or Case? column of the window’s lower pane, Select Column Heading in the Crosstab: row Select Ascending in the Sort: row

Step 8. In the SQuantity Ordered column of the window’s lower pane,

Save as Dialog BoxFIGURE B.41

TblSalesPrices in Datasheet View After Running the Update Query

FIGURE B.42

B.4 Queries 769

Step 3 produces the TblSOr- ders box in Access that contains a list of fields in this table.

Stinsons

770 Appendix B–Database Basics with Microsoft Access

Select Sum in the Total: row Select Value in the Crosstab: row

Step 9. In the Results group of the Design tab, click the Run button, , to execute the crosstab query

Figure B.43 displays the results of completing Steps 1 to 8 to create our crosstab query. In the first column, we have indicated that we want values of the field BrewerID to act as the row headings of our table (in ascending order), whereas in the second column we have indicated that we want values of the field Keg or Case? to act as the column headings of our table (again, in ascending order). In the third column, we have indicated that values of the field SQuantity Ordered will be summed for every combination of row (value of the field BrewerID) and column (value of the field Keg or Case?).

The results of the crosstab query appear in Figure B.44. From Figure B.44, we see that we have ordered 8 cases and 10 kegs of beer from the microbrewery with a value of 3 for the BrewerID field (the Oak Creek Brewery).

Step 10. To save the results of this query, click the Save icon, , in the Quick Access toolbar

When the Save As dialog box opens, enter Brewer Orders Query for Query Name: Click OK

Display of Design of the Crosstab QueryFIGURE B.43

Results of Crosstab QueryFIGURE B.44

B.5 Saving Data to External Files Access can export data to external files in formats that are compatible with a wide variety of software. To export the information from the table TblSOrders to an external Excel file, we take the following steps:

Step 1. Click the External Data tab in the Ribbon (Figure B.45) Step 2. In the Navigation Panel, click TblSOrders Step 3. In the Export group of the External Data tab, click the Excel icon, Step 4. When the Export—Excel Spreadsheet dialog box opens (Figure B.46), click

the Browse… button Find the destination where you want to save your exported file and then click the Save button Verify that the correct path and filename are listed in the File Name: box (TblSOrders.xlsx in this example) Verify that the File format: is set to Excel Workbook (*.xlsx) Select the check boxes for Export data with formatting and layout. and Open the designation file after the export operation is complete. Click OK

The preceding steps export the table TblSOrders from Access into an Excel file named TblSOrders.xlsx. Exporting information from a relational database such as Access to Excel allows one to apply the tools and techniques covered throughout this textbook to a subset of a large data set. This can be much more efficient than using Excel to clean and filter large data sets.

You can open the file Stinsons and follow these steps to reproduce an external Excel file of the data in TblSOrders.

After we complete Step 4, another dialog box asks us if we want to save the steps we used to export the information in this table; this can be useful if we have to export similar data again.

1. Action queries permanently change the data in a database,

so we suggest that you back up the database before per-

forming an action query. After you have reviewed the results

of the action query and are satisfied that the query worked

as desired, you can then save the database with the results of

the action query. Some cautious users save the original data-

base under a different name so that they can revert to the

original preaction query database if they later find that the

action query has had an undesirable effect on the database.

2. Crosstab queries do not permanently change the data in

a database.

3. The Make Table, Append, and Delete action queries work

in manners similar to Update action queries and are also

useful ways to modify tables to better suit the user’s needs.

N O T E S + C O m m E N T S

External Data Tab in AccessFIGURE B.45

B.5 Saving Data to External Files 771

772 Appendix B–Database Basics with Microsoft Access

Export—Excel Spreadsheet Dialog BoxFIGURE B.46

S u M M A R y

The amount of data available for analyses is increasing at a rapid rate, and this trend will not change in the foreseeable future. Furthermore, the data used by organizations to make decisions are dynamic, and they change rapidly. Thus, it is critical that a data analyst understand how data are stored, revised, updated, retrieved, and manipulated. We have reviewed tools in Microsoft Access® that can be used for these purposes.

In this appendix we have reviewed the basic concepts of database creation and man- agement that are important to consider when using data from a database in an analysis. We have discussed several ways to create a database in Microsoft Access®, and we have demonstrated Access tools for preparing data in an existing database for analysis. These include tools for reading data from external sources into tables, creating relationships between tables, sorting and filtering records, designing and executing queries, and saving data to external files.

G L O S S A R y

Action queries Queries that are used to change data in existing tables. The four types of action queries available in Access are update, make table, append, and delete. Crosstab queries Queries that are used to summarize data in one field across the values of one or more other fields.

Glossary 773

Database A collection of logically related data that can be retrieved, manipulated, and updated to meet a user’s or organization’s needs. Datasheet view A view used in Access to control a database; provides access to tables, reports, queries, forms, etc. in the database that is currently open. This view can also be used to create tables for a database. Design view A view used in Access to define or edit a database table’s fields and field prop- erties as well as to rearrange the order of the fields in the database that is currently open. Entity integrity The rule that establishes that a table has no duplicate records. Entity integ- rity can be enforced by assigning a unique primary key to each record in a table. Fields The variables or characteristics for which data have been collected from the records. Foreign key field A field that is permitted to have multiple records with the same value. Form An object that is created from a table to simplify the process of entering data. Leszynski/Reddick guidelines A commonly used set of standards for naming database objects. Many-to-many Sometimes abbreviated as ∞:∞, a relationship for which a value in the common field for a record in one table (say, Table A) can match the value in the common field for multiple records in another table (say, Table B), and a value in the common field for a record in Table B can match the value in the common field for several records in Table A. One-to-many Sometimes abbreviated as 1:∞, a relationship between tables for which a value in the common field for a record in one table (say, Table A) can match the value in the common field for multiple records in another table (say, Table B), but a value in the common field for a record in Table B can match the value in the common field for at most a single record in Table A. One-to-one Sometimes abbreviated as 1:1, a relationship between tables for which a value in the common field for a record in one table (say, Table A) can match the value in the common field for at most one record in another table (say, Table B), and a value in the common field for a record in Table B can match the value in the common field for at most a single record in Table A. Orphaned A record in a table that has a value for the foreign key field of a table that does not match the value in the primary key field for any record of a related table. Enforcing ref- erential integrity prevents the creation of orphaned records. Primary key field A field that must have a unique value for each record in the table and is used to identify how records from several tables in a database are logically related. Query A question posed by a user about the data in the database. Records The individual units from which the data for a database have been collected. Referential integrity The rule that establishes the proper relationship between two tables. Report Output from a table or a query that has been put into a specific prespecified format. Select queries Queries that are used to extract the subset of data that satisfy one or more criteria from a table. Table Data arrayed in rows and columns (similar to a worksheet in an Excel spreadsheet) in which rows correspond to records and columns correspond to fields.

Data Management and Microsoft Access Adamski, J. J., K. T. Finnegan, and S. Scollard New

Perspectives on Microsoft® Access 2013, Comprehensive. Cengage Learning, 2014.

Alexander, M. The Excel Analyst’s Guide to Access, Wiley, 2010.

Alexander, M. Access 2013 Bible, 1st ed. Wiley, 2013. Balter, A. Using Microsoft Access 2010. Que Publishing, 2010. Carter, J., and J. Juarez. Microsoft Office Access 2010: A

Lesson Approach, Complete. McGraw-Hill, 2011. Conrad, J. Microsoft Access 2013 Inside Out, 1st ed. Microsoft

Press, 2013. Friedrichsen, L. Microsoft® Access 2013: Illustrated Complete.

Cengage Learning, 2014. Jennings, R. Microsoft Access 2010 in Depth. Que Publishing,

2010. MacDonald, Access 2013: The Missing Manual, 1st ed.

O’Reilly Media, 2013. Owen, G. Using Microsoft Excel and Access 2016 for

Accounting, 5th ed. Cengage Learning, 2017. Pratt, P. J., and M. Z. Last. Microsoft® Access 2013: Complete.

Cengage Learning, 2014.

Data Mining Linoff, G. S., and M. J. Berry. Data Mining Techniques: For

Marketing, Sales, and Customer Relationship Management, 3rd ed. Wiley, 2011.

Berthold, M., and D. J. Hand. Intelligent Data Analysis. Springer (Berlin), 1999.

Hand, D. J., H. Mannila, and P. Smyth. Principles of Data Mining. MIT Press, 2001.

Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, 2nd ed. Springer, 2009.

Schmueli, G., N. R. Patel, and P. C. Bruce. Data Mining for Business Analytics: Concepts, Techniques and Applications with XLMiner, 3rd ed. Wiley, 2016.

Tan, P.-N., M. Steinbach, and V. Kumar, Introduction to Data Mining. Pearson, 2006.

Data Visualization Alexander, M., and J. Walkenbach. Excel Dashboards and

Reports. Wiley, 2010. Camm, J., M. Fry, and J. Shaffer, “A Practitioner’s Guide to

Best Practices in Data Visualization,” Interfaces 47, no. 6 (November-December 2017): 473–488.

Cleveland, W. S. Visualizing Data. Hobart Press, 1993. Cleveland, W. S. The Elements of Graphing Data, 2nd ed.

Hobart Press, 1994. Entrepreneur, 2012 Annual Ranking of America’s Top

Franchise Opportunities, 2012.

Few, S. Show Me the Numbers: Designing Tables and Graphs to Enlighten. Analytics Press, 2004.

Few, S. Information Dashboard Design: The Effective Visual Communication of Data. O’Reilly Media, 2006.

Few, S. Now You See It: Simple Visualization Techniques for Quantitative Analysis. Analytics Press, 2009.

Longley, P. A., M. Goodchild, D. J. Maguire, and D. W. Rhind. Geographic Information Systems and Science. Wiley, 2010.

The Pew Research Center, Internet & American Life Project, 2011.

Robbins, N. B. Creating More Effective Graphs. Wiley, 2004. Telea, A. C. Data Visualization Principles and Practice. A. K.

Peters, 2008. Tufte, E. R. Envisioning Information. Graphics Press, 1990. Tufte, E. R. Visual and Statistical Thinking: Displays of

Evidence for Making Decisions. Graphics Press, 1997. Tufte, E. R. Visual Explanations: Images and Quantities,

Evidence and Narrative. Graphics Press, 1997. Tufte, E. R. The Visual Display of Quantitative Information,

2nd ed. Graphics Press, 2001. Tufte, E. R. Beautiful Evidence. Graphics Press, 2006. Wong, D. M. The Wall Street Journal Guide to Information

Graphics. Norton, 2010. Young, F. W., P. M. Valero-Mora, and M. Friendly. Visual

Statistics: Seeing Data with Dynamic Interactive Graphics. Wiley, 2006.

Decision Analysis Clemen, R. T., and T. Reilly. Making Hard Decisions with

DecisionTools. Cengage Learning, 2004. Golub, A. L. Decision Analysis: An Integrated Approach.

Wiley, 1997. Goodwin, P., and G. Wright. Decision Analysis for Management

Judgment, 4th ed. Wiley, 2009. Peterson, M. An Introduction to Decision Theory. Cambridge,

2009. Pratt, J. W., H. Raiffa, and R. Schlaiter. Introduction to Statis

tical Decision Theory. MIT Press, 2008. Raiffa, H. Decision Analysis. McGraw-Hill, 1997.

Time Series and Forecasting Bowerman, B. L., R. T. O’Connell and A. Koehler. Forecasting,

Time Series, and Regression, 4th ed. Cengage Learning, 2005. Box, G. E. P., G. M. Jenkins, and G. C. Reinsel. Time Series

Analysis: Forecasting and Control, 5th ed. Wiley, 2015. Hanke, J. E., and D. Wichern. Business Forecasting, 9th ed.,

Prentice Hall, 2009. Makridakis, S. G., S. C. Wheelwright, and R. J. Hyndman.

Forecasting Methods and Applications, 3rd ed. Wiley, 1997. Ord, K., and R. Fildes. Principles of Business Forecasting.

Cengage Learning, 2013. Wilson, J. H., B. Keating, and John Galt Solutions, Inc.

Business Forecasting with Accompanying ExcelBased Forecast X™ Software, 5th ed. McGraw-Hill/Irwin, 2007.

References

References 775

General Business Analytics Ayres, I. Super Crunchers: Why ThinkingbyNumbers Is the

New Way to Be Smart. Bantam, 2008. Baker, S. The Numerati. Mariner Books, 2009. Davenport, T. H., and J. G. Harris, Competing on Analytics.

Harvard Business School Press, 2007. Davenport, T. H., J. G. Harris, and R. Morrison, Analytics at

Work. Harvard Business School Press, 2010. Davenport, T. H., Ed. Enterprise Analytics. FT Press, 2012. Fisher, M., and A. Raman. The New Science of Retailing.

Harvard Business Press, 2010. Lewis, M. Moneyball: The Art of Winning an Unfair Game.

Norton, 2004. Wind, J., P. E. Green, D. Shifflet, and M. Scarbrough.

“Courtyard by Marriott: Designing a Hotel Facility with Consumer-Based Marketing Models,” Interfaces 19, no. 1 (January–February 1989): 25–47.

Optimization Baker, K. R. Optimization Modeling with Spreadsheets, 3rd ed.

Wiley, 2015. Bazaraa, M. S., H. D. Sherali, and C.M. Shetty. Nonlinear

Programming: Theory and Algorithms. 3rd ed. Wiley, 2006. Bazaraa, M. S., J. J. Jarvis, and H. D. Sherali. Linear

Programming and Network Flows. 4th ed. Wiley, 2009. Chen, D., R. G. Batson, and Y. Dang. Applied Integer

Programming. Wiley, 2010. Sashihara, S. The Optimization Edge. McGraw-Hill, 2011. Winston, W. L. Financial Models Using Simulation and

Optimization, 2nd ed. Palisade Corporation, 2008.

Probability Anderson, D., D. Sweeney, T. Williams, J. Camm and

J. Cochran. Modern Business Statistics with Microsoft Excel, 6th ed. Cengage Learning, 2018.

Anderson, D., D. Sweeney, T. Williams, J. Camm and J. Cochran. An Introduction to Statistics for Business and Economics, 13th Revised ed. Cengage Learning, 2018.

Ross, S. M. An Introduction to Probability Models, 11th ed. Academic Press, 2014.

Regression Analysis Chatterjee, S., and A. S. Hadi. Regression Analysis by Ex ample,

5th ed. Wiley, 2012. Draper, N. R., and H. Smith. Applied Regression Analysis,

3rd ed. Wiley, 1998.

Graybill, F. A., and H. K. Iyer. Regression Analysis: Concepts and Applications. Wadsworth, 1994.

Hosmer, D. W., and S. Lemeshow. Applied Logistic Reg ression, 3rd ed. Wiley, 2013.

Kleinbaum, D. G., L. L. Kupper, A. Nizam, and E. Rosenberg. Applied Regression Analysis and Multivariate Methods, 5th ed. Cengage Learning, 2013.

Mendenhall, M., T. Sincich, and T. R. Dye. A Second Course in Statistics: Regression Analysis, 7th ed. Prentice Hall, 2011.

Montgomery, D. C., E. A. Peck, and G. G. Vining. Introduction to Linear Regression Analysis, 5th ed. Wiley, 2012.

Neter, J., W. Wasserman, M. H. Kutner, and C. Nashtsheim. Applied Linear Statistical Models, 5th ed. McGraw-Hill, 2004.

Monte Carlo Simulation Bell, P. BrentHarbridge Developments, Inc. Richard Ivey

School of Business, University of Western Ontario, 1998. Law, A. M. Simulation Modeling and Analysis, 4th ed.

McGraw-Hill, 2006. Ross, S. Simulation. Academic Press, 2013. Savage, S. L. Flaw of Averages. Wiley, 2012. Talib, N. N. Fooled by Randomness. Random House, 2004. Wainer, H. Picturing the Uncertain World. Princeton University

Press, 2009. Winston, W. Decision Making Under Uncertainty. Palisade

Corporation, 2007.

Spreadsheet Modeling Leong, T., and M. Cheong. Business Modeling with

Spreadsheets: Problems, Principles, and Practice, 2nd ed. McGraw-Hill (Asia), 2010.

Powell, S. G., and R. J. Batt. Modeling for Insight. Wiley, 2008. Winston, W. Excel 2016 Data Analysis and Business Modeling.

Microsoft Press, 2016.

Statistical Inference Barnett, V. Comparative Statistical Inference, 3rd ed. Wiley, 1999. Casella, G. and R. L. Berger. Statistical Inference, 2nd ed.

Duxbury, 2002. Roussas, G. G. An Introduction to Probability and Statistical

Inference, 2nd ed. Elsevier, 2014. Wasserman, L. All of Statistics: A Concise Course in Statistical

Inference, Springer, 2004. Welsh, A. H. Aspects of Statistical Inference, Wiley, 1996. Young, G. A. and R. L. Smith. Essentials of Statistical

Inference, Cambridge 2005.

Index

A Absolute references, 484 Accuracy, 426 Addition law, 170–172 Additivity, linear optimization, 561 Advanced analytics, 11, 15 Advertising campaign planning, 584–589 Alliance Data Systems, 295 All-integer linear program, 607–608 All-time movie box office data, 136–137 Alternative hypotheses, 250–253 Alternative optimal solutions

in binary optimization, 628–630 linear optimization problems, 589–591 for linear programs, 571–572, 589–591

Alumni giving case study, linear regression, 369–371 Analytical methods and models, 6–7 Antecedent, 148 Arcs, 582 Arithmetic mean, 39–40 Association rules, 148–151

evaluation of, 150–151 measure of confidence of, 149

Auditing, spreadsheet models, 487–491 Autoregressive models, 397 Average error, prediction accuracy, 431 AVERAGE function, 185, 262

B Backward elimination procedure, 342 Bagging, 446–447 Bank location problem, 621–623 Bar charts, 106–108 Base-case scenario, 502–503 Bass forecasting model, 663–666 Bayes’ theorem, 178–180, 695–698 Bell-shaped distribution, 50–52, 199–203 Bernoulli distribution, 552 Best-case scenario, 503 Best subsets procedure, 343 Beta distribution, random variable, 524–527, 549–550 Bias, 431 Bid fraction values, 522–527 Big data

and confidence intervals, 273–275 defined, 7, 15 estimation, 270–271 hypothesis testing, 275–277 overview, 7–10 and p-values, 275–276 and sampling error, 272–273 statistical inference and, 268–277 uses of, 10 variety of, 9, 271

velocity of, 9, 271 veracity of, 9, 271 volume of, 9, 271

Bimodal data, 41 Binary integer linear program, 608 Binary term-document matrix, 152 Binary variables, 142 Binary variables, integer linear programming

applications, 616–626 bank location, 621–623 capital budgeting, 616–618 fixed cost, 618–621 modeling flexibility, 626–628 optimization alternatives, 628–630 product design and market share optimization, 623–626

Binding constraints, 566–567 BINOM.DIST function, 190 Binomial probability distribution, 188–190, 552 Bins, frequency distributions

limits, 33 number of, 32 width of, 32–33

Boosting method, 447 Box plots, distribution analysis, 52–55 Branch-and-bound algorithm, integer linear programming, 611 Branches, 681 Branch probabilities, computation, with Bayes’ theorem, 695–698 Breakpoint, nonlinear relationships, 335 Bubble charts, 109–111 Business analytics

decision making and, 679 defined, 5, 15 demand for, 10 methods and models, 6–7 in practice, 11–14 role of, 4 spectrum of, 11

Business cycles, 382

C Capital budgeting problem, 616–618 Categorical data

defined, 21 frequency distributions for, 29–30

Categorical independent variables, linear regression, 325–329 Categorical outcomes

classification of, 425–431 with classification trees, 439–444 with k-nearest neighbors, 436–438

Causal forecasting, 401–404 Causal variables, 404–405 Census, 19, 221–222 Central limit theorem, 234 Centroid linkage, 143, 144 Chance events, 680

Index 777

Chance nodes, decision trees, 681 Charts

advanced, 117–120 bar, 106–108 bubble, 109–111 clustered column (clustered bar), 113 column, 106–107 defined, 99 dendograms, 144–146 Excel, 302–303 geographic information systems, 120–122 line, 88, 89, 102–106 multiple-column, 113, 114 for multiple variables, 112–115 pie, 107, 109 PivotCharts, 115–117 scatter-chart matrix, 113–115 scatter charts, 99–101 stacked-bar, 112–113 stacked-column, 112–113 vs. tables, 88–89 three-dimensional, 107

Cincinnati Zoo & Botanical Gardens, 83–84, 120–121 Class 0 error rate, 426–431 Class 1 error rate, 426–431 Classification, 423

of categorical outcomes, 425–431 error rates vs. cutoff value, 429 performance, supervised learning, 425–432 probabilities, 427

Classification and regression trees (CART), 439–450 Cloud computing, 4 Cluster analysis, 140–148

hierarchical clustering, 140, 143–147 k-means clustering, 140, 146–147 measuring similarity between observations, 140–142 uses of, 140

Clustered-column (clustered-bar) chart, 112–113, 123 Coefficient of determination, 306–307 Coefficient of variation, 47 Column charts, 106–107 Complement of an event, 169–170 Complete linkage clustering method, 144 Component ordering, spreadsheet models, 484 Concave function, 653 Conditional constraint, 628 Conditional probability, 172–180

Bayes’ theorem, 178–180, 697 independent events, 177 multiplication law, 177

Confidence coefficient, 244–245 Confidence interval

and big data, 273–275 individual regression parameters, 318–321 statistical inference, 244–245, 264–265

Confidence level, 320–321 Confidence, of association rules, 149 Confusion matrix, 426, 428 Conjoint analysis, 623–626 Consequent, 148 Conservative approach, nonprobability decision

analysis, 683 Constrained problem, nonlinear optimization models,

648–650

Constraints, 557 binding, 566–567 conditional, 628 corequisite, 628 greater-than-or-equal-to, 576 k out of n alternatives, 627 linear optimization model, 557, 559–561 multiple-choice, 626–627 mutually exclusive, 626–627 shadow price for, 576

Continuous outcomes estimation of, 431–432

with k-nearest neighbors, 438–439 with regression tree, 445–446

Continuous probability distributions, 194–206, 505, 522, 549–551 exponential, 203–206 normal, 198–203 triangular, 196–198 uniform, 194–196

Continuous random variables, 181–182, 194 Controllable inputs, 502 Convex function, 653–654 Convex hull, 610 Corequisite constraint, 628 Corpus, 151 Correlation coefficient, 60–61 COUNTA function, 267 COUNTBLANK function, 63 COUNT function, 248, 262 Covariance, 57–58 Coverage error, 269 Cross-sectional data, 21 Crosstabulation, 90–92 Cumulative distributions, 37–38 Cumulative frequency distribution, 37–38 Cumulative lift chart, 428–429 Current Population Survey (CPS), 19 Custom discrete probability distributions, 182–184, 552 Cutoff value, 427, 428 Cutting plane, integer linear programming, 611 Cyclical pattern, time series analysis, 382

D Data

bimodal, 41 cross-sectional, 21 defined, 19 distributions from, 29–38

cumulative, 37–38 frequency distributions for categorical data, 29–30 frequency distributions for quantitative data, 31–34 histograms, 34–37 relative and percent frequency distributions, 30–31

Excel, 24–29 conditional formatting, 27–29 sorting and filtering data, 24–27

historical, 374 multimodal, 41 overview of using, 19–20 population and sample, 21 quantitative and categorical, 21 sources, 21–22

778 Index

Data (continued) tall, 271 time series, 21 types of, 21–23 wide, 271

Data cleansing Blakely Tires, 63–64 identification of erroneous outliers and other erroneous values, 65–67 missing, 61–63 variable representation, 67–68

Data dashboards applications of, 123–124 defined, 6, 15, 122 principles of effective, 123

Data-driven decision making, 3–5 Data exploration, 423 Data-ink ratio, 85–87 Data mining

analytics case study, 139 cluster analysis, 140–148 data preparation, 423 defined, 6, 15 Grey Code Corporation case study, 462 Orbitz case study, 423 steps in, 423–424 supervised learning, 423–450

classification and regression trees, 439–449 data exploration, 423 data partitioning, 423–425 data preparation, 423 data sampling, 423–425 k-nearest neighbors, 436–439 logistic regression model, 432–436 model assessment, 424 model construction, 424 overview, 450 performance measures, 425–432

unsupervised learning, 139–165 association rules, 148–151 cluster analysis, 140–148 text mining, 151–154

Data partitioning, 423–425 Data preparation, data mining, 423 Data query, 6, 15 Data sampling, 423–425 Data scientists, 10, 15 Data security, 9–10, 15 Data set, 424–425 Data tables, 471–473 Data visualization, 82–137

advanced techniques, 117–122 case study, 136–137 charts, 99–117 Cincinnati Zoo & Botanical Gardens, 83–84, 120–121 data dashboards, 122–124 effective design techniques, 85–87 heat maps, 110–112 overview of, 85–87 tables, 88–89

Decile-wise lift chart, 429–430 Decision alternatives, 680 Decision analysis, 7, 678–722

branch probabilities with Bayes’ theorem, 695–698

defined, 15 phyotpharm example, 679–680 with probabilities, 685–688

expected value approach, 685–687 risk analysis, 687–688 sensitivity analysis, 688

problem formulation, 680–682 decision trees, 681–682 payoff tables, 681

property purchase strategy case study, 721–722 with sample information, 689–695

expected value of perfect information, 694–695 expected value of sample information, 694

uses of, 679 utility and, 699–703 utility theory, 698–707 without probabilities, 682–685

conservative approach, 683 minimax regret approach, 683–685 optimistic approach, 682–683

Decision making business analytics and, 679 data-driven, 3–5 defined, 4 managerial, 295–296 overview, 4–5 uncertainty in, 501, 502, 679

Decision nodes, 681 Decision strategy, 691 Decision trees, 681–682 Decision variables, 20, 469, 559–560, 577, 582 Dendograms, 144–146 Dependent variable, 295, 321–322, 395 Descriptive analytics, 6, 20 Descriptive data mining, 139–165

association rules, 148–151 case study, 164–165 cluster analysis, 140–148 text mining, 151–154

Descriptive statistics, 18–80 case study, 79–80 cross-sectional and time series data, 21 data cleansing

Blakely Tires, 63–64 identification of erroneous outliers and other erroneous

values, 65–67 missing, 61–63 variable representation, 67–68

data definitions and goals, 19–20 data distribution creation, 29–38 data sources, 21–22 distribution analysis, 47–55 Excel data modification, 24–29 measures of association between two variables, 55–61 measures of location, 39–44 measures of variability, 44–47 population and sample data, 21 quantitative and categorical data, 21 U.S. Census Bureau, 19

Dimension reduction, 67 Discrete-event simulation, 533 Discrete probability distributions, 182–193, 551–555

custom, 182–184

Index 779

expected value and variance, 184–187 risk analysis, 505 uniform, 187–188

Discrete random variables, 180–181, 184–187, 194 Discrete uniform distribution, 551–552 Distribution analysis, 47–55

box plots, 52–55 empirical rule, 50–52 outlier identification, 52 percentiles, 48 quartiles, 49 z-scores, 49–51

Divisibility, linear optimization, 561 Double-subscripted decision variables, 582 Dow Jones Industrial Average, 19 Dow Jones Industrial Index, 19, 20, 22

E Efficient frontier, 662 Element, 21 Empirical probability distribution, 182 Empirical rule, 50–52 Ensemble methods, data mining, 446–449 Erroneous outliers, identification of, 65–67 Error Checking, 489–491 Estimated multiple regression equation, 308–309 Estimated regression equation, 296–300

using Excel to compute, 302–303 Estimated regression line, 296 Euclidean distance, 140–141, 143 Evaluate Formulas, 489–490 Events

chance, 680 complement of an event, 169–170 defined, 168 independent events, 177 intersection of, 170, 171 mutually exclusive, 171–172 probabilities and, 168–169 union of, 170

Excel AVERAGE function, 185, 262 BINOM.DIST function, 190 charts, 99–117 chart tools, 302–303 coefficient of determination computation using, 307 CORREL function, 61 COUNTA function, 267 COUNTBLANK function, 63 COUNT function, 248, 262 COUNTIF function, 483–485 data modification in, 24–29

conditional formatting, 27–29 sorting and filtering data, 24–27

Data Tables, 471–473 estimated regression equation using, 302–303 EXPON.DIST function, 205 exponential smoothing with, 393–395 forecasting with, 389–390 Forecast Sheet, 416–421 frequency distributions for quantitative data, 33–34

GEOMEAN, 43 Goal Seek, 473, 475, 476 histograms, 34–37 hypothesis testing, 257–259, 266–268 IF function, 483–485 interval estimation, 245–249 MAX function, 45 MIN function, 45 MODE.MULT function, 41 MODE.SNGL function, 41 multiple regression using, 310–313 NORM.DIST function, 201 NORM.INV function, 202–203 PivotCharts, 115–117 PivotTables, 93–99, 173–175 POISSON.DIST function, 192–193 RAND function, 506–510 random variables generation, 506–510 regression analysis using, 404 Regression tool, 317–319, 322 simulation trials, 510, 511 sort procedure, 63 spreadsheet modeling functions, 464–499 spreadsheet models, 470 STANDARDIZE function, 50 STDEV function, 262 STDEV.S function, 47 SUM function, 481 SUMPRODUCT function, 184, 186, 481, 564–565 T.DIST function, 257 variance calculation, 186–187 VLOOKUP function, 485–486

Excel Solver integer optimization problems, 611–615 linear programs, 564–567 nonlinear optimization problems, 650–651 overcoming local optima, 655–656 Sensitivity Report, 575–577

Expected utility (EU), 702 Expected value (EV), 184–185, 685–687 Expected value approach, 685–687 Expected value of perfect information (EVPI), 694–695 Expected value of sample information (EVSI), 694 Experimental studies, 21–22 Experiments, random, 168–169 EXPON.DIST function, 205 Exponential distribution, 550 Exponential probability distribution, 203–206 Exponential smoothing, 391–395 Exponential utility functions, 706–707 Extreme points, 563–564

F False negative, 426 False positive, 426 Feasibility table, 627 Feasible regions

integer linear optimization models, 610 linear optimization models, 561, 562 nonlinear optimization models, constrained problem, 648–650

Feasible solution, 561

780 Index

Features, 423 Financial analytics, 11 Finite population

defined, 222 sampling from, 223–224

Fitted distribution, bid fraction data, 522–527 Fixed-cost problem, 618–621 Forecast error, 383–385 Forecasting, 373–374. See also Time series analysis

ACCO Brands, 373 accuracy, 382–386

exponential smoothing forecasting, 393–395 moving averages forecasting, 390–391

Bass forecasting model, 663–666 causal or exploratory, 374 Excel Forecast Sheet, 416–421 exponential smoothing, 391–395 food and beverage sales case study, 415 model selection and criteria, 405–406 moving averages, 386–391 nonlinear optimization models, new product adoption, 663–666 qualitative methods, 374 quantitative methods, 374 regression analysis, 395–405

causal forecasting, 401–404 combining causal variables with trend and seasonality effects,

404–405 linear trend projection, 395–397

regression and limitations of, 405 seasonality, 397–401

Forecast Sheet (Excel), 416–421 Forward selection procedure, 342 Frame, 222 Frequency distributions

for categorical data, 29–30 cumulative, 37–38 percent, 30–31 for quantitative data, 31–34 relative, 30–31

Frequency term-document matrix, 154 F1 Score, 430

G Gamma distribution, random variable, 550 General Electric (GE), 10

case study, 557 Geographic information systems charts, 120–122 Geometric approach, to solving linear program, 562–565 Geometric mean, 41–44 Global maximum, 652 Global minimum, 653 Global optimum, nonlinear optimization problems, 652–657 Goal Seek (Excel), 473, 475, 476 Government, use of analytics by, 13 Greater-than-or-equal-to constraint, 576 Group average linkage clustering method, 143, 144 Growth factor, 42

H Hadoop, 9, 15 Half spaces, 562

Health care analytics, 12–13 Heat maps, 110–112 Hierarchical clustering, 140, 143–147 Histograms, 34–37 Historical data, 374 Holdout method, 344 Horizontal pattern, time series analysis, 375–377 Human resource (HR) analytics, 12 Hypergeometric distribution, 553 Hypothesis tests, 250–268

and big data, 275–277 individual regression parameters, 318–321 interval estimation and, 264–265 null and alternative hypotheses, 250–253 one-tailed tests, 254–255, 260 of population mean, 254–265 of population proportion, 265–268 steps of hypothesis testing, 263 summary and practice advice for, 263 two-tailed tests, 260–263 Type I and Type II errors, 253–254, 259 using Excel, 257–259, 266–268

I Illegitimately missing data, 62 Impurity, 439 Imputation, 62 Independent events, 177

multiplication law for, 177 Independent variables, 295–296, 395

categorical, 325–329 interaction between, 337–341 nonsignificant, 321–322 in regression analysis, 322 variable selection procedures, 342–343

Infeasibility, in linear programming problems, 572–573 Infinite population, sampling from, 224–226 Influence diagrams, 466–468 Integer linear optimization models, 606–644

Applecore Children’s Clothing case study, 643–644 binary variables, 616–630

bank location, 621–623 capital budgeting, 616–618 fixed cost, 618–621 modeling flexibility, 626–628 optimization alternatives, 628–630 product design and market share optimization, 623–626

Eastborne Realty example, 608–615 Excel Solver, 611–615 geometry of, 609–611 Petrobras case study, 607 sensitivity analysis and, 614–615 types of, 607–608

Integer linear programs, 607 Integer uniform distribution, 551 Interaction, between independent variables, 337–341 Internet of Things (IoT), 10, 15 Interval estimation, 240–250

defined, 240 hypothesis testing and, 264–265 of population mean, 240–247 of population proportion, 247–249 using Excel, 245–249

Investment portfolio selection, 578–580

Index 781

J Jaccard’s coefficient, 142 John Morrell & Company, 221 Joint probabilities, 175, 178

K Key performance indicators (KPIs), 12 k-fold cross-validation, 344 k-means clustering, 140, 146–147 k-nearest neighbors, 436–439, 449

classifying categorical outcomes with, 436–438 estimating continuous outcomes with, 438–439

Knot, nonlinear relationships, 335 k out of n alternatives constraint, 627 Kroger, 10

L Lagrangian multiplier, 651 Least squares method, 298–303

estimates of regression parameters, 300–302 multiple regression and, 309

Least squares regression model, 314–318 Leave-one-out cross-validation, 344 Legitimately missing data, 61–62 Level of significance, 244, 254, 259, 264 Lift charts, 429–430 Lift ratio, 149–150 Linear functions, 561 Linear optimization models, 556–605. See also Integer linear

optimization models advertising campaign planning, 584–589 alternative optimal solutions, 571–572, 589–591 applications of, 557 decision variable, 559–560 Excel Solver, 564–567 General Electric case study, 557 infeasibility, 572–573 investment portfolio selection, 578–580 investment strategy case study, 604–605 linear programming notation and examples, 577–589 linear programming outcomes, 570–575 M&D Chemicals problem, 568–570, 578 sensitivity analysis, 575–577 simple maximization problem, 558–561

mathematical model, 561 problem formulation, 559–560

simple minimization problem, 568–575 solving Par, Inc. problem, 561–567 transportation planning, 580–584 unbounded solutions, 573–574

Linear programming model, 561 notation and examples, 577–589

Linear programs, 561 Excel Solver for, 564–567 geometric approach to solving, 562–565

Linear regression, 294–371 Alliance Data Systems, 295 case study, 369–370 categorical independent variables, 325–329 fit assessment, simple model, 304–308 inference and, 313–325

individual regression parameters, 318–321

least squares regression model, 314–318 multicollinearity, 322–324 nonsignificant independent variables, 321–322 very large samples, 344–347

least squares method, 298–303 model fitting, 342–344

variable selection procedures, 342–343 modeling nonlinear relationships, 330–342

interaction between independent variables, 337–341 piecewise linear regression models, 335–337 quadratic regression models, 331–335

multiple, 450 multiple regression model, 308–313 simple linear regression model, 296–298

Linear trend projection, 395–397 Line charts, 88, 89, 102–106 Local maximum, 652 Local minimum, 652 Local optimum, nonlinear optimization problems, 652–657 Location problem, 621

integer linear optimization models, 621–623, 628–630 Markowitz mean-variance portfolio model, 658–661 nonlinear optimization model, 657–658

Logistic function, 434 Logistic regression, 432–436, 450 Logistic S-curve, 434–435 Log-normal distribution, 551 Lower-tail tests, 260 LP Relaxation, 608

M MagicBand, 10 Make-versus-buy decision, 469

spreadsheet models, 466 Mallow’s Cp statistic, 436 Managerial decisions, 296 MapReduce, 9, 15 Maps

heat maps, 110–112 treemaps, 118–120

Marketing, 148 Marketing analytics, 12 Market segmentation, 140 Markowitz mean-variance portfolio model, 658–661 Matching coefficient, 141 Mathematical models, 466–468, 561 Maximization problem, 558–561

mathematical model, 561 problem formulation, 559–560

McQuitty’s method, 144 Mean

arithmetic, 39–40 deviation about the, 45 geometric, 41–44 population, 240–247, 254–265

Mean absolute error (MAE), 384, 386, 390 Mean absolute percentage error (MAPE), 384, 386, 390 Mean forecast error (MFE), 383, 384 Mean squared error (MSE), 384, 390 Measurement error, 270 Measures of association, intervariable, 55–61

correlation coefficient, 60–61 covariance, 57–58 scatter charts, 55–56, 59

782 Index

Measures of location, 39–44 geometric mean, 41–44 mean (arithmetic mean), 39–40 median, 40–41 mode, 40, 41

Measures of variability, 44–47 coefficient of variation, 47 range, 44–45 standard deviation, 46–47 variance, 45–46

Median, 40–41 Median linkage method, 144 Minimax regret approach, to problem formulation, 683–685 Minimization problem, 568–570 Missing at random (MAR), 62 Missing completely at random (MCAR), 62 Missing data, 61–63 Missing not at random (MNAR), 62 Mixed-integer linear program, 608 Mode, 40, 41 Modeling, 559–561 Model overfitting, 342–344, 424 Money, utility function for, 705 Monte Carlo simulation, 500–548

advantages and disadvantages, 532–533 fitted distribution, 522–527 Four Corners case study, 547–548 Land Shark Inc. example, 514–527 output analysis, 519–522 Polio Eradication example, 501 random variables, 502

generating values for Land Shark, 517–519 generating with Excel, 506–510 probability distributions for, 549–555 probability distributions representing, 504–506

risk analysis, 502–514 base-case scenario, 502–503 best-case scenario, 503 generating values for random variables with

Excel, 506–510 probability distributions to represent random

variables, 504–506 simulation output, measurement and analysis, 510–514 simulation trials with Excel, 510 spreadsheet model, 503–504 worst-case scenario, 503

Sanotronics LLC example, 502–514 simulation modeling, 514–527 spreadsheet model, 515–517 steps for conduction simulation analysis, 533–534 verification and validation, 532

Moving averages, 386–391 Multicollinearity, 322–324 Multimodal data, 41 Multiple-choice constraint, 626–627 Multiple coefficient of determination, 310, 311 Multiple-column charts, 113, 114 Multiple regression, 450

Butler Trucking Company and, 310 estimated multiple regression equation, 308–313 estimation process for, 309 least squares method, 309 model, 308–313 using Excel, 310–313

Multiplication law, 177 Mutually exclusive constraints, 626–627 Mutually exclusive events, 171–172

N Naïve Bayes method, 450 Naïve forecasting method, 382, 386 National Aeronautics and Space Administration (NASA), 167 Negative binomial distribution, 554 Netflix, 139 Networks, 582 Neural networks, 450 New product adoption, forecasting, 663–666 Nodes, 466, 582, 681 Nonexperimental studies, 22 Nonlinear optimization models, 646–677

forecasting applications, new product adoption, 663–666 Intercontinental Hotels example, 647 local and global optima, 652–657 location problem, 657–658 portfolio optimization with transaction costs case study, 675–677 production application, 647–652

constrained problem, 648–650 Excel Solver, 650–651 sensitivity analysis and shadow prices, 651–652 unconstrained problem, 647–648

Nonlinear optimization problem, defined, 647 Nonlinear relationships

interaction between independent variables, 337–341 modeling, 330–342

piecewise linear regression models, 335–337 quadratic regression models, 331–335

Nonnegativity constraints, 560, 562 Nonprofit organizations, use of analytics by, 13 Nonresponse error, 269–270 Nonsampling error, 269–270 Normal probability distribution, 198–203, 506, 549 NORM.DIST function, 201 NORM.INV function, 202–203 Null hypotheses, 250–253

O Objective function, 557 Objective function coefficient allowable increase (decrease), 576 Objective function contour, 562 Observation, 19, 139 Observational studies, 22 Observations, measuring similarity between observations, 140–142 Observed level of significance, 259 One-tailed tests, 254–255, 260 One-way data tables, 471, 472 Operational decisions, 4, 15 Opportunity loss, 683–684 Optimistic approach, 682–683 Optimization models, 7, 15

applications of, 557 integer linear, 606–644 linear, 556–605 nonlinear, 646–677

Orbitz, 423 Outcomes, 168–169, 680

Index 783

Outliers in box plots, 53–54 erroneous, identification of, 65 identifying, 52

Overall error rate, 426 Overfitting, 342–344

P Pandora, 139 Parallel-coordinate plots, 117–118 Parameters, 223, 470 Par, Inc. problem

Excel Solver for, 564–567 feasible regions for, 561 geometry of, 562–565 mathematical model for, 561 nonlinear optimization model, 647–652 solving, 561–567

Part-worth, 623–624 Payoff, 681 Payoff tables, 681 PenningtonDailyTimes.com (PDT), 272–276 Percent frequency distributions, 30–31 Percentiles, 48 Perfect information, 694–695 Petrobras case study, 607 Piecewise linear regression models, 335–337 Pie charts, 107, 109 PivotCharts, 115–117 PivotTables, 93–99, 173–175 Point estimation, 227–229, 240 Point estimator, 297, 313 POISSON.DIST function, 192–193 Poisson probability distribution, 191–193, 554–555 Polio eradication, 501 Population

characteristics of, 221 defined, 21 finite, 222–224 infinite, 224–226 with normal distribution, 234 point estimator of, 227–229 sampled, 222, 229 target, 229 without normal distribution, 234

Population mean in hypothesis testing, 252–253 hypothesis test of, 254–265 interval estimation of, 240–247

Population proportion in hypothesis testing, 252–253 hypothesis test of, 265–268 interval estimation of, 247–249

Portfolio models, 7 Posterior probabilities, 178–180, 689, 697 Postoptimality analysis, 575 Precision, 430 Predictive analytics, 6–7, 15 Predictive and prescriptive spreadsheet models, 491–492 Predictive data mining, 422–462

classification and regression trees, 439–449 data preparation, 424–425

data sampling, 424–425 k-nearest neighbors, 436–439 logistic regression, 432–436 performance measures, 425–432

Predictor variables, 295, 395 Prescriptive analytics, 7, 12, 15, 557 Presence/absence term-document matrix for Triad Airlines, 152–153 Press Teag Worldwide (PTW), 527–531 Prior probability, 689 Probability, 166–219

addition law, 170–172 basic relationships of, 169–172 branch probabilities with Bayes’ theorem, 695–698 case study, 218–219 classification probabilities, 427 conditional, 172–180, 697 continuous probability distributions, 194–206 defined, 167–168 discrete probability distributions, 182–193 events and probabilities, 168–169 joint probabilities, 175, 178 National Aeronautics and Space Administration, 167 posterior probabilities, 178–180 random variables, 180–182

Probability distributions, 502 binomial, 188–190 continuous, 194–206, 505, 549–551

uniform, 194–196 defined, 182 discrete, 182–193, 505, 551–555

custom, 182–184 uniform, 187–188

empirical, 182 exponential, 203–206 normal, 198–203, 506 Poisson, 191–193 for random variables, 549–555 triangular, 196–198 uniform, 505

Problem formulation, 559–561, 568, 680–682 Product design and market share optimization problem, 623–626 Proportionality, linear optimization, 561 p values

and big data, 275–276 hypothesis tests, 256, 260, 261, 263, 266, 267 independent regression parameters, 318–321 nonlinear relationships, 336 very large samples, 345–347

Q Quadratic function, 648 Quadratic regression models, 331–335 Quantitative data, 21

frequency distributions for, 31–34 histograms, 34–37

Quartiles, 49 Quick Analysis button (Excel), 28, 29

R RAND function, 506–510 Random experiments, 168–169

784 Index

Random forests, 447, 449 Random sampling, 21, 223–226 Random variables, 20, 180–182, 229, 502

continuous, 181–182, 194 dependent, 527–531 discrete, 180–181, 184–187, 194 expected value and variance, 187–188 generating with Excel, 506–510 Monte Carlo simulation, 501–502, 504–506 probability distributions for, 549–555 probability distributions representing, 504–506

Range, 44–45 Recall, 430 Receiver operating characteristic (ROC) curve, 430–431 Recommended PivotTables, 97–99 Recommender systems, 139 Record, 139 Reduced cost, 577 Reduced gradient, 651 Regression analysis. See also Linear regression

autoregressive models, 397 forecasting applications, 395–405

causal forecasting, 401–404 combining causal variables with trend and

seasonality effects, 404–405 limitations of, 405 linear trend projection, 395–397 seasonality, 397–401

logistic regression, 432–436, 450 Regression lines, 296–298 Regression parameters, estimates of, 300–302 Regression trees, 445–446 Regret, 683–684 Rejection rule, 259 Relative and percent frequency distributions, 30–31 Relative frequency distributions, 30–31 Research hypothesis, 250–251 Residual plots, logistic regression, 434 Response variable, 295, 395 Right-hand side allowable increase (decrease), 575 Risk analysis

base-case scenario, 502–503 best-case scenario, 503 in decision analysis, 687–688 defined, 502 generating values for random variables with Excel, 506–510 Monte Carlo simulation, 502–514 probability distributions to represent random variables, 504–506 simulation output, measurement and analysis, 510–514 simulation trials with Excel, 510, 511 spreadsheet model, 503–504 worst-case scenario, 503

Risk assessment, 3 Risk avoider, 700 Risk-neutral, 706 Risk profile, 687–688 Risk taker, 703 Root mean squared error (RMSE), 431–432 Rule-based model, 7, 15

S Sales forecasting. See Forecasting Sample

defined, 21

representative, 222 selection, 223–227 taking a, 221–222

Sampled population, 222, 229 Sample information, 689

decision analysis with, 689–695 expected value of, 694

Sample mean (x), 229–232 expected value of, 232 sampling distribution of, 232–237, 240–241 standard deviation of, 232–233

Sample proportion (p), 229–232 expected value of, 237 sampling distribution of, 237–240 standard deviation of, 237–238

Sample size, 235–237, 239–240 Sample statistic, 227 Sampling, 222–227

data, 423–425 distributions, 229–240

of p, 237–240 sample size and, 235–237, 239–240 of x, 232–237

from finite population, 223–224 from infinite population, 224–226 random, 223–226 very large samples, 344–347

Sampling error, 268–269 and big data, 272–273 defined, 268

Scatter-chart matrix, 113–115 Scatter charts, 55–56, 59, 67, 99–101

k-nearest neighbors, 437, 438 logistic regression, 433 of residuals and independent variables, 314–318

Scenario manager, 475–480 Seasonality

combining causal variables with trend and seasonality effects, 404–405

with trend, 398–401 without trend, 397–398

Seasonal pattern, time series analysis, 378–379, 381 Sensitivity, 430 Sensitivity analysis

cautionary note about, 614–615 in decision analysis, 688 defined, 575 Excel Solver sensitivity report interpretation, 575–577 nonlinear optimization problems, 651

Sensitivity Report (Excel Solver), 575–577 Shadow price, 576, 651 Show Formulas, 487 Significance tests, 254 Simple linear regression, 296–298

estimated regression function, 296–298 fit assessment, 304–308 least squares method, 298–303 regression model, 296

Simple random sample, sampling from, 223–226 Simulation modeling, 514–527 Simulation optimization, 7, 16 Simulations, 7, 16

trials with Excel, 510, 511, 519–522 Single linkage clustering method, 144 Slack value, 567 Slack variable, 567

Index 785

Smoothing methods, 386–391 Specificity, 430 Sports analytics, 13–14 Spreadsheet models, 464–499

auditing, 487–491 design and implementation, 466–471 documentation, 471 Excel functions, 480–491

IF and COUNTIF, 483–485 SUM and SUMPRODUCT, 481 VLOOKUP, 485–486

formatting, 470 influence diagrams, 466–468 make-versus-buy, 466, 469 mathematical models, 466–468 overview, 465 parameters, 470 predictive and prescriptive, 491–492 for Press Teag Worldwide, 527–531 Procter & Gamble case study, 465 retirement plan case study, 499 risk analysis, 503–504 what-if analysis, 471–480

Stacked-bar charts, 112–113 Stacked-column charts, 112–113 Standard deviation, 46–47

of x, 232–233 Standard error, 238 Standard error of mean, 238 Standard error of the proportion, 238 Standardized value, 50 Standard normal distribution, 241–242 States of nature, 680 Statistical inference, 225–294

applications of, 222 big data and

attributes, 271 confidence intervals, 273–275 estimation, 270–271 hypothesis test, 275–277 sampling error, 272–273 tall data, 271, 272 wide data, 271

case studies, 291–293 defined, 222, 313 hypothesis tests, 250–268 interval estimation, 240–250 John Morrell & Company, 221 nonsampling error, 269–270 point estimation, 227–229 practical advice for, 229 practical significance of, 268–277 regression and, 313–325

individual regression parameters, 318–324 least squares regression model, 314–318 multicollinearity, 322–324 nonsignificant independent variables, 321–322 very large samples, 344–347

sample selection, 223–227 sampling distributions, 229–240 sampling error, 268–269

Statistical studies, 21–22 STDEV function, 262 Stemming, 153 Stepwise procedure, 343 Stitch Fix, 139

Strategic decisions, 4, 16 SUM function, 481 Sum of squares due to error (SSE), 311 Sum of squares due to regression (SSR), 306 SUMPRODUCT function, 184, 186, 481, 564–565 Sums of squares, 304–306 Sums of squares due to error (SSE), 304 Supervised learning, 423–450 Supply-chain analytics, 13 Supply-network design models, 7 Support count, 148 Support vector machines, 450 Surplus variable, 570

T Tables, 88–99

vs. charts, 88–89 crosstabulation, 90–92 data tables, 471–473 design principles, 89–90 payoff tables, 681 PivotTables, 93–99, 173–175

Tactical decisions, 4, 16 Target population, 229 T.DIST function, 257 t distributions, 241–242 Test set, 425 Test statistic, 255–260, 266 Text mining

definition, 151 movie reviews, 154 preprocessing text data for analysis, 153–154 unstructured data, 151 voice of customer at Triad Airline, 151–153

Three-dimensional charts, 107 3D Maps, 121–122 Time series analysis

forecasting and, 373–374 patterns, 375–382

cyclical pattern, 382 horizontal pattern, 375–377 identification of, 382 seasonal pattern, 378–379 trend and seasonal pattern, 379, 381 trend pattern, 377–378

Time series, defined, 375 Tokenization, 153 Total sum of squares (SST), 304–305 Trace Dependents, 487–488 Trace Precedents, 487–488 Training set, 425 Transportation planning, 580–584 Treemaps, 118–120 Trend-cycle effects, 382 Trend pattern, time series analysis, 377–379, 381 Triad Airlines, 151–153 Trial-and-error approach, 470 Trials, 506 Triangular distribution, 522–524, 550 Triangular probability distribution, 196–198 Two-tailed tests, 260–263 Two-way data tables, 471, 472 Type I errors, 253–254, 259 Type II errors, 253–254

786 Index

U Unbounded solutions, in linear programming problems,

573–574 Uncertainty, 167, 501, 502, 679 Uncertain variables, 20 Uniform distribution, 522, 551 Uniform probability density function, 195 Uniform probability distributions, 187–188, 194–196, 505 Unstable, 446 Unstructured data, 151 Unsupervised learning techniques, 139, 146. See also Descriptive

data mining Upper-tail tests, 260 U.S. Census Bureau, 19 Utility

decision analysis and, 699–703 defined, 698

Utility function for money, 705 Utility functions, 703–706

exponential, 706–707 Utility theory, 7, 16, 698–707

exponential utility function, 706–707 utility and decision analysis, 699–703 utility functions, 703–706

V Validation, 532 Validation set, 425 Variables

binary, 142, 616–626 causal, 404–405 decision, 20, 469, 559–560, 577, 582 defined, 19 dependent, 295, 321–322, 395 dummy, 325 expected value and variance, 187–188

independent, 295–296, 321–322, 325–329, 395 interaction between, 337–341

measures of association between two variables, 55–61 random, 20, 180–182, 229, 501–502, 504–506 slack, 567 surplus, 570 uncertain, 20 variable representation, 67–68 variable selection procedures, 342–343

Variance, 45–46, 186–187 Variation, 20 Venn diagram, 169, 170 Verification, 532 VLOOKUP function, 485–486

W Walt Disney Company, 10 Ward’s method, 144 Watch Window, 490, 491 Watson, 3–4 Web analytics, 14 What-if analysis, 471–480

data tables, 471–473 Excel Solver, 564–567 Goal Seek, 473–476 risk analysis, 503 scenario manager, 475–480

Worst-case scenario, 503

Y y-intercept, 300, 321, 322

Z z-scores, 49–51, 141

Cover
Brief Contents
Contents
About the Authors
Preface
Chapter 1: Introduction

1.1 Decision Making��
1.2 Business Analytics Defined��
1.3 A Categorization of Analytical Methods and Models��
1.4 Big Data��
1.5 Business Analytics in Practice��
Summary��
Glossary��

Chapter 2: Descriptive Statistics

2.1 Overview of Using Data: Definitions and Goals��
2.2 Types of Data��
2.3 Modifying Data in Excel��
2.4 Creating Distributions from Data��
2.5 Measures of Location��
2.6 Measures of Variability��
2.7 Analyzing Distributions��
2.8 Measures of Association between Two Variables
2.9 Data Cleansing��
Summary��
Glossary��
Problems��
Case Problem: Heavenly Chocolates Web Site Transactions��

Chapter 3: Data Visualization

3.1 Overview of Data Visualization��
3.2 Tables��
3.3 Charts��
3.4 Advanced Data Visualization��
3.5 Data Dashboards��
Summary��
Glossary��
Problems��
Case Problem: All-Time Movie Box-Office Data��

Chapter 4: Descriptive Data Mining

4.1 Cluster Analysis��
4.2 Association Rules��
4.3 Text Mining��
Summary��
Glossary��
Problems��
Case Problem: Know Thy Customer��

Chapter 5: Probability: An Introduction to Modeling Uncertainty

5.1 Events and Probabilities��
5.2 Some Basic Relationships of Probability��
5.3 Conditional Probability��
5.4 Random Variables��
5.5 Discrete Probability Distributions��
5.6 Continuous Probability Distributions��
Summary��
Glossary��
Problems��
Case Problem: Hamilton County Judges��

Chapter 6: Statistical Inference

6.1 Selecting a Sample��
6.2 Point Estimation��
6.3 Sampling Distributions��
6.4 Interval Estimation��
6.5 Hypothesis Tests��
6.6 Big Data, Statistical Inference, and Practical Significance��
Summary��
Glossary��
Problems��
Case Problem 1: Young Professional Magazine��
Case Problem 2: Quality Associates, Inc��

Chapter 7: Linear Regression

7.1 Simple Linear Regression Model��
7.2 Least Squares Method��
7.3 Assessing the Fit of the Simple Linear Regression Model��
7.4 The Multiple Regression Model��
7.5 Inference and Regression��
7.6 Categorical Independent Variables��
7.7 Modeling Nonlinear Relationships��
7.8 Model Fitting��
7.9 Big Data and Regression��
7.10 Prediction with Regression��
Summary��
Glossary��
Problems��
Case Problem: Alumni Giving��

Chapter 8: Time Series Analysis and Forecasting

8.1 Time Series Patterns��
8.2 Forecast Accuracy��
8.3 Moving Averages and Exponential Smoothing��
8.4 Using Regression Analysis for Forecasting��
8.5 Determining the Best Forecasting Model to Use��
Summary��
Glossary��
Problems��
Case Problem: Forecasting Food and Beverage Sales��
Appendix 8.1 Using the Excel Forecast Sheet��

Chapter 9: Predictive Data Mining

9.1 Data Sampling, Preparation, and Partitioning��
9.2 Performance Measures��
9.3 Logistic Regression��
9.4 k-Nearest Neighbors��
9.5 Classification and Regression Trees��
Summary��
Glossary��
Problems��
Case Problem: Grey Code Corporation��

Chapter 10: Spreadsheet Models

10.1 Building Good Spreadsheet Models��
10.2 What-If Analysis��
10.3 Some Useful Excel Functions for Modeling��
10.4 Auditing Spreadsheet Models��
10.5 Predictive and Prescriptive Spreadsheet Models��
Summary��
Glossary��
Problems��
Case Problem: Retirement Plan��

Chapter 11: Monte Carlo Simulation

11.1 Risk Analysis for Sanotronics LLC��
11.2 Simulation Modeling for Land Shark Inc.��
11.3 Simulation with Dependent Random Variables��
11.4 Simulation Considerations��
Summary��
Glossary��
Problems��
Case Problem: Four Corners��
Appendix 11.1 Common Probability Distributions for Simulation��

Chapter 12: Linear Optimization Models

12.1 A Simple Maximization Problem��
12.2 Solving the Par, Inc. Problem��
12.3 A Simple Minimization Problem��
12.4 Special Cases of Linear Program Outcomes��
12.5 Sensitivity Analysis��
12.6 General Linear Programming Notation and More Examples��
12.7 Generating an Alternative Optimal Solution for a Linear Program��
Summary��
Glossary��
Problems��
Case Problem: Investment Strategy��

Chapter 13: Integer Linear Optimization Models

13.1 Types of Integer Linear Optimization Models��
13.2 Eastborne Realty, an Example of Integer Optimization��
13.3 Solving Integer Optimization Problems with Excel Solver��
13.4 Applications Involving Binary Variables��
13.5 Modeling Flexibility Provided by Binary Variables��
13.6 Generating Alternatives in Binary Optimization��
Summary��
Glossary��
Problems��
Case Problem: Applecore Children's Clothing

Chapter 14: Nonlinear Optimization Models

14.1 A Production Application: Par, Inc. Revisited��
14.2 Local and Global Optima��
14.3 A Location Problem��
14.4 Markowitz Portfolio Model��
14.5 Forecasting Adoption of a New Product��
Summary��
Glossary��
Problems��
Case Problem: Portfolio Optimization with Transaction Costs��

Chapter 15: Decision Analysis

15.1 Problem Formulation��
15.2 Decision Analysis without Probabilities��
15.3 Decision Analysis with Probabilities��
15.4 Decision Analysis with Sample Information��
15.5 Computing Branch Probabilities with Bayes' Theorem
15.6 Utility Theory��
Summary��
Glossary��
Problems��
Case Problem: Property Purchase Strategy��

Appendix A-Basics of Excel
Appendix B-Database Basics with Microsoft Access
References
Index

1. 2018-03-15T11:31:57+0000
2. Preflight Ticket Signature