replies

profilealleb aef4
Book.pdf

Data Analytics Basics for Managers

Don’t let a fear of numbers hold you back.

Understand the numbers Make better decisions Present and persuade

SMARTER THAN THE AVERAGE GUIDE

HBR Guide to

D ata A

n alytics B

asics fo r M

an agers

H BR

G uide to

Stay informed. Join the discussion. Visit hbr.org Follow @HarvardBiz on Twitter Find us on Facebook and LinkedIn

US$19.95 Management

Today’s business environment brings with it an onslaught of data. Now more than ever, managers must know how to tease insight from data—to understand where the numbers come from, make sense of them, and use them to inform tough decisions. How do you get started?

Whether you’re working with data experts or running your own tests, you’ll find answers in the HBR Guide to Data Analytics Basics for Managers. This book describes three key steps in the data analysis process, so you can get the information you need, study the data, and communicate your findings to others.

You’ll learn how to: • Identify the metrics you need to measure • Run experiments and A/B tests • Ask the right questions of your data experts • Understand statistical terms and concepts • Create effective charts and visualizations • Avoid common mistakes

BONUS ARTICLE

Data Scientist: The Sexiest Job

of the 21st Century

ISBN-13: 978-1-63369-428-6

9 7 8 1 6 3 3 6 9 4 2 8 6

9 0 0 0 0

HBR-Guide-DataAnalytics10185_Mechanical.indd 1 1/22/18 1:56 PM

HBR Guide to Data Analytics Basics for Managers

H7353_Guide-DataAnalytics_2ndREV.indb iH7353_Guide-DataAnalytics_2ndREV.indb i 1/17/18 10:47 AM1/17/18 10:47 AM

Harvard Business Review Guides

Arm yourself with the advice you need to succeed on the

job, from the most trusted brand in business. Packed

with how-to essentials from leading experts, the HBR

Guides provide smart answers to your most pressing

work challenges.

The titles include:

HBR Guide to Being More Productive

HBR Guide to Better Business Writing

HBR Guide to Building Your Business Case

HBR Guide to Buying a Small Business

HBR Guide to Coaching Employees

HBR Guide to Data Analytics Basics for Managers

HBR Guide to Delivering Effective Feedback

HBR Guide to Emotional Intelligence

HBR Guide to Finance Basics for Managers

HBR Guide to Getting the Right Work Done

HBR Guide to Leading Teams

HBR Guide to Making Every Meeting Matter

HBR Guide to Managing Stress at Work

HBR Guide to Managing Up and Across

HBR Guide to Negotiating

HBR Guide to Offi ce Politics

HBR Guide to Performance Management

HBR Guide to Persuasive Presentations

HBR Guide to Project Management

H7353_Guide-DataAnalytics_2ndREV.indb iiH7353_Guide-DataAnalytics_2ndREV.indb ii 1/17/18 10:47 AM1/17/18 10:47 AM

HBR Guide to Data Analytics Basics for Managers

HARVARD BUSINESS REVIEW PRESS

Boston, Massachusetts

H7353_Guide-DataAnalytics_2ndREV.indb iiiH7353_Guide-DataAnalytics_2ndREV.indb iii 1/17/18 10:47 AM1/17/18 10:47 AM

Copyright 2018 Harvard Business School Publishing Corporation

All rights reserved

No part of this publication may be reproduced, stored in or introduced

into a retrieval system, or transmitted, in any form, or by any means

(electronic, mechanical, photocopying, recording, or otherwise),

without the prior permission of the publisher. Requests for permission

should be directed to [email protected], or mailed to

Permissions, Harvard Business School Publishing, 60 Harvard Way,

Boston, Massachusetts 02163.

The web addresses referenced in this book were live and correct at the

time of the book’s publication but may be subject to change.

Cataloging-in-Publication data is forthcoming

eISBN: 9781633694293

HBR Press Quantity Sales Discounts

Harvard Business Review Press titles are available at signifi cant

quantity discounts when purchased in bulk for client gifts, sales

promotions, and premiums. Special editions, including books with

corporate logos, customized covers, and letters from the company

or CEO printed in the front matter, as well as excerpts of existing

books, can also be created in large quantities for special needs.

For details and discount information for both print and ebook for-

mats, contact [email protected], tel. 800-988-0886,

or www.hbr.org/bulksales.

H7353_Guide-DataAnalytics_2ndREV.indb ivH7353_Guide-DataAnalytics_2ndREV.indb iv 1/17/18 10:47 AM1/17/18 10:47 AM

What You’ll Learn

The vast amounts of data that companies accumulate to-

day can help you understand the past, make predictions

about the future, and guide your decision making. But

how do you use all this data effectively? How do you as-

sess whether your fi ndings are accurate or signifi cant?

How do you distinguish between causation and correla-

tion? And how do you present your results in a way that

will persuade others?

Understanding data analytics is an essential skill for

every manager. It’s no longer enough to hand this re-

sponsibility off to data experts. To be able to rely on the

evidence your analysts give you, you need to know where

it comes from and how it was generated—and what it

can and can’t teach you.

Using quantitative analysis as part of your decision

making helps you uncover new information and pro-

vides you with more confi dence in your choices—and

you don’t need to be deeply profi cient in statistics to

do it. This guide gives you the basics so you can better

understand how to use data and analytics as you make

tough choices in your daily work. It walks you through

H7353_Guide-DataAnalytics_2ndREV.indb vH7353_Guide-DataAnalytics_2ndREV.indb v 1/17/18 10:47 AM1/17/18 10:47 AM

vi

What You’ll Learn

three fundamental steps of data analysis: gathering the

information you need, making sense of the numbers,

and communicating those fi ndings to get buy-in and

spur others to action.

You’ll learn to:

• Ask the right questions to get the information

you need

• Work more effectively with data scientists

• Run business experiments and A/B tests

• Choose the right metrics to evaluate predictions

and performance

• Assess whether you can trust your data

• Understand the basics of regression analysis and

statistical signifi cance

• Distinguish between correlation and causation

• Sidestep cognitive biases when making decisions

• Identify when to invest in machine learning—and

how to proceed

• Communicate and defend your fi ndings to

stakeholders

• Visualize your data clearly and powerfully

H7353_Guide-DataAnalytics_2ndREV.indb viH7353_Guide-DataAnalytics_2ndREV.indb vi 1/17/18 10:47 AM1/17/18 10:47 AM

Contents

Introduction 1

Why you need to understand data analytics.

S E C T I O N O N E

Getting Started

1. Keep Up with Your Quants 13

An innumerate’s guide to navigating big data.

BY THOMAS H. DAVENPORT

2. A Simple Exercise to Help You Think

Like a Data Scientist 25

An easy way to learn the process of data analytics.

BY THOMAS C. REDMAN

S E C T I O N T W O

Gather the Right Information

3. Do You Need All That Data? 33

Questions to ask for a focused search.

BY RON ASHKENAS

H7353_Guide-DataAnalytics_2ndREV.indb viiH7353_Guide-DataAnalytics_2ndREV.indb vii 1/17/18 10:47 AM1/17/18 10:47 AM

Contents

viii

4. How to Ask Your Data Scientists for

Data and Analytics 37

Factors to keep in mind to get the information you need.

BY MICHAEL LI, MADINA KASSENGALIYEVA ,

AND RAYMOND PERKINS

5. How to Design a Business Experiment 45

Seven tips for using the scientifi c method.

BY OLIVER HAUSER AND MICHAEL LUCA

6. Know the Diff erence Between Your Data

and Your Metrics 51

Understand what you’re measuring.

BY JEFF BLADT AND BOB FILBIN

7. The Fundamentals of A/B Testing 59

How it works—and mistakes to avoid.

BY AMY GALLO

8. Can Your Data Be Trusted? 71

Gauge whether your data is safe to use.

BY THOMAS C. REDMAN

S E C T I O N T H R E E

Analyze the Data

9. A Predictive Analytics Primer 81

Look to the future by looking at the past.

BY THOMAS H. DAVENPORT

10. Understanding Regression Analysis 87

Evaluate the relationship between variables.

BY AMY GALLO

H7353_Guide-DataAnalytics_2ndREV.indb viiiH7353_Guide-DataAnalytics_2ndREV.indb viii 1/17/18 10:47 AM1/17/18 10:47 AM

Contents

ix

11. When to Act On a Correlation,

and When Not To 103

Assess your confi dence in your fi ndings and the risk of being wrong.

BY DAVID RIT TER

12. Can Machine Learning Solve Your

Business Problem? 111

Steps to take before investing in artifi cial intelligence.

BY ANASTASSIA FEDYK

13. A Refresher on Statistical Signifi cance 121

Check if your results are real or just luck.

BY AMY GALLO

14. Linear Thinking in a Nonlinear World 131

A common mistake that leads to errors in judgment.

BY BART DE LANGHE, STEFANO PUNTONI,

AND RICHARD LARRICK

15. Pitfalls of Data-Driven Decisions 155

The cognitive traps to avoid.

BY MEGAN MacGARVIE AND KRISTINA McELHERAN

16. Don’t Let Your Analytics Cheat the Truth 165

Pay close attention to the outliers.

BY MICHAEL SCHRAGE

S E C T I O N F O U R

Communicate Your Findings

17. Data Is Worthless If You Don’t Communicate It 173

Tell people what it means.

BY THOMAS H. DAVENPORT

H7353_Guide-DataAnalytics_2ndREV.indb ixH7353_Guide-DataAnalytics_2ndREV.indb ix 1/17/18 10:47 AM1/17/18 10:47 AM

Contents

x

18. When Data Visualization Works—and

When It Doesn’t 177

Not all data is worth the eff ort.

BY JIM STIKELEATHER

19. How to Make Charts That Pop and Persuade 183

Five questions to help give your numbers meaning.

BY NANCY DUARTE

20. Why It’s So Hard for Us to Communicate

Uncertainty 191

Illustrating—and understanding—the likelihood of events.

AN INTERVIEW WITH SCOT T BERINATO

BY NICOLE TORRES

21. Responding to Someone Who Challenges

Your Data 199

Ensure the data is thorough, then make them an ally.

BY JON M. JACHIMOWICZ

22. Decisions Don’t Start with Data 205

Infl uence others through story and emotion.

BY NICK MORGAN

Appendix: Data Scientist: The Sexiest Job

of the 21st Century 209

BY THOMAS H. DAVENPORT AND D.J. PATIL

Index 225

H7353_Guide-DataAnalytics_2ndREV.indb xH7353_Guide-DataAnalytics_2ndREV.indb x 1/17/18 10:47 AM1/17/18 10:47 AM

1

Introduction

Data is coming into companies at remarkable speed and

volume. From small, manageable data sets to big data

that is recorded every time a consumer buys a product or

likes a social media post, this information offers a range

of opportunities to managers.

Data allows you to make better predictions about the

future—whether a new retail location is likely to suc-

ceed, for example, or what a reasonable budget for the

next fi scal year might look like. It helps you identify the

causes of certain events—a failed advertising campaign,

a bad quarter, or even poor employee performance—so

you can adjust course if necessary. It allows you to iso-

late variables so that you can identify your customers’

wants or needs or assess the chances an initiative will

succeed. Data gives you insight on factors affecting your

industry or marketplace and can inform your decisions

about anything from new product development to hiring

choices.

H7353_Guide-DataAnalytics_2ndREV.indb 1H7353_Guide-DataAnalytics_2ndREV.indb 1 1/17/18 10:47 AM1/17/18 10:47 AM

Introduction

2

But with so much information coming in, how do you

sort through it all and make sense of everything? It’s

tempting to hand that role off to your experts and ana-

lysts. But even if you have the brightest minds handling

your data, it won’t make a difference if you don’t know

what they’re doing or what it means. Unless you know

how to use that data to inform your decisions, all you

have is a set of numbers.

It’s quickly becoming a requirement that every deci-

sion maker have a basic understanding of data analyt-

ics. But if the thought of statistical analysis makes you

sweat, have no fear. You don’t need to become a data

scientist or statistician to understand what the numbers

mean (even if data scientists have the “sexiest job of the

21st century”—see the bonus article we’ve included in the

appendix). Instead, you as a manager need a clear un-

derstanding of how these experts reach their results and

how to best use that information to guide your own deci-

sions. You must know where their fi ndings come from,

ask the right questions of data sets, and translate the re-

sults to your colleagues and other stakeholders in a way

that convinces and persuades.

This book is not for analytics experts—the data sci-

entists, analysts, and other specialists who do this work

day in, day out. Instead, it’s meant for managers who

may not have a background in statistical analysis but still

want to improve their decisions using data. This book

will not give you a detailed course in statistics. Rather,

it will help you better use data, so you can understand

what the numbers are telling you, identify where the re-

sults of those calculations may be falling short, and make

stronger choices about how to run your business.

H7353_Guide-DataAnalytics_2ndREV.indb 2H7353_Guide-DataAnalytics_2ndREV.indb 2 1/17/18 10:47 AM1/17/18 10:47 AM

Introduction

3

What This Book Will Do This guide walks you through three key areas of the data

analytics process: gathering the information you need,

analyzing it, and communicating your fi ndings to oth-

ers. These three steps form the core of managerial data

analytics.

To fully understand these steps, you need to see the

process of data analytics and your role within it at a high

level. Section 1, “Getting Started,” provides two pieces

to help you digest the process from start to fi nish. First,

Thomas Davenport outlines your role in data analysis

and describes how you can work more effectively with

your data scientist and become a better consumer of

analytics. Then, you’ll fi nd an easy exercise you can do

yourself to gather your own data, analyze it, and identify

what to do next in light of what you’ve discovered.

Once you have this basic understanding of the pro-

cess, you can move on to learn the specifi cs about each

step, starting with the data search.

Gather the right information

For any analysis, you need data—that’s obvious. But

what data you need and how to get it can be less clear

and can vary, depending on the problem to be solved.

Section 2 begins by providing a list of questions to ask

for a targeted data search.

There are two ways to get the information you need:

by asking others for existing data and analysis or by run-

ning your own experiment to gather new data. We ex-

plore both of these approaches in turn, covering how to

request information from your data experts (taking into

H7353_Guide-DataAnalytics_2ndREV.indb 3H7353_Guide-DataAnalytics_2ndREV.indb 3 1/17/18 10:47 AM1/17/18 10:47 AM

Introduction

4

account their needs and concerns) and using the scien-

tifi c method and A/B testing for well-thought-out tests.

But any data search won’t matter if you don’t measure

useful things. Defi ning the right metrics ensures that

your results align with your needs. Jeff Bladt, chief data

offi cer at DoSomething.org, and Bob Filbin, chief data

scientist at Crisis Text Line, use the example of their own

social media campaign to explain how to identify and

work toward metrics that matter.

We end this section with a helpful process by data ex-

pert and company adviser Thomas C. Redman. Before

you can move forward with any analysis, you must know

if the information you have can be trusted. By following

his advice, you can assess the quality of your data, make

corrections as necessary, and move forward accordingly,

even if the data isn’t perfect.

Analyze the data

You have the numbers—now what? It’s usually at this

point in the process that managers fl ash back to their

college statistics courses and nervously leave the analy-

sis to an expert or a computer algorithm. Certainly, the

data scientists on your team are there to help. But you

can learn the basics of analysis without needing to un-

derstand every mathematical calculation. By focusing on

how data experts and companies use these equations (in-

stead of how they run them), we help you ask the right

questions and inform your decisions in real-world man-

agerial situations.

We begin section 3 by describing some basic terms

and processes. We defi ne predictive analytics and how to

H7353_Guide-DataAnalytics_2ndREV.indb 4H7353_Guide-DataAnalytics_2ndREV.indb 4 1/17/18 10:47 AM1/17/18 10:47 AM

Introduction

5

use them, and explain statistical concepts like regression

analysis, correlation versus causation, and statistical

signifi cance. You’ll also learn how to assess if machine

learning can help solve your problem—and how to pro-

ceed if it does.

In this section, we also aim to help you avoid common

traps as you study data and make decisions. You’ll dis-

cover how to look at numbers in nonlinear ways, so your

predictions are more accurate. And you will fi nd practi-

cal ways to avoid injecting subconscious bias into your

choices.

Finally, recognize when the story you’re being told

may be too good to be true. Even with the best data—and

the best data analysts—the results may not be as clear as

you think. As Michael Schrage, research fellow at MIT’s

Sloan School Center for Digital Business, points out in

the last piece in this section, an unmentioned outlier can

throw an entire conclusion off base, which is a risk you

can’t take with your decision making.

Communicate your fi ndings

“Never make the mistake of assuming that the results

will ‘speak for themselves,’” warns Thomas Davenport

in the fi nal section of this book. You must know how to

communicate the results of your analysis and use that

information to persuade others and drive your decision

forward—the third step in the data analytics process.

Section 4 explains how to share data with others so

that the numbers support your message, rather than

distract from it. The next few chapters outline when

visualizations will be helpful to your data—and when

H7353_Guide-DataAnalytics_2ndREV.indb 5H7353_Guide-DataAnalytics_2ndREV.indb 5 1/17/18 10:47 AM1/17/18 10:47 AM

Introduction

6

they won’t be—as well as the basics of making persuasive

charts. You’ll learn how to depict and explain the uncer-

tainty and the probability of events, as well as what to do

if someone questions your fi ndings.

Data alone will not elicit change, though; you must

use this evidence in the right way to inform and change

the mindset of the person who sees it. Data is merely

supporting material, says presentations expert Nick

Morgan in the fi nal chapter. To truly persuade, you need

a story with emotional power.

Set your organization up for success

While we hope that you’ll continue to learn and grow

your own analytical skills, it’s likely that you’ll continue

to work with data experts and quants throughout your

data journey. Understanding the role of the data scien-

tist will be crucial to ensuring your organization has the

capabilities it needs to grow through data.

Data scientists bring with them intense curiosity and

make new discoveries that managers and analysts may

not see themselves. As an appendix at the end of this

book, you’ll fi nd Thomas H. Davenport and D.J. Patil’s

popular article “Data Scientist: The Sexiest Job of the

21st Century.” Davenport and Patil’s piece aims to help

you better understand this key player in an organiza-

tion—someone they describe as a “hybrid of data hacker,

analyst, communicator, and trusted adviser.” These indi-

viduals have rare qualities that, as a manager, you may

not fully under stand. By reading through this piece,

you’ll have insight into how they think about and work

with data. What’s more, you’ll learn how to fi nd, attract,

H7353_Guide-DataAnalytics_2ndREV.indb 6H7353_Guide-DataAnalytics_2ndREV.indb 6 1/17/18 10:47 AM1/17/18 10:47 AM

Introduction

7

and develop data scientists to keep your company on the

competing edge.

Moving Forward Data-driven decisions won’t come easily. But by under-

standing the basics of data analytics, you’ll be able to ask

the right questions of data to pull the most useful informa-

tion out of the numbers. Before diving in to the chapters

that follow, though, ask yourself how often you’re incor-

porating data into your daily work. The assessment “Are

You Data Driven?” is a brief test that will help you target

your efforts. With that knowledge in mind, move through

the next sections with an open mind, ready to weave data

into each of your decisions.

ARE YOU DATA DRIVEN?

by Thomas C. Redman

Look at the list below and give yourself a point for ev-

ery behavior you demonstrate consistently and half

a point for those you follow most—but not all—of the

time. Be hard on yourself. If you can only cite an in-

stance or two, don’t give yourself any credit.

□ I push decisions down to the lowest possible level.

□ I bring as much diverse data and as many diverse viewpoints to any situation as I possibly can.

(continued�)

H7353_Guide-DataAnalytics_2ndREV.indb 7H7353_Guide-DataAnalytics_2ndREV.indb 7 1/17/18 10:47 AM1/17/18 10:47 AM

Introduction

8

ARE YOU DATA DRIVEN?

(continued�)

□ I use data to develop a deeper understanding of the business context and the problem at hand.

□ I develop an appreciation for variation.

□ I deal reasonably well with uncertainty.

□ I integrate my understanding of the data and its implications with my intuition.

□ I recognize the importance of high-quality data and invest to make improvements.

□ I conduct experiments and research to supple- ment existing data and address new questions.

□ I recognize that decision criteria can vary with circumstances.

□ I realize that making a decision is only the fi rst step, and I revise decisions as new data comes

to light.

□ I work to learn new skills, and bring new data and data technologies into my organization.

□ I learn from my mistakes and help others to do so as well.

□ I strive to be a role model when it comes to data, and work with leaders, peers, and subordinates

to help them become data driven.

H7353_Guide-DataAnalytics_2ndREV.indb 8H7353_Guide-DataAnalytics_2ndREV.indb 8 1/17/18 10:47 AM1/17/18 10:47 AM

Introduction

9

Tally your points. If you score less than 7, it’s im-

perative that you start changing the way you work as

soon as possible. Target those behaviors where you

gave yourself partial credit fi rst and fully embed those

skills into your daily work. Then build on your success

by targeting those behaviors that you were unable to

give yourself any credit for. It may help to enlist a col-

league’s aid—the two of you can improve together.

If you score a 7 or higher, you’re showing signs of

being data driven. Still, strive for ongoing improve-

ment. Set a goal of learning a new behavior or two ev-

ery year. Take this test every six months to make sure

that you’re on track.

Adapted from “Are You Data Driven? Take a Hard Look in the Mirror” on hbr.org, July 11, 2013 (product # H00AX2).

Thomas C. Redman, “the D ata Doc,” is President of Data Quality So- lutions. He helps companies and people, including startups, multi- nationals, executives, and leaders at all levels, chart their courses to data-driven futures. He places special emphasis on quality, analytics, and organizational capabilities.

H7353_Guide-DataAnalytics_2ndREV.indb 9H7353_Guide-DataAnalytics_2ndREV.indb 9 1/17/18 10:47 AM1/17/18 10:47 AM

H7353_Guide-DataAnalytics_2ndREV.indb 10H7353_Guide-DataAnalytics_2ndREV.indb 10 1/17/18 10:47 AM1/17/18 10:47 AM

SECTION ONE

Getting Started

H7353_Guide-DataAnalytics_2ndREV.indb 11H7353_Guide-DataAnalytics_2ndREV.indb 11 1/17/18 10:47 AM1/17/18 10:47 AM

H7353_Guide-DataAnalytics_2ndREV.indb 12H7353_Guide-DataAnalytics_2ndREV.indb 12 1/17/18 10:47 AM1/17/18 10:47 AM

13

CHAPTER 1

Keep Up with Your Quants by Thomas H. Davenport

“I don’t know why we didn’t get the mortgages off our

books,” a senior quantitative analyst at a large U.S. bank

told me a few years ago. “I had a model strongly indicat-

ing that a lot of them wouldn’t be repaid, and I sent it to

the head of our mortgage business.”

When I asked the leader of the mortgage business why

he’d ignored the advice, he said, “If the analyst showed me

a model, it wasn’t in terms I could make sense of. I didn’t

even know his group was working on repayment prob-

abilities.” The bank ended up losing billions in bad loans.

We live in an era of big data. Whether you work in

fi nancial services, consumer goods, travel and transpor-

Reprinted from Harvard Business Review, July–August 2013 (product

#R1307L).

H7353_Guide-DataAnalytics_2ndREV.indb 13H7353_Guide-DataAnalytics_2ndREV.indb 13 1/17/18 10:47 AM1/17/18 10:47 AM

Getting Started

14

tation, or industrial products, analytics are becoming a

competitive necessity for your organization. But as the

banking example shows, having big data—and even peo-

ple who can manipulate it successfully—is not enough.

Companies need general managers who can partner ef-

fectively with “quants” to ensure that their work yields

better strategic and tactical decisions.

For people fl uent in analytics—such as Gary Loveman

of Caesars Entertainment (with a PhD from MIT), Jeff

Bezos of Amazon (an electrical engineering and com-

puter science major from Princeton), or Sergey Brin and

Larry Page of Google (computer science PhD dropouts

from Stanford)—there’s no problem. But if you’re a typi-

cal executive, your math and statistics background prob-

ably amounts to a college class or two. You might be ad-

ept at using spreadsheets and know your way around a

bar graph or a pie chart, but when it comes to analytics,

you often feel quantitatively challenged.

So what does the shift toward data-driven decision

making mean for you? How do you avoid the fate of

the loss-making mortgage bank head and instead lead

your company into the analytical revolution, or at least

become a good foot soldier in it? This article—a primer

for non-quants—is based on extensive interviews with

executives, including some with whom I’ve worked as a

teacher or a consultant.

You, the Consumer Start by thinking of yourself as a consumer of analytics.

The producers are the quants whose analyses and mod-

els you’ll integrate with your business experience and in-

H7353_Guide-DataAnalytics_2ndREV.indb 14H7353_Guide-DataAnalytics_2ndREV.indb 14 1/17/18 10:47 AM1/17/18 10:47 AM

Keep Up with Your Quants

15

tuition as you make decisions. Producers are, of course,

good at gathering the available data and making predic-

tions about the future. But most lack suffi cient knowl-

edge to identify hypotheses and relevant variables and to

know when the ground beneath an organization is shift-

ing. Your job as a data consumer—to generate hypoth-

eses and determine whether results and recommenda-

tions make sense in a changing business environment—is

therefore critically important. That means accepting a

few key responsibilities. Some require only changes in

attitude and perspective; others demand a bit of study.

Learn a little about analytics

If you remember the content of your college-level statis-

tics course, you may be fi ne. If not, bone up on the basics

of regression analysis, statistical inference, and experi-

mental design. You need to understand the process for

making analytical decisions, including when you should

step in as a consumer, and you must recognize that every

analytical model is built on assumptions that producers

ought to explain and defend. (See the sidebar “Analytics-

Based Decision Making—in Six Steps.” ) As the famous

statistician George Box noted, “All models are wrong,

but some are useful.” In other words, models intention-

ally simplify our complex world.

To become more data literate, enroll in an executive

education program in statistics, take an online course, or

learn from the quants in your organization by working

closely with them on one or more projects.

Jennifer Joy, the vice president of clinical operations

at Cigna, took the third approach. Joy has a nursing

H7353_Guide-DataAnalytics_2ndREV.indb 15H7353_Guide-DataAnalytics_2ndREV.indb 15 1/17/18 10:47 AM1/17/18 10:47 AM

Getting Started

16

ANALYTICS-BASED DECISION MAKING—IN SIX KEY STEPS

When using big data to make big decisions, non-

quants should focus on the fi rst and the last steps of

the process. The numbers people typically handle the

details in the middle, but wise non-quants ask lots of

questions along the way.

1. Recognize the problem or question. Frame the

decision or business problem, and identify pos-

sible alternatives to the framing.

2. Review previous fi ndings. Identify people who

have tried to solve this problem or similar ones—

and the approaches they used.

3. Model the solution and select the variables. 

Formulate a detailed hypothesis about how par-

ticular variables aff ect the outcome.

4. Collect the data. Gather primary and second-

ary data on the hypothesized variables.

5. Analyze the data. Run a statistical model, as-

sess its appropriateness for the data, and repeat

the process until a good fi t is found.

6. Present and act on the results. Use the data to

tell a story to decision makers and stakeholders

so that they will take action.

H7353_Guide-DataAnalytics_2ndREV.indb 16H7353_Guide-DataAnalytics_2ndREV.indb 16 1/17/18 10:47 AM1/17/18 10:47 AM

Keep Up with Your Quants

17

degree and an MBA, but she wasn’t entirely comfortable

with her analytical skills. She knew, however, that the

voluminous reports she received about her call center

operations weren’t telling her whether the coaching calls

made to patients were actually helping to manage their

diseases and to keep them out of the hospital.

So Joy reached out to Cigna’s analytics group, in par-

ticular to the experts on experimental design—the only

analytical approach that can potentially demonstrate

cause and effect. She learned, for example, that she

could conduct pilot studies to discover which segments

of her targeted population benefi t the most (and which

the least) from her call center’s services. Specifi cally, she

uses analytics to “prematch” pairs of patients and then to

randomly assign one member of the pair to receive those

services, while the other gets an alternative such as a

mail-order or an online-support intervention. Each pilot

lasts just a couple of months, and multiple studies are

run simultaneously—so Joy now gets information about

the effectiveness of her programs on a rolling basis.

In the end, Joy and her quant partners learned that

the coaching worked for people with certain diseases

but not for other patients, and some call center staff

members were redeployed as a result. Now her group

regularly conducts 20 to 30 such tests a year to fi nd out

what really makes a difference for patients. She may

not under stand all the methodological details, but as

Michael Cousins, the vice president of U.S. research and

analytics at Cigna, attests, she’s learned to be “very ana-

lytically oriented.”

H7353_Guide-DataAnalytics_2ndREV.indb 17H7353_Guide-DataAnalytics_2ndREV.indb 17 1/17/18 10:47 AM1/17/18 10:47 AM

Getting Started

18

Align yourself with the right kind of quant

Karl Kempf, a leader in Intel’s decision-engineering

group, is known at the company as the “überquant” or

“chief mathematician.” He often says that effective quan-

titative decisions “are not about the math; they’re about

the relationships.” What he means is that quants and the

consumers of their data get much better results if they

form deep, trusting ties that allow them to exchange in-

formation and ideas freely.

Of course, highly analytical people are not always

known for their social skills, so this can be hard work.

As one wag jokingly advised, “Look for the quants who

stare at your shoes, instead of their own, when you en-

gage them in conversation.” But it’s possible to fi nd

people who communicate well and have a passion for

solving business—rather than mathematical—problems

and, after you’ve established a relationship, to encourage

frank dialogue and data-driven dissent between the two

of you.

Katy Knox, at Bank of America, has learned how to

align with data producers. As the head of retail strat-

egy and distribution for the bank’s consumer division,

she oversees 5,400-plus branches serving more than

50 million consumers and small businesses. For several

years she’s been pushing her direct reports to use analyt-

ics to make better decisions—for example, about which

branches to open or close, how to reduce customer wait

times, what incentives lead to multichannel interactions,

and why some salespeople are more productive than

others.

H7353_Guide-DataAnalytics_2ndREV.indb 18H7353_Guide-DataAnalytics_2ndREV.indb 18 1/17/18 10:47 AM1/17/18 10:47 AM

Keep Up with Your Quants

19

Bank of America has hundreds of quants, but most of

them were pooled in a group that managers could not

easily access. Knox insisted on having her own analyt-

ics team, and she established a strong working relation-

ship with its members through frequent meetings and

project-reporting sessions. She worked especially closely

with two team leaders, Justin Addis and Michael Hyzy,

who have backgrounds in retail banking and Six Sigma,

so they’re able to understand her unit’s business prob-

lems and communicate them to the hard-core quants

they manage. After Knox set the precedent, Bank of

America created a matrix structure for its analysts in the

consumer bank, and most now report to both a business

line and a centralized analytical group.

Focus on the beginning and the end

Framing a problem—identifying it and understanding

how others might have solved it in the past—is the most

important stage of the analytical process for a consumer

of big data. It’s where your business experience and in-

tuition matter most. After all, a hypothesis is simply a

hunch about how the world works. The difference with

analytical thinking, of course, is that you use rigorous

methods to test the hypothesis.

For example, executives at the two corporate parent

organizations of Transitions Optical believed that the

photochromic lens company might not be investing in

marketing at optimal levels, but no empirical data con-

fi rmed or refuted that idea. Grady Lenski, who headed

the marketing division at the time, decided to hire

analytics consultants to measure the effectiveness of

H7353_Guide-DataAnalytics_2ndREV.indb 19H7353_Guide-DataAnalytics_2ndREV.indb 19 1/17/18 10:47 AM1/17/18 10:47 AM

Getting Started

20

different sales campaigns—a constructive framing that

expanded on the simple binary question of whether or

not costs were too high.

If you’re a non-quant, you should also focus on the

fi nal step in the process—presenting and communicat-

ing results to other executives—because it’s one that

many quants discount or overlook and that you’ll prob-

ably have to take on yourself at some point. If analytics

is largely about “telling a story with data,” what type of

story would you favor? What kind of language and tone

would you use? Should the story be told in narrative or

visual terms? What types of graphics do you like? No

matter how sophisticated their analyses, quants should

be encouraged to explain their results in a straightfor-

ward way so that everyone can understand—or you

should do it for them. A statistical methods story (“fi rst

we ran a chi-square test, and then we converted the cat-

egorical data to ordinal, next we ran a logistic regres-

sion, and then we lagged the economic data by a year”)

is rarely acceptable.

Many businesspeople settle on an ROI story: How

will the new decision-making model increase conver-

sions, revenue, or profi tability? For example, a Merck

executive with responsibility for a global business unit

has worked closely with the pharmaceutical company’s

commercial analytics group for many years to answer a

variety of questions, including what the ROIs of direct-

to-consumer promotions are. Before an ROI analysis, he

and the group discuss what actions they will take when

they fi nd out whether promotions are highly, margin-

ally, or not successful—to make clear that the effort isn’t

H7353_Guide-DataAnalytics_2ndREV.indb 20H7353_Guide-DataAnalytics_2ndREV.indb 20 1/17/18 10:47 AM1/17/18 10:47 AM

Keep Up with Your Quants

21

merely an academic exercise. After the analysis, the ex-

ecutive sits the analysts down at a table with his man-

agement team to present and debate the results.

Ask lots of questions along the way

Former U.S. Treasury Secretary Larry Summers, who

once served as an adviser to a quantitative hedge fund,

told me that his primary responsibility in that job was

to “look over shoulders”—that is, to ask the smart quants

in the fi rm equally smart questions about their models

and assumptions. Many of them hadn’t been pressed

like that before; they needed an intelligent consumer of

data to help them think through and improve their work.

No matter how much you trust your quants, don’t

stop asking them tough questions. Here are a few that

almost always lead to more- rigorous, defensible analy-

ses. (If you don’t understand a reply, ask for one that

uses simpler language.)

1. What was the source of your data?

2. How well do the sample data represent the

population?

3. Does your data distribution include outliers?

How did they affect the results?

4. What assumptions are behind your analysis?

Might certain conditions render your assump-

tions and your model invalid?

5. Why did you decide on that particular analytical

approach? What alternatives did you consider?

H7353_Guide-DataAnalytics_2ndREV.indb 21H7353_Guide-DataAnalytics_2ndREV.indb 21 1/17/18 10:47 AM1/17/18 10:47 AM

Getting Started

22

6. How likely is it that the independent variables

are actually causing the changes in the dependent

variable? Might other analyses establish causality

more clearly?

Frank Friedman, the chief fi nancial offi cer and manag-

ing partner for fi nance and administration of Deloitte’s

U.S. business, is an inveterate questioner. He has assem-

bled a group of data scientists and quantitative analysts

to help him with several initiatives, including optimizing

the pricing of services, developing models that predict

employee performance, and identifying factors that drive

receivables. “People who work with me know I question

a lot—everything—always,” Friedman says. “After the

questioning, they know they will have to go back and

redo some of their analyses.” He also believes it’s vital to

admit when you don’t understand something: “I know I

am not the smartest person in the room in my meetings

with these people. I’m always pushing for greater clarity

[because] if I can’t articulate it, I can’t defend it to others.”

Establish a culture of inquiry, not advocacy

We all know how easily “fi gures lie and liars fi gure.” Ana-

lytics consumers should never pressure their producers

with comments like “See if you can fi nd some evidence in

the data to support my idea.” Instead, your explicit goal

should be to fi nd the truth. As the head of Merck’s com-

mercial analytics group says, “Our management team

wants us to be like Switzerland. We work only for the

shareholders.”

In fact, some senior executives push their analysts to

play devil’s advocate. This sets the right cultural tone

H7353_Guide-DataAnalytics_2ndREV.indb 22H7353_Guide-DataAnalytics_2ndREV.indb 22 1/17/18 10:47 AM1/17/18 10:47 AM

Keep Up with Your Quants

23

and helps to refi ne the models. “All organizations seek

to please the leader,” explains Gary Loveman, of Caesars,

“so it’s critical to cultivate an environment that views

ideas as separate from people and insists on rigorous evi-

dence to distinguish among those ideas.”

Loveman encourages his subordinates to put forth

data and analysis, rather than opinions, and reveals

his own faulty hypotheses, conclusions, and decisions.

That way managers and quants alike understand that

his sometimes “lame and ill-considered views,” as he

describes them, need as much objective, unbiased test-

ing as anyone else’s. For example, he often says that his

greatest mistake as a new CEO was choosing not to fi re

property managers who didn’t share his analytical ori-

entation. He thought their experience would be enough.

Loveman uses the example to show both that he’s fallible

and that he insists on being a consumer of analytics.

When It All Adds Up Warren Buffett once said, “Beware of geeks . . . bearing

formulas.” But in today’s data-driven world, you can’t af-

ford to do that. Instead you need to combine the science

of analytics with the art of intuition. Be a manager who

knows the geeks, understands their formulas, helps im-

prove their analytic processes, effectively interprets and

communicates the fi ndings to others, and makes better

decisions as a result.

Contrast the bank mentioned at the beginning of this

article with Toronto- Dominion Bank. TD’s CEO, Ed

Clark, is quantitatively literate (with a PhD in econom-

ics), and he also insists that his managers understand the

math behind any fi nancial product the company depends

H7353_Guide-DataAnalytics_2ndREV.indb 23H7353_Guide-DataAnalytics_2ndREV.indb 23 1/17/18 10:47 AM1/17/18 10:47 AM

Getting Started

24

on. As a result, TD knew to avoid the riskiest-structured

products and get out of others before incurring major

losses during the 2008–2009 fi nancial crisis.

TD’s emphasis on data and analytics affects other ar-

eas of the business as well. Compensation is closely tied

to performance-management measures, for example.

And TD’s branches stay open longer than most other

banks’ because Tim Hockey, the former head of retail

banking, insisted on systematically testing the effect of

extended retail hours (with control groups) and found

that they brought in more deposits. If anyone at a man-

agement meeting suggests a new direction, he or she is

pressed for data and analysis to support it. TD is not per-

fect, Clark acknowledges, but “nobody ever accuses us of

not running the numbers.”

Your organization may not be as analytical as TD, and

your CEO may not be like Ed Clark. But that doesn’t

mean you can’t become a great consumer of analytics

on your own—and set an example for the rest of your

company.

Thomas H. Davenport is the President’s Distinguished

Professor in Management and Information Technology

at Babson College, a research fellow at the MIT Initiative

on the Digital Economy, and a senior adviser at Deloitte

Analytics. Author of over a dozen management books,

his latest is Only Humans Need Apply: Winners and

Losers in the Age of Smart Machines.

H7353_Guide-DataAnalytics_2ndREV.indb 24H7353_Guide-DataAnalytics_2ndREV.indb 24 1/17/18 10:47 AM1/17/18 10:47 AM

25

CHAPTER 2

A Simple Exercise to Help You Think Like a Data Scientist by Thomas C. Redman

For 20 years, I’ve used a simple exercise to help those

with an open mind (and a pencil, paper, and calculator)

get started with data. One activity won’t make you data

savvy, but it will help you become data literate, open your

eyes to the millions of small data opportunities, and en-

able you to work a bit more effectively with data scien-

tists, analytics, and all things quantitative.

Adapted from “How to Start Thinking Like a Data Scientist” on hbr.org,

November 29, 2013.

H7353_Guide-DataAnalytics_2ndREV.indb 25H7353_Guide-DataAnalytics_2ndREV.indb 25 1/17/18 10:47 AM1/17/18 10:47 AM

Getting Started

26

While the exercise is very much a how-to, each step

also illustrates an important concept in analytics—from

understanding variation to visualization.

First, start with something that interests, even both-

ers, you at work, like consistently late-starting meetings.

Form it up as a question and write it down: “Meetings

always seem to start late. Is that really true?”

Next, think through the data that can help answer

your question and develop a plan for creating it. Write

down all the relevant defi nitions and your protocol for

collecting the data. For this particular example, you have

to defi ne when the meeting actually begins. Is it the time

someone says, “OK, let’s begin”? Or the time the real

business of the meeting starts? Does kibitzing count?

Now collect the data. It is critical that you trust the

data. And, as you go, you’re almost certain to fi nd gaps

in data collection. You may fi nd that even though a meet-

ing has started, it starts anew when a more senior per-

son joins in. Modify your defi nition and protocol as you

go along.

Sooner than you think, you’ll be ready to start draw-

ing some pictures. Good pictures make it easier for you

to both understand the data and communicate main

points to others. There are plenty of good tools to help,

but I like to draw my fi rst picture by hand. My go-to plot

is a time-series plot, where the horizontal axis has the

date and time and the vertical axis has the variable of in-

terest. Thus, a point on the graph in fi gure 2-1 is the date

and time of a meeting versus the number of minutes late.

Now return to the question that you started with

and develop summary statistics. Have you discovered an

H7353_Guide-DataAnalytics_2ndREV.indb 26H7353_Guide-DataAnalytics_2ndREV.indb 26 1/17/18 10:47 AM1/17/18 10:47 AM

27

FI G

U R

E 2

-1

H ow

la te

a re

m ee

ti ng

s?

H7353_Guide-DataAnalytics_2ndREV.indb 27H7353_Guide-DataAnalytics_2ndREV.indb 27 1/17/18 10:47 AM1/17/18 10:47 AM

Getting Started

28

answer? In this case, “Over a two-week period, 10% of

the meetings I attended started on time. And on average,

they started 12 minutes late.”

But don’t stop there. Ask yourself, “So what?” In this

case, “If those two weeks are typical, I waste an hour a

day. And that costs the company x dollars a year.”

Many analyses end because there is no “so what?”

Certainly if 80% of meetings start within a few minutes

of their scheduled start times, the answer to the original

question is, “No, meetings start pretty much on time,”

and there is no need to go further.

But this case demands more, as some analyses do. Get

a feel for variation. Understanding variation leads to a

better feel for the overall problem, deeper insights, and

novel ideas for improvement. Note on the graph that

8–20 minutes late is typical. A few meetings start right

on time, others nearly a full 30 minutes late. It would

be great if you could conclude, “I can get to meetings

10 minutes late, just in time for them to start,” but the

variation is too great.

Now ask, “What else does the data reveal?” It strikes

me that six meetings began exactly on time, while every

other meeting began at least seven minutes late. In this

case, bringing meeting notes to bear reveals that all six

on-time meetings were called by the vice president of

fi nance. Evidently, she starts all her meetings on time.

So where do you go from here? Are there important

next steps? This example illustrates a common dichot-

omy. On a personal level, results pass both the “inter-

esting” and “important” test. Most of us would give al-

most anything to get back an hour a day. And you may

H7353_Guide-DataAnalytics_2ndREV.indb 28H7353_Guide-DataAnalytics_2ndREV.indb 28 1/17/18 10:47 AM1/17/18 10:47 AM

A Simple Exercise to Help You Think Like a Data Scientist

29

not be able to make all meetings start on time, but if the

VP can, you can certainly start the meetings you control

promptly.

On the company level, results so far pass only the in-

teresting test. You don’t know whether your results are

typical, nor whether others can be as hard-nosed as the

VP when it comes to starting meetings. But a deeper look

is surely in order: Are your results consistent with others’

experiences in the company? Are some days worse than

others? Which starts later: conference calls or face-to-

face meetings? Is there a relationship between meeting

start time and most senior attendee? Return to step one,

pose the next group of questions, and repeat the process.

Keep the focus narrow—two or three questions at most.

I hope you’ll have fun with this exercise. Many fi nd

joy in teasing insights from data. But whether you ex-

perience that joy or not, don’t take this exercise lightly.

There are fewer and fewer places for the “data illiterate”

and, in my humble opinion, no more excuses.

Thomas C. Redman, “the Data Doc,” is President of Data

Quality Solutions. He helps companies and people, in-

cluding startups, multinationals, executives, and leaders

at all levels, chart their courses to data-driven futures.

He places special emphasis on quality, analytics, and or-

ganizational capabilities.

H7353_Guide-DataAnalytics_2ndREV.indb 29H7353_Guide-DataAnalytics_2ndREV.indb 29 1/17/18 10:47 AM1/17/18 10:47 AM

H7353_Guide-DataAnalytics_2ndREV.indb 30H7353_Guide-DataAnalytics_2ndREV.indb 30 1/17/18 10:47 AM1/17/18 10:47 AM

SECTION TWO

Gather the Right Information

H7353_Guide-DataAnalytics_2ndREV.indb 31H7353_Guide-DataAnalytics_2ndREV.indb 31 1/17/18 10:47 AM1/17/18 10:47 AM

H7353_Guide-DataAnalytics_2ndREV.indb 32H7353_Guide-DataAnalytics_2ndREV.indb 32 1/17/18 10:47 AM1/17/18 10:47 AM

33

CHAPTER 3

Do You Need All That Data? by Ron Ashkenas

Organizations love data: numbers, reports, trend lines,

graphs, spreadsheets—the more the better. And, as a re-

sult, many organizations have a substantial internal fac-

tory that churns out data on a regular basis, as well as

external resources on call that produce data for onetime

studies and questions. But what’s the evidence (or dare I

sa y “the data”) that all this data leads to better business

decisions? Is some amount of data collection unneces-

sary, and perhaps even damaging by creating complexity

and confusion?

Let’s look at a quick case study: For many years the

CEO of a premier consumer products company insisted

Adapted from content posted on hbr.org, March 1, 2010 (product

#H004FC).

H7353_Guide-DataAnalytics_2ndREV.indb 33H7353_Guide-DataAnalytics_2ndREV.indb 33 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

34

on a monthly business review process that was highly

data-intensive. At its core was a “book” that contained

cost and sales data for every product sold by the com-

pany, broken down by business unit, channel, geography,

and consumer segment. This book (available electroni-

cally but always printed by the executive team) was sev-

eral inches thick. It was produced each month by many

hundreds of fi nance, product management, and infor-

mation technology people who spent thousands of hours

collecting, assessing, analyzing, reconciling, and sorting

the data.

Since this was the CEO’s way of running the business,

no one really questioned whether all of this activity was

worth it, although many complained about the time re-

quired. When a new CEO came on the scene a couple of

years ago, however, he decided that the business would

do just fi ne with quarterly reviews and exception-only

reporting. Suddenly the entire data-production indus-

try of the company was reduced substantially—and the

company didn’t miss a beat.

Obviously, different CEOs have different needs for

data. Some want their decisions to be based on as much

hard data as possible; others want just enough to either

reinforce or challenge their intuition; and still others

may prefer a combination of hard, analytical data with

anecdotal and qualitative input. In all cases, though,

managers would do well to ask themselves four ques-

tions about their data process as a way of improving the

return on what is often a substantial (but not always vis-

ible) investment:

H7353_Guide-DataAnalytics_2ndREV.indb 34H7353_Guide-DataAnalytics_2ndREV.indb 34 1/17/18 10:47 AM1/17/18 10:47 AM

Do You Need All That Data?

35

1. Are we asking the right questions? Many compa-

nies collect the data that is available, rather than

the information that is needed to help make deci-

sions and run the business. So the starting point

is to be clear about a limited number of key ques-

tions that you want the data to help you answer—

and then focus the data collection around those

rather than everything else that is possible.

2. Does our data tell a story? Most data comes in

fragments. To be useful, these individual bits of

information need to be put together into a coher-

ent explanation of the business situation, which

means integrating data into a “story.” While

enterprise data systems have been useful in driv-

ing consistent data defi nitions so that points can

be added and compared, they don’t automatically

create the story. Instead, managers should con-

sider in advance what data is needed to convey

the story that they will be required to tell.

3. Does our data help us look ahead rather than

behind? Most of the data that is collected in

companies tells managers how they performed in

a past period—but is less effective in predicting

future performance. Therefore, it is important to

ask what data, in what time frames, will help us

get ahead of the curve instead of just reacting.

4. Do we have a good mix of quantitative and quali-

tative data? Neither quantitative nor qualitative

data tells the whole story. For example, to make

H7353_Guide-DataAnalytics_2ndREV.indb 35H7353_Guide-DataAnalytics_2ndREV.indb 35 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

36

good product and pricing decisions, we need to

know not only what is being sold to whom, but

also why some products are selling more than

others.

Clearly, business data and its analysis are critical for

organizations to succeed, which is underscored by the

fact that companies like IBM are investing billions of

dollars in acquisitions in the business intelligence and

analytics space. But even the best automated tools won’t

be effective unless managers are clear about these four

questions.

Ron Ashkenas is an Emeritus Partner with Schaffer

Consulting, a frequent contributor to Harvard Busi-

ness Review, and the author or coauthor of four books

on organizational transformation. He has worked with

hundreds of managers over the years to help them trans-

late strategy into results and simplify the way things get

done. He also is the coauthor (with Brook Manville) of

The Harvard Business Review Leader’s Handbook (Har-

vard Business Review Press; forthcoming in 2018). He

can be reached at [email protected].

H7353_Guide-DataAnalytics_2ndREV.indb 36H7353_Guide-DataAnalytics_2ndREV.indb 36 1/17/18 10:47 AM1/17/18 10:47 AM

37

CHAPTER 4

How to Ask Your Data Scientists for Data and Analytics by Michael Li, Madina Kassengaliyeva, and Raymond Perkins

The intersection of big data and business is growing

daily. Although enterprises have been studying analyt-

ics for decades, data science is a relatively new capability.

And interacting in a new data-driven culture can be dif-

fi cult, particularly for those who aren’t data experts.

One particular challenge that many of these individ-

uals face is how to request new data or analytics from

data scientists. They don’t know the right questions to

ask, the correct terms to use, or the range of factors to

consider to get the information they need. In the end,

analysts are left uncertain about how to proceed, and

H7353_Guide-DataAnalytics_2ndREV.indb 37H7353_Guide-DataAnalytics_2ndREV.indb 37 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

38

managers are frustrated when the information they get

isn’t what they intended.

At The Data Incubator, we work with hundreds of

companies looking to hire data scientists and data engi-

neers or enroll their employees in our corporate train-

ing programs. We often fi eld questions from our hiring

and training clients about how to interact with their data

experts. While it’s impossible to give an exhaustive ac-

count, here are some important factors to think about

when communicating with data scientists, particularly

as you begin a data search.

What Question Should We Ask? As you begin working with your data analysts, be clear

about what you hope to achieve. Think about the busi-

ness impact you want the data to have and the com pany’s

ability to act on that information. By hearing what you

hope to gain from their assistance, the data scientist can

collaborate with you to defi ne the right set of questions

to answer and better understand exactly what informa-

tion to seek.

Even the subtlest ambiguity can have major implica-

tions. For example, advertising managers may ask ana-

lysts, “What is the most effi cient way to use ads to in-

crease sales?” Though this seems reasonable, it may not

be the right question since the ultimate objective of most

fi rms isn’t to increase sales, but to maximize profi t. Re-

search from the Institute of Practitioners in Advertising

shows that using ads to reduce price sensitivity is typi-

cally twice as profi table as trying to increase sales.1 The

value of the insight obtained will depend heavily on the

question asked. Be as specifi c and actionable as possible.

H7353_Guide-DataAnalytics_2ndREV.indb 38H7353_Guide-DataAnalytics_2ndREV.indb 38 1/17/18 10:47 AM1/17/18 10:47 AM

How to Ask Your Data Scientists for Data and Analytics

39

What Data Do We Need? As you defi ne the right question and objectives for

analysis, you and your data scientist should assess the

availability of the data. Ask if someone has already col-

lected the relevant data and performed analysis. The

ever-growing breadth of public data often provides eas-

ily accessible answers to common questions. Cerner, a

supplier of health care IT solutions, uses data sets from

the U.S. Department of Health and Human Services to

supplement their own data. iMedicare uses information

from the Centers for Medicare and Medicaid Services to

select policies. Consider whether public data could be

used toward your problem as well. You can also work

with other analysts in the organization to determine if

the data has previously been examined for similar rea-

sons by others internally.

Then, assess whether the available data is suffi cient.

Data may not contain all the relevant information

needed to answer your questions. It may also be infl u-

enced by latent factors that can be diffi cult to recognize.

Consider the vintage effect in private lending data: Even

seemingly identical loans typically perform very dif-

ferently based on the time of issuance, despite the fact

they may have had identical data at that time. The effect

comes from fl uctuations in the underlying underwriting

standards at issuance, information that is not typically

represented in loan data.

You should also inquire if the data is unbiased, since

sample size alone is not suffi cient to guarantee its valid-

ity. Finally, ask if the data scientist has enough data to

answer the question. By identifying what information is

H7353_Guide-DataAnalytics_2ndREV.indb 39H7353_Guide-DataAnalytics_2ndREV.indb 39 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

40

needed, you can help data scientists plan better analyses

going forward.

How Do We Obtain the Data? If more information is needed, data scientists must

decide between using data compiled by the company

through the normal course of business, such as through

observational studies, and collecting new data through

experiments. As part of your conversation with analysts,

ask about the costs and benefi ts of these options. Obser-

vational studies may be easier and less expensive to ar-

range since they do not require direct interaction with

subjects, for example, but they are typically far less reli-

able than experiments because they are only able to es-

tablish correlation, not causation.

Experiments allow substantially more control and

provide more reliable information about causality, but

they are often expensive and diffi cult to perform. Even

seemingly harmless experiments may carry ethical or so-

cial implications with real fi nancial consequences. Face-

book, for example, faced public fury over its manipula-

tion of its own newsfeed to test how emotions spread on

social media. Though the experiments were completely

legal, many users resented being unwitting participants

in Facebook’s tests. Managers must think beyond the

data and consider the greater brand repercussions of

data collection and work with data scientists to under-

stand these consequences. (See the sidebar, “Under-

standing the Cost of Data.”)

Before investing resources in new analysis, validate

that the company can use the insights derived from it

H7353_Guide-DataAnalytics_2ndREV.indb 40H7353_Guide-DataAnalytics_2ndREV.indb 40 1/17/18 10:47 AM1/17/18 10:47 AM

How to Ask Your Data Scientists for Data and Analytics

41

UNDERSTANDING THE COST OF DATA

Though eff ective data analysis has been shown to gen-

erate substantial fi nancial gains, there can be many

diff erent costs and complexities associated with it.

Obtaining good data may not only be diffi cult, but very

expensive. For example, in the health care and phar-

maceutical industry, data collection is often associ-

ated with medical experimentation and patient ob-

servations. These randomized control trials can easily

cost millions. Data storage can cost millions annually

as well. When interacting with data scientists, manag-

ers should ask about the specifi c risks and costs as-

sociated with obtaining and analyzing the data before

moving forward with a project.

But not all costs associated with data collection are

fi nancial. Violations of user privacy can have enormous

legal and reputational repercussions. Privacy is one

of the most signifi cant concerns regarding consumer

data. Managers must consider and weigh the legal

and ethical implications of their data collection and

analysis methods. Even seemingly anonymized data

can be used to identify individuals. Safely anonymized

(continued)

in a productive and meaningful way. This may entail

integration with existing technology projects, providing

new data to automated systems, and establishing new

processes.

H7353_Guide-DataAnalytics_2ndREV.indb 41H7353_Guide-DataAnalytics_2ndREV.indb 41 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

42

UNDERSTANDING THE COST OF DATA

(continued�)

data can be de-anonymized when combined with other

data sets. In a famous case, Carnegie Mellon University

researchers were able to identify anonymized health

care records of a former Massachusetts governor us-

ing only his ZIP code, birthday, and gender.2 The Gart-

ner Data Center predicted that through 2016, over 25%

of fi rms using consumer data would incur reputation

damage due to privacy violation issues.3 Managers

must ask data scientists about these risks when work-

ing with the company’s potentially sensitive consumer

data.

Is the Data Clean and Easy to Analyze? In general, data comes in two forms: structured and un-

structured. Structured data is structured, as its name

implies, and easy to add to a database. Most analysts fi nd

it easier and faster to manipulate. Unstructured data

is often free form and cannot be as easily stored in the

types of relational databases most commonly used in en-

terprises. While unstructured data is estimated to make

up 95% of the world’s data, according to a report by pro-

fessors Amir Gandomi and Murtaza Haider of Ryerson

University, for many large companies, storing and ma-

nipulating unstructured data may require a signifi cant

investment of resources to extract necessary informa-

H7353_Guide-DataAnalytics_2ndREV.indb 42H7353_Guide-DataAnalytics_2ndREV.indb 42 1/17/18 10:47 AM1/17/18 10:47 AM

How to Ask Your Data Scientists for Data and Analytics

43

tion.4 Working with your data scientists, evaluate the ad-

ditional costs of using unstructured data when defi ning

your initial objectives.

Even if the data is structured it still may need to be

cleaned or checked for incompleteness and inaccuracies.

When possible, encourage analysts to use clean data fi rst.

Otherwise, they will have to waste valuable time and re-

sources identifying and correcting inaccurate records.

A 2014 survey conducted by Ascend2, a marketing re-

search company, found that nearly 54% of respondents

complained that a “lack of data quality/completeness”

was their most prominent impediment. By searching for

clean data, you can avoid signifi cant problems and loss

of time.

Is the Model Too Complicated? Statistical techniques and open-source tools to analyze

data abound, but simplicity is often the best choice. More

complex and fl exible tools expose themselves to overfi t-

ting and can take more time to develop (read more about

overfi tting in chapter 15, “Pitfalls of Data-Driven Deci-

sions”). Work with your data scientists to identify the

simpler techniques and tools and move to more complex

models only if the simpler ones prove insuffi cient. It is

important to observe the KISS rule: “Keep It Simple,

Stupid!”

It may not be possible to avoid all of the expenses and

issues related to data collection and analysis. But you

can take steps to mitigate these costs and risks. By ask-

ing the right questions of your analysts, you can ensure

proper collaboration and get the information you need

to move forward confi dently.

H7353_Guide-DataAnalytics_2ndREV.indb 43H7353_Guide-DataAnalytics_2ndREV.indb 43 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

44

Michael Li is the founder and executive director of The

Data Incubator, a big data company that trains and places

data scientists. A data scientist himself, he has worked at

Google, Foursquare, and Andreessen Horowitz. He is a

regular contributor to VentureBeat, The Next Web, and

Harvard Business Review. Madina Kassengaliyeva is a

client services director with Think Big, a Teradata com-

pany. She helps clients realize high-impact business op-

portunities through effective implementation of big data

and analytics solutions. Madina has managed accounts

in the fi nancial services and insurance industries and led

successful strategy, solution development, and analytics

engagements. Raymond Perkins is a researcher at Prince-

ton University working at the intersection of statistics,

data, and fi nance and is the executive director of the

Princeton Quant Trading Conference. He has also con-

ducted research at Hong Kong University of Science and

Technology, the Mathematical Sciences Research Insti-

tute (MSRI), and Michigan State University.

NOTES

1. P. F. Mouncey, “Marketing in the Era of Accountability,” Journal of Direct, Data and Digital Marketing Practice 9, no. 2 (December 2007): 225–228.

2. N. Anderson, “‘Anonymized’ Data Really Isn’t—and Here’s Why Not,” Ars Technica, September 8, 2009, https://arstechnica.com/ tech-policy/2009/09/your-secrets-live-online-in-databases-of-ruin/.

3. D. Laney, “Information Innovation Key Initiative Over- view,” Gartner Research, April 22, 2014, https://www.gartner.com/ doc/2 715317/information-innovation-key-initiative-overview.

4. A. Gandomi and M. Haider, “Beyond the Hype: Big Data Con- cepts, Methods, and Analytics,” International Journal of Information Management 35, no. 2 (April 2015): 137–144.

H7353_Guide-DataAnalytics_2ndREV.indb 44H7353_Guide-DataAnalytics_2ndREV.indb 44 1/17/18 10:47 AM1/17/18 10:47 AM

45

CHAPTER 5

How to Design a Business Experiment by Oliver Hauser and Michael Luca

The rise of experimental evaluations within organi-

zations—or what economists refer to as fi eld experi-

ments—has the potential to transform organizational

decision making, providing fresh insight into areas rang-

ing from product design to human resources to public

policy. Companies that invest in randomized evaluatio ns

can gain a game-changing advantage.

Yet while there has been a rapid growth in experi-

ments, especially within tech companies, we’ve seen too

Adapted from “How to Design (and Analyze) a Business Experiment”

on hbr.org, October 29, 2015 (product #H02FSL).

H7353_Guide-DataAnalytics_2ndREV.indb 45H7353_Guide-DataAnalytics_2ndREV.indb 45 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

46

many run incorrectly. Even when they’re set up properly,

avoidable mistakes often happen during implementa-

tion. As a result, many organizations fail to receive the

real benefi ts of the scientifi c method.

This chapter lays out seven steps to ensure that your

experiment delivers the data and insight you need. These

principles draw on the academic research on fi eld ex-

periments as well as our work with a variety of organiza-

tions ranging from Yelp to the UK government.

1. Identify a Narrow Question It is tempting to run an experiment around a question

such as “Is advertising worth the cost?” or “Should we re-

duce (or increase) our annual bonuses?” Indeed, begin-

ning with a question that is central to your broader goals

is a good start. But it’s misguided to think that a single

experiment will do the trick. The reason is simple: Multi-

ple factors go into answering these types of big questions.

Take the issue of whether advertising is worth the

cost. What form of advertising are we talking about, and

for which products, in which media, over which time pe-

riods? Your question should be testable, which means it

must be narrow and clearly defi ned. A better question

might be, “How much does advertising our brand name

on Google AdWords increase monthly sales?” This is an

empirical question that an experiment can answer—

and that feeds into the question you ultimately hope to

resolve. In fact, through just such an experiment, re-

searchers at eBay discovered that a long-standing brand-

advertising strategy on Google had no effect on the rate

at which paying customers visited eBay.

H7353_Guide-DataAnalytics_2ndREV.indb 46H7353_Guide-DataAnalytics_2ndREV.indb 46 1/17/18 10:47 AM1/17/18 10:47 AM

How to Design a Business Experiment

47

2. Use a Big Hammer Companies experiment when they don’t know what will

work best. Faced with this uncertainty, it may sound ap-

pealing to start small in order to avoid disrupting things.

But your goal should be to see whether some version of

your intervention—your new change—will make a dif-

ference to your customers. This requires a large-enough

intervention.

For example, suppose a grocery store is considering

adding labels to items to show consumers that it sources

mainly from local farms. How big should the labels be

and where should they be attached? We would suggest

starting with large labels on the front of the packages, be-

cause if the labels were small or on the backs of the pack-

ages, and there were no effect (a common outcome for

subtle interventions), the store managers would be left

to wonder whether consumers simply didn’t notice the

tags (the treatment wasn’t large enough) or truly didn’t

care (there was no treatment effect). By starting with a

big hammer, the store would learn whether customers

care about local sourcing. If there’s no effect from large

labels on the package fronts, then the store should give

up on the idea. If there is an effect, the experimenters

can later refi ne the labels to the desired characteristics.

3. Perform a Data Audit Once you know what your intervention is, you need to

choose what data to look at. Make a list of all the inter-

nal data related to the outcome you would like to infl u-

ence and when you will need to do the measurements.

H7353_Guide-DataAnalytics_2ndREV.indb 47H7353_Guide-DataAnalytics_2ndREV.indb 47 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

48

Include data both about things you hope will change and

things you hope won’t change as a result of the inter-

vention, because you’ll need to be alert for unintended

consequences. Think, too, about sources of external data

that might add perspective.

Say you’re launching a new cosmetics product and

you want to know which type of packaging leads to the

highest customer loyalty and satisfaction. You decide

to run a randomized controlled trial across geographi-

cal areas. In addition to measuring recurring orders and

customer service feedback (internal data), you can track

user reviews on Amazon and look for differences among

customers in different states (external data).

4. Select a Study Population Choose a subgroup among your customers that matches

the customer profi le you are hoping to understand. It

might be tempting to look for the easiest avenue to get

a subgroup, such as online users, but beware: If your

subgroup is not a good representation of your target

customers, the fi ndings of your experiment may not be

applicable. For example, younger online customers who

shop exclusively on your e-commerce platform may be-

have very differently than older in-store customers. You

could use the former to generalize to your online plat-

form strategy, but you may be misguided if you try to

draw inferences from that group for your physical stores.

5. Randomize Randomly assign some people to a treatment group and

others to a control group. The treatment group receives

H7353_Guide-DataAnalytics_2ndREV.indb 48H7353_Guide-DataAnalytics_2ndREV.indb 48 1/17/18 10:47 AM1/17/18 10:47 AM

How to Design a Business Experiment

49

the change you want to test, while the control group re-

ceives what you previously had on offer—and make sure

there are no differences other than what you are testing.

The fi rst rule of randomization is to not let participants

decide which group they are in, or the results will be

meaningless. The second is to make sure there really are

no differences between treatment and control.

It’s not always easy to follow the second rule. For ex-

ample, we’ve seen companies experiment by offering a

different coupon on Sunday than on Monday. The prob-

lem is that Sunday shoppers may be systematically dif-

ferent from Monday shoppers, even if you control for the

volume of shoppers on each day.

6. Commit to a Plan, and Stick to It Before you run an experiment, lay out your plans in de-

tail. How many observations will you collect? How long

will you let the experiment run? What variables will be

collected and analyzed? Record these details. This can

be as simple as creating a Google spreadsheet or as offi -

cial as using a public trial registry. Not only will this level

of transparency make sure that everyone is on the same

page, it will also help you avoid well-known pitfalls in

the implementation of experiments.

Once your experiment is running, leave it alone! If

you get a result you expected, great; if not, that’s fi ne too.

The one thing that’s not OK: running your experiment

until your results look as though they fi t your hypothesis,

rather than until the study has run its planned course.

This type of practice has led to a “replication crisis” in

psychology research; it can seriously bias your results

H7353_Guide-DataAnalytics_2ndREV.indb 49H7353_Guide-DataAnalytics_2ndREV.indb 49 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

50

and reduce the insight you receive. Stick to the plan, to

the extent possible.

7. Let the Data Speak To give a complete picture of your results, report mul-

tiple outcomes. Sure, some might be unchanged, un-

impressive, or downright inexplicable. But better to

be transparent about them than to ignore them. Once

you’ve surveyed the main results, ask yourself whether

you’ve really discovered the underlying mechanism be-

hind your results—the factor that is driving them. If

you’re not sure, refi ne your experiment and run another

trial to learn more.

Experiments are already a central part of the social

sciences; they are quickly becoming central to organiza-

tions as well. If your experiments are well designed, they

will tell you something valuable. The most successful

will puncture your assumptions, change your practices,

and put you ahead of competitors. Experimentation is

a long-term, richly informative process, with each trial

forming the starting point for the next.

Oliver Hauser is a research fellow at Harvard Business

School and Harvard Kennedy School. He conducts re-

search and runs experiments with organizations and

governments around the world. Michael Luca is the

Lee J. Styslinger III Associate Professor of Business

Admin is tration at Harvard Business School and works

with a variety of organizations to design experiments.

H7353_Guide-DataAnalytics_2ndREV.indb 50H7353_Guide-DataAnalytics_2ndREV.indb 50 1/17/18 10:47 AM1/17/18 10:47 AM

51

CHAPTER 6

Know the Diff erence Between Your Data and Your Metrics by Jeff Bladt and Bob Filbin

How many views make a YouTube video a success? How

about 1.5 million? That’s how many views a video posted

in 2011 by our organization, DoSomething.org, received.

It featured some well-known YouTube celebrities, who

asked young people to donate their used sports equip-

ment to youth in need. It was twice as popular as any

video DoSomething.org had posted to date. Success!

Then came the data report: only eight viewers had signed

up to donate equipment, and no one actually donated.

Adapted from content posted on hbr.org, March 4, 2013.

H7353_Guide-DataAnalytics_2ndREV.indb 51H7353_Guide-DataAnalytics_2ndREV.indb 51 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

52

Zero donations from 1.5 million views. Suddenly, it

was clear that for DoSomething.org, views did not equal

success. In terms of donations, the video was a complete

failure.

What happened? We were concerned with the wrong

metric. A metric contains a single type of data—video

views or equipment donations. A successful organiza-

tion can only measure so many things well and what it

measures ties to its defi nition of success. For DoSome-

thing.org, that’s social change. In the case above, success

meant donations, not video views. As we learned, there

is a difference between numbers and numbers that mat-

ter. This is what separates data from metrics.

You Can’t Pick Your Data, but You Must Pick Your Metrics Take baseball. Every team has the same defi nition of

success—winning the World Series. This requires one

main asset: good players. But what makes a player good?

In baseball, teams used to answer this question with a

handful of simple metrics like batting average and runs

batted in (RBIs). Then came the statisticians (remember

Moneyball?). New metrics provided teams with the abil-

ity to slice their data in new ways, fi nd better ways of de-

fi ning good players, and thus win more games.

Keep in mind that all metrics are proxies for what

ulti mately matters (in the case of baseball, a combi-

nation of championships and profi tability), but some

are better than others. The data of the game has never

changed—there are still RBIs and batting averages.

What has changed is how we look at the data. And those

H7353_Guide-DataAnalytics_2ndREV.indb 52H7353_Guide-DataAnalytics_2ndREV.indb 52 1/17/18 10:47 AM1/17/18 10:47 AM

Know the Diff erence Between Your Data and Your Metrics

53

teams that slice the data in smarter ways are able to fi nd

good players who have been traditionally undervalued.

Organizations Become Their Metrics Metrics are what you measure. And what you measure

is what you manage to. In baseball, a critical question is,

how effective is a player when he steps up to the plate?

One measure is hits. A better measure turns out to be

the sabermetric “OPS”—a combination of on-base per-

centage (which includes hits and walks) and total bases

(slugging). Teams that look only at batting average suf-

fer. Players on these teams walk less, with no offsetting

gains in hits. In short, players play to the metrics their

management values, even at the cost of the team.

The same happens in workplaces. Measure YouTube

views? Your employees will strive for more and more

views. Measure downloads of a product? You’ll get more

of that. But if your actual goal is to boost sales or acquire

members, better measures might be return-on-invest-

ment (ROI), on-site conversion, or retention. Do people

who download the product keep using it or share it with

others? If not, all the downloads in the world won’t help

your business. (See the sidebar, “Picking Statistics,” to

learn how to choose metrics that that align with a spe-

cifi c performance objective.)

In the business world, we talk about the difference

between vanity metrics and meaningful metrics. Van-

ity metrics are like dandelions—they might look pretty,

but to most of us, they’re weeds, using up resources and

doing nothing for your property value. Vanity metrics

for your organization might include website visitors per

H7353_Guide-DataAnalytics_2ndREV.indb 53H7353_Guide-DataAnalytics_2ndREV.indb 53 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

54

month, Twitter followers, Facebook fans, and media im-

pressions. Here’s the thing: If these numbers go up, they

might drive up sales of your product. But can you prove

it? If yes, great. Measure away. But if you can’t, they

aren’t valuable.

PICKING STATISTICS

by Michael Mauboussin

The following is a process for choosing metrics that al-

low you to understand, track, and manage the cause-

and-eff ect relationships that determine your com-

pany’s performance. I will illustrate the process in a

simplifi ed way using a retail bank that is based on an

analysis of 115 banks by Venky Nagar of the Univer-

sity of Michigan and Madhav Rajan of Stanford. Leave

aside, for the moment, which metrics you currently

use or which ones Wall Street analysts or bankers say

you should. Start with a blank slate and work through

these four steps in sequence.

1. Defi ne Your Governing Objective

A clear objective is essential to business success be-

cause it guides the allocation of capital. Creating eco-

nomic value is a logical governing objective for a com-

pany that operates in a free market system. Companies

may choose a diff erent objective, such as maximizing

H7353_Guide-DataAnalytics_2ndREV.indb 54H7353_Guide-DataAnalytics_2ndREV.indb 54 1/17/18 10:47 AM1/17/18 10:47 AM

Know the Diff erence Between Your Data and Your Metrics

55

the fi rm’s longevity. We will assume that the retail bank

seeks to create economic value.

2. Develop a Theory of Cause and Eff ect to Assess

Presumed Drivers of the Objective

The three commonly cited fi nancial drivers of value cre-

ation are sales, costs, and investments. More- specifi c

fi nancial drivers vary among companies and can in-

clude earnings growth, cash fl ow growth, and return

on invested capital.

Naturally, fi nancial metrics can’t capture all value-

creating activities. You also need to assess nonfi nan-

cial measures such as customer loyalty, customer sat-

isfaction, and product quality, and determine if they

can be directly linked to the fi nancial measures that

ultimately deliver value. As we’ve discussed, the link

between value creation and fi nancial and nonfi nancial

measures like these is variable and must be evaluated

on a case-by-case basis.

In our example, the bank starts with the theory that

customer satisfaction drives the use of bank services

and that usage is the main driver of value. This theory

links a nonfi nancial and a fi nancial driver. The bank

then measures the correlations statistically to see if

the theory is correct and determines that satisfi ed cus-

tomers indeed use more services, allowing the bank to

(continued)

H7353_Guide-DataAnalytics_2ndREV.indb 55H7353_Guide-DataAnalytics_2ndREV.indb 55 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

56

PICKING STATISTICS

(continued�)

generate cash earnings growth and attractive returns

on assets, both indicators of value creation. Having

determined that customer satisfaction is persistently

and predictively linked to returns on assets, the bank

must now fi gure out which employee activities drive

satisfaction.

3. Identify the Specifi c Activities That Employees

Can Do to Help Achieve the Governing Objective

The goal is to make the link between your objective

and the measures that employees can control through

the application of skill. The relationship between these

activities and the objective must also be persistent and

predictive.

In the previous step, the bank determined that

customer satisfaction drives value (it is predictive).

The bank now has to fi nd reliable drivers of customer

satisfaction. Statistical analysis shows that the rates

consumers receive on their loans, the speed of loan

processing, and low teller turnover all aff ect customer

satisfaction. Because these are within the control of

employees and management, they are persistent. The

bank can use this information to, for example, make

sure that its process for reviewing and approving loans

is quick and effi cient.

H7353_Guide-DataAnalytics_2ndREV.indb 56H7353_Guide-DataAnalytics_2ndREV.indb 56 1/17/18 10:47 AM1/17/18 10:47 AM

Know the Diff erence Between Your Data and Your Metrics

57

4. Evaluate Your Statistics

Finally, you must regularly reevaluate the measures you

are using to link employee activities with the governing

objective. The drivers of value change over time, and

so must your statistics. For example, the demograph-

ics of the retail bank’s customer base are changing,

so the bank needs to review the drivers of customer

satisfaction. As the customer base becomes younger

and more digitally savvy, teller turnover becomes less

relevant and the bank’s online interface and customer

service become more so. Companies have access

to a growing torrent of statistics that could improve

their performance, but executives still cling to old-

fashioned and often fl awed methods for choosing met-

rics. In the past, companies could get away with going

on gut and ignoring the right statistics because that’s

what everyone else was doing. Today, using them is

necessary to compete. More to the point, identifying

and exploiting them before rivals do will be the key to

seizing advantage.

Excerpted from “The True Measures of Success” in Harvard Business Re- view, October 2012 (product #R1210B).

Michael Mauboussin is an investment strategist and an adjunct profes- sor at Columbia Business School. His latest book is The Success Equa- tion (Harvard Business Review Press, 2012).

H7353_Guide-DataAnalytics_2ndREV.indb 57H7353_Guide-DataAnalytics_2ndREV.indb 57 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

58

Metrics Are Only Valuable if You Can Manage to Them Good metrics have three key attributes: Their data is

consistent, cheap, and quick to collect. A simple rule of

thumb: If you can’t measure results within a week for

free (and if you can’t replicate the process), then you’re

prioritizing the wrong ones. There are exceptions, but

they are rare. In baseball, the metrics an organization

uses to measure a successful plate appearance will af-

fect player strategy in the short term (do they draw more

walks, prioritize home runs, etc.?) and personnel strat-

egy in the mid- and long terms. The data to make these

decisions is readily available and continuously updated.

Organizations can’t control their data, but they do

control what they care about. If our metric on the You-

Tube video had been views, we would have called it a

huge success. In fact, we wrote it off as a massive failure.

Does that mean no more videos? Not necessarily, but for

now, we’ll be spending our resources elsewhere, collect-

ing data on metrics that matter.

Jeff Bladt is chief data offi cer at DoSomething.org,

America’s largest organization for young people and so-

cial change. Bob Filbin is chief data scientist at Crisis Text

Line, the fi rst large-scale 24/7 national crisis line for

teens on the medium they use most: texting.

H7353_Guide-DataAnalytics_2ndREV.indb 58H7353_Guide-DataAnalytics_2ndREV.indb 58 1/17/18 10:47 AM1/17/18 10:47 AM

59

CHAPTER 7

The Fundamentals of A/B Testing by Amy Gallo

As we learned in chapter 5, running an experiment is a

straightforward way to collect new data about a specifi c

question or problem. One of the most common meth-

ods of experimentation, particularly in online settings, is

A/B testing.

To better understand what A/B testing is, where it

originated, and how to use it, I spoke with Kaiser Fung,

who founded the applied analytics program at Columbia

University and is author of Junk Charts, a blog devoted

to the critical examination of data and graphics in the

mass media. His latest book is Numbersense: How to Use

Big Data to Your Advantage.

Adapted from “A Refresher on A/B Testing” on hbr.org, June 28, 2017

(product #H03R3D).

H7353_Guide-DataAnalytics_2ndREV.indb 59H7353_Guide-DataAnalytics_2ndREV.indb 59 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

60

What Is A/B Testing? A/B testing is a way to compare two versions of some-

thing to fi gure out which performs better. While it’s most

often associated with websites and apps, Fung says the

method is almost 100 years old.

In the 1920s, statistician and biologist Ronald Fisher

discovered the most important principles behind A/B

testing and randomized controlled experiments in gen-

eral. “He wasn’t the fi rst to run an experiment like this,

but he was the fi rst to fi gure out the basic principles and

mathematics and make them a science,” Fung says.

Fisher ran agricultural experiments, asking questions

such as, “What happens if I put more fertilizer on this

land?” The principles persisted, and in the early 1950s

scientists started running clinical trials in medicine. In

the 1960s and 1970s, the concept was adapted by mar-

keters to evaluate direct-response campaigns (for exam-

ple, “Would a postcard or a letter sent to target custom-

ers result in more sales?”).

A/B testing in its current form came into existence in

the 1990s. Fung says that throughout the past century,

the math behind the tests hasn’t changed: “It’s the same

core concepts, but now you’re doing it online, in a real-

time environment, and on a different scale in terms of

number of participants and number of experiments.”

How Does A/B Testing Work? You start an A/B test by deciding what it is you want to

test. Fung gives a simple example: the size of the “Sub-

scribe” button on your website. Then you need to know

H7353_Guide-DataAnalytics_2ndREV.indb 60H7353_Guide-DataAnalytics_2ndREV.indb 60 1/17/18 10:47 AM1/17/18 10:47 AM

The Fundamentals of A/B Testing

61

how you want to evaluate its performance. In this case,

let’s say your metric is the number of visitors who click

on the button. To run the test, you show two sets of us-

ers (assigned at random when they visit the site) the

different versions (where the only thing different is the

size of the button) and determine which infl uenced your

success metric the most—in this case, which button size

caused more visitors to click.

There are a lot of things that infl uence whether some-

one clicks. For example, it may be that those using a mo-

bile device are more likely to click a button of a certain

size, while those on desktop are drawn to a different

size. This is where randomization is critical. By random-

izing which users are in which group, you minimize the

chances that other factors, like mobile versus desktop,

will drive your results on average.

“The A/B test can be considered the most basic kind

of randomized controlled experiment,” Fung says. “In its

simplest form, there are two treatments and one acts as

the control for the other.” As with all randomized con-

trolled experiments, you must estimate the sample size

you need to achieve a statistical signifi cance, which will

help you make sure the result you’re seeing “isn’t just be-

cause of background noise,” Fung says.

Sometimes you know that certain variables, usually

those that are not easily manipulated, have a strong ef-

fect on the success metric. For example, maybe mobile

users of your website tend to click less in general, com-

pared with desktop users. Randomization may result

in set A containing slightly more mobile users than

set B, which may cause set A to have a lower click rate

H7353_Guide-DataAnalytics_2ndREV.indb 61H7353_Guide-DataAnalytics_2ndREV.indb 61 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

62

regardless of the button size they’re seeing. To level the

playing fi eld, the test analyst should fi rst divide the users

by mobile and desktop and then randomly assign them

to each version. This is called blocking.

The size of the “Subscribe” button is a very basic ex-

ample, Fung says. In actuality, you might not be testing

just size but also color, text, typeface, and font size. Lots

of managers run sequential tests—testing size fi rst (large

versus small), then color (blue versus red), then typeface

(Times versus Arial), and so on—because they believe

they shouldn’t vary two or more factors at the same time.

But according to Fung, that view has been debunked by

statisticians. Sequential tests are in fact suboptimal, be-

cause you’re not measuring what happens when factors

interact. For example, it may be that users prefer blue on

average but prefer red when it’s combined with an Arial

font. This kind of result is regularly missed in sequential

A/B testing because the typeface test is run on blue but-

tons that have “won” the previous test.

Instead, Fung says, you should run more-complex

tests. This can be hard for some managers, since the

appeal of A/B tests is how straightforward and simple

they are to run (and many people designing these ex-

periments, Fung points out, don’t have a statistics back-

ground). “With A/B testing, we tend to want to run a

large number of simultaneous, independent tests,” he

says, in large part because the mind reels at the number

of possible combinations that can be tested. But using

mathematics, you can “smartly pick and run only certain

subsets of those treatments; then you can infer the rest

H7353_Guide-DataAnalytics_2ndREV.indb 62H7353_Guide-DataAnalytics_2ndREV.indb 62 1/17/18 10:47 AM1/17/18 10:47 AM

The Fundamentals of A/B Testing

63

from the data.” This is called multivariate testing in the

A/B testing world, and it means you often end up doing

an A/B/C test or even an A/B/C/D test. In the colors and

size example, it might include showing different groups

a large red button, a small red button, a large blue but-

ton, and a small blue button. If you wanted to test fonts

too, you would need even more test groups.

How Do You Interpret the Results of an A/B Test? Chances are that your company will use software that

handles the calculations, and it may even employ a stat-

istician who can interpret those results for you. But it’s

helpful to have a basic understanding of how to make

sense of the output and decide whether to move forward

with the test variation (the new button, in the example

Fung describes).

Fung says that most software programs report two

conversion rates for A/B testing: one for users who saw

the control version, and the other for users who saw the

test version. “The conversion rate may measure clicks or

other actions taken by users,” he says. The report might

look like this: “Control: 15% (+/– 2.1%); Variation 18%

(+/– 2.3%).” This means that 18% of your users clicked

through on the new variation (perhaps the larger blue

button) with a margin of error of 2.3%. You might be

tempted to interpret this as the actual conversion rate

falling between 15.7% and 20.3%, but that wouldn’t be

technically correct. “The real interpretation is that if

you ran your A/B test multiple times, 95% of the ranges

H7353_Guide-DataAnalytics_2ndREV.indb 63H7353_Guide-DataAnalytics_2ndREV.indb 63 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

64

will capture the true conversion rate—in other words,

the conversion rate falls outside the margin of error 5%

of the time (or whatever level of statistical signifi cance

you’ve set),” Fung explains.

This can be a diffi cult concept to wrap your head

around. But what’s important to know is that the 18%

conversion rate isn’t a guarantee. This is where your

judgment comes in. An 18% conversation rate is cer-

tainly better than a 15% one, even allowing for the mar-

gin of error (12.9% to 17.1% versus 15.7% to 20.3%). You

might hear people talk about this as a “3% lift” (lift is the

percentage difference in conversion rate between your

control version and a successful test treatment). In this

case, it’s most likely a good decision to switch to your

new version, but that will depend on the costs of imple-

menting it. If they’re low, you might try out the switch

and see what happens in actuality (versus in tests). One

of the big advantages to testing in the online world is

that you can usually revert back to your original pretty

easily.

How Do Companies Use A/B Testing? Fung says that the popularity of the methodology has

risen as companies have realized that the online en-

vironment is well suited to help managers, especially

marketers, answer questions like, “What is most likely

to make people click? Or buy our product? Or register

with our site?” A/B testing is now used to evaluate every-

thing from website design to online offers to headlines

to product descriptions. (See the sidebar “A/B Testing in

H7353_Guide-DataAnalytics_2ndREV.indb 64H7353_Guide-DataAnalytics_2ndREV.indb 64 1/17/18 10:47 AM1/17/18 10:47 AM

The Fundamentals of A/B Testing

65

Action” to see an example from the creative marketplace

Shutterstock.)

Most of these experiments run without the subjects

even knowing. As users, Fung says, “we’re part of these

tests all the time and don’t know it.”

And it’s not just websites. You can test marketing

emails or ads as well. For example, you might send two

versions of an email to your customer list (random-

izing the list fi rst, of course) and fi gure out which one

generates more sales. Then you can just send out the

winning version next time. Or you might test two ver-

sions of ad copy and see which one converts visitors

more often. Then you know to spend more getting the

most successful one out there.

A/B TESTING IN ACTION

by Wyatt Jenkins

At Shutterstock, we test everything: copy and link col-

ors, relevance algorithms that rank our search results,

language-detection functions, usability in download-

ing, pricing, video-playback design, and anything else

you can see on our site (plus a lot you can’t).

Shutterstock is the world’s largest creative market-

place, serving photography, illustrations, and video to

more than 750,000 customers. And those customers

(continued�)

H7353_Guide-DataAnalytics_2ndREV.indb 65H7353_Guide-DataAnalytics_2ndREV.indb 65 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

66

A/B TESTING IN ACTION

(continued�)

have heavy image needs; we serve over three down-

loads per second. That’s a ton of data.

This means that we know more about our custom-

ers, statistically, than anyone else in our market. It

also means that we can run more experiments with

statistical signifi cance faster than businesses with less

user data. It’s one of our most important competitive

advantages.

Search results are among the highest-traffi cked

pages on our site. A few years back, we started experi-

menting with a mosaic-display search-results page in

our Labs area—an experimentation platform we use to

try things quickly and get user feedback. In qualitative

testing, customers really liked the design of the mosaic

search grid, so we A/B tested it within the core Shut-

terstock experience.

Here are some of the details of the experiment, and

what we learned:

• Image sizes: We tested diff erent image sizes to

get just the right number of pixels on the screen.

• New customers: We watched to see if new

customers to our site would increase conversion.

New customers act diff erently than existing ones,

so you need to account for that. Sometimes ex-

isting customers suff er from change aversion.

H7353_Guide-DataAnalytics_2ndREV.indb 66H7353_Guide-DataAnalytics_2ndREV.indb 66 1/17/18 10:47 AM1/17/18 10:47 AM

The Fundamentals of A/B Testing

67

• Viewport size: We tracked the viewport size

(the size of the screen customers used) to

under stand how they were viewing the page.

• Watermarks: We tested including an image

watermark versus no watermark. Was including

the watermark distracting?

• Hover: We experimented with the behavior of a

hover feature when a user paused on a particu-

lar image.

Before the test, we were convinced that removing

the watermark on our images would increase con-

version because there would be less visual clutter on

the page. But in testing we learned that removing the

water mark created the opposite eff ect, disproving our

gut instinct.

We ran enough tests to fi nd two diff erent designs

that increased conversion, so we iterated on those de-

signs and re-tested them before deciding on one. And

we continue to test this search grid and make improve-

ments for our customers on a regular basis.

Adapted from “A/B Testing and the Benefi ts of an Experimentation Cul- ture” posted on hbr.org, February 5, 2014 (product #H00NTO).

Wyatt Jenkins is a product executive with a focus on marketplaces, personalization, optimization, and international growth. He has acted as SVP of Product at Hired.com and Optimizely, and was VP of Product at Shutterstock for fi ve years. Wyatt was an early partner in Beatport from 2003 to 2009, and he served on the board until 2013.

H7353_Guide-DataAnalytics_2ndREV.indb 67H7353_Guide-DataAnalytics_2ndREV.indb 67 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

68

What Mistakes Do People Make When Doing A/B Tests? Fung identifi ed three common mistakes he sees compa-

nies make when performing A/B tests.

First, too many managers don’t let the tests run their

course. Because most of the software for running these

tests lets you watch results in real time, managers want

to make decisions too quickly. This mistake, Fung says,

“evolves out of impatience,” and many software vendors

have played into this overeagerness by offering a type of

A/B testing called real-time optimization, in which you

can use algorithms to make adjustments as results come

in. The problem is that, because of randomization, it’s

possible that if you let the test run to its natural end, you

might get a different result.

The second mistake is looking at too many metrics.

“I cringe every time I see software that tries to please

every one by giving you a panel of hundreds of metrics,”

he says. The problem is that if you’re looking at such a

large number of metrics at the same time, you’re at risk

of making what statisticians call spurious correlations (a

topic discussed in more detail in chapter 10). In proper

test design, “you should decide on the metrics you’re

going to look at before you execute an experiment and

select a few. The more you’re measuring, the more likely

that you’re going to see random fl uctuations.” With too

many metrics, instead of asking yourself, “What’s hap-

pening with this variable?” you’re asking, “What in-

teresting (and potentially insignifi cant) changes am I

seeing?”

H7353_Guide-DataAnalytics_2ndREV.indb 68H7353_Guide-DataAnalytics_2ndREV.indb 68 1/17/18 10:47 AM1/17/18 10:47 AM

The Fundamentals of A/B Testing

69

Lastly, Fung says, few companies do enough retest-

ing. “We tend to test it once and then we believe it. But

even with a statistically signifi cant result, there’s a quite

large probability of false positive error. Unless you retest

once in a while, you don’t rule out the possibility of be-

ing wrong.” False positives can occur for several reasons.

For example, even though there may be little chance that

any given A/B result is driven by random chance, if you

do lots of A/B tests, the chances that at least one of your

results is wrong grows rapidly.

This can be particularly diffi cult to do because it is

likely that managers would end up with contradictory

results, and no one wants to discover that they’ve under-

mined previous fi ndings, especially in the online world,

where managers want to make changes—and capture

value—quickly. But this focus on value can be misguided.

Fung says, “People are not very vigilant about the practi-

cal value of the fi ndings. They want to believe that every

little amount of improvement is valuable even when the

test results are not fully reliable. In fact, the smaller the

improvement, the less reliable the results.”

It’s clear that A/B testing is not a panacea for all your

data-testing needs. There are more complex kinds of ex-

periments that are more effi cient and will give you more

reliable data, Fung says. But A/B testing is a great way

to gain quick information about a specifi c question you

have, particularly in an online setting. And, as Fung

says, “the good news about the A/B testing world is that

everything happens so quickly, so if you run it and it

doesn’t work, you can try something else. You can always

fl ip back to the old tactic.”

H7353_Guide-DataAnalytics_2ndREV.indb 69H7353_Guide-DataAnalytics_2ndREV.indb 69 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

70

Amy Gallo is a contributing editor at Harvard Business

Review and the author of the HBR Guide to Dealing with

Confl ict. Follow her on Twitter @amyegallo.

H7353_Guide-DataAnalytics_2ndREV.indb 70H7353_Guide-DataAnalytics_2ndREV.indb 70 1/17/18 10:47 AM1/17/18 10:47 AM

71

CHAPTER 8

Can Your Data Be Trusted? by Thomas C. Redman

You’ve just learned of some new data that, when com-

bined with existing data, could offer potentially game-

changing insights. But there isn’t a clear indication

whether this new information can be trusted. How

should you proceed?

There is, of course, no simple answer. While many

managers are skeptical of new data and others embrace

it wholeheartedly, the more thoughtful managers take a

nuanced approach. They know that some data (maybe

even most of it) is bad and can’t be used, and some is

good and should be trusted implicitly. But they also re-

alize that some data is fl awed but usable with caution.

Adapted from content posted on hbr.org, October 29, 2015 (product

#H02G61).

H7353_Guide-DataAnalytics_2ndREV.indb 71H7353_Guide-DataAnalytics_2ndREV.indb 71 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

72

They fi nd this data intriguing and are eager to push the

data to its limits, as they know game-changing insights

may reside there.

Fortunately, you can work with your data scientists to

assess whether the data you’re considering is safe to use

and just how far you can go with fl awed data. Indeed,

following some basic steps can help you proceed with

greater confi dence—or caution—as the quality of the

data dictates.

Evaluate Where It Came From You can trust data when it is created in accordance with

a fi rst-rate data quality program. They feature clear ac-

countabilities for managers to create data correctly, in-

put controls, and fi nd and eliminate the root causes of

error. You won’t have to opine whether the data is good—

data quality statistics will tell you. You’ll fi nd an expert

who will be happy to explain what you may expect and

answer your questions. If the data quality stats look good

and the conversation goes well, trust the data. This is the

“gold standard” against which the other steps should be

calibrated.

Assess Data Quality Independently Much, perhaps most, data will not meet the gold stan-

dard, so adopt a cautious attitude by doing your own

assessment of data quality. Make sure you know where

the data was created and how it is defi ned, not just how

your data scientist accessed it. It is easy to be misled by

a casual, “We took it from our cloud-based data ware-

H7353_Guide-DataAnalytics_2ndREV.indb 72H7353_Guide-DataAnalytics_2ndREV.indb 72 1/17/18 10:47 AM1/17/18 10:47 AM

Can Your Data Be Trusted?

73

house, which employs the latest technology,” and com-

pletely miss the fact that the data was created in a dubi-

ous public forum. Figure out which organization created

the data. Then dig deeper: What do colleagues advise

about this organization and data? Does it have a good or

poor reputation for quality? What do others say on social

media? Do some research both inside and outside your

company.

At the same time, develop your own data quality

statistics, using what I call the “Friday afternoon mea-

surement,” tailor-made for this situation. Briefl y, you,

the data scientist providing the analysis, or both of you,

should lay out 10 or 15 important data elements for

100 data records on a spreadsheet. If the new data in-

volves customer purchases, such data elements may in-

clude “customer name,” “purchased item,” and “price.”

Then work record by record, taking a hard look at each

data element. The obvious errors will jump out at you—

customer names will be misspelled, the purchased item

will be a product you don’t sell, or the price may be miss-

ing. Mark these obvious errors with a red pen or high-

light them in a bright color. Then count the number of

records with no errors. (See fi gure 8-1 for an example.) In

many cases you’ll see a lot of red—don’t trust this data!

If you see only a little red, say, less than 5% of records

with an obvious error, you can use this data with caution.

Look, too, at patterns of the errors. If, for instance,

there are 25 total errors, 24 of which occur in the price,

eliminate that data element going forward. But if the

rest of the data looks pretty good, use it with caution.

H7353_Guide-DataAnalytics_2ndREV.indb 73H7353_Guide-DataAnalytics_2ndREV.indb 73 1/17/18 10:47 AM1/17/18 10:47 AM

74

R ec

or d

N am

e

1 Ja

ne D

oe N

ul l

$4 72

.1 3

N o

Jo hn

S m

it h

A tt

ri bu

te 1

A tt

ri bu

te 2

A tt

ri bu

te 3

A tt

ri bu

te 1

5

M ed

iu m

$1 26

.9 3

Ye s

St ua

rt M

ad ni

ck XX

XL N

ul l

N o

Ja m

es O

ls en

24 L

oc kw

oo d

R oa

d $7

6. 24

N o

N um

be r

of pe

rf ec

t re

co rd

s =

67

Th oa

m s

Jo ne

s N

o

2 3 4 10 0

Si ze

A m

ou nt

Pe rf

ec t

re co

rd ?

FI G

U R

E 8

-1

Ex am

p le

: F ri

d ay

a ft

er no

on m

ea su

re m

en t

sp re

ad sh

ee t

So ur

ce : T

ho m

as C

. R ed

m an

, “ A

ss es

s W

he th

er Y

ou H

av e

a D

at a

Q ua

lit y

Pr ob

le m

” on

h br

.o rg

, J ul

y 28

, 2 0

16 (

pr od

uc t #

H 0

30 SQ

).

H7353_Guide-DataAnalytics_2ndREV.indb 74H7353_Guide-DataAnalytics_2ndREV.indb 74 1/17/18 10:47 AM1/17/18 10:47 AM

Can Your Data Be Trusted?

75

Clean the Data I think of data cleaning in three levels: rinse, wash, and

scrub. “Rinse” replaces obvious errors with “missing

value” or corrects them if doing so is very easy; “scrub”

involves deep study, even making corrections one at a

time, by hand, if necessary; and “wash” occupies a mid-

dle ground.

Even if time is short, scrub a small random sample

(say, 1,000 records), making them as pristine as you pos-

sibly can. Your goal is to arrive at a sample of data you

know you can trust. Employ all possible means of scrub-

bing and be ruthless! Eliminate erroneous data records

and data elements that you cannot correct, and mark

data as “uncertain” when applicable.

When you are done, take a hard look. When the

scrubbing has gone really well (and you’ll know if it has),

you’ve created a data set that rates high on the trust-

worthy scale. It’s OK to move forward using this data.

Sometimes the scrubbing is less satisfying. If you’ve

done the best you can, but still feel uncertain, put this

data in the “use with caution” category. If the scrubbing

goes poorly—for example, too many prices just look

wrong and you can’t make corrections—you must rate

this data, and all like it, as untrustworthy. The sample

strongly suggests none of the data should be used to in-

form your decision.

After the initial scrub, move on to the second clean-

ing exercise: washing the remaining data that was not in

the scrubbing sample. This step should be performed by

a truly competent data scientist. Since scrubbing can be

H7353_Guide-DataAnalytics_2ndREV.indb 75H7353_Guide-DataAnalytics_2ndREV.indb 75 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

76

a time-consuming, manual process, the wash allows you

to make corrections using more automatic processes. For

example, one wash technique involves “imputing” miss-

ing values using statistical means. Or your data scientist

may have discovered algorithms during scrubbing. If the

washing goes well, put this data into the “use with cau-

tion” category.

The fl ow chart in fi gure 8-2 will help you see this pro-

cess in action. Once you’ve identifi ed a set of data that

you can trust or use with caution, move on to the next

step of integration.

Ensure High-Quality Data Integration Align the data you can trust—or the data that you’re

moving forward with cautiously—with your existing

data. There is a lot of technical work here, so probe your

data scientist to ensure three things are done well:

• Identifi cation: Verify that the Courtney Smith in

one data set is the same Courtney Smith in others.

• Alignment of units of measure and data defi nitions:

Make sure Courtney’s purchases and prices paid,

expressed in “pallets” and “dollars” in one set, are

aligned with “units” and “euros” in another.

• De-duplication: Check that the Courtney Smith

record does not appear multiple times in different

ways (say as C. Smith or Courtney E. Smith).

At this point in the process, you’re ready to perform

whatever analytics (from simple summaries to more

complex analyses) you need to guide your decision. Pay

H7353_Guide-DataAnalytics_2ndREV.indb 76H7353_Guide-DataAnalytics_2ndREV.indb 76 1/17/18 10:47 AM1/17/18 10:47 AM

77

U s e

t h

is d

a ta

w it

h c

a u

ti o

n

T ru

s t

th is

d a

ta

“W a s h ”

th e

re m

a in

in g

d a ta

u s in

g

a u to

m a te

d

te c h n iq

u e s

w it h t

h e h

e lp

o f

a d

a ta

s c ie

n ti s t.

D o

n o

t tr

u s t

th is

d a

ta

R a

w d

a ta

W a s t

h e d

a ta

c re

a te

d i n

a c c o

rd a n c e

w it h a

fi rs

t- ra

te

d a ta

q u a li ty

p ro

g ra

m ?

C a n y

o u

id e n ti fy

d a ta

o f

h ig

h q

u a li ty

th ro

u g

h y

o u r

o w

n r

e s e a rc

h ?

T h e d

a ta

c o

u ld

n o

t

b e s

c ru

b b

e d

.

T h e re

w e re

to o

m a n y

e rr

o rs

t h a t

c o

u ld

n ’t

b e

fi x e d

.

“S c ru

b ”

a

s m

a ll s

a m

p le

b y c

o rr

e c ti n g

o r

e li m

in a ti n g

d a ta

.

D id

t h e

“s c ru

b b

in g

g o

w e ll ?

Y E

S

Y E

S

Y E

S

Y E

S

S O

M E

W H

A T

N O

N O

N O

D id

t h e

“w a s h in

g ”

g o

w e ll ?

N O

N O

FI G

U R

E 8

-2

Sh ou

ld y

ou t

ru st

y ou

r d

at a?

A s

im pl

e pr

oc es

s to

h el

p yo

u de

ci de

H7353_Guide-DataAnalytics_2ndREV.indb 77H7353_Guide-DataAnalytics_2ndREV.indb 77 1/17/18 10:47 AM1/17/18 10:47 AM

Gather the Right Information

78

particular attention when you get different results based

on “use with caution” and “trusted” data. Both great in-

sights and great traps lie here. When a result looks in-

triguing, isolate the data and repeat the steps above,

making more detailed measurements, scrubbing the

data, and improving wash routines. As you do so, de-

velop a feel for how deeply you should trust this data.

Data doesn’t have to be perfect to yield new insights,

but you must exercise caution by understanding where

the fl aws lie, working around errors, cleaning them up,

and backing off when the data simply isn’t good enough.

Thomas C. Redman, “the Data Doc,” is President of Data

Quality Solutions. He helps companies and people, in-

cluding startups, multinationals, executives, and leaders

at all levels, chart their courses to data-driven futures.

He places special emphasis on quality, analytics, and or-

ganizational capabilities.

H7353_Guide-DataAnalytics_2ndREV.indb 78H7353_Guide-DataAnalytics_2ndREV.indb 78 1/17/18 10:47 AM1/17/18 10:47 AM

SECTION THREE

Analyze the Data

H7353_Guide-DataAnalytics_2ndREV.indb 79H7353_Guide-DataAnalytics_2ndREV.indb 79 1/17/18 10:47 AM1/17/18 10:47 AM

H7353_Guide-DataAnalytics_2ndREV.indb 80H7353_Guide-DataAnalytics_2ndREV.indb 80 1/17/18 10:47 AM1/17/18 10:47 AM

81

CHAPTER 9

A Predictive Analytics Primer by Thomas H. Davenport

No one has the ability to capture and analyze data from

the future. However, there is a way to predict the future

using data from the past. It’s called predictive analytics,

and organizations do it every day.

Has your company, for example, developed a customer

lifetime value (CLTV) measure? That’s using predictive

analytics to determine how much a customer will buy

from the company over time. Do you have a “next best

offer” or product recommendation capability? That’s an

analytical prediction of the product or service that your

customer is most likely to buy next. Have you made a

Adapted from content posted on hbr.org, September 2, 2014 (product

#H00YO1).

H7353_Guide-DataAnalytics_2ndREV.indb 81H7353_Guide-DataAnalytics_2ndREV.indb 81 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

82

forecast of next quarter’s sales? Used digital marketing

models to determine what ad to place on what publish-

er’s site? All of these are forms of predictive analytics.

Predictive analytics are gaining in popularity, but

what do you really need to know in order to interpret

results and make better decisions? By understanding a

few basics, you will feel more comfortable working with

and communicating with others in your organization

about the results and recommendations from predictive

analytics. The quantitative analysis isn’t magic—but it is

normally done with a lot of past data, a little statistical

wizardry, and some important assumptions.

The Data Lack of good data is the most common barrier to or-

ganizations seeking to employ predictive analytics. To

make predictions about what customers will buy in

the future, for example, you need to have good data on

what they are buying (which may require a loyalty pro-

gram, or at least a lot of analysis of their credit cards),

what they have bought in the past, the attributes of those

products (attribute-based predictions are often more ac-

curate than the “people who buy this also buy this” type

of model), and perhaps some demographic attributes of

the customer (age, gender, residential location, socioeco-

nomic status, etc.). If you have multiple channels or cus-

tomer touchpoints, you need to make sure that they cap-

ture data on customer purchases in the same way your

previous channels did.

All in all, it’s a fairly tough job to create a single

customer data warehouse with unique customer IDs

H7353_Guide-DataAnalytics_2ndREV.indb 82H7353_Guide-DataAnalytics_2ndREV.indb 82 1/17/18 10:47 AM1/17/18 10:47 AM

A Predictive Analytics Primer

83

on everyone, and all past purchases customers have

made through all channels. If you’ve already done that,

you’ve got an incredible asset for predictive customer

analytics.

The Statistics Regression analysis in its various forms is the primary

tool that organizations use for predictive analytics. It

works like this, in general: An analyst hypothesizes that

a set of independent variables (say, gender, income, vis-

its to a website) are statistically correlated with the pur-

chase of a product for a sample of customers. The analyst

performs a regression analysis to see just how correlated

each variable is; this usually requires some iteration

to fi nd the right combination of variables and the best

model. Let’s say that the analyst succeeds and fi nds that

each variable in the model is important in explaining the

product purchase, and together the variables explain a

lot of variation in the product’s sales. Using that regres-

sion equation, the analyst can then use the regression

coeffi cients—the degree to which each variable affects

the purchase behavior—to create a score predicting the

likelihood of the purchase.

Voilà! You have created a predictive model for other

customers who weren’t in the sample. All you have to do

is compute their score and offer them the product if their

score exceeds a certain level. It’s quite likely that the

high-scoring customers will want to buy the product—

assuming the analyst did the statistical work well and

that the data was of good quality. (For more on regres-

sion analysis, read on to the next chapter.)

H7353_Guide-DataAnalytics_2ndREV.indb 83H7353_Guide-DataAnalytics_2ndREV.indb 83 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

84

The Assumptions Another key factor in any predictive model is the as-

sumptions that underlie it. Every model has them,

and it’s important to know what they are and monitor

whether they are still true. The big assumption in pre-

dictive analytics is that the future will continue to be like

the past. As Charles Duhigg describes in his book The

Power of Habit, people establish strong patterns of be-

havior that they usually keep up over time. Sometimes,

however, they change those behaviors, and the models

that were used to predict them may no longer be valid.

What makes assumptions invalid? The most com-

mon reason is time. If your model was created several

years ago, it may no longer accurately predict current

behavior. The greater the elapsed time, the more likely

it is that customer behavior has changed. Some Net-

fl ix predictive models, for example, that were created

on early internet users had to be retired because later

inter net users were substantially different. The pioneers

were more technically focused and relatively young;

later users were essentially everyone.

Another reason a predictive model’s assumptions may

no longer be valid is if the analyst didn’t include a key

variable in the model, and that variable has changed

substantially over time. The great—and scary—example

here is the fi nancial crisis of 2008–2009, caused largely

by invalid models predicting how likely mortgage cus-

tomers were to repay their loans. The models didn’t in-

clude the possibility that housing prices might stop ris-

ing, and that they even might fall. When they did start

H7353_Guide-DataAnalytics_2ndREV.indb 84H7353_Guide-DataAnalytics_2ndREV.indb 84 1/17/18 10:47 AM1/17/18 10:47 AM

A Predictive Analytics Primer

85

falling, it turned out that the models were poor predic-

tors of mortgage repayment. In essence, the belief that

housing prices would always rise was a hidden assump-

tion in the models.

Since faulty or obsolete assumptions can clearly bring

down whole banks and even (nearly!) whole economies,

it’s pretty important that they be carefully examined.

Managers should always ask analysts what the key as-

sumptions are, and what would have to happen for them

to no longer be valid. And both managers and analysts

should continually monitor the world to see if key factors

involved in assumptions have changed over time.

With these fundamentals in mind, here are a few good

questions to ask your analysts:

• Can you tell me something about the source of the

data you used in your analysis?

• Are you sure the sample data is representative of

the population?

• Are there any outliers in your data distribution?

How did they affect the results?

• What assumptions are behind your analysis?

• Are there any conditions that would make your

assumptions invalid?

Even with those cautions, it’s still pretty amazing that

we can use analytics to predict the future. All we have to

do is gather the right data, do the right type of statisti-

cal model, and be careful of our assumptions. Analytical

predictions may be harder to generate than those by the

H7353_Guide-DataAnalytics_2ndREV.indb 85H7353_Guide-DataAnalytics_2ndREV.indb 85 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

86

late-night television soothsayer Carnac the Magnifi cent,

but they are usually considerably more accurate.

Thomas H. Davenport is the President’s Distinguished

Professor in Management and Information Technology

at Babson College, a research fellow at the MIT Initiative

on the Digital Economy, and a senior adviser at Deloitte

Analytics. Author of over a dozen management books,

his latest is Only Humans Need Apply: Winners and

Losers in the Age of Smart Machines.

H7353_Guide-DataAnalytics_2ndREV.indb 86H7353_Guide-DataAnalytics_2ndREV.indb 86 1/17/18 10:47 AM1/17/18 10:47 AM

87

CHAPTER 10

Understanding Regression Analysis by Amy Gallo

One of the most important types of data analysis is re-

gression. It is a common approach used to draw conclu-

sions from and make predictions based on data, but for

those without a statistical or analytical background, it

can also be complex and confusing.

To better understand this method and how compa-

nies use it, I talked with Thomas Redman, author of

Data Driven: Profi ting from Your Most Important Busi-

ness Asset. He also advises organizations on their data

and data quality programs.

Adapted from “A Refresher on Regression Analysis” on hbr.org, No-

vember 4, 2015 (product #H02GBP).

H7353_Guide-DataAnalytics_2ndREV.indb 87H7353_Guide-DataAnalytics_2ndREV.indb 87 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

88

What Is Regression Analysis? Redman offers this example scenario: Suppose you’re a

sales manager trying to predict next month’s numbers.

You know that dozens, perhaps even hundreds, of fac-

tors from the weather to a competitor’s promotion to

the rumor of a new and improved model can impact

the number. Perhaps people in your organization even

have a theory about what will have the biggest effect on

sales. “Trust me. The more rain we have, the more we

sell.” “Six weeks after the competitor’s promotion, sales

jump.”

Regression analysis is a way of mathematically sorting

out which of those variables do indeed have an impact.

It answers the questions: Which factors matter most?

Which can we ignore? How do those factors interact

with one another? And, perhaps most importantly, how

certain are we about all of these factors?

In regression analysis, those factors are called vari-

ables. You have your dependent variable—the main fac-

tor that you’re trying to understand or predict. In Red-

man’s example above, the dependent variable is monthly

sales. And then you have your independent variables—

the factors you suspect have an impact on your depen-

dent variable.

How Does It Work? In order to conduct a regression analysis, you gather data

on the variables in question. You take all of your monthly

sales numbers for, say, the past three years and any data

on the independent variables you’re interested in. So, in

H7353_Guide-DataAnalytics_2ndREV.indb 88H7353_Guide-DataAnalytics_2ndREV.indb 88 1/17/18 10:47 AM1/17/18 10:47 AM

Understanding Regression Analysis

89

this case, let’s say you fi nd out the average monthly rain-

fall for the past three years as well. Then you plot all of

that information on a chart that looks like fi gure 10-1.

The y-axis is the amount of sales (the dependent vari-

able, the thing you’re interested in, is always on the y-

axis) and the x-axis is the total rainfall. Each dot repre-

sents one month’s data—how much it rained that month

and how many sales you made that same month.

Glancing at this data, you probably notice that sales

are higher on days when it rains a lot. That’s interesting

to know, but by how much? If it rains three inches, do

you know how much you’ll sell? What about if it rains

four inches?

FIGURE 10-1

Is there a relationship between these two variables?

Plotting your data is the fi rst step to fi guring that out.

H7353_Guide-DataAnalytics_2ndREV.indb 89H7353_Guide-DataAnalytics_2ndREV.indb 89 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

90

Now imagine drawing a line through the chart, one

that runs roughly through the middle of all the data

points, as shown in fi gure 10-2. This line will help you

answer, with some degree of certainty, how much you

typically sell when it rains a certain amount.

This is called the regression line and it’s drawn (using

a statistics program like SPSS or STATA or even Excel)

to show the line that best fi ts the data. In other words,

explains Redman, “The line is the best explanation of the

relationship between the independent variable and de-

pendent variable.”

In addition to drawing the line, your statistics pro-

gram also outputs a formula that explains the slope of

the line and looks something like this:

FIGURE 10-2

Building a regression model

The line summarizes the relationship between x and y.

H7353_Guide-DataAnalytics_2ndREV.indb 90H7353_Guide-DataAnalytics_2ndREV.indb 90 1/17/18 10:47 AM1/17/18 10:47 AM

Understanding Regression Analysis

91

y = 200 + 5x + error term

Ignore the error term for now. It refers to the fact

that regression isn’t perfectly precise. Just focus on the

model:

y = 200 + 5x

What this formula is telling you is that if there is no x

then y = 200. So, historically, when it didn’t rain at all,

you made an average of 200 sales and you can expect

to do the same going forward assuming other variables

stay the same. And in the past, for every additional inch

of rain, you made an average of fi ve more sales. “For ev-

ery increment that x goes up one, y goes up by fi ve,” says

Redman.

Now let’s return to the error term. You might be

tempted to say that rain has a big impact on sales if for

every inch you get fi ve more sales, but whether this vari-

able is worth your attention will depend on the error

term. A regression line always has an error term because,

in real life, independent variables are never perfect pre-

dictors of the dependent variables. Rather, the line is an

estimate based on the available data. So the error term

tells you how certain you can be about the formula. The

larger it is, the less certain the regression line.

This example uses only one variable to predict the fac-

tor of interest—in this case, rain to predict sales. Typi-

cally, you start a regression analysis wanting to under-

stand the impact of several independent variables. So

you might include not just rain but also data about a

competitor’s promotion. “You keep doing this until the

error term is very small,” says Redman. “You’re trying to

H7353_Guide-DataAnalytics_2ndREV.indb 91H7353_Guide-DataAnalytics_2ndREV.indb 91 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

92

get the line that fi ts best with your data.” While there can

be dangers in trying to include too many variables in a

regression analysis, skilled analysts can minimize those

risks. And considering the impact of multiple variables

at once is one of the biggest advantages of regression.

How Do Companies Use It? Regression analysis is the “go-to method in analytics,”

says Redman. And smart companies use it to make deci-

sions about all sorts of business issues. “As managers, we

want to fi gure out how we can impact sales or employee

retention or recruiting the best people. It helps us fi gure

out what we can do.”

Most companies use regression analysis to explain

a phenomenon they want to understand (why did cus-

tomer service calls drop last month?); to predict things

about the future (what will sales look like over the next

six months?); or to decide what to do (should we go with

this promotion or a different one?).

Does Correlation Imply Causation? Whenever you work with regression analysis or any other

analysis that tries to explain the impact of one factor on

another, you need to remember the important adage:

Correlation is not causation. This is critical and here’s

why: It’s easy to say that there is a correlation between

rain and monthly sales. The regression shows that they

are indeed related. But it’s an entirely different thing to

say that rain caused the sales. Unless you’re selling um-

H7353_Guide-DataAnalytics_2ndREV.indb 92H7353_Guide-DataAnalytics_2ndREV.indb 92 1/17/18 10:47 AM1/17/18 10:47 AM

Understanding Regression Analysis

93

brellas, it might be diffi cult to prove that there is cause

and effect.

Sometimes factors are correlated that are obviously

not connected by cause and effect, but more often in

business it’s not so obvious (see the sidebar, “Beware

Spurious Correlations,” at the end of this chapter). When

you see a correlation from a regression analysis, you can’t

make assumptions, says Redman. Instead, “You have to

go out and see what’s happening in the real world. What’s

the physical mechanism that’s causing the relationship?”

Go out and observe consumers buying your product in

the rain, talk to them, and fi nd out what is actually caus-

ing them to make the purchase. “A lot of people skip this

step and I think it’s because they’re lazy. The goal is not

to fi gure out what is going on in the data but to fi gure

out what is going on in the world. You have to go out and

pound the pavement,” he says.

Redman once ran his own experiment and analysis in

order to better understand the connection between his

travel and weight gain. He noticed that when he trav-

eled, he ate more and exercised less. Was his weight gain

caused by travel? Not necessarily. “It was nice to quan-

tify what was happening but travel isn’t the cause. It may

be related,” he says, but it’s not like his being on the road

put those extra pounds on. He had to understand more

about what was happening during his trips. “I’m often

in new environments so maybe I’m eating more because

I’m nervous.” He needed to look more closely at the cor-

relation. And this is his advice to managers. Use the data

to guide more experiments, not to make conclusions

about cause and effect.

H7353_Guide-DataAnalytics_2ndREV.indb 93H7353_Guide-DataAnalytics_2ndREV.indb 93 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

94

What Mistakes Do People Make When Working with Regression Analysis? As a consumer of regression analysis, there are several

things you need to keep in mind.

First, don’t tell your data analyst to go fi gure out what

is affecting sales. “The way most analyses go haywire is

the manager hasn’t narrowed the focus on what he or

she is looking for,” says Redman. It’s your job to identify

the factors that you suspect are having an impact and

ask your analyst to look at those. “If you tell a data scien-

tist to go on a fi shing expedition, or to tell you something

you don’t know, then you deserve what you get, which

is bad analysis,” he says. In other words, don’t ask your

analysts to look at every variable they can possibly get

their hands on all at once. If you do, you’re likely to fi nd

relationships that don’t really exist. It’s the same princi-

ple as fl ipping a coin: Do it enough times, you’ll eventu-

ally think you see something interesting, like a bunch of

heads all in a row. (For more on how to communicate

your data needs to experts, see chapter 4.)

Also keep in mind whether or not you can do anything

about the independent variable you’re considering. You

can’t change how much it rains, so how important is it to

understand that? “We can’t do anything about weather

or our competitor’s promotion but we can affect our own

promotions or add features, for example,” says Redman.

Always ask yourself what you will do with the data. What

actions will you take? What decisions will you make?

Second, “analyses are very sensitive to bad data” so

be careful about the data you collect and how you col-

H7353_Guide-DataAnalytics_2ndREV.indb 94H7353_Guide-DataAnalytics_2ndREV.indb 94 1/17/18 10:47 AM1/17/18 10:47 AM

Understanding Regression Analysis

95

lect it, and know whether you can trust it (as we learned

in chapter 8). “All the data doesn’t have to be correct or

perfect,” explains Redman, but consider what you will be

doing with the analysis. If the decisions you’ll make as a

result don’t have a huge impact on your business, then

it’s OK if the data is “kind of leaky.” But, “if you’re try-

ing to decide whether to build 8 or 10 of something and

each one costs $1 million to build, then it’s a bigger deal,”

he says.

Redman also says that some managers who are new

to understanding regression analysis make the mistake

of ignoring the error term. This is dangerous because

they’re making the relationship between two variables

more certain than it is. “Oftentimes the results spit out

of a computer and managers think, ‘That’s great, let’s use

this going forward.’” But remember that the results are

always uncertain. As Redman points out, “If the regres-

sion explains 90% of the relationship, that’s great. But

if it explains 10%, and you act like it’s 90%, that’s not

good.” The point of the analysis is to quantify the cer-

tainty that something will happen. “It’s not telling you

how rain will infl uence your sales, but it’s telling you the

probability that rain may infl uence your sales.”

The last mistake that Redman warns against is letting

data replace your intuition. “You always have to lay your

intuition on top of the data,” he explains. Ask yourself

whether the results fi t with your understanding of the

situation. And if you see something that doesn’t make

sense, ask whether the data was right or whether there

is indeed a large error term. Redman suggests you look

to more experienced managers or other analyses if you’re

H7353_Guide-DataAnalytics_2ndREV.indb 95H7353_Guide-DataAnalytics_2ndREV.indb 95 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

96

getting something that doesn’t make sense. And, he says,

never forget to look beyond the numbers to what’s hap-

pening outside your offi ce: “You need to pair any analy-

sis with study of the real world. The best scientists—and

managers—look at both.”

Amy Gallo is a contributing editor at Harvard Business

Review and the author of the HBR Guide to Dealing with

Confl ict. Follow her on Twitter @amyegallo.

BEWARE SPURIOUS CORRELATIONS

We all know the truism “Correlation doesn’t imply cau-

sation,” but when we see lines sloping together, bars

rising together, or points on a scatterplot clustering,

the data practically begs us to assign a reason. We

want to believe one exists.

Statistically we can’t make that leap, however.

Charts that show a close correlation are often relying

on a visual parlor trick to imply a relationship. Tyler Vi-

gen, a JD student at Harvard Law School and the au-

thor of Spurious Correlations, has made sport of this

on his website, which charts farcical correlations—for

example, between U.S. per capita margarine con-

sumption and the divorce rate in Maine.

Vigen has programmed his site so that anyone can

find and chart absurd correlations in large data sets.

We tried a few of our own and came up with these

gems:

H7353_Guide-DataAnalytics_2ndREV.indb 96H7353_Guide-DataAnalytics_2ndREV.indb 96 1/17/18 10:47 AM1/17/18 10:47 AM

Understanding Regression Analysis

97

Source: Tylervigen.com

(continued)

More iPhones means more people die from falling down stairs

Deaths caused by falls down stairs (U.S.)

iPhone sales

2007

0

10

20

30

40M

1,900

1,925

1,950

1,975

2,000

2008 2009 2010

Let’s cheer on the team, and we’ll lose weight

Per capita consumption of high-fructose corn syrup (U.S.)

Spending on admission to spectator sports (U.S.)

10

12

14

16

18

20

$22B

2000 , 01

, 02

, 03

, 04

, 05

, 06

, 07

, 08

, 09

50

55

60

65

70 LBS

H7353_Guide-DataAnalytics_2ndREV.indb 97H7353_Guide-DataAnalytics_2ndREV.indb 97 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

98

BEWARE SPURIOUS CORRELATIONS

(continued)

Source: Tylervigen.com

Although it’s easy to spot and explain away absurd ex-

amples, like these, you’re likely to encounter rigged

but plausible charts in your daily work. Here are three

types to watch out for:

To increase auto sales, market trips to Universal Orlando

Visitors to Universal Orlando’s “Islands of Adventure”

Sales of new cars (U.S.)

5

6

7

8M

4

5

6M

2007 2008 2009

H7353_Guide-DataAnalytics_2ndREV.indb 98H7353_Guide-DataAnalytics_2ndREV.indb 98 1/17/18 10:47 AM1/17/18 10:47 AM

Understanding Regression Analysis

99

Apples and Oranges: Comparing

Dissimilar Variables

Y axis scales that measure different values may show

similar curves that shouldn’t be paired. This becomes

pernicious when the values appear to be related but

aren’t.

Total “Black Friday” online revenue

eBay total gross merchandise volume

40

50

60

70

80M

2008 2009 2010 2011 2012 2013

200

400

600

$800M

It’s best to chart them separately.

(continued)

eBay total gross merchandise volume

2008 , 09

, 10

, 11

, 12

, 13

20

0

40

60

80M

Total “Black Friday” online revenue

2008 , 09

, 10

, 11

, 12

, 13

200

0

400

600

$800M

H7353_Guide-DataAnalytics_2ndREV.indb 99H7353_Guide-DataAnalytics_2ndREV.indb 99 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

100

BEWARE SPURIOUS CORRELATIONS

(continued�)

Skewed Scales: Manipulating Ranges to Align Data

Even when y axes measure the same category, chang-

ing the scales can alter the lines to suggest a correla-

tion. These y axes for RetailCo’s monthly revenue dif-

ference in range and proportional increase.

Eliminating the second axis shows how skewed this

chart is.

Customers over 40

Customers under 40

J

0

100

200

300

400

$500K

F M A M J J A S O N D

10

15

20

25

30

$35K

J F M A M J J A S O N D

Customers over 40

Customers under 40

0

100

200

300

400

$500K

H7353_Guide-DataAnalytics_2ndREV.indb 100H7353_Guide-DataAnalytics_2ndREV.indb 100 1/17/18 10:47 AM1/17/18 10:47 AM

Understanding Regression Analysis

101

Ifs and Thens: Implying Cause and Effect

Plotting unrelated data sets together can make it seem

that changes in one variable are causing changes in the

other.

We try to create a narrative—i�f Pandora loses less

money, then more music is copyrighted—from what is

probably a coincidence.

Adapted from “Beware Spurious Correlations,” Harvard Business Re- view, June 2015 (product #F1506Z).

Pandora net losses

Musical works copyrighted (U.S.)

2006 2007 2008 2009

10

15

20

25

$30M

124

108

92

76

60

140M

Pandora net losses

–30

–$10M

–20

0 2006 2007 2008 2009

Musical works copyrighted (U.S.)

20072006 2008 2009

0

50

100

150M

H7353_Guide-DataAnalytics_2ndREV.indb 101H7353_Guide-DataAnalytics_2ndREV.indb 101 1/17/18 10:47 AM1/17/18 10:47 AM

H7353_Guide-DataAnalytics_2ndREV.indb 102H7353_Guide-DataAnalytics_2ndREV.indb 102 1/17/18 10:47 AM1/17/18 10:47 AM

103

CHAPTER 11

When to Act On a Correlation, and When Not To by David Ritter

“Petabytes allow us to say: ‘Correlation is enough.’”

—Chris Anderson, Wired, June 23, 2008

The sentiment expressed by Chris Anderson in 2008 is

a popular meme in the big data community. “Causality is

dead,” say the priests of analytics and machine learning.

They argue that given enough statistical evidence, it’s no

long er necessary to understand why things happen—we

need only know what things happen together.

Adapted from content posted on hbr.org, March 19, 2014 (product

#H00Q1X).

H7353_Guide-DataAnalytics_2ndREV.indb 103H7353_Guide-DataAnalytics_2ndREV.indb 103 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

104

But inquiring whether correlation is enough is asking

the wrong question. For consumers of big data, the key

question is, “Can I take action on the basis of a corre-

lation fi nding?” The answer to that question is, “It de-

pends”—primarily on two factors:

• Confi dence that the correlation will reliably recur

in the future. The higher that confi dence level, the

more reasonable it is to take action in response.

• The trade-off between the risk and reward of

acting. If the risk of acting and being wrong is ex-

tremely high, for example, acting on even a strong

correlation may be a mistake.

The fi rst factor—the confi dence that the correlation

will recur—is in turn a function of two things: the fre-

quency with which the correlation has historically oc-

curred (the more often events occur together in real

life, the more likely it is that they are connected) and an

under standing of what is causing that statistical fi nding.

This second element—what we call “clarity of causal-

ity”—stems from the fact that the fewer possible explana-

tions there are for a correlation, the higher the likelihood

that the two events are linked. Considering frequency

and clarity together yields a more reliable gauge of the

overall confi dence in the fi nding than evaluating only

one or the other in isolation.

Understanding the interplay between the confi dence

level and the risk/reward trade-off enables sound deci-

sions on what action—if any—makes sense in light of a

particular statistical fi nding. The bottom line: Causal-

ity can matter tremendously. And efforts to gain better

H7353_Guide-DataAnalytics_2ndREV.indb 104H7353_Guide-DataAnalytics_2ndREV.indb 104 1/17/18 10:47 AM1/17/18 10:47 AM

When to Act On a Correlation, and When Not To

105

insight into the cause of a correlation can drive up the

confi dence level of taking action.

These concepts allowed The Boston Consulting Group

(BCG) to develop a prism through which any potential

action can be evaluated. If the value of acting is high, and

the cost of acting when wrong is low, it can make sense to

act based on even a weak correlation. We choose to look

both ways before crossing the street because the cost of

looking is low and the potential loss from not looking is

high (in statistical jargon what is known as “asymmet-

ric loss function”). Alternatively, if the confi dence in the

fi nding is low due to the fact you don’t have a handle on

why two events are linked, you should be less willing to

take actions that have signifi cant potential downside,

illustrated in fi gure 11-1.

Frequent correlation; clear causal hypothesis

C o

n fi

d e

n c

e in

t h

e r

e la

ti o

n s h

ip

Benefits of action relative to cost of being wrong

Frequent correlation; but many causal hypotheses

Infrequent, unstable correlation Risks outweigh

benefits Benefits outweigh risk

Act

Don’t act

FIGURE 11-1

When to act on a correlation in your data

How confi dent are you in the relationship? And do the benefi ts of action outweigh the risk?

Source: David Ritter, BCG

H7353_Guide-DataAnalytics_2ndREV.indb 105H7353_Guide-DataAnalytics_2ndREV.indb 105 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

106

Consider the case of New York City’s sewer sensors.

These sensors detect the amount of grease fl owing into

the sewer system at various locations throughout the city.

If the data collected shows a concentration of grease at

an unexpected location—perhaps due to an unlicensed

restaurant—offi cials will send a car out to determine the

source. The confi dence in the meaning of the data from

the sensors is on the low side—there may be many other

explanations for the excessive infl ux of grease. But there’s

little cost if the inspection fi nds nothing amiss.

Recent decisions around routine PSA screening tests

for prostate cancer involved a very different risk/reward

trade-off. Confi dence that PSA blood tests are a good

predictor of cancer is low because the correlation itself is

weak—elevated PSA levels are found often in men without

prostate cancer. There is also no clear causal explanation

for how PSA is related to the development of cancer. In ad-

dition, preventative surgery prompted by the test did not

increase long-term survival rates. And the risk associated

with screening was high, with false positives leading to un-

necessary, debilitating treatment. The result: The Ameri-

can Medical Association reversed its previous recommen-

dation that men over 50 have routine PSA blood tests.

Of course, there is usually not just one, but a range

of possible actions in response to a statistical fi nding.

This came into play recently in a partnership between an

Australian supermarket and an auto insurance company.

Combining data from the supermarket’s loyalty card pro-

gram with auto claims information revealed interesting

correlations. The data showed that people who buy red

meat and milk are good car insurance risks while peo-

ple who buy pasta and spirits and who fuel their cars at

H7353_Guide-DataAnalytics_2ndREV.indb 106H7353_Guide-DataAnalytics_2ndREV.indb 106 1/17/18 10:47 AM1/17/18 10:47 AM

When to Act On a Correlation, and When Not To

107

night are poor risks. Though this statistical relationship

could be an indicator of risky behaviors (driving under

the infl uence of spirits, for example), there are a number

of other possible reasons for the fi nding.

Potential responses to the fi nding included:

• Targeting insurance marketing to loyalty card

holders in the low-risk group

• Pricing car insurance based on these buying

patterns

The latter approach, however, could lead to brand-

damaging backlash should the practice be exposed.

Looking at the two options via our framework in fi g-

ure 11-2 makes clear that without additional confi dence

in the fi nding, the former approach is preferable.

Frequent correlation; clear causal hypothesis

C o

n fi

d e

n c

e in

t h

e r

e la

ti o

n s h

ip

Benefits of action relative to cost of being wrong

Frequent correlation; but many causal hypotheses

Infrequent, unstable correlation Risks outweigh

benefits Benefits outweigh risk

Act Targeting insurance marketing based on buying patterns

Don’t act Pricing car insurance based on buying patterns

Source: David Ritter, BCG

FIGURE 11-2

If supermarket purchases correlate with auto insurance claims, what should an insurer do?

With the case of relationship unclear, low risk actions are advisible.

H7353_Guide-DataAnalytics_2ndREV.indb 107H7353_Guide-DataAnalytics_2ndREV.indb 107 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

108

However, if we are able to fi nd a clear causal expla-

nation for this correlation, we may be able to increase

confi dence suffi ciently to take the riskier, higher-value

action of increasing rates. For example, the buying pat-

terns associated with higher risks could be leading indi-

cators of an impending life transition such as loss of em-

ployment or a divorce. This possible explanation could

be tested by adding additional data to the analysis.

In this case causality is critical. New factors can po-

tentially be identifi ed that create a better understanding

of the dynamics at work. The goal is to rule out some pos-

sible causes and shed light on what is really driving that

correlation. That understanding will increase the overall

level of confi dence that the correlation will continue in

the future—essentially shifting possible actions into the

upper portion of the framework. The result may be that

previously ruled-out responses are now appropriate. In

addition, insight on the cause of a correlation can allow

you to look for changes that cause the linkage to weaken

or disappear. And that knowledge makes it possible to

monitor and respond to events that might make a previ-

ously sound response outdated.

There is no shortage of examples where the selection

of the right response hinges on this “clarity of cause.” The

U.S. Army, for example, has developed image process-

ing software that uses fl ashes of light to locate the pos-

sible position of a sniper. But similar fl ashes also come

from a camera. With two potential reasons for the imag-

ing pattern, the confi dence in the fi nding is lower than

it would be if there were just one. And that, of course,

will determine how to respond—and what level of risk is

acceptable.

H7353_Guide-DataAnalytics_2ndREV.indb 108H7353_Guide-DataAnalytics_2ndREV.indb 108 1/17/18 10:47 AM1/17/18 10:47 AM

When to Act On a Correlation, and When Not To

109

When working with big data, sometimes correlation

is enough. But other times understanding the cause is vi-

tal. The key is to know when correlation is enough—and

what to do when it is not.

David Ritter is a director in the Technology Advantage

practice of The Boston Consulting Group (BCG), where

he advises clients on the use of technology for competi-

tive advantage, open innovation, and other topics.

H7353_Guide-DataAnalytics_2ndREV.indb 109H7353_Guide-DataAnalytics_2ndREV.indb 109 1/17/18 10:47 AM1/17/18 10:47 AM

H7353_Guide-DataAnalytics_2ndREV.indb 110H7353_Guide-DataAnalytics_2ndREV.indb 110 1/17/18 10:47 AM1/17/18 10:47 AM

111

CHAPTER 12

Can Machine Learning Solve Your Business Problem? by Anastassia Fedyk

As you consider ways to analyze large swaths of data, you

may ask yourself how the latest technological tools and

automation can help. AI, big data, and machine learn-

ing are all trending buzzwords, but how can you know

which problems in your business are amenable to ma-

chine learning?

Adapted from “How to Tell If Machine Learning Can Solve Your Busi-

ness Problem” on hbr.org, November 25, 2016 (product #H03A8R).

H7353_Guide-DataAnalytics_2ndREV.indb 111H7353_Guide-DataAnalytics_2ndREV.indb 111 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

112

To decide, you need to think about the problem to be

solved and the available data, and ask questions about

feasibility, intuition, and expectations.

Assess Whether Your Problem Requires Learning Machine learning can help automate your processes, but

not all automation problems require learning.

Automation without learning is appropriate when the

problem is relatively straightforward—the kinds of tasks

where you have a clear, predefi ned sequence of steps that

is currently being executed by a human, but could con-

ceivably be transitioned to a machine. This sort of au-

tomation has been happening in businesses for decades.

Screening incoming data from an outside data provider

for well-defi ned potential errors is an example of a prob-

lem ready for automation. (For example, hedge funds

automatically fi lter out bad data in the form of a negative

value for trading volume, which can’t be negative.) On

the other hand, encoding human language into a struc-

tured data set is something that is just a tad too ambi-

tious for a straightforward set of rules.

For the second type of problem, standard automation

is not enough. Such complex problems require learning

from data—and now we venture into the arena of ma-

chine learning. Machine learning, at its core, is a set of

statistical methods meant to fi nd patterns of predictabil-

ity in data sets. These methods are great at determining

how certain features of the data are related to the out-

comes you are interested in. What these methods cannot

do is access any knowledge outside of the data you pro-

H7353_Guide-DataAnalytics_2ndREV.indb 112H7353_Guide-DataAnalytics_2ndREV.indb 112 1/17/18 10:47 AM1/17/18 10:47 AM

Can Machine Learning Solve Your Business Problem?

113

vide. For example, researchers at the University of Pitts-

burg in the late 1990s evaluated machine-learning algo-

rithms for predicting mortality rates from pneumonia.1

The algorithms recommended that hospitals send home

pneumonia patients who were also asthma sufferers, es-

timating their risk of death from pneumonia to be lower.

It turned out that the data set fed into the algorithms did

not account for the fact that asthma sufferers had been

immediately sent to intensive care, and had fared better

only because of the additional attention.2

So what are good business problems for machine

learning methods? Essentially, any problems that meet

the following two criteria:

1. They require prediction rather than causal

inference.

2. They are suffi ciently self-contained or relatively

insulated from outside infl uences.

The fi rst means that you are interested in understand-

ing how, on average, certain aspects of the data relate to

each other, and not in the causal channels of their rela-

tionship. (Keep in mind that the statistical methods do

not bring to the table the intuition, theory, or domain

knowledge of human analysts.) The second means that

you are relatively certain that the data you feed to your

learning algorithm includes more or less all there is to

the problem. If, in the future, the thing you’re trying to

predict changes unexpectedly and no longer matches

prior patterns in the data, the algorithm will not know

what to make of it.

H7353_Guide-DataAnalytics_2ndREV.indb 113H7353_Guide-DataAnalytics_2ndREV.indb 113 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

114

Examples of good machine learning problems include

predicting the likelihood that a certain type of user will

click on a certain kind of ad, or evaluating the extent to

which a piece of text is similar to previous texts you have

seen. (To see an example of how an artifi cial intelligence

algorithm learned from existing customer data and test

marketing campaigns to fi nd new sales leads, see the

sidebar “Artifi cial Intelligence at Harley-Davidson.”)

Bad examples include predicting profi ts from the in-

troduction of a completely new and revolutionary prod-

uct line, or extrapolating next year’s sales from past data

when an important new competitor just entered the

market.

ARTIFICIAL INTELLIGENCE AT HARLEY-DAVIDSON

by Brad Power

It was winter in New York City, and Asaf Jacobi’s

Harley-Davidson dealership was selling one or two

motor cycles a week. It wasn’t enough.

Jacobi went for a long walk in Riverside Park and

happened to bump into Or Shani, CEO of an AI fi rm,

Adgorithms. After discussing Jacobi’s sales woes,

Shani suggested he try out Albert, Adgorithm’s AI-

driven marketing platform. It works across digital

channels, like Facebook and Google, to measure and

then autonomously optimize the outcomes of market-

ing campaigns. Jacobi decided he’d give Albert a one-

weekend audition.

H7353_Guide-DataAnalytics_2ndREV.indb 114H7353_Guide-DataAnalytics_2ndREV.indb 114 1/17/18 10:47 AM1/17/18 10:47 AM

Can Machine Learning Solve Your Business Problem?

115

That weekend, Jacobi sold 15 motorcycles—almost

twice his all-time summer weekend sales record of

eight.

Naturally, Jacobi kept using Albert. His dealership

went from getting one qualifi ed lead per day to 40. In

the fi rst month, 15% of those new leads were looka-

likes, meaning that the people calling the dealership

to set up a visit resembled previous high-value cus-

tomers and therefore were more likely to make a pur-

chase. By the third month, the dealership’s leads had

increased 2,930%, 50% of them lookalikes, leaving

Jacobi scrambling to set up a new call center with six

new employees to handle all the new business.

While Jacobi had estimated that only 2% of New

York City’s population were potential buyers, Albert

revealed that his target market was larger—much

larger—and began fi nding customers Jacobi didn’t

even know existed.

How did it do that?

Albert drove in-store traffi c by generating leads, de-

fi ned as customers who express interest in speaking to

a salesperson by fi lling out a form on the dealership’s

website. Armed with creative content (headlines and

visuals) provided by Harley-Davidson and key perfor-

mance targets, Albert began by analyzing existing cus-

tomer data from Jacobi’s customer relationship man-

agement system to isolate defi ning characteristics and

(continued)

H7353_Guide-DataAnalytics_2ndREV.indb 115H7353_Guide-DataAnalytics_2ndREV.indb 115 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

116

ARTIFICIAL INTELLIGENCE AT HARLEY-DAVIDSON

(continued)

behaviors of high-value past customers: those who ei-

ther had completed a purchase, added an item to an

online cart, viewed website content, or were among

the top 25% in terms of time spent on the website.

Using this information, Albert identifi ed lookalikes

who resembled these past customers and created

micro segments—small sample groups with whom it

could run test campaigns before extending its eff orts

more widely. Albert used the data gathered through

these tests to predict which possible headlines and vi-

sual combinations, and thousands of other campaign

variables, would most likely convert diff erent audience

segments through various digital channels (social me-

dia, search, display, and email or SMS).

Once it determined what was working and what

wasn’t, Albert scaled the campaigns, autonomously

allocating resources from channel to channel, making

content recommendations, and so on.

For example, when it discovered that ads with the

word call—such as, “Don’t miss out on a pre-owned

Harley with a great price! Call now!”—performed 447%

better than ads containing the word buy, such as, “Buy

a pre-owned Harley from our store now!” Albert imme-

diately changed buy to call in all ads across all relevant

channels. The results spoke for themselves.

For Harley-Davidson, AI evaluated what was work-

ing across digital channels and what wasn’t, and used

H7353_Guide-DataAnalytics_2ndREV.indb 116H7353_Guide-DataAnalytics_2ndREV.indb 116 1/17/18 10:47 AM1/17/18 10:47 AM

Can Machine Learning Solve Your Business Problem?

117

what it learned to create more opportunities for con-

version. In other words, the system allocated resources

only to what had been proven to work, thereby increas-

ing digital marketing ROI. Using AI, Harley- Davidson

was able to eliminate guesswork, gather and analyze

enormous volumes of data, and optimally lever age the

resulting insights.

Adapted from “How Harley-Davidson Used Artifi cial Intelligence to In- crease New York Sales Leads by 2,930%” on hbr.org, May 30, 2017 (product #H03NFD).

Brad Power is a consultant who helps organizations that must make faster changes to their products, services, and systems to compete with startups and leading software companies.

Find the Appropriate Data Once you verify that your problem is suitable for ma-

chine learning, the next step is to evaluate whether you

have the right data to solve it. The data might come from

you or from an external provider. In the latter case, ask

enough questions to get a good feel for the data’s scope

and whether it is likely to be a good fi t for your problem.

Ask Questions and Look for Mistakes Once you’ve determined that your problem is a clas-

sic machine learning problem and you have the data to

fi t it, check your intuition. Machine learning methods,

however proprietary and seemingly magical, are statis-

tics. And statistics can be explained in intuitive terms.

H7353_Guide-DataAnalytics_2ndREV.indb 117H7353_Guide-DataAnalytics_2ndREV.indb 117 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

118

Instead of trusting that the brilliant proposed method

will seamlessly work, ask lots of questions.

Get yourself comfortable with how the method works.

Does the intuition of the method roughly make sense?

Does it fi t conceptually into the framework of the par-

ticular setting or problem you are dealing with? What

makes this method especially well-suited to your prob-

lem? If you are encoding a set of steps, perhaps sequen-

tial models or decision trees are a good choice. If you

need to separate two classes of outcome, perhaps a bi-

nary support vector machine would be best aligned with

your needs.

With understanding come more realistic expectations.

Once you ask enough questions and receive enough an-

swers to have an intuitive understanding of how the

methodology works, you will see that it is far from magi-

cal. Every human makes mistakes, and every algorithm

is error prone too. For all but the simplest of problems,

there will be times when things go wrong. The machine

learning prediction engine will get things right on aver-

age but will reliably make mistakes. And these errors will

happen most often in ways that you cannot anticipate.

Decide How to Move Forward The last step is to evaluate the extent to which you can

allow for exceptions or statistical errors in your pro-

cess. Is your problem the kind where getting things

right 80% of the time is enough? Can you deal with a

10% error rate? 5%? 1%? Are there certain kinds of er-

rors that should never be allowed? Be clear and upfront

about your needs and expectations, both with yourself

H7353_Guide-DataAnalytics_2ndREV.indb 118H7353_Guide-DataAnalytics_2ndREV.indb 118 1/17/18 10:47 AM1/17/18 10:47 AM

Can Machine Learning Solve Your Business Problem?

119

and with your solution provider. And once both of you

are comfortably on the same page, go ahead. Armed with

knowledge, understanding, and reasonable expectations,

you are set to reap the benefi ts of machine learning. Just

please be patient.

Anastassia Fedyk is a PhD candidate in business eco-

nomics at Harvard Business School. Her research fo-

cuses on fi nance and behavioral economics.

NOTES

1. G. F. Cooper et al., “An Evaluation of Machine-Learning Meth- ods for Predicting Pneumonia Mortality,” Artifi cial Intelligence in Medicine 9 (1997): 107–138.

2. A. M. Bornstein, “Is Artifi cial Intelligence Permanently In- scrutable?” Nautilus, September 1, 2016, http://nautil.us/issue/40/ learning/is-artifi cial-intelligence-permanently-inscrutable.

H7353_Guide-DataAnalytics_2ndREV.indb 119H7353_Guide-DataAnalytics_2ndREV.indb 119 1/17/18 10:47 AM1/17/18 10:47 AM

H7353_Guide-DataAnalytics_2ndREV.indb 120H7353_Guide-DataAnalytics_2ndREV.indb 120 1/17/18 10:47 AM1/17/18 10:47 AM

121

CHAPTER 13

A Refresher on Statistical Signifi cance by Amy Gallo

When you run an experiment or analyze data, you want

to know if your fi ndings are signifi cant. But business rel-

evance (that is, practical signifi cance) isn’t always the

same thing as confi dence that a result isn’t due purely to

chance (that is, statistical signifi cance). This is an impor-

tant distinction; unfortunately, statistical signifi cance

is often misunderstood and misused in organizations to-

day. And because more and more companies are relying

on data to make critical business decisions, it’s an essen-

tial concept for managers to understand.

Adapted from content posted on hbr.org, February 16, 2016 (product

#H02NMS).

H7353_Guide-DataAnalytics_2ndREV.indb 121H7353_Guide-DataAnalytics_2ndREV.indb 121 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

122

To better understand what statistical signifi cance re-

ally means, I talked with Thomas Redman, author of

Data Driven: Profi ting from Your Most Important Busi-

ness Asset, and adviser to organizations on their data and

data quality programs.

What Is Statistical Signifi cance? “Statistical signifi cance helps quantify whether a result

is likely due to chance or to some factor of interest,” says

Redman. When a fi nding is signifi cant, it simply means

you can feel confi dent that it’s real, not that you just got

lucky (or unlucky) in choosing the sample.

When you run an experiment, conduct a survey, take

a poll, or analyze a data set, you’re taking a sample of

some population of interest, not looking at every single

data point that you possibly can. Consider the example

of a marketing campaign. You’ve come up with a new

concept and you want to see if it works better than your

current one. You can’t show it to every single target cus-

tomer, of course, so you choose a sample group.

When you run the results, you fi nd that those who saw

the new campaign spent $10.17 on average, more than

the $8.41 spent by those who saw the old campaign. This

$1.76 might seem like a big—and perhaps important—

difference. But in reality you may have been unlucky,

drawing a sample of people who do not represent the

larger population; in fact, maybe there was no difference

between the two campaigns and their infl uence on con-

sumers’ purchasing behaviors. This is called a sampling

error, something you must contend with in any test that

does not include the entire population of interest.

H7353_Guide-DataAnalytics_2ndREV.indb 122H7353_Guide-DataAnalytics_2ndREV.indb 122 1/17/18 10:47 AM1/17/18 10:47 AM

A Refresher on Statistical Signifi cance

123

Redman notes that there are two main contributors to

sampling error: the size of the sample and the variation

in the underlying population. Sample size may be intui-

tive enough. Think about fl ipping a coin 5 times versus

fl ipping it 500 times. The more times you fl ip, the less

likely you’ll end up with a great majority of heads. The

same is true of statistical signifi cance: With bigger sam-

ple sizes, you’re less likely to get results that refl ect ran-

domness. All else being equal, you’ll feel more comfort-

able in the accuracy of the campaigns’ $1.76 difference

if you showed the new one to 1,000 people rather than

just 25. Of course, showing the campaign to more people

costs more money, so you have to balance the need for a

larger sample size with your budget.

Variation is a little trickier to understand, but Red-

man insists that developing a sense for it is critical for

all managers who use data. Consider the images in fi g-

ure 13-1. Each expresses a different possible distribution

of customer purchases under campaign A. Looking at the

chart on the left (with less variation), most people spend

roughly the same amount. Some people spend a few dol-

lars more or less, but if you pick a customer at random,

chances are pretty good that they’ll be close to the aver-

age. So it’s less likely that you’ll select a sample that looks

vastly different from the total population, which means

you can be relatively confi dent in your results.

Compare that with the chart on the right (with more

variation). Here, people vary more widely in how much

they spend. The average is still the same, but quite a few

people spend more or less. If you pick a customer at ran-

dom, chances are higher that they are pretty far from

H7353_Guide-DataAnalytics_2ndREV.indb 123H7353_Guide-DataAnalytics_2ndREV.indb 123 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

124

the average. So if you select a sample from a more varied

population, you can’t be as confi dent in your results.

To summarize, the important thing to understand is

that the greater the variation in the underlying popula-

tion, the larger the sampling error.

Redman advises that you should plot your data and

make pictures like these when you analyze the numbers.

The graphs will help you get a feel for variation, the sam-

pling error, and in turn, the statistical signifi cance.

No matter what you’re studying, the process for evalu-

ating signifi cance is the same. You start by stating a null

hypothesis. In the experiment about the marketing cam-

paign, the null hypothesis might be, “On average, cus-

tomers don’t prefer our new campaign to the old one.”

Before you begin, you should also state an alternative

hypothesis, such as, “On average, customers prefer the

new one,” and a target signifi cance level. The signifi cance

level is an expression of how rare your results are, un-

der the assumption that the null hypothesis is true. It is

Number of customers

Number of customers

Greater variationLesser variation

FIGURE 13-1

Population variation

Source: Thomas C. Redman

Spend amount

Number of customers

Spend amount

Number of customers

H7353_Guide-DataAnalytics_2ndREV.indb 124H7353_Guide-DataAnalytics_2ndREV.indb 124 1/17/18 10:47 AM1/17/18 10:47 AM

A Refresher on Statistical Signifi cance

125

usually expressed as a p-value, and the lower the p-value,

the less likely the results are due purely to chance.

Setting a target and interpreting p-values can be

dauntingly complex. Redman says it depends a lot on

what you are analyzing. “If you’re searching for the Higgs

boson, you probably want an extremely low p-value,

maybe 0.00001,” he says. “But if you’re testing for

whether your new marketing concept is better or the new

drill bits your engineer designed work faster than your

existing bits, then you’re probably willing to take a higher

value, maybe even as high as 0.25.”

Note that in many business experiments, managers

skip these two initial steps and don’t worry about sig-

nifi cance until after the results are in. However, it’s good

scientifi c practice to do these two things ahead of time.

Then you collect your data, plot the results, and calcu-

late statistics, including the p-value, which incorporates

variation and the sample size. If you get a p-value lower

than your target, then you reject the null hypothesis in

favor of the alternative. Again, this means the probabil-

ity is small that your results were due solely to chance.

How Is It Calculated? As a manager, chances are you won’t ever calculate statis-

tical signifi cance yourself. “Most good statistical packages

will report the signifi cance along with the results,” says

Redman. There is also a formula in Microsoft Excel and a

number of other online tools that will calculate it for you.

Still, it’s helpful to know the process in order to

under stand and interpret the results. As Redman ad-

vises, “Managers should not trust a model they don’t

understand.”

H7353_Guide-DataAnalytics_2ndREV.indb 125H7353_Guide-DataAnalytics_2ndREV.indb 125 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

126

How Do Companies Use It? Companies use statistical signifi cance to understand

how strongly the results of an experiment, survey, or poll

they’ve conducted should infl uence the decisions they

make. For example, if a manager runs a pricing study to

understand how best to price a new product, they will

calculate the statistical signifi cance (with the help of an

analyst, most likely) so that they know whether the fi nd-

ings should affect the fi nal price.

Remember the new marketing campaign that pro-

duced a $1.76 boost (more than 20%) in the average

sale? It’s surely of practical signifi cance. If the p-value

comes in at 0.03 the result is also statistically signifi cant,

and you should adopt the new campaign. If the p-value

comes in at 0.2 the result is not statistically signifi cant,

but since the boost is so large you’ll probably still pro-

ceed, though perhaps with a bit more caution.

But what if the difference were only a few cents? If

the p-value comes in at 0.2, you’ll stick with your current

campaign or explore other options. But even if it had a

signifi cance level of 0.03, the result is probably real,

though quite small. In this case, your decision probably

will be based on other factors, such as the cost of imple-

menting the new campaign.

Closely related to the idea of a signifi cance level is the

notion of a confi dence interval. Let’s take the example of

a political poll. Say there are two candidates: A and B.

The pollsters conduct an experiment with 1,000 “likely

voters.” From the sample, 49% say they’ll vote for A and

51% say they’ll vote for B. The pollsters also report a

margin of error of +/- 3%.

H7353_Guide-DataAnalytics_2ndREV.indb 126H7353_Guide-DataAnalytics_2ndREV.indb 126 1/17/18 10:47 AM1/17/18 10:47 AM

A Refresher on Statistical Signifi cance

127

“Technically, 49% plus or minus 3% is a 95% confi -

dence interval for the true proportion of A voters in the

population,” says Redman. Unfortunately, he adds, most

people interpret this as “there’s a 95% chance that A’s

true percentage lies between 46% and 52%,” but that

isn’t correct. Instead, it says that if the pollsters were to

do the result many times, 95% of intervals constructed

this way would contain the true proportion.

If your head is spinning at that last sentence, you’re

not alone. As Redman says, this interpretation is “mad-

deningly subtle, too subtle for most managers and even

many researchers with advanced degrees.” He says the

more practical interpretation of this would be “Don’t get

too excited that B has a lock on the election” or “B ap-

pears to have a lead, but it’s not a statistically signifi cant

one.” Of course, the practical interpretation would be

very different if 70% of the likely voters said they’d vote

for B and the margin of error was 3%.

The reason managers bother with statistical signifi -

cance is they want to know what fi ndings say about what

they should do in the real world. But “confi dence inter-

vals and hypothesis tests were designed to support ‘sci-

ence,’ where the idea is to learn something that will stand

the test of time,” says Redman. Even if a fi nding isn’t sta-

tistically signifi cant, it may have utility to you and your

company. On the other hand, when you’re working with

large data sets, it’s possible to obtain results that are

statistically signifi cant but practically meaningless, for

example, that a group of customers is 0.000001% more

likely to click on campaign A over campaign B. So rather

than obsessing about whether your fi ndings are precisely

right, think about the implication of each fi nding for the

H7353_Guide-DataAnalytics_2ndREV.indb 127H7353_Guide-DataAnalytics_2ndREV.indb 127 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

128

decision you’re hoping to make. What would you do dif-

ferently if the fi nding were different?

What Mistakes Do People Make When Working with Statistical Signifi cance? “Statistical signifi cance is a slippery concept and is often

misunderstood,” warns Redman. “I don’t run into very

many situations where managers need to understand it

deeply, but they need to know how to not misuse it.”

Of course, data scientists don’t have a monopoly on

the word “signifi cant,” and often in businesses it’s used

to mean whether a fi nding is strategically important. It’s

good practice to use language that’s as clear as possible

when talking about data fi ndings. If you want to discuss

whether the fi nding has implications for your strategy or

decisions, it’s fi ne to use the word “signifi cant,” but if you

want to know whether something is statistically signifi -

cant, be precise in your language. Next time you look at

results of a survey or experiment, ask about the statisti-

cal signifi cance if the analyst hasn’t reported it.

Remember that statistical signifi cance tests help you

account for potential sampling errors, but Redman says

what is often more worrisome is the non-sampling error:

“Non-sampling error involves things where the experi-

mental and/or measurement protocols didn’t happen ac-

cording to plan, such as people lying on the survey, data

getting lost, or mistakes being made in the analysis.” This

is where Redman sees more troubling results. “There

is so much that can happen from the time you plan the

survey or experiment to the time you get the results. I’m

more worried about whether the raw data is trustwor-

H7353_Guide-DataAnalytics_2ndREV.indb 128H7353_Guide-DataAnalytics_2ndREV.indb 128 1/17/18 10:47 AM1/17/18 10:47 AM

A Refresher on Statistical Signifi cance

129

thy than how many people they talked to,” he says. Clean

data and careful analysis are more important than statis-

tical signifi cance.

Always keep in mind the practical application of the

fi nding. And don’t get too hung up on setting a strict con-

fi dence interval. Redman says there’s a bias in scientifi c

literature that “a result wasn’t publishable unless it hit a

p = 0.05 (or less).” But for many decisions—like which

marketing approach to use—you’ll need a much lower

confi dence interval. In business, says Redman, there’s of-

ten more important criteria than statistical signifi cance.

The important question is, “Does the result stand up in

the market, if only for a brief period of time?”

As Redman says, the results only give you so much in-

formation: “I’m all for using statistics, but always wed it

with good judgment.”

Amy Gallo is a contributing editor at Harvard Business

Review and the author of the HBR Guide to Dealing with

Confl ict. Follow her on Twitter @amyegallo .

H7353_Guide-DataAnalytics_2ndREV.indb 129H7353_Guide-DataAnalytics_2ndREV.indb 129 1/17/18 10:47 AM1/17/18 10:47 AM

H7353_Guide-DataAnalytics_2ndREV.indb 130H7353_Guide-DataAnalytics_2ndREV.indb 130 1/17/18 10:47 AM1/17/18 10:47 AM

131

Reprinted from Harvard Business Review, May-June 2017 (product

#R1703K).

CHAPTER 14

Linear Thinking in a Nonlinear World by Bart de Langhe, Stefano Puntoni, and Richard Larrick

Test yourself with this word problem: Imagine you’re re-

sponsible for your company’s car fl eet. You manage two

models, an SUV that gets 10 miles to the gallon and a

sedan that gets 20. The fl eet has equal numbers of each,

and all the cars travel 10,000 miles a year. You have

enough capital to replace one model with more-fuel-

effi cient vehicles to lower operational costs and help

meet sustainability goals.

Which upgrade is better?

A. Replacing the 10 MPG vehicles with 20 MPG

vehicles

H7353_Guide-DataAnalytics_2ndREV.indb 131H7353_Guide-DataAnalytics_2ndREV.indb 131 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

132

B. Replacing the 20 MPG vehicles with 50 MPG

vehicles

Intuitively, option B seems more impressive—an in-

crease of 30 MPG is a lot larger than a 10 MPG one.

And the percentage increase is greater, too. But B is not

the better deal. In fact, it’s not even close. Let’s compare.

Gallons used per 10,000 miles Current After upgrade Savings A. 1,000 (@10 MPG) 500 (@20 MPG) 500 B. 500 (@20 MPG) 200 (@50 MPG) 300

Is this surprising? For many of us, it is. That’s because

in our minds the relationship between MPG and fuel

consumption is simpler than it really is. We tend to think

it’s linear and looks like this:

But that graph is incorrect. Gas consumption is not a

linear function of MPG. When you do the math, the rela-

tionship actually looks like this:

H7353_Guide-DataAnalytics_2ndREV.indb 132H7353_Guide-DataAnalytics_2ndREV.indb 132 1/17/18 10:47 AM1/17/18 10:47 AM

Linear Thinking in a Nonlinear World

133

And when you dissect the curve to show each upgrade

scenario, it becomes clear how much more effective it is

to replace the 10 MPG cars.

H7353_Guide-DataAnalytics_2ndREV.indb 133H7353_Guide-DataAnalytics_2ndREV.indb 133 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

134

But choosing the lower-mileage upgrade remains

counterintuitive, even in the face of the visual evidence.

It just doesn’t feel right.

If you’re still having trouble grasping this, it’s not your

fault. Decades of research in cognitive psychology show

that the human mind struggles to understand nonlinear

relationships. Our brain wants to make simple straight

lines. In many situations, that kind of thinking serves us

well: If you can store 50 books on a shelf, you can store

100 books if you add another shelf, and 150 books if you

add yet another. Similarly, if the price of coffee is $2, you

Shockingly, upgrading fuel effi ciency from 20 to 100

MPG still wouldn’t save as much gas as upgrading from

10 to 20 MPG.

H7353_Guide-DataAnalytics_2ndREV.indb 134H7353_Guide-DataAnalytics_2ndREV.indb 134 1/17/18 10:47 AM1/17/18 10:47 AM

Linear Thinking in a Nonlinear World

135

can buy fi ve coffees with $10, 10 coffees with $20, and

15 coffees with $30.

But in business there are many highly nonlinear rela-

tionships, and we need to recognize when they’re in play.

This is true for generalists and specialists alike, because

even experts who are aware of nonlinearity in their fi elds

can fail to take it into account and default instead to re-

lying on their gut. But when people do that, they often

end up making poor decisions.

Linear Bias in Practice We’ve seen consumers and companies fall victim to lin-

ear bias in numerous real-world scenarios. A common

one concerns an important business objective: profi ts.

Three main factors affect profi ts: costs, volume, and

price. A change in one often requires action on the oth-

ers to maintain profi ts. For example, rising costs must be

offset by an increase in either price or volume. And if you

cut price, lower costs or higher volumes are needed to

prevent profi ts from dipping.

Unfortunately, managers’ intuitions about the rela-

tionships between these profi t levers aren’t always good.

For years experts have advised companies that changes

in price affect profi ts more than changes in volume or

costs. Nevertheless, executives often focus too much on

volume and costs instead of getting the price right.

Why? Because the large volume increases they see af-

ter reducing prices are very exciting. What people don’t

realize is just how large those increases need to be to

maintain profi ts, especially when margins are low.

H7353_Guide-DataAnalytics_2ndREV.indb 135H7353_Guide-DataAnalytics_2ndREV.indb 135 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

136

Imagine you manage a brand of paper towels. They

sell for 50 cents a roll, and the marginal cost of produc-

ing a roll is 15 cents. You recently did two price promo-

tions. Here’s how they compare:

Normal Promo A: Promo B: 20% off 40% off Price/Roll 50¢ 40¢ 30¢ Sales 1,000 1,200 (+20%) 1,800 (+80%)

Intuitively, B looks more impressive—an 80% in-

crease in volume for a 40% decrease in price seems a

lot more profi table than a 20% increase in volume for a

20% cut in price. But you may have guessed by now that

B is not the most profi table strategy.

In fact, both promotions decrease profi ts, but B’s neg-

ative impact is much bigger than A’s. Here are the profi ts

in each scenario:

Normal Promo A: Promo B: 20% off 40% off Price/Roll 50¢ 40¢ 30¢ Sales 1,000 1,200 (+20%) 1,800 (+80%) Profi t/Roll 35¢ 25¢ 15¢ Profi t $350 $300 $270

Although promotion B nearly doubled sales, profi ts

sank almost 25%. To maintain the usual $350 profi t

during the 40%-off sale, you would have to sell more

than 2,300 units, an increase of 133%. The curve looks

like this:

H7353_Guide-DataAnalytics_2ndREV.indb 136H7353_Guide-DataAnalytics_2ndREV.indb 136 1/17/18 10:47 AM1/17/18 10:47 AM

Linear Thinking in a Nonlinear World

137

The nonlinear phenomenon also extends to intan-

gibles, like consumer attitudes. Take consumers and

sus tain ability. We frequently hear executives complain

that while people say they care about the environment,

they are not willing to pay extra for ecofriendly prod-

ucts. Quantitative analyses bear this out. A survey by the

National Geographic Society and GlobeScan fi nds that,

across 18 countries, concerns about environmental prob-

lems have increased markedly over time, but consumer

behavior has changed much more slowly. While nearly

all consumers surveyed agree that food production and

consumption should be more sustainable, few of them

alter their habits to support that goal.

What’s going on? It turns out that the relationship

between what consumers say they care about and their

H7353_Guide-DataAnalytics_2ndREV.indb 137H7353_Guide-DataAnalytics_2ndREV.indb 137 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

138

actions is often highly nonlinear. But managers often

believe that classic quantitative tools, like surveys using

1-to-5 scales of importance, will predict behavior in a lin-

ear fashion. In reality, research shows little or no behav-

ioral difference between consumers who, on a fi ve-point

scale, give their environmental concern the lowest rat-

ing, 1, and consumers who rate it a 4. But the difference

between 4s and 5s is huge. Behavior maps to attitudes on

a curve, not a straight line.

Companies typically fail to account for this pattern—

in part because they focus on averages. Averages mask

nonlinearity and lead to prediction errors. For example,

suppose a fi rm did a sustainability survey among two of

H7353_Guide-DataAnalytics_2ndREV.indb 138H7353_Guide-DataAnalytics_2ndREV.indb 138 1/17/18 10:47 AM1/17/18 10:47 AM

Linear Thinking in a Nonlinear World

139

its target segments. All consumers in one segment rate

their concern about the environment a 4, while 50% of

consumers in the other segment rate it a 3 and 50% rate

it a 5. The average level of concern is the same for the two

segments, but people in the second segment are overall

much more likely to buy green products. That’s because

a customer scoring 5 is much more likely to make envi-

ronmental choices than a customer scoring 4, whereas a

customer scoring 4 is not more likely to than a customer

scoring 3.

The nonlinear relationship between attitudes and

behavior shows up repeatedly in important domains,

including consumers’ privacy concerns. A large-scale

survey in the Netherlands, for example, revealed little

difference in the number of loyalty-program cards car-

ried by consumers who said they were quite concerned

versus only weakly concerned about privacy. How is it

possible that people said they were worried about pri-

vacy but then agreed to sign up for loyalty programs that

require the disclosure of sensitive personal information?

Again, because only people who say they are extremely

concerned about privacy take signifi cant steps to protect

it, while most others, regardless of their concern rating,

don’t adjust their behavior.

Awareness of nonlinear relationships is also impor-

tant when choosing performance metrics. For instance,

to assess the effectiveness of their inventory manage-

ment, some fi rms track days of supply, or the number

of days that products are held in inventory, while other

fi rms track the number of times their inventory turns

H7353_Guide-DataAnalytics_2ndREV.indb 139H7353_Guide-DataAnalytics_2ndREV.indb 139 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

140

over annually. Most managers don’t know why their

fi rm uses one metric and not the other. But the choice

may have unintended consequences—for instance, on

employee motivation. Assume a fi rm was able to reduce

days of supply from 12 to six and that with additional

research, it could further reduce days of supply to four.

This is the same as saying that the inventory turn rate

could increase from 30 times a year to 60 times a year

and that it could be raised again to 90 times a year. But

employees are much more motivated to achieve im-

provements if the fi rm tracks turnover instead of days

of supply, research by the University of Cologne’s To-

bias Stangl and Ulrich Thonemann shows. That’s be-

cause they appear to get decreasing returns on their

efforts when they improve the days-of-supply metric—

but constant returns when they improve the turnover

metric.

Other areas where companies can choose different

metrics include warehousing (picking time versus pick-

ing rate), production (production time versus produc-

tion rate), and quality control (time between failures ver-

sus failure rate).

Nonlinearity is all around us. Let’s now explore the

forms it takes.

The Four Types of Nonlinear Relationships The best way to understand nonlinear patterns is to see

them. There are four types.

H7353_Guide-DataAnalytics_2ndREV.indb 140H7353_Guide-DataAnalytics_2ndREV.indb 140 1/17/18 10:47 AM1/17/18 10:47 AM

Linear Thinking in a Nonlinear World

141

Increasing gradually, then rising more steeply

Say a company has two customer segments that both

have annual contribution margins of $100. Segment A

has a retention rate of 20% while segment B has one of

60%. Most managers believe that it makes little differ-

ence to the bottom line which segment’s retention they

increase. If anything, most people fi nd doubling the

weaker retention rate more appealing than increasing

the stronger one by, say, a third.

But customer lifetime value is a nonlinear function of

retention rate, as you’ll see when you apply the formula

for calculating CLV:

Margin × Retention Rate

1 + Discount Rate – Retention Rate

When the retention rate rises from 20% to 40%, CLV

goes up about $35 (assuming a discount rate of 10% to

adjust future profi ts to their current worth), but when

retention rates rise from 60% to 80%, CLV goes up

about $147. As retention rates rise, customer lifetime

value increases gradually at fi rst and then suddenly

shoots up.

Most companies focus on identifying customers who

are most likely to defect and then target them with mar-

keting programs. However, it’s usually more profi table to

focus on customers who are more likely to stay. Linear

thinking leads managers to underestimate the benefi ts

of small increases to high retention rates.

H7353_Guide-DataAnalytics_2ndREV.indb 141H7353_Guide-DataAnalytics_2ndREV.indb 141 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

142

Decreasing gradually, then dropping quickly

The classic example of this can be seen in mortgages.

Property owners are often surprised by how slowly they

chip away at their debt during the early years of their

loan terms. But in a mortgage with a fi xed interest rate

and fi xed term, less of each payment goes toward the

principal at the beginning. The principal doesn’t de-

crease linearly. On a 30-year $165,000 loan at a 4.5% in-

terest rate, the balance decreases by only about $15,000

over the fi rst fi ve years. By year 25 the balance will have

dropped below $45,000. So the owner will pay off less

than 10% of the principal in the fi rst 16% of the loan’s

term but more than a quarter of it in the last 16%.

Because they’re misled by their linear thinking in this

context, mortgage payers are often surprised when they

H7353_Guide-DataAnalytics_2ndREV.indb 142H7353_Guide-DataAnalytics_2ndREV.indb 142 1/17/18 10:47 AM1/17/18 10:47 AM

Linear Thinking in a Nonlinear World

143

sell a property after a few years (and pay brokerage costs)

and have only small net gains to show for it.

Climbing quickly, then tapering off

Selling more of a product allows companies to achieve

economies of scale and boost per unit profi t, a metric of-

ten used to gauge a fi rm’s effi ciency. Executives use this

formula to calculate per unit profi t:

(Volume × unit price) – Fixed Costs – (Volume × Unit Variable Costs)

Volume

Say a fi rm sells 100,000 widgets each year at $2 a

widget, and producing those widgets costs $100,000—

$50,000 in fi xed costs and 50 cents in unit variable costs.

The per unit profi t is $1. The fi rm can increase per unit

H7353_Guide-DataAnalytics_2ndREV.indb 143H7353_Guide-DataAnalytics_2ndREV.indb 143 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

144

profi t by producing and selling more widgets, because it

will spread fi xed costs over more units. If it doubles the

number of widgets sold to 200,000, profi t per unit will

rise to $1.25 (assuming that unit variable costs remain

the same). That attractive increase might tempt you into

thinking per unit profi t will skyrocket if you increase

sales from 100,000 to 800,000 units. Not so.

If the fi rm doubles widget sales from 400,000 to

800,000 (which is much harder to do than going from

100,000 to 200,000), the per unit profi t increases only

by about 6 cents.

Managers focus a great deal on the benefi ts of econo-

mies of scale and growth. However, linear thinking may

lead them to overestimate volume as a driver of profi t

and thus underestimate other more impactful drivers,

like price.

H7353_Guide-DataAnalytics_2ndREV.indb 144H7353_Guide-DataAnalytics_2ndREV.indb 144 1/17/18 10:47 AM1/17/18 10:47 AM

Linear Thinking in a Nonlinear World

145

Falling sharply, then gradually

Firms often base evaluations of investments on the pay-

back period, the amount of time required to recover the

costs. Obviously, shorter paybacks are more favorable.

Say you have two projects slated for funding. Proj ect A

has a payback period of two years, and proj ect B has one

of four years. Both teams believe they can cut their pay-

back period in half. Many managers may fi nd B more at-

tractive because they’ll save two years, double the time

they’ll save with A.

Company leadership, however, may ultimately care

more about return on investment than time to break-

even. A one-year payback has an annual rate of re-

turn (ARR) of 100%. A two-year payback yields one of

50%—a 50-point difference. A four-year payback yields

one of 25%—a 25-point difference. So as the payback pe-

riod increases, ARR drops steeply at fi rst and then more

slowly. If your focus is achieving a higher ARR, halving

the payback period of proj ect A is a better choice.

Managers comparing portfolios of similar-sized proj-

ects may also be surprised to learn that the return on

investment is higher on one containing a proj ect with a

one-year payback and another with a four-year payback

than on a portfolio containing two proj ects expected to

pay back in two years. They should be careful not to un-

derestimate the effect that decreases in relatively short

payback periods will have on ARR.

How to Limit the Pitfalls of Linear Bias As long as people are employed as managers, biases

that are hardwired into the human brain will affect the

H7353_Guide-DataAnalytics_2ndREV.indb 145H7353_Guide-DataAnalytics_2ndREV.indb 145 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

146

quality of business decisions. Nevertheless, it is possible

to minimize the pitfalls of linear thinking.

Step 1: Increase awareness of linear bias

MBA programs should explicitly warn future manag-

ers about this phenomenon and teach them ways to

deal with it. Companies can also undertake initiatives

to educate employees by, for instance, presenting them

with puzzles that involve nonlinear relationships. In

our experience, people fi nd such exercises engaging and

eye-opening.

Broader educational efforts are already under way in

several fi elds. One is Ocean Tipping Points, an initiative

that aims to make people more sensitive to nonlinear re-

lationships in marine ecosystems. Scientists and manag-

ers often assume that the relationship between a stressor

(such as fi shing) and an ecological response (a decline in

fi sh population) is linear. However, a small change in a

stressor sometimes does disproportionately large dam-

age: A fi sh stock can collapse following a small increase

in fi shing. The proj ect’s goal is to identify relevant tip-

ping points in ocean ecology to help improve the man-

agement of natural resources.

Step 2: Focus on outcomes, not indicators

One of senior management’s most important tasks is to

set the organization’s direction and incentives. But fre-

quently, desired outcomes are far removed from every-

day business decisions, so fi rms identify relevant inter-

mediate metrics and create incentives to maximize them.

To lift sales, for instance, many companies try to improve

their websites’ positioning in organic search results.

H7353_Guide-DataAnalytics_2ndREV.indb 146H7353_Guide-DataAnalytics_2ndREV.indb 146 1/17/18 10:47 AM1/17/18 10:47 AM

Linear Thinking in a Nonlinear World

147

The problem is, these intermediate metrics can be-

come the end rather than the means, a phenomenon aca-

demics call “medium maximization.” That bodes trouble

if a metric and the outcome don’t have a linear relation-

ship—as is the case with organic search position and

sales. When a search rank drops, sales decrease quickly

at fi rst and then more gradually: The impact on sales is

much greater when a site drops from the fi rst to the sec-

ond position in search results than when it drops from

the 20th to the 25th position.

Other times, a single indicator can be used to predict

multiple outcomes, and that may confuse people and

lead them astray. Take annual rates of return, which a

manager who wants to maximize the future value of an

investment may consider. If you map the relationship

between investment products’ ARR and their total ac-

cumulated returns, you’ll see that as ARR rises, total re-

turns increase gradually and then suddenly shoot up.

Another manager may wish to minimize the time it

takes to achieve a particular investment goal. The rela-

tionship here is the reverse: As ARR rises, the time it

takes to reach a goal drops steeply at fi rst and then de-

clines gradually.

Because ARR is related to multiple outcomes in

different nonlinear ways, people often under- or over-

estimate its effect. A manager who wants to maximize

overall returns may care a great deal about a change

in the rate from 0.30% to 0.70% but be insensi-

tive to a change from 6.4% to 6.6%. In fact, increas-

ing a low return rate has a much smaller effect on ac-

cumulated future returns than increasing a high rate

does. In contrast, a manager focused on minimizing

H7353_Guide-DataAnalytics_2ndREV.indb 147H7353_Guide-DataAnalytics_2ndREV.indb 147 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

148

the time it takes to reach an investment goal may de-

cide to take on additional risk to increase returns from

6.3% to 6.7% but be insensitive to a change from 0.40%

to 0.60%. In this case the effect of increasing a high in-

terest rate on time to completing a savings goal is much

smaller than the effect of increasing a low interest rate.

Step 3: Discover the type of nonlinearity you’re dealing with

As Thomas Jones and W. Earl Sasser Jr. pointed out in

HBR back in 1995 (see “Why Satisfi ed Customers De-

fect” ), the relationship between customer satisfaction

ratings and customer retention is often nonlinear—but

in ways that vary according to the industry. In highly

competitive industries, such as automobiles, retention

rises gradually and then climbs up steeply as satisfaction

ratings increase. For noncompetitive industries reten-

tion shoots up quickly and then levels off.

In both situations linear thinking will lead to errors. If

the industry is competitive, managers will overestimate

the benefi t of increasing the satisfaction of completely

dissatisfi ed customers. If the industry is not competitive,

managers will overestimate the benefi t of increasing the

satisfaction of already satisfi ed customers.

The point is that managers should avoid making gen-

eralizations about nonlinear relationships across con-

texts and work to understand the cause and effect in

their specifi c situation.

Field experiments are an increasingly popular way to

do this. When designing them, managers should be sure

to account for nonlinearity. For instance, many people

H7353_Guide-DataAnalytics_2ndREV.indb 148H7353_Guide-DataAnalytics_2ndREV.indb 148 1/17/18 10:47 AM1/17/18 10:47 AM

Linear Thinking in a Nonlinear World

149

try to measure the impact of price on sales by offering

a product at a low price (condition A in the chart on

the next page) and a high price (condition B) and then

measuring differences in sales. But testing two prices

won’t reveal nonlinear relationships. You need to use at

least three price levels—low, medium (condition C), and

high—to get a sense of them.

Step 4: Map nonlinearity whenever you can

In addition to providing the right training, companies

can build support systems that warn managers when

they might be making bad decisions because of the incli-

nation to think linearly.

Ideally, algorithms and artifi cial intelligence could

identify situations in which that bias is likely to strike

and then offer information to counteract it. Of course,

H7353_Guide-DataAnalytics_2ndREV.indb 149H7353_Guide-DataAnalytics_2ndREV.indb 149 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

150

while advances in AI make this possible in formal set-

tings, it can’t account for decisions that take place off-

line and in conversations. And building such systems

could eat up a lot of time and money.

A low-tech but highly effective technique for fi ghting

linear bias is data visualization. As you’ve noticed in this

article, whenever we wanted you to understand some

linear bias, we showed you the nonlinear relationships.

They’re much easier to grasp when plotted out in a chart

than when described in a list of statistics. A visual repre-

sentation also helps you see threshold points where out-

comes change dramatically and gives you a good sense of

the degree of nonlinearity in play.

Putting charts of nonlinear relationships in dash-

boards and even mapping them out in “what if ” sce-

H7353_Guide-DataAnalytics_2ndREV.indb 150H7353_Guide-DataAnalytics_2ndREV.indb 150 1/17/18 10:47 AM1/17/18 10:47 AM

Linear Thinking in a Nonlinear World

151

narios will make managers more familiar with nonlin-

earity and thus more likely to check for it before making

decisions.

Visualization is also a good tool for companies inter-

ested in helping customers make good decisions. For ex-

ample, to make drivers aware of how little time they save

by accelerating when they’re already traveling at high

speed, you could add a visual cue for time savings to car

dashboards. One way to do this is with what Eyal Pe’er

and Eyal Gamliel call a “paceometer,” which shows how

many min utes it takes to drive 10 miles. It will surprise

most drivers that going from 40 to 65 will save you

about six min utes per 10 miles, but going from 65 to 90

saves only about two and a half minutes—even though

you’re increasing your speed 25 miles per hour in both

instances.

03-H7353-SEC3.indd 15103-H7353-SEC3.indd 151 1/22/18 2:11 PM1/22/18 2:11 PM

Analyze the Data

152

The Implications for Marketers A cornerstone of modern marketing is the idea that by

focusing more on consumer benefi ts than on product at-

tributes, you can sell more. Apple, for instance, realized

that people would perceive an MP3 player that provided

“1,000 songs in your pocket” to be more attractive than

one with an “internal storage capacity of 5GB.”

Our framework, however, highlights the fact that in

many situations companies actually profi t from promot-

ing attributes rather than benefi ts. They’re taking advan-

tage of consumers’ tendency to assume that the relation-

ship between attributes and benefi ts is linear. And that is

not always the case.

We can list any number of instances where showing

customers the actual benefi ts would reveal where they

may be overspending and probably change their buy-

ing behavior: printer pages per min ute, points in loy-

alty programs, and sun protection factor, to name just

a few. Bandwidth upgrades are another good example.

Our research shows that internet connections are priced

linearly: Consumers pay the same for increases in speed

from a low base and from a high base. But the relation-

ship between download speed and download time is

nonlinear. As download speed increases, download time

drops rapidly at fi rst and then gradually. Upgrading from

fi ve to 25 megabits per second will lead to time savings

of 21 min utes per gigabyte, while the increase from 25 to

100 Mbps buys only four minutes. When consumers see

the actual gains from raising their speed to 100 Mbps,

they may prefer a cheaper, slower internet connection.

H7353_Guide-DataAnalytics_2ndREV.indb 152H7353_Guide-DataAnalytics_2ndREV.indb 152 1/17/18 10:47 AM1/17/18 10:47 AM

Linear Thinking in a Nonlinear World

153

Of course, willfully exploiting consumers’ fl awed

perceptions of attri bute-benefi t relationships is a ques-

tionable marketing strategy. It’s widely regarded as un-

ethical for companies to take advantage of customers’

ignorance.

In recent years a number of professions, including ecolo-

gists, physiologists, and physicians, have begun to rou-

tinely factor nonlinear relationships into their decision

making. But nonlinearity is just as prevalent in the

business world as anywhere else. It’s time that man-

agement professionals joined these other disciplines

in developing greater awareness of the pitfalls of linear

thinking in a nonlinear world. This will increase their

H7353_Guide-DataAnalytics_2ndREV.indb 153H7353_Guide-DataAnalytics_2ndREV.indb 153 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

154

ability to choose wisely—and to help the people around

them make good decisions too.

Bart de Langhe is an associate professor of marketing

at Esade Business School, Ramon Llull University, and

an assistant professor of marketing at the Leeds School

of Business, University of Colorado–Boulder. Stefano

Puntoni is a professor of marketing at the Rotterdam

School of Management, Erasmus University. Richard

Larrick is the Hanes Corporation Foundation Professor

at Duke University’s Fuqua School of Business.

H7353_Guide-DataAnalytics_2ndREV.indb 154H7353_Guide-DataAnalytics_2ndREV.indb 154 1/17/18 10:47 AM1/17/18 10:47 AM

155

CHAPTER 15

Pitfalls of Data-Driven Decisions by Megan MacGarvie and Kristina McElheran

Even with impressively large data sets, the best analytics

tools, and careful statistical methods, managers can still

be vulnerable to a range of pitfalls when using data to

back up their toughest choices—especially when infor-

mation overload leads us to take shortcuts in reasoning.

In some instances, data and analytics actually make mat-

ters worse.

Psychologists, behavioral economists, and other schol-

ars of human behavior have identifi ed several common

decision-making traps. Many of these traps stem from

the fact that people don’t carefully process every piece

of information in every decision. Instead, we often rely

H7353_Guide-DataAnalytics_2ndREV.indb 155H7353_Guide-DataAnalytics_2ndREV.indb 155 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

156

on heuristics—simplifi ed procedures that allow us to

make decisions in the face of uncertainty or when ex-

tensive analysis is too costly or time-consuming. These

heuristics lead us to believe we are making sound deci-

sions when we are actually making systematic mistakes.

What’s more, human brains are wired for certain biases

that creep in and distort our thinking, typically without

our awareness.

There are three main cognitive traps that regularly

skew decision making, even when informed by the best

data. Here are each of these three pitfalls in detail, as

well as a number of suggestions for how to escape them.

The Confi rmation Trap When we pay more attention to fi ndings that align with

our prior beliefs, and ignore other facts and patterns in

the data, we fall into the confi rmation trap. With a huge

data set and numerous correlations between variables,

analyzing all possible correlations is often both costly

and counterproductive. Even with smaller data sets, it

can be easy to inadvertently focus on correlations that

confi rm our expectations of how the world should work,

and dismiss counterintuitive or inconclusive patterns in

the data when they don’t align.

Consider the following example: In the late 1960s

and early 1970s, researchers conducted one of the most

well-designed studies on how different types of fats af-

fect heart health and mortality. But the results of this

study, known as the Minnesota Coronary Experiment,

were not published at the time. A recent New York Times

H7353_Guide-DataAnalytics_2ndREV.indb 156H7353_Guide-DataAnalytics_2ndREV.indb 156 1/17/18 10:47 AM1/17/18 10:47 AM

Pitfalls of Data-Driven Decisions

157

article suggests that these results stayed unpublished

for so long because they contradicted the beliefs of both

the researchers and the medical establishment.1 In fact,

it wasn’t until 2016 that the medical journal BMJ pub-

lished a piece referencing this data, when growing skep-

ticism about the relationship between saturated fat con-

sumption and heart disease led researchers to analyze

data from the original experiment—more than 40 years

later.2 These and similar fi ndings cast doubt on decades

of unchallenged medical advice to avoid saturated fats.

While it’s unclear whether one experiment would have

changed standard dietary and health recommendations,

this example demonstrates that even with the best pos-

sible data, when we look at numbers we may ignore

important facts when they contradict the dominant

paradigm or don’t confi rm our beliefs, with potentially

troublesome results.

This is a sobering prospect for decision makers in

companies. And confi rmation bias becomes that much

harder to avoid when individuals face pressure from

bosses and peers. Organizations frequently reward em-

ployees who can provide empirical support for existing

managerial preferences. Those who decide what parts

of the data to examine and present to senior managers

may feel compelled to choose only the evidence that re-

inforces what their supervisors want to see or that con-

fi rms a prevalent attitude within the fi rm.

To get a fair assessment of what the data has to say,

don’t avoid information that counters your (or your

boss’s) beliefs. Instead, embrace it by doing the following:

H7353_Guide-DataAnalytics_2ndREV.indb 157H7353_Guide-DataAnalytics_2ndREV.indb 157 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

158

• Specify in advance the data and analytical ap-

proaches on which you’ll base your decision, to

reduce the temptation to cherry-pick fi ndings that

agree with your prejudices.

• Actively look for fi ndings that disprove your

beliefs. Ask yourself, “If my expectations are

wrong, what pattern would I likely see in the

data?” Enlist a skeptic to help you. Seek people

who like to play devil’s advocate or assign contrary

positions for active debate.

• Don’t automatically dismiss fi ndings that fall

below your threshold for statistical or practi-

cal signifi cance. Both noisy relationships (those

with large standard errors) and small, precisely

measured relationships can point to fl aws in your

beliefs and presumptions. Ask yourself, what

would it take for this to appear important? Make

sure your key takeaway is not sensitive to reason-

able changes in your model or sample size.

• Assign multiple independent teams to analyze

the data separately. Do they come to similar

conclusions? If not, isolate and study the points

of divergence to determine whether the differ-

ences are due to error, inconsistent methods,

or bias.

• Treat your fi ndings like predictions, and test them.

If you uncover a correlation from which you think

your organization can profi t, use an experiment to

validate that correlation.

H7353_Guide-DataAnalytics_2ndREV.indb 158H7353_Guide-DataAnalytics_2ndREV.indb 158 1/17/18 10:47 AM1/17/18 10:47 AM

Pitfalls of Data-Driven Decisions

159

The Overconfi dence Trap In their book Judgment in Managerial Decision Mak-

ing, behavioral researchers Max Bazerman and Don

Moore refer to overconfi dence as “The Mother of All

Biases.” Time and time again, psychologists have found

that decision makers are too sure of themselves. We tend

to assume that the accuracy of our judgments or the

probability of success in our endeavors is more favorable

than the data would suggest. When there are risks, we

alter our read of the odds to assume we’ll come out on

the winning side. Senior decision makers who have been

promoted based on past successes are especially suscep-

tible to this bias, since they have received positive signals

about their decision-making abilities throughout their

careers.

Overconfi dence also reinforces many other pitfalls of

data interpretation, be it psychological or procedural. It

prevents us from questioning our methods, motivation,

and the way we communicate our fi ndings to others. It

makes it easy to underinvest in data and analysis; when

we feel too confi dent in our understanding, we don’t

spend enough time or money acquiring more informa-

tion or running further analyses. To make matters worse,

more information can increase overconfi dence without

improving accuracy. More data in and of itself is not a

guaranteed solution.

Going from data to insight requires quality inputs,

skill, and sound processes. Because it can be so diffi cult

to recognize our own biases, good processes are essential

for avoiding overconfi dence when analyzing data. Here

H7353_Guide-DataAnalytics_2ndREV.indb 159H7353_Guide-DataAnalytics_2ndREV.indb 159 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

160

are a few procedural tips to avoid the overconfi dence

trap:

• Describe your perfect experiment—the type

of information you would use to answer your

question if you had limitless resources for

data collection and the ability to measure any

variable. Compare this ideal with your actual

data to understand where it might fall short.

Identify places where you might be able to close

the gap with more data collection or analytical

techniques.

• Make it a formal part of your process to be your

own devil’s advocate. In Thinking, Fast and Slow,

Nobel laureate Daniel Kahneman suggests asking

yourself why your analysis might be wrong, and

recommends that you do this for every analysis

you perform. Taking this contrarian view can

help you see the fl aws in your own arguments and

reduce mistakes across the board.

• Before making a decision or launching a project,

perform a “pre-mortem,” an approach suggested

by psychologist Gary Klein. Ask others with

knowledge about the project to imagine its

failure a year into the future and to write stories

of that failure. In doing so, you’ll benefi t from

the wisdom of multiple perspectives, while also

providing an opportunity to bring to the surface

potential fl aws in the analysis that you may

otherwise overlook.

H7353_Guide-DataAnalytics_2ndREV.indb 160H7353_Guide-DataAnalytics_2ndREV.indb 160 1/17/18 10:47 AM1/17/18 10:47 AM

Pitfalls of Data-Driven Decisions

161

• Keep track of your predictions and systemati-

cally compare them with what actually happens.

Which of your predictions turned out to be true

and which ones fell short? Persistent biases can

creep back into our decision making; revisit these

reports on a regular basis so you can prevent mis-

takes in the future.

The Overfi tting Trap When your model yields surprising or counterintui-

tive predictions, you may have made an exciting new

discovery—or it may be the result of overfi tting. In The

Signal and the Noise, Nate Silver famously dubbed this

“the most important scientifi c problem you’ve never

heard of.” This trap occurs when a statistical model de-

scribes random noise, rather than the underlying rela-

tionship we need to capture. Overfi t models generally do

a suspiciously good job of explaining many nuances of

what happened in the past, but they have great diffi culty

predicting the future.

For instance, when Google’s Flu Trends application

was introduced in 2008, it was heralded as an innovative

way to predict fl u outbreaks by tracking search terms as-

sociated with early fl u symptoms. But early versions of

the algorithm looked for correlations between fl u out-

breaks and millions of search terms. With such a large

number of terms, some correlations appeared signifi cant

when they were really due to chance (searches for “high

school basketball,” for example, were highly correlated

with the fl u). The application was ultimately scrapped

only a few years later due to failures of prediction.

H7353_Guide-DataAnalytics_2ndREV.indb 161H7353_Guide-DataAnalytics_2ndREV.indb 161 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

162

In order to overcome this bias, you need to discern

between the data that matters and the noise around it.

Here’s how you can guard against the overfi tting trap:

• Randomly divide the data into two sets: a training

set, with which you’ll estimate the model, and a

“validation set,” with which you’ll test the accuracy

of the model’s predictions. An overfi t model might

be great at making predictions within the training

set, but raise warning fl ags by performing poorly

in the validation set.

• Much like you would for the confi rmation trap,

specify the relationships you want to test and how

you plan to test them before analyzing the data, to

avoid cherry-picking.

• Keep your analysis simple. Look for relationships

that measure important effects related to clear and

logical hypotheses before digging into nuances. Be

on guard against spurious correlations—the ones

that occur only by chance—that you can rule out

based on experience or common sense (see the

sidebar, “Beware Spurious Correlations,” in chap-

ter 10). Remember that data can never truly “speak

for itself ” and must rely on human interpreters to

make sense of it.

• Construct alternative narratives. Is there another

story you could tell with the same data? If so, you

cannot be confi dent that the relationship you have

uncovered is the right—or only—one.

H7353_Guide-DataAnalytics_2ndREV.indb 162H7353_Guide-DataAnalytics_2ndREV.indb 162 1/17/18 10:47 AM1/17/18 10:47 AM

Pitfalls of Data-Driven Decisions

163

• Beware of the all-too-human tendency to see

patterns in random data. For example, consider

a baseball player with a .325 batting average who

has no hits in a championship series game. His

coach may see a cold streak and want to replace

him, but he’s only looking at handful of games.

Statistically, it would be better to keep him in the

lineup than substitute the .200 hitter who had

four hits in the previous game.

From Bias to Better Decisions Data analytics can be an effective tool to promote con-

sistent decisions and shared understanding. It can high-

light blind spots in our individual or collective awareness

and can offer evidence of risks and benefi ts for particular

paths of action. But it can also make us complacent.

Managers need to be aware of these common

decision-making pitfalls and employ sound processes

and cognitive strategies to prevent them. It can be

diffi cult to recognize the fl aws in your own reason-

ing—but proactively tackling these biases with the right

mindset and procedures can lead to better analysis of

data and better decisions overall.

Megan MacGarvie is an associate professor in the mar-

kets, public policy, and law group at Boston Univer-

sity’s Questrom School of Business, where she teaches

data-driven decision making and business analytics.

H7353_Guide-DataAnalytics_2ndREV.indb 163H7353_Guide-DataAnalytics_2ndREV.indb 163 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

164

She is also a research associate of the National Bureau

of Economic Research. Kristina McElheran is an assis-

tant professor of strategic management at the University

of Toronto and a digital fellow at the MIT Initiative on

the Digital Economy. Her ongoing work on data-driven

decision making with Erik Brynjolfsson has been fea-

tured on HBR online and in the American Economic

Review.

NOTES

1. A. E. Carroll, “A Study on Fats That Doesn’t Fit the Story Line,” New York Times, April 15, 2016.

2. C. E. Ramsden et al., “Re-evaluation of the Traditional Diet- Heart Hypothesis: Analysis of Recovered Data from Minnesota Coronary Experiment (1968-73),” BMJ (April 2016), 353:i1246, doi: 10.1136.

H7353_Guide-DataAnalytics_2ndREV.indb 164H7353_Guide-DataAnalytics_2ndREV.indb 164 1/17/18 10:47 AM1/17/18 10:47 AM

165

CHAPTER 16

Don’t Let Your Analytics Cheat the Truth by Michael Schrage

Everyone’s heard the truism that there are lies, damned

lies, and statistics. But sitting through a welter of

analytics-driven, top-management presentations pro-

vokes me into proposing a cynical revision: There are

liars, damned liars, and statisticians.

The rise of analytics-informed insight and decision

making is welcome. The disingenuous and deceptive

manner in which many of these statistics are presented

is not. I’m simultaneously stunned and disappointed

Adapted from “Do Your Analytics Cheat the Truth?” on hbr.org, Octo-

ber 10, 2011.

H7353_Guide-DataAnalytics_2ndREV.indb 165H7353_Guide-DataAnalytics_2ndREV.indb 165 1/17/18 10:47 AM1/17/18 10:47 AM

Analyze the Data

166

by how egregiously manipulative these analytics have

become at the very highest levels of enterprise oversight.

The only thing more surprising—and more disappoint-

ing—is how unwilling or unable so many senior execu-

tives are about asking simple questions about the analyt-

ics they see.

At one fi nancial services fi rm, for example, call center

analytics showed spike after spike of negative customer

satisfaction numbers. Hold times and problem resolu-

tion times had noticeably increased. The presenting ex-

ecutive clearly sought greater funding and training for

her group. The implied threat was that the fi rm’s reputa-

tion for swift and responsive service was at risk.

Three simple but pointed questions later, her analytic

gamesmanship became clear. What had been presented

as a disturbing customer service trend was in large part

due to a policy change affecting about 20% of the fi rm’s

newly retired customers. Between their age, possible tax

implications, and an approval process requiring coordi-

nation with another department, these calls frequently

stretched beyond 35 to 45 minutes.

What made the situation worse (and what might ex-

plain why the presenter chose not to break out the data)

was a management decision not to route those calls to a

specially trained team but instead to allow any customer

representative to process the query. The additional de-

lays undermined the entire function’s performance.

Every single one of the presenter’s numbers was tech-

nically accurate. But they were aggregated in a manner

that made it look as if the function was underresourced.

The analytics deliberately concealed the outlier statisti-

H7353_Guide-DataAnalytics_2ndREV.indb 166H7353_Guide-DataAnalytics_2ndREV.indb 166 1/17/18 10:47 AM1/17/18 10:47 AM

Don’t Let Your Analytics Cheat the Truth

167

cally responsible for making the numbers dramatically

worse.

More damning was a simple queuing theory simula-

tion demonstrating that if the call center had made even

marginal changes in how it chose to manage that excep-

tional 20%, the aggregate call center performance num-

bers would have been virtually unaffected. Poor manage-

ment, not systems underinvestment, was the real root

cause problem.

Increasingly, I observe statistical sophisticates in-

dulging in analytic advocacy—that is, the numbers are

deployed to infl uence and win arguments rather than

identify underlying dynamics and generate insight. This

is particularly disturbing because while the analytics—in

the strictest technical sense—accurately portray a situa-

tion, they do so in a way that discourages useful inquiry.

I always insist that analytics presentations and pre-

senters explicitly identify the outliers, how they were de-

fi ned and dealt with, and—most importantly—what the

analytics would look like if they didn’t exist. It’s astonish-

ing what you fi nd when you make the outliers as impor-

tant as the aggregates and averages in understanding the

analytics. (To guide your discussion, consider the ques-

tions in the sidebar “Investigating Outliers.”)

My favorite example of this comes, naturally enough,

from Harvard. Few people realize that, in fact, the aver-

age net worth of Harvard dropouts vastly exceeds the av-

erage net worth of Harvard graduates.

The reason for that is simple. There are many, many

more Harvard graduates than there are Harvard drop-

outs. But the ranks of Harvard dropouts include Bill

H7353_Guide-DataAnalytics_2ndREV.indb 167H7353_Guide-DataAnalytics_2ndREV.indb 167 1/17/18 10:47 AM1/17/18 10:47 AM

168

INVESTIGATING OUTLIERS

by Janice H. Hammond

When you notice an outlier in data, you must investi-

gate why the anomaly exists. Consider asking some of

the following questions:

• Is it just an unusual, but valid, value?

• Could it be a data entry error?

• Was it collected in a diff erent way than the rest

of the data? At a diff erent time?

After making an eff ort to understand where an

outlier comes from, you should have a deeper under-

standing of the situation the data represent. Then think

about how to handle the outlier in your analysis. Typi-

cally, you can do one of three things: leave it alone, or—

very rarely—remove it or change it to a corrected value.

Excluding or changing data is not something we

do often—and it should be done only after examin-

ing the underlying situation in great detail. We should

never do it to help the data “fi t” a conclusion we want

to draw. Changes to a data set should be made on a

case-by-case basis only after careful investigation of

the situation.

Adapted from “Quantitative Methods Online Course,” Harvard Busi- ness Publishing, October 24, 2004, revised January 24, 2017 (product #504702).

Janice H. Hammond is the Jesse Philips Professor of Manufacturing at Harvard Business School. She serves as program chair for the HBS Executive Education International Women’s Foundation and Women’s Leadership Programs, and created the online Business Analytics course for HBX CORe.

H7353_Guide-DataAnalytics_2ndREV.indb 168H7353_Guide-DataAnalytics_2ndREV.indb 168 1/17/18 10:47 AM1/17/18 10:47 AM

Don’t Let Your Analytics Cheat the Truth

169

Gates, Mark Zuckerberg, and Polaroid’s Edwin Land,

whose combined, infl ation-adjusted net worth probably

tops $100 billion. That megarich numerator divided by

the smaller “dropout” denominator creates the statisti-

cally accurate illusion that the average Harvard dropout

is much, much wealthier than the Harvard student who

actually got their degree.

This is, of course, ridiculous. Unfortunately, it is no

more ridiculous than what one fi nds, on average, in a sta-

tistically signifi cant number of analytics-driven board-

room presentations. The misdirection—and mismanage-

ment—associated with outliers is the most disturbingly

common pathology I experience, even in stats-savvy

organizations.

Always ask for the outliers. Always make the analysts

display what their data looks like with the outliers r e-

moved. There are other equally important ways to wring

greater utility from aggregated analytics, but start from

the outliers in. Because analytics that mishandle outliers

are “outliars.”

Michael Schrage, a research fellow at MIT Sloan School’s

Center for Digital Business, is the author of the books

Serious Play, Who Do You Want Your Customers to Be-

come? and The Innovator’s Hypothesis.

H7353_Guide-DataAnalytics_2ndREV.indb 169H7353_Guide-DataAnalytics_2ndREV.indb 169 1/17/18 10:47 AM1/17/18 10:47 AM

H7353_Guide-DataAnalytics_2ndREV.indb 170H7353_Guide-DataAnalytics_2ndREV.indb 170 1/17/18 10:47 AM1/17/18 10:47 AM

SECTION FOUR

Communicate Your Findings

H7353_Guide-DataAnalytics_2ndREV.indb 171H7353_Guide-DataAnalytics_2ndREV.indb 171 1/17/18 10:47 AM1/17/18 10:47 AM

H7353_Guide-DataAnalytics_2ndREV.indb 172H7353_Guide-DataAnalytics_2ndREV.indb 172 1/17/18 10:47 AM1/17/18 10:47 AM

173

CHAPTER 17

Data Is Worthless If You Don’t Communicate It by Thomas H. Davenport

Too many managers are, with the help of their analyst

colleagues, simply compiling vast databases of infor-

mation that never see the light of day, or that only get

disseminated in autogenerated business intelligence

reports. As a manager, it’s not your job to crunch the

numbers, but it is your job to communicate them. Never

make the mistake of assuming that the results will speak

for themselves.

Consider the cautionary tale of Gregor Mendel. Al-

though he discovered the concept of genetic inheritance,

Adapted from content posted on hbr.org, June 18, 2013 (product

#H00ASW).

H7353_Guide-DataAnalytics_2ndREV.indb 173H7353_Guide-DataAnalytics_2ndREV.indb 173 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

174

his ideas were not adopted during his lifetime because

he only published his fi ndings in an obscure Moravian

scientifi c journal, a few reprints of which he mailed to

leading scientists. It’s said that Darwin, to whom Mendel

sent a reprint of his fi ndings, never even cut the pages

to read the geneticist’s work. Although Mendel carried

out his groundbreaking experiments between 1856 and

1863—eight years of painstaking research—their signifi -

cance was not recognized until the turn of the 20th cen-

tury, long after his death. The lesson: If you’re going to

spend the better part of a decade on a research project,

also put some time and effort into disseminating your

results.

One person who has done this very well is Dr. John

Gottman, the well-known marriage scientist at the Uni-

versity of Washington. Gottman, working with a statis-

tical colleague, developed a marriage equation predict-

ing how likely a marriage is to last over the long term.

The equation is based on a couple’s ratio of positive to

negative interactions during a 15-minute conversation

on a diffi cult topic such as money or in-laws. Pairs who

showed affection, humor, or happiness while talking

about contentious topics were given a maximum num-

ber of points, while those who displayed belligerence

or contempt received the minimum. Observing several

hundred couples, Gottman and his team were able to

score couples’ interactions and identify the patterns that

predict divorce or a happy marriage.

This was great work in itself, but Gottman didn’t

stop there. He and his wife, Julie, founded a nonprofi t

H7353_Guide-DataAnalytics_2ndREV.indb 174H7353_Guide-DataAnalytics_2ndREV.indb 174 1/17/18 10:47 AM1/17/18 10:47 AM

Data Is Worthless If You Don’t Communicate It

175

research institute and a for-profi t organization to apply

the results through books, DVDs, workshops, and ther-

apist training. They’ve infl uenced exponentially more

marriages through these outlets than they could possibly

ever have done in their own clinic—or if they’d just is-

sued a press release with their fi ndings.

Similarly, during his tenure at Intuit, George Rou-

me lio tis was head of a data science group that analyzed

and created product features based on the vast amount

of online data that Intuit collected. For his projects, he

recommended a simple framework for communicating

about each analysis:

1. My understanding of the business problem

2. How I will measure the business impact

3. What data is available

4. The initial solution hypothesis

5. The solution

6. The business impact of the solution

Note what’s not here: details on statistical methods

used, regression coeffi cients, or logarithmic transforma-

tions. Most audiences neither understand nor appreciate

those details; they care about results and implications.

It may be useful to make such information available in

an appendix to a report or presentation, but don’t let it

get in the way of telling a good story with your data—

starting with what your audience really needs to know.

H7353_Guide-DataAnalytics_2ndREV.indb 175H7353_Guide-DataAnalytics_2ndREV.indb 175 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

176

Thomas H. Davenport is the President’s Distinguished

Professor in Management and Information Technology

at Babson College, a research fellow at the MIT Initiative

on the Digital Economy, and a senior adviser at Deloitte

Analytics. Author of over a dozen management books,

his latest is Only Humans Need Apply: Winners and

Losers in the Age of Smart Machines.

H7353_Guide-DataAnalytics_2ndREV.indb 176H7353_Guide-DataAnalytics_2ndREV.indb 176 1/17/18 10:47 AM1/17/18 10:47 AM

177

CHAPTER 18

When Data Visualization Works—and When It Doesn’t by Jim Stikeleather

I am uncomfortable with the growing emphasis on big

data and its stylist, visualization. Don’t get me wrong—

I love infographic representations of large data sets.

The value of representing information concisely and ef-

fectively dates back to Florence Nightingale, when she

developed a new type of pie chart to clearly show that

more soldiers were dying from preventable illnesses than

Adapted from content posted on hbr.org, March 27, 2013 (product

#H00ADJ).

H7353_Guide-DataAnalytics_2ndREV.indb 177H7353_Guide-DataAnalytics_2ndREV.indb 177 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

178

from their wounds. On the other hand, I see beautiful

exercises in special effects that show off statistical and

technical skills, but do not clearly serve an informing

purpose. That’s what makes me squirm.

Ultimately, data visualization is about communic at-

ing an idea that will drive action. Understanding the cri-

teria for information to provide valuable insights and the

reasoning behind constructing data visualizations will

help you do that with effi ciency and impact.

For information to provide valuable insights, it must be

interpretable, relevant, and novel. With so much unstruc-

tured data today, it is critical that the data being analyzed

generates interpretable information. Collecting lots of

data without the associated metadata—such as what is it,

where was it collected, when, how, and by whom—reduces

the opportunity to play with, interpret, and draw conclu-

sions from the data. It must also be relevant to the people

who are looking to gain insights, and to the purpose for

which the information is being examined (see the sidebar

“Understand Your Audience”). Finally, it must be original,

or shed new light on an area. If the information fails any

one of these criteria, then no visualization can make it

valuable. That means that only a tiny slice of the data we

can bring to life visually will actually be worth the effort.

Once we’ve narrowed the universe of data down to

that which satisfi es these three requirements, we must

also understand the legitimate reasons to construct data

visualizations, and recognize what factors affect the

quality of data visualizations. There are three broad rea-

sons for visualizing data:

H7353_Guide-DataAnalytics_2ndREV.indb 178H7353_Guide-DataAnalytics_2ndREV.indb 178 1/17/18 10:47 AM1/17/18 10:47 AM

When Data Visualization Works—and When It Doesn’t

179

• Confi rmation: If we already have a set of assump-

tions about how the system we are interested in

operates—for example, a market, customers, or

competitors—visualizations can help us check

those assumptions. They can also enable us to

observe whether the underlying system has devi-

ated from the model we had and assess the risk of

the actions we are about to undertake based on

those assumptions. You see this approach in some

enterprise dashboards.

• Education: There are two forms of education that

visualization offers. One is simply reporting: here

is how we measure the underlying system of inter-

est, and here are the values of those measures in

some comparative form—for instance, over time,

or against other systems or models. The other is to

develop intuition and new insights on the behavior

of a known system as it evolves and changes over

time, so that humans can get an experiential feel

of the system in an extremely compressed time

frame. You often see this model in the “gamifi ca-

tion” of training and development.

• Exploration: When we have large sets of data

about a system we are interested in and the goal is

to provide optimal human-machine inter actions

(HMI) to that data to tease out relationships,

processes, models, etc., we can use visualization to

help build a model to allow us to predict and bet-

ter manage the system. The practice of using visual

H7353_Guide-DataAnalytics_2ndREV.indb 179H7353_Guide-DataAnalytics_2ndREV.indb 179 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

180

UNDERSTAND YOUR AUDIENCE

Before you throw up (pun intended) data in a visual-

ization, start with the goal, which is to convey great

quantities of information in a format that is easily as-

similated by the consumers of this information—deci-

sion makers. A successful visualization is based on the

designer understanding whom the visualization is tar-

geting, and executing on three key points:

• Who is the audience, and how will it read and

interpret the information? Can you assume these

individuals have knowledge of the terminology

and concepts you’ll use, or do you need to guide

them with clues in the visualization (for example,

good is indicated with a green arrow going up)?

An audience of experts will have diff erent expec-

tations than a general audience.

• What are viewers’ expectations, and what type

of information is most useful to them?

• What is the visualization’s functional role, and

how can viewers take action from it? An explor-

atory visualization should leave viewers with

questions to pursue; educational or confi rma-

tional graphics should not.

Adapted from “The Three Elements of Successful Data Visualizations” on hbr.org by Jim Stikeleather, April 19, 2013.

H7353_Guide-DataAnalytics_2ndREV.indb 180H7353_Guide-DataAnalytics_2ndREV.indb 180 1/17/18 10:47 AM1/17/18 10:47 AM

When Data Visualization Works—and When It Doesn’t

181

discovery in lieu of statistics is called exploratory

data analysis (EDA), and too few businesses make

use of it.

Assuming the visualization creator has gotten it all

right—a well-defi ned purpose, the necessary and suf-

fi cient amount of data and metadata to make the visu-

alization interpretable, enabling relevant and original

insights for the business—what gives us confi dence that

these fi ndings are now worthy of action? Our ability to

understand and to a degree control three areas of risk can

defi ne the visualization’s resulting value to the business:

• Data quality: The quality of the underlying data is

crucial to the value of visualization. How complete

and reliable is it? As with all analytical processes,

putting garbage in means getting garbage out.

• Context: The point of visualization is to make

large amounts of data approachable so we can

apply our evolutionarily honed pattern detection

computer—our brain—to draw insights from it.

To do so, we need to access all of the potential

relationships of the data elements. This context is

the source of insight. To leave out any contextual

information or metadata (or more appropriately,

“metacontent”) is to risk hampering our under-

standing.

• Biases: The creator of the visualization may infl u-

ence the visualization’s semantics and the syntax

of the elements through color choices, positioning,

H7353_Guide-DataAnalytics_2ndREV.indb 181H7353_Guide-DataAnalytics_2ndREV.indb 181 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

182

and visual tricks (such as unnecessary 3D, or 2D

when 3D is more informative)—any of which can

challenge the interpretation of the data. This also

creates the risk of pre-specifying discoverable

features and results via the embedded algorithms

used by the creator (something EDA is intended to

overcome). These in turn can signifi cantly infl u-

ence how viewers under stand the visualization,

and what insight they will gather from it.

Ignoring these requirements and risks can under-

mine the visualization’s purpose and confuse rather than

enlighten.

Jim Stikeleather, DBA, is a serial entrepreneur and was

formerly Chief Innovation Offi cer at Dell. He teaches

innovation, business models, strategy, governance, and

change management at the graduate level at the Uni-

versity of South Florida and The Innovation Academy

at Trinity College Dublin. He is also a senior executive

coach.

H7353_Guide-DataAnalytics_2ndREV.indb 182H7353_Guide-DataAnalytics_2ndREV.indb 182 1/17/18 10:47 AM1/17/18 10:47 AM

183

CHAPTER 19

How to Make Charts That Pop and Persuade by Nancy Duarte

Displaying data can be a tricky proposition, because dif-

ferent rules apply in different contexts. A sales director

presenting fi nancial projections to a group of fi eld repre-

sentatives wouldn’t visualize her data the same way that

a design consultant would in a written proposal to a po-

tential client.

So how do you make the right choices for your situa-

tion? Before displaying your data, ask yourself these fi ve

questions:

Adapted from “The Quick and Dirty on Data Visualization” on hbr.org,

April 16, 2014 (product #H00RKA).

H7353_Guide-DataAnalytics_2ndREV.indb 183H7353_Guide-DataAnalytics_2ndREV.indb 183 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

184

1. Am I Presenting or Circulating My Data?  Context plays a huge role in how best to render data.

When delivering a presentation, show the conclusions

you’ve drawn, not all the details that led you to those

conclusions. Because your slides will be up for only a

few seconds, your audience will need to process them

quickly. People won’t have time to chew on a lot of com-

plex information, and they’re not likely to run up to the

wall for a closer look at the numbers. Think in broad

strokes when you’re putting your charts together: What’s

the overall trend you’re highlighting? What’s the most

striking comparison you’re making? These are the sorts

of questions to answer with projected data.

Scales, grid lines, tick marks, and such should pro-

vide context, but without competing with the data. Use

a light neutral color, such as gray, for these elements so

they’ll recede to the background, and plot your data in

a slightly stronger neutral color, such as blue or green.

Then use a bright color to emphasize the point you’re

making.

It’s fi ne to display more detail in documents or in

decks that you email rather than present. Readers can

study them at their own pace—examine the axes, the leg-

ends, the layers—and draw their own conclusions from

your body of work. Still, you don’t want to overwhelm

them, especially since they won’t have you there in per-

son to explain what your main points are. Use white

space, section heads, and a clear hierarchy of visual ele-

H7353_Guide-DataAnalytics_2ndREV.indb 184H7353_Guide-DataAnalytics_2ndREV.indb 184 1/17/18 10:47 AM1/17/18 10:47 AM

How to Make Charts That Pop and Persuade

185

ments to help your readers navigate dense content and

guide them to key pieces of data.

2. Am I Using the Right Kind of Chart or Table? When you choose how to visualize your data, you’re de-

ciding what type of relationship you want to emphasize.

Take a look at fi gure 19-1, which shows the breakdown of

an investment portfolio.

In the pie, it’s clear that this person holds a number

of investments in different areas—but that’s about all

you see.

Figure 19-2 shows the same data in a bar chart. In this

form it’s much easier to discern how much is invested in

each category. If your focus is on comparing categories,

the bar chart is the better choice. A pie chart would be

more useful if you were trying to make the point that a

single investment made up a signifi cant portion of the

portfolio.

FIGURE 19-1

Investment portfolio breakdown

International stocks

Large-cap U.S. stock

Bonds

Real estate

Mid-cap U.S. stock

Small-cap U.S. stock

Commodities

H7353_Guide-DataAnalytics_2ndREV.indb 185H7353_Guide-DataAnalytics_2ndREV.indb 185 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

186

3. What Message Am I Trying to Convey? Whether you’re presenting or circulating your charts,

you need to highlight the most important items to en-

sure that your audience can follow your train of thought

and focus on the right elements. For example, fi gure

19-3 is diffi cult to interpret because all the information

is displayed with equal visual value.

Are we comparing regions? Quarters? Positive versus

negative numbers? It’s diffi cult to determine what mat-

ters most. By adding color or shading, you can draw the

eye to specifi c areas, as shown in fi gure 19-4.

We now know that we should be focusing on when

and in which regions revenue dropped.

4. Do My Visuals Accurately Refl ect the Numbers? Using a lot of crazy colors, extra labels, and fancy effects

won’t captivate an audience. That kind of visual clutter

FIGURE 19-2

Investment portfolio breakdown

International stocks

Large-cap U.S. stock

Bonds

Real estate

Mid-cap U.S. stock

Small-cap U.S. stock

Commodities

0 5 10 15 20%

H7353_Guide-DataAnalytics_2ndREV.indb 186H7353_Guide-DataAnalytics_2ndREV.indb 186 1/17/18 10:47 AM1/17/18 10:47 AM

How to Make Charts That Pop and Persuade

187

FIGURE 19-3

Revenue trends

Americas

Australia

China

Europe

India

–18%

47%

15%

57%

57%

7%

–7%

–5%

10%

6%

25%

26%

1%

–3%

–3%

2%

15%

7%

7%

8%

2%

Q1 Q2 Q3 Q4 Total

17%

19%

13%

13%

Americas

Australia

China

Europe

India

–18%

47%

15%

57%

57%

7%

–7%

–5%

10%

6%

25%

26%

1%

–3%

–3%

2%

15%

7%

7%

8%

2%

Q1 Q2 Q3 Q4 Total

17%

19%

13%

13%

FIGURE 19-4

Revenue trends

H7353_Guide-DataAnalytics_2ndREV.indb 187H7353_Guide-DataAnalytics_2ndREV.indb 187 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

188

FIGURE 19-5

Yearly revenue per region

5

10

15

20

25

30

35

40

45

50

0 Year one Year two

North South East West

dilutes the information and can even misrepresent it.

Consider the chart in fi gure 19-5.

Can you fi gure out the northern territory’s revenue for

year one? Is it 17? Or maybe 19? The way some programs

create 3D charts would lead any rational person to think

that the bar in question is well below 20. However, the

data behind the chart actually says that bar represents

20.4 units. You can see that if you look at the chart in a

very specifi c way, but it’s diffi cult to tell which way that

should be—even with plenty of time to scrutinize it.

It’s much clearer if you simply fl atten the chart, as in

fi gure 19-6.

5. Is My Data Memorable? Even if you’ve rendered your data clearly and accurately,

it’s another challenge altogether to make the information

stick. Consider using a meaningful visual metaphor to il-

H7353_Guide-DataAnalytics_2ndREV.indb 188H7353_Guide-DataAnalytics_2ndREV.indb 188 1/17/18 10:47 AM1/17/18 10:47 AM

How to Make Charts That Pop and Persuade

189

lustrate the scale of your numbers and cement the data

in the minds of your audience members. A metaphor can

also tie your insights to something that your audience al-

ready knows and cares about.

Author and activist Michael Pollan showed how much

crude oil goes into making a McDonald’s Double Quar-

ter Pounder with Cheese through a striking visual dem-

onstration: He placed glasses on a table and fi lled them

with oil to represent the amount of oil consumed during

each stage of the production process. At the end, he took

a taste of the oil to drive home his point. (To add an ele-

ment of humor, he later revealed that his prop “oil” was

actually chocolate syrup.)

Pollan could have shown a chart, but this was more

effective because he gave the audience a tangible visual—

one that triggered a visceral response.

50

40

30

20

10

0 North South

Year 1 East West North South

Year 2 East West

FIGURE 19-6

Yearly revenue per region

H7353_Guide-DataAnalytics_2ndREV.indb 189H7353_Guide-DataAnalytics_2ndREV.indb 189 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

190

By answering these fi ve questions as you’re laying out

your data, you’ll visualize it in a way that helps people

understand and engage with each point in your presen-

tation, document, or deck. As a result, your audience will

be more likely to adopt your overall message.

Nancy Duarte has published her latest book, Illuminate,

with coauthor Patti Sanchez. Duarte is also the author

of the HBR Guide to Persuasive Presentations, as well

as two award-winning books on the art of presenting,

Slide:ology and Resonate. Her team at Duarte Inc. has

created more than a quarter million presentations for its

clients and teaches public and corporate workshops on

presenting. Find Duarte on LinkedIn or follow her on

Twitter @nancyduarte.

H7353_Guide-DataAnalytics_2ndREV.indb 190H7353_Guide-DataAnalytics_2ndREV.indb 190 1/17/18 10:47 AM1/17/18 10:47 AM

191

CHAPTER 20

Why It’s So Hard for Us to Communicate Uncertainty An interview with Scott Berinato by Nicole Torres

We use data to make predictions. But predictions are

just educated guesses—they’re uncertain. And when

they’re being communicated, they’re incredibly diffi cult

to explain or clearly illustrate.

A case in point: The 2016 U.S. presidential election

did not unfold the way so many predicted it would. We

now know some of the reasons why—polling failed—but

Adapted from “Why It’s So Hard for Us to Visualize Uncertainty” on

hbr.org, November 11, 2016 (product #H039NV).

H7353_Guide-DataAnalytics_2ndREV.indb 191H7353_Guide-DataAnalytics_2ndREV.indb 191 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

192

watching the real-time results on the night of Tuesday,

November 8, wasn’t just surprising, it was confusing.

Predictions swung back and forth, and it was hard to

process the information that was coming in. Not only

did the data seem wrong, the way we were presenting

that data seemed wrong too.

I asked my colleague Scott Berinato, Harvard Busi-

ness Review editor and author of Good Charts: The HBR

Guide to Making Smarter, More Persuasive Data Visu-

alizations, if he would help explain this uncertainty—

how we dealt with it, why it was so hard to grasp, and

what’s so challenging about communicating and visual-

izing it.

Torres: What did you notice about how election

predictions were being shown election night?

Berinato: A lot of people were looking at the New

York Times’ live presidential forecast, where you’d see

a series of gauges (half-circle gauges, like a gas gauge

on your car) that updated frequently.1 The needle

moved left if data showed that Hillary Clinton had a

higher chance of winning, and right if Donald Trump

did. But the needle also jittered back and forth, mak-

ing it look like the statistical likelihood of winning

was changing rapidly. This caused a lot of anxiety.

People were confused. They were trying to interpret

what was going on in the election and why the data

was changing so drastically in real time, and it was

really hard to understand what was going on.

The thing was, the needle wasn’t swinging to

represent statistical likelihood; it was a hard-coded

effect meant to represent uncertainty in the statisti-

H7353_Guide-DataAnalytics_2ndREV.indb 192H7353_Guide-DataAnalytics_2ndREV.indb 192 1/17/18 10:47 AM1/17/18 10:47 AM

Why It’s So Hard for Us to Communicate Uncertainty

193

cal forecast. So trying to show real-time changes

in the race, while accounting for uncertainty, was a

good engagement effort, but the execution fell short

because it confused and unnerved people. The jitter

wasn’t the best visual approach.

What do we mean by “uncertainty”?

When thinking about showing uncertainty, we think

mostly about two types. One is statistical uncer-

tainty, which applies if I said something like, “Here

are my values, and statistically my confi dence in

them is 95%.” Think about margin of error built into

polls. Statisticians use things like box-and-whisker

plots to represent this, where a box shows the upper

and lower ranges of the fi rst and third quartiles in a

data set, a line in the box marks the median, and thin

bar “whiskers” reaching above and below the box to

indicate the range of the data. Dots can also be used

beyond the whiskers to show outliers. There are lots

of variations of these, and they work reasonably well,

though academics try other approaches sometimes

and the lay audience isn’t used to these visualizations,

for the most part.

The other kind of uncertainty is data uncertainty.

This applies when we’re not sure where within a

range our data falls. Instead of having a value and a

confi dence in that value, we have a range of possible

values. A friend recently gave me a data set with two

values. One was “the estimate ranges from 1 in 2,000

to 1 in 4,500” and the other was “an estimate ranging

from 1 in 5,500 to 1 in 8,000.” There’s not an accepted

or right way to visualize something like this.

H7353_Guide-DataAnalytics_2ndREV.indb 193H7353_Guide-DataAnalytics_2ndREV.indb 193 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

194

Finding ways to accurately and effectively repre-

sent uncertainty is one of the most important chal-

lenges in data visualization today. And it’s important

to know that visualizing uncertainty in general is

extremely diffi cult to do.

Why?

When you think about it, visualizations make some-

thing abstract—numbers, statistics—concrete. You

are representing an idea like 20% with a thing like a

bar or dot. A dot on a line that represents 20% looks

pretty certain. How do you then express the idea

that “fi ve times out of a hundred this isn’t the right

answer, and it could be all these other answers”?

So are there good ways of visualizing uncertainty

like this?

A lot of the time people just don’t represent their

uncertainty, because it’s hard. We don’t want to do

that. Uncertainty is an important thing to be able

to communicate. For example, consider health care,

where outcomes of care may be uncertain but you

want people to understand their decisions. How

do you show them the possible range of outcomes,

instead of only what is the most likely or least likely

to happen? Or say there’s an outbreak of a disease

like Ebola and we want to model the worst case, the

most likely, and the best-case scenarios. How do we

represent those different outcomes? Weather fore-

casts, hurricane models are the same thing. Risk

H7353_Guide-DataAnalytics_2ndREV.indb 194H7353_Guide-DataAnalytics_2ndREV.indb 194 1/17/18 10:47 AM1/17/18 10:47 AM

Why It’s So Hard for Us to Communicate Uncertainty

195

analysts and probability experts think about how to

solve these problems all the time. It’s not easy.

There are a number of other approaches, though.

Some people use bars to represent the range of

uncertainty. Some use solid lines to show an aver-

age value and dotted lines above and below to show

the upper and lower boundaries. Using color satu-

ration or gradients to show that values are becom-

ing less and less likely—but still in the realm of

possibility— is another way.

On top of uncertainty, we’re also dealing with

probability.

Yes, it’s really hard for our brains to perceive prob-

ability. When we say something has an 80%

chance of happening, it’s not the simplest thing to

understand. You can’t really feel what 80% likelihood

really means. I mean, it seems like it will probably

happen. But the important thing to remember is

that if it doesn’t happen, that doesn’t mean you were

wrong. It just means the 20% likelihood happened

instead.

Statistics are weird. Even if we felt like we under-

stood what a “20% chance” was, we don’t think of it

as the same as “1 in 5.” We tend to think that “1 in 5”

is more likely to happen than “20%.” It’s less abstract.

If you say 1 in 5 people commits a crime, you actually

picture that one person. We “image the numerator.”

But “20%” doesn’t commit a crime. It’s not a thing

that acts. It’s a statistic.

H7353_Guide-DataAnalytics_2ndREV.indb 195H7353_Guide-DataAnalytics_2ndREV.indb 195 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

196

What do we do when the 20% or 10% chance thing

happens?

How do you tell someone who has had the very rare

thing happen to them that, based on the probability

we gave you, it was the right advice, even though it

didn’t work out for you? That’s really diffi cult, and

security executives and risk experts think about this

all the time. When you think about it, businesses

need to learn this because it’s easy in hindsight to

say “Our models were wrong—the unlikely bad thing

happened.” Not true! We all along were communicat-

ing there was some small chance that the bad thing

could happen. Still, as humans, that’s hard for us to

grasp.

Is it because we try to hang on to the hope of a more

favorable outcome?

It’s because likely things happen more of the time.

When unlikely things happen, we want to make sense

of it. We weren’t expecting it. We shouldn’t have been

expecting it because it was unlikely. But it’s still pos-

sible, however unlikely. Already just hearing myself

say this, you see how elliptical it sounds. When a

natural disaster strikes, you often hear people after-

ward say “It was a 100-year storm, no one could have

seen this coming.” Not true! Risk experts always see it

coming. It was always a statistical possibility. It’s just

not likely.

I get probability, but I still can’t help but feel misled by

the presidential election predictions. What am I missing?

H7353_Guide-DataAnalytics_2ndREV.indb 196H7353_Guide-DataAnalytics_2ndREV.indb 196 1/17/18 10:47 AM1/17/18 10:47 AM

Why It’s So Hard for Us to Communicate Uncertainty

197

Three things are going on with the election mod-

els. (1) Even if a candidate had a 10% chance of

winning 10 days ago and they end up winning, it

doesn’t mean the model was wrong. It means the

unlikely happened. (2) This whole notion of using

probability to determine who will win an election

(based on whether they have an 80% chance, etc.)

is hard for the audience to grasp, because we tend

to think about elections in more binary terms—this

person will win versus that person will win. (3) We

revisit the probabilities every day and update them.

And when one candidate says something stupid,

their probability of winning goes down and the

others go up. This makes us feel like these winning

probabilities are reactive, not speculative. So we,

the lay audience, end up thinking we’re looking at

data that tells us something about how the candi-

dates are behaving, not how likely it is they’ll win.

It starts to feel more like an approval rating than a

forecast.

That fi rst point must come up in business all the time.

The election brings the subject of visualizing un-

certainty into focus but it’s an increasingly com-

mon challenge in businesses building out their data

science operations. As data science becomes more

and more important for companies, managers are

starting to deal with types of data that show multiple

possible outcomes, where there is statistical uncer-

tainty and data uncertainty that they have to commu-

nicate to their bosses. If they don’t help their bosses

understand the uncertainty, they will look at their

H7353_Guide-DataAnalytics_2ndREV.indb 197H7353_Guide-DataAnalytics_2ndREV.indb 197 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

198

charts and say that’s the answer when it’s only the

likelihood. It’s okay to focus on what is most likely,

but you don’t want to forgo showing the range of pos-

sible outcomes.

For example, if you’re looking at a way to model

customer adoption and you’re using statistical mod-

els, you want to make sure you demonstrate what you

think is most likely to happen, but also how this out-

come is one of a range of potential outcomes based

on your models. You need to be able to communicate

that visually, or your boss or client will misinterpret

what you’re saying. If the data scientists say we have a

90% chance of succeeding if we adopt this model, but

then it doesn’t happen, the boss should know that you

weren’t wrong—you really just fell into the 10%. You

rolled snake eyes. It happens. This is a really hard

thing for our brains to deal with and communicate,

and it’s an important challenge for companies invest-

ing in a data-driven approach to their businesses.

Scott Berinato is a senior editor at Harvard Business Re-

view and the author of Good Charts: The HBR Guide to

Making Smarter, More Persuasive Data Visualizations

(Harvard Business Review Press, 2016). Nicole Torres is

an associate editor at Harvard Business Review.

NOTE

“Live Presidential Forecast,” New York Times, November 9, 2016, https://www.nytimes.com/elections/forecast/president.

H7353_Guide-DataAnalytics_2ndREV.indb 198H7353_Guide-DataAnalytics_2ndREV.indb 198 1/17/18 10:47 AM1/17/18 10:47 AM

199

CHAPTER 21

Responding to Someone Who Challenges Your Data by Jon M. Jachimowicz

I recently conducted a study with a large, multinational

company to fi gure out how to increase employee engage-

ment. After the data collection was complete, I ran the

data analysis and found some intriguing fi ndings that I

was excited to share with the fi rm. But a troubling re-

sult became apparent in my analysis: The organization

had rampant discrimination against women, especially

Adapted from “What to Do When Someone Angrily Challenges Your

Data” on hbr.org, April 5, 2017 (product #H03L2M).

H7353_Guide-DataAnalytics_2ndREV.indb 199H7353_Guide-DataAnalytics_2ndREV.indb 199 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

200

ambitious, passionate, talented women. Although this

result was based on initial data and was not particularly

rigorous, I was convinced that managers at the collab-

orating organization would like to hear it so that they

could address the issue.

I couldn’t have been more wrong. In a meeting with

the company’s head of HR and a few members of his

team, I fi rst presented my overall fi ndings about em-

ployee engagement. In my last few slides, I turned the

presentation toward the results of the gender discrimi-

nation analysis that I had conducted. I was expecting an

animated conversation, and perhaps even some internal

questioning into why the discrimination was occurring

and how they could rectify it.

Instead, the head of HR got very angry. He accused

me of misrepresenting the facts, and countered by citing

data from his own records that showed men and women

were equally likely to be promoted. In addition, he had

never heard from anyone within the organization that

gender discrimination was a problem. He strongly be-

lieved that the diversity practices his team had champi-

oned were industry leading, and that they were suffi cient

to ward off gender discrimination. Clearly, this topic was

important to him, and my fi ndings had touched a nerve.

After his fury (and my shock) had cooled, I reminded

him that the data I presented was just initial pilot data

and should be treated as such. Perhaps if we were to do a

more thorough assessment, I argued, we would fi nd that

the initial data was inaccurate. In addition, I proposed

that a follow-on study that focused on gender discrimi-

nation could pinpoint which aspects of the diversity poli-

cies were working particularly well, and that he could

H7353_Guide-DataAnalytics_2ndREV.indb 200H7353_Guide-DataAnalytics_2ndREV.indb 200 1/17/18 10:47 AM1/17/18 10:47 AM

Responding to Someone Who Challenges Your Data

201

use these insights to further advocate for his agenda.

We landed on a compromise: I would design and run an

additional study with a focus on gender discrimination,

connecting survey responses to important outcomes

such as promotions and turnover.

A few months later, the data came in. My data analy-

sis showed that my initial fi ndings were correct: Gen-

der discrimination was happening in the company. But

the head of HR’s major claim wasn’t wrong: Men and

women were equally likely to be promoted.

The improved data set allowed us to see how both

facts could be true at the same time. We now had de-

tailed insights into which employees were—and, more

important, were not—being promoted. Although am-

bitious, passionate, and talented men were advancing

in the company, their female counterparts were being

passed over for promotion, time and again—effectively

being pushed out of the organization. That is, the best

men were moving up, but not the best women. Those

women who were being promoted were given these op-

portunities out of tokenism: They weren’t particularly

high performing, and often reached a “natural” ceiling

early on in their careers due to their limited abilities.

We also now had data on the specifi c kind of advance-

ment opportunities male and female employees received

to learn new skills, make new connections, and increase

their visibility in the organization. Compared with their

male counterparts, passionate women were less likely to

get these kinds of chances.

Armed with this new data, I was invited to present to

the head of HR again. Remembering our last meeting,

I expected him to be upset. But we had a very different

H7353_Guide-DataAnalytics_2ndREV.indb 201H7353_Guide-DataAnalytics_2ndREV.indb 201 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

202

conversation this time. Instead of being met with anger,

the data I presented prompted concern. I could place the

fact of men and women being equally likely to be pro-

moted in a fuller context, complete with rigorous data

from the organization. We had a lively debate about why

this asymmetry between men and women existed. Most

important, we concluded that the data he measured to

track gender discrimination was unable to provide him

with the necessary insight to understand whether gen-

der discrimination was a problem.

He has since appointed a task force to tackle the prob-

lem of gender discrimination head-on, something he

wouldn’t have done if we hadn’t collected the data that

we did. This is the power of collecting thorough data in

your own organization: Instead of making assumptions

on what may or may not be occurring, a thoughtful de-

sign of data-collection practices allows you to gather the

right information to come to better conclusions.

So it’s not just about the data you have. Existing data

blinds us, and it is important to shift the focus away from

readily available information. Crucially, not having the

right data is no excuse. In the case of the head of HR, not

hearing about gender discrimination from anyone in the

organization allowed him to conclude that women did

not face this issue. Think about what data is not being

collected that may help embed existing data in a richer

context.

Next time someone angrily challenges your data,

there are a few steps you can take:

First, take their perspective. Understand why your

counterpart is responding so forcefully. In many

H7353_Guide-DataAnalytics_2ndREV.indb 202H7353_Guide-DataAnalytics_2ndREV.indb 202 1/17/18 10:47 AM1/17/18 10:47 AM

Responding to Someone Who Challenges Your Data

203

cases, it may simply be that they really care about the

outcome. Your goals may even be aligned, and fram-

ing your data in a way where their goals are achieved

may help you circumvent their anger.

Second, collect more data that specifi cally takes their

criticism to heart. Every comment is a useful com-

ment. Just as a fi ction author can’t be upset when

readers don’t get the point of what they are trying to

say, a researcher must understand how their fi ndings

are being understood. What is the upset recipient

of your analysis responding to, and how can further

data collection help you address their concerns?

Last, view your challenger not as an opponent, but

as an ally. Find a way to collaborate, because once

you have their buy-in, they are invested in the joint

investigation. As a result, they will be more likely to

view you as being part of the team. And then you can

channel the energy that prompted their fury for good.

Defending your data analysis can be stressful—espe-

cially if your fi ndings cause confl ict. But by following

these steps, you can diffuse any tension and attack the

problem in a productive way.

Jon M. Jachimowicz is a PhD candidate at Columbia

Business School. In his research, he investigates the an-

tecedents, perceptions, and consequences of passion for

work. His website can be found at jonmjachimowicz

.com.

H7353_Guide-DataAnalytics_2ndREV.indb 203H7353_Guide-DataAnalytics_2ndREV.indb 203 1/17/18 10:47 AM1/17/18 10:47 AM

H7353_Guide-DataAnalytics_2ndREV.indb 204H7353_Guide-DataAnalytics_2ndREV.indb 204 1/17/18 10:47 AM1/17/18 10:47 AM

205

CHAPTER 22

Decisions Don’t Start with Data by Nick Morgan

I recently worked with an executive keen to persuade his

colleagues that their company should drop a longtime

vendor in favor of a new one. He knew that members of

the executive team opposed the idea (in part because of

their well-established relationships with the vendor) but

he didn’t want to confront them directly, so he put to-

gether a PowerPoint presentation full of stats and charts

showing the cost savings that might be achieved by the

change.

He hoped the data would speak for itself.

But it didn’t.

Adapted from content posted on hbr.org, May 14, 2014 (product

#H00T3S).

H7353_Guide-DataAnalytics_2ndREV.indb 205H7353_Guide-DataAnalytics_2ndREV.indb 205 1/17/18 10:47 AM1/17/18 10:47 AM

Communicate Your Findings

206

The team stopped listening about a third of the way

through the presentation. Why? It was good data. The

executive was right. But, even in business meetings,

numbers don’t ever speak for themselves, no matter how

visually appealing the presentation may be.

To infl uence human decision making, you have to

get to the place where decisions are really made—in the

unconscious mind, where emotions rule, and data is

mostly absent. Yes, even the most savvy executives begin

to make choices this way. They get an intent, a desire,

or a want in their unconscious minds, and then decide

to pursue it and act on that decision. Only after that

do they become consciously aware of the choice they’ve

made and start to justify it with rational argument. In

fact, research from Carnegie Mellon University indicates

that our unconscious minds actually make better deci-

sions when left alone to deal with complex issues.

Data is helpful as supporting material, of course. But,

because it spurs thinking in the conscious mind, it must

be used with care. Effective persuasion starts not with

numbers, but with stories that have emotional power

because that’s the best way to tap into unconscious de-

cision making. We decide to invest in a new company

or business line not because the fi nancial model shows

it will succeed but because we’re drawn to the story told

by the people pitching it. We buy goods and services

because we believe the stories marketers build around

them: “A diamond is forever” (De Beers), “Real beauty”

(Dove), “Think different” (Apple), “Just do it” (Nike). We

take jobs not only for the pay and benefi ts but also for

the self-advancement story we’re told, and tell ourselves,

about working at the new place.

H7353_Guide-DataAnalytics_2ndREV.indb 206H7353_Guide-DataAnalytics_2ndREV.indb 206 1/17/18 10:47 AM1/17/18 10:47 AM

Decisions Don’t Start with Data

207

Sometimes we describe this as having a good “gut

feeling.” What that really means is that we’ve already un-

consciously decided to go forward, based on desire, and

our conscious mind is seeking some rationale for that

otherwise invisible decision.

I advised the executive to scrap his PowerPoint and

tell a story about the opportunities for future growth

with the new vendor, reframing and trumping the loy-

alty story the opposition camp was going to tell. And so,

in his next attempt, rather than just presenting data, he

told his colleagues that they should all be striving toward

a new vision for the company, no longer held back by a

tether to the past. He began with an alluring description

of the future state—improved margins, a cooler, higher-

tech product line, and excited customers—then asked his

audience to move forward with him to reach that goal. It

was a quest story, and it worked.

Data can provide new insight and evidence to inform

your toughest decisions. But numbers alone won’t con-

vince others. Good stories—with a few key facts woven

in—are what attach emotions to your argument, prompt

people into unconscious decision making, and ultimately

move them to action.

Nick Morgan is a speaker, coach, and the president and

founder of Public Words, a communications consulting

fi rm. He is the author of Power Cues: The Subtle Science

of Leading Groups, Persuading Others, and Maximiz-

ing Your Personal Impact (Harvard Business Review

Press, 2014).

H7353_Guide-DataAnalytics_2ndREV.indb 207H7353_Guide-DataAnalytics_2ndREV.indb 207 1/17/18 10:47 AM1/17/18 10:47 AM

H7353_Guide-DataAnalytics_2ndREV.indb 208H7353_Guide-DataAnalytics_2ndREV.indb 208 1/17/18 10:47 AM1/17/18 10:47 AM

209

APPENDIX

Data Scientist: The Sexiest Job of the 21st Century by Thomas H. Davenport and D.J. Patil

When Jonathan Goldman arrived for work in June 2006

at LinkedIn, the business networking site, the place still

felt like a startup. The company had just under 8 million

accounts, and the number was growing quickly as exist-

ing members invited their friends and colleagues to join.

But users weren’t seeking out connections with the peo-

ple who were already on the site at the rate executives

had expected. Something was apparently missing in the

social experience. As one LinkedIn manager put it, “It

Reprinted from Harvard Business Review, October 2012 (product

#R1210D).

H7353_Guide-DataAnalytics_2ndREV.indb 209H7353_Guide-DataAnalytics_2ndREV.indb 209 1/17/18 10:47 AM1/17/18 10:47 AM

Appendix

210

was like arriving at a conference reception and realizing

you don’t know anyone. So you just stand in the corner

sipping your drink—and you probably leave early.”

Goldman, a PhD in physics from Stanford, was in-

trigued by the linking he did see going on and by the

richness of the user profi les. It all made for messy data

and unwieldy analysis, but as he began exploring people’s

connections, he started to see possibilities. He began

forming theories, testing hunches, and fi nding patterns

that allowed him to predict whose networks a given pro-

fi le would land in. He could imagine that new features

capitalizing on the heuristics he was developing might

provide value to users. But Linked In’s engineering team,

caught up in the challenges of scaling up the site, seemed

uninterested. Some colleagues were openly dismissive of

Goldman’s ideas. Why would users need LinkedIn to fi g-

ure out their networks for them? The site already had an

address book importer that could pull in all a member’s

connections.

Luckily, Reid Hoffman, LinkedIn’s cofounder and CEO

at the time (now its executive chairman), had faith in the

power of analytics because of his experiences at PayPal,

and he had granted Goldman a high degree of autonomy.

For one thing, he had given Goldman a way to circum-

vent the traditional product release cycle by publishing

small modules in the form of ads on the site’s most popu-

lar pages.

Through one such module, Goldman started to test

what would happen if you presented users with names of

people they hadn’t yet connected with but seemed likely

to know—for example, people who had shared their

H7353_Guide-DataAnalytics_2ndREV.indb 210H7353_Guide-DataAnalytics_2ndREV.indb 210 1/17/18 10:48 AM1/17/18 10:48 AM

Data Scientist: The Sexiest Job of the 21st Century

211

tenures at schools and workplaces. He did this by gin-

ning up a custom ad that displayed the three best new

matches for each user based on the background entered

in his or her LinkedIn profi le. Within days it was obvi-

ous that something remarkable was taking place. The

click-through rate on those ads was the highest ever

seen. Goldman continued to refi ne how the suggestions

were generated, incorporating networking ideas such as

“triangle closing”—the notion that if you know Larry and

Sue, there’s a good chance that Larry and Sue know each

other. Goldman and his team also got the action required

to respond to a suggestion down to one click.

It didn’t take long for LinkedIn’s top managers to rec-

ognize a good idea and make it a standard feature. That’s

when things really took off. “People You May Know” ads

achieved a click-through rate 30% higher than the rate

obtained by other prompts to visit more pages on the

site. They generated millions of new page views. Thanks

to this one feature, Linked In’s growth trajectory shifted

signifi cantly upward.

A New Breed Goldman is a good example of a new key player in orga-

nizations: the “data scientist.” It’s a high-ranking profes-

sional with the training and curiosity to make discover-

ies in the world of big data. The title has been around

for only a few years. (It was coined in 2008 by one of us,

D.J. Patil, and Jeff Hammerbacher, then the respective

leads of data and analytics efforts at LinkedIn and Face-

book.) But thousands of data scientists are already work-

ing at both startups and well-established companies.

H7353_Guide-DataAnalytics_2ndREV.indb 211H7353_Guide-DataAnalytics_2ndREV.indb 211 1/17/18 10:48 AM1/17/18 10:48 AM

Appendix

212

Their sudden appearance on the business scene refl ects

the fact that companies are now wrestling with informa-

tion that comes in varieties and volumes never encoun-

tered before. If your organization stores multiple peta-

bytes of data, if the information most critical to your

business resides in forms other than rows and columns

of numbers, or if answering your biggest question would

involve a “mashup” of several analytical efforts, you’ve

got a big data opportunity.

Much of the current enthusiasm for big data focuses

on technologies that make taming it possible, including

Hadoop (the most widely used framework for distributed

fi le system processing) and related open-source tools,

cloud computing, and data visualization. While those are

important breakthroughs, at least as important are the

people with the skill set (and the mindset) to put them to

good use. On this front, demand has raced ahead of sup-

ply. Indeed, the shortage of data scientists is becoming a

serious constraint in some sectors. Greylock Partners, an

early-stage venture fi rm that has backed companies such

as Facebook, LinkedIn, Palo Alto Networks, and Work-

day, is worried enough about the tight labor pool that

it has built its own specialized recruiting team to chan-

nel talent to businesses in its portfolio. “Once they have

data,” says Dan Portillo, who leads that team, “they really

need people who can manage it and fi nd insights in it.”

Who Are These People? If capitalizing on big data depends on hiring scarce data

scientists, then the challenge for managers is to learn

how to identify that talent, attract it to an enterprise, and

H7353_Guide-DataAnalytics_2ndREV.indb 212H7353_Guide-DataAnalytics_2ndREV.indb 212 1/17/18 10:48 AM1/17/18 10:48 AM

Data Scientist: The Sexiest Job of the 21st Century

213

make it productive. None of those tasks is as straightfor-

ward as it is with other, established organizational roles.

Start with the fact that there are no university programs

offering degrees in data science. There is also little con-

sensus on where the role fi ts in an organization, how

data scientists can add the most value, and how their

performance should be measured.

The fi rst step in fi lling the need for data scientists,

therefore, is to understand what they do in businesses.

Then ask, What skills do they need? And what fi elds are

those skills most readily found in?

More than anything, what data scientists do is make

discoveries while swimming in data. It’s their preferred

method of navigating the world around them. At ease in

the digital realm, they are able to bring structure to large

quantities of formless data and make analysis possible.

They identify rich data sources, join them with other,

potentially incomplete data sources, and clean the re-

sulting set. In a competitive landscape where challenges

keep changing and data never stop fl owing, data scien-

tists help decision makers shift from ad hoc analysis to

an ongoing conversation with data.

Data scientists realize that they face technical limita-

tions, but they don’t allow that to bog down their search

for novel solutions. As they make discoveries, they com-

municate what they’ve learned and suggest its implica-

tions for new business directions. Often they are creative

in displaying information visually and making the pat-

terns they fi nd clear and compelling. They advise execu-

tives and product managers on the implications of the

data for products, processes, and decisions.

H7353_Guide-DataAnalytics_2ndREV.indb 213H7353_Guide-DataAnalytics_2ndREV.indb 213 1/17/18 10:48 AM1/17/18 10:48 AM

Appendix

214

Given the nascent state of their trade, it often falls to

data scientists to fashion their own tools and even con-

duct academic-style research. Yahoo, one of the fi rms

that employed a group of data scientists early on, was

instrumental in developing Hadoop. Facebook’s data

team created the language Hive for programming Ha-

doop projects. Many other data scientists, especially at

data-driven companies such as Google, Amazon, Micro-

soft, Walmart, eBay, LinkedIn, and Twitter, have added

to and refi ned the tool kit.

What kind of person does all this? What abilities

make a data scientist successful? Think of him or her

as a hybrid of data hacker, analyst, communicator, and

trusted adviser. The combination is extremely power-

ful—and rare.

Data scientists’ most basic, universal skill is the ability

to write code. This may be less true in fi ve years’ time,

when many more people will have the title “data scien-

tist” on their business cards. More enduring will be the

need for data scientists to communicate in language

that all their stakeholders understand—and to demon-

strate the special skills involved in storytelling with data,

whether verbally, visually, or—ideally—both.

But we would say the dominant trait among data sci-

entists is an intense curiosity—a desire to go beneath the

surface of a problem, fi nd the questions at its heart, and

distill them into a very clear set of hypotheses that can

be tested. This often entails the associative thinking that

characterizes the most creative scientists in any fi eld. For

example, we know of a data scientist studying a fraud

problem who realized that it was analogous to a type of

DNA sequencing problem. By bringing together those

H7353_Guide-DataAnalytics_2ndREV.indb 214H7353_Guide-DataAnalytics_2ndREV.indb 214 1/17/18 10:48 AM1/17/18 10:48 AM

Data Scientist: The Sexiest Job of the 21st Century

215

disparate worlds, he and his team were able to craft a so-

lution that dramatically reduced fraud losses.

Perhaps it’s becoming clear why the word “scientist”

fi ts this emerging role. Experimental physicists, for ex-

ample, also have to design equipment, gather data, con-

duct multiple experiments, and communicate their re-

sults. Thus, companies looking for people who can work

with complex data have had good luck recruiting among

those with educational and work backgrounds in the

physical or social sciences. Some of the best and bright-

est data scientists are PhDs in esoteric fi elds like ecol-

ogy and systems biology. George Roumeliotis, the head

of a data science team at Intuit in Silicon Valley, holds a

doctorate in astrophysics. A little less surprisingly, many

of the data scientists working in business today were for-

mally trained in computer science, math, or economics.

They can emerge from any fi eld that has a strong data

and computational focus.

It’s important to keep that image of the scientist in

mind—because the word “data” might easily send a

search for talent down the wrong path. As Portillo told

us, “The traditional backgrounds of people you saw 10

to 15 years ago just don’t cut it these days.” A quantita-

tive analyst can be great at analyzing data but not at sub-

duing a mass of unstructured data and getting it into a

form in which it can be analyzed. A data management

expert might be great at generating and organizing data

in structured form but not at turning unstructured data

into structured data—and also not at actually analyzing

the data. And while people without strong social skills

might thrive in traditional data professions, data scien-

tists must have such skills to be effective.

H7353_Guide-DataAnalytics_2ndREV.indb 215H7353_Guide-DataAnalytics_2ndREV.indb 215 1/17/18 10:48 AM1/17/18 10:48 AM

Appendix

216

Roumeliotis was clear with us that he doesn’t hire on

the basis of statistical or analytical capabilities. He be-

gins his search for data scientists by asking candidates if

they can develop prototypes in a mainstream program-

ming language such as Java. Roumeliotis seeks both a

skill set—a solid foundation in math, statistics, probabil-

ity, and computer science—and certain habits of mind.

He wants people with a feel for business issues and em-

pathy for customers. Then, he says, he builds on all that

with on-the-job training and an occasional course in a

particular technology.

Several universities are planning to launch data sci-

ence programs, and existing programs in analytics, such

as the Master of Science in Analytics program at North

Carolina State, are busy adding big data exercises and

coursework. Some companies are also trying to de-

velop their own data scientists. After acquiring the big

data fi rm Greenplum, EMC decided that the availability

of data scientists would be a gating factor in its own—

and customers’—exploitation of big data. So its Educa-

tion Services division launched a data science and big

data analytics training and certifi cation program. EMC

makes the program available to both employees and cus-

tomers, and some of its graduates are already working

on internal big data initiatives.

As educational offerings proliferate, the pipeline of

talent should expand. Vendors of big data technolo-

gies are also working to make them easier to use. In the

meantime one data scientist has come up with a creative

approach to closing the gap. The Insight Data Science

Fellows Program, a postdoctoral fellowship designed by

H7353_Guide-DataAnalytics_2ndREV.indb 216H7353_Guide-DataAnalytics_2ndREV.indb 216 1/17/18 10:48 AM1/17/18 10:48 AM

Data Scientist: The Sexiest Job of the 21st Century

217

HOW TO FIND THE DATA SCIENTISTS YOU NEED

1. Focus recruiting at the “usual suspect” universi-

ties (Stanford, MIT, Berkeley, Harvard, Carnegie

Mellon) and also at a few others with proven

strengths: North Carolina State, UC Santa Cruz,

the University of Maryland, the University of

Washington, and UT Austin.

2. Scan the membership rolls of user groups

devoted to data science tools. The R User

Groups (for an open-source statistical

tool  favored by data scientists) and Python

Interest Groups (for PIGgies) are good places

to start.

3. Search for data scientists on LinkedIn—they’re

almost all on there, and you can see if they have

the skills you want.

4. Hang out with data scientists at the Strata,

Structure:Data, and Hadoop World conferences

and similar gatherings (there is almost one a

week now) or at informal data scientist “meet-

ups” in the Bay Area; Boston; New York; Wash-

ington, DC; London; Singapore; and Sydney.

5. Make friends with a local venture capitalist, who

is likely to have gotten a variety of big data pro-

posals over the past year.

(continued�)

H7353_Guide-DataAnalytics_2ndREV.indb 217H7353_Guide-DataAnalytics_2ndREV.indb 217 1/17/18 10:48 AM1/17/18 10:48 AM

Appendix

218

HOW TO FIND THE DATA SCIENTISTS YOU NEED

(continued�)

6. Host a competition on Kaggle or TopCoder, the

analytics and coding competition sites. Follow

up with the most-creative entrants.

7. Don’t bother with any candidate who can’t code.

Coding skills don’t have to be at a world-class

level but should be good enough to get by. Look

for evidence, too, that candidates learn rapidly

about new technologies and methods.

8. Make sure a candidate can fi nd a story in a data

set and provide a coherent narrative about a

key data insight. Test whether he or she can

communicate with numbers, visually and

verbally.

9. Be wary of candidates who are too detached

from the business world. When you ask how

their work might apply to your management

challenges, are they stuck for answers?

10. Ask candidates about their favorite analysis or

insight and how they are keeping their skills

sharp. Have they gotten a certifi cate in the

advanced track of Stanford’s online Machine

Learning course, contributed to open-source

projects, or built an online repository of code to

share (for example, on GitHub)?

H7353_Guide-DataAnalytics_2ndREV.indb 218H7353_Guide-DataAnalytics_2ndREV.indb 218 1/17/18 10:48 AM1/17/18 10:48 AM

Data Scientist: The Sexiest Job of the 21st Century

219

Jake Klamka (a high-energy physicist by training), takes

scientists from academia and in six weeks prepares them

to succeed as data scientists. The program combines

mentoring by data experts from local companies (such

as Facebook, Twitter, Google, and LinkedIn) with ex-

posure to actual big data challenges. Originally aiming

for 10 fellows, Klamka wound up accepting 30, from an

applicant pool numbering more than 200. More organi-

zations are now lining up to participate. “The demand

from companies has been phenomenal,” Klamka told us.

“They just can’t get this kind of high-quality talent.”

Why Would a Data Scientist Want to Work Here? Even as the ranks of data scientists swell, competition

for top talent will remain fi erce. Expect candidates to

size up employment opportunities on the basis of how

interesting the big data challenges are. As one of them

commented, “If we wanted to work with structured data,

we’d be on Wall Street.” Given that today’s most qualifi ed

prospects come from nonbusiness backgrounds, hiring

managers may need to fi gure out how to paint an excit-

ing picture of the potential for breakthroughs that their

problems offer.

Pay will of course be a factor. A good data scientist

will have many doors open to him or her, and salaries

will be bid upward. Several data scientists working at

startups commented that they’d demanded and got large

stock option packages. Even for someone accepting a po-

sition for other reasons, compensation signals a level of

respect and the value the role is expected to add to the

H7353_Guide-DataAnalytics_2ndREV.indb 219H7353_Guide-DataAnalytics_2ndREV.indb 219 1/17/18 10:48 AM1/17/18 10:48 AM

Appendix

220

business. But our informal survey of the priorities of

data scientists revealed something more fundamentally

important. They want to be “on the bridge.” The refer-

ence is to the 1960s television show Star Trek, in which

the starship captain James Kirk relies heavily on data

supplied by Mr. Spock. Data scientists want to be in the

thick of a developing situation, with real-time awareness

of the evolving set of choices it presents.

Considering the diffi culty of fi nding and keeping data

scientists, one would think that a good strategy would

involve hiring them as consultants. Most consulting

fi rms have yet to assemble many of them. Even the larg-

est fi rms, such as Accenture, Deloitte, and IBM Global

Services, are in the early stages of leading big data proj-

ects for their clients. The skills of the data scientists

they do have on staff are mainly being applied to more-

conventional quantitative analysis problems. Offshore

analytics services fi rms, such as Mu Sigma, might be the

ones to make the fi rst major inroads with data scientists.

But the data scientists we’ve spoken with say they

want to build things, not just give advice to a decision

maker. One described being a consultant as “the dead

zone—all you get to do is tell someone else what the

analyses say they should do.” By creating solutions that

work, they can have more impact and leave their marks

as pioneers of their profession.

Care and Feeding Data scientists don’t do well on a short leash. They

should have the freedom to experiment and explore pos-

sibilities. That said, they need close relationships with

the rest of the business. The most important ties for

H7353_Guide-DataAnalytics_2ndREV.indb 220H7353_Guide-DataAnalytics_2ndREV.indb 220 1/17/18 10:48 AM1/17/18 10:48 AM

Data Scientist: The Sexiest Job of the 21st Century

221

them to forge are with executives in charge of products

and services rather than with people overseeing business

functions. As the story of Jonathan Goldman illustrates,

their greatest opportunity to add value is not in creating

reports or presentations for senior executives but in in-

novating with customer- facing products and processes.

LinkedIn isn’t the only company to use data scientists

to generate ideas for products, features, and value- adding

services. At Intuit data scientists are asked to develop in-

sights for small-business customers and consumers and

report to a new senior vice president of big data, social

design, and marketing. GE is already using data science

to optimize the service contracts and maintenance inter-

vals for industrial products. Google, of course, uses data

scientists to refi ne its core search and ad-serving algo-

rithms. Zynga uses data scientists to optimize the game

experience for both long-term engagement and revenue.

Netfl ix created the well-known Net fl ix Prize, given to

the data science team that developed the best way to

improve the company’s movie recommendation system.

The test-preparation fi rm Kaplan uses its data scientists

to uncover effective learning strategies.

There is, however, a potential downside to having

people with sophisticated skills in a fast-evolving fi eld

spend their time among general management colleagues.

They’ll have less interaction with similar specialists,

which they need to keep their skills sharp and their tool

kit state-of-the-art. Data scientists have to connect with

communities of practice, either within large fi rms or ex-

ternally. New conferences and informal associations are

springing up to support collaboration and technology

sharing, and companies should encourage scientists to

H7353_Guide-DataAnalytics_2ndREV.indb 221H7353_Guide-DataAnalytics_2ndREV.indb 221 1/17/18 10:48 AM1/17/18 10:48 AM

Appendix

222

become involved in them with the understanding that

“more water in the harbor fl oats all boats.”

Data scientists tend to be more motivated, too, when

more is expected of them. The challenges of accessing

and structuring big data sometimes leave little time or

energy for sophisticated analytics involving prediction or

optimization. Yet if executives make it clear that simple

reports are not enough, data scientists will devote more

effort to advanced analytics. Big data shouldn’t equal

“small math.”

The Hot Job of the Decade Hal Varian, the chief economist at Google, is known

to have said, “The sexy job in the next 10 years will be

statisticians. People think I’m joking, but who would’ve

guessed that computer engineers would’ve been the sexy

job of the 1990s?”

If “sexy” means having rare qualities that are much in

demand, data scientists are already there. They are diffi -

cult and expensive to hire and, given the very competitive

market for their services, diffi cult to retain. There simply

aren’t a lot of people with their combination of scientifi c

background and computational and analytical skills.

Data scientists today are akin to Wall Street “quants”

of the 1980s and 1990s. In those days people with back-

grounds in physics and math streamed to investment

banks and hedge funds, where they could devise entirely

new algorithms and data strategies. Then a variety of

universities developed master’s programs in fi nancial en-

gineering, which churned out a second generation of tal-

ent that was more accessible to mainstream fi rms. The

H7353_Guide-DataAnalytics_2ndREV.indb 222H7353_Guide-DataAnalytics_2ndREV.indb 222 1/17/18 10:48 AM1/17/18 10:48 AM

Data Scientist: The Sexiest Job of the 21st Century

223

pattern was repeated later in the 1990s with search en-

gineers, whose rarefi ed skills soon came to be taught in

computer science programs.

One question raised by this is whether some fi rms

would be wise to wait until that second generation of

data scientists emerges, and the candidates are more nu-

merous, less expensive, and easier to vet and assimilate

in a business setting. Why not leave the trouble of hunt-

ing down and domesticating exotic talent to the big data

startups and to fi rms like GE and Walmart, whose ag-

gressive strategies require them to be at the forefront?

The problem with that reasoning is that the advance

of big data shows no signs of slowing. If companies sit

out this trend’s early days for lack of talent, they risk fall-

ing behind as competitors and channel partners gain

nearly unassailable advantages. Think of big data as an

epic wave gathering now, starting to crest. If you want to

catch it, you need people who can surf.

Thomas H. Davenport is the President’s Distinguished

Professor in Management and Information Technology

at Babson College, a research fellow at the MIT Initiative

on the Digital Economy, and a senior adviser at Deloitte

Analytics. Author of over a dozen management books,

his latest is Only Humans Need Apply: Winners and

Losers in the Age of Smart Machines. D.J. Patil was ap-

pointed as the fi rst U.S. chief data scientist and has led

product development at LinkedIn, eBay, and PayPal. He

is the author of Data Jujitsu: The Art of Turning Data

into Product.

H7353_Guide-DataAnalytics_2ndREV.indb 223H7353_Guide-DataAnalytics_2ndREV.indb 223 1/17/18 10:48 AM1/17/18 10:48 AM

H7353_Guide-DataAnalytics_2ndREV.indb 224H7353_Guide-DataAnalytics_2ndREV.indb 224 1/17/18 10:48 AM1/17/18 10:48 AM

225

Index

A/B testing, 59–70

blocking in, 62

defi ned, 60

example of, 65–67

interpretation of results,

63–64

mistakes in, 68–69

multivariate, 62–63

origin of, 60

overview of, 60–63

real-time optimization, 68

retesting, 69

sequential, 62

uses of, 64–65

Albert (artifi cial intelligence

algorithm), 114–117

analytical models, 15, 22–23, 43,

84–85, 161, 198. See also

data models

analytics-based decision making.

See data-driven decisions

Anderson, Chris, 103

annual rate of return (ARR), 145,

147–148

artifi cial intelligence (AI),

114–117, 149–150. See also

machine learning

assumptions

confi rming, through visualiza-

tions, 179

in predictive analytics, 84–86

asymmetrical loss function, 105

attributes versus benefi ts, in

marketing, 152–153

audience

data presentations and, 175,

178, 180, 184, 186, 188–189

understanding your, 180

automation, 112

averages, 138–139

Bank of America, 18–19

bar charts, 185, 186–188

behavior patterns, 84

bias

cognitive, 156–163

confi rmation, 156–158

linear, 135–140, 145–151

overconfi dence, 159–161

overfi tting, 161–163

visualizations and, 181–182

big data, 13–14, 104, 212, 222,

223

H7353_Guide-DataAnalytics_2ndREV.indb 225H7353_Guide-DataAnalytics_2ndREV.indb 225 1/17/18 10:48 AM1/17/18 10:48 AM

226

Index

blocking, in A/B tests, 62

Boston Consulting Group (BCG),

105

Box, George, 15

box-and-whisker plots, 193

Buffett, Warren, 23

business experiments. See

experiments

business relevance, of data fi nd-

ings, 121, 127–128

causation, 40, 92–93, 96, 103,

104–109

charts, 26, 150–151, 183–190. See

also data visualization

Cigna, 15, 17

clarity of causality, 104–109

Clark, Ed, 23–24

cognitive psychology, 134. See

also cognitive traps

cognitive traps, 156–163

color, in visualizations, 184–185,

186

communication

accuracy in, 165–167, 186–188

of data fi ndings, 5–6, 20,

165–167, 173–176

of data needs, 37–40

with data scientists, 21–22,

38–43

responding to challenges,

199–203

of uncertainty, 191–198

visualizations (see data

visualization)

confi dence interval, 126–127, 129

confi rmation bias, 156–158

context, in rendering data, 181,

184

control groups, 48–49

correlation, 40, 68, 92–93,

96–101, 158

acting on, 103–109

confi dence in recurrence of,

104–105

spurious, 68, 93, 96–101, 162

culture, of inquiry, 22–23

customer lifetime value, 81, 141

data

anonymized, 41–42

big, 13–14, 104, 212, 222, 223

challenges to, 199–203

cherry-picking, 158, 162

clean, 42–43, 75–76

collection, 26, 34–36, 39–42

communication of (see

communication)

context for, 181, 184

costs of, 40, 41–42

errors in, 72–73

evaluating, 43, 72–76

external, 39, 48

forward-looking, 35

integration of, 76–78

internal, 39, 48

for machine learning, 117

metadata, 178, 181

versus metrics, 51–58

needs, 33–36, 39–40

outliers, 5, 21, 85, 166–169

for predictive analytics,

82–83

presenting, 184–190

qualitative, 35–36

quality of, 71–78, 94–95, 181

quantitative, 35–36

for regression analysis, 94–95

scrubbing, 75–76

story, 20, 35, 207

H7353_Guide-DataAnalytics_2ndREV.indb 226H7353_Guide-DataAnalytics_2ndREV.indb 226 1/17/18 10:48 AM1/17/18 10:48 AM

Index

227

structured, 42–43

trusting the, 71–78, 94–95

unstructured, 42–43

uses of, 1

visualizing (see data

visualization)

data analysts. See data scientists

data analytics

deceptive, 165–169

predictive, 81–86

process, 3–6

data audits, 47–48

data collection, 26, 34–36,

39–42

data-driven behaviors, 7–9

data-driven decisions, 7–9, 14,

16, 18, 23–24, 155–164

data experts. See data scientists

data models

overfi tting, 43, 161–163

simplicity of, 43

See also analytical models

data scientists

asking questions of, 21–22, 85

compensation for, 219

education programs for,

218–219

fi nding and attracting,

212–220

job of, 209–223

requesting data and analytics

from, 37–44

retention of, 219–222

role of, 6–7, 213–215

skills of, 214–215

working with, 18–19, 220–222

data uncertainty, 193, 197

data visualization, 177–190

audience for, 180, 184

accuracy of, 186–188

charts, 26, 150–151, 183–190

color and shading in, 184–185,

186

to combat linear bias, 150–151

context in, 184

reasons for, 178–179, 181

risks, 181–182

of uncertainty, 193–195,

197–198

deception, 165–169

decision making

cognitive traps in, 155–163

data-driven, 7–9, 14, 16, 18,

23–24, 155–164

factors in, 205–207

heuristics, 156

intuition and, 95–96

linear thinking and, 131–154

statistical signifi cance and,

121–129

unconscious, 206–207

de-duplication, 76

dependent variables, 88, 90, 91

devil’s advocate, playing, 22–23,

160

DoSomething.org, 51–52, 58

Duhigg, Charles, 84

economies of scale, 143–144

education

programs for data scientists,

218–219

to learn analytics, 15, 17

in visualizations, 179

enterprise data systems, 35

errors

in data, 73, 75–76, 168

non-sampling, 128–129

prediction, 138–139

sampling, 122–123, 128

error term, 91, 95

H7353_Guide-DataAnalytics_2ndREV.indb 227H7353_Guide-DataAnalytics_2ndREV.indb 227 1/17/18 10:48 AM1/17/18 10:48 AM

228

Index

experiments, 40

A/B tests, 59–69

design of, 17, 45–50

fi eld, 148–149

narrow questions for, 46

plan for, 49–50

randomization in, 48–49,

61–62

randomized controlled, 41, 60

results from, 50

statistical signifi cance and, 125

study populations for, 48

exploration, in visualizations, 179

exploratory data analysis (EDA),

179, 181, 182

external data, 39, 48

Facebook, 40, 214

favorable outcomes, 196–197

fi eld experiments, 148–149

fi nancial drivers, 55–56

fi ndings, communication of, 5–6,

20, 165–167, 173–176

Fisher, Ronald, 60

Friedman, Frank, 22

Fung, Kaiser, 59–64, 68–69

future predictions. See

predictions

General Electric (GE), 221

Goldman, Jonathan, 209–211,

220

Google, 46, 161, 220

Gottman, John, 174–175

governing objectives, 54–57

Greylock Partners, 212

gut feelings, 206–207. See also

intuition

Hadoop, 212, 214

Harley-Davidson, 114–117

heuristics, 156

Hoffman, Reid, 210

hypothesis, 19

null, 124–125

independent variables, 88, 90,

91–92

indicators, 146–148

information overload, 155

inquiry, culture of, 22–23

internal data, 39, 48

Intuit, 221

intuition, 95–96, 113, 117–118

Kahneman, Daniel, 160

Kaplan (test preparation fi rm),

221

Kempf, Karl, 18

Klein, Gary, 160

lift, 64

linear thinking, 131–154

awareness of, 146

limiting pitfalls of, 145–151

in practice, 135–140

LinkedIn, 209–211, 217, 220

Loveman, Gary, 23

machine learning, 111–119

assessment of problem for,

112–114

data for, 117

defi ned, 112–113

example, 114–117

H7353_Guide-DataAnalytics_2ndREV.indb 228H7353_Guide-DataAnalytics_2ndREV.indb 228 1/17/18 10:48 AM1/17/18 10:48 AM

Index

229

intuition and, 117–118

moving forward with, 118–119

marketing, 152–153

medium maximization, 147

Mendel, Gregor, 173–174

Merck, 20–21, 22

metadata, 178, 181

metrics

choosing, 54–57, 139–140

versus data, 51–58

intermediate, 146–147

managing, 58

performance, 139–140

using too many, 68

vanity, 53–54

Minnesota Coronary Experi-

ment, 156–157

mistakes

in A/B testing, 68–69

in machine learning, 117–118

in regression analysis, 94–96

in statistical signifi cance,

128–129

mortgage crisis, 84–85

multivariate testing, 62–63

Netfl ix, 220

Nightingale, Florence, 177–178

noise, 61, 161–162

nonfi nancial measures, 55–56

nonlinear relationships, 131–154

linear bias and, 135–140

mapping, 149–151

marketing and, 152–153

performance metrics and,

139–140

types of, 140–145, 148–149

non-sampling error, 128–129

null hypothesis, 124–125

objectives, aligning metrics with,

54–57

observational studies, 40

outcomes, 146–148, 196–197

outliers, 5, 21, 85, 166–169, 193

overconfi dence, 159–161

overfi tting, 43, 161–163

patterns, random, 163

performance metrics, 139–140

persuasion, 206–207

per unit profi t, 143–144

Pollan, Michael, 189

polls, 126–127, 191–193

population variation, 123–124

practical signifi cance, 121,

126–128

prediction errors, 138–139

predictions, 81–86, 161, 191–193

predictive analytics, 81–86

assumptions, 84–86

data for, 82–83

questions to ask about, 85

regression analysis, 83

statistics of, 83

uses for, 81–82

pre-mortems, 160

presentations, 184–190

presidential election (2016),

191–193, 196–197

privacy, 41–42

probability, 195–197

problems, framing, 19–21

p-value, 125, 126

qualitative data, 35–36

quality, of data, 71–78, 94–95,

181

H7353_Guide-DataAnalytics_2ndREV.indb 229H7353_Guide-DataAnalytics_2ndREV.indb 229 1/17/18 10:48 AM1/17/18 10:48 AM

230

Index

quantitative analysts. See data

scientists

quantitative data, 35–36. See also

data

quants. See data scientists

questions

to ask data experts, 21–22,

38–43, 85

asking the right, 35, 38

for experiments, 46

for focused data search, 34–36

for investigating outliers, 168

for understanding audience,

180

machine learning and, 117–118

for understanding audience,

180

randomization, 48–49, 61–62, 68

randomized controlled experi-

ments, 41, 60

random noise, 161–162

random patterns, 163

real-time optimization, 68

regression analysis, 83, 87–101

correlation and causation,

92–93, 96–101

data for, 94–95

defi ned, 88

dependent variables in, 88,

90, 91

error term, 91, 95

independent variables in, 88,

90, 91–92

mistakes in, 94–96

process of, 88–92

use of, 92

regression coeffi cients, 83

regression line, 90–91

replication crisis, 49–50

results, communication of, 5–6,

20, 165–167, 173–176

return on investment (ROI),

20–21, 145

Roumeliotis, George, 175, 215,

216, 218

sample size, 123

sampling error, 122–123, 128

Shutterstock, 65–67

Silver, Nate, 161

spurious correlations, 68, 93,

96–101, 162

statistical methods story, 20

statistical models. See analytical

models

statistical signifi cance, 121–129,

158

calculation of, 125

confi dence interval and,

126–127, 129

defi nition, 122

mistakes when working with,

128–129

null hypothesis and, 124–125

overview of, 122–125

sampling error and, 122–123

use of, 126–128

variation and, 123–124

statistical uncertainty, 193, 197

statistics

deceptive, 165–169

evaluating, 57

machine learning and, 117–118

picking, 54–57

in predictive analytics, 83

summary, 26, 28

storytelling, with data, 20, 35, 207

structured data, 42–43

study populations, 48

H7353_Guide-DataAnalytics_2ndREV.indb 230H7353_Guide-DataAnalytics_2ndREV.indb 230 1/17/18 10:48 AM1/17/18 10:48 AM

Index

231

summary statistics, 26, 28

Summers, Larry, 21

surveys, 128–129, 138

tables, 185

time-series plots, 26, 27

Toronto-Dominion Bank (TD),

23–24

treatment groups, in experimen-

tation, 48–49

uncertainty, 191–198

data, 193, 197

favorable outcomes and,

196–197

probability and, 195–197

statistical, 193, 197

unconscious decision making,

206–207

unstructured data, 42–43

user privacy, 41–42

vanity metrics, 53–54

variables

comparing dissimilar, 99

dependent, 88, 90, 91

independent, 88, 90, 91–92

Varian, Hal, 222

variation, 28, 123–124

Vigen, Tyler, 96–98

visualization. See data

visualization

Yahoo, 214

Zynga, 221

H7353_Guide-DataAnalytics_2ndREV.indb 231H7353_Guide-DataAnalytics_2ndREV.indb 231 1/17/18 10:48 AM1/17/18 10:48 AM

H7353_Guide-DataAnalytics_2ndREV.indb 232H7353_Guide-DataAnalytics_2ndREV.indb 232 1/17/18 10:48 AM1/17/18 10:48 AM

Invaluable insights always at your fingertips

With an All-Access subscription to Harvard Business Review, you’ll get

so much more than a magazine.

Exclusive online content and tools you can put to use today

My Library, your personal workspace for sharing, saving, and organizing HBR.org articles and tools

Unlimited access to more than 4,000 articles in the Harvard Business Review archive

Subscribe today at hbr.org/subnow

19915_Press_HBR Subs_BoB_guides.indd 1 9/7/16 10:43 AM

Smart advice and inspiration from a source you trust.

If you enjoyed this book and want more comprehensive guidance on essential professional skills, turn to the HBR Guides Boxed Set. Packed with the practical advice you need to succeed, this seven-volume collection provides smart answers to your most pressing work challenges, from writing more effective emails and delivering persuasive presentations to setting priorities and managing up and across.

Buy for your team, clients, or event. Visit hbr.org/bulksales for quantity discount rates.

Harvard Business Review Guides Available in paperback or ebook format. Plus, find downloadable tools and templates to help you get started.

§ Better Business Writing § Building Your Business Case § Buying a Small Business § Coaching Employees § Delivering Effective Feedback § Finance Basics for Managers § Getting the Mentoring You Need § Getting the Right Work Done

§ Leading Teams § Making Every Meeting Matter § Managing Stress at Work § Managing Up and Across § Negotiating § Office Politics § Persuasive Presentations § Project Management

HBR.ORG/GUIDES

20034_Press_HBRGuides Ad_BoB_Guides.indd 1 11/8/16 9:50 AM

Notes

H6347.indb 175H6347.indb 175 10/15/13 9:19 AM10/15/13 9:19 AM

Notes

H6347.indb 176H6347.indb 176 10/15/13 9:19 AM10/15/13 9:19 AM

Notes

H6347.indb 175H6347.indb 175 10/15/13 9:19 AM10/15/13 9:19 AM

Notes

H6347.indb 176H6347.indb 176 10/15/13 9:19 AM10/15/13 9:19 AM

Notes

H6347.indb 175H6347.indb 175 10/15/13 9:19 AM10/15/13 9:19 AM

Notes

H6347.indb 176H6347.indb 176 10/15/13 9:19 AM10/15/13 9:19 AM

Notes

H6347.indb 175H6347.indb 175 10/15/13 9:19 AM10/15/13 9:19 AM

Notes

H6347.indb 176H6347.indb 176 10/15/13 9:19 AM10/15/13 9:19 AM

Notes

H6347.indb 175H6347.indb 175 10/15/13 9:19 AM10/15/13 9:19 AM

Notes

H6347.indb 176H6347.indb 176 10/15/13 9:19 AM10/15/13 9:19 AM

Notes

H6347.indb 175H6347.indb 175 10/15/13 9:19 AM10/15/13 9:19 AM

Notes

H6347.indb 176H6347.indb 176 10/15/13 9:19 AM10/15/13 9:19 AM

  • Copyright
  • What You'll Learn
  • Contents
  • Introduction
  • Section 1: Getting Started
  • Chapter 1: Keep Up with Your Quants
  • Ch 2: A Simple Exercise to Help You Think Like a Data Scientist
  • Section 2: Gather the Right Information
  • Ch 3: Do You Need All That Data?
  • Ch 4: How to Ask Your Data Scientists for Data and Analytics
  • Ch 5: How to Design a Business Experiment
  • Ch 6: Know the Difference Between Your Data and Your Metrics
  • Ch 7: The Fundamentals of A/B Testing
  • Ch 8: Can Your Data Be Trusted?
  • Section 3: Analyze the Data
  • Chapter 9: A Predictive Analytics Primer
  • Ch 10: Understanding Regression Analysis
  • Ch 11: When to Act On a Correlation, and When Not To
  • Ch 12: Can Machine Learning Solve Your Business Problem?
  • Ch 13: A Refresher on Statistical Significance
  • Ch 14: Linear Thinking in a Nonlinear World
  • Ch 15: Pitfalls of Data-Driven Decisions
  • Ch 16: Don't Let Your Analytics Cheat the Truth
  • Section 4: Communicate Your Findings
  • Ch 17: Data Is Worthless If You Don't Communicate It
  • Ch 18: When Data Visualization Works--and When It Doesn't
  • Ch 19: how to Make Charts That Pop and Persuade
  • Ch 20: Why It's So Hard for Us to Communicate Uncertainty
  • Ch 21: Responding to Someone Who Challenges Your Data
  • Ch 22: Decisions Don't Start with Data
  • Appendix: Data Scientist: The Sexiest Job of the 21st Century
  • Index