replies
Data Analytics Basics for Managers
Don’t let a fear of numbers hold you back.
Understand the numbers Make better decisions Present and persuade
SMARTER THAN THE AVERAGE GUIDE
HBR Guide to
D ata A
n alytics B
asics fo r M
an agers
H BR
G uide to
Stay informed. Join the discussion. Visit hbr.org Follow @HarvardBiz on Twitter Find us on Facebook and LinkedIn
US$19.95 Management
Today’s business environment brings with it an onslaught of data. Now more than ever, managers must know how to tease insight from data—to understand where the numbers come from, make sense of them, and use them to inform tough decisions. How do you get started?
Whether you’re working with data experts or running your own tests, you’ll find answers in the HBR Guide to Data Analytics Basics for Managers. This book describes three key steps in the data analysis process, so you can get the information you need, study the data, and communicate your findings to others.
You’ll learn how to: • Identify the metrics you need to measure • Run experiments and A/B tests • Ask the right questions of your data experts • Understand statistical terms and concepts • Create effective charts and visualizations • Avoid common mistakes
BONUS ARTICLE
Data Scientist: The Sexiest Job
of the 21st Century
ISBN-13: 978-1-63369-428-6
9 7 8 1 6 3 3 6 9 4 2 8 6
9 0 0 0 0
HBR-Guide-DataAnalytics10185_Mechanical.indd 1 1/22/18 1:56 PM
HBR Guide to Data Analytics Basics for Managers
H7353_Guide-DataAnalytics_2ndREV.indb iH7353_Guide-DataAnalytics_2ndREV.indb i 1/17/18 10:47 AM1/17/18 10:47 AM
Harvard Business Review Guides
Arm yourself with the advice you need to succeed on the
job, from the most trusted brand in business. Packed
with how-to essentials from leading experts, the HBR
Guides provide smart answers to your most pressing
work challenges.
The titles include:
HBR Guide to Being More Productive
HBR Guide to Better Business Writing
HBR Guide to Building Your Business Case
HBR Guide to Buying a Small Business
HBR Guide to Coaching Employees
HBR Guide to Data Analytics Basics for Managers
HBR Guide to Delivering Effective Feedback
HBR Guide to Emotional Intelligence
HBR Guide to Finance Basics for Managers
HBR Guide to Getting the Right Work Done
HBR Guide to Leading Teams
HBR Guide to Making Every Meeting Matter
HBR Guide to Managing Stress at Work
HBR Guide to Managing Up and Across
HBR Guide to Negotiating
HBR Guide to Offi ce Politics
HBR Guide to Performance Management
HBR Guide to Persuasive Presentations
HBR Guide to Project Management
H7353_Guide-DataAnalytics_2ndREV.indb iiH7353_Guide-DataAnalytics_2ndREV.indb ii 1/17/18 10:47 AM1/17/18 10:47 AM
HBR Guide to Data Analytics Basics for Managers
HARVARD BUSINESS REVIEW PRESS
Boston, Massachusetts
H7353_Guide-DataAnalytics_2ndREV.indb iiiH7353_Guide-DataAnalytics_2ndREV.indb iii 1/17/18 10:47 AM1/17/18 10:47 AM
Copyright 2018 Harvard Business School Publishing Corporation
All rights reserved
No part of this publication may be reproduced, stored in or introduced
into a retrieval system, or transmitted, in any form, or by any means
(electronic, mechanical, photocopying, recording, or otherwise),
without the prior permission of the publisher. Requests for permission
should be directed to [email protected], or mailed to
Permissions, Harvard Business School Publishing, 60 Harvard Way,
Boston, Massachusetts 02163.
The web addresses referenced in this book were live and correct at the
time of the book’s publication but may be subject to change.
Cataloging-in-Publication data is forthcoming
eISBN: 9781633694293
HBR Press Quantity Sales Discounts
Harvard Business Review Press titles are available at signifi cant
quantity discounts when purchased in bulk for client gifts, sales
promotions, and premiums. Special editions, including books with
corporate logos, customized covers, and letters from the company
or CEO printed in the front matter, as well as excerpts of existing
books, can also be created in large quantities for special needs.
For details and discount information for both print and ebook for-
mats, contact [email protected], tel. 800-988-0886,
or www.hbr.org/bulksales.
H7353_Guide-DataAnalytics_2ndREV.indb ivH7353_Guide-DataAnalytics_2ndREV.indb iv 1/17/18 10:47 AM1/17/18 10:47 AM
What You’ll Learn
The vast amounts of data that companies accumulate to-
day can help you understand the past, make predictions
about the future, and guide your decision making. But
how do you use all this data effectively? How do you as-
sess whether your fi ndings are accurate or signifi cant?
How do you distinguish between causation and correla-
tion? And how do you present your results in a way that
will persuade others?
Understanding data analytics is an essential skill for
every manager. It’s no longer enough to hand this re-
sponsibility off to data experts. To be able to rely on the
evidence your analysts give you, you need to know where
it comes from and how it was generated—and what it
can and can’t teach you.
Using quantitative analysis as part of your decision
making helps you uncover new information and pro-
vides you with more confi dence in your choices—and
you don’t need to be deeply profi cient in statistics to
do it. This guide gives you the basics so you can better
understand how to use data and analytics as you make
tough choices in your daily work. It walks you through
H7353_Guide-DataAnalytics_2ndREV.indb vH7353_Guide-DataAnalytics_2ndREV.indb v 1/17/18 10:47 AM1/17/18 10:47 AM
vi
What You’ll Learn
three fundamental steps of data analysis: gathering the
information you need, making sense of the numbers,
and communicating those fi ndings to get buy-in and
spur others to action.
You’ll learn to:
• Ask the right questions to get the information
you need
• Work more effectively with data scientists
• Run business experiments and A/B tests
• Choose the right metrics to evaluate predictions
and performance
• Assess whether you can trust your data
• Understand the basics of regression analysis and
statistical signifi cance
• Distinguish between correlation and causation
• Sidestep cognitive biases when making decisions
• Identify when to invest in machine learning—and
how to proceed
• Communicate and defend your fi ndings to
stakeholders
• Visualize your data clearly and powerfully
H7353_Guide-DataAnalytics_2ndREV.indb viH7353_Guide-DataAnalytics_2ndREV.indb vi 1/17/18 10:47 AM1/17/18 10:47 AM
Contents
Introduction 1
Why you need to understand data analytics.
S E C T I O N O N E
Getting Started
1. Keep Up with Your Quants 13
An innumerate’s guide to navigating big data.
BY THOMAS H. DAVENPORT
2. A Simple Exercise to Help You Think
Like a Data Scientist 25
An easy way to learn the process of data analytics.
BY THOMAS C. REDMAN
S E C T I O N T W O
Gather the Right Information
3. Do You Need All That Data? 33
Questions to ask for a focused search.
BY RON ASHKENAS
H7353_Guide-DataAnalytics_2ndREV.indb viiH7353_Guide-DataAnalytics_2ndREV.indb vii 1/17/18 10:47 AM1/17/18 10:47 AM
Contents
viii
4. How to Ask Your Data Scientists for
Data and Analytics 37
Factors to keep in mind to get the information you need.
BY MICHAEL LI, MADINA KASSENGALIYEVA ,
AND RAYMOND PERKINS
5. How to Design a Business Experiment 45
Seven tips for using the scientifi c method.
BY OLIVER HAUSER AND MICHAEL LUCA
6. Know the Diff erence Between Your Data
and Your Metrics 51
Understand what you’re measuring.
BY JEFF BLADT AND BOB FILBIN
7. The Fundamentals of A/B Testing 59
How it works—and mistakes to avoid.
BY AMY GALLO
8. Can Your Data Be Trusted? 71
Gauge whether your data is safe to use.
BY THOMAS C. REDMAN
S E C T I O N T H R E E
Analyze the Data
9. A Predictive Analytics Primer 81
Look to the future by looking at the past.
BY THOMAS H. DAVENPORT
10. Understanding Regression Analysis 87
Evaluate the relationship between variables.
BY AMY GALLO
H7353_Guide-DataAnalytics_2ndREV.indb viiiH7353_Guide-DataAnalytics_2ndREV.indb viii 1/17/18 10:47 AM1/17/18 10:47 AM
Contents
ix
11. When to Act On a Correlation,
and When Not To 103
Assess your confi dence in your fi ndings and the risk of being wrong.
BY DAVID RIT TER
12. Can Machine Learning Solve Your
Business Problem? 111
Steps to take before investing in artifi cial intelligence.
BY ANASTASSIA FEDYK
13. A Refresher on Statistical Signifi cance 121
Check if your results are real or just luck.
BY AMY GALLO
14. Linear Thinking in a Nonlinear World 131
A common mistake that leads to errors in judgment.
BY BART DE LANGHE, STEFANO PUNTONI,
AND RICHARD LARRICK
15. Pitfalls of Data-Driven Decisions 155
The cognitive traps to avoid.
BY MEGAN MacGARVIE AND KRISTINA McELHERAN
16. Don’t Let Your Analytics Cheat the Truth 165
Pay close attention to the outliers.
BY MICHAEL SCHRAGE
S E C T I O N F O U R
Communicate Your Findings
17. Data Is Worthless If You Don’t Communicate It 173
Tell people what it means.
BY THOMAS H. DAVENPORT
H7353_Guide-DataAnalytics_2ndREV.indb ixH7353_Guide-DataAnalytics_2ndREV.indb ix 1/17/18 10:47 AM1/17/18 10:47 AM
Contents
x
18. When Data Visualization Works—and
When It Doesn’t 177
Not all data is worth the eff ort.
BY JIM STIKELEATHER
19. How to Make Charts That Pop and Persuade 183
Five questions to help give your numbers meaning.
BY NANCY DUARTE
20. Why It’s So Hard for Us to Communicate
Uncertainty 191
Illustrating—and understanding—the likelihood of events.
AN INTERVIEW WITH SCOT T BERINATO
BY NICOLE TORRES
21. Responding to Someone Who Challenges
Your Data 199
Ensure the data is thorough, then make them an ally.
BY JON M. JACHIMOWICZ
22. Decisions Don’t Start with Data 205
Infl uence others through story and emotion.
BY NICK MORGAN
Appendix: Data Scientist: The Sexiest Job
of the 21st Century 209
BY THOMAS H. DAVENPORT AND D.J. PATIL
Index 225
H7353_Guide-DataAnalytics_2ndREV.indb xH7353_Guide-DataAnalytics_2ndREV.indb x 1/17/18 10:47 AM1/17/18 10:47 AM
1
Introduction
Data is coming into companies at remarkable speed and
volume. From small, manageable data sets to big data
that is recorded every time a consumer buys a product or
likes a social media post, this information offers a range
of opportunities to managers.
Data allows you to make better predictions about the
future—whether a new retail location is likely to suc-
ceed, for example, or what a reasonable budget for the
next fi scal year might look like. It helps you identify the
causes of certain events—a failed advertising campaign,
a bad quarter, or even poor employee performance—so
you can adjust course if necessary. It allows you to iso-
late variables so that you can identify your customers’
wants or needs or assess the chances an initiative will
succeed. Data gives you insight on factors affecting your
industry or marketplace and can inform your decisions
about anything from new product development to hiring
choices.
H7353_Guide-DataAnalytics_2ndREV.indb 1H7353_Guide-DataAnalytics_2ndREV.indb 1 1/17/18 10:47 AM1/17/18 10:47 AM
Introduction
2
But with so much information coming in, how do you
sort through it all and make sense of everything? It’s
tempting to hand that role off to your experts and ana-
lysts. But even if you have the brightest minds handling
your data, it won’t make a difference if you don’t know
what they’re doing or what it means. Unless you know
how to use that data to inform your decisions, all you
have is a set of numbers.
It’s quickly becoming a requirement that every deci-
sion maker have a basic understanding of data analyt-
ics. But if the thought of statistical analysis makes you
sweat, have no fear. You don’t need to become a data
scientist or statistician to understand what the numbers
mean (even if data scientists have the “sexiest job of the
21st century”—see the bonus article we’ve included in the
appendix). Instead, you as a manager need a clear un-
derstanding of how these experts reach their results and
how to best use that information to guide your own deci-
sions. You must know where their fi ndings come from,
ask the right questions of data sets, and translate the re-
sults to your colleagues and other stakeholders in a way
that convinces and persuades.
This book is not for analytics experts—the data sci-
entists, analysts, and other specialists who do this work
day in, day out. Instead, it’s meant for managers who
may not have a background in statistical analysis but still
want to improve their decisions using data. This book
will not give you a detailed course in statistics. Rather,
it will help you better use data, so you can understand
what the numbers are telling you, identify where the re-
sults of those calculations may be falling short, and make
stronger choices about how to run your business.
H7353_Guide-DataAnalytics_2ndREV.indb 2H7353_Guide-DataAnalytics_2ndREV.indb 2 1/17/18 10:47 AM1/17/18 10:47 AM
Introduction
3
What This Book Will Do This guide walks you through three key areas of the data
analytics process: gathering the information you need,
analyzing it, and communicating your fi ndings to oth-
ers. These three steps form the core of managerial data
analytics.
To fully understand these steps, you need to see the
process of data analytics and your role within it at a high
level. Section 1, “Getting Started,” provides two pieces
to help you digest the process from start to fi nish. First,
Thomas Davenport outlines your role in data analysis
and describes how you can work more effectively with
your data scientist and become a better consumer of
analytics. Then, you’ll fi nd an easy exercise you can do
yourself to gather your own data, analyze it, and identify
what to do next in light of what you’ve discovered.
Once you have this basic understanding of the pro-
cess, you can move on to learn the specifi cs about each
step, starting with the data search.
Gather the right information
For any analysis, you need data—that’s obvious. But
what data you need and how to get it can be less clear
and can vary, depending on the problem to be solved.
Section 2 begins by providing a list of questions to ask
for a targeted data search.
There are two ways to get the information you need:
by asking others for existing data and analysis or by run-
ning your own experiment to gather new data. We ex-
plore both of these approaches in turn, covering how to
request information from your data experts (taking into
H7353_Guide-DataAnalytics_2ndREV.indb 3H7353_Guide-DataAnalytics_2ndREV.indb 3 1/17/18 10:47 AM1/17/18 10:47 AM
Introduction
4
account their needs and concerns) and using the scien-
tifi c method and A/B testing for well-thought-out tests.
But any data search won’t matter if you don’t measure
useful things. Defi ning the right metrics ensures that
your results align with your needs. Jeff Bladt, chief data
offi cer at DoSomething.org, and Bob Filbin, chief data
scientist at Crisis Text Line, use the example of their own
social media campaign to explain how to identify and
work toward metrics that matter.
We end this section with a helpful process by data ex-
pert and company adviser Thomas C. Redman. Before
you can move forward with any analysis, you must know
if the information you have can be trusted. By following
his advice, you can assess the quality of your data, make
corrections as necessary, and move forward accordingly,
even if the data isn’t perfect.
Analyze the data
You have the numbers—now what? It’s usually at this
point in the process that managers fl ash back to their
college statistics courses and nervously leave the analy-
sis to an expert or a computer algorithm. Certainly, the
data scientists on your team are there to help. But you
can learn the basics of analysis without needing to un-
derstand every mathematical calculation. By focusing on
how data experts and companies use these equations (in-
stead of how they run them), we help you ask the right
questions and inform your decisions in real-world man-
agerial situations.
We begin section 3 by describing some basic terms
and processes. We defi ne predictive analytics and how to
H7353_Guide-DataAnalytics_2ndREV.indb 4H7353_Guide-DataAnalytics_2ndREV.indb 4 1/17/18 10:47 AM1/17/18 10:47 AM
Introduction
5
use them, and explain statistical concepts like regression
analysis, correlation versus causation, and statistical
signifi cance. You’ll also learn how to assess if machine
learning can help solve your problem—and how to pro-
ceed if it does.
In this section, we also aim to help you avoid common
traps as you study data and make decisions. You’ll dis-
cover how to look at numbers in nonlinear ways, so your
predictions are more accurate. And you will fi nd practi-
cal ways to avoid injecting subconscious bias into your
choices.
Finally, recognize when the story you’re being told
may be too good to be true. Even with the best data—and
the best data analysts—the results may not be as clear as
you think. As Michael Schrage, research fellow at MIT’s
Sloan School Center for Digital Business, points out in
the last piece in this section, an unmentioned outlier can
throw an entire conclusion off base, which is a risk you
can’t take with your decision making.
Communicate your fi ndings
“Never make the mistake of assuming that the results
will ‘speak for themselves,’” warns Thomas Davenport
in the fi nal section of this book. You must know how to
communicate the results of your analysis and use that
information to persuade others and drive your decision
forward—the third step in the data analytics process.
Section 4 explains how to share data with others so
that the numbers support your message, rather than
distract from it. The next few chapters outline when
visualizations will be helpful to your data—and when
H7353_Guide-DataAnalytics_2ndREV.indb 5H7353_Guide-DataAnalytics_2ndREV.indb 5 1/17/18 10:47 AM1/17/18 10:47 AM
Introduction
6
they won’t be—as well as the basics of making persuasive
charts. You’ll learn how to depict and explain the uncer-
tainty and the probability of events, as well as what to do
if someone questions your fi ndings.
Data alone will not elicit change, though; you must
use this evidence in the right way to inform and change
the mindset of the person who sees it. Data is merely
supporting material, says presentations expert Nick
Morgan in the fi nal chapter. To truly persuade, you need
a story with emotional power.
Set your organization up for success
While we hope that you’ll continue to learn and grow
your own analytical skills, it’s likely that you’ll continue
to work with data experts and quants throughout your
data journey. Understanding the role of the data scien-
tist will be crucial to ensuring your organization has the
capabilities it needs to grow through data.
Data scientists bring with them intense curiosity and
make new discoveries that managers and analysts may
not see themselves. As an appendix at the end of this
book, you’ll fi nd Thomas H. Davenport and D.J. Patil’s
popular article “Data Scientist: The Sexiest Job of the
21st Century.” Davenport and Patil’s piece aims to help
you better understand this key player in an organiza-
tion—someone they describe as a “hybrid of data hacker,
analyst, communicator, and trusted adviser.” These indi-
viduals have rare qualities that, as a manager, you may
not fully under stand. By reading through this piece,
you’ll have insight into how they think about and work
with data. What’s more, you’ll learn how to fi nd, attract,
H7353_Guide-DataAnalytics_2ndREV.indb 6H7353_Guide-DataAnalytics_2ndREV.indb 6 1/17/18 10:47 AM1/17/18 10:47 AM
Introduction
7
and develop data scientists to keep your company on the
competing edge.
Moving Forward Data-driven decisions won’t come easily. But by under-
standing the basics of data analytics, you’ll be able to ask
the right questions of data to pull the most useful informa-
tion out of the numbers. Before diving in to the chapters
that follow, though, ask yourself how often you’re incor-
porating data into your daily work. The assessment “Are
You Data Driven?” is a brief test that will help you target
your efforts. With that knowledge in mind, move through
the next sections with an open mind, ready to weave data
into each of your decisions.
ARE YOU DATA DRIVEN?
by Thomas C. Redman
Look at the list below and give yourself a point for ev-
ery behavior you demonstrate consistently and half
a point for those you follow most—but not all—of the
time. Be hard on yourself. If you can only cite an in-
stance or two, don’t give yourself any credit.
□ I push decisions down to the lowest possible level.
□ I bring as much diverse data and as many diverse viewpoints to any situation as I possibly can.
(continued�)
H7353_Guide-DataAnalytics_2ndREV.indb 7H7353_Guide-DataAnalytics_2ndREV.indb 7 1/17/18 10:47 AM1/17/18 10:47 AM
Introduction
8
ARE YOU DATA DRIVEN?
(continued�)
□ I use data to develop a deeper understanding of the business context and the problem at hand.
□ I develop an appreciation for variation.
□ I deal reasonably well with uncertainty.
□ I integrate my understanding of the data and its implications with my intuition.
□ I recognize the importance of high-quality data and invest to make improvements.
□ I conduct experiments and research to supple- ment existing data and address new questions.
□ I recognize that decision criteria can vary with circumstances.
□ I realize that making a decision is only the fi rst step, and I revise decisions as new data comes
to light.
□ I work to learn new skills, and bring new data and data technologies into my organization.
□ I learn from my mistakes and help others to do so as well.
□ I strive to be a role model when it comes to data, and work with leaders, peers, and subordinates
to help them become data driven.
H7353_Guide-DataAnalytics_2ndREV.indb 8H7353_Guide-DataAnalytics_2ndREV.indb 8 1/17/18 10:47 AM1/17/18 10:47 AM
Introduction
9
Tally your points. If you score less than 7, it’s im-
perative that you start changing the way you work as
soon as possible. Target those behaviors where you
gave yourself partial credit fi rst and fully embed those
skills into your daily work. Then build on your success
by targeting those behaviors that you were unable to
give yourself any credit for. It may help to enlist a col-
league’s aid—the two of you can improve together.
If you score a 7 or higher, you’re showing signs of
being data driven. Still, strive for ongoing improve-
ment. Set a goal of learning a new behavior or two ev-
ery year. Take this test every six months to make sure
that you’re on track.
Adapted from “Are You Data Driven? Take a Hard Look in the Mirror” on hbr.org, July 11, 2013 (product # H00AX2).
Thomas C. Redman, “the D ata Doc,” is President of Data Quality So- lutions. He helps companies and people, including startups, multi- nationals, executives, and leaders at all levels, chart their courses to data-driven futures. He places special emphasis on quality, analytics, and organizational capabilities.
H7353_Guide-DataAnalytics_2ndREV.indb 9H7353_Guide-DataAnalytics_2ndREV.indb 9 1/17/18 10:47 AM1/17/18 10:47 AM
H7353_Guide-DataAnalytics_2ndREV.indb 10H7353_Guide-DataAnalytics_2ndREV.indb 10 1/17/18 10:47 AM1/17/18 10:47 AM
SECTION ONE
Getting Started
H7353_Guide-DataAnalytics_2ndREV.indb 11H7353_Guide-DataAnalytics_2ndREV.indb 11 1/17/18 10:47 AM1/17/18 10:47 AM
H7353_Guide-DataAnalytics_2ndREV.indb 12H7353_Guide-DataAnalytics_2ndREV.indb 12 1/17/18 10:47 AM1/17/18 10:47 AM
13
CHAPTER 1
Keep Up with Your Quants by Thomas H. Davenport
“I don’t know why we didn’t get the mortgages off our
books,” a senior quantitative analyst at a large U.S. bank
told me a few years ago. “I had a model strongly indicat-
ing that a lot of them wouldn’t be repaid, and I sent it to
the head of our mortgage business.”
When I asked the leader of the mortgage business why
he’d ignored the advice, he said, “If the analyst showed me
a model, it wasn’t in terms I could make sense of. I didn’t
even know his group was working on repayment prob-
abilities.” The bank ended up losing billions in bad loans.
We live in an era of big data. Whether you work in
fi nancial services, consumer goods, travel and transpor-
Reprinted from Harvard Business Review, July–August 2013 (product
#R1307L).
H7353_Guide-DataAnalytics_2ndREV.indb 13H7353_Guide-DataAnalytics_2ndREV.indb 13 1/17/18 10:47 AM1/17/18 10:47 AM
Getting Started
14
tation, or industrial products, analytics are becoming a
competitive necessity for your organization. But as the
banking example shows, having big data—and even peo-
ple who can manipulate it successfully—is not enough.
Companies need general managers who can partner ef-
fectively with “quants” to ensure that their work yields
better strategic and tactical decisions.
For people fl uent in analytics—such as Gary Loveman
of Caesars Entertainment (with a PhD from MIT), Jeff
Bezos of Amazon (an electrical engineering and com-
puter science major from Princeton), or Sergey Brin and
Larry Page of Google (computer science PhD dropouts
from Stanford)—there’s no problem. But if you’re a typi-
cal executive, your math and statistics background prob-
ably amounts to a college class or two. You might be ad-
ept at using spreadsheets and know your way around a
bar graph or a pie chart, but when it comes to analytics,
you often feel quantitatively challenged.
So what does the shift toward data-driven decision
making mean for you? How do you avoid the fate of
the loss-making mortgage bank head and instead lead
your company into the analytical revolution, or at least
become a good foot soldier in it? This article—a primer
for non-quants—is based on extensive interviews with
executives, including some with whom I’ve worked as a
teacher or a consultant.
You, the Consumer Start by thinking of yourself as a consumer of analytics.
The producers are the quants whose analyses and mod-
els you’ll integrate with your business experience and in-
H7353_Guide-DataAnalytics_2ndREV.indb 14H7353_Guide-DataAnalytics_2ndREV.indb 14 1/17/18 10:47 AM1/17/18 10:47 AM
Keep Up with Your Quants
15
tuition as you make decisions. Producers are, of course,
good at gathering the available data and making predic-
tions about the future. But most lack suffi cient knowl-
edge to identify hypotheses and relevant variables and to
know when the ground beneath an organization is shift-
ing. Your job as a data consumer—to generate hypoth-
eses and determine whether results and recommenda-
tions make sense in a changing business environment—is
therefore critically important. That means accepting a
few key responsibilities. Some require only changes in
attitude and perspective; others demand a bit of study.
Learn a little about analytics
If you remember the content of your college-level statis-
tics course, you may be fi ne. If not, bone up on the basics
of regression analysis, statistical inference, and experi-
mental design. You need to understand the process for
making analytical decisions, including when you should
step in as a consumer, and you must recognize that every
analytical model is built on assumptions that producers
ought to explain and defend. (See the sidebar “Analytics-
Based Decision Making—in Six Steps.” ) As the famous
statistician George Box noted, “All models are wrong,
but some are useful.” In other words, models intention-
ally simplify our complex world.
To become more data literate, enroll in an executive
education program in statistics, take an online course, or
learn from the quants in your organization by working
closely with them on one or more projects.
Jennifer Joy, the vice president of clinical operations
at Cigna, took the third approach. Joy has a nursing
H7353_Guide-DataAnalytics_2ndREV.indb 15H7353_Guide-DataAnalytics_2ndREV.indb 15 1/17/18 10:47 AM1/17/18 10:47 AM
Getting Started
16
ANALYTICS-BASED DECISION MAKING—IN SIX KEY STEPS
When using big data to make big decisions, non-
quants should focus on the fi rst and the last steps of
the process. The numbers people typically handle the
details in the middle, but wise non-quants ask lots of
questions along the way.
1. Recognize the problem or question. Frame the
decision or business problem, and identify pos-
sible alternatives to the framing.
2. Review previous fi ndings. Identify people who
have tried to solve this problem or similar ones—
and the approaches they used.
3. Model the solution and select the variables.
Formulate a detailed hypothesis about how par-
ticular variables aff ect the outcome.
4. Collect the data. Gather primary and second-
ary data on the hypothesized variables.
5. Analyze the data. Run a statistical model, as-
sess its appropriateness for the data, and repeat
the process until a good fi t is found.
6. Present and act on the results. Use the data to
tell a story to decision makers and stakeholders
so that they will take action.
H7353_Guide-DataAnalytics_2ndREV.indb 16H7353_Guide-DataAnalytics_2ndREV.indb 16 1/17/18 10:47 AM1/17/18 10:47 AM
Keep Up with Your Quants
17
degree and an MBA, but she wasn’t entirely comfortable
with her analytical skills. She knew, however, that the
voluminous reports she received about her call center
operations weren’t telling her whether the coaching calls
made to patients were actually helping to manage their
diseases and to keep them out of the hospital.
So Joy reached out to Cigna’s analytics group, in par-
ticular to the experts on experimental design—the only
analytical approach that can potentially demonstrate
cause and effect. She learned, for example, that she
could conduct pilot studies to discover which segments
of her targeted population benefi t the most (and which
the least) from her call center’s services. Specifi cally, she
uses analytics to “prematch” pairs of patients and then to
randomly assign one member of the pair to receive those
services, while the other gets an alternative such as a
mail-order or an online-support intervention. Each pilot
lasts just a couple of months, and multiple studies are
run simultaneously—so Joy now gets information about
the effectiveness of her programs on a rolling basis.
In the end, Joy and her quant partners learned that
the coaching worked for people with certain diseases
but not for other patients, and some call center staff
members were redeployed as a result. Now her group
regularly conducts 20 to 30 such tests a year to fi nd out
what really makes a difference for patients. She may
not under stand all the methodological details, but as
Michael Cousins, the vice president of U.S. research and
analytics at Cigna, attests, she’s learned to be “very ana-
lytically oriented.”
H7353_Guide-DataAnalytics_2ndREV.indb 17H7353_Guide-DataAnalytics_2ndREV.indb 17 1/17/18 10:47 AM1/17/18 10:47 AM
Getting Started
18
Align yourself with the right kind of quant
Karl Kempf, a leader in Intel’s decision-engineering
group, is known at the company as the “überquant” or
“chief mathematician.” He often says that effective quan-
titative decisions “are not about the math; they’re about
the relationships.” What he means is that quants and the
consumers of their data get much better results if they
form deep, trusting ties that allow them to exchange in-
formation and ideas freely.
Of course, highly analytical people are not always
known for their social skills, so this can be hard work.
As one wag jokingly advised, “Look for the quants who
stare at your shoes, instead of their own, when you en-
gage them in conversation.” But it’s possible to fi nd
people who communicate well and have a passion for
solving business—rather than mathematical—problems
and, after you’ve established a relationship, to encourage
frank dialogue and data-driven dissent between the two
of you.
Katy Knox, at Bank of America, has learned how to
align with data producers. As the head of retail strat-
egy and distribution for the bank’s consumer division,
she oversees 5,400-plus branches serving more than
50 million consumers and small businesses. For several
years she’s been pushing her direct reports to use analyt-
ics to make better decisions—for example, about which
branches to open or close, how to reduce customer wait
times, what incentives lead to multichannel interactions,
and why some salespeople are more productive than
others.
H7353_Guide-DataAnalytics_2ndREV.indb 18H7353_Guide-DataAnalytics_2ndREV.indb 18 1/17/18 10:47 AM1/17/18 10:47 AM
Keep Up with Your Quants
19
Bank of America has hundreds of quants, but most of
them were pooled in a group that managers could not
easily access. Knox insisted on having her own analyt-
ics team, and she established a strong working relation-
ship with its members through frequent meetings and
project-reporting sessions. She worked especially closely
with two team leaders, Justin Addis and Michael Hyzy,
who have backgrounds in retail banking and Six Sigma,
so they’re able to understand her unit’s business prob-
lems and communicate them to the hard-core quants
they manage. After Knox set the precedent, Bank of
America created a matrix structure for its analysts in the
consumer bank, and most now report to both a business
line and a centralized analytical group.
Focus on the beginning and the end
Framing a problem—identifying it and understanding
how others might have solved it in the past—is the most
important stage of the analytical process for a consumer
of big data. It’s where your business experience and in-
tuition matter most. After all, a hypothesis is simply a
hunch about how the world works. The difference with
analytical thinking, of course, is that you use rigorous
methods to test the hypothesis.
For example, executives at the two corporate parent
organizations of Transitions Optical believed that the
photochromic lens company might not be investing in
marketing at optimal levels, but no empirical data con-
fi rmed or refuted that idea. Grady Lenski, who headed
the marketing division at the time, decided to hire
analytics consultants to measure the effectiveness of
H7353_Guide-DataAnalytics_2ndREV.indb 19H7353_Guide-DataAnalytics_2ndREV.indb 19 1/17/18 10:47 AM1/17/18 10:47 AM
Getting Started
20
different sales campaigns—a constructive framing that
expanded on the simple binary question of whether or
not costs were too high.
If you’re a non-quant, you should also focus on the
fi nal step in the process—presenting and communicat-
ing results to other executives—because it’s one that
many quants discount or overlook and that you’ll prob-
ably have to take on yourself at some point. If analytics
is largely about “telling a story with data,” what type of
story would you favor? What kind of language and tone
would you use? Should the story be told in narrative or
visual terms? What types of graphics do you like? No
matter how sophisticated their analyses, quants should
be encouraged to explain their results in a straightfor-
ward way so that everyone can understand—or you
should do it for them. A statistical methods story (“fi rst
we ran a chi-square test, and then we converted the cat-
egorical data to ordinal, next we ran a logistic regres-
sion, and then we lagged the economic data by a year”)
is rarely acceptable.
Many businesspeople settle on an ROI story: How
will the new decision-making model increase conver-
sions, revenue, or profi tability? For example, a Merck
executive with responsibility for a global business unit
has worked closely with the pharmaceutical company’s
commercial analytics group for many years to answer a
variety of questions, including what the ROIs of direct-
to-consumer promotions are. Before an ROI analysis, he
and the group discuss what actions they will take when
they fi nd out whether promotions are highly, margin-
ally, or not successful—to make clear that the effort isn’t
H7353_Guide-DataAnalytics_2ndREV.indb 20H7353_Guide-DataAnalytics_2ndREV.indb 20 1/17/18 10:47 AM1/17/18 10:47 AM
Keep Up with Your Quants
21
merely an academic exercise. After the analysis, the ex-
ecutive sits the analysts down at a table with his man-
agement team to present and debate the results.
Ask lots of questions along the way
Former U.S. Treasury Secretary Larry Summers, who
once served as an adviser to a quantitative hedge fund,
told me that his primary responsibility in that job was
to “look over shoulders”—that is, to ask the smart quants
in the fi rm equally smart questions about their models
and assumptions. Many of them hadn’t been pressed
like that before; they needed an intelligent consumer of
data to help them think through and improve their work.
No matter how much you trust your quants, don’t
stop asking them tough questions. Here are a few that
almost always lead to more- rigorous, defensible analy-
ses. (If you don’t understand a reply, ask for one that
uses simpler language.)
1. What was the source of your data?
2. How well do the sample data represent the
population?
3. Does your data distribution include outliers?
How did they affect the results?
4. What assumptions are behind your analysis?
Might certain conditions render your assump-
tions and your model invalid?
5. Why did you decide on that particular analytical
approach? What alternatives did you consider?
H7353_Guide-DataAnalytics_2ndREV.indb 21H7353_Guide-DataAnalytics_2ndREV.indb 21 1/17/18 10:47 AM1/17/18 10:47 AM
Getting Started
22
6. How likely is it that the independent variables
are actually causing the changes in the dependent
variable? Might other analyses establish causality
more clearly?
Frank Friedman, the chief fi nancial offi cer and manag-
ing partner for fi nance and administration of Deloitte’s
U.S. business, is an inveterate questioner. He has assem-
bled a group of data scientists and quantitative analysts
to help him with several initiatives, including optimizing
the pricing of services, developing models that predict
employee performance, and identifying factors that drive
receivables. “People who work with me know I question
a lot—everything—always,” Friedman says. “After the
questioning, they know they will have to go back and
redo some of their analyses.” He also believes it’s vital to
admit when you don’t understand something: “I know I
am not the smartest person in the room in my meetings
with these people. I’m always pushing for greater clarity
[because] if I can’t articulate it, I can’t defend it to others.”
Establish a culture of inquiry, not advocacy
We all know how easily “fi gures lie and liars fi gure.” Ana-
lytics consumers should never pressure their producers
with comments like “See if you can fi nd some evidence in
the data to support my idea.” Instead, your explicit goal
should be to fi nd the truth. As the head of Merck’s com-
mercial analytics group says, “Our management team
wants us to be like Switzerland. We work only for the
shareholders.”
In fact, some senior executives push their analysts to
play devil’s advocate. This sets the right cultural tone
H7353_Guide-DataAnalytics_2ndREV.indb 22H7353_Guide-DataAnalytics_2ndREV.indb 22 1/17/18 10:47 AM1/17/18 10:47 AM
Keep Up with Your Quants
23
and helps to refi ne the models. “All organizations seek
to please the leader,” explains Gary Loveman, of Caesars,
“so it’s critical to cultivate an environment that views
ideas as separate from people and insists on rigorous evi-
dence to distinguish among those ideas.”
Loveman encourages his subordinates to put forth
data and analysis, rather than opinions, and reveals
his own faulty hypotheses, conclusions, and decisions.
That way managers and quants alike understand that
his sometimes “lame and ill-considered views,” as he
describes them, need as much objective, unbiased test-
ing as anyone else’s. For example, he often says that his
greatest mistake as a new CEO was choosing not to fi re
property managers who didn’t share his analytical ori-
entation. He thought their experience would be enough.
Loveman uses the example to show both that he’s fallible
and that he insists on being a consumer of analytics.
When It All Adds Up Warren Buffett once said, “Beware of geeks . . . bearing
formulas.” But in today’s data-driven world, you can’t af-
ford to do that. Instead you need to combine the science
of analytics with the art of intuition. Be a manager who
knows the geeks, understands their formulas, helps im-
prove their analytic processes, effectively interprets and
communicates the fi ndings to others, and makes better
decisions as a result.
Contrast the bank mentioned at the beginning of this
article with Toronto- Dominion Bank. TD’s CEO, Ed
Clark, is quantitatively literate (with a PhD in econom-
ics), and he also insists that his managers understand the
math behind any fi nancial product the company depends
H7353_Guide-DataAnalytics_2ndREV.indb 23H7353_Guide-DataAnalytics_2ndREV.indb 23 1/17/18 10:47 AM1/17/18 10:47 AM
Getting Started
24
on. As a result, TD knew to avoid the riskiest-structured
products and get out of others before incurring major
losses during the 2008–2009 fi nancial crisis.
TD’s emphasis on data and analytics affects other ar-
eas of the business as well. Compensation is closely tied
to performance-management measures, for example.
And TD’s branches stay open longer than most other
banks’ because Tim Hockey, the former head of retail
banking, insisted on systematically testing the effect of
extended retail hours (with control groups) and found
that they brought in more deposits. If anyone at a man-
agement meeting suggests a new direction, he or she is
pressed for data and analysis to support it. TD is not per-
fect, Clark acknowledges, but “nobody ever accuses us of
not running the numbers.”
Your organization may not be as analytical as TD, and
your CEO may not be like Ed Clark. But that doesn’t
mean you can’t become a great consumer of analytics
on your own—and set an example for the rest of your
company.
Thomas H. Davenport is the President’s Distinguished
Professor in Management and Information Technology
at Babson College, a research fellow at the MIT Initiative
on the Digital Economy, and a senior adviser at Deloitte
Analytics. Author of over a dozen management books,
his latest is Only Humans Need Apply: Winners and
Losers in the Age of Smart Machines.
H7353_Guide-DataAnalytics_2ndREV.indb 24H7353_Guide-DataAnalytics_2ndREV.indb 24 1/17/18 10:47 AM1/17/18 10:47 AM
25
CHAPTER 2
A Simple Exercise to Help You Think Like a Data Scientist by Thomas C. Redman
For 20 years, I’ve used a simple exercise to help those
with an open mind (and a pencil, paper, and calculator)
get started with data. One activity won’t make you data
savvy, but it will help you become data literate, open your
eyes to the millions of small data opportunities, and en-
able you to work a bit more effectively with data scien-
tists, analytics, and all things quantitative.
Adapted from “How to Start Thinking Like a Data Scientist” on hbr.org,
November 29, 2013.
H7353_Guide-DataAnalytics_2ndREV.indb 25H7353_Guide-DataAnalytics_2ndREV.indb 25 1/17/18 10:47 AM1/17/18 10:47 AM
Getting Started
26
While the exercise is very much a how-to, each step
also illustrates an important concept in analytics—from
understanding variation to visualization.
First, start with something that interests, even both-
ers, you at work, like consistently late-starting meetings.
Form it up as a question and write it down: “Meetings
always seem to start late. Is that really true?”
Next, think through the data that can help answer
your question and develop a plan for creating it. Write
down all the relevant defi nitions and your protocol for
collecting the data. For this particular example, you have
to defi ne when the meeting actually begins. Is it the time
someone says, “OK, let’s begin”? Or the time the real
business of the meeting starts? Does kibitzing count?
Now collect the data. It is critical that you trust the
data. And, as you go, you’re almost certain to fi nd gaps
in data collection. You may fi nd that even though a meet-
ing has started, it starts anew when a more senior per-
son joins in. Modify your defi nition and protocol as you
go along.
Sooner than you think, you’ll be ready to start draw-
ing some pictures. Good pictures make it easier for you
to both understand the data and communicate main
points to others. There are plenty of good tools to help,
but I like to draw my fi rst picture by hand. My go-to plot
is a time-series plot, where the horizontal axis has the
date and time and the vertical axis has the variable of in-
terest. Thus, a point on the graph in fi gure 2-1 is the date
and time of a meeting versus the number of minutes late.
Now return to the question that you started with
and develop summary statistics. Have you discovered an
H7353_Guide-DataAnalytics_2ndREV.indb 26H7353_Guide-DataAnalytics_2ndREV.indb 26 1/17/18 10:47 AM1/17/18 10:47 AM
27
FI G
U R
E 2
-1
H ow
la te
a re
m ee
ti ng
s?
H7353_Guide-DataAnalytics_2ndREV.indb 27H7353_Guide-DataAnalytics_2ndREV.indb 27 1/17/18 10:47 AM1/17/18 10:47 AM
Getting Started
28
answer? In this case, “Over a two-week period, 10% of
the meetings I attended started on time. And on average,
they started 12 minutes late.”
But don’t stop there. Ask yourself, “So what?” In this
case, “If those two weeks are typical, I waste an hour a
day. And that costs the company x dollars a year.”
Many analyses end because there is no “so what?”
Certainly if 80% of meetings start within a few minutes
of their scheduled start times, the answer to the original
question is, “No, meetings start pretty much on time,”
and there is no need to go further.
But this case demands more, as some analyses do. Get
a feel for variation. Understanding variation leads to a
better feel for the overall problem, deeper insights, and
novel ideas for improvement. Note on the graph that
8–20 minutes late is typical. A few meetings start right
on time, others nearly a full 30 minutes late. It would
be great if you could conclude, “I can get to meetings
10 minutes late, just in time for them to start,” but the
variation is too great.
Now ask, “What else does the data reveal?” It strikes
me that six meetings began exactly on time, while every
other meeting began at least seven minutes late. In this
case, bringing meeting notes to bear reveals that all six
on-time meetings were called by the vice president of
fi nance. Evidently, she starts all her meetings on time.
So where do you go from here? Are there important
next steps? This example illustrates a common dichot-
omy. On a personal level, results pass both the “inter-
esting” and “important” test. Most of us would give al-
most anything to get back an hour a day. And you may
H7353_Guide-DataAnalytics_2ndREV.indb 28H7353_Guide-DataAnalytics_2ndREV.indb 28 1/17/18 10:47 AM1/17/18 10:47 AM
A Simple Exercise to Help You Think Like a Data Scientist
29
not be able to make all meetings start on time, but if the
VP can, you can certainly start the meetings you control
promptly.
On the company level, results so far pass only the in-
teresting test. You don’t know whether your results are
typical, nor whether others can be as hard-nosed as the
VP when it comes to starting meetings. But a deeper look
is surely in order: Are your results consistent with others’
experiences in the company? Are some days worse than
others? Which starts later: conference calls or face-to-
face meetings? Is there a relationship between meeting
start time and most senior attendee? Return to step one,
pose the next group of questions, and repeat the process.
Keep the focus narrow—two or three questions at most.
I hope you’ll have fun with this exercise. Many fi nd
joy in teasing insights from data. But whether you ex-
perience that joy or not, don’t take this exercise lightly.
There are fewer and fewer places for the “data illiterate”
and, in my humble opinion, no more excuses.
Thomas C. Redman, “the Data Doc,” is President of Data
Quality Solutions. He helps companies and people, in-
cluding startups, multinationals, executives, and leaders
at all levels, chart their courses to data-driven futures.
He places special emphasis on quality, analytics, and or-
ganizational capabilities.
H7353_Guide-DataAnalytics_2ndREV.indb 29H7353_Guide-DataAnalytics_2ndREV.indb 29 1/17/18 10:47 AM1/17/18 10:47 AM
H7353_Guide-DataAnalytics_2ndREV.indb 30H7353_Guide-DataAnalytics_2ndREV.indb 30 1/17/18 10:47 AM1/17/18 10:47 AM
SECTION TWO
Gather the Right Information
H7353_Guide-DataAnalytics_2ndREV.indb 31H7353_Guide-DataAnalytics_2ndREV.indb 31 1/17/18 10:47 AM1/17/18 10:47 AM
H7353_Guide-DataAnalytics_2ndREV.indb 32H7353_Guide-DataAnalytics_2ndREV.indb 32 1/17/18 10:47 AM1/17/18 10:47 AM
33
CHAPTER 3
Do You Need All That Data? by Ron Ashkenas
Organizations love data: numbers, reports, trend lines,
graphs, spreadsheets—the more the better. And, as a re-
sult, many organizations have a substantial internal fac-
tory that churns out data on a regular basis, as well as
external resources on call that produce data for onetime
studies and questions. But what’s the evidence (or dare I
sa y “the data”) that all this data leads to better business
decisions? Is some amount of data collection unneces-
sary, and perhaps even damaging by creating complexity
and confusion?
Let’s look at a quick case study: For many years the
CEO of a premier consumer products company insisted
Adapted from content posted on hbr.org, March 1, 2010 (product
#H004FC).
H7353_Guide-DataAnalytics_2ndREV.indb 33H7353_Guide-DataAnalytics_2ndREV.indb 33 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
34
on a monthly business review process that was highly
data-intensive. At its core was a “book” that contained
cost and sales data for every product sold by the com-
pany, broken down by business unit, channel, geography,
and consumer segment. This book (available electroni-
cally but always printed by the executive team) was sev-
eral inches thick. It was produced each month by many
hundreds of fi nance, product management, and infor-
mation technology people who spent thousands of hours
collecting, assessing, analyzing, reconciling, and sorting
the data.
Since this was the CEO’s way of running the business,
no one really questioned whether all of this activity was
worth it, although many complained about the time re-
quired. When a new CEO came on the scene a couple of
years ago, however, he decided that the business would
do just fi ne with quarterly reviews and exception-only
reporting. Suddenly the entire data-production indus-
try of the company was reduced substantially—and the
company didn’t miss a beat.
Obviously, different CEOs have different needs for
data. Some want their decisions to be based on as much
hard data as possible; others want just enough to either
reinforce or challenge their intuition; and still others
may prefer a combination of hard, analytical data with
anecdotal and qualitative input. In all cases, though,
managers would do well to ask themselves four ques-
tions about their data process as a way of improving the
return on what is often a substantial (but not always vis-
ible) investment:
H7353_Guide-DataAnalytics_2ndREV.indb 34H7353_Guide-DataAnalytics_2ndREV.indb 34 1/17/18 10:47 AM1/17/18 10:47 AM
Do You Need All That Data?
35
1. Are we asking the right questions? Many compa-
nies collect the data that is available, rather than
the information that is needed to help make deci-
sions and run the business. So the starting point
is to be clear about a limited number of key ques-
tions that you want the data to help you answer—
and then focus the data collection around those
rather than everything else that is possible.
2. Does our data tell a story? Most data comes in
fragments. To be useful, these individual bits of
information need to be put together into a coher-
ent explanation of the business situation, which
means integrating data into a “story.” While
enterprise data systems have been useful in driv-
ing consistent data defi nitions so that points can
be added and compared, they don’t automatically
create the story. Instead, managers should con-
sider in advance what data is needed to convey
the story that they will be required to tell.
3. Does our data help us look ahead rather than
behind? Most of the data that is collected in
companies tells managers how they performed in
a past period—but is less effective in predicting
future performance. Therefore, it is important to
ask what data, in what time frames, will help us
get ahead of the curve instead of just reacting.
4. Do we have a good mix of quantitative and quali-
tative data? Neither quantitative nor qualitative
data tells the whole story. For example, to make
H7353_Guide-DataAnalytics_2ndREV.indb 35H7353_Guide-DataAnalytics_2ndREV.indb 35 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
36
good product and pricing decisions, we need to
know not only what is being sold to whom, but
also why some products are selling more than
others.
Clearly, business data and its analysis are critical for
organizations to succeed, which is underscored by the
fact that companies like IBM are investing billions of
dollars in acquisitions in the business intelligence and
analytics space. But even the best automated tools won’t
be effective unless managers are clear about these four
questions.
Ron Ashkenas is an Emeritus Partner with Schaffer
Consulting, a frequent contributor to Harvard Busi-
ness Review, and the author or coauthor of four books
on organizational transformation. He has worked with
hundreds of managers over the years to help them trans-
late strategy into results and simplify the way things get
done. He also is the coauthor (with Brook Manville) of
The Harvard Business Review Leader’s Handbook (Har-
vard Business Review Press; forthcoming in 2018). He
can be reached at [email protected].
H7353_Guide-DataAnalytics_2ndREV.indb 36H7353_Guide-DataAnalytics_2ndREV.indb 36 1/17/18 10:47 AM1/17/18 10:47 AM
37
CHAPTER 4
How to Ask Your Data Scientists for Data and Analytics by Michael Li, Madina Kassengaliyeva, and Raymond Perkins
The intersection of big data and business is growing
daily. Although enterprises have been studying analyt-
ics for decades, data science is a relatively new capability.
And interacting in a new data-driven culture can be dif-
fi cult, particularly for those who aren’t data experts.
One particular challenge that many of these individ-
uals face is how to request new data or analytics from
data scientists. They don’t know the right questions to
ask, the correct terms to use, or the range of factors to
consider to get the information they need. In the end,
analysts are left uncertain about how to proceed, and
H7353_Guide-DataAnalytics_2ndREV.indb 37H7353_Guide-DataAnalytics_2ndREV.indb 37 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
38
managers are frustrated when the information they get
isn’t what they intended.
At The Data Incubator, we work with hundreds of
companies looking to hire data scientists and data engi-
neers or enroll their employees in our corporate train-
ing programs. We often fi eld questions from our hiring
and training clients about how to interact with their data
experts. While it’s impossible to give an exhaustive ac-
count, here are some important factors to think about
when communicating with data scientists, particularly
as you begin a data search.
What Question Should We Ask? As you begin working with your data analysts, be clear
about what you hope to achieve. Think about the busi-
ness impact you want the data to have and the com pany’s
ability to act on that information. By hearing what you
hope to gain from their assistance, the data scientist can
collaborate with you to defi ne the right set of questions
to answer and better understand exactly what informa-
tion to seek.
Even the subtlest ambiguity can have major implica-
tions. For example, advertising managers may ask ana-
lysts, “What is the most effi cient way to use ads to in-
crease sales?” Though this seems reasonable, it may not
be the right question since the ultimate objective of most
fi rms isn’t to increase sales, but to maximize profi t. Re-
search from the Institute of Practitioners in Advertising
shows that using ads to reduce price sensitivity is typi-
cally twice as profi table as trying to increase sales.1 The
value of the insight obtained will depend heavily on the
question asked. Be as specifi c and actionable as possible.
H7353_Guide-DataAnalytics_2ndREV.indb 38H7353_Guide-DataAnalytics_2ndREV.indb 38 1/17/18 10:47 AM1/17/18 10:47 AM
How to Ask Your Data Scientists for Data and Analytics
39
What Data Do We Need? As you defi ne the right question and objectives for
analysis, you and your data scientist should assess the
availability of the data. Ask if someone has already col-
lected the relevant data and performed analysis. The
ever-growing breadth of public data often provides eas-
ily accessible answers to common questions. Cerner, a
supplier of health care IT solutions, uses data sets from
the U.S. Department of Health and Human Services to
supplement their own data. iMedicare uses information
from the Centers for Medicare and Medicaid Services to
select policies. Consider whether public data could be
used toward your problem as well. You can also work
with other analysts in the organization to determine if
the data has previously been examined for similar rea-
sons by others internally.
Then, assess whether the available data is suffi cient.
Data may not contain all the relevant information
needed to answer your questions. It may also be infl u-
enced by latent factors that can be diffi cult to recognize.
Consider the vintage effect in private lending data: Even
seemingly identical loans typically perform very dif-
ferently based on the time of issuance, despite the fact
they may have had identical data at that time. The effect
comes from fl uctuations in the underlying underwriting
standards at issuance, information that is not typically
represented in loan data.
You should also inquire if the data is unbiased, since
sample size alone is not suffi cient to guarantee its valid-
ity. Finally, ask if the data scientist has enough data to
answer the question. By identifying what information is
H7353_Guide-DataAnalytics_2ndREV.indb 39H7353_Guide-DataAnalytics_2ndREV.indb 39 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
40
needed, you can help data scientists plan better analyses
going forward.
How Do We Obtain the Data? If more information is needed, data scientists must
decide between using data compiled by the company
through the normal course of business, such as through
observational studies, and collecting new data through
experiments. As part of your conversation with analysts,
ask about the costs and benefi ts of these options. Obser-
vational studies may be easier and less expensive to ar-
range since they do not require direct interaction with
subjects, for example, but they are typically far less reli-
able than experiments because they are only able to es-
tablish correlation, not causation.
Experiments allow substantially more control and
provide more reliable information about causality, but
they are often expensive and diffi cult to perform. Even
seemingly harmless experiments may carry ethical or so-
cial implications with real fi nancial consequences. Face-
book, for example, faced public fury over its manipula-
tion of its own newsfeed to test how emotions spread on
social media. Though the experiments were completely
legal, many users resented being unwitting participants
in Facebook’s tests. Managers must think beyond the
data and consider the greater brand repercussions of
data collection and work with data scientists to under-
stand these consequences. (See the sidebar, “Under-
standing the Cost of Data.”)
Before investing resources in new analysis, validate
that the company can use the insights derived from it
H7353_Guide-DataAnalytics_2ndREV.indb 40H7353_Guide-DataAnalytics_2ndREV.indb 40 1/17/18 10:47 AM1/17/18 10:47 AM
How to Ask Your Data Scientists for Data and Analytics
41
UNDERSTANDING THE COST OF DATA
Though eff ective data analysis has been shown to gen-
erate substantial fi nancial gains, there can be many
diff erent costs and complexities associated with it.
Obtaining good data may not only be diffi cult, but very
expensive. For example, in the health care and phar-
maceutical industry, data collection is often associ-
ated with medical experimentation and patient ob-
servations. These randomized control trials can easily
cost millions. Data storage can cost millions annually
as well. When interacting with data scientists, manag-
ers should ask about the specifi c risks and costs as-
sociated with obtaining and analyzing the data before
moving forward with a project.
But not all costs associated with data collection are
fi nancial. Violations of user privacy can have enormous
legal and reputational repercussions. Privacy is one
of the most signifi cant concerns regarding consumer
data. Managers must consider and weigh the legal
and ethical implications of their data collection and
analysis methods. Even seemingly anonymized data
can be used to identify individuals. Safely anonymized
(continued)
in a productive and meaningful way. This may entail
integration with existing technology projects, providing
new data to automated systems, and establishing new
processes.
H7353_Guide-DataAnalytics_2ndREV.indb 41H7353_Guide-DataAnalytics_2ndREV.indb 41 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
42
UNDERSTANDING THE COST OF DATA
(continued�)
data can be de-anonymized when combined with other
data sets. In a famous case, Carnegie Mellon University
researchers were able to identify anonymized health
care records of a former Massachusetts governor us-
ing only his ZIP code, birthday, and gender.2 The Gart-
ner Data Center predicted that through 2016, over 25%
of fi rms using consumer data would incur reputation
damage due to privacy violation issues.3 Managers
must ask data scientists about these risks when work-
ing with the company’s potentially sensitive consumer
data.
Is the Data Clean and Easy to Analyze? In general, data comes in two forms: structured and un-
structured. Structured data is structured, as its name
implies, and easy to add to a database. Most analysts fi nd
it easier and faster to manipulate. Unstructured data
is often free form and cannot be as easily stored in the
types of relational databases most commonly used in en-
terprises. While unstructured data is estimated to make
up 95% of the world’s data, according to a report by pro-
fessors Amir Gandomi and Murtaza Haider of Ryerson
University, for many large companies, storing and ma-
nipulating unstructured data may require a signifi cant
investment of resources to extract necessary informa-
H7353_Guide-DataAnalytics_2ndREV.indb 42H7353_Guide-DataAnalytics_2ndREV.indb 42 1/17/18 10:47 AM1/17/18 10:47 AM
How to Ask Your Data Scientists for Data and Analytics
43
tion.4 Working with your data scientists, evaluate the ad-
ditional costs of using unstructured data when defi ning
your initial objectives.
Even if the data is structured it still may need to be
cleaned or checked for incompleteness and inaccuracies.
When possible, encourage analysts to use clean data fi rst.
Otherwise, they will have to waste valuable time and re-
sources identifying and correcting inaccurate records.
A 2014 survey conducted by Ascend2, a marketing re-
search company, found that nearly 54% of respondents
complained that a “lack of data quality/completeness”
was their most prominent impediment. By searching for
clean data, you can avoid signifi cant problems and loss
of time.
Is the Model Too Complicated? Statistical techniques and open-source tools to analyze
data abound, but simplicity is often the best choice. More
complex and fl exible tools expose themselves to overfi t-
ting and can take more time to develop (read more about
overfi tting in chapter 15, “Pitfalls of Data-Driven Deci-
sions”). Work with your data scientists to identify the
simpler techniques and tools and move to more complex
models only if the simpler ones prove insuffi cient. It is
important to observe the KISS rule: “Keep It Simple,
Stupid!”
It may not be possible to avoid all of the expenses and
issues related to data collection and analysis. But you
can take steps to mitigate these costs and risks. By ask-
ing the right questions of your analysts, you can ensure
proper collaboration and get the information you need
to move forward confi dently.
H7353_Guide-DataAnalytics_2ndREV.indb 43H7353_Guide-DataAnalytics_2ndREV.indb 43 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
44
Michael Li is the founder and executive director of The
Data Incubator, a big data company that trains and places
data scientists. A data scientist himself, he has worked at
Google, Foursquare, and Andreessen Horowitz. He is a
regular contributor to VentureBeat, The Next Web, and
Harvard Business Review. Madina Kassengaliyeva is a
client services director with Think Big, a Teradata com-
pany. She helps clients realize high-impact business op-
portunities through effective implementation of big data
and analytics solutions. Madina has managed accounts
in the fi nancial services and insurance industries and led
successful strategy, solution development, and analytics
engagements. Raymond Perkins is a researcher at Prince-
ton University working at the intersection of statistics,
data, and fi nance and is the executive director of the
Princeton Quant Trading Conference. He has also con-
ducted research at Hong Kong University of Science and
Technology, the Mathematical Sciences Research Insti-
tute (MSRI), and Michigan State University.
NOTES
1. P. F. Mouncey, “Marketing in the Era of Accountability,” Journal of Direct, Data and Digital Marketing Practice 9, no. 2 (December 2007): 225–228.
2. N. Anderson, “‘Anonymized’ Data Really Isn’t—and Here’s Why Not,” Ars Technica, September 8, 2009, https://arstechnica.com/ tech-policy/2009/09/your-secrets-live-online-in-databases-of-ruin/.
3. D. Laney, “Information Innovation Key Initiative Over- view,” Gartner Research, April 22, 2014, https://www.gartner.com/ doc/2 715317/information-innovation-key-initiative-overview.
4. A. Gandomi and M. Haider, “Beyond the Hype: Big Data Con- cepts, Methods, and Analytics,” International Journal of Information Management 35, no. 2 (April 2015): 137–144.
H7353_Guide-DataAnalytics_2ndREV.indb 44H7353_Guide-DataAnalytics_2ndREV.indb 44 1/17/18 10:47 AM1/17/18 10:47 AM
45
CHAPTER 5
How to Design a Business Experiment by Oliver Hauser and Michael Luca
The rise of experimental evaluations within organi-
zations—or what economists refer to as fi eld experi-
ments—has the potential to transform organizational
decision making, providing fresh insight into areas rang-
ing from product design to human resources to public
policy. Companies that invest in randomized evaluatio ns
can gain a game-changing advantage.
Yet while there has been a rapid growth in experi-
ments, especially within tech companies, we’ve seen too
Adapted from “How to Design (and Analyze) a Business Experiment”
on hbr.org, October 29, 2015 (product #H02FSL).
H7353_Guide-DataAnalytics_2ndREV.indb 45H7353_Guide-DataAnalytics_2ndREV.indb 45 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
46
many run incorrectly. Even when they’re set up properly,
avoidable mistakes often happen during implementa-
tion. As a result, many organizations fail to receive the
real benefi ts of the scientifi c method.
This chapter lays out seven steps to ensure that your
experiment delivers the data and insight you need. These
principles draw on the academic research on fi eld ex-
periments as well as our work with a variety of organiza-
tions ranging from Yelp to the UK government.
1. Identify a Narrow Question It is tempting to run an experiment around a question
such as “Is advertising worth the cost?” or “Should we re-
duce (or increase) our annual bonuses?” Indeed, begin-
ning with a question that is central to your broader goals
is a good start. But it’s misguided to think that a single
experiment will do the trick. The reason is simple: Multi-
ple factors go into answering these types of big questions.
Take the issue of whether advertising is worth the
cost. What form of advertising are we talking about, and
for which products, in which media, over which time pe-
riods? Your question should be testable, which means it
must be narrow and clearly defi ned. A better question
might be, “How much does advertising our brand name
on Google AdWords increase monthly sales?” This is an
empirical question that an experiment can answer—
and that feeds into the question you ultimately hope to
resolve. In fact, through just such an experiment, re-
searchers at eBay discovered that a long-standing brand-
advertising strategy on Google had no effect on the rate
at which paying customers visited eBay.
H7353_Guide-DataAnalytics_2ndREV.indb 46H7353_Guide-DataAnalytics_2ndREV.indb 46 1/17/18 10:47 AM1/17/18 10:47 AM
How to Design a Business Experiment
47
2. Use a Big Hammer Companies experiment when they don’t know what will
work best. Faced with this uncertainty, it may sound ap-
pealing to start small in order to avoid disrupting things.
But your goal should be to see whether some version of
your intervention—your new change—will make a dif-
ference to your customers. This requires a large-enough
intervention.
For example, suppose a grocery store is considering
adding labels to items to show consumers that it sources
mainly from local farms. How big should the labels be
and where should they be attached? We would suggest
starting with large labels on the front of the packages, be-
cause if the labels were small or on the backs of the pack-
ages, and there were no effect (a common outcome for
subtle interventions), the store managers would be left
to wonder whether consumers simply didn’t notice the
tags (the treatment wasn’t large enough) or truly didn’t
care (there was no treatment effect). By starting with a
big hammer, the store would learn whether customers
care about local sourcing. If there’s no effect from large
labels on the package fronts, then the store should give
up on the idea. If there is an effect, the experimenters
can later refi ne the labels to the desired characteristics.
3. Perform a Data Audit Once you know what your intervention is, you need to
choose what data to look at. Make a list of all the inter-
nal data related to the outcome you would like to infl u-
ence and when you will need to do the measurements.
H7353_Guide-DataAnalytics_2ndREV.indb 47H7353_Guide-DataAnalytics_2ndREV.indb 47 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
48
Include data both about things you hope will change and
things you hope won’t change as a result of the inter-
vention, because you’ll need to be alert for unintended
consequences. Think, too, about sources of external data
that might add perspective.
Say you’re launching a new cosmetics product and
you want to know which type of packaging leads to the
highest customer loyalty and satisfaction. You decide
to run a randomized controlled trial across geographi-
cal areas. In addition to measuring recurring orders and
customer service feedback (internal data), you can track
user reviews on Amazon and look for differences among
customers in different states (external data).
4. Select a Study Population Choose a subgroup among your customers that matches
the customer profi le you are hoping to understand. It
might be tempting to look for the easiest avenue to get
a subgroup, such as online users, but beware: If your
subgroup is not a good representation of your target
customers, the fi ndings of your experiment may not be
applicable. For example, younger online customers who
shop exclusively on your e-commerce platform may be-
have very differently than older in-store customers. You
could use the former to generalize to your online plat-
form strategy, but you may be misguided if you try to
draw inferences from that group for your physical stores.
5. Randomize Randomly assign some people to a treatment group and
others to a control group. The treatment group receives
H7353_Guide-DataAnalytics_2ndREV.indb 48H7353_Guide-DataAnalytics_2ndREV.indb 48 1/17/18 10:47 AM1/17/18 10:47 AM
How to Design a Business Experiment
49
the change you want to test, while the control group re-
ceives what you previously had on offer—and make sure
there are no differences other than what you are testing.
The fi rst rule of randomization is to not let participants
decide which group they are in, or the results will be
meaningless. The second is to make sure there really are
no differences between treatment and control.
It’s not always easy to follow the second rule. For ex-
ample, we’ve seen companies experiment by offering a
different coupon on Sunday than on Monday. The prob-
lem is that Sunday shoppers may be systematically dif-
ferent from Monday shoppers, even if you control for the
volume of shoppers on each day.
6. Commit to a Plan, and Stick to It Before you run an experiment, lay out your plans in de-
tail. How many observations will you collect? How long
will you let the experiment run? What variables will be
collected and analyzed? Record these details. This can
be as simple as creating a Google spreadsheet or as offi -
cial as using a public trial registry. Not only will this level
of transparency make sure that everyone is on the same
page, it will also help you avoid well-known pitfalls in
the implementation of experiments.
Once your experiment is running, leave it alone! If
you get a result you expected, great; if not, that’s fi ne too.
The one thing that’s not OK: running your experiment
until your results look as though they fi t your hypothesis,
rather than until the study has run its planned course.
This type of practice has led to a “replication crisis” in
psychology research; it can seriously bias your results
H7353_Guide-DataAnalytics_2ndREV.indb 49H7353_Guide-DataAnalytics_2ndREV.indb 49 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
50
and reduce the insight you receive. Stick to the plan, to
the extent possible.
7. Let the Data Speak To give a complete picture of your results, report mul-
tiple outcomes. Sure, some might be unchanged, un-
impressive, or downright inexplicable. But better to
be transparent about them than to ignore them. Once
you’ve surveyed the main results, ask yourself whether
you’ve really discovered the underlying mechanism be-
hind your results—the factor that is driving them. If
you’re not sure, refi ne your experiment and run another
trial to learn more.
Experiments are already a central part of the social
sciences; they are quickly becoming central to organiza-
tions as well. If your experiments are well designed, they
will tell you something valuable. The most successful
will puncture your assumptions, change your practices,
and put you ahead of competitors. Experimentation is
a long-term, richly informative process, with each trial
forming the starting point for the next.
Oliver Hauser is a research fellow at Harvard Business
School and Harvard Kennedy School. He conducts re-
search and runs experiments with organizations and
governments around the world. Michael Luca is the
Lee J. Styslinger III Associate Professor of Business
Admin is tration at Harvard Business School and works
with a variety of organizations to design experiments.
H7353_Guide-DataAnalytics_2ndREV.indb 50H7353_Guide-DataAnalytics_2ndREV.indb 50 1/17/18 10:47 AM1/17/18 10:47 AM
51
CHAPTER 6
Know the Diff erence Between Your Data and Your Metrics by Jeff Bladt and Bob Filbin
How many views make a YouTube video a success? How
about 1.5 million? That’s how many views a video posted
in 2011 by our organization, DoSomething.org, received.
It featured some well-known YouTube celebrities, who
asked young people to donate their used sports equip-
ment to youth in need. It was twice as popular as any
video DoSomething.org had posted to date. Success!
Then came the data report: only eight viewers had signed
up to donate equipment, and no one actually donated.
Adapted from content posted on hbr.org, March 4, 2013.
H7353_Guide-DataAnalytics_2ndREV.indb 51H7353_Guide-DataAnalytics_2ndREV.indb 51 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
52
Zero donations from 1.5 million views. Suddenly, it
was clear that for DoSomething.org, views did not equal
success. In terms of donations, the video was a complete
failure.
What happened? We were concerned with the wrong
metric. A metric contains a single type of data—video
views or equipment donations. A successful organiza-
tion can only measure so many things well and what it
measures ties to its defi nition of success. For DoSome-
thing.org, that’s social change. In the case above, success
meant donations, not video views. As we learned, there
is a difference between numbers and numbers that mat-
ter. This is what separates data from metrics.
You Can’t Pick Your Data, but You Must Pick Your Metrics Take baseball. Every team has the same defi nition of
success—winning the World Series. This requires one
main asset: good players. But what makes a player good?
In baseball, teams used to answer this question with a
handful of simple metrics like batting average and runs
batted in (RBIs). Then came the statisticians (remember
Moneyball?). New metrics provided teams with the abil-
ity to slice their data in new ways, fi nd better ways of de-
fi ning good players, and thus win more games.
Keep in mind that all metrics are proxies for what
ulti mately matters (in the case of baseball, a combi-
nation of championships and profi tability), but some
are better than others. The data of the game has never
changed—there are still RBIs and batting averages.
What has changed is how we look at the data. And those
H7353_Guide-DataAnalytics_2ndREV.indb 52H7353_Guide-DataAnalytics_2ndREV.indb 52 1/17/18 10:47 AM1/17/18 10:47 AM
Know the Diff erence Between Your Data and Your Metrics
53
teams that slice the data in smarter ways are able to fi nd
good players who have been traditionally undervalued.
Organizations Become Their Metrics Metrics are what you measure. And what you measure
is what you manage to. In baseball, a critical question is,
how effective is a player when he steps up to the plate?
One measure is hits. A better measure turns out to be
the sabermetric “OPS”—a combination of on-base per-
centage (which includes hits and walks) and total bases
(slugging). Teams that look only at batting average suf-
fer. Players on these teams walk less, with no offsetting
gains in hits. In short, players play to the metrics their
management values, even at the cost of the team.
The same happens in workplaces. Measure YouTube
views? Your employees will strive for more and more
views. Measure downloads of a product? You’ll get more
of that. But if your actual goal is to boost sales or acquire
members, better measures might be return-on-invest-
ment (ROI), on-site conversion, or retention. Do people
who download the product keep using it or share it with
others? If not, all the downloads in the world won’t help
your business. (See the sidebar, “Picking Statistics,” to
learn how to choose metrics that that align with a spe-
cifi c performance objective.)
In the business world, we talk about the difference
between vanity metrics and meaningful metrics. Van-
ity metrics are like dandelions—they might look pretty,
but to most of us, they’re weeds, using up resources and
doing nothing for your property value. Vanity metrics
for your organization might include website visitors per
H7353_Guide-DataAnalytics_2ndREV.indb 53H7353_Guide-DataAnalytics_2ndREV.indb 53 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
54
month, Twitter followers, Facebook fans, and media im-
pressions. Here’s the thing: If these numbers go up, they
might drive up sales of your product. But can you prove
it? If yes, great. Measure away. But if you can’t, they
aren’t valuable.
PICKING STATISTICS
by Michael Mauboussin
The following is a process for choosing metrics that al-
low you to understand, track, and manage the cause-
and-eff ect relationships that determine your com-
pany’s performance. I will illustrate the process in a
simplifi ed way using a retail bank that is based on an
analysis of 115 banks by Venky Nagar of the Univer-
sity of Michigan and Madhav Rajan of Stanford. Leave
aside, for the moment, which metrics you currently
use or which ones Wall Street analysts or bankers say
you should. Start with a blank slate and work through
these four steps in sequence.
1. Defi ne Your Governing Objective
A clear objective is essential to business success be-
cause it guides the allocation of capital. Creating eco-
nomic value is a logical governing objective for a com-
pany that operates in a free market system. Companies
may choose a diff erent objective, such as maximizing
H7353_Guide-DataAnalytics_2ndREV.indb 54H7353_Guide-DataAnalytics_2ndREV.indb 54 1/17/18 10:47 AM1/17/18 10:47 AM
Know the Diff erence Between Your Data and Your Metrics
55
the fi rm’s longevity. We will assume that the retail bank
seeks to create economic value.
2. Develop a Theory of Cause and Eff ect to Assess
Presumed Drivers of the Objective
The three commonly cited fi nancial drivers of value cre-
ation are sales, costs, and investments. More- specifi c
fi nancial drivers vary among companies and can in-
clude earnings growth, cash fl ow growth, and return
on invested capital.
Naturally, fi nancial metrics can’t capture all value-
creating activities. You also need to assess nonfi nan-
cial measures such as customer loyalty, customer sat-
isfaction, and product quality, and determine if they
can be directly linked to the fi nancial measures that
ultimately deliver value. As we’ve discussed, the link
between value creation and fi nancial and nonfi nancial
measures like these is variable and must be evaluated
on a case-by-case basis.
In our example, the bank starts with the theory that
customer satisfaction drives the use of bank services
and that usage is the main driver of value. This theory
links a nonfi nancial and a fi nancial driver. The bank
then measures the correlations statistically to see if
the theory is correct and determines that satisfi ed cus-
tomers indeed use more services, allowing the bank to
(continued)
H7353_Guide-DataAnalytics_2ndREV.indb 55H7353_Guide-DataAnalytics_2ndREV.indb 55 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
56
PICKING STATISTICS
(continued�)
generate cash earnings growth and attractive returns
on assets, both indicators of value creation. Having
determined that customer satisfaction is persistently
and predictively linked to returns on assets, the bank
must now fi gure out which employee activities drive
satisfaction.
3. Identify the Specifi c Activities That Employees
Can Do to Help Achieve the Governing Objective
The goal is to make the link between your objective
and the measures that employees can control through
the application of skill. The relationship between these
activities and the objective must also be persistent and
predictive.
In the previous step, the bank determined that
customer satisfaction drives value (it is predictive).
The bank now has to fi nd reliable drivers of customer
satisfaction. Statistical analysis shows that the rates
consumers receive on their loans, the speed of loan
processing, and low teller turnover all aff ect customer
satisfaction. Because these are within the control of
employees and management, they are persistent. The
bank can use this information to, for example, make
sure that its process for reviewing and approving loans
is quick and effi cient.
H7353_Guide-DataAnalytics_2ndREV.indb 56H7353_Guide-DataAnalytics_2ndREV.indb 56 1/17/18 10:47 AM1/17/18 10:47 AM
Know the Diff erence Between Your Data and Your Metrics
57
4. Evaluate Your Statistics
Finally, you must regularly reevaluate the measures you
are using to link employee activities with the governing
objective. The drivers of value change over time, and
so must your statistics. For example, the demograph-
ics of the retail bank’s customer base are changing,
so the bank needs to review the drivers of customer
satisfaction. As the customer base becomes younger
and more digitally savvy, teller turnover becomes less
relevant and the bank’s online interface and customer
service become more so. Companies have access
to a growing torrent of statistics that could improve
their performance, but executives still cling to old-
fashioned and often fl awed methods for choosing met-
rics. In the past, companies could get away with going
on gut and ignoring the right statistics because that’s
what everyone else was doing. Today, using them is
necessary to compete. More to the point, identifying
and exploiting them before rivals do will be the key to
seizing advantage.
Excerpted from “The True Measures of Success” in Harvard Business Re- view, October 2012 (product #R1210B).
Michael Mauboussin is an investment strategist and an adjunct profes- sor at Columbia Business School. His latest book is The Success Equa- tion (Harvard Business Review Press, 2012).
H7353_Guide-DataAnalytics_2ndREV.indb 57H7353_Guide-DataAnalytics_2ndREV.indb 57 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
58
Metrics Are Only Valuable if You Can Manage to Them Good metrics have three key attributes: Their data is
consistent, cheap, and quick to collect. A simple rule of
thumb: If you can’t measure results within a week for
free (and if you can’t replicate the process), then you’re
prioritizing the wrong ones. There are exceptions, but
they are rare. In baseball, the metrics an organization
uses to measure a successful plate appearance will af-
fect player strategy in the short term (do they draw more
walks, prioritize home runs, etc.?) and personnel strat-
egy in the mid- and long terms. The data to make these
decisions is readily available and continuously updated.
Organizations can’t control their data, but they do
control what they care about. If our metric on the You-
Tube video had been views, we would have called it a
huge success. In fact, we wrote it off as a massive failure.
Does that mean no more videos? Not necessarily, but for
now, we’ll be spending our resources elsewhere, collect-
ing data on metrics that matter.
Jeff Bladt is chief data offi cer at DoSomething.org,
America’s largest organization for young people and so-
cial change. Bob Filbin is chief data scientist at Crisis Text
Line, the fi rst large-scale 24/7 national crisis line for
teens on the medium they use most: texting.
H7353_Guide-DataAnalytics_2ndREV.indb 58H7353_Guide-DataAnalytics_2ndREV.indb 58 1/17/18 10:47 AM1/17/18 10:47 AM
59
CHAPTER 7
The Fundamentals of A/B Testing by Amy Gallo
As we learned in chapter 5, running an experiment is a
straightforward way to collect new data about a specifi c
question or problem. One of the most common meth-
ods of experimentation, particularly in online settings, is
A/B testing.
To better understand what A/B testing is, where it
originated, and how to use it, I spoke with Kaiser Fung,
who founded the applied analytics program at Columbia
University and is author of Junk Charts, a blog devoted
to the critical examination of data and graphics in the
mass media. His latest book is Numbersense: How to Use
Big Data to Your Advantage.
Adapted from “A Refresher on A/B Testing” on hbr.org, June 28, 2017
(product #H03R3D).
H7353_Guide-DataAnalytics_2ndREV.indb 59H7353_Guide-DataAnalytics_2ndREV.indb 59 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
60
What Is A/B Testing? A/B testing is a way to compare two versions of some-
thing to fi gure out which performs better. While it’s most
often associated with websites and apps, Fung says the
method is almost 100 years old.
In the 1920s, statistician and biologist Ronald Fisher
discovered the most important principles behind A/B
testing and randomized controlled experiments in gen-
eral. “He wasn’t the fi rst to run an experiment like this,
but he was the fi rst to fi gure out the basic principles and
mathematics and make them a science,” Fung says.
Fisher ran agricultural experiments, asking questions
such as, “What happens if I put more fertilizer on this
land?” The principles persisted, and in the early 1950s
scientists started running clinical trials in medicine. In
the 1960s and 1970s, the concept was adapted by mar-
keters to evaluate direct-response campaigns (for exam-
ple, “Would a postcard or a letter sent to target custom-
ers result in more sales?”).
A/B testing in its current form came into existence in
the 1990s. Fung says that throughout the past century,
the math behind the tests hasn’t changed: “It’s the same
core concepts, but now you’re doing it online, in a real-
time environment, and on a different scale in terms of
number of participants and number of experiments.”
How Does A/B Testing Work? You start an A/B test by deciding what it is you want to
test. Fung gives a simple example: the size of the “Sub-
scribe” button on your website. Then you need to know
H7353_Guide-DataAnalytics_2ndREV.indb 60H7353_Guide-DataAnalytics_2ndREV.indb 60 1/17/18 10:47 AM1/17/18 10:47 AM
The Fundamentals of A/B Testing
61
how you want to evaluate its performance. In this case,
let’s say your metric is the number of visitors who click
on the button. To run the test, you show two sets of us-
ers (assigned at random when they visit the site) the
different versions (where the only thing different is the
size of the button) and determine which infl uenced your
success metric the most—in this case, which button size
caused more visitors to click.
There are a lot of things that infl uence whether some-
one clicks. For example, it may be that those using a mo-
bile device are more likely to click a button of a certain
size, while those on desktop are drawn to a different
size. This is where randomization is critical. By random-
izing which users are in which group, you minimize the
chances that other factors, like mobile versus desktop,
will drive your results on average.
“The A/B test can be considered the most basic kind
of randomized controlled experiment,” Fung says. “In its
simplest form, there are two treatments and one acts as
the control for the other.” As with all randomized con-
trolled experiments, you must estimate the sample size
you need to achieve a statistical signifi cance, which will
help you make sure the result you’re seeing “isn’t just be-
cause of background noise,” Fung says.
Sometimes you know that certain variables, usually
those that are not easily manipulated, have a strong ef-
fect on the success metric. For example, maybe mobile
users of your website tend to click less in general, com-
pared with desktop users. Randomization may result
in set A containing slightly more mobile users than
set B, which may cause set A to have a lower click rate
H7353_Guide-DataAnalytics_2ndREV.indb 61H7353_Guide-DataAnalytics_2ndREV.indb 61 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
62
regardless of the button size they’re seeing. To level the
playing fi eld, the test analyst should fi rst divide the users
by mobile and desktop and then randomly assign them
to each version. This is called blocking.
The size of the “Subscribe” button is a very basic ex-
ample, Fung says. In actuality, you might not be testing
just size but also color, text, typeface, and font size. Lots
of managers run sequential tests—testing size fi rst (large
versus small), then color (blue versus red), then typeface
(Times versus Arial), and so on—because they believe
they shouldn’t vary two or more factors at the same time.
But according to Fung, that view has been debunked by
statisticians. Sequential tests are in fact suboptimal, be-
cause you’re not measuring what happens when factors
interact. For example, it may be that users prefer blue on
average but prefer red when it’s combined with an Arial
font. This kind of result is regularly missed in sequential
A/B testing because the typeface test is run on blue but-
tons that have “won” the previous test.
Instead, Fung says, you should run more-complex
tests. This can be hard for some managers, since the
appeal of A/B tests is how straightforward and simple
they are to run (and many people designing these ex-
periments, Fung points out, don’t have a statistics back-
ground). “With A/B testing, we tend to want to run a
large number of simultaneous, independent tests,” he
says, in large part because the mind reels at the number
of possible combinations that can be tested. But using
mathematics, you can “smartly pick and run only certain
subsets of those treatments; then you can infer the rest
H7353_Guide-DataAnalytics_2ndREV.indb 62H7353_Guide-DataAnalytics_2ndREV.indb 62 1/17/18 10:47 AM1/17/18 10:47 AM
The Fundamentals of A/B Testing
63
from the data.” This is called multivariate testing in the
A/B testing world, and it means you often end up doing
an A/B/C test or even an A/B/C/D test. In the colors and
size example, it might include showing different groups
a large red button, a small red button, a large blue but-
ton, and a small blue button. If you wanted to test fonts
too, you would need even more test groups.
How Do You Interpret the Results of an A/B Test? Chances are that your company will use software that
handles the calculations, and it may even employ a stat-
istician who can interpret those results for you. But it’s
helpful to have a basic understanding of how to make
sense of the output and decide whether to move forward
with the test variation (the new button, in the example
Fung describes).
Fung says that most software programs report two
conversion rates for A/B testing: one for users who saw
the control version, and the other for users who saw the
test version. “The conversion rate may measure clicks or
other actions taken by users,” he says. The report might
look like this: “Control: 15% (+/– 2.1%); Variation 18%
(+/– 2.3%).” This means that 18% of your users clicked
through on the new variation (perhaps the larger blue
button) with a margin of error of 2.3%. You might be
tempted to interpret this as the actual conversion rate
falling between 15.7% and 20.3%, but that wouldn’t be
technically correct. “The real interpretation is that if
you ran your A/B test multiple times, 95% of the ranges
H7353_Guide-DataAnalytics_2ndREV.indb 63H7353_Guide-DataAnalytics_2ndREV.indb 63 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
64
will capture the true conversion rate—in other words,
the conversion rate falls outside the margin of error 5%
of the time (or whatever level of statistical signifi cance
you’ve set),” Fung explains.
This can be a diffi cult concept to wrap your head
around. But what’s important to know is that the 18%
conversion rate isn’t a guarantee. This is where your
judgment comes in. An 18% conversation rate is cer-
tainly better than a 15% one, even allowing for the mar-
gin of error (12.9% to 17.1% versus 15.7% to 20.3%). You
might hear people talk about this as a “3% lift” (lift is the
percentage difference in conversion rate between your
control version and a successful test treatment). In this
case, it’s most likely a good decision to switch to your
new version, but that will depend on the costs of imple-
menting it. If they’re low, you might try out the switch
and see what happens in actuality (versus in tests). One
of the big advantages to testing in the online world is
that you can usually revert back to your original pretty
easily.
How Do Companies Use A/B Testing? Fung says that the popularity of the methodology has
risen as companies have realized that the online en-
vironment is well suited to help managers, especially
marketers, answer questions like, “What is most likely
to make people click? Or buy our product? Or register
with our site?” A/B testing is now used to evaluate every-
thing from website design to online offers to headlines
to product descriptions. (See the sidebar “A/B Testing in
H7353_Guide-DataAnalytics_2ndREV.indb 64H7353_Guide-DataAnalytics_2ndREV.indb 64 1/17/18 10:47 AM1/17/18 10:47 AM
The Fundamentals of A/B Testing
65
Action” to see an example from the creative marketplace
Shutterstock.)
Most of these experiments run without the subjects
even knowing. As users, Fung says, “we’re part of these
tests all the time and don’t know it.”
And it’s not just websites. You can test marketing
emails or ads as well. For example, you might send two
versions of an email to your customer list (random-
izing the list fi rst, of course) and fi gure out which one
generates more sales. Then you can just send out the
winning version next time. Or you might test two ver-
sions of ad copy and see which one converts visitors
more often. Then you know to spend more getting the
most successful one out there.
A/B TESTING IN ACTION
by Wyatt Jenkins
At Shutterstock, we test everything: copy and link col-
ors, relevance algorithms that rank our search results,
language-detection functions, usability in download-
ing, pricing, video-playback design, and anything else
you can see on our site (plus a lot you can’t).
Shutterstock is the world’s largest creative market-
place, serving photography, illustrations, and video to
more than 750,000 customers. And those customers
(continued�)
H7353_Guide-DataAnalytics_2ndREV.indb 65H7353_Guide-DataAnalytics_2ndREV.indb 65 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
66
A/B TESTING IN ACTION
(continued�)
have heavy image needs; we serve over three down-
loads per second. That’s a ton of data.
This means that we know more about our custom-
ers, statistically, than anyone else in our market. It
also means that we can run more experiments with
statistical signifi cance faster than businesses with less
user data. It’s one of our most important competitive
advantages.
Search results are among the highest-traffi cked
pages on our site. A few years back, we started experi-
menting with a mosaic-display search-results page in
our Labs area—an experimentation platform we use to
try things quickly and get user feedback. In qualitative
testing, customers really liked the design of the mosaic
search grid, so we A/B tested it within the core Shut-
terstock experience.
Here are some of the details of the experiment, and
what we learned:
• Image sizes: We tested diff erent image sizes to
get just the right number of pixels on the screen.
• New customers: We watched to see if new
customers to our site would increase conversion.
New customers act diff erently than existing ones,
so you need to account for that. Sometimes ex-
isting customers suff er from change aversion.
H7353_Guide-DataAnalytics_2ndREV.indb 66H7353_Guide-DataAnalytics_2ndREV.indb 66 1/17/18 10:47 AM1/17/18 10:47 AM
The Fundamentals of A/B Testing
67
• Viewport size: We tracked the viewport size
(the size of the screen customers used) to
under stand how they were viewing the page.
• Watermarks: We tested including an image
watermark versus no watermark. Was including
the watermark distracting?
• Hover: We experimented with the behavior of a
hover feature when a user paused on a particu-
lar image.
Before the test, we were convinced that removing
the watermark on our images would increase con-
version because there would be less visual clutter on
the page. But in testing we learned that removing the
water mark created the opposite eff ect, disproving our
gut instinct.
We ran enough tests to fi nd two diff erent designs
that increased conversion, so we iterated on those de-
signs and re-tested them before deciding on one. And
we continue to test this search grid and make improve-
ments for our customers on a regular basis.
Adapted from “A/B Testing and the Benefi ts of an Experimentation Cul- ture” posted on hbr.org, February 5, 2014 (product #H00NTO).
Wyatt Jenkins is a product executive with a focus on marketplaces, personalization, optimization, and international growth. He has acted as SVP of Product at Hired.com and Optimizely, and was VP of Product at Shutterstock for fi ve years. Wyatt was an early partner in Beatport from 2003 to 2009, and he served on the board until 2013.
H7353_Guide-DataAnalytics_2ndREV.indb 67H7353_Guide-DataAnalytics_2ndREV.indb 67 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
68
What Mistakes Do People Make When Doing A/B Tests? Fung identifi ed three common mistakes he sees compa-
nies make when performing A/B tests.
First, too many managers don’t let the tests run their
course. Because most of the software for running these
tests lets you watch results in real time, managers want
to make decisions too quickly. This mistake, Fung says,
“evolves out of impatience,” and many software vendors
have played into this overeagerness by offering a type of
A/B testing called real-time optimization, in which you
can use algorithms to make adjustments as results come
in. The problem is that, because of randomization, it’s
possible that if you let the test run to its natural end, you
might get a different result.
The second mistake is looking at too many metrics.
“I cringe every time I see software that tries to please
every one by giving you a panel of hundreds of metrics,”
he says. The problem is that if you’re looking at such a
large number of metrics at the same time, you’re at risk
of making what statisticians call spurious correlations (a
topic discussed in more detail in chapter 10). In proper
test design, “you should decide on the metrics you’re
going to look at before you execute an experiment and
select a few. The more you’re measuring, the more likely
that you’re going to see random fl uctuations.” With too
many metrics, instead of asking yourself, “What’s hap-
pening with this variable?” you’re asking, “What in-
teresting (and potentially insignifi cant) changes am I
seeing?”
H7353_Guide-DataAnalytics_2ndREV.indb 68H7353_Guide-DataAnalytics_2ndREV.indb 68 1/17/18 10:47 AM1/17/18 10:47 AM
The Fundamentals of A/B Testing
69
Lastly, Fung says, few companies do enough retest-
ing. “We tend to test it once and then we believe it. But
even with a statistically signifi cant result, there’s a quite
large probability of false positive error. Unless you retest
once in a while, you don’t rule out the possibility of be-
ing wrong.” False positives can occur for several reasons.
For example, even though there may be little chance that
any given A/B result is driven by random chance, if you
do lots of A/B tests, the chances that at least one of your
results is wrong grows rapidly.
This can be particularly diffi cult to do because it is
likely that managers would end up with contradictory
results, and no one wants to discover that they’ve under-
mined previous fi ndings, especially in the online world,
where managers want to make changes—and capture
value—quickly. But this focus on value can be misguided.
Fung says, “People are not very vigilant about the practi-
cal value of the fi ndings. They want to believe that every
little amount of improvement is valuable even when the
test results are not fully reliable. In fact, the smaller the
improvement, the less reliable the results.”
It’s clear that A/B testing is not a panacea for all your
data-testing needs. There are more complex kinds of ex-
periments that are more effi cient and will give you more
reliable data, Fung says. But A/B testing is a great way
to gain quick information about a specifi c question you
have, particularly in an online setting. And, as Fung
says, “the good news about the A/B testing world is that
everything happens so quickly, so if you run it and it
doesn’t work, you can try something else. You can always
fl ip back to the old tactic.”
H7353_Guide-DataAnalytics_2ndREV.indb 69H7353_Guide-DataAnalytics_2ndREV.indb 69 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
70
Amy Gallo is a contributing editor at Harvard Business
Review and the author of the HBR Guide to Dealing with
Confl ict. Follow her on Twitter @amyegallo.
H7353_Guide-DataAnalytics_2ndREV.indb 70H7353_Guide-DataAnalytics_2ndREV.indb 70 1/17/18 10:47 AM1/17/18 10:47 AM
71
CHAPTER 8
Can Your Data Be Trusted? by Thomas C. Redman
You’ve just learned of some new data that, when com-
bined with existing data, could offer potentially game-
changing insights. But there isn’t a clear indication
whether this new information can be trusted. How
should you proceed?
There is, of course, no simple answer. While many
managers are skeptical of new data and others embrace
it wholeheartedly, the more thoughtful managers take a
nuanced approach. They know that some data (maybe
even most of it) is bad and can’t be used, and some is
good and should be trusted implicitly. But they also re-
alize that some data is fl awed but usable with caution.
Adapted from content posted on hbr.org, October 29, 2015 (product
#H02G61).
H7353_Guide-DataAnalytics_2ndREV.indb 71H7353_Guide-DataAnalytics_2ndREV.indb 71 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
72
They fi nd this data intriguing and are eager to push the
data to its limits, as they know game-changing insights
may reside there.
Fortunately, you can work with your data scientists to
assess whether the data you’re considering is safe to use
and just how far you can go with fl awed data. Indeed,
following some basic steps can help you proceed with
greater confi dence—or caution—as the quality of the
data dictates.
Evaluate Where It Came From You can trust data when it is created in accordance with
a fi rst-rate data quality program. They feature clear ac-
countabilities for managers to create data correctly, in-
put controls, and fi nd and eliminate the root causes of
error. You won’t have to opine whether the data is good—
data quality statistics will tell you. You’ll fi nd an expert
who will be happy to explain what you may expect and
answer your questions. If the data quality stats look good
and the conversation goes well, trust the data. This is the
“gold standard” against which the other steps should be
calibrated.
Assess Data Quality Independently Much, perhaps most, data will not meet the gold stan-
dard, so adopt a cautious attitude by doing your own
assessment of data quality. Make sure you know where
the data was created and how it is defi ned, not just how
your data scientist accessed it. It is easy to be misled by
a casual, “We took it from our cloud-based data ware-
H7353_Guide-DataAnalytics_2ndREV.indb 72H7353_Guide-DataAnalytics_2ndREV.indb 72 1/17/18 10:47 AM1/17/18 10:47 AM
Can Your Data Be Trusted?
73
house, which employs the latest technology,” and com-
pletely miss the fact that the data was created in a dubi-
ous public forum. Figure out which organization created
the data. Then dig deeper: What do colleagues advise
about this organization and data? Does it have a good or
poor reputation for quality? What do others say on social
media? Do some research both inside and outside your
company.
At the same time, develop your own data quality
statistics, using what I call the “Friday afternoon mea-
surement,” tailor-made for this situation. Briefl y, you,
the data scientist providing the analysis, or both of you,
should lay out 10 or 15 important data elements for
100 data records on a spreadsheet. If the new data in-
volves customer purchases, such data elements may in-
clude “customer name,” “purchased item,” and “price.”
Then work record by record, taking a hard look at each
data element. The obvious errors will jump out at you—
customer names will be misspelled, the purchased item
will be a product you don’t sell, or the price may be miss-
ing. Mark these obvious errors with a red pen or high-
light them in a bright color. Then count the number of
records with no errors. (See fi gure 8-1 for an example.) In
many cases you’ll see a lot of red—don’t trust this data!
If you see only a little red, say, less than 5% of records
with an obvious error, you can use this data with caution.
Look, too, at patterns of the errors. If, for instance,
there are 25 total errors, 24 of which occur in the price,
eliminate that data element going forward. But if the
rest of the data looks pretty good, use it with caution.
H7353_Guide-DataAnalytics_2ndREV.indb 73H7353_Guide-DataAnalytics_2ndREV.indb 73 1/17/18 10:47 AM1/17/18 10:47 AM
74
R ec
or d
N am
e
1 Ja
ne D
oe N
ul l
$4 72
.1 3
N o
Jo hn
S m
it h
A tt
ri bu
te 1
A tt
ri bu
te 2
A tt
ri bu
te 3
A tt
ri bu
te 1
5
M ed
iu m
$1 26
.9 3
Ye s
St ua
rt M
ad ni
ck XX
XL N
ul l
N o
Ja m
es O
ls en
24 L
oc kw
oo d
R oa
d $7
6. 24
N o
N um
be r
of pe
rf ec
t re
co rd
s =
67
Th oa
m s
Jo ne
s N
o
2 3 4 10 0
Si ze
A m
ou nt
Pe rf
ec t
re co
rd ?
FI G
U R
E 8
-1
Ex am
p le
: F ri
d ay
a ft
er no
on m
ea su
re m
en t
sp re
ad sh
ee t
So ur
ce : T
ho m
as C
. R ed
m an
, “ A
ss es
s W
he th
er Y
ou H
av e
a D
at a
Q ua
lit y
Pr ob
le m
” on
h br
.o rg
, J ul
y 28
, 2 0
16 (
pr od
uc t #
H 0
30 SQ
).
H7353_Guide-DataAnalytics_2ndREV.indb 74H7353_Guide-DataAnalytics_2ndREV.indb 74 1/17/18 10:47 AM1/17/18 10:47 AM
Can Your Data Be Trusted?
75
Clean the Data I think of data cleaning in three levels: rinse, wash, and
scrub. “Rinse” replaces obvious errors with “missing
value” or corrects them if doing so is very easy; “scrub”
involves deep study, even making corrections one at a
time, by hand, if necessary; and “wash” occupies a mid-
dle ground.
Even if time is short, scrub a small random sample
(say, 1,000 records), making them as pristine as you pos-
sibly can. Your goal is to arrive at a sample of data you
know you can trust. Employ all possible means of scrub-
bing and be ruthless! Eliminate erroneous data records
and data elements that you cannot correct, and mark
data as “uncertain” when applicable.
When you are done, take a hard look. When the
scrubbing has gone really well (and you’ll know if it has),
you’ve created a data set that rates high on the trust-
worthy scale. It’s OK to move forward using this data.
Sometimes the scrubbing is less satisfying. If you’ve
done the best you can, but still feel uncertain, put this
data in the “use with caution” category. If the scrubbing
goes poorly—for example, too many prices just look
wrong and you can’t make corrections—you must rate
this data, and all like it, as untrustworthy. The sample
strongly suggests none of the data should be used to in-
form your decision.
After the initial scrub, move on to the second clean-
ing exercise: washing the remaining data that was not in
the scrubbing sample. This step should be performed by
a truly competent data scientist. Since scrubbing can be
H7353_Guide-DataAnalytics_2ndREV.indb 75H7353_Guide-DataAnalytics_2ndREV.indb 75 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
76
a time-consuming, manual process, the wash allows you
to make corrections using more automatic processes. For
example, one wash technique involves “imputing” miss-
ing values using statistical means. Or your data scientist
may have discovered algorithms during scrubbing. If the
washing goes well, put this data into the “use with cau-
tion” category.
The fl ow chart in fi gure 8-2 will help you see this pro-
cess in action. Once you’ve identifi ed a set of data that
you can trust or use with caution, move on to the next
step of integration.
Ensure High-Quality Data Integration Align the data you can trust—or the data that you’re
moving forward with cautiously—with your existing
data. There is a lot of technical work here, so probe your
data scientist to ensure three things are done well:
• Identifi cation: Verify that the Courtney Smith in
one data set is the same Courtney Smith in others.
• Alignment of units of measure and data defi nitions:
Make sure Courtney’s purchases and prices paid,
expressed in “pallets” and “dollars” in one set, are
aligned with “units” and “euros” in another.
• De-duplication: Check that the Courtney Smith
record does not appear multiple times in different
ways (say as C. Smith or Courtney E. Smith).
At this point in the process, you’re ready to perform
whatever analytics (from simple summaries to more
complex analyses) you need to guide your decision. Pay
H7353_Guide-DataAnalytics_2ndREV.indb 76H7353_Guide-DataAnalytics_2ndREV.indb 76 1/17/18 10:47 AM1/17/18 10:47 AM
77
U s e
t h
is d
a ta
w it
h c
a u
ti o
n
T ru
s t
th is
d a
ta
“W a s h ”
th e
re m
a in
in g
d a ta
u s in
g
a u to
m a te
d
te c h n iq
u e s
w it h t
h e h
e lp
o f
a d
a ta
s c ie
n ti s t.
D o
n o
t tr
u s t
th is
d a
ta
R a
w d
a ta
W a s t
h e d
a ta
c re
a te
d i n
a c c o
rd a n c e
w it h a
fi rs
t- ra
te
d a ta
q u a li ty
p ro
g ra
m ?
C a n y
o u
id e n ti fy
d a ta
o f
h ig
h q
u a li ty
th ro
u g
h y
o u r
o w
n r
e s e a rc
h ?
T h e d
a ta
c o
u ld
n o
t
b e s
c ru
b b
e d
.
T h e re
w e re
to o
m a n y
e rr
o rs
t h a t
c o
u ld
n ’t
b e
fi x e d
.
“S c ru
b ”
a
s m
a ll s
a m
p le
b y c
o rr
e c ti n g
o r
e li m
in a ti n g
d a ta
.
D id
t h e
“s c ru
b b
in g
”
g o
w e ll ?
Y E
S
Y E
S
Y E
S
Y E
S
S O
M E
W H
A T
N O
N O
N O
D id
t h e
“w a s h in
g ”
g o
w e ll ?
N O
N O
FI G
U R
E 8
-2
Sh ou
ld y
ou t
ru st
y ou
r d
at a?
A s
im pl
e pr
oc es
s to
h el
p yo
u de
ci de
H7353_Guide-DataAnalytics_2ndREV.indb 77H7353_Guide-DataAnalytics_2ndREV.indb 77 1/17/18 10:47 AM1/17/18 10:47 AM
Gather the Right Information
78
particular attention when you get different results based
on “use with caution” and “trusted” data. Both great in-
sights and great traps lie here. When a result looks in-
triguing, isolate the data and repeat the steps above,
making more detailed measurements, scrubbing the
data, and improving wash routines. As you do so, de-
velop a feel for how deeply you should trust this data.
Data doesn’t have to be perfect to yield new insights,
but you must exercise caution by understanding where
the fl aws lie, working around errors, cleaning them up,
and backing off when the data simply isn’t good enough.
Thomas C. Redman, “the Data Doc,” is President of Data
Quality Solutions. He helps companies and people, in-
cluding startups, multinationals, executives, and leaders
at all levels, chart their courses to data-driven futures.
He places special emphasis on quality, analytics, and or-
ganizational capabilities.
H7353_Guide-DataAnalytics_2ndREV.indb 78H7353_Guide-DataAnalytics_2ndREV.indb 78 1/17/18 10:47 AM1/17/18 10:47 AM
SECTION THREE
Analyze the Data
H7353_Guide-DataAnalytics_2ndREV.indb 79H7353_Guide-DataAnalytics_2ndREV.indb 79 1/17/18 10:47 AM1/17/18 10:47 AM
H7353_Guide-DataAnalytics_2ndREV.indb 80H7353_Guide-DataAnalytics_2ndREV.indb 80 1/17/18 10:47 AM1/17/18 10:47 AM
81
CHAPTER 9
A Predictive Analytics Primer by Thomas H. Davenport
No one has the ability to capture and analyze data from
the future. However, there is a way to predict the future
using data from the past. It’s called predictive analytics,
and organizations do it every day.
Has your company, for example, developed a customer
lifetime value (CLTV) measure? That’s using predictive
analytics to determine how much a customer will buy
from the company over time. Do you have a “next best
offer” or product recommendation capability? That’s an
analytical prediction of the product or service that your
customer is most likely to buy next. Have you made a
Adapted from content posted on hbr.org, September 2, 2014 (product
#H00YO1).
H7353_Guide-DataAnalytics_2ndREV.indb 81H7353_Guide-DataAnalytics_2ndREV.indb 81 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
82
forecast of next quarter’s sales? Used digital marketing
models to determine what ad to place on what publish-
er’s site? All of these are forms of predictive analytics.
Predictive analytics are gaining in popularity, but
what do you really need to know in order to interpret
results and make better decisions? By understanding a
few basics, you will feel more comfortable working with
and communicating with others in your organization
about the results and recommendations from predictive
analytics. The quantitative analysis isn’t magic—but it is
normally done with a lot of past data, a little statistical
wizardry, and some important assumptions.
The Data Lack of good data is the most common barrier to or-
ganizations seeking to employ predictive analytics. To
make predictions about what customers will buy in
the future, for example, you need to have good data on
what they are buying (which may require a loyalty pro-
gram, or at least a lot of analysis of their credit cards),
what they have bought in the past, the attributes of those
products (attribute-based predictions are often more ac-
curate than the “people who buy this also buy this” type
of model), and perhaps some demographic attributes of
the customer (age, gender, residential location, socioeco-
nomic status, etc.). If you have multiple channels or cus-
tomer touchpoints, you need to make sure that they cap-
ture data on customer purchases in the same way your
previous channels did.
All in all, it’s a fairly tough job to create a single
customer data warehouse with unique customer IDs
H7353_Guide-DataAnalytics_2ndREV.indb 82H7353_Guide-DataAnalytics_2ndREV.indb 82 1/17/18 10:47 AM1/17/18 10:47 AM
A Predictive Analytics Primer
83
on everyone, and all past purchases customers have
made through all channels. If you’ve already done that,
you’ve got an incredible asset for predictive customer
analytics.
The Statistics Regression analysis in its various forms is the primary
tool that organizations use for predictive analytics. It
works like this, in general: An analyst hypothesizes that
a set of independent variables (say, gender, income, vis-
its to a website) are statistically correlated with the pur-
chase of a product for a sample of customers. The analyst
performs a regression analysis to see just how correlated
each variable is; this usually requires some iteration
to fi nd the right combination of variables and the best
model. Let’s say that the analyst succeeds and fi nds that
each variable in the model is important in explaining the
product purchase, and together the variables explain a
lot of variation in the product’s sales. Using that regres-
sion equation, the analyst can then use the regression
coeffi cients—the degree to which each variable affects
the purchase behavior—to create a score predicting the
likelihood of the purchase.
Voilà! You have created a predictive model for other
customers who weren’t in the sample. All you have to do
is compute their score and offer them the product if their
score exceeds a certain level. It’s quite likely that the
high-scoring customers will want to buy the product—
assuming the analyst did the statistical work well and
that the data was of good quality. (For more on regres-
sion analysis, read on to the next chapter.)
H7353_Guide-DataAnalytics_2ndREV.indb 83H7353_Guide-DataAnalytics_2ndREV.indb 83 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
84
The Assumptions Another key factor in any predictive model is the as-
sumptions that underlie it. Every model has them,
and it’s important to know what they are and monitor
whether they are still true. The big assumption in pre-
dictive analytics is that the future will continue to be like
the past. As Charles Duhigg describes in his book The
Power of Habit, people establish strong patterns of be-
havior that they usually keep up over time. Sometimes,
however, they change those behaviors, and the models
that were used to predict them may no longer be valid.
What makes assumptions invalid? The most com-
mon reason is time. If your model was created several
years ago, it may no longer accurately predict current
behavior. The greater the elapsed time, the more likely
it is that customer behavior has changed. Some Net-
fl ix predictive models, for example, that were created
on early internet users had to be retired because later
inter net users were substantially different. The pioneers
were more technically focused and relatively young;
later users were essentially everyone.
Another reason a predictive model’s assumptions may
no longer be valid is if the analyst didn’t include a key
variable in the model, and that variable has changed
substantially over time. The great—and scary—example
here is the fi nancial crisis of 2008–2009, caused largely
by invalid models predicting how likely mortgage cus-
tomers were to repay their loans. The models didn’t in-
clude the possibility that housing prices might stop ris-
ing, and that they even might fall. When they did start
H7353_Guide-DataAnalytics_2ndREV.indb 84H7353_Guide-DataAnalytics_2ndREV.indb 84 1/17/18 10:47 AM1/17/18 10:47 AM
A Predictive Analytics Primer
85
falling, it turned out that the models were poor predic-
tors of mortgage repayment. In essence, the belief that
housing prices would always rise was a hidden assump-
tion in the models.
Since faulty or obsolete assumptions can clearly bring
down whole banks and even (nearly!) whole economies,
it’s pretty important that they be carefully examined.
Managers should always ask analysts what the key as-
sumptions are, and what would have to happen for them
to no longer be valid. And both managers and analysts
should continually monitor the world to see if key factors
involved in assumptions have changed over time.
With these fundamentals in mind, here are a few good
questions to ask your analysts:
• Can you tell me something about the source of the
data you used in your analysis?
• Are you sure the sample data is representative of
the population?
• Are there any outliers in your data distribution?
How did they affect the results?
• What assumptions are behind your analysis?
• Are there any conditions that would make your
assumptions invalid?
Even with those cautions, it’s still pretty amazing that
we can use analytics to predict the future. All we have to
do is gather the right data, do the right type of statisti-
cal model, and be careful of our assumptions. Analytical
predictions may be harder to generate than those by the
H7353_Guide-DataAnalytics_2ndREV.indb 85H7353_Guide-DataAnalytics_2ndREV.indb 85 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
86
late-night television soothsayer Carnac the Magnifi cent,
but they are usually considerably more accurate.
Thomas H. Davenport is the President’s Distinguished
Professor in Management and Information Technology
at Babson College, a research fellow at the MIT Initiative
on the Digital Economy, and a senior adviser at Deloitte
Analytics. Author of over a dozen management books,
his latest is Only Humans Need Apply: Winners and
Losers in the Age of Smart Machines.
H7353_Guide-DataAnalytics_2ndREV.indb 86H7353_Guide-DataAnalytics_2ndREV.indb 86 1/17/18 10:47 AM1/17/18 10:47 AM
87
CHAPTER 10
Understanding Regression Analysis by Amy Gallo
One of the most important types of data analysis is re-
gression. It is a common approach used to draw conclu-
sions from and make predictions based on data, but for
those without a statistical or analytical background, it
can also be complex and confusing.
To better understand this method and how compa-
nies use it, I talked with Thomas Redman, author of
Data Driven: Profi ting from Your Most Important Busi-
ness Asset. He also advises organizations on their data
and data quality programs.
Adapted from “A Refresher on Regression Analysis” on hbr.org, No-
vember 4, 2015 (product #H02GBP).
H7353_Guide-DataAnalytics_2ndREV.indb 87H7353_Guide-DataAnalytics_2ndREV.indb 87 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
88
What Is Regression Analysis? Redman offers this example scenario: Suppose you’re a
sales manager trying to predict next month’s numbers.
You know that dozens, perhaps even hundreds, of fac-
tors from the weather to a competitor’s promotion to
the rumor of a new and improved model can impact
the number. Perhaps people in your organization even
have a theory about what will have the biggest effect on
sales. “Trust me. The more rain we have, the more we
sell.” “Six weeks after the competitor’s promotion, sales
jump.”
Regression analysis is a way of mathematically sorting
out which of those variables do indeed have an impact.
It answers the questions: Which factors matter most?
Which can we ignore? How do those factors interact
with one another? And, perhaps most importantly, how
certain are we about all of these factors?
In regression analysis, those factors are called vari-
ables. You have your dependent variable—the main fac-
tor that you’re trying to understand or predict. In Red-
man’s example above, the dependent variable is monthly
sales. And then you have your independent variables—
the factors you suspect have an impact on your depen-
dent variable.
How Does It Work? In order to conduct a regression analysis, you gather data
on the variables in question. You take all of your monthly
sales numbers for, say, the past three years and any data
on the independent variables you’re interested in. So, in
H7353_Guide-DataAnalytics_2ndREV.indb 88H7353_Guide-DataAnalytics_2ndREV.indb 88 1/17/18 10:47 AM1/17/18 10:47 AM
Understanding Regression Analysis
89
this case, let’s say you fi nd out the average monthly rain-
fall for the past three years as well. Then you plot all of
that information on a chart that looks like fi gure 10-1.
The y-axis is the amount of sales (the dependent vari-
able, the thing you’re interested in, is always on the y-
axis) and the x-axis is the total rainfall. Each dot repre-
sents one month’s data—how much it rained that month
and how many sales you made that same month.
Glancing at this data, you probably notice that sales
are higher on days when it rains a lot. That’s interesting
to know, but by how much? If it rains three inches, do
you know how much you’ll sell? What about if it rains
four inches?
FIGURE 10-1
Is there a relationship between these two variables?
Plotting your data is the fi rst step to fi guring that out.
H7353_Guide-DataAnalytics_2ndREV.indb 89H7353_Guide-DataAnalytics_2ndREV.indb 89 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
90
Now imagine drawing a line through the chart, one
that runs roughly through the middle of all the data
points, as shown in fi gure 10-2. This line will help you
answer, with some degree of certainty, how much you
typically sell when it rains a certain amount.
This is called the regression line and it’s drawn (using
a statistics program like SPSS or STATA or even Excel)
to show the line that best fi ts the data. In other words,
explains Redman, “The line is the best explanation of the
relationship between the independent variable and de-
pendent variable.”
In addition to drawing the line, your statistics pro-
gram also outputs a formula that explains the slope of
the line and looks something like this:
FIGURE 10-2
Building a regression model
The line summarizes the relationship between x and y.
H7353_Guide-DataAnalytics_2ndREV.indb 90H7353_Guide-DataAnalytics_2ndREV.indb 90 1/17/18 10:47 AM1/17/18 10:47 AM
Understanding Regression Analysis
91
y = 200 + 5x + error term
Ignore the error term for now. It refers to the fact
that regression isn’t perfectly precise. Just focus on the
model:
y = 200 + 5x
What this formula is telling you is that if there is no x
then y = 200. So, historically, when it didn’t rain at all,
you made an average of 200 sales and you can expect
to do the same going forward assuming other variables
stay the same. And in the past, for every additional inch
of rain, you made an average of fi ve more sales. “For ev-
ery increment that x goes up one, y goes up by fi ve,” says
Redman.
Now let’s return to the error term. You might be
tempted to say that rain has a big impact on sales if for
every inch you get fi ve more sales, but whether this vari-
able is worth your attention will depend on the error
term. A regression line always has an error term because,
in real life, independent variables are never perfect pre-
dictors of the dependent variables. Rather, the line is an
estimate based on the available data. So the error term
tells you how certain you can be about the formula. The
larger it is, the less certain the regression line.
This example uses only one variable to predict the fac-
tor of interest—in this case, rain to predict sales. Typi-
cally, you start a regression analysis wanting to under-
stand the impact of several independent variables. So
you might include not just rain but also data about a
competitor’s promotion. “You keep doing this until the
error term is very small,” says Redman. “You’re trying to
H7353_Guide-DataAnalytics_2ndREV.indb 91H7353_Guide-DataAnalytics_2ndREV.indb 91 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
92
get the line that fi ts best with your data.” While there can
be dangers in trying to include too many variables in a
regression analysis, skilled analysts can minimize those
risks. And considering the impact of multiple variables
at once is one of the biggest advantages of regression.
How Do Companies Use It? Regression analysis is the “go-to method in analytics,”
says Redman. And smart companies use it to make deci-
sions about all sorts of business issues. “As managers, we
want to fi gure out how we can impact sales or employee
retention or recruiting the best people. It helps us fi gure
out what we can do.”
Most companies use regression analysis to explain
a phenomenon they want to understand (why did cus-
tomer service calls drop last month?); to predict things
about the future (what will sales look like over the next
six months?); or to decide what to do (should we go with
this promotion or a different one?).
Does Correlation Imply Causation? Whenever you work with regression analysis or any other
analysis that tries to explain the impact of one factor on
another, you need to remember the important adage:
Correlation is not causation. This is critical and here’s
why: It’s easy to say that there is a correlation between
rain and monthly sales. The regression shows that they
are indeed related. But it’s an entirely different thing to
say that rain caused the sales. Unless you’re selling um-
H7353_Guide-DataAnalytics_2ndREV.indb 92H7353_Guide-DataAnalytics_2ndREV.indb 92 1/17/18 10:47 AM1/17/18 10:47 AM
Understanding Regression Analysis
93
brellas, it might be diffi cult to prove that there is cause
and effect.
Sometimes factors are correlated that are obviously
not connected by cause and effect, but more often in
business it’s not so obvious (see the sidebar, “Beware
Spurious Correlations,” at the end of this chapter). When
you see a correlation from a regression analysis, you can’t
make assumptions, says Redman. Instead, “You have to
go out and see what’s happening in the real world. What’s
the physical mechanism that’s causing the relationship?”
Go out and observe consumers buying your product in
the rain, talk to them, and fi nd out what is actually caus-
ing them to make the purchase. “A lot of people skip this
step and I think it’s because they’re lazy. The goal is not
to fi gure out what is going on in the data but to fi gure
out what is going on in the world. You have to go out and
pound the pavement,” he says.
Redman once ran his own experiment and analysis in
order to better understand the connection between his
travel and weight gain. He noticed that when he trav-
eled, he ate more and exercised less. Was his weight gain
caused by travel? Not necessarily. “It was nice to quan-
tify what was happening but travel isn’t the cause. It may
be related,” he says, but it’s not like his being on the road
put those extra pounds on. He had to understand more
about what was happening during his trips. “I’m often
in new environments so maybe I’m eating more because
I’m nervous.” He needed to look more closely at the cor-
relation. And this is his advice to managers. Use the data
to guide more experiments, not to make conclusions
about cause and effect.
H7353_Guide-DataAnalytics_2ndREV.indb 93H7353_Guide-DataAnalytics_2ndREV.indb 93 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
94
What Mistakes Do People Make When Working with Regression Analysis? As a consumer of regression analysis, there are several
things you need to keep in mind.
First, don’t tell your data analyst to go fi gure out what
is affecting sales. “The way most analyses go haywire is
the manager hasn’t narrowed the focus on what he or
she is looking for,” says Redman. It’s your job to identify
the factors that you suspect are having an impact and
ask your analyst to look at those. “If you tell a data scien-
tist to go on a fi shing expedition, or to tell you something
you don’t know, then you deserve what you get, which
is bad analysis,” he says. In other words, don’t ask your
analysts to look at every variable they can possibly get
their hands on all at once. If you do, you’re likely to fi nd
relationships that don’t really exist. It’s the same princi-
ple as fl ipping a coin: Do it enough times, you’ll eventu-
ally think you see something interesting, like a bunch of
heads all in a row. (For more on how to communicate
your data needs to experts, see chapter 4.)
Also keep in mind whether or not you can do anything
about the independent variable you’re considering. You
can’t change how much it rains, so how important is it to
understand that? “We can’t do anything about weather
or our competitor’s promotion but we can affect our own
promotions or add features, for example,” says Redman.
Always ask yourself what you will do with the data. What
actions will you take? What decisions will you make?
Second, “analyses are very sensitive to bad data” so
be careful about the data you collect and how you col-
H7353_Guide-DataAnalytics_2ndREV.indb 94H7353_Guide-DataAnalytics_2ndREV.indb 94 1/17/18 10:47 AM1/17/18 10:47 AM
Understanding Regression Analysis
95
lect it, and know whether you can trust it (as we learned
in chapter 8). “All the data doesn’t have to be correct or
perfect,” explains Redman, but consider what you will be
doing with the analysis. If the decisions you’ll make as a
result don’t have a huge impact on your business, then
it’s OK if the data is “kind of leaky.” But, “if you’re try-
ing to decide whether to build 8 or 10 of something and
each one costs $1 million to build, then it’s a bigger deal,”
he says.
Redman also says that some managers who are new
to understanding regression analysis make the mistake
of ignoring the error term. This is dangerous because
they’re making the relationship between two variables
more certain than it is. “Oftentimes the results spit out
of a computer and managers think, ‘That’s great, let’s use
this going forward.’” But remember that the results are
always uncertain. As Redman points out, “If the regres-
sion explains 90% of the relationship, that’s great. But
if it explains 10%, and you act like it’s 90%, that’s not
good.” The point of the analysis is to quantify the cer-
tainty that something will happen. “It’s not telling you
how rain will infl uence your sales, but it’s telling you the
probability that rain may infl uence your sales.”
The last mistake that Redman warns against is letting
data replace your intuition. “You always have to lay your
intuition on top of the data,” he explains. Ask yourself
whether the results fi t with your understanding of the
situation. And if you see something that doesn’t make
sense, ask whether the data was right or whether there
is indeed a large error term. Redman suggests you look
to more experienced managers or other analyses if you’re
H7353_Guide-DataAnalytics_2ndREV.indb 95H7353_Guide-DataAnalytics_2ndREV.indb 95 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
96
getting something that doesn’t make sense. And, he says,
never forget to look beyond the numbers to what’s hap-
pening outside your offi ce: “You need to pair any analy-
sis with study of the real world. The best scientists—and
managers—look at both.”
Amy Gallo is a contributing editor at Harvard Business
Review and the author of the HBR Guide to Dealing with
Confl ict. Follow her on Twitter @amyegallo.
BEWARE SPURIOUS CORRELATIONS
We all know the truism “Correlation doesn’t imply cau-
sation,” but when we see lines sloping together, bars
rising together, or points on a scatterplot clustering,
the data practically begs us to assign a reason. We
want to believe one exists.
Statistically we can’t make that leap, however.
Charts that show a close correlation are often relying
on a visual parlor trick to imply a relationship. Tyler Vi-
gen, a JD student at Harvard Law School and the au-
thor of Spurious Correlations, has made sport of this
on his website, which charts farcical correlations—for
example, between U.S. per capita margarine con-
sumption and the divorce rate in Maine.
Vigen has programmed his site so that anyone can
find and chart absurd correlations in large data sets.
We tried a few of our own and came up with these
gems:
H7353_Guide-DataAnalytics_2ndREV.indb 96H7353_Guide-DataAnalytics_2ndREV.indb 96 1/17/18 10:47 AM1/17/18 10:47 AM
Understanding Regression Analysis
97
Source: Tylervigen.com
(continued)
More iPhones means more people die from falling down stairs
Deaths caused by falls down stairs (U.S.)
iPhone sales
2007
0
10
20
30
40M
1,900
1,925
1,950
1,975
2,000
2008 2009 2010
Let’s cheer on the team, and we’ll lose weight
Per capita consumption of high-fructose corn syrup (U.S.)
Spending on admission to spectator sports (U.S.)
10
12
14
16
18
20
$22B
2000 , 01
, 02
, 03
, 04
, 05
, 06
, 07
, 08
, 09
50
55
60
65
70 LBS
H7353_Guide-DataAnalytics_2ndREV.indb 97H7353_Guide-DataAnalytics_2ndREV.indb 97 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
98
BEWARE SPURIOUS CORRELATIONS
(continued)
Source: Tylervigen.com
Although it’s easy to spot and explain away absurd ex-
amples, like these, you’re likely to encounter rigged
but plausible charts in your daily work. Here are three
types to watch out for:
To increase auto sales, market trips to Universal Orlando
Visitors to Universal Orlando’s “Islands of Adventure”
Sales of new cars (U.S.)
5
6
7
8M
4
5
6M
2007 2008 2009
H7353_Guide-DataAnalytics_2ndREV.indb 98H7353_Guide-DataAnalytics_2ndREV.indb 98 1/17/18 10:47 AM1/17/18 10:47 AM
Understanding Regression Analysis
99
Apples and Oranges: Comparing
Dissimilar Variables
Y axis scales that measure different values may show
similar curves that shouldn’t be paired. This becomes
pernicious when the values appear to be related but
aren’t.
Total “Black Friday” online revenue
eBay total gross merchandise volume
40
50
60
70
80M
2008 2009 2010 2011 2012 2013
200
400
600
$800M
It’s best to chart them separately.
(continued)
eBay total gross merchandise volume
2008 , 09
, 10
, 11
, 12
, 13
20
0
40
60
80M
Total “Black Friday” online revenue
2008 , 09
, 10
, 11
, 12
, 13
200
0
400
600
$800M
H7353_Guide-DataAnalytics_2ndREV.indb 99H7353_Guide-DataAnalytics_2ndREV.indb 99 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
100
BEWARE SPURIOUS CORRELATIONS
(continued�)
Skewed Scales: Manipulating Ranges to Align Data
Even when y axes measure the same category, chang-
ing the scales can alter the lines to suggest a correla-
tion. These y axes for RetailCo’s monthly revenue dif-
ference in range and proportional increase.
Eliminating the second axis shows how skewed this
chart is.
Customers over 40
Customers under 40
J
0
100
200
300
400
$500K
F M A M J J A S O N D
10
15
20
25
30
$35K
J F M A M J J A S O N D
Customers over 40
Customers under 40
0
100
200
300
400
$500K
H7353_Guide-DataAnalytics_2ndREV.indb 100H7353_Guide-DataAnalytics_2ndREV.indb 100 1/17/18 10:47 AM1/17/18 10:47 AM
Understanding Regression Analysis
101
Ifs and Thens: Implying Cause and Effect
Plotting unrelated data sets together can make it seem
that changes in one variable are causing changes in the
other.
We try to create a narrative—i�f Pandora loses less
money, then more music is copyrighted—from what is
probably a coincidence.
Adapted from “Beware Spurious Correlations,” Harvard Business Re- view, June 2015 (product #F1506Z).
Pandora net losses
Musical works copyrighted (U.S.)
2006 2007 2008 2009
10
15
20
25
$30M
124
108
92
76
60
140M
Pandora net losses
–30
–$10M
–20
0 2006 2007 2008 2009
Musical works copyrighted (U.S.)
20072006 2008 2009
0
50
100
150M
H7353_Guide-DataAnalytics_2ndREV.indb 101H7353_Guide-DataAnalytics_2ndREV.indb 101 1/17/18 10:47 AM1/17/18 10:47 AM
H7353_Guide-DataAnalytics_2ndREV.indb 102H7353_Guide-DataAnalytics_2ndREV.indb 102 1/17/18 10:47 AM1/17/18 10:47 AM
103
CHAPTER 11
When to Act On a Correlation, and When Not To by David Ritter
“Petabytes allow us to say: ‘Correlation is enough.’”
—Chris Anderson, Wired, June 23, 2008
The sentiment expressed by Chris Anderson in 2008 is
a popular meme in the big data community. “Causality is
dead,” say the priests of analytics and machine learning.
They argue that given enough statistical evidence, it’s no
long er necessary to understand why things happen—we
need only know what things happen together.
Adapted from content posted on hbr.org, March 19, 2014 (product
#H00Q1X).
H7353_Guide-DataAnalytics_2ndREV.indb 103H7353_Guide-DataAnalytics_2ndREV.indb 103 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
104
But inquiring whether correlation is enough is asking
the wrong question. For consumers of big data, the key
question is, “Can I take action on the basis of a corre-
lation fi nding?” The answer to that question is, “It de-
pends”—primarily on two factors:
• Confi dence that the correlation will reliably recur
in the future. The higher that confi dence level, the
more reasonable it is to take action in response.
• The trade-off between the risk and reward of
acting. If the risk of acting and being wrong is ex-
tremely high, for example, acting on even a strong
correlation may be a mistake.
The fi rst factor—the confi dence that the correlation
will recur—is in turn a function of two things: the fre-
quency with which the correlation has historically oc-
curred (the more often events occur together in real
life, the more likely it is that they are connected) and an
under standing of what is causing that statistical fi nding.
This second element—what we call “clarity of causal-
ity”—stems from the fact that the fewer possible explana-
tions there are for a correlation, the higher the likelihood
that the two events are linked. Considering frequency
and clarity together yields a more reliable gauge of the
overall confi dence in the fi nding than evaluating only
one or the other in isolation.
Understanding the interplay between the confi dence
level and the risk/reward trade-off enables sound deci-
sions on what action—if any—makes sense in light of a
particular statistical fi nding. The bottom line: Causal-
ity can matter tremendously. And efforts to gain better
H7353_Guide-DataAnalytics_2ndREV.indb 104H7353_Guide-DataAnalytics_2ndREV.indb 104 1/17/18 10:47 AM1/17/18 10:47 AM
When to Act On a Correlation, and When Not To
105
insight into the cause of a correlation can drive up the
confi dence level of taking action.
These concepts allowed The Boston Consulting Group
(BCG) to develop a prism through which any potential
action can be evaluated. If the value of acting is high, and
the cost of acting when wrong is low, it can make sense to
act based on even a weak correlation. We choose to look
both ways before crossing the street because the cost of
looking is low and the potential loss from not looking is
high (in statistical jargon what is known as “asymmet-
ric loss function”). Alternatively, if the confi dence in the
fi nding is low due to the fact you don’t have a handle on
why two events are linked, you should be less willing to
take actions that have signifi cant potential downside,
illustrated in fi gure 11-1.
Frequent correlation; clear causal hypothesis
C o
n fi
d e
n c
e in
t h
e r
e la
ti o
n s h
ip
Benefits of action relative to cost of being wrong
Frequent correlation; but many causal hypotheses
Infrequent, unstable correlation Risks outweigh
benefits Benefits outweigh risk
Act
Don’t act
FIGURE 11-1
When to act on a correlation in your data
How confi dent are you in the relationship? And do the benefi ts of action outweigh the risk?
Source: David Ritter, BCG
H7353_Guide-DataAnalytics_2ndREV.indb 105H7353_Guide-DataAnalytics_2ndREV.indb 105 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
106
Consider the case of New York City’s sewer sensors.
These sensors detect the amount of grease fl owing into
the sewer system at various locations throughout the city.
If the data collected shows a concentration of grease at
an unexpected location—perhaps due to an unlicensed
restaurant—offi cials will send a car out to determine the
source. The confi dence in the meaning of the data from
the sensors is on the low side—there may be many other
explanations for the excessive infl ux of grease. But there’s
little cost if the inspection fi nds nothing amiss.
Recent decisions around routine PSA screening tests
for prostate cancer involved a very different risk/reward
trade-off. Confi dence that PSA blood tests are a good
predictor of cancer is low because the correlation itself is
weak—elevated PSA levels are found often in men without
prostate cancer. There is also no clear causal explanation
for how PSA is related to the development of cancer. In ad-
dition, preventative surgery prompted by the test did not
increase long-term survival rates. And the risk associated
with screening was high, with false positives leading to un-
necessary, debilitating treatment. The result: The Ameri-
can Medical Association reversed its previous recommen-
dation that men over 50 have routine PSA blood tests.
Of course, there is usually not just one, but a range
of possible actions in response to a statistical fi nding.
This came into play recently in a partnership between an
Australian supermarket and an auto insurance company.
Combining data from the supermarket’s loyalty card pro-
gram with auto claims information revealed interesting
correlations. The data showed that people who buy red
meat and milk are good car insurance risks while peo-
ple who buy pasta and spirits and who fuel their cars at
H7353_Guide-DataAnalytics_2ndREV.indb 106H7353_Guide-DataAnalytics_2ndREV.indb 106 1/17/18 10:47 AM1/17/18 10:47 AM
When to Act On a Correlation, and When Not To
107
night are poor risks. Though this statistical relationship
could be an indicator of risky behaviors (driving under
the infl uence of spirits, for example), there are a number
of other possible reasons for the fi nding.
Potential responses to the fi nding included:
• Targeting insurance marketing to loyalty card
holders in the low-risk group
• Pricing car insurance based on these buying
patterns
The latter approach, however, could lead to brand-
damaging backlash should the practice be exposed.
Looking at the two options via our framework in fi g-
ure 11-2 makes clear that without additional confi dence
in the fi nding, the former approach is preferable.
Frequent correlation; clear causal hypothesis
C o
n fi
d e
n c
e in
t h
e r
e la
ti o
n s h
ip
Benefits of action relative to cost of being wrong
Frequent correlation; but many causal hypotheses
Infrequent, unstable correlation Risks outweigh
benefits Benefits outweigh risk
Act Targeting insurance marketing based on buying patterns
Don’t act Pricing car insurance based on buying patterns
Source: David Ritter, BCG
FIGURE 11-2
If supermarket purchases correlate with auto insurance claims, what should an insurer do?
With the case of relationship unclear, low risk actions are advisible.
H7353_Guide-DataAnalytics_2ndREV.indb 107H7353_Guide-DataAnalytics_2ndREV.indb 107 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
108
However, if we are able to fi nd a clear causal expla-
nation for this correlation, we may be able to increase
confi dence suffi ciently to take the riskier, higher-value
action of increasing rates. For example, the buying pat-
terns associated with higher risks could be leading indi-
cators of an impending life transition such as loss of em-
ployment or a divorce. This possible explanation could
be tested by adding additional data to the analysis.
In this case causality is critical. New factors can po-
tentially be identifi ed that create a better understanding
of the dynamics at work. The goal is to rule out some pos-
sible causes and shed light on what is really driving that
correlation. That understanding will increase the overall
level of confi dence that the correlation will continue in
the future—essentially shifting possible actions into the
upper portion of the framework. The result may be that
previously ruled-out responses are now appropriate. In
addition, insight on the cause of a correlation can allow
you to look for changes that cause the linkage to weaken
or disappear. And that knowledge makes it possible to
monitor and respond to events that might make a previ-
ously sound response outdated.
There is no shortage of examples where the selection
of the right response hinges on this “clarity of cause.” The
U.S. Army, for example, has developed image process-
ing software that uses fl ashes of light to locate the pos-
sible position of a sniper. But similar fl ashes also come
from a camera. With two potential reasons for the imag-
ing pattern, the confi dence in the fi nding is lower than
it would be if there were just one. And that, of course,
will determine how to respond—and what level of risk is
acceptable.
H7353_Guide-DataAnalytics_2ndREV.indb 108H7353_Guide-DataAnalytics_2ndREV.indb 108 1/17/18 10:47 AM1/17/18 10:47 AM
When to Act On a Correlation, and When Not To
109
When working with big data, sometimes correlation
is enough. But other times understanding the cause is vi-
tal. The key is to know when correlation is enough—and
what to do when it is not.
David Ritter is a director in the Technology Advantage
practice of The Boston Consulting Group (BCG), where
he advises clients on the use of technology for competi-
tive advantage, open innovation, and other topics.
H7353_Guide-DataAnalytics_2ndREV.indb 109H7353_Guide-DataAnalytics_2ndREV.indb 109 1/17/18 10:47 AM1/17/18 10:47 AM
H7353_Guide-DataAnalytics_2ndREV.indb 110H7353_Guide-DataAnalytics_2ndREV.indb 110 1/17/18 10:47 AM1/17/18 10:47 AM
111
CHAPTER 12
Can Machine Learning Solve Your Business Problem? by Anastassia Fedyk
As you consider ways to analyze large swaths of data, you
may ask yourself how the latest technological tools and
automation can help. AI, big data, and machine learn-
ing are all trending buzzwords, but how can you know
which problems in your business are amenable to ma-
chine learning?
Adapted from “How to Tell If Machine Learning Can Solve Your Busi-
ness Problem” on hbr.org, November 25, 2016 (product #H03A8R).
H7353_Guide-DataAnalytics_2ndREV.indb 111H7353_Guide-DataAnalytics_2ndREV.indb 111 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
112
To decide, you need to think about the problem to be
solved and the available data, and ask questions about
feasibility, intuition, and expectations.
Assess Whether Your Problem Requires Learning Machine learning can help automate your processes, but
not all automation problems require learning.
Automation without learning is appropriate when the
problem is relatively straightforward—the kinds of tasks
where you have a clear, predefi ned sequence of steps that
is currently being executed by a human, but could con-
ceivably be transitioned to a machine. This sort of au-
tomation has been happening in businesses for decades.
Screening incoming data from an outside data provider
for well-defi ned potential errors is an example of a prob-
lem ready for automation. (For example, hedge funds
automatically fi lter out bad data in the form of a negative
value for trading volume, which can’t be negative.) On
the other hand, encoding human language into a struc-
tured data set is something that is just a tad too ambi-
tious for a straightforward set of rules.
For the second type of problem, standard automation
is not enough. Such complex problems require learning
from data—and now we venture into the arena of ma-
chine learning. Machine learning, at its core, is a set of
statistical methods meant to fi nd patterns of predictabil-
ity in data sets. These methods are great at determining
how certain features of the data are related to the out-
comes you are interested in. What these methods cannot
do is access any knowledge outside of the data you pro-
H7353_Guide-DataAnalytics_2ndREV.indb 112H7353_Guide-DataAnalytics_2ndREV.indb 112 1/17/18 10:47 AM1/17/18 10:47 AM
Can Machine Learning Solve Your Business Problem?
113
vide. For example, researchers at the University of Pitts-
burg in the late 1990s evaluated machine-learning algo-
rithms for predicting mortality rates from pneumonia.1
The algorithms recommended that hospitals send home
pneumonia patients who were also asthma sufferers, es-
timating their risk of death from pneumonia to be lower.
It turned out that the data set fed into the algorithms did
not account for the fact that asthma sufferers had been
immediately sent to intensive care, and had fared better
only because of the additional attention.2
So what are good business problems for machine
learning methods? Essentially, any problems that meet
the following two criteria:
1. They require prediction rather than causal
inference.
2. They are suffi ciently self-contained or relatively
insulated from outside infl uences.
The fi rst means that you are interested in understand-
ing how, on average, certain aspects of the data relate to
each other, and not in the causal channels of their rela-
tionship. (Keep in mind that the statistical methods do
not bring to the table the intuition, theory, or domain
knowledge of human analysts.) The second means that
you are relatively certain that the data you feed to your
learning algorithm includes more or less all there is to
the problem. If, in the future, the thing you’re trying to
predict changes unexpectedly and no longer matches
prior patterns in the data, the algorithm will not know
what to make of it.
H7353_Guide-DataAnalytics_2ndREV.indb 113H7353_Guide-DataAnalytics_2ndREV.indb 113 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
114
Examples of good machine learning problems include
predicting the likelihood that a certain type of user will
click on a certain kind of ad, or evaluating the extent to
which a piece of text is similar to previous texts you have
seen. (To see an example of how an artifi cial intelligence
algorithm learned from existing customer data and test
marketing campaigns to fi nd new sales leads, see the
sidebar “Artifi cial Intelligence at Harley-Davidson.”)
Bad examples include predicting profi ts from the in-
troduction of a completely new and revolutionary prod-
uct line, or extrapolating next year’s sales from past data
when an important new competitor just entered the
market.
ARTIFICIAL INTELLIGENCE AT HARLEY-DAVIDSON
by Brad Power
It was winter in New York City, and Asaf Jacobi’s
Harley-Davidson dealership was selling one or two
motor cycles a week. It wasn’t enough.
Jacobi went for a long walk in Riverside Park and
happened to bump into Or Shani, CEO of an AI fi rm,
Adgorithms. After discussing Jacobi’s sales woes,
Shani suggested he try out Albert, Adgorithm’s AI-
driven marketing platform. It works across digital
channels, like Facebook and Google, to measure and
then autonomously optimize the outcomes of market-
ing campaigns. Jacobi decided he’d give Albert a one-
weekend audition.
H7353_Guide-DataAnalytics_2ndREV.indb 114H7353_Guide-DataAnalytics_2ndREV.indb 114 1/17/18 10:47 AM1/17/18 10:47 AM
Can Machine Learning Solve Your Business Problem?
115
That weekend, Jacobi sold 15 motorcycles—almost
twice his all-time summer weekend sales record of
eight.
Naturally, Jacobi kept using Albert. His dealership
went from getting one qualifi ed lead per day to 40. In
the fi rst month, 15% of those new leads were looka-
likes, meaning that the people calling the dealership
to set up a visit resembled previous high-value cus-
tomers and therefore were more likely to make a pur-
chase. By the third month, the dealership’s leads had
increased 2,930%, 50% of them lookalikes, leaving
Jacobi scrambling to set up a new call center with six
new employees to handle all the new business.
While Jacobi had estimated that only 2% of New
York City’s population were potential buyers, Albert
revealed that his target market was larger—much
larger—and began fi nding customers Jacobi didn’t
even know existed.
How did it do that?
Albert drove in-store traffi c by generating leads, de-
fi ned as customers who express interest in speaking to
a salesperson by fi lling out a form on the dealership’s
website. Armed with creative content (headlines and
visuals) provided by Harley-Davidson and key perfor-
mance targets, Albert began by analyzing existing cus-
tomer data from Jacobi’s customer relationship man-
agement system to isolate defi ning characteristics and
(continued)
H7353_Guide-DataAnalytics_2ndREV.indb 115H7353_Guide-DataAnalytics_2ndREV.indb 115 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
116
ARTIFICIAL INTELLIGENCE AT HARLEY-DAVIDSON
(continued)
behaviors of high-value past customers: those who ei-
ther had completed a purchase, added an item to an
online cart, viewed website content, or were among
the top 25% in terms of time spent on the website.
Using this information, Albert identifi ed lookalikes
who resembled these past customers and created
micro segments—small sample groups with whom it
could run test campaigns before extending its eff orts
more widely. Albert used the data gathered through
these tests to predict which possible headlines and vi-
sual combinations, and thousands of other campaign
variables, would most likely convert diff erent audience
segments through various digital channels (social me-
dia, search, display, and email or SMS).
Once it determined what was working and what
wasn’t, Albert scaled the campaigns, autonomously
allocating resources from channel to channel, making
content recommendations, and so on.
For example, when it discovered that ads with the
word call—such as, “Don’t miss out on a pre-owned
Harley with a great price! Call now!”—performed 447%
better than ads containing the word buy, such as, “Buy
a pre-owned Harley from our store now!” Albert imme-
diately changed buy to call in all ads across all relevant
channels. The results spoke for themselves.
For Harley-Davidson, AI evaluated what was work-
ing across digital channels and what wasn’t, and used
H7353_Guide-DataAnalytics_2ndREV.indb 116H7353_Guide-DataAnalytics_2ndREV.indb 116 1/17/18 10:47 AM1/17/18 10:47 AM
Can Machine Learning Solve Your Business Problem?
117
what it learned to create more opportunities for con-
version. In other words, the system allocated resources
only to what had been proven to work, thereby increas-
ing digital marketing ROI. Using AI, Harley- Davidson
was able to eliminate guesswork, gather and analyze
enormous volumes of data, and optimally lever age the
resulting insights.
Adapted from “How Harley-Davidson Used Artifi cial Intelligence to In- crease New York Sales Leads by 2,930%” on hbr.org, May 30, 2017 (product #H03NFD).
Brad Power is a consultant who helps organizations that must make faster changes to their products, services, and systems to compete with startups and leading software companies.
Find the Appropriate Data Once you verify that your problem is suitable for ma-
chine learning, the next step is to evaluate whether you
have the right data to solve it. The data might come from
you or from an external provider. In the latter case, ask
enough questions to get a good feel for the data’s scope
and whether it is likely to be a good fi t for your problem.
Ask Questions and Look for Mistakes Once you’ve determined that your problem is a clas-
sic machine learning problem and you have the data to
fi t it, check your intuition. Machine learning methods,
however proprietary and seemingly magical, are statis-
tics. And statistics can be explained in intuitive terms.
H7353_Guide-DataAnalytics_2ndREV.indb 117H7353_Guide-DataAnalytics_2ndREV.indb 117 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
118
Instead of trusting that the brilliant proposed method
will seamlessly work, ask lots of questions.
Get yourself comfortable with how the method works.
Does the intuition of the method roughly make sense?
Does it fi t conceptually into the framework of the par-
ticular setting or problem you are dealing with? What
makes this method especially well-suited to your prob-
lem? If you are encoding a set of steps, perhaps sequen-
tial models or decision trees are a good choice. If you
need to separate two classes of outcome, perhaps a bi-
nary support vector machine would be best aligned with
your needs.
With understanding come more realistic expectations.
Once you ask enough questions and receive enough an-
swers to have an intuitive understanding of how the
methodology works, you will see that it is far from magi-
cal. Every human makes mistakes, and every algorithm
is error prone too. For all but the simplest of problems,
there will be times when things go wrong. The machine
learning prediction engine will get things right on aver-
age but will reliably make mistakes. And these errors will
happen most often in ways that you cannot anticipate.
Decide How to Move Forward The last step is to evaluate the extent to which you can
allow for exceptions or statistical errors in your pro-
cess. Is your problem the kind where getting things
right 80% of the time is enough? Can you deal with a
10% error rate? 5%? 1%? Are there certain kinds of er-
rors that should never be allowed? Be clear and upfront
about your needs and expectations, both with yourself
H7353_Guide-DataAnalytics_2ndREV.indb 118H7353_Guide-DataAnalytics_2ndREV.indb 118 1/17/18 10:47 AM1/17/18 10:47 AM
Can Machine Learning Solve Your Business Problem?
119
and with your solution provider. And once both of you
are comfortably on the same page, go ahead. Armed with
knowledge, understanding, and reasonable expectations,
you are set to reap the benefi ts of machine learning. Just
please be patient.
Anastassia Fedyk is a PhD candidate in business eco-
nomics at Harvard Business School. Her research fo-
cuses on fi nance and behavioral economics.
NOTES
1. G. F. Cooper et al., “An Evaluation of Machine-Learning Meth- ods for Predicting Pneumonia Mortality,” Artifi cial Intelligence in Medicine 9 (1997): 107–138.
2. A. M. Bornstein, “Is Artifi cial Intelligence Permanently In- scrutable?” Nautilus, September 1, 2016, http://nautil.us/issue/40/ learning/is-artifi cial-intelligence-permanently-inscrutable.
H7353_Guide-DataAnalytics_2ndREV.indb 119H7353_Guide-DataAnalytics_2ndREV.indb 119 1/17/18 10:47 AM1/17/18 10:47 AM
H7353_Guide-DataAnalytics_2ndREV.indb 120H7353_Guide-DataAnalytics_2ndREV.indb 120 1/17/18 10:47 AM1/17/18 10:47 AM
121
CHAPTER 13
A Refresher on Statistical Signifi cance by Amy Gallo
When you run an experiment or analyze data, you want
to know if your fi ndings are signifi cant. But business rel-
evance (that is, practical signifi cance) isn’t always the
same thing as confi dence that a result isn’t due purely to
chance (that is, statistical signifi cance). This is an impor-
tant distinction; unfortunately, statistical signifi cance
is often misunderstood and misused in organizations to-
day. And because more and more companies are relying
on data to make critical business decisions, it’s an essen-
tial concept for managers to understand.
Adapted from content posted on hbr.org, February 16, 2016 (product
#H02NMS).
H7353_Guide-DataAnalytics_2ndREV.indb 121H7353_Guide-DataAnalytics_2ndREV.indb 121 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
122
To better understand what statistical signifi cance re-
ally means, I talked with Thomas Redman, author of
Data Driven: Profi ting from Your Most Important Busi-
ness Asset, and adviser to organizations on their data and
data quality programs.
What Is Statistical Signifi cance? “Statistical signifi cance helps quantify whether a result
is likely due to chance or to some factor of interest,” says
Redman. When a fi nding is signifi cant, it simply means
you can feel confi dent that it’s real, not that you just got
lucky (or unlucky) in choosing the sample.
When you run an experiment, conduct a survey, take
a poll, or analyze a data set, you’re taking a sample of
some population of interest, not looking at every single
data point that you possibly can. Consider the example
of a marketing campaign. You’ve come up with a new
concept and you want to see if it works better than your
current one. You can’t show it to every single target cus-
tomer, of course, so you choose a sample group.
When you run the results, you fi nd that those who saw
the new campaign spent $10.17 on average, more than
the $8.41 spent by those who saw the old campaign. This
$1.76 might seem like a big—and perhaps important—
difference. But in reality you may have been unlucky,
drawing a sample of people who do not represent the
larger population; in fact, maybe there was no difference
between the two campaigns and their infl uence on con-
sumers’ purchasing behaviors. This is called a sampling
error, something you must contend with in any test that
does not include the entire population of interest.
H7353_Guide-DataAnalytics_2ndREV.indb 122H7353_Guide-DataAnalytics_2ndREV.indb 122 1/17/18 10:47 AM1/17/18 10:47 AM
A Refresher on Statistical Signifi cance
123
Redman notes that there are two main contributors to
sampling error: the size of the sample and the variation
in the underlying population. Sample size may be intui-
tive enough. Think about fl ipping a coin 5 times versus
fl ipping it 500 times. The more times you fl ip, the less
likely you’ll end up with a great majority of heads. The
same is true of statistical signifi cance: With bigger sam-
ple sizes, you’re less likely to get results that refl ect ran-
domness. All else being equal, you’ll feel more comfort-
able in the accuracy of the campaigns’ $1.76 difference
if you showed the new one to 1,000 people rather than
just 25. Of course, showing the campaign to more people
costs more money, so you have to balance the need for a
larger sample size with your budget.
Variation is a little trickier to understand, but Red-
man insists that developing a sense for it is critical for
all managers who use data. Consider the images in fi g-
ure 13-1. Each expresses a different possible distribution
of customer purchases under campaign A. Looking at the
chart on the left (with less variation), most people spend
roughly the same amount. Some people spend a few dol-
lars more or less, but if you pick a customer at random,
chances are pretty good that they’ll be close to the aver-
age. So it’s less likely that you’ll select a sample that looks
vastly different from the total population, which means
you can be relatively confi dent in your results.
Compare that with the chart on the right (with more
variation). Here, people vary more widely in how much
they spend. The average is still the same, but quite a few
people spend more or less. If you pick a customer at ran-
dom, chances are higher that they are pretty far from
H7353_Guide-DataAnalytics_2ndREV.indb 123H7353_Guide-DataAnalytics_2ndREV.indb 123 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
124
the average. So if you select a sample from a more varied
population, you can’t be as confi dent in your results.
To summarize, the important thing to understand is
that the greater the variation in the underlying popula-
tion, the larger the sampling error.
Redman advises that you should plot your data and
make pictures like these when you analyze the numbers.
The graphs will help you get a feel for variation, the sam-
pling error, and in turn, the statistical signifi cance.
No matter what you’re studying, the process for evalu-
ating signifi cance is the same. You start by stating a null
hypothesis. In the experiment about the marketing cam-
paign, the null hypothesis might be, “On average, cus-
tomers don’t prefer our new campaign to the old one.”
Before you begin, you should also state an alternative
hypothesis, such as, “On average, customers prefer the
new one,” and a target signifi cance level. The signifi cance
level is an expression of how rare your results are, un-
der the assumption that the null hypothesis is true. It is
Number of customers
Number of customers
Greater variationLesser variation
FIGURE 13-1
Population variation
Source: Thomas C. Redman
Spend amount
Number of customers
Spend amount
Number of customers
H7353_Guide-DataAnalytics_2ndREV.indb 124H7353_Guide-DataAnalytics_2ndREV.indb 124 1/17/18 10:47 AM1/17/18 10:47 AM
A Refresher on Statistical Signifi cance
125
usually expressed as a p-value, and the lower the p-value,
the less likely the results are due purely to chance.
Setting a target and interpreting p-values can be
dauntingly complex. Redman says it depends a lot on
what you are analyzing. “If you’re searching for the Higgs
boson, you probably want an extremely low p-value,
maybe 0.00001,” he says. “But if you’re testing for
whether your new marketing concept is better or the new
drill bits your engineer designed work faster than your
existing bits, then you’re probably willing to take a higher
value, maybe even as high as 0.25.”
Note that in many business experiments, managers
skip these two initial steps and don’t worry about sig-
nifi cance until after the results are in. However, it’s good
scientifi c practice to do these two things ahead of time.
Then you collect your data, plot the results, and calcu-
late statistics, including the p-value, which incorporates
variation and the sample size. If you get a p-value lower
than your target, then you reject the null hypothesis in
favor of the alternative. Again, this means the probabil-
ity is small that your results were due solely to chance.
How Is It Calculated? As a manager, chances are you won’t ever calculate statis-
tical signifi cance yourself. “Most good statistical packages
will report the signifi cance along with the results,” says
Redman. There is also a formula in Microsoft Excel and a
number of other online tools that will calculate it for you.
Still, it’s helpful to know the process in order to
under stand and interpret the results. As Redman ad-
vises, “Managers should not trust a model they don’t
understand.”
H7353_Guide-DataAnalytics_2ndREV.indb 125H7353_Guide-DataAnalytics_2ndREV.indb 125 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
126
How Do Companies Use It? Companies use statistical signifi cance to understand
how strongly the results of an experiment, survey, or poll
they’ve conducted should infl uence the decisions they
make. For example, if a manager runs a pricing study to
understand how best to price a new product, they will
calculate the statistical signifi cance (with the help of an
analyst, most likely) so that they know whether the fi nd-
ings should affect the fi nal price.
Remember the new marketing campaign that pro-
duced a $1.76 boost (more than 20%) in the average
sale? It’s surely of practical signifi cance. If the p-value
comes in at 0.03 the result is also statistically signifi cant,
and you should adopt the new campaign. If the p-value
comes in at 0.2 the result is not statistically signifi cant,
but since the boost is so large you’ll probably still pro-
ceed, though perhaps with a bit more caution.
But what if the difference were only a few cents? If
the p-value comes in at 0.2, you’ll stick with your current
campaign or explore other options. But even if it had a
signifi cance level of 0.03, the result is probably real,
though quite small. In this case, your decision probably
will be based on other factors, such as the cost of imple-
menting the new campaign.
Closely related to the idea of a signifi cance level is the
notion of a confi dence interval. Let’s take the example of
a political poll. Say there are two candidates: A and B.
The pollsters conduct an experiment with 1,000 “likely
voters.” From the sample, 49% say they’ll vote for A and
51% say they’ll vote for B. The pollsters also report a
margin of error of +/- 3%.
H7353_Guide-DataAnalytics_2ndREV.indb 126H7353_Guide-DataAnalytics_2ndREV.indb 126 1/17/18 10:47 AM1/17/18 10:47 AM
A Refresher on Statistical Signifi cance
127
“Technically, 49% plus or minus 3% is a 95% confi -
dence interval for the true proportion of A voters in the
population,” says Redman. Unfortunately, he adds, most
people interpret this as “there’s a 95% chance that A’s
true percentage lies between 46% and 52%,” but that
isn’t correct. Instead, it says that if the pollsters were to
do the result many times, 95% of intervals constructed
this way would contain the true proportion.
If your head is spinning at that last sentence, you’re
not alone. As Redman says, this interpretation is “mad-
deningly subtle, too subtle for most managers and even
many researchers with advanced degrees.” He says the
more practical interpretation of this would be “Don’t get
too excited that B has a lock on the election” or “B ap-
pears to have a lead, but it’s not a statistically signifi cant
one.” Of course, the practical interpretation would be
very different if 70% of the likely voters said they’d vote
for B and the margin of error was 3%.
The reason managers bother with statistical signifi -
cance is they want to know what fi ndings say about what
they should do in the real world. But “confi dence inter-
vals and hypothesis tests were designed to support ‘sci-
ence,’ where the idea is to learn something that will stand
the test of time,” says Redman. Even if a fi nding isn’t sta-
tistically signifi cant, it may have utility to you and your
company. On the other hand, when you’re working with
large data sets, it’s possible to obtain results that are
statistically signifi cant but practically meaningless, for
example, that a group of customers is 0.000001% more
likely to click on campaign A over campaign B. So rather
than obsessing about whether your fi ndings are precisely
right, think about the implication of each fi nding for the
H7353_Guide-DataAnalytics_2ndREV.indb 127H7353_Guide-DataAnalytics_2ndREV.indb 127 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
128
decision you’re hoping to make. What would you do dif-
ferently if the fi nding were different?
What Mistakes Do People Make When Working with Statistical Signifi cance? “Statistical signifi cance is a slippery concept and is often
misunderstood,” warns Redman. “I don’t run into very
many situations where managers need to understand it
deeply, but they need to know how to not misuse it.”
Of course, data scientists don’t have a monopoly on
the word “signifi cant,” and often in businesses it’s used
to mean whether a fi nding is strategically important. It’s
good practice to use language that’s as clear as possible
when talking about data fi ndings. If you want to discuss
whether the fi nding has implications for your strategy or
decisions, it’s fi ne to use the word “signifi cant,” but if you
want to know whether something is statistically signifi -
cant, be precise in your language. Next time you look at
results of a survey or experiment, ask about the statisti-
cal signifi cance if the analyst hasn’t reported it.
Remember that statistical signifi cance tests help you
account for potential sampling errors, but Redman says
what is often more worrisome is the non-sampling error:
“Non-sampling error involves things where the experi-
mental and/or measurement protocols didn’t happen ac-
cording to plan, such as people lying on the survey, data
getting lost, or mistakes being made in the analysis.” This
is where Redman sees more troubling results. “There
is so much that can happen from the time you plan the
survey or experiment to the time you get the results. I’m
more worried about whether the raw data is trustwor-
H7353_Guide-DataAnalytics_2ndREV.indb 128H7353_Guide-DataAnalytics_2ndREV.indb 128 1/17/18 10:47 AM1/17/18 10:47 AM
A Refresher on Statistical Signifi cance
129
thy than how many people they talked to,” he says. Clean
data and careful analysis are more important than statis-
tical signifi cance.
Always keep in mind the practical application of the
fi nding. And don’t get too hung up on setting a strict con-
fi dence interval. Redman says there’s a bias in scientifi c
literature that “a result wasn’t publishable unless it hit a
p = 0.05 (or less).” But for many decisions—like which
marketing approach to use—you’ll need a much lower
confi dence interval. In business, says Redman, there’s of-
ten more important criteria than statistical signifi cance.
The important question is, “Does the result stand up in
the market, if only for a brief period of time?”
As Redman says, the results only give you so much in-
formation: “I’m all for using statistics, but always wed it
with good judgment.”
Amy Gallo is a contributing editor at Harvard Business
Review and the author of the HBR Guide to Dealing with
Confl ict. Follow her on Twitter @amyegallo .
H7353_Guide-DataAnalytics_2ndREV.indb 129H7353_Guide-DataAnalytics_2ndREV.indb 129 1/17/18 10:47 AM1/17/18 10:47 AM
H7353_Guide-DataAnalytics_2ndREV.indb 130H7353_Guide-DataAnalytics_2ndREV.indb 130 1/17/18 10:47 AM1/17/18 10:47 AM
131
Reprinted from Harvard Business Review, May-June 2017 (product
#R1703K).
CHAPTER 14
Linear Thinking in a Nonlinear World by Bart de Langhe, Stefano Puntoni, and Richard Larrick
Test yourself with this word problem: Imagine you’re re-
sponsible for your company’s car fl eet. You manage two
models, an SUV that gets 10 miles to the gallon and a
sedan that gets 20. The fl eet has equal numbers of each,
and all the cars travel 10,000 miles a year. You have
enough capital to replace one model with more-fuel-
effi cient vehicles to lower operational costs and help
meet sustainability goals.
Which upgrade is better?
A. Replacing the 10 MPG vehicles with 20 MPG
vehicles
H7353_Guide-DataAnalytics_2ndREV.indb 131H7353_Guide-DataAnalytics_2ndREV.indb 131 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
132
B. Replacing the 20 MPG vehicles with 50 MPG
vehicles
Intuitively, option B seems more impressive—an in-
crease of 30 MPG is a lot larger than a 10 MPG one.
And the percentage increase is greater, too. But B is not
the better deal. In fact, it’s not even close. Let’s compare.
Gallons used per 10,000 miles Current After upgrade Savings A. 1,000 (@10 MPG) 500 (@20 MPG) 500 B. 500 (@20 MPG) 200 (@50 MPG) 300
Is this surprising? For many of us, it is. That’s because
in our minds the relationship between MPG and fuel
consumption is simpler than it really is. We tend to think
it’s linear and looks like this:
But that graph is incorrect. Gas consumption is not a
linear function of MPG. When you do the math, the rela-
tionship actually looks like this:
H7353_Guide-DataAnalytics_2ndREV.indb 132H7353_Guide-DataAnalytics_2ndREV.indb 132 1/17/18 10:47 AM1/17/18 10:47 AM
Linear Thinking in a Nonlinear World
133
And when you dissect the curve to show each upgrade
scenario, it becomes clear how much more effective it is
to replace the 10 MPG cars.
H7353_Guide-DataAnalytics_2ndREV.indb 133H7353_Guide-DataAnalytics_2ndREV.indb 133 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
134
But choosing the lower-mileage upgrade remains
counterintuitive, even in the face of the visual evidence.
It just doesn’t feel right.
If you’re still having trouble grasping this, it’s not your
fault. Decades of research in cognitive psychology show
that the human mind struggles to understand nonlinear
relationships. Our brain wants to make simple straight
lines. In many situations, that kind of thinking serves us
well: If you can store 50 books on a shelf, you can store
100 books if you add another shelf, and 150 books if you
add yet another. Similarly, if the price of coffee is $2, you
Shockingly, upgrading fuel effi ciency from 20 to 100
MPG still wouldn’t save as much gas as upgrading from
10 to 20 MPG.
H7353_Guide-DataAnalytics_2ndREV.indb 134H7353_Guide-DataAnalytics_2ndREV.indb 134 1/17/18 10:47 AM1/17/18 10:47 AM
Linear Thinking in a Nonlinear World
135
can buy fi ve coffees with $10, 10 coffees with $20, and
15 coffees with $30.
But in business there are many highly nonlinear rela-
tionships, and we need to recognize when they’re in play.
This is true for generalists and specialists alike, because
even experts who are aware of nonlinearity in their fi elds
can fail to take it into account and default instead to re-
lying on their gut. But when people do that, they often
end up making poor decisions.
Linear Bias in Practice We’ve seen consumers and companies fall victim to lin-
ear bias in numerous real-world scenarios. A common
one concerns an important business objective: profi ts.
Three main factors affect profi ts: costs, volume, and
price. A change in one often requires action on the oth-
ers to maintain profi ts. For example, rising costs must be
offset by an increase in either price or volume. And if you
cut price, lower costs or higher volumes are needed to
prevent profi ts from dipping.
Unfortunately, managers’ intuitions about the rela-
tionships between these profi t levers aren’t always good.
For years experts have advised companies that changes
in price affect profi ts more than changes in volume or
costs. Nevertheless, executives often focus too much on
volume and costs instead of getting the price right.
Why? Because the large volume increases they see af-
ter reducing prices are very exciting. What people don’t
realize is just how large those increases need to be to
maintain profi ts, especially when margins are low.
H7353_Guide-DataAnalytics_2ndREV.indb 135H7353_Guide-DataAnalytics_2ndREV.indb 135 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
136
Imagine you manage a brand of paper towels. They
sell for 50 cents a roll, and the marginal cost of produc-
ing a roll is 15 cents. You recently did two price promo-
tions. Here’s how they compare:
Normal Promo A: Promo B: 20% off 40% off Price/Roll 50¢ 40¢ 30¢ Sales 1,000 1,200 (+20%) 1,800 (+80%)
Intuitively, B looks more impressive—an 80% in-
crease in volume for a 40% decrease in price seems a
lot more profi table than a 20% increase in volume for a
20% cut in price. But you may have guessed by now that
B is not the most profi table strategy.
In fact, both promotions decrease profi ts, but B’s neg-
ative impact is much bigger than A’s. Here are the profi ts
in each scenario:
Normal Promo A: Promo B: 20% off 40% off Price/Roll 50¢ 40¢ 30¢ Sales 1,000 1,200 (+20%) 1,800 (+80%) Profi t/Roll 35¢ 25¢ 15¢ Profi t $350 $300 $270
Although promotion B nearly doubled sales, profi ts
sank almost 25%. To maintain the usual $350 profi t
during the 40%-off sale, you would have to sell more
than 2,300 units, an increase of 133%. The curve looks
like this:
H7353_Guide-DataAnalytics_2ndREV.indb 136H7353_Guide-DataAnalytics_2ndREV.indb 136 1/17/18 10:47 AM1/17/18 10:47 AM
Linear Thinking in a Nonlinear World
137
The nonlinear phenomenon also extends to intan-
gibles, like consumer attitudes. Take consumers and
sus tain ability. We frequently hear executives complain
that while people say they care about the environment,
they are not willing to pay extra for ecofriendly prod-
ucts. Quantitative analyses bear this out. A survey by the
National Geographic Society and GlobeScan fi nds that,
across 18 countries, concerns about environmental prob-
lems have increased markedly over time, but consumer
behavior has changed much more slowly. While nearly
all consumers surveyed agree that food production and
consumption should be more sustainable, few of them
alter their habits to support that goal.
What’s going on? It turns out that the relationship
between what consumers say they care about and their
H7353_Guide-DataAnalytics_2ndREV.indb 137H7353_Guide-DataAnalytics_2ndREV.indb 137 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
138
actions is often highly nonlinear. But managers often
believe that classic quantitative tools, like surveys using
1-to-5 scales of importance, will predict behavior in a lin-
ear fashion. In reality, research shows little or no behav-
ioral difference between consumers who, on a fi ve-point
scale, give their environmental concern the lowest rat-
ing, 1, and consumers who rate it a 4. But the difference
between 4s and 5s is huge. Behavior maps to attitudes on
a curve, not a straight line.
Companies typically fail to account for this pattern—
in part because they focus on averages. Averages mask
nonlinearity and lead to prediction errors. For example,
suppose a fi rm did a sustainability survey among two of
H7353_Guide-DataAnalytics_2ndREV.indb 138H7353_Guide-DataAnalytics_2ndREV.indb 138 1/17/18 10:47 AM1/17/18 10:47 AM
Linear Thinking in a Nonlinear World
139
its target segments. All consumers in one segment rate
their concern about the environment a 4, while 50% of
consumers in the other segment rate it a 3 and 50% rate
it a 5. The average level of concern is the same for the two
segments, but people in the second segment are overall
much more likely to buy green products. That’s because
a customer scoring 5 is much more likely to make envi-
ronmental choices than a customer scoring 4, whereas a
customer scoring 4 is not more likely to than a customer
scoring 3.
The nonlinear relationship between attitudes and
behavior shows up repeatedly in important domains,
including consumers’ privacy concerns. A large-scale
survey in the Netherlands, for example, revealed little
difference in the number of loyalty-program cards car-
ried by consumers who said they were quite concerned
versus only weakly concerned about privacy. How is it
possible that people said they were worried about pri-
vacy but then agreed to sign up for loyalty programs that
require the disclosure of sensitive personal information?
Again, because only people who say they are extremely
concerned about privacy take signifi cant steps to protect
it, while most others, regardless of their concern rating,
don’t adjust their behavior.
Awareness of nonlinear relationships is also impor-
tant when choosing performance metrics. For instance,
to assess the effectiveness of their inventory manage-
ment, some fi rms track days of supply, or the number
of days that products are held in inventory, while other
fi rms track the number of times their inventory turns
H7353_Guide-DataAnalytics_2ndREV.indb 139H7353_Guide-DataAnalytics_2ndREV.indb 139 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
140
over annually. Most managers don’t know why their
fi rm uses one metric and not the other. But the choice
may have unintended consequences—for instance, on
employee motivation. Assume a fi rm was able to reduce
days of supply from 12 to six and that with additional
research, it could further reduce days of supply to four.
This is the same as saying that the inventory turn rate
could increase from 30 times a year to 60 times a year
and that it could be raised again to 90 times a year. But
employees are much more motivated to achieve im-
provements if the fi rm tracks turnover instead of days
of supply, research by the University of Cologne’s To-
bias Stangl and Ulrich Thonemann shows. That’s be-
cause they appear to get decreasing returns on their
efforts when they improve the days-of-supply metric—
but constant returns when they improve the turnover
metric.
Other areas where companies can choose different
metrics include warehousing (picking time versus pick-
ing rate), production (production time versus produc-
tion rate), and quality control (time between failures ver-
sus failure rate).
Nonlinearity is all around us. Let’s now explore the
forms it takes.
The Four Types of Nonlinear Relationships The best way to understand nonlinear patterns is to see
them. There are four types.
H7353_Guide-DataAnalytics_2ndREV.indb 140H7353_Guide-DataAnalytics_2ndREV.indb 140 1/17/18 10:47 AM1/17/18 10:47 AM
Linear Thinking in a Nonlinear World
141
Increasing gradually, then rising more steeply
Say a company has two customer segments that both
have annual contribution margins of $100. Segment A
has a retention rate of 20% while segment B has one of
60%. Most managers believe that it makes little differ-
ence to the bottom line which segment’s retention they
increase. If anything, most people fi nd doubling the
weaker retention rate more appealing than increasing
the stronger one by, say, a third.
But customer lifetime value is a nonlinear function of
retention rate, as you’ll see when you apply the formula
for calculating CLV:
Margin × Retention Rate
1 + Discount Rate – Retention Rate
When the retention rate rises from 20% to 40%, CLV
goes up about $35 (assuming a discount rate of 10% to
adjust future profi ts to their current worth), but when
retention rates rise from 60% to 80%, CLV goes up
about $147. As retention rates rise, customer lifetime
value increases gradually at fi rst and then suddenly
shoots up.
Most companies focus on identifying customers who
are most likely to defect and then target them with mar-
keting programs. However, it’s usually more profi table to
focus on customers who are more likely to stay. Linear
thinking leads managers to underestimate the benefi ts
of small increases to high retention rates.
H7353_Guide-DataAnalytics_2ndREV.indb 141H7353_Guide-DataAnalytics_2ndREV.indb 141 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
142
Decreasing gradually, then dropping quickly
The classic example of this can be seen in mortgages.
Property owners are often surprised by how slowly they
chip away at their debt during the early years of their
loan terms. But in a mortgage with a fi xed interest rate
and fi xed term, less of each payment goes toward the
principal at the beginning. The principal doesn’t de-
crease linearly. On a 30-year $165,000 loan at a 4.5% in-
terest rate, the balance decreases by only about $15,000
over the fi rst fi ve years. By year 25 the balance will have
dropped below $45,000. So the owner will pay off less
than 10% of the principal in the fi rst 16% of the loan’s
term but more than a quarter of it in the last 16%.
Because they’re misled by their linear thinking in this
context, mortgage payers are often surprised when they
H7353_Guide-DataAnalytics_2ndREV.indb 142H7353_Guide-DataAnalytics_2ndREV.indb 142 1/17/18 10:47 AM1/17/18 10:47 AM
Linear Thinking in a Nonlinear World
143
sell a property after a few years (and pay brokerage costs)
and have only small net gains to show for it.
Climbing quickly, then tapering off
Selling more of a product allows companies to achieve
economies of scale and boost per unit profi t, a metric of-
ten used to gauge a fi rm’s effi ciency. Executives use this
formula to calculate per unit profi t:
(Volume × unit price) – Fixed Costs – (Volume × Unit Variable Costs)
Volume
Say a fi rm sells 100,000 widgets each year at $2 a
widget, and producing those widgets costs $100,000—
$50,000 in fi xed costs and 50 cents in unit variable costs.
The per unit profi t is $1. The fi rm can increase per unit
H7353_Guide-DataAnalytics_2ndREV.indb 143H7353_Guide-DataAnalytics_2ndREV.indb 143 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
144
profi t by producing and selling more widgets, because it
will spread fi xed costs over more units. If it doubles the
number of widgets sold to 200,000, profi t per unit will
rise to $1.25 (assuming that unit variable costs remain
the same). That attractive increase might tempt you into
thinking per unit profi t will skyrocket if you increase
sales from 100,000 to 800,000 units. Not so.
If the fi rm doubles widget sales from 400,000 to
800,000 (which is much harder to do than going from
100,000 to 200,000), the per unit profi t increases only
by about 6 cents.
Managers focus a great deal on the benefi ts of econo-
mies of scale and growth. However, linear thinking may
lead them to overestimate volume as a driver of profi t
and thus underestimate other more impactful drivers,
like price.
H7353_Guide-DataAnalytics_2ndREV.indb 144H7353_Guide-DataAnalytics_2ndREV.indb 144 1/17/18 10:47 AM1/17/18 10:47 AM
Linear Thinking in a Nonlinear World
145
Falling sharply, then gradually
Firms often base evaluations of investments on the pay-
back period, the amount of time required to recover the
costs. Obviously, shorter paybacks are more favorable.
Say you have two projects slated for funding. Proj ect A
has a payback period of two years, and proj ect B has one
of four years. Both teams believe they can cut their pay-
back period in half. Many managers may fi nd B more at-
tractive because they’ll save two years, double the time
they’ll save with A.
Company leadership, however, may ultimately care
more about return on investment than time to break-
even. A one-year payback has an annual rate of re-
turn (ARR) of 100%. A two-year payback yields one of
50%—a 50-point difference. A four-year payback yields
one of 25%—a 25-point difference. So as the payback pe-
riod increases, ARR drops steeply at fi rst and then more
slowly. If your focus is achieving a higher ARR, halving
the payback period of proj ect A is a better choice.
Managers comparing portfolios of similar-sized proj-
ects may also be surprised to learn that the return on
investment is higher on one containing a proj ect with a
one-year payback and another with a four-year payback
than on a portfolio containing two proj ects expected to
pay back in two years. They should be careful not to un-
derestimate the effect that decreases in relatively short
payback periods will have on ARR.
How to Limit the Pitfalls of Linear Bias As long as people are employed as managers, biases
that are hardwired into the human brain will affect the
H7353_Guide-DataAnalytics_2ndREV.indb 145H7353_Guide-DataAnalytics_2ndREV.indb 145 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
146
quality of business decisions. Nevertheless, it is possible
to minimize the pitfalls of linear thinking.
Step 1: Increase awareness of linear bias
MBA programs should explicitly warn future manag-
ers about this phenomenon and teach them ways to
deal with it. Companies can also undertake initiatives
to educate employees by, for instance, presenting them
with puzzles that involve nonlinear relationships. In
our experience, people fi nd such exercises engaging and
eye-opening.
Broader educational efforts are already under way in
several fi elds. One is Ocean Tipping Points, an initiative
that aims to make people more sensitive to nonlinear re-
lationships in marine ecosystems. Scientists and manag-
ers often assume that the relationship between a stressor
(such as fi shing) and an ecological response (a decline in
fi sh population) is linear. However, a small change in a
stressor sometimes does disproportionately large dam-
age: A fi sh stock can collapse following a small increase
in fi shing. The proj ect’s goal is to identify relevant tip-
ping points in ocean ecology to help improve the man-
agement of natural resources.
Step 2: Focus on outcomes, not indicators
One of senior management’s most important tasks is to
set the organization’s direction and incentives. But fre-
quently, desired outcomes are far removed from every-
day business decisions, so fi rms identify relevant inter-
mediate metrics and create incentives to maximize them.
To lift sales, for instance, many companies try to improve
their websites’ positioning in organic search results.
H7353_Guide-DataAnalytics_2ndREV.indb 146H7353_Guide-DataAnalytics_2ndREV.indb 146 1/17/18 10:47 AM1/17/18 10:47 AM
Linear Thinking in a Nonlinear World
147
The problem is, these intermediate metrics can be-
come the end rather than the means, a phenomenon aca-
demics call “medium maximization.” That bodes trouble
if a metric and the outcome don’t have a linear relation-
ship—as is the case with organic search position and
sales. When a search rank drops, sales decrease quickly
at fi rst and then more gradually: The impact on sales is
much greater when a site drops from the fi rst to the sec-
ond position in search results than when it drops from
the 20th to the 25th position.
Other times, a single indicator can be used to predict
multiple outcomes, and that may confuse people and
lead them astray. Take annual rates of return, which a
manager who wants to maximize the future value of an
investment may consider. If you map the relationship
between investment products’ ARR and their total ac-
cumulated returns, you’ll see that as ARR rises, total re-
turns increase gradually and then suddenly shoot up.
Another manager may wish to minimize the time it
takes to achieve a particular investment goal. The rela-
tionship here is the reverse: As ARR rises, the time it
takes to reach a goal drops steeply at fi rst and then de-
clines gradually.
Because ARR is related to multiple outcomes in
different nonlinear ways, people often under- or over-
estimate its effect. A manager who wants to maximize
overall returns may care a great deal about a change
in the rate from 0.30% to 0.70% but be insensi-
tive to a change from 6.4% to 6.6%. In fact, increas-
ing a low return rate has a much smaller effect on ac-
cumulated future returns than increasing a high rate
does. In contrast, a manager focused on minimizing
H7353_Guide-DataAnalytics_2ndREV.indb 147H7353_Guide-DataAnalytics_2ndREV.indb 147 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
148
the time it takes to reach an investment goal may de-
cide to take on additional risk to increase returns from
6.3% to 6.7% but be insensitive to a change from 0.40%
to 0.60%. In this case the effect of increasing a high in-
terest rate on time to completing a savings goal is much
smaller than the effect of increasing a low interest rate.
Step 3: Discover the type of nonlinearity you’re dealing with
As Thomas Jones and W. Earl Sasser Jr. pointed out in
HBR back in 1995 (see “Why Satisfi ed Customers De-
fect” ), the relationship between customer satisfaction
ratings and customer retention is often nonlinear—but
in ways that vary according to the industry. In highly
competitive industries, such as automobiles, retention
rises gradually and then climbs up steeply as satisfaction
ratings increase. For noncompetitive industries reten-
tion shoots up quickly and then levels off.
In both situations linear thinking will lead to errors. If
the industry is competitive, managers will overestimate
the benefi t of increasing the satisfaction of completely
dissatisfi ed customers. If the industry is not competitive,
managers will overestimate the benefi t of increasing the
satisfaction of already satisfi ed customers.
The point is that managers should avoid making gen-
eralizations about nonlinear relationships across con-
texts and work to understand the cause and effect in
their specifi c situation.
Field experiments are an increasingly popular way to
do this. When designing them, managers should be sure
to account for nonlinearity. For instance, many people
H7353_Guide-DataAnalytics_2ndREV.indb 148H7353_Guide-DataAnalytics_2ndREV.indb 148 1/17/18 10:47 AM1/17/18 10:47 AM
Linear Thinking in a Nonlinear World
149
try to measure the impact of price on sales by offering
a product at a low price (condition A in the chart on
the next page) and a high price (condition B) and then
measuring differences in sales. But testing two prices
won’t reveal nonlinear relationships. You need to use at
least three price levels—low, medium (condition C), and
high—to get a sense of them.
Step 4: Map nonlinearity whenever you can
In addition to providing the right training, companies
can build support systems that warn managers when
they might be making bad decisions because of the incli-
nation to think linearly.
Ideally, algorithms and artifi cial intelligence could
identify situations in which that bias is likely to strike
and then offer information to counteract it. Of course,
H7353_Guide-DataAnalytics_2ndREV.indb 149H7353_Guide-DataAnalytics_2ndREV.indb 149 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
150
while advances in AI make this possible in formal set-
tings, it can’t account for decisions that take place off-
line and in conversations. And building such systems
could eat up a lot of time and money.
A low-tech but highly effective technique for fi ghting
linear bias is data visualization. As you’ve noticed in this
article, whenever we wanted you to understand some
linear bias, we showed you the nonlinear relationships.
They’re much easier to grasp when plotted out in a chart
than when described in a list of statistics. A visual repre-
sentation also helps you see threshold points where out-
comes change dramatically and gives you a good sense of
the degree of nonlinearity in play.
Putting charts of nonlinear relationships in dash-
boards and even mapping them out in “what if ” sce-
H7353_Guide-DataAnalytics_2ndREV.indb 150H7353_Guide-DataAnalytics_2ndREV.indb 150 1/17/18 10:47 AM1/17/18 10:47 AM
Linear Thinking in a Nonlinear World
151
narios will make managers more familiar with nonlin-
earity and thus more likely to check for it before making
decisions.
Visualization is also a good tool for companies inter-
ested in helping customers make good decisions. For ex-
ample, to make drivers aware of how little time they save
by accelerating when they’re already traveling at high
speed, you could add a visual cue for time savings to car
dashboards. One way to do this is with what Eyal Pe’er
and Eyal Gamliel call a “paceometer,” which shows how
many min utes it takes to drive 10 miles. It will surprise
most drivers that going from 40 to 65 will save you
about six min utes per 10 miles, but going from 65 to 90
saves only about two and a half minutes—even though
you’re increasing your speed 25 miles per hour in both
instances.
03-H7353-SEC3.indd 15103-H7353-SEC3.indd 151 1/22/18 2:11 PM1/22/18 2:11 PM
Analyze the Data
152
The Implications for Marketers A cornerstone of modern marketing is the idea that by
focusing more on consumer benefi ts than on product at-
tributes, you can sell more. Apple, for instance, realized
that people would perceive an MP3 player that provided
“1,000 songs in your pocket” to be more attractive than
one with an “internal storage capacity of 5GB.”
Our framework, however, highlights the fact that in
many situations companies actually profi t from promot-
ing attributes rather than benefi ts. They’re taking advan-
tage of consumers’ tendency to assume that the relation-
ship between attributes and benefi ts is linear. And that is
not always the case.
We can list any number of instances where showing
customers the actual benefi ts would reveal where they
may be overspending and probably change their buy-
ing behavior: printer pages per min ute, points in loy-
alty programs, and sun protection factor, to name just
a few. Bandwidth upgrades are another good example.
Our research shows that internet connections are priced
linearly: Consumers pay the same for increases in speed
from a low base and from a high base. But the relation-
ship between download speed and download time is
nonlinear. As download speed increases, download time
drops rapidly at fi rst and then gradually. Upgrading from
fi ve to 25 megabits per second will lead to time savings
of 21 min utes per gigabyte, while the increase from 25 to
100 Mbps buys only four minutes. When consumers see
the actual gains from raising their speed to 100 Mbps,
they may prefer a cheaper, slower internet connection.
H7353_Guide-DataAnalytics_2ndREV.indb 152H7353_Guide-DataAnalytics_2ndREV.indb 152 1/17/18 10:47 AM1/17/18 10:47 AM
Linear Thinking in a Nonlinear World
153
Of course, willfully exploiting consumers’ fl awed
perceptions of attri bute-benefi t relationships is a ques-
tionable marketing strategy. It’s widely regarded as un-
ethical for companies to take advantage of customers’
ignorance.
In recent years a number of professions, including ecolo-
gists, physiologists, and physicians, have begun to rou-
tinely factor nonlinear relationships into their decision
making. But nonlinearity is just as prevalent in the
business world as anywhere else. It’s time that man-
agement professionals joined these other disciplines
in developing greater awareness of the pitfalls of linear
thinking in a nonlinear world. This will increase their
H7353_Guide-DataAnalytics_2ndREV.indb 153H7353_Guide-DataAnalytics_2ndREV.indb 153 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
154
ability to choose wisely—and to help the people around
them make good decisions too.
Bart de Langhe is an associate professor of marketing
at Esade Business School, Ramon Llull University, and
an assistant professor of marketing at the Leeds School
of Business, University of Colorado–Boulder. Stefano
Puntoni is a professor of marketing at the Rotterdam
School of Management, Erasmus University. Richard
Larrick is the Hanes Corporation Foundation Professor
at Duke University’s Fuqua School of Business.
H7353_Guide-DataAnalytics_2ndREV.indb 154H7353_Guide-DataAnalytics_2ndREV.indb 154 1/17/18 10:47 AM1/17/18 10:47 AM
155
CHAPTER 15
Pitfalls of Data-Driven Decisions by Megan MacGarvie and Kristina McElheran
Even with impressively large data sets, the best analytics
tools, and careful statistical methods, managers can still
be vulnerable to a range of pitfalls when using data to
back up their toughest choices—especially when infor-
mation overload leads us to take shortcuts in reasoning.
In some instances, data and analytics actually make mat-
ters worse.
Psychologists, behavioral economists, and other schol-
ars of human behavior have identifi ed several common
decision-making traps. Many of these traps stem from
the fact that people don’t carefully process every piece
of information in every decision. Instead, we often rely
H7353_Guide-DataAnalytics_2ndREV.indb 155H7353_Guide-DataAnalytics_2ndREV.indb 155 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
156
on heuristics—simplifi ed procedures that allow us to
make decisions in the face of uncertainty or when ex-
tensive analysis is too costly or time-consuming. These
heuristics lead us to believe we are making sound deci-
sions when we are actually making systematic mistakes.
What’s more, human brains are wired for certain biases
that creep in and distort our thinking, typically without
our awareness.
There are three main cognitive traps that regularly
skew decision making, even when informed by the best
data. Here are each of these three pitfalls in detail, as
well as a number of suggestions for how to escape them.
The Confi rmation Trap When we pay more attention to fi ndings that align with
our prior beliefs, and ignore other facts and patterns in
the data, we fall into the confi rmation trap. With a huge
data set and numerous correlations between variables,
analyzing all possible correlations is often both costly
and counterproductive. Even with smaller data sets, it
can be easy to inadvertently focus on correlations that
confi rm our expectations of how the world should work,
and dismiss counterintuitive or inconclusive patterns in
the data when they don’t align.
Consider the following example: In the late 1960s
and early 1970s, researchers conducted one of the most
well-designed studies on how different types of fats af-
fect heart health and mortality. But the results of this
study, known as the Minnesota Coronary Experiment,
were not published at the time. A recent New York Times
H7353_Guide-DataAnalytics_2ndREV.indb 156H7353_Guide-DataAnalytics_2ndREV.indb 156 1/17/18 10:47 AM1/17/18 10:47 AM
Pitfalls of Data-Driven Decisions
157
article suggests that these results stayed unpublished
for so long because they contradicted the beliefs of both
the researchers and the medical establishment.1 In fact,
it wasn’t until 2016 that the medical journal BMJ pub-
lished a piece referencing this data, when growing skep-
ticism about the relationship between saturated fat con-
sumption and heart disease led researchers to analyze
data from the original experiment—more than 40 years
later.2 These and similar fi ndings cast doubt on decades
of unchallenged medical advice to avoid saturated fats.
While it’s unclear whether one experiment would have
changed standard dietary and health recommendations,
this example demonstrates that even with the best pos-
sible data, when we look at numbers we may ignore
important facts when they contradict the dominant
paradigm or don’t confi rm our beliefs, with potentially
troublesome results.
This is a sobering prospect for decision makers in
companies. And confi rmation bias becomes that much
harder to avoid when individuals face pressure from
bosses and peers. Organizations frequently reward em-
ployees who can provide empirical support for existing
managerial preferences. Those who decide what parts
of the data to examine and present to senior managers
may feel compelled to choose only the evidence that re-
inforces what their supervisors want to see or that con-
fi rms a prevalent attitude within the fi rm.
To get a fair assessment of what the data has to say,
don’t avoid information that counters your (or your
boss’s) beliefs. Instead, embrace it by doing the following:
H7353_Guide-DataAnalytics_2ndREV.indb 157H7353_Guide-DataAnalytics_2ndREV.indb 157 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
158
• Specify in advance the data and analytical ap-
proaches on which you’ll base your decision, to
reduce the temptation to cherry-pick fi ndings that
agree with your prejudices.
• Actively look for fi ndings that disprove your
beliefs. Ask yourself, “If my expectations are
wrong, what pattern would I likely see in the
data?” Enlist a skeptic to help you. Seek people
who like to play devil’s advocate or assign contrary
positions for active debate.
• Don’t automatically dismiss fi ndings that fall
below your threshold for statistical or practi-
cal signifi cance. Both noisy relationships (those
with large standard errors) and small, precisely
measured relationships can point to fl aws in your
beliefs and presumptions. Ask yourself, what
would it take for this to appear important? Make
sure your key takeaway is not sensitive to reason-
able changes in your model or sample size.
• Assign multiple independent teams to analyze
the data separately. Do they come to similar
conclusions? If not, isolate and study the points
of divergence to determine whether the differ-
ences are due to error, inconsistent methods,
or bias.
• Treat your fi ndings like predictions, and test them.
If you uncover a correlation from which you think
your organization can profi t, use an experiment to
validate that correlation.
H7353_Guide-DataAnalytics_2ndREV.indb 158H7353_Guide-DataAnalytics_2ndREV.indb 158 1/17/18 10:47 AM1/17/18 10:47 AM
Pitfalls of Data-Driven Decisions
159
The Overconfi dence Trap In their book Judgment in Managerial Decision Mak-
ing, behavioral researchers Max Bazerman and Don
Moore refer to overconfi dence as “The Mother of All
Biases.” Time and time again, psychologists have found
that decision makers are too sure of themselves. We tend
to assume that the accuracy of our judgments or the
probability of success in our endeavors is more favorable
than the data would suggest. When there are risks, we
alter our read of the odds to assume we’ll come out on
the winning side. Senior decision makers who have been
promoted based on past successes are especially suscep-
tible to this bias, since they have received positive signals
about their decision-making abilities throughout their
careers.
Overconfi dence also reinforces many other pitfalls of
data interpretation, be it psychological or procedural. It
prevents us from questioning our methods, motivation,
and the way we communicate our fi ndings to others. It
makes it easy to underinvest in data and analysis; when
we feel too confi dent in our understanding, we don’t
spend enough time or money acquiring more informa-
tion or running further analyses. To make matters worse,
more information can increase overconfi dence without
improving accuracy. More data in and of itself is not a
guaranteed solution.
Going from data to insight requires quality inputs,
skill, and sound processes. Because it can be so diffi cult
to recognize our own biases, good processes are essential
for avoiding overconfi dence when analyzing data. Here
H7353_Guide-DataAnalytics_2ndREV.indb 159H7353_Guide-DataAnalytics_2ndREV.indb 159 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
160
are a few procedural tips to avoid the overconfi dence
trap:
• Describe your perfect experiment—the type
of information you would use to answer your
question if you had limitless resources for
data collection and the ability to measure any
variable. Compare this ideal with your actual
data to understand where it might fall short.
Identify places where you might be able to close
the gap with more data collection or analytical
techniques.
• Make it a formal part of your process to be your
own devil’s advocate. In Thinking, Fast and Slow,
Nobel laureate Daniel Kahneman suggests asking
yourself why your analysis might be wrong, and
recommends that you do this for every analysis
you perform. Taking this contrarian view can
help you see the fl aws in your own arguments and
reduce mistakes across the board.
• Before making a decision or launching a project,
perform a “pre-mortem,” an approach suggested
by psychologist Gary Klein. Ask others with
knowledge about the project to imagine its
failure a year into the future and to write stories
of that failure. In doing so, you’ll benefi t from
the wisdom of multiple perspectives, while also
providing an opportunity to bring to the surface
potential fl aws in the analysis that you may
otherwise overlook.
H7353_Guide-DataAnalytics_2ndREV.indb 160H7353_Guide-DataAnalytics_2ndREV.indb 160 1/17/18 10:47 AM1/17/18 10:47 AM
Pitfalls of Data-Driven Decisions
161
• Keep track of your predictions and systemati-
cally compare them with what actually happens.
Which of your predictions turned out to be true
and which ones fell short? Persistent biases can
creep back into our decision making; revisit these
reports on a regular basis so you can prevent mis-
takes in the future.
The Overfi tting Trap When your model yields surprising or counterintui-
tive predictions, you may have made an exciting new
discovery—or it may be the result of overfi tting. In The
Signal and the Noise, Nate Silver famously dubbed this
“the most important scientifi c problem you’ve never
heard of.” This trap occurs when a statistical model de-
scribes random noise, rather than the underlying rela-
tionship we need to capture. Overfi t models generally do
a suspiciously good job of explaining many nuances of
what happened in the past, but they have great diffi culty
predicting the future.
For instance, when Google’s Flu Trends application
was introduced in 2008, it was heralded as an innovative
way to predict fl u outbreaks by tracking search terms as-
sociated with early fl u symptoms. But early versions of
the algorithm looked for correlations between fl u out-
breaks and millions of search terms. With such a large
number of terms, some correlations appeared signifi cant
when they were really due to chance (searches for “high
school basketball,” for example, were highly correlated
with the fl u). The application was ultimately scrapped
only a few years later due to failures of prediction.
H7353_Guide-DataAnalytics_2ndREV.indb 161H7353_Guide-DataAnalytics_2ndREV.indb 161 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
162
In order to overcome this bias, you need to discern
between the data that matters and the noise around it.
Here’s how you can guard against the overfi tting trap:
• Randomly divide the data into two sets: a training
set, with which you’ll estimate the model, and a
“validation set,” with which you’ll test the accuracy
of the model’s predictions. An overfi t model might
be great at making predictions within the training
set, but raise warning fl ags by performing poorly
in the validation set.
• Much like you would for the confi rmation trap,
specify the relationships you want to test and how
you plan to test them before analyzing the data, to
avoid cherry-picking.
• Keep your analysis simple. Look for relationships
that measure important effects related to clear and
logical hypotheses before digging into nuances. Be
on guard against spurious correlations—the ones
that occur only by chance—that you can rule out
based on experience or common sense (see the
sidebar, “Beware Spurious Correlations,” in chap-
ter 10). Remember that data can never truly “speak
for itself ” and must rely on human interpreters to
make sense of it.
• Construct alternative narratives. Is there another
story you could tell with the same data? If so, you
cannot be confi dent that the relationship you have
uncovered is the right—or only—one.
H7353_Guide-DataAnalytics_2ndREV.indb 162H7353_Guide-DataAnalytics_2ndREV.indb 162 1/17/18 10:47 AM1/17/18 10:47 AM
Pitfalls of Data-Driven Decisions
163
• Beware of the all-too-human tendency to see
patterns in random data. For example, consider
a baseball player with a .325 batting average who
has no hits in a championship series game. His
coach may see a cold streak and want to replace
him, but he’s only looking at handful of games.
Statistically, it would be better to keep him in the
lineup than substitute the .200 hitter who had
four hits in the previous game.
From Bias to Better Decisions Data analytics can be an effective tool to promote con-
sistent decisions and shared understanding. It can high-
light blind spots in our individual or collective awareness
and can offer evidence of risks and benefi ts for particular
paths of action. But it can also make us complacent.
Managers need to be aware of these common
decision-making pitfalls and employ sound processes
and cognitive strategies to prevent them. It can be
diffi cult to recognize the fl aws in your own reason-
ing—but proactively tackling these biases with the right
mindset and procedures can lead to better analysis of
data and better decisions overall.
Megan MacGarvie is an associate professor in the mar-
kets, public policy, and law group at Boston Univer-
sity’s Questrom School of Business, where she teaches
data-driven decision making and business analytics.
H7353_Guide-DataAnalytics_2ndREV.indb 163H7353_Guide-DataAnalytics_2ndREV.indb 163 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
164
She is also a research associate of the National Bureau
of Economic Research. Kristina McElheran is an assis-
tant professor of strategic management at the University
of Toronto and a digital fellow at the MIT Initiative on
the Digital Economy. Her ongoing work on data-driven
decision making with Erik Brynjolfsson has been fea-
tured on HBR online and in the American Economic
Review.
NOTES
1. A. E. Carroll, “A Study on Fats That Doesn’t Fit the Story Line,” New York Times, April 15, 2016.
2. C. E. Ramsden et al., “Re-evaluation of the Traditional Diet- Heart Hypothesis: Analysis of Recovered Data from Minnesota Coronary Experiment (1968-73),” BMJ (April 2016), 353:i1246, doi: 10.1136.
H7353_Guide-DataAnalytics_2ndREV.indb 164H7353_Guide-DataAnalytics_2ndREV.indb 164 1/17/18 10:47 AM1/17/18 10:47 AM
165
CHAPTER 16
Don’t Let Your Analytics Cheat the Truth by Michael Schrage
Everyone’s heard the truism that there are lies, damned
lies, and statistics. But sitting through a welter of
analytics-driven, top-management presentations pro-
vokes me into proposing a cynical revision: There are
liars, damned liars, and statisticians.
The rise of analytics-informed insight and decision
making is welcome. The disingenuous and deceptive
manner in which many of these statistics are presented
is not. I’m simultaneously stunned and disappointed
Adapted from “Do Your Analytics Cheat the Truth?” on hbr.org, Octo-
ber 10, 2011.
H7353_Guide-DataAnalytics_2ndREV.indb 165H7353_Guide-DataAnalytics_2ndREV.indb 165 1/17/18 10:47 AM1/17/18 10:47 AM
Analyze the Data
166
by how egregiously manipulative these analytics have
become at the very highest levels of enterprise oversight.
The only thing more surprising—and more disappoint-
ing—is how unwilling or unable so many senior execu-
tives are about asking simple questions about the analyt-
ics they see.
At one fi nancial services fi rm, for example, call center
analytics showed spike after spike of negative customer
satisfaction numbers. Hold times and problem resolu-
tion times had noticeably increased. The presenting ex-
ecutive clearly sought greater funding and training for
her group. The implied threat was that the fi rm’s reputa-
tion for swift and responsive service was at risk.
Three simple but pointed questions later, her analytic
gamesmanship became clear. What had been presented
as a disturbing customer service trend was in large part
due to a policy change affecting about 20% of the fi rm’s
newly retired customers. Between their age, possible tax
implications, and an approval process requiring coordi-
nation with another department, these calls frequently
stretched beyond 35 to 45 minutes.
What made the situation worse (and what might ex-
plain why the presenter chose not to break out the data)
was a management decision not to route those calls to a
specially trained team but instead to allow any customer
representative to process the query. The additional de-
lays undermined the entire function’s performance.
Every single one of the presenter’s numbers was tech-
nically accurate. But they were aggregated in a manner
that made it look as if the function was underresourced.
The analytics deliberately concealed the outlier statisti-
H7353_Guide-DataAnalytics_2ndREV.indb 166H7353_Guide-DataAnalytics_2ndREV.indb 166 1/17/18 10:47 AM1/17/18 10:47 AM
Don’t Let Your Analytics Cheat the Truth
167
cally responsible for making the numbers dramatically
worse.
More damning was a simple queuing theory simula-
tion demonstrating that if the call center had made even
marginal changes in how it chose to manage that excep-
tional 20%, the aggregate call center performance num-
bers would have been virtually unaffected. Poor manage-
ment, not systems underinvestment, was the real root
cause problem.
Increasingly, I observe statistical sophisticates in-
dulging in analytic advocacy—that is, the numbers are
deployed to infl uence and win arguments rather than
identify underlying dynamics and generate insight. This
is particularly disturbing because while the analytics—in
the strictest technical sense—accurately portray a situa-
tion, they do so in a way that discourages useful inquiry.
I always insist that analytics presentations and pre-
senters explicitly identify the outliers, how they were de-
fi ned and dealt with, and—most importantly—what the
analytics would look like if they didn’t exist. It’s astonish-
ing what you fi nd when you make the outliers as impor-
tant as the aggregates and averages in understanding the
analytics. (To guide your discussion, consider the ques-
tions in the sidebar “Investigating Outliers.”)
My favorite example of this comes, naturally enough,
from Harvard. Few people realize that, in fact, the aver-
age net worth of Harvard dropouts vastly exceeds the av-
erage net worth of Harvard graduates.
The reason for that is simple. There are many, many
more Harvard graduates than there are Harvard drop-
outs. But the ranks of Harvard dropouts include Bill
H7353_Guide-DataAnalytics_2ndREV.indb 167H7353_Guide-DataAnalytics_2ndREV.indb 167 1/17/18 10:47 AM1/17/18 10:47 AM
168
INVESTIGATING OUTLIERS
by Janice H. Hammond
When you notice an outlier in data, you must investi-
gate why the anomaly exists. Consider asking some of
the following questions:
• Is it just an unusual, but valid, value?
• Could it be a data entry error?
• Was it collected in a diff erent way than the rest
of the data? At a diff erent time?
After making an eff ort to understand where an
outlier comes from, you should have a deeper under-
standing of the situation the data represent. Then think
about how to handle the outlier in your analysis. Typi-
cally, you can do one of three things: leave it alone, or—
very rarely—remove it or change it to a corrected value.
Excluding or changing data is not something we
do often—and it should be done only after examin-
ing the underlying situation in great detail. We should
never do it to help the data “fi t” a conclusion we want
to draw. Changes to a data set should be made on a
case-by-case basis only after careful investigation of
the situation.
Adapted from “Quantitative Methods Online Course,” Harvard Busi- ness Publishing, October 24, 2004, revised January 24, 2017 (product #504702).
Janice H. Hammond is the Jesse Philips Professor of Manufacturing at Harvard Business School. She serves as program chair for the HBS Executive Education International Women’s Foundation and Women’s Leadership Programs, and created the online Business Analytics course for HBX CORe.
H7353_Guide-DataAnalytics_2ndREV.indb 168H7353_Guide-DataAnalytics_2ndREV.indb 168 1/17/18 10:47 AM1/17/18 10:47 AM
Don’t Let Your Analytics Cheat the Truth
169
Gates, Mark Zuckerberg, and Polaroid’s Edwin Land,
whose combined, infl ation-adjusted net worth probably
tops $100 billion. That megarich numerator divided by
the smaller “dropout” denominator creates the statisti-
cally accurate illusion that the average Harvard dropout
is much, much wealthier than the Harvard student who
actually got their degree.
This is, of course, ridiculous. Unfortunately, it is no
more ridiculous than what one fi nds, on average, in a sta-
tistically signifi cant number of analytics-driven board-
room presentations. The misdirection—and mismanage-
ment—associated with outliers is the most disturbingly
common pathology I experience, even in stats-savvy
organizations.
Always ask for the outliers. Always make the analysts
display what their data looks like with the outliers r e-
moved. There are other equally important ways to wring
greater utility from aggregated analytics, but start from
the outliers in. Because analytics that mishandle outliers
are “outliars.”
Michael Schrage, a research fellow at MIT Sloan School’s
Center for Digital Business, is the author of the books
Serious Play, Who Do You Want Your Customers to Be-
come? and The Innovator’s Hypothesis.
H7353_Guide-DataAnalytics_2ndREV.indb 169H7353_Guide-DataAnalytics_2ndREV.indb 169 1/17/18 10:47 AM1/17/18 10:47 AM
H7353_Guide-DataAnalytics_2ndREV.indb 170H7353_Guide-DataAnalytics_2ndREV.indb 170 1/17/18 10:47 AM1/17/18 10:47 AM
SECTION FOUR
Communicate Your Findings
H7353_Guide-DataAnalytics_2ndREV.indb 171H7353_Guide-DataAnalytics_2ndREV.indb 171 1/17/18 10:47 AM1/17/18 10:47 AM
H7353_Guide-DataAnalytics_2ndREV.indb 172H7353_Guide-DataAnalytics_2ndREV.indb 172 1/17/18 10:47 AM1/17/18 10:47 AM
173
CHAPTER 17
Data Is Worthless If You Don’t Communicate It by Thomas H. Davenport
Too many managers are, with the help of their analyst
colleagues, simply compiling vast databases of infor-
mation that never see the light of day, or that only get
disseminated in autogenerated business intelligence
reports. As a manager, it’s not your job to crunch the
numbers, but it is your job to communicate them. Never
make the mistake of assuming that the results will speak
for themselves.
Consider the cautionary tale of Gregor Mendel. Al-
though he discovered the concept of genetic inheritance,
Adapted from content posted on hbr.org, June 18, 2013 (product
#H00ASW).
H7353_Guide-DataAnalytics_2ndREV.indb 173H7353_Guide-DataAnalytics_2ndREV.indb 173 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
174
his ideas were not adopted during his lifetime because
he only published his fi ndings in an obscure Moravian
scientifi c journal, a few reprints of which he mailed to
leading scientists. It’s said that Darwin, to whom Mendel
sent a reprint of his fi ndings, never even cut the pages
to read the geneticist’s work. Although Mendel carried
out his groundbreaking experiments between 1856 and
1863—eight years of painstaking research—their signifi -
cance was not recognized until the turn of the 20th cen-
tury, long after his death. The lesson: If you’re going to
spend the better part of a decade on a research project,
also put some time and effort into disseminating your
results.
One person who has done this very well is Dr. John
Gottman, the well-known marriage scientist at the Uni-
versity of Washington. Gottman, working with a statis-
tical colleague, developed a marriage equation predict-
ing how likely a marriage is to last over the long term.
The equation is based on a couple’s ratio of positive to
negative interactions during a 15-minute conversation
on a diffi cult topic such as money or in-laws. Pairs who
showed affection, humor, or happiness while talking
about contentious topics were given a maximum num-
ber of points, while those who displayed belligerence
or contempt received the minimum. Observing several
hundred couples, Gottman and his team were able to
score couples’ interactions and identify the patterns that
predict divorce or a happy marriage.
This was great work in itself, but Gottman didn’t
stop there. He and his wife, Julie, founded a nonprofi t
H7353_Guide-DataAnalytics_2ndREV.indb 174H7353_Guide-DataAnalytics_2ndREV.indb 174 1/17/18 10:47 AM1/17/18 10:47 AM
Data Is Worthless If You Don’t Communicate It
175
research institute and a for-profi t organization to apply
the results through books, DVDs, workshops, and ther-
apist training. They’ve infl uenced exponentially more
marriages through these outlets than they could possibly
ever have done in their own clinic—or if they’d just is-
sued a press release with their fi ndings.
Similarly, during his tenure at Intuit, George Rou-
me lio tis was head of a data science group that analyzed
and created product features based on the vast amount
of online data that Intuit collected. For his projects, he
recommended a simple framework for communicating
about each analysis:
1. My understanding of the business problem
2. How I will measure the business impact
3. What data is available
4. The initial solution hypothesis
5. The solution
6. The business impact of the solution
Note what’s not here: details on statistical methods
used, regression coeffi cients, or logarithmic transforma-
tions. Most audiences neither understand nor appreciate
those details; they care about results and implications.
It may be useful to make such information available in
an appendix to a report or presentation, but don’t let it
get in the way of telling a good story with your data—
starting with what your audience really needs to know.
H7353_Guide-DataAnalytics_2ndREV.indb 175H7353_Guide-DataAnalytics_2ndREV.indb 175 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
176
Thomas H. Davenport is the President’s Distinguished
Professor in Management and Information Technology
at Babson College, a research fellow at the MIT Initiative
on the Digital Economy, and a senior adviser at Deloitte
Analytics. Author of over a dozen management books,
his latest is Only Humans Need Apply: Winners and
Losers in the Age of Smart Machines.
H7353_Guide-DataAnalytics_2ndREV.indb 176H7353_Guide-DataAnalytics_2ndREV.indb 176 1/17/18 10:47 AM1/17/18 10:47 AM
177
CHAPTER 18
When Data Visualization Works—and When It Doesn’t by Jim Stikeleather
I am uncomfortable with the growing emphasis on big
data and its stylist, visualization. Don’t get me wrong—
I love infographic representations of large data sets.
The value of representing information concisely and ef-
fectively dates back to Florence Nightingale, when she
developed a new type of pie chart to clearly show that
more soldiers were dying from preventable illnesses than
Adapted from content posted on hbr.org, March 27, 2013 (product
#H00ADJ).
H7353_Guide-DataAnalytics_2ndREV.indb 177H7353_Guide-DataAnalytics_2ndREV.indb 177 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
178
from their wounds. On the other hand, I see beautiful
exercises in special effects that show off statistical and
technical skills, but do not clearly serve an informing
purpose. That’s what makes me squirm.
Ultimately, data visualization is about communic at-
ing an idea that will drive action. Understanding the cri-
teria for information to provide valuable insights and the
reasoning behind constructing data visualizations will
help you do that with effi ciency and impact.
For information to provide valuable insights, it must be
interpretable, relevant, and novel. With so much unstruc-
tured data today, it is critical that the data being analyzed
generates interpretable information. Collecting lots of
data without the associated metadata—such as what is it,
where was it collected, when, how, and by whom—reduces
the opportunity to play with, interpret, and draw conclu-
sions from the data. It must also be relevant to the people
who are looking to gain insights, and to the purpose for
which the information is being examined (see the sidebar
“Understand Your Audience”). Finally, it must be original,
or shed new light on an area. If the information fails any
one of these criteria, then no visualization can make it
valuable. That means that only a tiny slice of the data we
can bring to life visually will actually be worth the effort.
Once we’ve narrowed the universe of data down to
that which satisfi es these three requirements, we must
also understand the legitimate reasons to construct data
visualizations, and recognize what factors affect the
quality of data visualizations. There are three broad rea-
sons for visualizing data:
H7353_Guide-DataAnalytics_2ndREV.indb 178H7353_Guide-DataAnalytics_2ndREV.indb 178 1/17/18 10:47 AM1/17/18 10:47 AM
When Data Visualization Works—and When It Doesn’t
179
• Confi rmation: If we already have a set of assump-
tions about how the system we are interested in
operates—for example, a market, customers, or
competitors—visualizations can help us check
those assumptions. They can also enable us to
observe whether the underlying system has devi-
ated from the model we had and assess the risk of
the actions we are about to undertake based on
those assumptions. You see this approach in some
enterprise dashboards.
• Education: There are two forms of education that
visualization offers. One is simply reporting: here
is how we measure the underlying system of inter-
est, and here are the values of those measures in
some comparative form—for instance, over time,
or against other systems or models. The other is to
develop intuition and new insights on the behavior
of a known system as it evolves and changes over
time, so that humans can get an experiential feel
of the system in an extremely compressed time
frame. You often see this model in the “gamifi ca-
tion” of training and development.
• Exploration: When we have large sets of data
about a system we are interested in and the goal is
to provide optimal human-machine inter actions
(HMI) to that data to tease out relationships,
processes, models, etc., we can use visualization to
help build a model to allow us to predict and bet-
ter manage the system. The practice of using visual
H7353_Guide-DataAnalytics_2ndREV.indb 179H7353_Guide-DataAnalytics_2ndREV.indb 179 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
180
UNDERSTAND YOUR AUDIENCE
Before you throw up (pun intended) data in a visual-
ization, start with the goal, which is to convey great
quantities of information in a format that is easily as-
similated by the consumers of this information—deci-
sion makers. A successful visualization is based on the
designer understanding whom the visualization is tar-
geting, and executing on three key points:
• Who is the audience, and how will it read and
interpret the information? Can you assume these
individuals have knowledge of the terminology
and concepts you’ll use, or do you need to guide
them with clues in the visualization (for example,
good is indicated with a green arrow going up)?
An audience of experts will have diff erent expec-
tations than a general audience.
• What are viewers’ expectations, and what type
of information is most useful to them?
• What is the visualization’s functional role, and
how can viewers take action from it? An explor-
atory visualization should leave viewers with
questions to pursue; educational or confi rma-
tional graphics should not.
Adapted from “The Three Elements of Successful Data Visualizations” on hbr.org by Jim Stikeleather, April 19, 2013.
H7353_Guide-DataAnalytics_2ndREV.indb 180H7353_Guide-DataAnalytics_2ndREV.indb 180 1/17/18 10:47 AM1/17/18 10:47 AM
When Data Visualization Works—and When It Doesn’t
181
discovery in lieu of statistics is called exploratory
data analysis (EDA), and too few businesses make
use of it.
Assuming the visualization creator has gotten it all
right—a well-defi ned purpose, the necessary and suf-
fi cient amount of data and metadata to make the visu-
alization interpretable, enabling relevant and original
insights for the business—what gives us confi dence that
these fi ndings are now worthy of action? Our ability to
understand and to a degree control three areas of risk can
defi ne the visualization’s resulting value to the business:
• Data quality: The quality of the underlying data is
crucial to the value of visualization. How complete
and reliable is it? As with all analytical processes,
putting garbage in means getting garbage out.
• Context: The point of visualization is to make
large amounts of data approachable so we can
apply our evolutionarily honed pattern detection
computer—our brain—to draw insights from it.
To do so, we need to access all of the potential
relationships of the data elements. This context is
the source of insight. To leave out any contextual
information or metadata (or more appropriately,
“metacontent”) is to risk hampering our under-
standing.
• Biases: The creator of the visualization may infl u-
ence the visualization’s semantics and the syntax
of the elements through color choices, positioning,
H7353_Guide-DataAnalytics_2ndREV.indb 181H7353_Guide-DataAnalytics_2ndREV.indb 181 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
182
and visual tricks (such as unnecessary 3D, or 2D
when 3D is more informative)—any of which can
challenge the interpretation of the data. This also
creates the risk of pre-specifying discoverable
features and results via the embedded algorithms
used by the creator (something EDA is intended to
overcome). These in turn can signifi cantly infl u-
ence how viewers under stand the visualization,
and what insight they will gather from it.
Ignoring these requirements and risks can under-
mine the visualization’s purpose and confuse rather than
enlighten.
Jim Stikeleather, DBA, is a serial entrepreneur and was
formerly Chief Innovation Offi cer at Dell. He teaches
innovation, business models, strategy, governance, and
change management at the graduate level at the Uni-
versity of South Florida and The Innovation Academy
at Trinity College Dublin. He is also a senior executive
coach.
H7353_Guide-DataAnalytics_2ndREV.indb 182H7353_Guide-DataAnalytics_2ndREV.indb 182 1/17/18 10:47 AM1/17/18 10:47 AM
183
CHAPTER 19
How to Make Charts That Pop and Persuade by Nancy Duarte
Displaying data can be a tricky proposition, because dif-
ferent rules apply in different contexts. A sales director
presenting fi nancial projections to a group of fi eld repre-
sentatives wouldn’t visualize her data the same way that
a design consultant would in a written proposal to a po-
tential client.
So how do you make the right choices for your situa-
tion? Before displaying your data, ask yourself these fi ve
questions:
Adapted from “The Quick and Dirty on Data Visualization” on hbr.org,
April 16, 2014 (product #H00RKA).
H7353_Guide-DataAnalytics_2ndREV.indb 183H7353_Guide-DataAnalytics_2ndREV.indb 183 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
184
1. Am I Presenting or Circulating My Data? Context plays a huge role in how best to render data.
When delivering a presentation, show the conclusions
you’ve drawn, not all the details that led you to those
conclusions. Because your slides will be up for only a
few seconds, your audience will need to process them
quickly. People won’t have time to chew on a lot of com-
plex information, and they’re not likely to run up to the
wall for a closer look at the numbers. Think in broad
strokes when you’re putting your charts together: What’s
the overall trend you’re highlighting? What’s the most
striking comparison you’re making? These are the sorts
of questions to answer with projected data.
Scales, grid lines, tick marks, and such should pro-
vide context, but without competing with the data. Use
a light neutral color, such as gray, for these elements so
they’ll recede to the background, and plot your data in
a slightly stronger neutral color, such as blue or green.
Then use a bright color to emphasize the point you’re
making.
It’s fi ne to display more detail in documents or in
decks that you email rather than present. Readers can
study them at their own pace—examine the axes, the leg-
ends, the layers—and draw their own conclusions from
your body of work. Still, you don’t want to overwhelm
them, especially since they won’t have you there in per-
son to explain what your main points are. Use white
space, section heads, and a clear hierarchy of visual ele-
H7353_Guide-DataAnalytics_2ndREV.indb 184H7353_Guide-DataAnalytics_2ndREV.indb 184 1/17/18 10:47 AM1/17/18 10:47 AM
How to Make Charts That Pop and Persuade
185
ments to help your readers navigate dense content and
guide them to key pieces of data.
2. Am I Using the Right Kind of Chart or Table? When you choose how to visualize your data, you’re de-
ciding what type of relationship you want to emphasize.
Take a look at fi gure 19-1, which shows the breakdown of
an investment portfolio.
In the pie, it’s clear that this person holds a number
of investments in different areas—but that’s about all
you see.
Figure 19-2 shows the same data in a bar chart. In this
form it’s much easier to discern how much is invested in
each category. If your focus is on comparing categories,
the bar chart is the better choice. A pie chart would be
more useful if you were trying to make the point that a
single investment made up a signifi cant portion of the
portfolio.
FIGURE 19-1
Investment portfolio breakdown
International stocks
Large-cap U.S. stock
Bonds
Real estate
Mid-cap U.S. stock
Small-cap U.S. stock
Commodities
H7353_Guide-DataAnalytics_2ndREV.indb 185H7353_Guide-DataAnalytics_2ndREV.indb 185 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
186
3. What Message Am I Trying to Convey? Whether you’re presenting or circulating your charts,
you need to highlight the most important items to en-
sure that your audience can follow your train of thought
and focus on the right elements. For example, fi gure
19-3 is diffi cult to interpret because all the information
is displayed with equal visual value.
Are we comparing regions? Quarters? Positive versus
negative numbers? It’s diffi cult to determine what mat-
ters most. By adding color or shading, you can draw the
eye to specifi c areas, as shown in fi gure 19-4.
We now know that we should be focusing on when
and in which regions revenue dropped.
4. Do My Visuals Accurately Refl ect the Numbers? Using a lot of crazy colors, extra labels, and fancy effects
won’t captivate an audience. That kind of visual clutter
FIGURE 19-2
Investment portfolio breakdown
International stocks
Large-cap U.S. stock
Bonds
Real estate
Mid-cap U.S. stock
Small-cap U.S. stock
Commodities
0 5 10 15 20%
H7353_Guide-DataAnalytics_2ndREV.indb 186H7353_Guide-DataAnalytics_2ndREV.indb 186 1/17/18 10:47 AM1/17/18 10:47 AM
How to Make Charts That Pop and Persuade
187
FIGURE 19-3
Revenue trends
Americas
Australia
China
Europe
India
–18%
47%
15%
57%
57%
7%
–7%
–5%
10%
6%
25%
26%
1%
–3%
–3%
2%
15%
7%
7%
8%
2%
Q1 Q2 Q3 Q4 Total
17%
19%
13%
13%
Americas
Australia
China
Europe
India
–18%
47%
15%
57%
57%
7%
–7%
–5%
10%
6%
25%
26%
1%
–3%
–3%
2%
15%
7%
7%
8%
2%
Q1 Q2 Q3 Q4 Total
17%
19%
13%
13%
FIGURE 19-4
Revenue trends
H7353_Guide-DataAnalytics_2ndREV.indb 187H7353_Guide-DataAnalytics_2ndREV.indb 187 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
188
FIGURE 19-5
Yearly revenue per region
5
10
15
20
25
30
35
40
45
50
0 Year one Year two
North South East West
dilutes the information and can even misrepresent it.
Consider the chart in fi gure 19-5.
Can you fi gure out the northern territory’s revenue for
year one? Is it 17? Or maybe 19? The way some programs
create 3D charts would lead any rational person to think
that the bar in question is well below 20. However, the
data behind the chart actually says that bar represents
20.4 units. You can see that if you look at the chart in a
very specifi c way, but it’s diffi cult to tell which way that
should be—even with plenty of time to scrutinize it.
It’s much clearer if you simply fl atten the chart, as in
fi gure 19-6.
5. Is My Data Memorable? Even if you’ve rendered your data clearly and accurately,
it’s another challenge altogether to make the information
stick. Consider using a meaningful visual metaphor to il-
H7353_Guide-DataAnalytics_2ndREV.indb 188H7353_Guide-DataAnalytics_2ndREV.indb 188 1/17/18 10:47 AM1/17/18 10:47 AM
How to Make Charts That Pop and Persuade
189
lustrate the scale of your numbers and cement the data
in the minds of your audience members. A metaphor can
also tie your insights to something that your audience al-
ready knows and cares about.
Author and activist Michael Pollan showed how much
crude oil goes into making a McDonald’s Double Quar-
ter Pounder with Cheese through a striking visual dem-
onstration: He placed glasses on a table and fi lled them
with oil to represent the amount of oil consumed during
each stage of the production process. At the end, he took
a taste of the oil to drive home his point. (To add an ele-
ment of humor, he later revealed that his prop “oil” was
actually chocolate syrup.)
Pollan could have shown a chart, but this was more
effective because he gave the audience a tangible visual—
one that triggered a visceral response.
50
40
30
20
10
0 North South
Year 1 East West North South
Year 2 East West
FIGURE 19-6
Yearly revenue per region
H7353_Guide-DataAnalytics_2ndREV.indb 189H7353_Guide-DataAnalytics_2ndREV.indb 189 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
190
By answering these fi ve questions as you’re laying out
your data, you’ll visualize it in a way that helps people
understand and engage with each point in your presen-
tation, document, or deck. As a result, your audience will
be more likely to adopt your overall message.
Nancy Duarte has published her latest book, Illuminate,
with coauthor Patti Sanchez. Duarte is also the author
of the HBR Guide to Persuasive Presentations, as well
as two award-winning books on the art of presenting,
Slide:ology and Resonate. Her team at Duarte Inc. has
created more than a quarter million presentations for its
clients and teaches public and corporate workshops on
presenting. Find Duarte on LinkedIn or follow her on
Twitter @nancyduarte.
H7353_Guide-DataAnalytics_2ndREV.indb 190H7353_Guide-DataAnalytics_2ndREV.indb 190 1/17/18 10:47 AM1/17/18 10:47 AM
191
CHAPTER 20
Why It’s So Hard for Us to Communicate Uncertainty An interview with Scott Berinato by Nicole Torres
We use data to make predictions. But predictions are
just educated guesses—they’re uncertain. And when
they’re being communicated, they’re incredibly diffi cult
to explain or clearly illustrate.
A case in point: The 2016 U.S. presidential election
did not unfold the way so many predicted it would. We
now know some of the reasons why—polling failed—but
Adapted from “Why It’s So Hard for Us to Visualize Uncertainty” on
hbr.org, November 11, 2016 (product #H039NV).
H7353_Guide-DataAnalytics_2ndREV.indb 191H7353_Guide-DataAnalytics_2ndREV.indb 191 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
192
watching the real-time results on the night of Tuesday,
November 8, wasn’t just surprising, it was confusing.
Predictions swung back and forth, and it was hard to
process the information that was coming in. Not only
did the data seem wrong, the way we were presenting
that data seemed wrong too.
I asked my colleague Scott Berinato, Harvard Busi-
ness Review editor and author of Good Charts: The HBR
Guide to Making Smarter, More Persuasive Data Visu-
alizations, if he would help explain this uncertainty—
how we dealt with it, why it was so hard to grasp, and
what’s so challenging about communicating and visual-
izing it.
Torres: What did you notice about how election
predictions were being shown election night?
Berinato: A lot of people were looking at the New
York Times’ live presidential forecast, where you’d see
a series of gauges (half-circle gauges, like a gas gauge
on your car) that updated frequently.1 The needle
moved left if data showed that Hillary Clinton had a
higher chance of winning, and right if Donald Trump
did. But the needle also jittered back and forth, mak-
ing it look like the statistical likelihood of winning
was changing rapidly. This caused a lot of anxiety.
People were confused. They were trying to interpret
what was going on in the election and why the data
was changing so drastically in real time, and it was
really hard to understand what was going on.
The thing was, the needle wasn’t swinging to
represent statistical likelihood; it was a hard-coded
effect meant to represent uncertainty in the statisti-
H7353_Guide-DataAnalytics_2ndREV.indb 192H7353_Guide-DataAnalytics_2ndREV.indb 192 1/17/18 10:47 AM1/17/18 10:47 AM
Why It’s So Hard for Us to Communicate Uncertainty
193
cal forecast. So trying to show real-time changes
in the race, while accounting for uncertainty, was a
good engagement effort, but the execution fell short
because it confused and unnerved people. The jitter
wasn’t the best visual approach.
What do we mean by “uncertainty”?
When thinking about showing uncertainty, we think
mostly about two types. One is statistical uncer-
tainty, which applies if I said something like, “Here
are my values, and statistically my confi dence in
them is 95%.” Think about margin of error built into
polls. Statisticians use things like box-and-whisker
plots to represent this, where a box shows the upper
and lower ranges of the fi rst and third quartiles in a
data set, a line in the box marks the median, and thin
bar “whiskers” reaching above and below the box to
indicate the range of the data. Dots can also be used
beyond the whiskers to show outliers. There are lots
of variations of these, and they work reasonably well,
though academics try other approaches sometimes
and the lay audience isn’t used to these visualizations,
for the most part.
The other kind of uncertainty is data uncertainty.
This applies when we’re not sure where within a
range our data falls. Instead of having a value and a
confi dence in that value, we have a range of possible
values. A friend recently gave me a data set with two
values. One was “the estimate ranges from 1 in 2,000
to 1 in 4,500” and the other was “an estimate ranging
from 1 in 5,500 to 1 in 8,000.” There’s not an accepted
or right way to visualize something like this.
H7353_Guide-DataAnalytics_2ndREV.indb 193H7353_Guide-DataAnalytics_2ndREV.indb 193 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
194
Finding ways to accurately and effectively repre-
sent uncertainty is one of the most important chal-
lenges in data visualization today. And it’s important
to know that visualizing uncertainty in general is
extremely diffi cult to do.
Why?
When you think about it, visualizations make some-
thing abstract—numbers, statistics—concrete. You
are representing an idea like 20% with a thing like a
bar or dot. A dot on a line that represents 20% looks
pretty certain. How do you then express the idea
that “fi ve times out of a hundred this isn’t the right
answer, and it could be all these other answers”?
So are there good ways of visualizing uncertainty
like this?
A lot of the time people just don’t represent their
uncertainty, because it’s hard. We don’t want to do
that. Uncertainty is an important thing to be able
to communicate. For example, consider health care,
where outcomes of care may be uncertain but you
want people to understand their decisions. How
do you show them the possible range of outcomes,
instead of only what is the most likely or least likely
to happen? Or say there’s an outbreak of a disease
like Ebola and we want to model the worst case, the
most likely, and the best-case scenarios. How do we
represent those different outcomes? Weather fore-
casts, hurricane models are the same thing. Risk
H7353_Guide-DataAnalytics_2ndREV.indb 194H7353_Guide-DataAnalytics_2ndREV.indb 194 1/17/18 10:47 AM1/17/18 10:47 AM
Why It’s So Hard for Us to Communicate Uncertainty
195
analysts and probability experts think about how to
solve these problems all the time. It’s not easy.
There are a number of other approaches, though.
Some people use bars to represent the range of
uncertainty. Some use solid lines to show an aver-
age value and dotted lines above and below to show
the upper and lower boundaries. Using color satu-
ration or gradients to show that values are becom-
ing less and less likely—but still in the realm of
possibility— is another way.
On top of uncertainty, we’re also dealing with
probability.
Yes, it’s really hard for our brains to perceive prob-
ability. When we say something has an 80%
chance of happening, it’s not the simplest thing to
understand. You can’t really feel what 80% likelihood
really means. I mean, it seems like it will probably
happen. But the important thing to remember is
that if it doesn’t happen, that doesn’t mean you were
wrong. It just means the 20% likelihood happened
instead.
Statistics are weird. Even if we felt like we under-
stood what a “20% chance” was, we don’t think of it
as the same as “1 in 5.” We tend to think that “1 in 5”
is more likely to happen than “20%.” It’s less abstract.
If you say 1 in 5 people commits a crime, you actually
picture that one person. We “image the numerator.”
But “20%” doesn’t commit a crime. It’s not a thing
that acts. It’s a statistic.
H7353_Guide-DataAnalytics_2ndREV.indb 195H7353_Guide-DataAnalytics_2ndREV.indb 195 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
196
What do we do when the 20% or 10% chance thing
happens?
How do you tell someone who has had the very rare
thing happen to them that, based on the probability
we gave you, it was the right advice, even though it
didn’t work out for you? That’s really diffi cult, and
security executives and risk experts think about this
all the time. When you think about it, businesses
need to learn this because it’s easy in hindsight to
say “Our models were wrong—the unlikely bad thing
happened.” Not true! We all along were communicat-
ing there was some small chance that the bad thing
could happen. Still, as humans, that’s hard for us to
grasp.
Is it because we try to hang on to the hope of a more
favorable outcome?
It’s because likely things happen more of the time.
When unlikely things happen, we want to make sense
of it. We weren’t expecting it. We shouldn’t have been
expecting it because it was unlikely. But it’s still pos-
sible, however unlikely. Already just hearing myself
say this, you see how elliptical it sounds. When a
natural disaster strikes, you often hear people after-
ward say “It was a 100-year storm, no one could have
seen this coming.” Not true! Risk experts always see it
coming. It was always a statistical possibility. It’s just
not likely.
I get probability, but I still can’t help but feel misled by
the presidential election predictions. What am I missing?
H7353_Guide-DataAnalytics_2ndREV.indb 196H7353_Guide-DataAnalytics_2ndREV.indb 196 1/17/18 10:47 AM1/17/18 10:47 AM
Why It’s So Hard for Us to Communicate Uncertainty
197
Three things are going on with the election mod-
els. (1) Even if a candidate had a 10% chance of
winning 10 days ago and they end up winning, it
doesn’t mean the model was wrong. It means the
unlikely happened. (2) This whole notion of using
probability to determine who will win an election
(based on whether they have an 80% chance, etc.)
is hard for the audience to grasp, because we tend
to think about elections in more binary terms—this
person will win versus that person will win. (3) We
revisit the probabilities every day and update them.
And when one candidate says something stupid,
their probability of winning goes down and the
others go up. This makes us feel like these winning
probabilities are reactive, not speculative. So we,
the lay audience, end up thinking we’re looking at
data that tells us something about how the candi-
dates are behaving, not how likely it is they’ll win.
It starts to feel more like an approval rating than a
forecast.
That fi rst point must come up in business all the time.
The election brings the subject of visualizing un-
certainty into focus but it’s an increasingly com-
mon challenge in businesses building out their data
science operations. As data science becomes more
and more important for companies, managers are
starting to deal with types of data that show multiple
possible outcomes, where there is statistical uncer-
tainty and data uncertainty that they have to commu-
nicate to their bosses. If they don’t help their bosses
understand the uncertainty, they will look at their
H7353_Guide-DataAnalytics_2ndREV.indb 197H7353_Guide-DataAnalytics_2ndREV.indb 197 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
198
charts and say that’s the answer when it’s only the
likelihood. It’s okay to focus on what is most likely,
but you don’t want to forgo showing the range of pos-
sible outcomes.
For example, if you’re looking at a way to model
customer adoption and you’re using statistical mod-
els, you want to make sure you demonstrate what you
think is most likely to happen, but also how this out-
come is one of a range of potential outcomes based
on your models. You need to be able to communicate
that visually, or your boss or client will misinterpret
what you’re saying. If the data scientists say we have a
90% chance of succeeding if we adopt this model, but
then it doesn’t happen, the boss should know that you
weren’t wrong—you really just fell into the 10%. You
rolled snake eyes. It happens. This is a really hard
thing for our brains to deal with and communicate,
and it’s an important challenge for companies invest-
ing in a data-driven approach to their businesses.
Scott Berinato is a senior editor at Harvard Business Re-
view and the author of Good Charts: The HBR Guide to
Making Smarter, More Persuasive Data Visualizations
(Harvard Business Review Press, 2016). Nicole Torres is
an associate editor at Harvard Business Review.
NOTE
“Live Presidential Forecast,” New York Times, November 9, 2016, https://www.nytimes.com/elections/forecast/president.
H7353_Guide-DataAnalytics_2ndREV.indb 198H7353_Guide-DataAnalytics_2ndREV.indb 198 1/17/18 10:47 AM1/17/18 10:47 AM
199
CHAPTER 21
Responding to Someone Who Challenges Your Data by Jon M. Jachimowicz
I recently conducted a study with a large, multinational
company to fi gure out how to increase employee engage-
ment. After the data collection was complete, I ran the
data analysis and found some intriguing fi ndings that I
was excited to share with the fi rm. But a troubling re-
sult became apparent in my analysis: The organization
had rampant discrimination against women, especially
Adapted from “What to Do When Someone Angrily Challenges Your
Data” on hbr.org, April 5, 2017 (product #H03L2M).
H7353_Guide-DataAnalytics_2ndREV.indb 199H7353_Guide-DataAnalytics_2ndREV.indb 199 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
200
ambitious, passionate, talented women. Although this
result was based on initial data and was not particularly
rigorous, I was convinced that managers at the collab-
orating organization would like to hear it so that they
could address the issue.
I couldn’t have been more wrong. In a meeting with
the company’s head of HR and a few members of his
team, I fi rst presented my overall fi ndings about em-
ployee engagement. In my last few slides, I turned the
presentation toward the results of the gender discrimi-
nation analysis that I had conducted. I was expecting an
animated conversation, and perhaps even some internal
questioning into why the discrimination was occurring
and how they could rectify it.
Instead, the head of HR got very angry. He accused
me of misrepresenting the facts, and countered by citing
data from his own records that showed men and women
were equally likely to be promoted. In addition, he had
never heard from anyone within the organization that
gender discrimination was a problem. He strongly be-
lieved that the diversity practices his team had champi-
oned were industry leading, and that they were suffi cient
to ward off gender discrimination. Clearly, this topic was
important to him, and my fi ndings had touched a nerve.
After his fury (and my shock) had cooled, I reminded
him that the data I presented was just initial pilot data
and should be treated as such. Perhaps if we were to do a
more thorough assessment, I argued, we would fi nd that
the initial data was inaccurate. In addition, I proposed
that a follow-on study that focused on gender discrimi-
nation could pinpoint which aspects of the diversity poli-
cies were working particularly well, and that he could
H7353_Guide-DataAnalytics_2ndREV.indb 200H7353_Guide-DataAnalytics_2ndREV.indb 200 1/17/18 10:47 AM1/17/18 10:47 AM
Responding to Someone Who Challenges Your Data
201
use these insights to further advocate for his agenda.
We landed on a compromise: I would design and run an
additional study with a focus on gender discrimination,
connecting survey responses to important outcomes
such as promotions and turnover.
A few months later, the data came in. My data analy-
sis showed that my initial fi ndings were correct: Gen-
der discrimination was happening in the company. But
the head of HR’s major claim wasn’t wrong: Men and
women were equally likely to be promoted.
The improved data set allowed us to see how both
facts could be true at the same time. We now had de-
tailed insights into which employees were—and, more
important, were not—being promoted. Although am-
bitious, passionate, and talented men were advancing
in the company, their female counterparts were being
passed over for promotion, time and again—effectively
being pushed out of the organization. That is, the best
men were moving up, but not the best women. Those
women who were being promoted were given these op-
portunities out of tokenism: They weren’t particularly
high performing, and often reached a “natural” ceiling
early on in their careers due to their limited abilities.
We also now had data on the specifi c kind of advance-
ment opportunities male and female employees received
to learn new skills, make new connections, and increase
their visibility in the organization. Compared with their
male counterparts, passionate women were less likely to
get these kinds of chances.
Armed with this new data, I was invited to present to
the head of HR again. Remembering our last meeting,
I expected him to be upset. But we had a very different
H7353_Guide-DataAnalytics_2ndREV.indb 201H7353_Guide-DataAnalytics_2ndREV.indb 201 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
202
conversation this time. Instead of being met with anger,
the data I presented prompted concern. I could place the
fact of men and women being equally likely to be pro-
moted in a fuller context, complete with rigorous data
from the organization. We had a lively debate about why
this asymmetry between men and women existed. Most
important, we concluded that the data he measured to
track gender discrimination was unable to provide him
with the necessary insight to understand whether gen-
der discrimination was a problem.
He has since appointed a task force to tackle the prob-
lem of gender discrimination head-on, something he
wouldn’t have done if we hadn’t collected the data that
we did. This is the power of collecting thorough data in
your own organization: Instead of making assumptions
on what may or may not be occurring, a thoughtful de-
sign of data-collection practices allows you to gather the
right information to come to better conclusions.
So it’s not just about the data you have. Existing data
blinds us, and it is important to shift the focus away from
readily available information. Crucially, not having the
right data is no excuse. In the case of the head of HR, not
hearing about gender discrimination from anyone in the
organization allowed him to conclude that women did
not face this issue. Think about what data is not being
collected that may help embed existing data in a richer
context.
Next time someone angrily challenges your data,
there are a few steps you can take:
First, take their perspective. Understand why your
counterpart is responding so forcefully. In many
H7353_Guide-DataAnalytics_2ndREV.indb 202H7353_Guide-DataAnalytics_2ndREV.indb 202 1/17/18 10:47 AM1/17/18 10:47 AM
Responding to Someone Who Challenges Your Data
203
cases, it may simply be that they really care about the
outcome. Your goals may even be aligned, and fram-
ing your data in a way where their goals are achieved
may help you circumvent their anger.
Second, collect more data that specifi cally takes their
criticism to heart. Every comment is a useful com-
ment. Just as a fi ction author can’t be upset when
readers don’t get the point of what they are trying to
say, a researcher must understand how their fi ndings
are being understood. What is the upset recipient
of your analysis responding to, and how can further
data collection help you address their concerns?
Last, view your challenger not as an opponent, but
as an ally. Find a way to collaborate, because once
you have their buy-in, they are invested in the joint
investigation. As a result, they will be more likely to
view you as being part of the team. And then you can
channel the energy that prompted their fury for good.
Defending your data analysis can be stressful—espe-
cially if your fi ndings cause confl ict. But by following
these steps, you can diffuse any tension and attack the
problem in a productive way.
Jon M. Jachimowicz is a PhD candidate at Columbia
Business School. In his research, he investigates the an-
tecedents, perceptions, and consequences of passion for
work. His website can be found at jonmjachimowicz
.com.
H7353_Guide-DataAnalytics_2ndREV.indb 203H7353_Guide-DataAnalytics_2ndREV.indb 203 1/17/18 10:47 AM1/17/18 10:47 AM
H7353_Guide-DataAnalytics_2ndREV.indb 204H7353_Guide-DataAnalytics_2ndREV.indb 204 1/17/18 10:47 AM1/17/18 10:47 AM
205
CHAPTER 22
Decisions Don’t Start with Data by Nick Morgan
I recently worked with an executive keen to persuade his
colleagues that their company should drop a longtime
vendor in favor of a new one. He knew that members of
the executive team opposed the idea (in part because of
their well-established relationships with the vendor) but
he didn’t want to confront them directly, so he put to-
gether a PowerPoint presentation full of stats and charts
showing the cost savings that might be achieved by the
change.
He hoped the data would speak for itself.
But it didn’t.
Adapted from content posted on hbr.org, May 14, 2014 (product
#H00T3S).
H7353_Guide-DataAnalytics_2ndREV.indb 205H7353_Guide-DataAnalytics_2ndREV.indb 205 1/17/18 10:47 AM1/17/18 10:47 AM
Communicate Your Findings
206
The team stopped listening about a third of the way
through the presentation. Why? It was good data. The
executive was right. But, even in business meetings,
numbers don’t ever speak for themselves, no matter how
visually appealing the presentation may be.
To infl uence human decision making, you have to
get to the place where decisions are really made—in the
unconscious mind, where emotions rule, and data is
mostly absent. Yes, even the most savvy executives begin
to make choices this way. They get an intent, a desire,
or a want in their unconscious minds, and then decide
to pursue it and act on that decision. Only after that
do they become consciously aware of the choice they’ve
made and start to justify it with rational argument. In
fact, research from Carnegie Mellon University indicates
that our unconscious minds actually make better deci-
sions when left alone to deal with complex issues.
Data is helpful as supporting material, of course. But,
because it spurs thinking in the conscious mind, it must
be used with care. Effective persuasion starts not with
numbers, but with stories that have emotional power
because that’s the best way to tap into unconscious de-
cision making. We decide to invest in a new company
or business line not because the fi nancial model shows
it will succeed but because we’re drawn to the story told
by the people pitching it. We buy goods and services
because we believe the stories marketers build around
them: “A diamond is forever” (De Beers), “Real beauty”
(Dove), “Think different” (Apple), “Just do it” (Nike). We
take jobs not only for the pay and benefi ts but also for
the self-advancement story we’re told, and tell ourselves,
about working at the new place.
H7353_Guide-DataAnalytics_2ndREV.indb 206H7353_Guide-DataAnalytics_2ndREV.indb 206 1/17/18 10:47 AM1/17/18 10:47 AM
Decisions Don’t Start with Data
207
Sometimes we describe this as having a good “gut
feeling.” What that really means is that we’ve already un-
consciously decided to go forward, based on desire, and
our conscious mind is seeking some rationale for that
otherwise invisible decision.
I advised the executive to scrap his PowerPoint and
tell a story about the opportunities for future growth
with the new vendor, reframing and trumping the loy-
alty story the opposition camp was going to tell. And so,
in his next attempt, rather than just presenting data, he
told his colleagues that they should all be striving toward
a new vision for the company, no longer held back by a
tether to the past. He began with an alluring description
of the future state—improved margins, a cooler, higher-
tech product line, and excited customers—then asked his
audience to move forward with him to reach that goal. It
was a quest story, and it worked.
Data can provide new insight and evidence to inform
your toughest decisions. But numbers alone won’t con-
vince others. Good stories—with a few key facts woven
in—are what attach emotions to your argument, prompt
people into unconscious decision making, and ultimately
move them to action.
Nick Morgan is a speaker, coach, and the president and
founder of Public Words, a communications consulting
fi rm. He is the author of Power Cues: The Subtle Science
of Leading Groups, Persuading Others, and Maximiz-
ing Your Personal Impact (Harvard Business Review
Press, 2014).
H7353_Guide-DataAnalytics_2ndREV.indb 207H7353_Guide-DataAnalytics_2ndREV.indb 207 1/17/18 10:47 AM1/17/18 10:47 AM
H7353_Guide-DataAnalytics_2ndREV.indb 208H7353_Guide-DataAnalytics_2ndREV.indb 208 1/17/18 10:47 AM1/17/18 10:47 AM
209
APPENDIX
Data Scientist: The Sexiest Job of the 21st Century by Thomas H. Davenport and D.J. Patil
When Jonathan Goldman arrived for work in June 2006
at LinkedIn, the business networking site, the place still
felt like a startup. The company had just under 8 million
accounts, and the number was growing quickly as exist-
ing members invited their friends and colleagues to join.
But users weren’t seeking out connections with the peo-
ple who were already on the site at the rate executives
had expected. Something was apparently missing in the
social experience. As one LinkedIn manager put it, “It
Reprinted from Harvard Business Review, October 2012 (product
#R1210D).
H7353_Guide-DataAnalytics_2ndREV.indb 209H7353_Guide-DataAnalytics_2ndREV.indb 209 1/17/18 10:47 AM1/17/18 10:47 AM
Appendix
210
was like arriving at a conference reception and realizing
you don’t know anyone. So you just stand in the corner
sipping your drink—and you probably leave early.”
Goldman, a PhD in physics from Stanford, was in-
trigued by the linking he did see going on and by the
richness of the user profi les. It all made for messy data
and unwieldy analysis, but as he began exploring people’s
connections, he started to see possibilities. He began
forming theories, testing hunches, and fi nding patterns
that allowed him to predict whose networks a given pro-
fi le would land in. He could imagine that new features
capitalizing on the heuristics he was developing might
provide value to users. But Linked In’s engineering team,
caught up in the challenges of scaling up the site, seemed
uninterested. Some colleagues were openly dismissive of
Goldman’s ideas. Why would users need LinkedIn to fi g-
ure out their networks for them? The site already had an
address book importer that could pull in all a member’s
connections.
Luckily, Reid Hoffman, LinkedIn’s cofounder and CEO
at the time (now its executive chairman), had faith in the
power of analytics because of his experiences at PayPal,
and he had granted Goldman a high degree of autonomy.
For one thing, he had given Goldman a way to circum-
vent the traditional product release cycle by publishing
small modules in the form of ads on the site’s most popu-
lar pages.
Through one such module, Goldman started to test
what would happen if you presented users with names of
people they hadn’t yet connected with but seemed likely
to know—for example, people who had shared their
H7353_Guide-DataAnalytics_2ndREV.indb 210H7353_Guide-DataAnalytics_2ndREV.indb 210 1/17/18 10:48 AM1/17/18 10:48 AM
Data Scientist: The Sexiest Job of the 21st Century
211
tenures at schools and workplaces. He did this by gin-
ning up a custom ad that displayed the three best new
matches for each user based on the background entered
in his or her LinkedIn profi le. Within days it was obvi-
ous that something remarkable was taking place. The
click-through rate on those ads was the highest ever
seen. Goldman continued to refi ne how the suggestions
were generated, incorporating networking ideas such as
“triangle closing”—the notion that if you know Larry and
Sue, there’s a good chance that Larry and Sue know each
other. Goldman and his team also got the action required
to respond to a suggestion down to one click.
It didn’t take long for LinkedIn’s top managers to rec-
ognize a good idea and make it a standard feature. That’s
when things really took off. “People You May Know” ads
achieved a click-through rate 30% higher than the rate
obtained by other prompts to visit more pages on the
site. They generated millions of new page views. Thanks
to this one feature, Linked In’s growth trajectory shifted
signifi cantly upward.
A New Breed Goldman is a good example of a new key player in orga-
nizations: the “data scientist.” It’s a high-ranking profes-
sional with the training and curiosity to make discover-
ies in the world of big data. The title has been around
for only a few years. (It was coined in 2008 by one of us,
D.J. Patil, and Jeff Hammerbacher, then the respective
leads of data and analytics efforts at LinkedIn and Face-
book.) But thousands of data scientists are already work-
ing at both startups and well-established companies.
H7353_Guide-DataAnalytics_2ndREV.indb 211H7353_Guide-DataAnalytics_2ndREV.indb 211 1/17/18 10:48 AM1/17/18 10:48 AM
Appendix
212
Their sudden appearance on the business scene refl ects
the fact that companies are now wrestling with informa-
tion that comes in varieties and volumes never encoun-
tered before. If your organization stores multiple peta-
bytes of data, if the information most critical to your
business resides in forms other than rows and columns
of numbers, or if answering your biggest question would
involve a “mashup” of several analytical efforts, you’ve
got a big data opportunity.
Much of the current enthusiasm for big data focuses
on technologies that make taming it possible, including
Hadoop (the most widely used framework for distributed
fi le system processing) and related open-source tools,
cloud computing, and data visualization. While those are
important breakthroughs, at least as important are the
people with the skill set (and the mindset) to put them to
good use. On this front, demand has raced ahead of sup-
ply. Indeed, the shortage of data scientists is becoming a
serious constraint in some sectors. Greylock Partners, an
early-stage venture fi rm that has backed companies such
as Facebook, LinkedIn, Palo Alto Networks, and Work-
day, is worried enough about the tight labor pool that
it has built its own specialized recruiting team to chan-
nel talent to businesses in its portfolio. “Once they have
data,” says Dan Portillo, who leads that team, “they really
need people who can manage it and fi nd insights in it.”
Who Are These People? If capitalizing on big data depends on hiring scarce data
scientists, then the challenge for managers is to learn
how to identify that talent, attract it to an enterprise, and
H7353_Guide-DataAnalytics_2ndREV.indb 212H7353_Guide-DataAnalytics_2ndREV.indb 212 1/17/18 10:48 AM1/17/18 10:48 AM
Data Scientist: The Sexiest Job of the 21st Century
213
make it productive. None of those tasks is as straightfor-
ward as it is with other, established organizational roles.
Start with the fact that there are no university programs
offering degrees in data science. There is also little con-
sensus on where the role fi ts in an organization, how
data scientists can add the most value, and how their
performance should be measured.
The fi rst step in fi lling the need for data scientists,
therefore, is to understand what they do in businesses.
Then ask, What skills do they need? And what fi elds are
those skills most readily found in?
More than anything, what data scientists do is make
discoveries while swimming in data. It’s their preferred
method of navigating the world around them. At ease in
the digital realm, they are able to bring structure to large
quantities of formless data and make analysis possible.
They identify rich data sources, join them with other,
potentially incomplete data sources, and clean the re-
sulting set. In a competitive landscape where challenges
keep changing and data never stop fl owing, data scien-
tists help decision makers shift from ad hoc analysis to
an ongoing conversation with data.
Data scientists realize that they face technical limita-
tions, but they don’t allow that to bog down their search
for novel solutions. As they make discoveries, they com-
municate what they’ve learned and suggest its implica-
tions for new business directions. Often they are creative
in displaying information visually and making the pat-
terns they fi nd clear and compelling. They advise execu-
tives and product managers on the implications of the
data for products, processes, and decisions.
H7353_Guide-DataAnalytics_2ndREV.indb 213H7353_Guide-DataAnalytics_2ndREV.indb 213 1/17/18 10:48 AM1/17/18 10:48 AM
Appendix
214
Given the nascent state of their trade, it often falls to
data scientists to fashion their own tools and even con-
duct academic-style research. Yahoo, one of the fi rms
that employed a group of data scientists early on, was
instrumental in developing Hadoop. Facebook’s data
team created the language Hive for programming Ha-
doop projects. Many other data scientists, especially at
data-driven companies such as Google, Amazon, Micro-
soft, Walmart, eBay, LinkedIn, and Twitter, have added
to and refi ned the tool kit.
What kind of person does all this? What abilities
make a data scientist successful? Think of him or her
as a hybrid of data hacker, analyst, communicator, and
trusted adviser. The combination is extremely power-
ful—and rare.
Data scientists’ most basic, universal skill is the ability
to write code. This may be less true in fi ve years’ time,
when many more people will have the title “data scien-
tist” on their business cards. More enduring will be the
need for data scientists to communicate in language
that all their stakeholders understand—and to demon-
strate the special skills involved in storytelling with data,
whether verbally, visually, or—ideally—both.
But we would say the dominant trait among data sci-
entists is an intense curiosity—a desire to go beneath the
surface of a problem, fi nd the questions at its heart, and
distill them into a very clear set of hypotheses that can
be tested. This often entails the associative thinking that
characterizes the most creative scientists in any fi eld. For
example, we know of a data scientist studying a fraud
problem who realized that it was analogous to a type of
DNA sequencing problem. By bringing together those
H7353_Guide-DataAnalytics_2ndREV.indb 214H7353_Guide-DataAnalytics_2ndREV.indb 214 1/17/18 10:48 AM1/17/18 10:48 AM
Data Scientist: The Sexiest Job of the 21st Century
215
disparate worlds, he and his team were able to craft a so-
lution that dramatically reduced fraud losses.
Perhaps it’s becoming clear why the word “scientist”
fi ts this emerging role. Experimental physicists, for ex-
ample, also have to design equipment, gather data, con-
duct multiple experiments, and communicate their re-
sults. Thus, companies looking for people who can work
with complex data have had good luck recruiting among
those with educational and work backgrounds in the
physical or social sciences. Some of the best and bright-
est data scientists are PhDs in esoteric fi elds like ecol-
ogy and systems biology. George Roumeliotis, the head
of a data science team at Intuit in Silicon Valley, holds a
doctorate in astrophysics. A little less surprisingly, many
of the data scientists working in business today were for-
mally trained in computer science, math, or economics.
They can emerge from any fi eld that has a strong data
and computational focus.
It’s important to keep that image of the scientist in
mind—because the word “data” might easily send a
search for talent down the wrong path. As Portillo told
us, “The traditional backgrounds of people you saw 10
to 15 years ago just don’t cut it these days.” A quantita-
tive analyst can be great at analyzing data but not at sub-
duing a mass of unstructured data and getting it into a
form in which it can be analyzed. A data management
expert might be great at generating and organizing data
in structured form but not at turning unstructured data
into structured data—and also not at actually analyzing
the data. And while people without strong social skills
might thrive in traditional data professions, data scien-
tists must have such skills to be effective.
H7353_Guide-DataAnalytics_2ndREV.indb 215H7353_Guide-DataAnalytics_2ndREV.indb 215 1/17/18 10:48 AM1/17/18 10:48 AM
Appendix
216
Roumeliotis was clear with us that he doesn’t hire on
the basis of statistical or analytical capabilities. He be-
gins his search for data scientists by asking candidates if
they can develop prototypes in a mainstream program-
ming language such as Java. Roumeliotis seeks both a
skill set—a solid foundation in math, statistics, probabil-
ity, and computer science—and certain habits of mind.
He wants people with a feel for business issues and em-
pathy for customers. Then, he says, he builds on all that
with on-the-job training and an occasional course in a
particular technology.
Several universities are planning to launch data sci-
ence programs, and existing programs in analytics, such
as the Master of Science in Analytics program at North
Carolina State, are busy adding big data exercises and
coursework. Some companies are also trying to de-
velop their own data scientists. After acquiring the big
data fi rm Greenplum, EMC decided that the availability
of data scientists would be a gating factor in its own—
and customers’—exploitation of big data. So its Educa-
tion Services division launched a data science and big
data analytics training and certifi cation program. EMC
makes the program available to both employees and cus-
tomers, and some of its graduates are already working
on internal big data initiatives.
As educational offerings proliferate, the pipeline of
talent should expand. Vendors of big data technolo-
gies are also working to make them easier to use. In the
meantime one data scientist has come up with a creative
approach to closing the gap. The Insight Data Science
Fellows Program, a postdoctoral fellowship designed by
H7353_Guide-DataAnalytics_2ndREV.indb 216H7353_Guide-DataAnalytics_2ndREV.indb 216 1/17/18 10:48 AM1/17/18 10:48 AM
Data Scientist: The Sexiest Job of the 21st Century
217
HOW TO FIND THE DATA SCIENTISTS YOU NEED
1. Focus recruiting at the “usual suspect” universi-
ties (Stanford, MIT, Berkeley, Harvard, Carnegie
Mellon) and also at a few others with proven
strengths: North Carolina State, UC Santa Cruz,
the University of Maryland, the University of
Washington, and UT Austin.
2. Scan the membership rolls of user groups
devoted to data science tools. The R User
Groups (for an open-source statistical
tool favored by data scientists) and Python
Interest Groups (for PIGgies) are good places
to start.
3. Search for data scientists on LinkedIn—they’re
almost all on there, and you can see if they have
the skills you want.
4. Hang out with data scientists at the Strata,
Structure:Data, and Hadoop World conferences
and similar gatherings (there is almost one a
week now) or at informal data scientist “meet-
ups” in the Bay Area; Boston; New York; Wash-
ington, DC; London; Singapore; and Sydney.
5. Make friends with a local venture capitalist, who
is likely to have gotten a variety of big data pro-
posals over the past year.
(continued�)
H7353_Guide-DataAnalytics_2ndREV.indb 217H7353_Guide-DataAnalytics_2ndREV.indb 217 1/17/18 10:48 AM1/17/18 10:48 AM
Appendix
218
HOW TO FIND THE DATA SCIENTISTS YOU NEED
(continued�)
6. Host a competition on Kaggle or TopCoder, the
analytics and coding competition sites. Follow
up with the most-creative entrants.
7. Don’t bother with any candidate who can’t code.
Coding skills don’t have to be at a world-class
level but should be good enough to get by. Look
for evidence, too, that candidates learn rapidly
about new technologies and methods.
8. Make sure a candidate can fi nd a story in a data
set and provide a coherent narrative about a
key data insight. Test whether he or she can
communicate with numbers, visually and
verbally.
9. Be wary of candidates who are too detached
from the business world. When you ask how
their work might apply to your management
challenges, are they stuck for answers?
10. Ask candidates about their favorite analysis or
insight and how they are keeping their skills
sharp. Have they gotten a certifi cate in the
advanced track of Stanford’s online Machine
Learning course, contributed to open-source
projects, or built an online repository of code to
share (for example, on GitHub)?
H7353_Guide-DataAnalytics_2ndREV.indb 218H7353_Guide-DataAnalytics_2ndREV.indb 218 1/17/18 10:48 AM1/17/18 10:48 AM
Data Scientist: The Sexiest Job of the 21st Century
219
Jake Klamka (a high-energy physicist by training), takes
scientists from academia and in six weeks prepares them
to succeed as data scientists. The program combines
mentoring by data experts from local companies (such
as Facebook, Twitter, Google, and LinkedIn) with ex-
posure to actual big data challenges. Originally aiming
for 10 fellows, Klamka wound up accepting 30, from an
applicant pool numbering more than 200. More organi-
zations are now lining up to participate. “The demand
from companies has been phenomenal,” Klamka told us.
“They just can’t get this kind of high-quality talent.”
Why Would a Data Scientist Want to Work Here? Even as the ranks of data scientists swell, competition
for top talent will remain fi erce. Expect candidates to
size up employment opportunities on the basis of how
interesting the big data challenges are. As one of them
commented, “If we wanted to work with structured data,
we’d be on Wall Street.” Given that today’s most qualifi ed
prospects come from nonbusiness backgrounds, hiring
managers may need to fi gure out how to paint an excit-
ing picture of the potential for breakthroughs that their
problems offer.
Pay will of course be a factor. A good data scientist
will have many doors open to him or her, and salaries
will be bid upward. Several data scientists working at
startups commented that they’d demanded and got large
stock option packages. Even for someone accepting a po-
sition for other reasons, compensation signals a level of
respect and the value the role is expected to add to the
H7353_Guide-DataAnalytics_2ndREV.indb 219H7353_Guide-DataAnalytics_2ndREV.indb 219 1/17/18 10:48 AM1/17/18 10:48 AM
Appendix
220
business. But our informal survey of the priorities of
data scientists revealed something more fundamentally
important. They want to be “on the bridge.” The refer-
ence is to the 1960s television show Star Trek, in which
the starship captain James Kirk relies heavily on data
supplied by Mr. Spock. Data scientists want to be in the
thick of a developing situation, with real-time awareness
of the evolving set of choices it presents.
Considering the diffi culty of fi nding and keeping data
scientists, one would think that a good strategy would
involve hiring them as consultants. Most consulting
fi rms have yet to assemble many of them. Even the larg-
est fi rms, such as Accenture, Deloitte, and IBM Global
Services, are in the early stages of leading big data proj-
ects for their clients. The skills of the data scientists
they do have on staff are mainly being applied to more-
conventional quantitative analysis problems. Offshore
analytics services fi rms, such as Mu Sigma, might be the
ones to make the fi rst major inroads with data scientists.
But the data scientists we’ve spoken with say they
want to build things, not just give advice to a decision
maker. One described being a consultant as “the dead
zone—all you get to do is tell someone else what the
analyses say they should do.” By creating solutions that
work, they can have more impact and leave their marks
as pioneers of their profession.
Care and Feeding Data scientists don’t do well on a short leash. They
should have the freedom to experiment and explore pos-
sibilities. That said, they need close relationships with
the rest of the business. The most important ties for
H7353_Guide-DataAnalytics_2ndREV.indb 220H7353_Guide-DataAnalytics_2ndREV.indb 220 1/17/18 10:48 AM1/17/18 10:48 AM
Data Scientist: The Sexiest Job of the 21st Century
221
them to forge are with executives in charge of products
and services rather than with people overseeing business
functions. As the story of Jonathan Goldman illustrates,
their greatest opportunity to add value is not in creating
reports or presentations for senior executives but in in-
novating with customer- facing products and processes.
LinkedIn isn’t the only company to use data scientists
to generate ideas for products, features, and value- adding
services. At Intuit data scientists are asked to develop in-
sights for small-business customers and consumers and
report to a new senior vice president of big data, social
design, and marketing. GE is already using data science
to optimize the service contracts and maintenance inter-
vals for industrial products. Google, of course, uses data
scientists to refi ne its core search and ad-serving algo-
rithms. Zynga uses data scientists to optimize the game
experience for both long-term engagement and revenue.
Netfl ix created the well-known Net fl ix Prize, given to
the data science team that developed the best way to
improve the company’s movie recommendation system.
The test-preparation fi rm Kaplan uses its data scientists
to uncover effective learning strategies.
There is, however, a potential downside to having
people with sophisticated skills in a fast-evolving fi eld
spend their time among general management colleagues.
They’ll have less interaction with similar specialists,
which they need to keep their skills sharp and their tool
kit state-of-the-art. Data scientists have to connect with
communities of practice, either within large fi rms or ex-
ternally. New conferences and informal associations are
springing up to support collaboration and technology
sharing, and companies should encourage scientists to
H7353_Guide-DataAnalytics_2ndREV.indb 221H7353_Guide-DataAnalytics_2ndREV.indb 221 1/17/18 10:48 AM1/17/18 10:48 AM
Appendix
222
become involved in them with the understanding that
“more water in the harbor fl oats all boats.”
Data scientists tend to be more motivated, too, when
more is expected of them. The challenges of accessing
and structuring big data sometimes leave little time or
energy for sophisticated analytics involving prediction or
optimization. Yet if executives make it clear that simple
reports are not enough, data scientists will devote more
effort to advanced analytics. Big data shouldn’t equal
“small math.”
The Hot Job of the Decade Hal Varian, the chief economist at Google, is known
to have said, “The sexy job in the next 10 years will be
statisticians. People think I’m joking, but who would’ve
guessed that computer engineers would’ve been the sexy
job of the 1990s?”
If “sexy” means having rare qualities that are much in
demand, data scientists are already there. They are diffi -
cult and expensive to hire and, given the very competitive
market for their services, diffi cult to retain. There simply
aren’t a lot of people with their combination of scientifi c
background and computational and analytical skills.
Data scientists today are akin to Wall Street “quants”
of the 1980s and 1990s. In those days people with back-
grounds in physics and math streamed to investment
banks and hedge funds, where they could devise entirely
new algorithms and data strategies. Then a variety of
universities developed master’s programs in fi nancial en-
gineering, which churned out a second generation of tal-
ent that was more accessible to mainstream fi rms. The
H7353_Guide-DataAnalytics_2ndREV.indb 222H7353_Guide-DataAnalytics_2ndREV.indb 222 1/17/18 10:48 AM1/17/18 10:48 AM
Data Scientist: The Sexiest Job of the 21st Century
223
pattern was repeated later in the 1990s with search en-
gineers, whose rarefi ed skills soon came to be taught in
computer science programs.
One question raised by this is whether some fi rms
would be wise to wait until that second generation of
data scientists emerges, and the candidates are more nu-
merous, less expensive, and easier to vet and assimilate
in a business setting. Why not leave the trouble of hunt-
ing down and domesticating exotic talent to the big data
startups and to fi rms like GE and Walmart, whose ag-
gressive strategies require them to be at the forefront?
The problem with that reasoning is that the advance
of big data shows no signs of slowing. If companies sit
out this trend’s early days for lack of talent, they risk fall-
ing behind as competitors and channel partners gain
nearly unassailable advantages. Think of big data as an
epic wave gathering now, starting to crest. If you want to
catch it, you need people who can surf.
Thomas H. Davenport is the President’s Distinguished
Professor in Management and Information Technology
at Babson College, a research fellow at the MIT Initiative
on the Digital Economy, and a senior adviser at Deloitte
Analytics. Author of over a dozen management books,
his latest is Only Humans Need Apply: Winners and
Losers in the Age of Smart Machines. D.J. Patil was ap-
pointed as the fi rst U.S. chief data scientist and has led
product development at LinkedIn, eBay, and PayPal. He
is the author of Data Jujitsu: The Art of Turning Data
into Product.
H7353_Guide-DataAnalytics_2ndREV.indb 223H7353_Guide-DataAnalytics_2ndREV.indb 223 1/17/18 10:48 AM1/17/18 10:48 AM
H7353_Guide-DataAnalytics_2ndREV.indb 224H7353_Guide-DataAnalytics_2ndREV.indb 224 1/17/18 10:48 AM1/17/18 10:48 AM
225
Index
A/B testing, 59–70
blocking in, 62
defi ned, 60
example of, 65–67
interpretation of results,
63–64
mistakes in, 68–69
multivariate, 62–63
origin of, 60
overview of, 60–63
real-time optimization, 68
retesting, 69
sequential, 62
uses of, 64–65
Albert (artifi cial intelligence
algorithm), 114–117
analytical models, 15, 22–23, 43,
84–85, 161, 198. See also
data models
analytics-based decision making.
See data-driven decisions
Anderson, Chris, 103
annual rate of return (ARR), 145,
147–148
artifi cial intelligence (AI),
114–117, 149–150. See also
machine learning
assumptions
confi rming, through visualiza-
tions, 179
in predictive analytics, 84–86
asymmetrical loss function, 105
attributes versus benefi ts, in
marketing, 152–153
audience
data presentations and, 175,
178, 180, 184, 186, 188–189
understanding your, 180
automation, 112
averages, 138–139
Bank of America, 18–19
bar charts, 185, 186–188
behavior patterns, 84
bias
cognitive, 156–163
confi rmation, 156–158
linear, 135–140, 145–151
overconfi dence, 159–161
overfi tting, 161–163
visualizations and, 181–182
big data, 13–14, 104, 212, 222,
223
H7353_Guide-DataAnalytics_2ndREV.indb 225H7353_Guide-DataAnalytics_2ndREV.indb 225 1/17/18 10:48 AM1/17/18 10:48 AM
226
Index
blocking, in A/B tests, 62
Boston Consulting Group (BCG),
105
Box, George, 15
box-and-whisker plots, 193
Buffett, Warren, 23
business experiments. See
experiments
business relevance, of data fi nd-
ings, 121, 127–128
causation, 40, 92–93, 96, 103,
104–109
charts, 26, 150–151, 183–190. See
also data visualization
Cigna, 15, 17
clarity of causality, 104–109
Clark, Ed, 23–24
cognitive psychology, 134. See
also cognitive traps
cognitive traps, 156–163
color, in visualizations, 184–185,
186
communication
accuracy in, 165–167, 186–188
of data fi ndings, 5–6, 20,
165–167, 173–176
of data needs, 37–40
with data scientists, 21–22,
38–43
responding to challenges,
199–203
of uncertainty, 191–198
visualizations (see data
visualization)
confi dence interval, 126–127, 129
confi rmation bias, 156–158
context, in rendering data, 181,
184
control groups, 48–49
correlation, 40, 68, 92–93,
96–101, 158
acting on, 103–109
confi dence in recurrence of,
104–105
spurious, 68, 93, 96–101, 162
culture, of inquiry, 22–23
customer lifetime value, 81, 141
data
anonymized, 41–42
big, 13–14, 104, 212, 222, 223
challenges to, 199–203
cherry-picking, 158, 162
clean, 42–43, 75–76
collection, 26, 34–36, 39–42
communication of (see
communication)
context for, 181, 184
costs of, 40, 41–42
errors in, 72–73
evaluating, 43, 72–76
external, 39, 48
forward-looking, 35
integration of, 76–78
internal, 39, 48
for machine learning, 117
metadata, 178, 181
versus metrics, 51–58
needs, 33–36, 39–40
outliers, 5, 21, 85, 166–169
for predictive analytics,
82–83
presenting, 184–190
qualitative, 35–36
quality of, 71–78, 94–95, 181
quantitative, 35–36
for regression analysis, 94–95
scrubbing, 75–76
story, 20, 35, 207
H7353_Guide-DataAnalytics_2ndREV.indb 226H7353_Guide-DataAnalytics_2ndREV.indb 226 1/17/18 10:48 AM1/17/18 10:48 AM
Index
227
structured, 42–43
trusting the, 71–78, 94–95
unstructured, 42–43
uses of, 1
visualizing (see data
visualization)
data analysts. See data scientists
data analytics
deceptive, 165–169
predictive, 81–86
process, 3–6
data audits, 47–48
data collection, 26, 34–36,
39–42
data-driven behaviors, 7–9
data-driven decisions, 7–9, 14,
16, 18, 23–24, 155–164
data experts. See data scientists
data models
overfi tting, 43, 161–163
simplicity of, 43
See also analytical models
data scientists
asking questions of, 21–22, 85
compensation for, 219
education programs for,
218–219
fi nding and attracting,
212–220
job of, 209–223
requesting data and analytics
from, 37–44
retention of, 219–222
role of, 6–7, 213–215
skills of, 214–215
working with, 18–19, 220–222
data uncertainty, 193, 197
data visualization, 177–190
audience for, 180, 184
accuracy of, 186–188
charts, 26, 150–151, 183–190
color and shading in, 184–185,
186
to combat linear bias, 150–151
context in, 184
reasons for, 178–179, 181
risks, 181–182
of uncertainty, 193–195,
197–198
deception, 165–169
decision making
cognitive traps in, 155–163
data-driven, 7–9, 14, 16, 18,
23–24, 155–164
factors in, 205–207
heuristics, 156
intuition and, 95–96
linear thinking and, 131–154
statistical signifi cance and,
121–129
unconscious, 206–207
de-duplication, 76
dependent variables, 88, 90, 91
devil’s advocate, playing, 22–23,
160
DoSomething.org, 51–52, 58
Duhigg, Charles, 84
economies of scale, 143–144
education
programs for data scientists,
218–219
to learn analytics, 15, 17
in visualizations, 179
enterprise data systems, 35
errors
in data, 73, 75–76, 168
non-sampling, 128–129
prediction, 138–139
sampling, 122–123, 128
error term, 91, 95
H7353_Guide-DataAnalytics_2ndREV.indb 227H7353_Guide-DataAnalytics_2ndREV.indb 227 1/17/18 10:48 AM1/17/18 10:48 AM
228
Index
experiments, 40
A/B tests, 59–69
design of, 17, 45–50
fi eld, 148–149
narrow questions for, 46
plan for, 49–50
randomization in, 48–49,
61–62
randomized controlled, 41, 60
results from, 50
statistical signifi cance and, 125
study populations for, 48
exploration, in visualizations, 179
exploratory data analysis (EDA),
179, 181, 182
external data, 39, 48
Facebook, 40, 214
favorable outcomes, 196–197
fi eld experiments, 148–149
fi nancial drivers, 55–56
fi ndings, communication of, 5–6,
20, 165–167, 173–176
Fisher, Ronald, 60
Friedman, Frank, 22
Fung, Kaiser, 59–64, 68–69
future predictions. See
predictions
General Electric (GE), 221
Goldman, Jonathan, 209–211,
220
Google, 46, 161, 220
Gottman, John, 174–175
governing objectives, 54–57
Greylock Partners, 212
gut feelings, 206–207. See also
intuition
Hadoop, 212, 214
Harley-Davidson, 114–117
heuristics, 156
Hoffman, Reid, 210
hypothesis, 19
null, 124–125
independent variables, 88, 90,
91–92
indicators, 146–148
information overload, 155
inquiry, culture of, 22–23
internal data, 39, 48
Intuit, 221
intuition, 95–96, 113, 117–118
Kahneman, Daniel, 160
Kaplan (test preparation fi rm),
221
Kempf, Karl, 18
Klein, Gary, 160
lift, 64
linear thinking, 131–154
awareness of, 146
limiting pitfalls of, 145–151
in practice, 135–140
LinkedIn, 209–211, 217, 220
Loveman, Gary, 23
machine learning, 111–119
assessment of problem for,
112–114
data for, 117
defi ned, 112–113
example, 114–117
H7353_Guide-DataAnalytics_2ndREV.indb 228H7353_Guide-DataAnalytics_2ndREV.indb 228 1/17/18 10:48 AM1/17/18 10:48 AM
Index
229
intuition and, 117–118
moving forward with, 118–119
marketing, 152–153
medium maximization, 147
Mendel, Gregor, 173–174
Merck, 20–21, 22
metadata, 178, 181
metrics
choosing, 54–57, 139–140
versus data, 51–58
intermediate, 146–147
managing, 58
performance, 139–140
using too many, 68
vanity, 53–54
Minnesota Coronary Experi-
ment, 156–157
mistakes
in A/B testing, 68–69
in machine learning, 117–118
in regression analysis, 94–96
in statistical signifi cance,
128–129
mortgage crisis, 84–85
multivariate testing, 62–63
Netfl ix, 220
Nightingale, Florence, 177–178
noise, 61, 161–162
nonfi nancial measures, 55–56
nonlinear relationships, 131–154
linear bias and, 135–140
mapping, 149–151
marketing and, 152–153
performance metrics and,
139–140
types of, 140–145, 148–149
non-sampling error, 128–129
null hypothesis, 124–125
objectives, aligning metrics with,
54–57
observational studies, 40
outcomes, 146–148, 196–197
outliers, 5, 21, 85, 166–169, 193
overconfi dence, 159–161
overfi tting, 43, 161–163
patterns, random, 163
performance metrics, 139–140
persuasion, 206–207
per unit profi t, 143–144
Pollan, Michael, 189
polls, 126–127, 191–193
population variation, 123–124
practical signifi cance, 121,
126–128
prediction errors, 138–139
predictions, 81–86, 161, 191–193
predictive analytics, 81–86
assumptions, 84–86
data for, 82–83
questions to ask about, 85
regression analysis, 83
statistics of, 83
uses for, 81–82
pre-mortems, 160
presentations, 184–190
presidential election (2016),
191–193, 196–197
privacy, 41–42
probability, 195–197
problems, framing, 19–21
p-value, 125, 126
qualitative data, 35–36
quality, of data, 71–78, 94–95,
181
H7353_Guide-DataAnalytics_2ndREV.indb 229H7353_Guide-DataAnalytics_2ndREV.indb 229 1/17/18 10:48 AM1/17/18 10:48 AM
230
Index
quantitative analysts. See data
scientists
quantitative data, 35–36. See also
data
quants. See data scientists
questions
to ask data experts, 21–22,
38–43, 85
asking the right, 35, 38
for experiments, 46
for focused data search, 34–36
for investigating outliers, 168
for understanding audience,
180
machine learning and, 117–118
for understanding audience,
180
randomization, 48–49, 61–62, 68
randomized controlled experi-
ments, 41, 60
random noise, 161–162
random patterns, 163
real-time optimization, 68
regression analysis, 83, 87–101
correlation and causation,
92–93, 96–101
data for, 94–95
defi ned, 88
dependent variables in, 88,
90, 91
error term, 91, 95
independent variables in, 88,
90, 91–92
mistakes in, 94–96
process of, 88–92
use of, 92
regression coeffi cients, 83
regression line, 90–91
replication crisis, 49–50
results, communication of, 5–6,
20, 165–167, 173–176
return on investment (ROI),
20–21, 145
Roumeliotis, George, 175, 215,
216, 218
sample size, 123
sampling error, 122–123, 128
Shutterstock, 65–67
Silver, Nate, 161
spurious correlations, 68, 93,
96–101, 162
statistical methods story, 20
statistical models. See analytical
models
statistical signifi cance, 121–129,
158
calculation of, 125
confi dence interval and,
126–127, 129
defi nition, 122
mistakes when working with,
128–129
null hypothesis and, 124–125
overview of, 122–125
sampling error and, 122–123
use of, 126–128
variation and, 123–124
statistical uncertainty, 193, 197
statistics
deceptive, 165–169
evaluating, 57
machine learning and, 117–118
picking, 54–57
in predictive analytics, 83
summary, 26, 28
storytelling, with data, 20, 35, 207
structured data, 42–43
study populations, 48
H7353_Guide-DataAnalytics_2ndREV.indb 230H7353_Guide-DataAnalytics_2ndREV.indb 230 1/17/18 10:48 AM1/17/18 10:48 AM
Index
231
summary statistics, 26, 28
Summers, Larry, 21
surveys, 128–129, 138
tables, 185
time-series plots, 26, 27
Toronto-Dominion Bank (TD),
23–24
treatment groups, in experimen-
tation, 48–49
uncertainty, 191–198
data, 193, 197
favorable outcomes and,
196–197
probability and, 195–197
statistical, 193, 197
unconscious decision making,
206–207
unstructured data, 42–43
user privacy, 41–42
vanity metrics, 53–54
variables
comparing dissimilar, 99
dependent, 88, 90, 91
independent, 88, 90, 91–92
Varian, Hal, 222
variation, 28, 123–124
Vigen, Tyler, 96–98
visualization. See data
visualization
Yahoo, 214
Zynga, 221
H7353_Guide-DataAnalytics_2ndREV.indb 231H7353_Guide-DataAnalytics_2ndREV.indb 231 1/17/18 10:48 AM1/17/18 10:48 AM
H7353_Guide-DataAnalytics_2ndREV.indb 232H7353_Guide-DataAnalytics_2ndREV.indb 232 1/17/18 10:48 AM1/17/18 10:48 AM
Invaluable insights always at your fingertips
With an All-Access subscription to Harvard Business Review, you’ll get
so much more than a magazine.
Exclusive online content and tools you can put to use today
My Library, your personal workspace for sharing, saving, and organizing HBR.org articles and tools
Unlimited access to more than 4,000 articles in the Harvard Business Review archive
Subscribe today at hbr.org/subnow
19915_Press_HBR Subs_BoB_guides.indd 1 9/7/16 10:43 AM
Smart advice and inspiration from a source you trust.
If you enjoyed this book and want more comprehensive guidance on essential professional skills, turn to the HBR Guides Boxed Set. Packed with the practical advice you need to succeed, this seven-volume collection provides smart answers to your most pressing work challenges, from writing more effective emails and delivering persuasive presentations to setting priorities and managing up and across.
Buy for your team, clients, or event. Visit hbr.org/bulksales for quantity discount rates.
Harvard Business Review Guides Available in paperback or ebook format. Plus, find downloadable tools and templates to help you get started.
§ Better Business Writing § Building Your Business Case § Buying a Small Business § Coaching Employees § Delivering Effective Feedback § Finance Basics for Managers § Getting the Mentoring You Need § Getting the Right Work Done
§ Leading Teams § Making Every Meeting Matter § Managing Stress at Work § Managing Up and Across § Negotiating § Office Politics § Persuasive Presentations § Project Management
HBR.ORG/GUIDES
20034_Press_HBRGuides Ad_BoB_Guides.indd 1 11/8/16 9:50 AM
Notes
H6347.indb 175H6347.indb 175 10/15/13 9:19 AM10/15/13 9:19 AM
Notes
H6347.indb 176H6347.indb 176 10/15/13 9:19 AM10/15/13 9:19 AM
Notes
H6347.indb 175H6347.indb 175 10/15/13 9:19 AM10/15/13 9:19 AM
Notes
H6347.indb 176H6347.indb 176 10/15/13 9:19 AM10/15/13 9:19 AM
Notes
H6347.indb 175H6347.indb 175 10/15/13 9:19 AM10/15/13 9:19 AM
Notes
H6347.indb 176H6347.indb 176 10/15/13 9:19 AM10/15/13 9:19 AM
Notes
H6347.indb 175H6347.indb 175 10/15/13 9:19 AM10/15/13 9:19 AM
Notes
H6347.indb 176H6347.indb 176 10/15/13 9:19 AM10/15/13 9:19 AM
Notes
H6347.indb 175H6347.indb 175 10/15/13 9:19 AM10/15/13 9:19 AM
Notes
H6347.indb 176H6347.indb 176 10/15/13 9:19 AM10/15/13 9:19 AM
Notes
H6347.indb 175H6347.indb 175 10/15/13 9:19 AM10/15/13 9:19 AM
Notes
H6347.indb 176H6347.indb 176 10/15/13 9:19 AM10/15/13 9:19 AM
- Copyright
- What You'll Learn
- Contents
- Introduction
- Section 1: Getting Started
- Chapter 1: Keep Up with Your Quants
- Ch 2: A Simple Exercise to Help You Think Like a Data Scientist
- Section 2: Gather the Right Information
- Ch 3: Do You Need All That Data?
- Ch 4: How to Ask Your Data Scientists for Data and Analytics
- Ch 5: How to Design a Business Experiment
- Ch 6: Know the Difference Between Your Data and Your Metrics
- Ch 7: The Fundamentals of A/B Testing
- Ch 8: Can Your Data Be Trusted?
- Section 3: Analyze the Data
- Chapter 9: A Predictive Analytics Primer
- Ch 10: Understanding Regression Analysis
- Ch 11: When to Act On a Correlation, and When Not To
- Ch 12: Can Machine Learning Solve Your Business Problem?
- Ch 13: A Refresher on Statistical Significance
- Ch 14: Linear Thinking in a Nonlinear World
- Ch 15: Pitfalls of Data-Driven Decisions
- Ch 16: Don't Let Your Analytics Cheat the Truth
- Section 4: Communicate Your Findings
- Ch 17: Data Is Worthless If You Don't Communicate It
- Ch 18: When Data Visualization Works--and When It Doesn't
- Ch 19: how to Make Charts That Pop and Persuade
- Ch 20: Why It's So Hard for Us to Communicate Uncertainty
- Ch 21: Responding to Someone Who Challenges Your Data
- Ch 22: Decisions Don't Start with Data
- Appendix: Data Scientist: The Sexiest Job of the 21st Century
- Index