Pandas Data Cleaning
MAS638 – Business Analytics Consulting
Section 2 – First case
Document type
Date
CONFIDENTIAL AND PROPRIETARY
Any use of this material without specific permission of McKinsey & Company is strictly prohibited
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
1
Effective problem solving is at the heart of successful consulting
... the problem is usually not well-defined
... the client has already decided they can’t do it alone
... you have very little time
... but you cannot afford to make big mistakes
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
1
Different types of problems
Application of a skill or a technique
2
‘Run a linear optimization on a truck schedule with known costs and known routes’
MAS 631/632
Data is given, math can be conceptually difficult
Open, ill-defined problem with possibly no or multiple great answers
‘How and how much can X save through improving the Supply Chain?’
This course
Data is wrong, non-existent, poor quality, not consistent
Often very difficult to figure out what data is useful and how to use it
Math usually quite easy
Absolute proof rarely exists
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Select, fun, real problems I worked on during my McKinsey time
3
We are a global direct-selling company with 6 million sales reps around the world and we are heading for bankruptcy due to online shopping. Help us!
How can our airline improve our on-time performance?
We accidentally sold lots of GM seed corn to US farmers that is not approved for human consumption. We lost track of it. If Japan finds it in a couple of ships of US corn, they will refuse it and the global corn market will grind to a halt. What should we do?
What is the fastest airline boarding process that also has fantastic customer satisfaction?
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
The basic problem solving process
4
Highly iterative, sometimes cycling daily
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Define the problem
Structure the problem
Prioritize
Plan & conduct analysis
Synthesize & develop recommendation
Dunder Mifflin has hired you!
5
You have been retained by Dunder Mifflin, a distribution company
They sell paper and many other products to other business customers (no retail)
They have many sales reps that call on these customers
They want to make more money!
Their question to you:
Do we have a Pricing opportunity?
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Interview with the sales manager
Why do you think you have a pricing opportunity?
Well, we know that we don’t price all our customers or products the same, and sales reps have a lot of freedom to set prices. And sometimes I hear our customers say that our prices are low compared to competitors, although frequently they also say they are too high. We also raised prices on several hundred customers on April 1st , although I don’t know how that worked out.
How does pricing work?
We have ‘official’ list prices, but we negotiate a standard discount off list price with every customer. Small customers obviously get a smaller discount. But then each sales rep over time adjusts the prices individually – a customer may say ‘this item is too expensive’ or the reps may raise the price on some. So it’s all over the place. Each branch in theory follows this same approach and we have 30+ branches.
How do sales reps work and how do they get paid?
Each rep is assigned a set of customers. They get paid a small percentage of the revenue of that customer, so it’s completely variable and we can’t lose. So they really have a lot of freedom.
What are the basic numbers of the business?
We have >100,000 SKUs, the Scranton branch has 13,000 customers, 50 sales reps and sells nearly $130 million per year. We’d consider Scranton pretty representative, so let’s look there first.
6
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
What is the problem definition?
7
Basic question or problem:
Criteria for success:
What is in scope, what is not:
Decision makers:
Other stakeholders and involvement:
Constraints:
Perspective & context:
As precise as possible without accidentally constraining
A clear definition of what defines success, e.g., ‘>$500m in 3 years’
Primary deciders and who can affect the outcome
Who else needs to be involved and how they are thinking about this
The overall context in which this problem occurs. E.g., trends in the market, competitor moves, etc.
Anything that limits the solution space, e.g., ‘people neutral’, ‘don’t upset customers’
E.g., ‘Focused on US business unit only’
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Dunder Mifflin Pricing Problem Definition
8
Basic question or problem:
Can we make a sustainably higher profit by raising prices? On what products and customers should we raise prices and by how much?
Criteria for success:
At least $3m in additional profit
What is in scope, what is not:
Scranton branch
Decision makers:
Branch manager & VP of sales
Other stakeholders and involvement:
Sales reps
Constraints:
none? Any pricing guidance from corporate overall?
Perspective & context:
Reps have significant pricing freedom, pricing likely all over the place
Raised prices on some customers in April
Some rumors that we are cheaper than competitors
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Structuring the problem – the issue tree
9
Main issue/
problem
Sub-issue 1
Sub-issue 2
Sub-issue 3
Sub-issue 1.1
Sub-issue 2.1
Sub-issue 1.2
Sub-issue 2.2
Sub-issue 2.3
How do we structure the overall question? What are the different ways to raise price?
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Define
Structure
Prioritize
Analyze
Synthesizem
A possible Dunder Mifflin starter issue tree
10
Pricing opportunity
Raise prices universally
Selectively raise prices
Raise prices on certain customers
Make specific sales reps price better
Other ideas
Some decent guesses, but this is stuff the client told us already. We need some facts and insights
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Define
Structure
Prioritize
Analyze
Synthesizem
What data would you ask the client for?
11
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Two fundamentally different approaches to data
12
Approach 1: directly ask for data on specific issues. E.g., if this was on logistics
Can you give me data on deliveries per customer?
Can you give me your route schedule?
What is the typical wait time per customer delivery?
...
Approach 2: ask for raw data that you can use for those and many others
The customer master – names, IDs, addresses
The delivery transactions – customer, date & time, truck #
If you don’t derive stuff from the master data:
You might get the wrong information and you can’t check
If you have more ideas you have to go back and the results may be inconsistent
If you find something interesting you cannot directly dig down
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
The Dunder Mifflin data
13
Activity data
Sales transactions for Scranton
SKU
Customer
Date
Quantity
Price per unit
List of customers we planned to increase price on April 1st
Customer
Price increase (as a factor – 1.05 is a 5% increase)
Master data
Sales rep assignments
Customer
Sales rep assigned
SKU master data
SKU
Vendor ID
Cost per unit
SKU type (ABC)
This is real data from a real distributor. Obviously SKUs, Customer and Vendor IDs are disguised
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
A couple days has passed, and this first stage of the analysis is complete We need to communicate to the client. What and how do we say it?
14
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Synthesize and develop recommendations – Hypothesis trees
15
Main point
Supporting argument 1
Supporting argument 2
Supporting argument 3
Evidence/supporting argument
Supporting argument 2.1
Evidence/supporting argument
Supporting argument 2.2
Supporting argument 2.3
Hypothesis trees are about structuring your logic & telling the story
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Define
Structure
Prioritize
Analyze
Synthesizem
Hypothesis tree - example
16
We should pay our sales force on margin instead of revenue, we can grow profits substantially
Sales reps sell products badly because we compensate them on revenue
Selling at higher prices would increase profitability significantly
Sales reps would accept switching to a new compensation system
Data analysis shows that many sales happen at near 0 margin, meaning no profit
Raising prices by 2% on average would be worth $25m
In anonymous interviews, sales reps admit to frequently lowering price to close deals quickly in order to get higher sales overall
The close rate of sales reps that charge 2-3% higher prices is not different from sales at lower prices
Customers tell us that our products are preferred over competitors
Interviews show that reps are very open to a switch
Good reps would make more money and only a handful of reps would make slightly less
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Define
Structure
Prioritize
Analyze
Synthesizem
Hypothesis tree – DM initial pricing analysis
17
Early indications are that DM has substantial pricing opportunity, however more work is required
Analyzing profitability by customer shows that many small customers are low margin, despite our intention to price them higher
Similarly, some reps show much higher margin than others
However, more analysis is required
The raw data contained a lot of non-sensical entries. We recommend to review these and adjust as needed
Some of the margin differences in customers/reps may be caused by other external factors that we have to exclude
Most customer buy each item no more than once per year, thus the odds that a customer even notices increases are low
Analysis of the April 1st price increase was inconclusive, but no big negative customer reaction was seen
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Executive summary
18
Early indications are that DM has substantial pricing opportunity, however more work is required
Analyzing profitability by customer shows that many small customers are low margin, despite our intention to price them higher
Similarly, some reps show much higher margin than others
Analysis of the April 1st price increase was inconclusive, but no big negative customer reaction was seen
Most customer buy each item no more than once per year, thus the odds that a customer even notices increases are low
However, more analysis is required
The raw data contained a lot of non-sensical entries. We recommend to review these and adjust as needed
Some of the margin differences in customers/reps may be caused by other external factors that we have to exclude
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Many small customers have low margin – analysis indicates a gap of >$2 million
19
Margin by customer
(Percent)
Customer revenue ($)
There is little correlation between customer margin and size
A large number of small customers do not meet our minimum target of 50% margin
If every customer under $1m revenue could be raised to at least 50%, we would be making $2.3m in additional profit
Source: Profitability analysis of DM transaction data (last 12 months)
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Many small customers have low margin – analysis indicates a gap of >$2 million
20
Margin by customer
(Percent)
Customer revenue ($)
There is little correlation between customer margin and size
A large number of small customers do not meet our minimum target of 50% GM
If every customer under $1m revenue could be raised to at least 50%, we would be making $2.3m in additional profit
Source: Profitability analysis of DM transaction data (last 12 months)
The message: what you want the readers to take away, most important part of the slide
Chart title: what is shown on the chart
Comments & explanations
Source: where the data and analysis comes from
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Similarly, at first sight, several sales reps could raise prices
21
Margin by sales rep
(Percent)
Sales rep ID
Most sales reps sell on average with our target margin, but several are substantially below 50%
Further work is required to understand whether this is due to customer mix or competitive intensity in their territory or other reasons
Source: Profitability analysis of DM transaction data (last 12 months)
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
0.38925573923841789 0.53541146916645854 0.53226311780277569 0.47258051601543694 0.48084285705819396 0.49389965102260985 0.41919048923985752 0.49194002104771206 0.52012371505335542 0.52472383988465976 0.49948868803583762 0.34987512286060729 0.53103595166885231 0.5322899287682481 0.51322157192258577 0.51838079858706521 0.54097352928093256 0.52562444940618158 0.51244758137044788 0.45456744449711667 0.45132925828032178 0.54291883773862282 0.54747409213343223 0.53608879330229431 0.54195694297535124 0.21853770092674035 0.51364256127278418 0.51123562428648262 0.52302879580181427 0.53103810257629203 0.53292984886133221 0.52597209399303002 0.52316875702584864 0.24455023322685115 0.35085789550844354 0.53965379115392376 0.51392826184344698 0.2974496824809193 0.49321965988600996 0.41991590971491538 0.41918009929947608 0.48103759135209456 0.54269930192495608 0.50700201552228497 0.47231067265842774 0.50126485918873187 0.4921864902682489 0.52984736909273167 0.45396295322599206 0.52006933804370159
Analysis of the April 1st price increase is inconclusive, but at least it did not look as if customers had a violent negative reaction
22
On April 1, DM increased prices on 788 customers by 5% on all items
We manually confirmed that these price increases were indeed implemented
Margins stayed flat while the control group grew 2 percent
Revenue increased by 52% vs 16% for Control Group
Analysis is inconclusive at this point – clearly customers selected were non-representative in some way
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Gross margin by customer group
Q1 2019
Increased customers Other customers 0.5 0.44 Q2 2019
Increased customers Other customers 0.5 0.46
Revenue by customer group (Indexed to Q1)
Q1 2019
Increased customers Other customers 1 1 Q2 2019
Increased customers Other customers 1.52 1.1599999999999999
Significantly more work is required
23
Issues and potential other factors
Data: The transaction data and cost data has a number of issues (>10% of entries) – missing cost information, frequent ‘sales at loss’, items with multiple prices, etc.
Other correlations: customer price differences may be due to product mix, geographic location, industry etc. Sales rep variations may be due to customers in different geographies or with different competitive intensities
Recommended next steps
Need to sit down with IT and Operations group to review specific instances and agree on approach to clean
Review select ‘interesting’ customers jointly with sales leader and identify any factors that need to be taken into consideration
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Recap, what have we learned?
24
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
What have we learned about Consulting with Analytics?
25
Consulting/problem-solving is an end-to-end process with many parts
The usually hard parts
What exactly is the problem?
How do I break it apart?
What analysis should I and can I do that will actually tell me something and be defensible against critics?
How do identify and filter out the crud in the data so I am basing the analysis on the right info
How do I use the insights to tell a convincing story?
The usually easy part
Doing the analysis once the data is clean
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
Define the problem
Structure the problem
Prioritize
Plan & conduct analysis
Synthesize & develop recommendation
What have we learned about Consulting with Analytics?
26
2. You need to UNDERSTAND the data, INSPECT it and CHECK it
What exactly does every field mean?
What do ‘weird’ entries mean in the real world? E.g., is this a return or an error? Is this really a normal SKU? Why would we sell something at a loss? Is there really one customer that we lose $millions on?
Do sample customers act in a way that is intuitive? E.g., do they buy a 1-5 times a week?
Do the aggregates in the data set jibe with the overall business numbers? Is the total revenue close to the financials? Is the number of customers approximately right? Is the gross margin roughly right? Are the top earning SKUs the ones we would expect? …
Macro-statistics (e.g., regressions, overall stats) are not a substitute for understanding
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
What have we learned about Consulting with Analytics?
27
3. Only BIG signals matter
Real data is random and will always show trends
Small variations are usually not a signal of opportunity
Small trends usually are random or at least don’t translate to value
4. Small BIG signals matter
Always look at the ‘units of action’, you might find one or two that drive substantial value
E.g.,
Two big customers may be extremely poorly priced
Two out of the 50 sales reps are really abusing the system
Regression & other macro-analysis usually does not find the opportunities
Never trust just the summary statistics of an analysis, always look at plots – you are looking for the ‘That’s weird’ moment
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
What have we learned about Consulting with Analytics?
28
5. Behold the power of 80/20
Very few businesses have customers or products that are all similar. The 80/20 rule
20% of customers make 80% of the profit
20% of SKUs are 80% of the volume
20% of transactions are 80% of the revenue
20% of customers have 80% of the wealth
When you remove the top 20%, the 80/20 rule still applies – Distributions follow power laws, not normal statistics
This means you can often radically short-cut analysis by focusing on the 20%
6. Beware the curse of 80/20
Random variations do not ‘null out’ in aggregate analysis
E.g.,
800 truly random customers out of 14000 are not necessarily representative
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
29
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure
30
TRACKER
Unit of measure
1 Footnote
SOURCE: Source
Title
Unit of measure