Pandas Data Cleaning

profileSUMMER11
MAS638Section2-afirstcase.pptx

MAS638 – Business Analytics Consulting

Section 2 – First case

Document type

Date

CONFIDENTIAL AND PROPRIETARY

Any use of this material without specific permission of McKinsey & Company is strictly prohibited

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

1

Effective problem solving is at the heart of successful consulting

... the problem is usually not well-defined

... the client has already decided they can’t do it alone

... you have very little time

... but you cannot afford to make big mistakes

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

1

Different types of problems

Application of a skill or a technique

2

‘Run a linear optimization on a truck schedule with known costs and known routes’

MAS 631/632

Data is given, math can be conceptually difficult

Open, ill-defined problem with possibly no or multiple great answers

‘How and how much can X save through improving the Supply Chain?’

This course

Data is wrong, non-existent, poor quality, not consistent

Often very difficult to figure out what data is useful and how to use it

Math usually quite easy

Absolute proof rarely exists

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Select, fun, real problems I worked on during my McKinsey time

3

We are a global direct-selling company with 6 million sales reps around the world and we are heading for bankruptcy due to online shopping. Help us!

How can our airline improve our on-time performance?

We accidentally sold lots of GM seed corn to US farmers that is not approved for human consumption. We lost track of it. If Japan finds it in a couple of ships of US corn, they will refuse it and the global corn market will grind to a halt. What should we do?

What is the fastest airline boarding process that also has fantastic customer satisfaction?

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

The basic problem solving process

4

Highly iterative, sometimes cycling daily

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Define the problem

Structure the problem

Prioritize

Plan & conduct analysis

Synthesize & develop recommendation

Dunder Mifflin has hired you!

5

You have been retained by Dunder Mifflin, a distribution company

They sell paper and many other products to other business customers (no retail)

They have many sales reps that call on these customers

They want to make more money!

Their question to you:

Do we have a Pricing opportunity?

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Interview with the sales manager

Why do you think you have a pricing opportunity?

Well, we know that we don’t price all our customers or products the same, and sales reps have a lot of freedom to set prices. And sometimes I hear our customers say that our prices are low compared to competitors, although frequently they also say they are too high. We also raised prices on several hundred customers on April 1st , although I don’t know how that worked out.

How does pricing work?

We have ‘official’ list prices, but we negotiate a standard discount off list price with every customer. Small customers obviously get a smaller discount. But then each sales rep over time adjusts the prices individually – a customer may say ‘this item is too expensive’ or the reps may raise the price on some. So it’s all over the place. Each branch in theory follows this same approach and we have 30+ branches.

How do sales reps work and how do they get paid?

Each rep is assigned a set of customers. They get paid a small percentage of the revenue of that customer, so it’s completely variable and we can’t lose. So they really have a lot of freedom.

What are the basic numbers of the business?

We have >100,000 SKUs, the Scranton branch has 13,000 customers, 50 sales reps and sells nearly $130 million per year. We’d consider Scranton pretty representative, so let’s look there first.

6

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

What is the problem definition?

7

Basic question or problem:

Criteria for success:

What is in scope, what is not:

Decision makers:

Other stakeholders and involvement:

Constraints:

Perspective & context:

As precise as possible without accidentally constraining

A clear definition of what defines success, e.g., ‘>$500m in 3 years’

Primary deciders and who can affect the outcome

Who else needs to be involved and how they are thinking about this

The overall context in which this problem occurs. E.g., trends in the market, competitor moves, etc.

Anything that limits the solution space, e.g., ‘people neutral’, ‘don’t upset customers’

E.g., ‘Focused on US business unit only’

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Dunder Mifflin Pricing Problem Definition

8

Basic question or problem:

Can we make a sustainably higher profit by raising prices? On what products and customers should we raise prices and by how much?

Criteria for success:

At least $3m in additional profit

What is in scope, what is not:

Scranton branch

Decision makers:

Branch manager & VP of sales

Other stakeholders and involvement:

Sales reps

Constraints:

none? Any pricing guidance from corporate overall?

Perspective & context:

Reps have significant pricing freedom, pricing likely all over the place

Raised prices on some customers in April

Some rumors that we are cheaper than competitors

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Structuring the problem – the issue tree

9

Main issue/

problem

Sub-issue 1

Sub-issue 2

Sub-issue 3

Sub-issue 1.1

Sub-issue 2.1

Sub-issue 1.2

Sub-issue 2.2

Sub-issue 2.3

How do we structure the overall question? What are the different ways to raise price?

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Define

Structure

Prioritize

Analyze

Synthesizem

A possible Dunder Mifflin starter issue tree

10

Pricing opportunity

Raise prices universally

Selectively raise prices

Raise prices on certain customers

Make specific sales reps price better

Other ideas

Some decent guesses, but this is stuff the client told us already. We need some facts and insights

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Define

Structure

Prioritize

Analyze

Synthesizem

What data would you ask the client for?

11

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Two fundamentally different approaches to data

12

Approach 1: directly ask for data on specific issues. E.g., if this was on logistics

Can you give me data on deliveries per customer?

Can you give me your route schedule?

What is the typical wait time per customer delivery?

...

Approach 2: ask for raw data that you can use for those and many others

The customer master – names, IDs, addresses

The delivery transactions – customer, date & time, truck #

If you don’t derive stuff from the master data:

You might get the wrong information and you can’t check

If you have more ideas you have to go back and the results may be inconsistent

If you find something interesting you cannot directly dig down

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

The Dunder Mifflin data

13

Activity data

Sales transactions for Scranton

SKU

Customer

Date

Quantity

Price per unit

List of customers we planned to increase price on April 1st

Customer

Price increase (as a factor – 1.05 is a 5% increase)

Master data

Sales rep assignments

Customer

Sales rep assigned

SKU master data

SKU

Vendor ID

Cost per unit

SKU type (ABC)

This is real data from a real distributor. Obviously SKUs, Customer and Vendor IDs are disguised

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

A couple days has passed, and this first stage of the analysis is complete We need to communicate to the client. What and how do we say it?

14

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Synthesize and develop recommendations – Hypothesis trees

15

Main point

Supporting argument 1

Supporting argument 2

Supporting argument 3

Evidence/supporting argument

Supporting argument 2.1

Evidence/supporting argument

Supporting argument 2.2

Supporting argument 2.3

Hypothesis trees are about structuring your logic & telling the story

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Define

Structure

Prioritize

Analyze

Synthesizem

Hypothesis tree - example

16

We should pay our sales force on margin instead of revenue, we can grow profits substantially

Sales reps sell products badly because we compensate them on revenue

Selling at higher prices would increase profitability significantly

Sales reps would accept switching to a new compensation system

Data analysis shows that many sales happen at near 0 margin, meaning no profit

Raising prices by 2% on average would be worth $25m

In anonymous interviews, sales reps admit to frequently lowering price to close deals quickly in order to get higher sales overall

The close rate of sales reps that charge 2-3% higher prices is not different from sales at lower prices

Customers tell us that our products are preferred over competitors

Interviews show that reps are very open to a switch

Good reps would make more money and only a handful of reps would make slightly less

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Define

Structure

Prioritize

Analyze

Synthesizem

Hypothesis tree – DM initial pricing analysis

17

Early indications are that DM has substantial pricing opportunity, however more work is required

Analyzing profitability by customer shows that many small customers are low margin, despite our intention to price them higher

Similarly, some reps show much higher margin than others

However, more analysis is required

The raw data contained a lot of non-sensical entries. We recommend to review these and adjust as needed

Some of the margin differences in customers/reps may be caused by other external factors that we have to exclude

Most customer buy each item no more than once per year, thus the odds that a customer even notices increases are low

Analysis of the April 1st price increase was inconclusive, but no big negative customer reaction was seen

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Executive summary

18

Early indications are that DM has substantial pricing opportunity, however more work is required

Analyzing profitability by customer shows that many small customers are low margin, despite our intention to price them higher

Similarly, some reps show much higher margin than others

Analysis of the April 1st price increase was inconclusive, but no big negative customer reaction was seen

Most customer buy each item no more than once per year, thus the odds that a customer even notices increases are low

However, more analysis is required

The raw data contained a lot of non-sensical entries. We recommend to review these and adjust as needed

Some of the margin differences in customers/reps may be caused by other external factors that we have to exclude

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Many small customers have low margin – analysis indicates a gap of >$2 million

19

Margin by customer

(Percent)

Customer revenue ($)

There is little correlation between customer margin and size

A large number of small customers do not meet our minimum target of 50% margin

If every customer under $1m revenue could be raised to at least 50%, we would be making $2.3m in additional profit

Source: Profitability analysis of DM transaction data (last 12 months)

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Many small customers have low margin – analysis indicates a gap of >$2 million

20

Margin by customer

(Percent)

Customer revenue ($)

There is little correlation between customer margin and size

A large number of small customers do not meet our minimum target of 50% GM

If every customer under $1m revenue could be raised to at least 50%, we would be making $2.3m in additional profit

Source: Profitability analysis of DM transaction data (last 12 months)

The message: what you want the readers to take away, most important part of the slide

Chart title: what is shown on the chart

Comments & explanations

Source: where the data and analysis comes from

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Similarly, at first sight, several sales reps could raise prices

21

Margin by sales rep

(Percent)

Sales rep ID

Most sales reps sell on average with our target margin, but several are substantially below 50%

Further work is required to understand whether this is due to customer mix or competitive intensity in their territory or other reasons

Source: Profitability analysis of DM transaction data (last 12 months)

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

0.38925573923841789 0.53541146916645854 0.53226311780277569 0.47258051601543694 0.48084285705819396 0.49389965102260985 0.41919048923985752 0.49194002104771206 0.52012371505335542 0.52472383988465976 0.49948868803583762 0.34987512286060729 0.53103595166885231 0.5322899287682481 0.51322157192258577 0.51838079858706521 0.54097352928093256 0.52562444940618158 0.51244758137044788 0.45456744449711667 0.45132925828032178 0.54291883773862282 0.54747409213343223 0.53608879330229431 0.54195694297535124 0.21853770092674035 0.51364256127278418 0.51123562428648262 0.52302879580181427 0.53103810257629203 0.53292984886133221 0.52597209399303002 0.52316875702584864 0.24455023322685115 0.35085789550844354 0.53965379115392376 0.51392826184344698 0.2974496824809193 0.49321965988600996 0.41991590971491538 0.41918009929947608 0.48103759135209456 0.54269930192495608 0.50700201552228497 0.47231067265842774 0.50126485918873187 0.4921864902682489 0.52984736909273167 0.45396295322599206 0.52006933804370159

Analysis of the April 1st price increase is inconclusive, but at least it did not look as if customers had a violent negative reaction

22

On April 1, DM increased prices on 788 customers by 5% on all items

We manually confirmed that these price increases were indeed implemented

Margins stayed flat while the control group grew 2 percent

Revenue increased by 52% vs 16% for Control Group

Analysis is inconclusive at this point – clearly customers selected were non-representative in some way

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Gross margin by customer group

Q1 2019

Increased customers Other customers 0.5 0.44 Q2 2019

Increased customers Other customers 0.5 0.46

Revenue by customer group (Indexed to Q1)

Q1 2019

Increased customers Other customers 1 1 Q2 2019

Increased customers Other customers 1.52 1.1599999999999999

Significantly more work is required

23

Issues and potential other factors

Data: The transaction data and cost data has a number of issues (>10% of entries) – missing cost information, frequent ‘sales at loss’, items with multiple prices, etc.

Other correlations: customer price differences may be due to product mix, geographic location, industry etc. Sales rep variations may be due to customers in different geographies or with different competitive intensities

Recommended next steps

Need to sit down with IT and Operations group to review specific instances and agree on approach to clean

Review select ‘interesting’ customers jointly with sales leader and identify any factors that need to be taken into consideration

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Recap, what have we learned?

24

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

What have we learned about Consulting with Analytics?

25

Consulting/problem-solving is an end-to-end process with many parts

The usually hard parts

What exactly is the problem?

How do I break it apart?

What analysis should I and can I do that will actually tell me something and be defensible against critics?

How do identify and filter out the crud in the data so I am basing the analysis on the right info

How do I use the insights to tell a convincing story?

The usually easy part

Doing the analysis once the data is clean

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

Define the problem

Structure the problem

Prioritize

Plan & conduct analysis

Synthesize & develop recommendation

What have we learned about Consulting with Analytics?

26

2. You need to UNDERSTAND the data, INSPECT it and CHECK it

What exactly does every field mean?

What do ‘weird’ entries mean in the real world? E.g., is this a return or an error? Is this really a normal SKU? Why would we sell something at a loss? Is there really one customer that we lose $millions on?

Do sample customers act in a way that is intuitive? E.g., do they buy a 1-5 times a week?

Do the aggregates in the data set jibe with the overall business numbers? Is the total revenue close to the financials? Is the number of customers approximately right? Is the gross margin roughly right? Are the top earning SKUs the ones we would expect? …

Macro-statistics (e.g., regressions, overall stats) are not a substitute for understanding

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

What have we learned about Consulting with Analytics?

27

3. Only BIG signals matter

Real data is random and will always show trends

Small variations are usually not a signal of opportunity

Small trends usually are random or at least don’t translate to value

4. Small BIG signals matter

Always look at the ‘units of action’, you might find one or two that drive substantial value

E.g.,

Two big customers may be extremely poorly priced

Two out of the 50 sales reps are really abusing the system

Regression & other macro-analysis usually does not find the opportunities

Never trust just the summary statistics of an analysis, always look at plots – you are looking for the ‘That’s weird’ moment

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

What have we learned about Consulting with Analytics?

28

5. Behold the power of 80/20

Very few businesses have customers or products that are all similar. The 80/20 rule

20% of customers make 80% of the profit

20% of SKUs are 80% of the volume

20% of transactions are 80% of the revenue

20% of customers have 80% of the wealth

When you remove the top 20%, the 80/20 rule still applies – Distributions follow power laws, not normal statistics

This means you can often radically short-cut analysis by focusing on the 20%

6. Beware the curse of 80/20

Random variations do not ‘null out’ in aggregate analysis

E.g.,

800 truly random customers out of 14000 are not necessarily representative

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

29

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure

30

TRACKER

Unit of measure

1 Footnote

SOURCE: Source

Title

Unit of measure