Marketing Assignments
Section 6
The Central Limit Theorem
Rhonda Knehans Drake
Associate Professor, New York University
Data Analytics, Interpretation and Reporting Copyright © 2013
2
• The assessment of sample means (average and percentages) are the basis of many every day business decisions.
• Therefore understanding exactly how an average is distributed is KEY to properly assessing one versus another (Pre vs. Post, Control vs. Test, etc.).
• Luck would have it that it that averages will always follow a normal distribution as n gets large (n>30).
Introduction
3
• When you take a sample from a population and calculate the average dollars spent, for example, that average or mean has certain distributional properties.
• According to the Central Limit Theorem, regardless of how the population from which we sampled is distributed, the sample mean or response rate or click through rate (for n 30) will be normally distributed with a mean equal to the mean of the population from which the sample came and a standard deviation equal to the standard deviation of the population from which the sample came divided by the square root of the sample size.
• This was proven a long time ago. And we can take advantage of it.
The Central Limit Theorem I
4
• So what does this mean….even if the distribution of dollars spent is highly skewed and not symmetric and therefore not normal at all, your statistics such as average spend will be.
• Let’s take a look at what the Central Limit Theorem is saying.
The Central Limit Theorem I
5
ACME Database
(10,000,000 Customer
Record)
Sample 1,000
(n=10,000
customers)
Sample 4
(n=10,000
customers)
Sample 3
(n=10,000
customers)
Sample 2
(n=10,000
customers)
Sample 1
(n=10,000
customers)
X1,000 = avg. income
of sample 1,000
X4= avg. income of
sample 1,000
X3= avg. income of
sample 1,000
X2 = avg. income of
sample 1,000
X1 = avg. income of
sample 1,000
The Central Limit Theorem II
• The database analyst at ACME Direct draws 1,000 random samples, of size 10,000 each, from the database and observes the data element “household income.”
6
• The analyst then calculates the mean “household income” for each of the 1,000 samples and creates a frequency distribution of the average incomes from the 1,000 samples drawn.
Income Range Frequency
Less than $10,000 32
$10,000 - $20,000 119
$20,000 - $30,000 187
$30,000 - $40,000 326
$40,000 - $50,000 192
$50,000 - $60,000 116
$60,000+ 28
Total 1,000
The Central Limit Theorem III
7
• The analyst creates a histogram using the frequency table and notes the distribution of these 1,000 sample mean income values is normally distributed (symmetric and bell-shaped).
Distribution of 1,000 Sample Means
0
50
100
150
200
250
300
350
Less than
$10,000
$10,000 -
$20,000
$20,000 -
$30,000
$30,000 -
$40,000
$40,000 -
$50,000
$50,000 -
$60,000
$60,000+
Income Range s
F re
q u
e n
c y
The Central Limit Theorem IV
8
• According to the Central Limit Theorem, the histogram will be normally distributed (bell-shaped and symmetric) with
– a mean equal to the true mean “household income” level of all people on the database and
– a standard deviation equal to the true standard deviation for all people on the database (when divided by the square root of n).
The Central Limit Theorem V
9
So important that the normal curve was printed on the 10 German
Deutsche Mark until 1993.
Gauss and the Deutsche Mark
10
Suppose you work for American Express and conducted a test to
1,000 new card members to excite spend.
• Based on your test you received an average spend value of $175 for this test with a standard deviation of $25.
• You know this is not reality because you only conducted a test.
How can you estimate the spend level for rollout to all new card members?
• Based on the CLT we can say with 95% certainty, true spend will lie some where within plus or minus 2 standard deviations of our average.
• So, in other words, the true spend should lie somewhere between $125 and $225 with 95% certainty
An Example of the CLT in Practice
11
6.1 You note the number of arrivals each day at Starbucks in Grand Central
Station between the hours of 5 pm and 6 pm for the months of April and
May. You do the same for the Starbucks at Penn Station. You calculate
the average for Grand Central and the average for Penn Station.
a) How is the variable number of arrivals between 5pm and 6pm
distributed?
b) How is the average number of arrivals for Grand Central and Penn
Station distributed?
6.2 You know income to be normally distributed on your customer file. You
sample 20 people on the file. How will the average income be distributed
for this small sample?
6.3 Income on your database is highly skewed. You sample 20 people on the
file. How will the average income be distributed for this small sample?
Section 6 Exercises I
12
6.4 Income on your database is highly skewed. You sample 1,000 people on
the file. How will the average income be distributed for this sample of size
1,000?
Section 6 Exercises II