Data Science and R Language
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Review of Basic Data Analytic Methods Using R
1Module 3: Basic Data Analytic Methods Using R
Module 3: Basic Data Analytic Methods Using R 1
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Review of Basic Data Analytic Methods Using R
During this lesson the following topics are covered:
• Statistics in the Analytic Lifecycle
• Hypothesis Testing
• Difference of means
• Significance, Power, Effect Size
• ANOVA
• Confidence Intervals
Statistics for Model Building and Evaluation
Module 3: Basic Data Analytic Methods Using R 2
In this lesson, we’ll be concentrating on model building and evaluation, using the topics described.
Module 3: Basic Data Analytic Methods Using R 2
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
• Model Building and Planning Can I predict the outcome with the inputs that I have?
Which inputs?
• Model Evaluation Is the model accurate?
Does it perform better than "the obvious guess"
Does it perform better than another candidate model?
• Model Deployment Do my predictions make a difference?
Are we preventing customer churn?
Have we raised profits?
Statistics in the Analytic Lifecycle
3Module 3: Basic Data Analytic Methods Using R
As Data Scientists. we use statistical techniques not only within our modeling algorithms but also during the early model building stages, when we evaluate our final models, and when we assess how our models improve the situation when deployed in the field. In this section we'll discuss techniques that help us answer questions such as those listed above? Visualization will help with the first question, at least as a first pass.
Module 3: Basic Data Analytic Methods Using R 3
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Hypothesis Testing
• Fundamental question: "Is there a difference between the populations based on samples?“
Examples : Mean, Variance
• Null hypothesis : There is no difference
• Alternate hypothesis : There is a difference
Module 3: Basic Data Analytic Methods Using R 4
When conducting statistical tests, such as a model or benchmarking the difference between two populations of data, a common technique to assess the difference or significance is Hypothesis Testing.
The basic concept is to come up with ideas that can be proved or disproved with data. When performing these tests, the operating assumption is that there is no difference between two samples or populations. Statisticians refer to this as "the null hypothesis".
The “alternate hypothesis” is that there is a difference between two models, samples, or populations.
Module 3: Basic Data Analytic Methods Using R 4
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Null and Alternative Hypotheses: Examples
Module 3: Basic Data Analytic Methods Using R 5
Null Hypothesis Alternative Hypothesis
The best estimate of the outcome is the average observed value:
• The mean is the "Null Model”
The model predicts better than the null model:
• The average prediction error from the model is smaller than that of the null model
This variable does not affect the outcome:
• The coefficient value is zero
The variable does affect outcome:
• Coefficient value is non-zero
The model predictions do not improve revenue:
• Revenue is the same with or without intervention
Interventions based on model predictions improve revenue:
• A/B Testing, ANOVA
Here are some examples of null and alternative hypotheses that we would be answering during the analytic lifecycle.
1. Once we have fit a model – does it predict better than always predicting the mean value of the training data? If we call the mean value of the training data "the null model", then the null hypothesis is that the average squared prediction error from the model is the same as the average squared prediction error from the null model. The alternative is that the model's squared prediction error is less than that of the null model. A variation of that is to determine whether your "new" model predicts better than some "old" model. In that case, your null model is the "old" model, and the null and alternative hypotheses are the same as describe above.
2. When we are evaluating a model, we sometimes want to know whether or not a given input is actually contributing to the prediction. If we are doing a regression, for example, this is the same as asking if the regression coefficient for a variable is zero. The null hypothesis is that the coefficient is zero; the alternative is that the coefficient is non-zero.
3. Once we have settled on and deployed a model, we are now making decisions based on its predictions. For example, the model may help us make decisions that are supposed to improve revenue. We can test if the model is improving revenue by doing what are referred to as "A/B tests”. Suppose the model tells us whether or not to make a customer a special offer. Over the next few days, every customer who comes to us is randomly put into the "A" group, or the "B" group. Customers in the A group get special offers (or not) depending on the output of the model. Customers in the B group get special offers (or not) depending on the output of the model. Customers in the B group get special offers "the old way" – either they don't get them at all, or they get them by whatever algorithm we used before.
If the model and the intervention are successful, then group A should generate higher revenue than group B. If group A does not generate higher revenue than group B (if we accept the null hypothesis that A and B generate the same revenue), then we have to determine if the problem is whether the model makes incorrect predictions, or whether our intervention is ineffective.
If we are testing more than one intervention at the same time (A, B, and C), then we can do an ANOVA analysis to see if there is a difference in revenue between the groups. We will talk about ANOVA in a bit.
Module 3: Basic Data Analytic Methods Using R 5
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Intuition: Difference of Means
Module 3: Basic Data Analytic Methods Using R 6
large is area this
,m m If 21
1m 2m
For examples 1 and 3 on the previous slide, we can think of verifying the null hypothesis as verifying whether the mean values of two different groups is the same. If they are not the same, then the alternative hypothesis is true: the introduction of new behavior did have an effect. Suppose both group1 and group2 are normally distributed, with the same standard deviation, sigma. We have n1 samples from group1 and n2 samples from group2. It happens to be true that the empirical estimate of the population means m1 and m2 are also normally distributed with standard deviations sigma/n1 and sigma/n2. In other words, the more samples we have, the better our estimate of the mean. If the means are really the same, then the distributions of m1 and m2 will overlap substantially.
Module 3: Basic Data Analytic Methods Using R 6
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Welch’s t-test
Module 3: Basic Data Analytic Methods Using R 7
t-statistic:
p-value: area under the tails of the
appropriate student's distribution
if p-value is small (say < 0.05), then
reject the null hypothesis
and assume that m1 <> m2
m1 and m2 are "significantly
different"
> x = rnorm(10) # distribution centered at 0
> y = rnorm(10,2) # distribution centered at 2
> t.test(x,y)
Welch Two Sample t-test
data: x and y
t = -7.2643, df = 15.05, p-value = 2.713e-06
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-2.364243 -1.291811
sample estimates:
mean of x mean of y
0.5449713 2.3729984
(this is the t-statistic for the Welch t-test)
-t 0 t
In practice, we don’t calculate the area directly. Instead we calculate the t-statistic, which is the difference in the observed means, divided by a quantity that is a function of the observed standard deviations, and the number of observations. If the null hypothesis is true (m1 = m2) then t should be "about zero". Specifically, t is distributed in a bell shaped curve around 0 called the Student's t distribution – the specific shape of the distribution is a function of the number of observations. For a very large number of observations, the Student's t distribution converges to the normal distribution.
How do we tell if the t-statistic that we observed is "about zero"? We calculate the probability of observing a t of that magnitude or larger under the null hypothesis – this probability is the area under the tails of the appropriate student distribution.
If the alternative hypothesis is that m1 <> m2, then we look at the area under both tails. If the alternative hypothesis is that m1 > m2 (or m2 > m1), then we look at the area under one tail.
This area is called the "p-value". If p is small, then the probability of seeing our observed t under the null hypothesis is small, and we can go ahead and accept the alternative hypothesis.
[Note – Welch's t-test does not assume equal variance, and is a more robust variation of Student's t-test]
Module 3: Basic Data Analytic Methods Using R 7
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Wilcoxon Rank Sum Test
• t-test assumes that the populations are normally distributed
Sometimes this is close to true, sometimes not
• Wilcoxon Rank Sum test
Makes no assumption about the distributions of the populations
More robust test for difference of means
if p-value is small: reject the null hypothesis (equal means)
Module 3: Basic Data Analytic Methods Using R 8
> mean(x)
[1] 0.5449713
> mean(y)
[1] 2.372998
> wilcox.test(x, y)
wilcoxon rank sum test
data: x and y
W = 2, p-value = 4.33e-05
alternative hypothesis: true location shift is not equal to 0
A t-test represents a parametric test. Student's t-test assumes that both populations are normally distributed with the same variance. Welch's t-test (the t.test() function in R is Welch's t-test by default) does not assume equal variance, but it does assume normality. Sometimes, this is approximately true (true enough to use a t-test), and sometimes, it isn't.
If we can't make the normality assumption, then we should use a nonparametric test. The Wilcoxon Rank Sum test will test for difference of means without making the normality assumption. Without getting into the details, Wilcoxon's test uses the fact that if two populations are centered in the same place, then if we merge the observations from each population, sort them, and rank them, then the observations of each population should "mix together". Specifically, if we sum the resulting ranks for each population, the sum should be "about the same".
Since Wilcoxon's test doesn't assume anything about the population distribution, it is strictly weaker than t-test when it is applied to normally distributed data. Here, we show the results of wilcox.test() on the same (normally distributed) data from the previous slide. wilcox.test() does reject the null hypothesis, but the p-value is an order of magnitude larger than it is with the t- test. So if you know that you can assume the data is near normally distributed, then you should use the t-test.
Module 3: Basic Data Analytic Methods Using R 8
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Hypothesis Testing: Summary
• Calculate the test statistic Different hypothesis tests are
appropriate, in different situations
• Calculate the p-value on the test statistic
• If p-value is "small" then reject the null hypothesis
"small" is often p < 0.05 by convention (95% confidence)
Many data scientists prefer a smaller threshold.
Module 3: Basic Data Analytic Methods Using R 9
Every hypothesis test calculates a test statistic that is assumed to be distributed a certain way if the null hypothesis is true.
• Usually around 0 for difference, or around 1 for ratios
• Different hypothesis tests are appropriate in different situations: check the assumptions of the test, and whether they are valid (enough) for your situation.
The p-value is the probability of observing a value of the test statistic like the value that you saw if the null hypothesis is true. The p-value depends on how the test statistic is assumed to be distributed.
If p-value is "small" then reject the null hypothesis
• "small" is often p < 0.05 by convention (95% confidence)
• Many data scientists prefer a smaller threshold, often 0.01, or 0.001
Of course, most statistical packages have functions that will do steps 1 and 2 automatically, for you. Sometimes, you have to find the appropriate distribution and do it by hand.
Module 3: Basic Data Analytic Methods Using R 9
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Generating a Hypothesis: Type I and Type II Error
Module 3: Basic Data Analytic Methods Using R
It’s Really - > we say it’s
Ham Spam
Spam Type I – false positive OK – true positive
Ham OK – true negative Type II – false negative
If H0 is X, and we …: Null hypothesis(H0) is true Null hypothesis(H0) is false
Fail to accept the Null Hypothesis we claim something happened
Type I error False positive
Correct Outcome True positive We reject the Null hypothesis
Fail to reject the null hypothesis we claim nothing happened.
Correct outcome True negative Accept the NULL hypothesis
Type II error False negative
Example: Ham or Spam? H0: it’s Ham HA: it’s Spam
10
• Goal: Identify Spam
• Which error is worse?
So, we have developed our null hypothesis and its alternate. Once we collect the data and begin our analysis, what kind of errors might we make? There are two kinds: type I errors and (oddly enough) type II errors, based on whether we fail to accept the null hypothesis or fail to reject the null hypothesis.
Type I error is the failure to accept the null hypothesis. This is a “False positive” -- finding significance where none exists. Type II error is the failure to reject the null hypothesis, thereby create a “false negative”. This means that we have failed to find significance when it does exist.
Let’s use the example of SPAM filtering (spam refers to “unsolicited commercial email”.) Here our H0 is that the email is legitimate (also know as “ham” [that is; not spam]); our alternate hypothesis is that it’s not legitimate (it’s “spam”). A false positive means that we treat legitimate email as spam; a false negative implies that we treat spam messages as legitimate.
We could frame the following question: using this Email filter, how often will we identify a valid email message (ham) as spam? We consider this to be a more serious error than labeling a spam email as valid , since spam messages can be filtered from the user’s mailbox, whereas a message incorrectly labeled as spam may contain information critical to the recipient.
<Continued>
Module 3: Basic Data Analytic Methods Using R 10
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
• Significance: the probability of a false positive (α) p-value is your significance
• Power: probability of a true positive (1 - β)
• Effect size: the size of the observed difference The actual difference in means, for example
Significance, Power and Effect Size
Module 3: Basic Data Analytic Methods Using R 12
The significance of a result is the probability of a false positive – rejecting the null hypothesis when it should be accepted. This is exactly the p-value of the result.
The threshold of p-values that you will accept depends on how much you are willing to tolerate a false positive. So a p-value threshold of 0.05 means that you are willing to have a false positive 5% of the time.
The power of a result is the probability of a true positive – correctly accepting the alternative hypothesis. The desired power is usually used to decide how big a sample to use.
Effect size is the actual magnitude of the result: the actual difference between the means, for example.
Module 3: Basic Data Analytic Methods Using R 12
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Always Keep Effect Size in Mind!
Module 3: Basic Data Analytic Methods Using R 13
Both power and significance
increase with larger sample
sizes.
So you can observe an effect
size that is statistically
significant, but practically
insignificant!
moderate
sample
size
larger
sample size
For a fixed effect size (delta in the above diagrams), both power and significance increase with larger sample sizes. This is because, for a difference in means (assuming normal distributions), the estimate of the mean gets tighter as the sample size increases.
So even if the difference between the means stays the same, the normal distributions around each mean overlap less, and the t-statistic gets larger, which pushes it further out on the tail of the t-distribution.
Since there is no limit on how tight the normal distribution can get, you can make any effect size appear statistically significant, even if, for all practical purposes, the difference is "insignificant" (in English terms). So always take into consideration whether or not the effect size you observe truly means "a difference" in your domain.
Module 3: Basic Data Analytic Methods Using R 13
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Hypothesis Testing: ANOVA
ANOVA is a generalization of the difference of means
• One-way ANOVA k populations ("treatment groups")
ni samples each – total N subjects
Null hypothesis: ALL the population means are equal
Module 3: Basic Data Analytic Methods Using R 14
Population ni: # offers made mi: avg purchase size
Offer 1 100 $55
Offer 2 102 $50
No intervention 99 $25
ANOVA (Analysis of Variance) is a generalization of the difference of means. Here we have multiple populations, and we want to see if any of the population means are different from the others. That means that the null hypothesis is that ALL the population means are equal.
An example: suppose everyone who visits our retail website either gets one of two promotional offers, or no promotion at all. We want to see if making the promotional offers makes a difference. (The null hypothesis is that neither promotion makes a difference. If we want to check if offer 1 is better than offer 2, that's a different question).
We can do multi-way ANOVA (MANOVA) as well. For instance if we want to analyze offers and day of week simultaneously, that would be a two-way ANOVA. Multi-way AVNOVA is usually done by doing a linear regression on the outcome, using each of the (categorical) treatments as an input variable. Here, we will only talk about 1-way ANOVA.
Module 3: Basic Data Analytic Methods Using R 14
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
ANOVA: Understanding the F statistic
Module 3: Basic Data Analytic Methods Using R 15
Test statistic:
The first thing to calculate is the test statistic. Here we sketch the intuition behind the test statistic for ANOVA. Essentially, we want to test whether or not the clusters formed by each population are more tightly grouped than the spread across all of the populations.
The between-groups mean sum of squares, sB 2, is an estimate of the between-groups variance.
It is a measure of how the population means vary with respect to the grand mean – the "spread across all of the populations".
The within-group mean sum of squares, sW 2 , is an estimate of the within-group variance: It is a
measure of the “average population variance” – the average "spread" of each cluster.
If the null hypothesis is true, then sB 2 should be about equal to sW
2 – that is, the populations are about as wide as they are far apart – they overlap. Their ratio, the test statistic F, will then be distributed as the F distribution with k-1, N-k degrees of freedom, which is right skewed and has its mode near 1. In the equations above, k is the number of populations, ni is the number of samples in the ith population, and N is the total number of samples.
If we observe that F < 1, then the populations clusters are wider than the between group spread, so we can just accept the null hypothesis (no differences). Otherwise, we only need to consider the area under the right tail of the F distribution.
Module 3: Basic Data Analytic Methods Using R 15
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
R Example: ANOVA
Module 3: Basic Data Analytic Methods Using R 16
3 different offers, and their
outcomes
Use lm() to do the ANOVA
F-statistic: reject the null hypothesis
.No appreciable difference between offer1 and offer2
offer1-nooffer
offer2-nooffer
>offers = sample(c("noffer", "offer1", "offer2"),
size=500, replace=T)
>purchasesize = ifelse(offers=="noffer", rlnorm(500,
meanlog=log(25)), ifelse(offers=="offer1", rlnorm(500,
meanlog=log(50)), rlnorm(500, meanlog=log(55))))
>offertest = data.frame(offer=as.factor(offers),
purchase_amt=purchasesize)
> model = lm(log10(purchase_amt) ~ as.factor(offers),
data=offertest)
>summary(model)
Residuals:
Min 1Q Median 3Q Max
-1.1940 -0.2837 0.0135 0.2863 1.3374
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.49092 0.03240 46.011 < 2e-16 ***
as.factor(offers)offer1 0.20424 0.04706 4.340 1.73e-05 ***
as.factor(offers)offer2 0.22371 0.04596 4.867 1.52e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4262 on 497 degrees of freedom
Multiple R-squared: 0.05479, Adjusted R-squared: 0.05098
F-statistic: 14.4 on 2 and 497 DF, p-value: 8.304e-07
> TukeyHSD(aov(model))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = model)
$offers
diff lwr upr p adj
offer1-noffer 0.20424099 0.09361976 0.3148622 0.0000512
offer2-noffer 0.22370761 0.11566775 0.3317475 0.0000045
offer2-offer1 0.01946663 -0.09146092 0.1303942 0.9104871
Tukey's test: all pair-wise tests for difference of means
95% confidence intervals for difference between means
Here is an example of how to do one-way ANOVA in R. We have a data frame with the outcomes under the three different offer scenarios you saw previously. We can use the linear regression function lm() to do the ANOVA calculations for us.
The F-statistic on the linear regression model tells us that we can reject the null hypothesis – at least one of the populations is different from the others. Since we used lm() to do the ANOVA, we have additional information: The intercept of the model is the mean outcome for nooffer. The coefficients for offer1 and offer2 are the difference of means of offer1 and offer2 respectively, from nooffer. The lm() function does a Wald test on each of the coefficients for the null hypothesis that the coefficient value is really zero. We can see from the p-values that the null hypothesis was rejected for both coefficients, with highly significant p-values. So, we can assume that both offer1 and offer2 are significantly different from nooffer.
However – we don't know whether or not offer1 is different from offer2. That requires additional tests. Tukey's test does all pair-wise tests for difference of means. We can see the 95% confidence interval for the difference of each pair of means, and the p-value for the test on the difference. A p-value of 0.9104871 for offer1 and offer2 suggests that we really can’t tell the difference between them.
A small p-value (p = 0.049) demonstrates statistical vs. practical significance – with more data, the difference gets more statistically significant, but the effect size is still fairly small. Is the effect practically significant?
___________
More references (2-way anova, etc): Practical Regression and ANOVA using R, Julian Faraway (you can get a .pdf file of an old edition of the book online from <http://cran.r-project.org/>)
Module 3: Basic Data Analytic Methods Using R 16
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Confidence Intervals
Module 3: Basic Data Analytic Methods Using R 17
If x is your estimate of some unknown value μ,
the P% confidence interval
is the interval around x that μ will fall in, with
probability P.
Example:
• Normal data N(μ, σ)
• x is the estimate of μ
• based on n samples
μ falls in the interval
x ± 2σ/√n
with approx. 95% probability
("95% confidence")
The confidence interval for an estimate x of an unknown value mu is the interval that should contain the true value mu, to a desired probability. For example, if you are estimating the mean value (mu) of a normal distribution with std. dev sigma, and your estimate after n samples is X, then mu falls within +/- 2* sigma/sqrt(n) with about 95% probability.
Of course, you probably don't know sigma, but you do know the empirical standard deviation of your n samples, s. So you would estimate the 95% confidence interval as x +/- 2*s.
In practice, most people estimate the 95% confidence interval as the mean plus/minus twice the standard deviation. This is really only true if the data is normally distributed, but it is a helpful rule of thumb.
Module 3: Basic Data Analytic Methods Using R 17
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Example
The defect rate of a disk drive manufacturing process is within 0.9% - 1.7%, with 98% confidence. We inspect a sample of 1000 drives from one of our plants.
Module 3: Basic Data Analytic Methods Using R 18
• We observe 13 defects in our sample. • Should we inspect the plant
for problems?
• What if we observe 25 defects in the sample?
Suppose we know that a properly functioning disk drive manufacturing process will produce between 9 and 17 defective disk drives per 1000 disk drives manufactured, 98% of the time. On one of our regularly scheduled inspections of a plant, we inspect 1000 randomly selected drives. If we find 13 defective drives, we can't reject the assumption that the plant is functioning properly, because 13 defects is "in bounds" for our process.
What if we find 25 defects? We know that this would happen less than 2% of the time in a properly functioning plant, so we should accept the alternate hypothesis that the plant is not functioning properly, and inspect it for problems.
Module 3: Basic Data Analytic Methods Using R 18
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Check Your Knowledge
• Refer back to the ANOVA example on an earlier slide. What do you think? Does the difference between offer1 and offer2 make a practical difference? Should we go ahead and implement one of them?
• If yes, and the costs were US $25 for each offer1 and US $10 for offer2, would you still make the same decision?
• In our manufacturing plant example, assuming you would check the plant for problems in the manufacturing process, how might you justify this decision financially?
Module 3: Basic Data Analytic Methods Using R 19
Module 3: Basic Data Analytic Methods Using R 19
Copyright © 2014 EMC Corporation. All rights reserved.
Copyright © 2014 EMC Corporation. All Rights Reserved.
Review of Basic Data Analytic Methods Using R
During this lesson the following topics were covered: • The role of Statistics in the Analytic Lifecycle • Developing a model and generating the null and the alternative
hypothesis • Difference between means • Difference between significance, power and effect size, and how they
relate to Type I and Type II errors • Applying ANOVA and determining whether the results are significant • Defining confidence intervals and applying them
Summary
Module 3: Basic Data Analytic Methods Using R 20
These are the key points covered in this lesson.
Module 3: Basic Data Analytic Methods Using R 20