statistics with R programming

profilejamesfiona1993
Week12outline.docx

Week 12 Outline (11-18-20):

Assignment 7 (MLR) coming soon.

Don’t forget about group presentation. I haven’t been approached by everyone yet about a topic.

Exam 1 is not graded.

Exam 2 is Thurs Dec 3 at 5:00, to Sat Dec 5 at 5:00

· Chapters 6, 7, 8 and 9 (SLR only, not MLR), so only assignments 5 and 6 are relevant

Questions

Cover Aho Chapter 10

· We finished two independent quantitative populations (both HT and CI), but what if we have more than two independent populations (and the response variable is still quantitative)?

· As always in our class, everything we do assumes every sample was taken randomly from its corresponding population

· The ideas are very similar, meaning we have one quantitative response variable, and one categorical explanatory variable, but it has 3 or more categories, rather than just 2 as before

· Now (for k populations) and says at least one population mean is different from the other

One-way ANOVA test (Analysis of variance) | Customer satisfaction example

· Example: k=3 independent populations

·

· at least one population mean is different

· can happen in many ways

· and and

· Only 3 is different (1 and 2 are same)

· and and

· Only 1 is different (2 and 3 are same)

· and and

· Only 2 is different (1 and 3 are same)

· and and

· All are different

· The procedure to test these hypotheses is called (one way) Analysis of Variance or ANOVA (as opposed to two independent sample t test)

· “one way” means there is one explanatory variable, and it is categorical

· The test statistic called F is based on how far apart , ,…., and are from each other

· , ,…., and are the sample means from the corresponding groups to

· F is always positive, and the further apart , ,…., and are from each other, the further F is from 0

· Calculating F for simple example

· Group 1: 10, 11, 15  ( = 12)

· Group 2: 6, 7, 7, 10, 10 ( = 8)

· Group 3: 8, 9, 12, 15 ( = 11)

· Sample grand mean (ignoring group) is = 10

· General notation from book:

· a is number of groups

· is sample size of group

· is the raw value of y from the group

· In our example, a=3, =3, =5 and =4

· Total sum of squares (ignoring group) is

SSTO =

· SSTO is just the numerator of the sample variance

· In our example, SSTO = (10-10)2+(11-10)2+(15-10)2+

(6-10)2+(7-10)2+(7-10)2+(10-10)2+(10-10)2+

(8-10)2+(9-10)2+(12-10)2+(15-10)2 = 94

· We want to partition SSTO into two parts, SSTR (sum of squares treatment) and SSE (sum of squares error), so that SSTO = SSTR + SSE

· SSTR can be further broken down into different components, but we have only one explanatory variable here, commonly called factor A, so for one way ANOVA, SSTR is equivalent to SSA, which is how much variability there is in the sample means for Factor A

· This is how we build the test statistic F, by measuring how far apart the sample means are, based on the benchmark value of the grand mean

· = 3(12-10)2 + 5(8-10)2 + 4(11-10)2 = 36

· SSE is still the sum of squared residuals (or errors), but measured from each individual group mean

· SSE = = (10-12)2+(11-12)2+(15-12)2+

(6-8)2+(7-8)2+(7-8)2+(10-8)2+(10-8)2+

(8-11)2+(9-11)2+(12-11)2+(15-11)2 = 58

· Note that 94 = 36 + 58

· The point is that the more spread out the , ,…., and are from each other, the more of SSTO will go to SSA, and less to SSE

· How to convert this to a test statistic? First we have to take into account how many groups there are through degrees of freedom (df)

· df for SSA will be a-1

· df for SSE will be n-a (n is total sample size, here n=12)

· df for SSTO is n-1

· each SS gets divided by its df to form mean squares (MS)

· then the test statistic F is the mean square for factor A divided by the mean square for error

· all of this gets summarized into a one way ANOVA table

Variation source

df

SS

MS

F

Factor A (among or between groups)

a-1 = 3-1 = 2

SSA = 36

MSA = SSA /(a-1) =

36/2 = 18

F = MSA/MSE =

18/6.444 = 2.79

Error (within groups)

n-a = 12-3 = 9

SSE = 58

MSE = SSE/(n-a) =

58/9 = 6.444

Total

n-1 = 12-1 = 11

SSTO = 94

· If is true, and if assumptions are reasonably met, then this test statistic F will follow an F distribution with numerator degrees of freedom a-1 and denominator degrees of freedom n-a

· F distribution is right skewed (because F statistic cannot be negative), and the p-value is always in the right tail (because values near 0 are not evidence against

· With calculator, Fcdf(lower, upper, num df, denom df) = Fcdf(2.79,1000000,2,9) = 0.11

· With R, pf(OTS,num df, denom df,lower.tail=F)

> pf(2.79,2,9,lower.tail=F)

[1] 0.1140729

· Using lm function to get whole ANOVA table

> group<-c(1,1,1,2,2,2,2,2,3,3,3,3)
> response<-c(10,11,15,6,7,7,10,10,8,9,12,15)
> example<-data.frame(group,response)
> example
   group response
1      1       10
2      1       11
3      1       15
4      2        6
5      2        7
6      2        7
7      2       10
8      2       10
9      3        8
10     3        9
11     3       12
12     3       15

> lm.example<-lm(response~factor(group),data=example)
> anova(lm.example)
Analysis of Variance Table
Response: response
              Df Sum Sq Mean Sq F value Pr(>F)
factor(group)  2     36 18.0000  2.7931 0.1139
Residuals      9     58  6.4444 

· Assumptions

· As before, we need all sample sizes to be reasonably large, or all populations to be normal

· If sample sizes are small, shape of samples can be used as an estimate of the shape of the populations (so you can make a side by side boxplot)

· We have an additional assumption that the population standard deviations are all the same ()

· If the sample standard deviations , , … are reasonably close to each other, then this is met

· Rule of thumb: compare largest sample sd to smallest sample sd. If the larger sample sd is not more than twice the smaller sample sd, then this is reasonably met

· Some researcher’s use “Levene’s Test of equal variances” to test this, where the null hypothesis is , so that you hope to not reject this particular null hypothesis

· Example with 8 steps: Using the cars data set, determine if there is evidence in the samples that the locations (1=US, 2=Germany, 3=Japan) are different in the populations, on average, in terms of mpg.

> head(cars)
  mpg cylinders displacement horsep weight acceleration year location                     model
1  18         8          307    130   3504         12.0   70        1 chevrolet chevelle malibu
2  15         8          350    165   3693         11.5   70        1         buick skylark 320
3  18         8          318    150   3436         11.0   70        1        plymouth satellite
4  16         8          304    150   3433         12.0   70        1             amc rebel sst
5  17         8          302    140   3449         10.5   70        1               ford torino
6  15         8          429    198   4341         10.0   70        1          ford galaxie 500

Step 0: From each individual (car), we observe mpg (quantitative response variable) and location (categorical explanatory variable). The goal is to compare population mean mpg across locations, so the type of problem is three or more independent quantitative populations.

Step 1:

(1) we need sample sizes to be reasonably large, or populations to be normal

> table(cars$location)
  1   2   3 
249  70  79 

> ggplot(data=cars,aes(x=factor(location),y=mpg))+geom_boxplot()

(2) we need the variability in the populations to be same (note that in the boxplots above, we don’t have extreme skewness or extreme outliers, so using standard deviation makes sense)

> cars %>% group_by(location) %>% summarise(sdmpg=sd(mpg))
# A tibble: 3 x 2
  location sdmpg
     <int> <dbl>
1        1  6.40
2        2  6.72
3        3  6.09

> cars %>% group_by(location) %>% summarise(iqrmpg=IQR(mpg))
# A tibble: 3 x 2
  location iqrmpg
     <int>  <dbl>
1        1   9   
2        2   6.65
3        3   8.35

Step 2: = population mean mpg for location 1

= population mean mpg for location 2

= population mean mpg for location 3

Step 3:

at least one population mean is different

=0.05

Step 4: F = 98.542

> lm.cars<-lm(mpg~factor(location),data=cars)
> anova(lm.cars)
Analysis of Variance Table
Response: mpg
                  Df  Sum Sq Mean Sq F value    Pr(>F)    
factor(location)   2  8072.8  4036.4  98.542 < 2.2e-16 ***
Residuals        395 16179.8    41.0                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Step 5: p-value (from R output above) < 2.2e-16

Step 6: p-value < 0.05, so reject

Step 7: There is sufficient evidence that there is at least one population mean mpg that is different.

· How to write one way anova in model form:

· This will be helpful when we get to two way anova

· is the overall population mean (ignoring group)

· is the true (population) effect of the level of factor A

· has nothing to do with the level of significance

· Put another way, plays the role of the mean of the population,

· Conceptually, is how far the population mean for group is from the grand (overall) population mean is, so

· In our original example (using sample versions of these quantities)

· 12 -10 = 2

· 8 -10 = -2

· 11 -10 = 1

· is the residual for

· Then we can write the hypotheses in terms of the ’s

at least one

· Analysis is still done the same way as we did

· Multiple testing (also known as post hoc testing, follow up testing, multiple comparisons)

· If we reject and conclude that is true, meaning that at least one population mean is different from the others, that doesn’t tell us which population means are different than the others, which is typically not a sufficient conclusion by itself

· This means we need to compare each population mean to all the others to determine which pairs are different, meaning we want to either test for all and , or, equivalently, calculate a confidence interval for , looking to see if 0 is inside the CI

· Note that there are C(k,2) number of comparisons

· C(3,2) = 3 comparisons

· C(4,2) = 6 comparisons

· C(5,2) = 10 comparisons

· C(6,2) = 15 comparisons

· These many pairwise comparisons cause a problem called multiple testing

· Remember what a Type 1 error was: rejecting when is true

· Level of significance = P(type 1 error)

· For example, for each pairwise comparison , if we were to set = 0.05, that means the comparison-wise type 1 error rate would be 5%, because 0.05 applies to each individual pairwise comparison

· However, we also have to worry about the family-wise (or experiment-wise) type 1 error rate

· Family-wise type 1 error rate is the probability of making at least one type 1 error across all the pairwise comparisons.

· Family-wise type 1 error rate will always be at least as big as the pairwise comparison type 1 error rate, and often much bigger,

· Think about an analogy of flipping a fair coin three times, and focusing on getting heads. For each flip, the probability of heads is 0.5. What is the probability of having at least one head in three flips?

· We have the same problem with multiple testing, but we won’t know what the family-wise error rate will be in a given problem, other than it will be bigger than 0.05, and in worst case scenario (no overlap at all), family-wise type 1 error rate could be as high as C(k,2)*(0.05)

· Statisticians have invented many procedures for how to tackle this problem, but the common theme is that all of the population means will be compared to each other, and pairwise significant differences will be determined

· All methods have advantages and disadvantages, and our focus will be on the three listed here

· These are sometimes shown as confidence intervals for the difference in two population means at a time (so we care if 0 is in each CI or not)

· Bonferroni

· Divide alpha used by pairwise comparisons by the number of comparisons to make, so that worst case is accounted for

· Most conservative, meaning Bonferroni correction will keep the family-wise type 1 error rate from going above 0.05, but in most cases it will be somewhat less than 0.05, so it will be harder to detect differences

· Popular because it is easily adaptable to a wide variety of contexts

# the function PostHocTest() requires you to install and load the DescTools package

> PostHocTest(aov(mpg~factor(location),data=cars),method="bonferroni")
  Posthoc multiple comparisons of means : Bonferroni 
    95% family-wise confidence level
$`factor(location)`
         diff     lwr.ci    upr.ci   pval    
2-1  7.807894 5.72624158  9.889547 <2e-16 ***
3-1 10.367099 8.38015599 12.354042 <2e-16 ***
3-2  2.559204 0.03344349  5.084965 0.0459 *  

· Tukey’s honest significant difference (hsd)

· This method is less conservative than Bonferroni, while still controlling the family-wise type 1 error rate

· One of the better choices when comparing more than a few groups

> PostHocTest(aov(mpg~factor(location),data=cars),method="hsd")
  Posthoc multiple comparisons of means : Tukey HSD 
    95% family-wise confidence level
$`factor(location)`
         diff     lwr.ci    upr.ci   pval    
2-1  7.807894 5.77095169  9.844837 <2e-16 ***
3-1 10.367099 8.42283190 12.311366 <2e-16 ***
3-2  2.559204 0.08769223  5.030716 0.0404 *  

· Dunnett’s test

· If the goal is just to compare all groups to a control group, then Dunnett’s test is the one to use

· You have to declare which group is the control group

# the DunnettTest() function requires that you install and load the DescTools package

# note that unlike PostHocTest(), you should not use the aov() function inside DunnettTest()

> DunnettTest(mpg~factor(location),data=cars,control="1")
  Dunnett's test for comparing several treatments with a control :  
    95% family-wise confidence level
$`1`
         diff   lwr.ci    upr.ci    pval    
2-1  7.807894 5.867970  9.747819 8.9e-16 ***
3-1 10.367099 8.515436 12.218761 < 2e-16 ***

· What if assumptions are not met for one way ANOVA?

· If we have small sample sizes and we don’t believe the populations are normal, then we can perform a different HT procedure called the Kruskal-Wallis (KW) test (also known as Mann-Whitney U test)

· As with the rank sum test, the hypotheses are about the population medians rather than means

· As with the rank sum test, the values are ranked as if they are all in one sample, and then the test statistic determines if at least one sample has a significantly different mean of its ranks

· As with the rank sum test, the Kruskal-Wallis test assumes that all of the populations have the same shape (symmetric, right skewed, left skewed) and same variability (typically measured by IQR)

· Shape of samples can be used as an estimate of the shape of the populations

· As with rank sum test, as long as the shapes of the samples are not drastically different, then the KW test does fine

· As with rank sum test, using ranks greatly reduces the affect of outliers, so we’re primarily interested in detecting when there are vastly different shapes, meaning one group has left skewness and another has right skewness

· If the sample sd’s or IQR’s are not close to each other, then transformations of the samples are often performed, like taking square root of the response variable, which often makes the variability more similar across samples

· For our class, if you encounter this issue, just point it out, but don’t worry about trying to remedy the situation

Step 0: From each individual (car), we observe mpg (quantitative response variable) and location (categorical explanatory variable). The goal is to compare population mean mpg across locations, so the type of problem is three or more independent quantitative populations.

Step 1: (1) we need sample sizes to be reasonably large, or populations to be normal

> table(cars$location)
  1   2   3 
249  70  79 

> ggplot(data=cars,aes(x=factor(location),y=mpg))+geom_boxplot()

For the purposes of this example, we’ll say that we don’t believe that the sample sizes are large enough, and we don’t believe that the shapes are normal enough (perhaps the outliers are disconcerting), so we should consider Kruskal-Wallis test instead of one way ANOVA. That requires we have the same shape and same variabilities in the populations. None of the groups has more than mild skewness, so the same shape requirement is not too concerning.

(2) we need the variability in the populations to be same, and if the outliers are an issue, we’d use IQR to compare

> cars %>% group_by(location) %>% summarise(sdmpg=sd(mpg))
# A tibble: 3 x 2
  location sdmpg
     <int> <dbl>
1        1  6.40
2        2  6.72
3        3  6.09

> cars %>% group_by(location) %>% summarise(iqrmpg=IQR(mpg))
# A tibble: 3 x 2
  location iqrmpg
     <int>  <dbl>
1        1   9   
2        2   6.65
3        3   8.35

Step 2: = population median mpg for location 1

= population median mpg for location 2

= population median mpg for location 3

Step 3:

at least one population median is different

=0.05

Step 4: chi-square = 134.46

> kruskal.test(mpg~factor(location),data=cars)
	Kruskal-Wallis rank sum test
data:  mpg by factor(location)
Kruskal-Wallis chi-squared = 134.46, df = 2, p-value < 2.2e-16

Step 5: p-value (from R output above) < 2.2e-16

Step 6: p-value < 0.05, so reject

Step 7: There is sufficient evidence that there is at least one population median mpg that is different.

· Because we are finding at least one population mean to be different, there would be follow up multiple testing as with ANOVA, and you interpret it the same way (though it would probably be based on the rank sum test and Bonferroni correction)

# note that this works differently than the pairwise comparisons we did with Bonferroni previously

# this function, rather than divide alpha for each pairwise comparison by the number of comparisons which is what must be done for the PostHocTest() function, now multiplies the p-value by the number of comparisons, so you should use 0.05 here, instead of 0.05/C(k,2) as earlier

> pairwise.wilcox.test(cars$mpg,cars$location,p.adjust.method="bonferroni")
	Pairwise comparisons using Wilcoxon rank sum test 
data:  cars$mpg and cars$location 
  1       2    
2 1.1e-14 -    
3 < 2e-16 0.015
P value adjustment method: bonferroni 

· Two way ANOVA (analysis of variance)

· We still require samples to be taken randomly

· We just finished one way ANOVA

· response variable is quantitative

· there is one explanatory variable, and it is categorical, which describes which of the independent populations each subject came from

· goal is to compare population means

· Sometimes when researchers design experiments, where they randomly assign subjects to treatment groups, there can be two explanatory variables, both categorical

· There may be the primary explanatory variable like treatment group, and also a secondary explanatory variable that researchers want to control for, such as gender, race, smoking status, presence of other condition, etc.

· Controlling for the secondary explanatory variable helps account for more unexplained variation in the response variable, which makes it easier to detect differences in the response variable across categories of the primary explanatory variable

· Or it could just be that there are exactly two explanatory variables of interest, and they are both categorical

· Goal is still to compare population means, but we have to be careful about how we do that for two categorical explanatory variables

· We can first do an overall F test to see if x1 and x2 generally explain anything about the mean level of y

· Assuming the initial test finds some contribution by x1 and x2 , then the first question to always answer in this setting is if there is an interaction between the two categorical explanatory variables

· Remember that interaction between x1 and x2 means the relationship between y and x1 depends on the level (or category) of x2

· Terminology: we usually refer to the separate (ignoring the other x) effects of x1 and x2 on y as the “main effects”, as opposed to the “interaction effect”

· Interaction plot

· A very simple way to explore if there is an interaction in the sample is to plot the sample means of y for each combination of x1 and x2

· Example: treatment (yes or no) and gender are the two explanatory variables (the values in the table are the sample means of the response variable for all individuals in that combination of dose and gender)

· So there may have been 10 women who received the treatment, and their mean of y was 9.9, and same for the other 3 combinations

Yes

No

Female

9.9

15.1

Male

15.4

10.2

Understanding 2-way Interactions - StatLab Articles

· The relationship between the response variable and treatment depends on which gender the person is, so that says there is an interaction between treatment and gender in the sample

· If the genders showed relatively “parallel” graphs, that would indicate no significant interaction between treatment and gender in the sample

· We always have to worry about if there is significant interaction first, because we’d like to compare means of y across treatment categories (and separately across genders), but that is not meaningful if there is a significant interaction. Why?

· Suppose there were in fact 10 subjects in each of the 4 combinations of dose and gender (there won’t always be the same number in general)

· Because there are the same number in each combination, we can find the mean levels of y for each treatment from the individual combination means (again, this is only possible because there are the same number of subjects in each combination):

· Treatment=yes: (9.9+15.4)/2 = 12.65

· Treatment=no: (15.1+10.2)/2 = 12.65

· So if we first just looked at the mean of y for the two treatment groups, we would see no difference at all, which completely hides the fact that the treatment does have an effect on the response variable

· Same for gender:

· Female: (9.9+15.1)/2 = 12.5

· Male: (15.4+10.2)/2 = 12.8

· These are not exactly the same, but it’s likely they would not be seen as significantly different, meaning gender was not important to the response variable

· When there is an interaction, we cannot just compare means of y for the levels of the two explanatory variables separately

· The best we can do is compare the means of the individual combinations, here 9.9, 15.1, 15.4 and 10.2

· The same ideas translate to doing statistical inference, meaning we look for evidence of significant interaction in the population using hypothesis testing prior to looking for significant main effects

· Test for interaction

· : no significant interaction between x1 and x2 in the population

· : there is significant interaction between x1 and x2 in the population

· You can look for the p-value for this test to determine whether or not to reject as before

· If you do reject : no sig interaction, the natural follow up question is which of the combinations of x1 and x2 have significantly different means levels of y

· This can be addressed the same way as we did for the followup to one way anova

· If you do not reject : no sig interaction, the natural follow up question is to test a different set of hypotheses, specifically

· : population means of y are same across levels of x1, and also across levels of x2 (in other words, there is no main effect of x1 and x2 on the mean level of y)

· If you reject this , then you would want to do follow up testing to see which population means are different, just as for the follow up to one way anova

General process for two way anova

Perform overall F-test:

Is the model useful, i.e., does at least one term (either main effect or interaction effect) affect the mean of y?

Done; neither factor x1 or x2 affects the response (either as main effects or interaction) so there is nothing more to do.

No

Yes

Determine which combinations of x1 and x2 have different mean levels of y

(We do not need to test main effects.)

Test interaction:

Is there interaction between x1 and x2?

Yes

No

If you find neither main effect is significant, then neither x1 or x2 explained much variation in y

If x1 is significant, then determine which categories of x1 have different mean levels of y

If x2 is significant, then determine which categories of x2 have different mean levels of y

If both A and B are significant, then the model is

Test only main effects of x1 and x2:

Perform the following two tests:

Test main effect of x1:

Does mean outcome of y differ for different levels of x1?

Test main effect of x2:

Does mean outcome of y differ for different levels of x2?

· Assumptions of two way anova (same as one way anova, but applied to the individual combinations of the two explanatory variables), including samples are taken randomly

· For each combination of x1 and x2, sample sizes are reasonably large OR corresponding populations normal

· If sample sizes are not reasonably large, then we should check boxplots or histograms for each combination of x1 and x2 to see if they are reasonably normal

· All combinations of x1 and x2 should have the same (or similar) variability

· We can compare the sample standard deviations or IQR’s of all combinations to see if this is reasonably met

· Examples with R code

· Overall test

· Test for interaction

· Interaction sig

· Post-hoc on combinations of x1 and x2

· Interaction not sig

· Follow up test for main effects, and possibly post-hoc on levels of x1 and x2

· What to do when assumptions are not met?

· This is a good question, but I’ve decided it is beyond the scope of our class, meaning it would take a significant amount of time to go into, and we don’t have a significant amount of time left

· The one exception is if the situation is such that subjects have the response variable measured repeatedly over time, often called repeated measures data, so that one of the explanatory variables is which measurement for each subject

· In this case, researchers will use Friedman’s test, but we won’t go into any details here, because we haven’t talked about how to analyze repeated measures data generally

Chapter 11 reading: Aho Chapter 11: Pages 503-511, 513 (section 11.4.1), 519-524 (skip section 11.6.2 and example for likelihood ratio test)