statistics with R programming
Assignment 6 KEY, STA 622, Fall 2020
1. In data collection, we often speak of random sampling (selection) and random assignment. Explain very briefly what the difference in the goals of these two uses of randomness are.
2. Use the data in #5 in Aho textbook on page 319 to complete the following. However, you may use R code to do all calculations, providing you show code and output as always.
(a) Make an appropriate plot of the data to examine if using Pearson’s correlation is appropriate (for describing the sample, not for inference).
While there are not outliers, there is a clear non-linear average pattern, so Pearson’s correlation is not appropriate, because it requires a linear average pattern.
> ggplot(prob2,aes(x=time,y=pop))+geom_point()
(b) Calculate and interpret Pearson’s correlation.
Pearson’s correlation is 0.925, which is interpreted as a very strong, positive linear correlation.
> cor(prob2$time,prob2$pop,method="pearson")
[1] 0.9251503
(c) Calculate and interpret Spearman’s correlation (do not worry about Kendall’s tau).
Spearman’s correlation is 1, which is interpreted as there is a perfect positive linear relationship in the ranks of time vs the ranks of population. In other words, as you can see in the scatterplot, the data show a purely increasing relationship.
> cor(prob2$time,prob2$pop,method="spearman")
[1] 1
(d) Explain very briefly the conceptual difference between Pearson’s and Spearman’s correlations, and when we would use each one.
Pearson’s is used when the data do not show any non-linear average pattern, and do not show any extreme or influential outliers. Spearman’s calculates Pearson’s correlation on the ranks of X and ranks of Y, which can be used as long as the original scatterplot shows an average pattern that is monotonic (the average pattern is either increasing or decreasing), and Spearman’s is robust to outliers (because it uses ranks). Neither should be used when the average pattern is non-linear and non-monotonic (e.g. the average pattern looks like a parabola).
3. Use the data in #2 in Aho textbook on page 414 to complete the following. You may use R for any of this provided that you show code and output as always. Note the word “given” in the problem statement means that weight.gain is the response variable.
(a) Make a scatterplot of the data, and describe it by addressing the four components discussed in class.
The average pattern is linear, which can be seen by tracing the middle of the points
The direction is positive, because as lysine.eaten increases, weight.gain also generally increases
The strength is moderate, because there is a moderate amount of vertical variability from the average pattern
There do not appear to be any extreme or influential outliers.
> ggplot(prob3,aes(x=lysine.eaten,y=weight.gain))+geom_point()
(b) Determine the equation of the best fitting line for the data, and interpret the slope in context.
The slope is 35.8, so for each additional gram of lysine eaten, we expect weight to increase by 35.8 grams, on average (in the sample).
> prob3lm<-lm(weight.gain~lysine.eaten,data=prob3)
> summary(prob3lm)
Call:lm(formula = weight.gain ~ lysine.eaten, data = prob3)Residuals:Min 1Q Median 3Q Max-1.1662 -0.6741 -0.1367 0.5486 2.2590Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) 12.509 1.192 10.50 1.02e-06 ***lysine.eaten 35.828 6.957 5.15 0.000431 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 1.034 on 10 degrees of freedomMultiple R-squared: 0.7262, Adjusted R-squared: 0.6988F-statistic: 26.52 on 1 and 10 DF, p-value: 0.0004315
(c) Determine the coefficient of determination (r2) and interpret it in context.
From output in 3b, r2 = 0.726, so 72.6% of the variability in weight.gain in the sample is explained by lysine.eaten (or explained by the regression model with lysine.eaten).
(d) Determine if there is evidence in this sample that weight gain and lysine eaten have a positive linear relationship in the population (you should know by now that this means you must address and LABEL the 8 steps of hypothesis testing as discussed in class).
Step 0 : From each individual (chicken), we observe weight.gain and lysine.eaten, and both are quantitative. Because the goal is to learn about the relationship between these two quantitative variables, the type of problem is simple linear regression.
Step 1 :
Linearity and outliers were checked in 2a
In terms of normality of residuals, the boxplot shows moderate right skewness (but no outliers), so normality is not met. In reality, we should explore a remedy like a transformation, but this was not requested.
ggplot(data=prob3,aes(y=prob3lm$residuals))+geom_boxplot()
plot(prob3lm)
Step 2 : = population slope for x = lysine.eaten and y=weight.gain
Step 3 : (default value when none given)
Step 4 : Reproducing output from above, t = 5.150.
> prob3lm<-lm(weight.gain~lysine.eaten,data=prob3)> summary(prob3lm)Call:lm(formula = weight.gain ~ lysine.eaten, data = prob3)Residuals:Min 1Q Median 3Q Max-1.1662 -0.6741 -0.1367 0.5486 2.2590Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) 12.509 1.192 10.50 1.02e-06 ***lysine.eaten 35.828 6.957 5.15 0.000431 ***
Step 5 : from the above table, p-value = 0.00431/2 = 0.002 (you need to divide by 2 because lm() gives two tailed p-values as evidenced by the absolute value symbol in Pr(>|t|), and H1 was >.
Step 6 : p-value = 0.002 < 0.05, so we reject Ho
Step 7 : There is sufficient evidence to support that there is a positive linear relationship between weight.gain and lysine.eaten in the population.
(e) Calculate and interpret a 95% confidence interval for the mean value of y for when x=0.14.
We are 95% confident that the population mean level of weight.gain for all individuals with lysine.eaten=0.14 is between 16.7 grams and 18.3 grams.
> predict(prob3lm, newdata = new.dat, interval = 'confidence')fit lwr upr1 17.52444 16.7481 18.30079
(f) Calculate and interpret a 95% prediction interval for an individual value of y for an individual with x=0.14.
We are 95% confident that if another individual chicken came along and had lysine.eaten=0.14 grams, that individual’s value of weight.gain would be between 15.1 grams and 20.0 grams.
> predict(prob3lm, newdata = new.dat, interval = 'prediction')fit lwr upr1 17.52444 15.0932 19.9556