Statistics homework -multi linear regression and leverage
STAT 6301 Homework #10 Due Thursday, 11/21/2019
Note: you may use SAS or R for this assignment.
Question 1 (50 points total, 10 points each) This is a modified version of Data Problem #30 from Chapter 10 in the textbook. Please read this problem for an introduction to the data. The data are given in “ex1030.csv”. However, instead of answering the questions in the text, you will do the following:
a) Fit a multiple linear regression model with weekly earnings as the outcome variable and race, region, metropolitan status, age, education code, and the interaction between race and region.
b) Examine all diagnostic plots and comment on whether the model assumptions are met. If you feel the assumptions are not met, you may try transforming some of the variables (but do not remove any).
c) Examine the Cook’s D, leverage, and studentized residual plots. On the basis of these plots, are there any observations that are a cause for concern? If so, why? If you think it is appropriate to remove any points, do so and comment how that changes the scope of the analysis.
d) Perform an extra sum of squares test that tests whether the interaction between race and region is statistically significant. Include a BYOA table and give a clear conclusion.
e) Given everything in parts a-d, is there any evidence that the distributions of weekly earnings differ in the populations of white and black workers after accounting for the other variables? Give some kind of statistical justification, and provide either numeric or graphical justification to support this.
1
Question 2 (50 points total) This is a modified version of Data Problem #23 from Chapter 11 in the textbook. Please read that problem for an introduciton to the data. The data are given in “ex1123.csv”. However, instead of answering the questions in the text, you will do the following:
a) (5 points) Make a scatterplot matrix of mortality versus the predictor variables. Are there any predictor variables for which a transformation might be appropriate? If so, perform the transformation before proceeding.
b) (10 points) Fit a multiple linear regression model with mortality as the response variable and the other five as predictors. If you decided to transform one of the predictors in part a), use the transformed version in your model.
c) (10 points) For this problem, you may assume that your model meets the necessary assumptions. However, examine the Cook’s D, leverage, and studentized residual plots. On the basis of these plots, are there any observations that are a cause for concern? If so, which ones? If you think it is appropriate to remove any points, do so and comment how that changes the scope of the analysis.
d) (5 points) Write down your final model, using appropriate notation.
e) (10 points) On the basis of your final model, is there any evidence that either of the pollution variables are associated with mortality after accounting for the climate and socioeconomic variables? Give statistical justification.
f) (10 points) Suppose we wish to predict the mortality rate for a city with a mean annual precipitation of 32 inches, a median number of years of school completed of 11.6, percent non-white of 4.4, an N Ox of 40 and an SO2 of 15. Using your final model, make this prediction.
2
- Question 1 (50 points total, 10 points each)
- Question 2 (50 points total)