Need Help with Q2 (required R studio)
PSTAT 126 - Homework 2
Due: 11:55 p.m. Friday, October 25
1. This problem uses the wblake data set in the alr4 package. This data set includes samples of small mouth bass collected in West Bearskin Lake, Minnesota, in 1991. Interest is in predicting length with age. Finish this problem without using lm().
(a) Compute the regression of length on age, and report the estimates, their standard errors, the value of the coefficient of determination, and the estimate of variance. Write a sentence or two that summarizes the results of these computations.
(b) Obtain a 99% confidence interval for β1 from the data. Interpret this interval in the context of the data.
(c) Obtain a prediction and a 99% prediction interval for a small mouth bass at age 1. Interpret this interval in the context of the data.
2. This exercise will help you understand what is a 95% confidence interval. Suppose that the population regression line is known as E(Y ) = β0 + β1x, where β0 = 5 and β1 = 10, and x is uniformly generated from [0, 1]. The sample size used to estimate this line is n = 100.
Try to finish the following simulation code to construct 95% confidence intervals for β1 and check the coverage rate, i.e., the percentage of confidence intervals covering the true value of β1.
n = 100
x = seq(0, 1, length=100)
beta0 = 5
beta1 = 10
Ey = beta0 + beta1 * x
nsim = 100
#b1vec is a vector keeping track of the estimate b1 for each simulation
b1vec = 1:nsim*0
#msevec is a vector keeping track of MSE (mean square error) for each simulation
msevec = 1:nsim*0
coverage = 0
for(i in 1:nsim) {
y = Ey + rnorm(n)
#your code to regress y on x and find b1
b1vec[i] = b1
#your code to construct a 95% confidence interval for beta1
#your code to compute mse
msevec[i] = mse
#the confidence interval is named as ci_b1
if (ci_b1[1] <= beta1 && ci_b1[2] >= beta1) {
coverage = coverage+1
}
}
coverage.rate = coverage/nsim
(a) The total number of simulations is denoted by nsim. Use nsim = 20, 500, 1000, and check the coverage rates. Are they close to 95%?
1
(b) For each nsim value, make a histogram of b1’s. What is the mean of b1’s? Is the distribution symmetric around the mean?
(c) Compute Sxx, and compute σ̂ by the mean of values kept in msevec, i.e., E(MSE). Then compute the estimated standard deviation of b1 by its math form for nsim = 20, 500, 1000. For the three simulation scenarios: nsim = 20, 500, 1000, rank the value of the estimated standard deviation of b1 from highest to lowest.
(d) Also, rank the absolute value of the mean of b1’s - β1, i.e., |E(b1)− β1| from highest to lowest for the three simulation scenarios.
3. The simple linear regression model Yi = β0 + β1xi + εi, i = 1, . . . , n can also be written as Y1 Y2 ... Yn
=
1 x1 1 x2 ...
... 1 xn
( β0 β1
) +
ε1 ε2 ... εn
. Using matrix notations, the model is
Y = Xβ + ε.
In this problem, we will show that that the least squares estimate is given by:
b =
( b0 b1
) = (X ′X)−1X ′Y
(a) Using straightforward matrix multiplication, show that
X ′X =
n nx̄ nx̄
n∑ i=1
x2i
= n(1 x̄ x̄ Sxx/n+ x̄
2
)
X ′Y =
nȲn∑ i=1
xiYi
= ( nȲ Sxy + nx̄Ȳ
)
(b) Using the identity ( a b c d
)−1 =
1
ad− bc
( d −b −c a
) for a 2× 2 matrix, show that
(X ′X)−1 = 1
Sxx
( Sxx/n+ x̄
2 −x̄ −x̄ 1
)
(c) Combine your answers from (a) and (b) to show that
b =
( b0 b1
) = (X ′X)−1X ′Y
where b1 = Sxy Sxx
and b0 = Ȳ − b1x̄ are the least squares estimates from simple linear regression.
(d) Simulate a data set with n = 100 observation units such that Yi = 1 + 2xi + εi, i = 1, . . . , n. εi follows the standard normal distribution, i.e., a normal distribution with zero mean and unit variance. Use the result in (c) to compute b0 and b1. Show that they are the same as the estimates by lm(). Start with generating x as
n = 100
x = seq(0, 1, length = n)
2
(Hint: check the help page of rnorm() about how to simulate normally distributed random vari- ables. Use solve() to get an inverse matrix and use t() to get a transpose matrix.).
4. This problem uses the UBSprices data set in the alr4 package. The international bank UBS regularly produces a report (UBS, 2009) on prices and earnings in major cities throughout the world. Three of the measures they include are prices of basic commodities, namely 1 kg of rice, a 1 kg loaf of bread, and the price of a Big Mac hamburger at McDonalds.
An interesting feature of the prices they report is that prices are measured in the minutes of labor required for a “typical” worker in that location to earn enough money to purchase the commodity. Using minutes of labor corrects at least in part for currency fluctuations, prevailing wage rates, and local prices. The data file includes measurements for rice, bread, and Big Mac prices from the 2003 and the 2009 reports. The year 2003 was before the major recession hit much of the world around 2006, and the year 2009 may reflect changes in prices due to the recession. The graph below is the plot of Y = rice2009 versus x = rice2003, the price of rice in 2009 and 2003, respectively, with the cities corresponding to a few of the points marked.
0 20 40 60 80 100
0 20
40 60
80
2003 rice price
20 09
ri ce
p ric
e
Budapest
Vilnius
Seoul
Mumbai
Nairobi
(a) The line with equation Y = x is shown on this plot as the solid line. What is the key difference between points above this line and points below the line?
(b) Which city had the largest increase in rice price? Which had the largest decrease in rice price?
(c) Give at least one reason why fitting simple linear regression to the figure in this problem is not likely to be appropriate.
(d) The following graph represent Y and x using log scales. Explain why this graph and the previous graph suggests that using log scales is preferable if fitting simple linear regression is desired. The linear model is shown by the dashed line.
3
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
1. 5
2. 0
2. 5
3. 0
3. 5
4. 0
4. 5
5. 0
log(2003 rice price)
lo g(
20 09
ri ce
p ric
e)
5. Provide the names of those in your group for the final project. Give a brief description of your data set, including the most important variables of interest (columns of the data), the observational units (rows of the data), and some preliminary ideas of how regression can be used to help you analyze these data.
4