ecn 425
ECN 425: Introduction to Econometrics
Alvin Murphy Arizona State University: Fall 2018
Assignment #2
Due at the beginning of class on Tuesday, September 25th
PART I: MULTIPLE REGRESSION ANALYSIS & HYPOTHESIS TESTING
1) Suppose that the wage paid to manufacturing workers in dangerous occupations (e.g. mining,
slaughterhouses, chemical processing) depends primarily on two factors: on-the-job
experience and risk of death, both of which increase wages. Furthermore, suppose that a
worker’s risk of death generally declines with his experience on the job. Now suppose the
National Association of Manufacturing Workers wants to predict the impact of a proposed
safety regulation on wages. Their analyst predicts that the regulation will cause wages to
decline by ROD 1 ̂ , where 0ROD is the reduction in the risk of death on the job that
will result from the regulation and 0ˆ 1 is an estimate for the slope coefficient of the
following model:
uRODwage 10
Is ROD 1 ̂ likely to be an unbiased estimator for the change in wages caused by the
regulation? If so, explain why. If not, explain the likely direction of the bias. Write down a
formal model for the bias to support your answer.
(10 points)
2
2) The following model is a simplified version of the multiple regression model used by Biddle
and Hamermesh (1990) to study the tradeoff between time spent sleeping and working and to
look at other factors affecting sleep.
uageeductotwrksleep 3210
,
where sleep and totwrk (total work) are measured in minutes per week and educ and age are
measured in years. Biddle and Hamermesh estimated the following equation:
uageeductotwrksleep ˆ20.213.11148.025.638,3
(112.28) (.017) (5.88) (1.45)
N=706, SSR=123,455,057
(20 points: 5/5/5/5)
(i) Explain the appropriate interpretation of the coefficient on totwrk.
(ii) Suppose the average undergraduate program takes 4 years to complete. All else constant,
what do the regression results imply about the average impact of an undergraduate degree
on sleep per week?
(iii) Test whether the coefficient on totwrk is statistically different from one at the 1% level.
State the null hypothesis, the alternative hypothesis, the critical value above which you
would reject the null hypothesis. Then use this information to perform the appropriate
test.
(iv) Dropping educ and age from the regression gives: utotwrksleep ˆ151.038.586,3
(38.91) (.017)
N=706, SSR=124,858,119
Set up a test to determine whether educ and age are jointly significant in the original
equation at the 1% level. State the null hypothesis, the alternative hypothesis, and the
critical value above which you would reject the null hypothesis. Then use this
information to perform the appropriate test.
3
3) Consider the following model: uxxxy 3322110
, which satisfies the classical
linear model assumptions. Suppose you would like to test the null hypothesis,
02: 310 H .
(15 points: 5/5/5)
(i) Write the t statistic for testing 02: 310 H as a function of
1 ̂ and
3 ̂ .
(ii) Let 1
̂ and 3
̂ denote the OLS estimators of 1
and 3
. Solve for 31
ˆ2ˆ Var in
terms of the variances of 1
̂ and 3
̂ , and the covariance between them. You may
find it helpful to use the properties of variances in Wooldridge appendix B.
(iii)Let 311
2 . Use this equation to rewrite the model in a way that allows you to
estimate 1 and its standard error directly. Your equation should be a function of
1 ,
2 , and
3 .
4) Consider the following model for how state employment depends on the tax revenue mix:
usalesshareincomesharepropertyshareemployment ___log 3210
,
where share_pop is the share of property taxes in total tax revenue; share_income is the
share of income taxes, and share_sales is the share of sales taxes. The share of tax revenue
from all other sources (share_other) is omitted. The shares have all been multiplied by 100
so that a value of 50 for share_property would indicate that 50% of total tax revenue comes
from property taxes, for example.
(15 points: 5/5/5)
(i) Explain why share_other is omitted.
(ii) Provide a careful interpretation of 2
.
(iii)Suppose share_income and share_sales are highly correlated. How would you expect
this correlation to affect the bias and variance of the OLS estimator for 2
? In principle,
how could you mitigate any potential concerns about the effect of this correlation on the
bias and variance?
4
PART III: EMPIRICAL ANALYSIS
(4) On HW#1 you assessed the potential impact of adverse selection in the market for health care. Specifically, you regressed a variety of health outcomes experienced by people with
health insurance on the share of a state’s population with health insurance. One potential
concern with the univariate model is that omitted variables could bias the estimator.
Fortunately, the BRFSS survey also collects data on a variety of demographic variables
including age, employment status, gender, and income. This problem asks you to use these
additional variables to reassess the impact of adverse selection. The more complete data are
posted as “BRFSS2.dta” on blackboard.
(20 points: 5/5/5/5)
(i) Fill in the following table of regression coefficients. Use asterisks after the coefficients to indicate the level of statistical significance, as follows: *** indicates
the coefficient is statistically different from zero at the 1% level. ** indicates the
coefficient is statistically different from zero at the 5% level. * indicates the
coefficient is statistically different from zero at the 10% level.
avg. days
health not
good
avg. days
health
prevented
regular
activity
disability
(%)
use
equipment
because of
disability
(%)
exercise in
past month
(%)
asthma (%) diabetes
(%)
share_insured
age25to64
ageover65
selfemployed
unemployed_1yr
unemployed_lessyr
homemaker
student
retired
unable_work
income15to50
incomeover50
female
Intercept
N
R 2
Independent
Variable
Dependent Variable
(ii) Discuss how adding the additional explanatory variables affect your findings on adverse selection compared to homework #1.
5
(iii) Discuss the coefficients on ageover65 in the avg health days not good and % with disability regressions. What might explain these results?
(iv) In the univariate model on homework #1, the R2 was very similar in the regressions for avg days health prevented regular activity and asthma. This is no longer true in
the multivariate model. What can explain this change?
PART IV: CONSISTENCY OF OLS ESTIMATORS—A FAKE DATA EXPERIMENT
(5) In this problem, we will use the file “fake1.dta” to investigate the properties of OLS estimators as the sample size increases. Recall from homework #1 that these data describe a
population of 500 observations from the (true) regression equation: uzy 10 , such
that 0uE , 0| zuE , and 2|var zu .
(20 points: 4/4/4/4/4)
a) Write a program called “fake25” that takes one thousand random samples of 25 observations from the population (with replacement), and uses each one to regress y on z.
This process will yield 1000 different sets of values for 0
̂ and 1
̂ . Use these results to
calculate the bias, recalling from the first homework that 95.10 0
and 17.2 1 .
To draw a sample of 25 observations (with replacement) you should use the Stata
command bsample 25.
Manually doing this 1000 times will take an enormous amount of time. Writing a short
simulation program saves us the time it would take to enter the commands for each
replication manually. This allows us to run fake data experiments with more replications.
See the Stata Hint below for information on how to do this.
b) Your results from part (a) define distributions of values for 0
̂ and 1
̂ . Report the mean,
standard deviation, min and max of this distribution for 1
̂ . You can also graph the
distribution by typing the command: kdensity b_z. Print out a picture of your result.
c) Now repeat parts (a) and (b) to write two more programs that take increasing large sample sizes: “fake100” should take 1000 random samples of 100 observations, and
6
“fake500” should take 1000 random samples of 500 observations. Use your results to fill
in the following table for the distribution of 1
̂ , using at least four decimal places:
#
observations
per sample
mean Standard
deviation
Min Max Bias( 1
̂ )
25
100
500
d) Explain what happens to the probability distribution of 1
̂ as the sample size increases.
Illustrate your answer by sketching a figure that shows probability distributions for two
of the different sample sizes.
e) If an estimator produces a probability distribution for 1
̂ that becomes more and more
tightly distributed around 1
as the sample size grows, the estimator is called
“consistent.” In HW#1 we saw that incorrectly replacing z with x biased the OLS
estimator for 1
. Explore whether this bias diminishes as the sample size grows by
taking 1000 random samples of 500 observations and running regressions of y on x.
Does the biased estimator appear to be consistent? Explain your answer. As part of your
explanation, draw a picture of your result similar to part (d).
Stata Hint
In HW#1 we conducted a fake data experiment by starting with a population and then (manually)
choosing 20 random samples from that population. Rather than perform the sampling process
manually, we could have written a short program to do it automatically. For example, the
following 6 lines of code perform parts (b) and (c) of the fake data experiment from HW#1:
1. program define hw1fake
2. use "C:\Desktop\fake1.dta", clear
3. bsample round(0.05*_N)
4. reg y z
5. end
6. simulate “hw1fake” _b, reps(20)
7. sum
7
Lines 1 through 5 define a program that is stored in STATA’s memory. Line 1 tells STATA that
we are writing a new program and naming it “hw1fake”. Lines 2, 3, and 4 define the operations
that the program will perform. Line 5 tells STATA we are finished writing the new program.
When we ask STATA to run the program, it will perform lines 2 through 4. Line 2 tells
STATA to open the file “fake1.dta” which I have stored on the desktop of my computer. Line 3
tells STATA to take a 5% sample of the data. Line 4 tells STATA to run a regression on the 5%
sample. If I type hw1fake into STATA and it will perform lines 2 through 4. Try it for yourself.
Line 6 tells STATA to run the new program 20 times and to save the parameter estimates
from each replication. The results are stored in the data editor. You can view them. Finally,
line 7 reports summary statistics for our 20 estimates of 0
and 1
.