ecn 425
ECN 425: Introduction to Econometrics
Alvin Murphy Arizona State University: Fall 2018
Assignment #1
Due at the beginning of class on Thursday, September 6th
PART I: DERIVING OLS ESTIMATORS
(You must show all work to receive full credit)
1) 1) Suppose the population regression function can be written as: uxy 10 , where
0uE and 0| xuE . The sample equivalents to these two restrictions imply:
0ˆ 1
:1
n
i
i u
n and 0ˆ
1
:1
n
i
ii ux
n . Parts (a)-(c) of this problem ask you to derive the OLS
estimators for 0
and 1
. Please show all of your work.
(20 points: 5/5/10)
(a) Use 0ˆ 1
:1
n
i
i u
n to demonstrate that the OLS estimator for
0 can be written as:
xy 10
ˆˆ , where
n
i
i y
n y
:1
1 and
n
i
i x
n x
:1
1 .
(b) Use 0ˆ 1
:1
n
i
ii ux
n together with the result from (a) to demonstrate that the OLS
estimator for 1
can be written as:
n
i
ii
n
i
ii
xxx
yyx
1
:1 1
̂ .
(c) Use your result from (b) together with the definition of the variance and covariance to
demonstrate that
i
ii
x
yx
var
,covˆ 1 .
2
2) Suppose the population regression function is uzy i
10 , and you estimate the
following sample regression function: iii
uxy ˆˆˆ 10
, where zx . (20 points: 10/10)
(a) Express your estimator, 1
̂ , in terms of the data and parameters of the population
regression function, ii
zx ,, 1 , and
i u .
(b) Use your result from (a) to demonstrate that 1
̂ is generally a biased estimator for 1
.
PART II: USING A FAKE DATA EXPERIMENT TO INVESTIGATE OLS ESTIMATORS
A fake data experiment can be a useful way to investigate the properties of an estimator. This
process begins by specifying the “true” economic model (i.e. the population regression
function). The next step is to use this model to generate some data that represent a population.
Finally, by taking repeated samples from the population and using these samples to estimate the
sample regression function several times, you can evaluate how well your estimator performs
(e.g. bias and variance) under specific conditions.
3) In this problem, you will use a fake data experiment to demonstrate the importance of correctly specifying the form of the sample regression function. More precisely, you will
compare the bias of the OLS estimator when the model is correctly specified, to the bias
when the model is incorrectly specified to use the wrong explanatory variable. In the file
“fake1.dta”, I have generated a population of 500 observations from the (true) regression
equation: uzy 10 , such that 0uE , 0| zuE , and 2|var zu .
(25 points: 5/5/5/5/5)
a) Use these data to calculate the population parameters 0
and 1
. What are they? Please
use 2 decimal places.
3
b) Now, take a random 5% sample from the population and discard the remaining observations. This can be done using the command “bsample round(0.05*_N)”. Use
this random sample to calculate OLS estimates for 0
and 1
. Report your results out to
3 decimal places.
c) Repeat part (b) 19 more times, saving the values for 0
̂ and 1
̂ on each iteration. Thus,
on each iteration you are reloading “fake1.dta”, taking a new randomly-chosen 5%
sample, and using that sample to generate estimates for 0
and 1
. Save your results
from all 20 iterations in a table and use them to calculate 0
̂bias , and 1 ̂bias . Of your
20 samples, what is the closest and the farthest that you come from recovering the true
values of 0
and 1
in any individual sample? Report the following statistics:
00 ˆmin ,
11 ˆmin ,
00 ˆmax , and
11 ˆmax .1
d) Repeat the exercise in parts (b) and (c), except this time you will incorrectly replace z
with x on each of the 20 iterations. Report 1 ̂bias , 11
ˆmin , and 11
ˆmax .
e) Are your sample results from part (c) for 0
̂bias and 1 ̂bias consistent with the
theoretical properties of correctly specified OLS estimators? Are your sample results
from part (d) consistent with what you learned from problem #2 about the theoretical
properties of an OLS estimator that is incorrectly specified to use the wrong explanatory
variable? Please explain your answers.
1 Stata hint: After typing in the commands for the first iteration in part (b), you can use the review window to click
on those same commands 19 more times, rather than typing them again
4
PART III: EMPIRICAL ANALYSIS2
4) Use airfare.dta to answer the following questions. (15 points: 5/5/5)
(i) Report the mean, standard deviation, minimum and maximum airfare for: (a) one-way
flights less than 500 miles, (b) one way flights between 500 and 1000 miles, (c) one-way
flights between 1000 and 2000 miles; and (d) one way flights over 2000 miles.
(ii) Estimate a regression model where a one mile increase in flight distance changes the
fare by a constant dollar amount. Use your result to predict the price of flying 250 miles.
(iii) Now estimate a regression model where a one percent increase in flight distance
leads to a constant percentage change in price. Use your result to report the elasticity of
airfare to flight distance.
5) Is there adverse selection in the market for health care? I have obtained state-level data on health outcomes for the share of the population with health insurance for 2004, 2005, 2008,
2009, and 2010. These data are from the Behavioral Risk Factor Surveillance System. This
question asks you investigate the data, run some regressions, and interpret the results. The
file BRFSS.dta contains data on state population, state population with health insurance, and
health outcomes for the insured population.
(20 points: 5/5/5/5)
a) Generate a variable, share_insured, that measures the share of the state population with health insurance. Report summary statistics for share_insured for each year (mean, st.
dev, min, max). Did the share of people with health insurance in the average state
increase during the 2000’s?
b) Use a simple linear regression model to estimate how the share of people with health insurance impacts health outcomes for the insured population. Report slope coefficients,
their standard errors, the number of observations, and the R2 in the table below.
2 Stata hint: you might find the if and bysort commands helpful on this part of the assignment.
5
avg. days
health not
good
avg. days
health
prevented
regular activity
disability (%)
use
equipment
because of
disability (%)
exercise in
past month
(%)
asthma (%) diabetes (%)
Slope coefficient ___ ___ ___ ___ ___ ___ ___
(___) (___) (___) (___) (___) (___) (___)
N ___ ___ ___ ___ ___ ___ ___
R2 ___ ___ ___ ___ ___ ___ ___
Dependent Variable
c) Based on your results, how would increasing the share of the state population with health insurance by 1% affect the average days that insured consumers report their health is not
good? How would it affect the percentage of the insured consumers with diabetes? Are
these results consistent with the presence of adverse selection in the market for health
insurance? Explain your answer.
d) Does it seem reasonable to expect that the model we estimated in part (b) provides an unbiased estimator for the impact of health insurance on health outcomes in the insured
population? If so, justify your answer by explaining why you suspect SLR.1 through
SLR.4 are satisfied. If not, explain why you suspect one or more of the four SLR
assumptions are violated.