hw22.pdf

ECN 425: Introduction to Econometrics

Alvin Murphy Arizona State University: Fall 2018

Assignment #2

Due at the beginning of class on Tuesday, September 25th

PART I: MULTIPLE REGRESSION ANALYSIS & HYPOTHESIS TESTING

1) Suppose that the wage paid to manufacturing workers in dangerous occupations (e.g. mining,

slaughterhouses, chemical processing) depends primarily on two factors: on-the-job

experience and risk of death, both of which increase wages. Furthermore, suppose that a

worker’s risk of death generally declines with his experience on the job. Now suppose the

National Association of Manufacturing Workers wants to predict the impact of a proposed

safety regulation on wages. Their analyst predicts that the regulation will cause wages to

decline by ROD 1 ̂ , where 0ROD is the reduction in the risk of death on the job that

will result from the regulation and 0ˆ 1  is an estimate for the slope coefficient of the

following model:

uRODwage  10 

Is ROD 1 ̂ likely to be an unbiased estimator for the change in wages caused by the

regulation? If so, explain why. If not, explain the likely direction of the bias. Write down a

formal model for the bias to support your answer.

(10 points)

2

2) The following model is a simplified version of the multiple regression model used by Biddle

and Hamermesh (1990) to study the tradeoff between time spent sleeping and working and to

look at other factors affecting sleep.

uageeductotwrksleep  3210

 ,

where sleep and totwrk (total work) are measured in minutes per week and educ and age are

measured in years. Biddle and Hamermesh estimated the following equation:

uageeductotwrksleep ˆ20.213.11148.025.638,3 

(112.28) (.017) (5.88) (1.45)

N=706, SSR=123,455,057

(20 points: 5/5/5/5)

(i) Explain the appropriate interpretation of the coefficient on totwrk.

(ii) Suppose the average undergraduate program takes 4 years to complete. All else constant,

what do the regression results imply about the average impact of an undergraduate degree

on sleep per week?

(iii) Test whether the coefficient on totwrk is statistically different from one at the 1% level.

State the null hypothesis, the alternative hypothesis, the critical value above which you

would reject the null hypothesis. Then use this information to perform the appropriate

test.

(iv) Dropping educ and age from the regression gives: utotwrksleep ˆ151.038.586,3 

(38.91) (.017)

N=706, SSR=124,858,119

Set up a test to determine whether educ and age are jointly significant in the original

equation at the 1% level. State the null hypothesis, the alternative hypothesis, and the

critical value above which you would reject the null hypothesis. Then use this

information to perform the appropriate test.

3

3) Consider the following model: uxxxy  3322110

 , which satisfies the classical

linear model assumptions. Suppose you would like to test the null hypothesis,

02: 310  H .

(15 points: 5/5/5)

(i) Write the t statistic for testing 02: 310  H as a function of

1 ̂ and

3 ̂ .

(ii) Let 1

̂ and 3

̂ denote the OLS estimators of 1

 and 3

 . Solve for   31

ˆ2ˆ  Var in

terms of the variances of 1

̂ and 3

̂ , and the covariance between them. You may

find it helpful to use the properties of variances in Wooldridge appendix B.

(iii)Let 311

2  . Use this equation to rewrite the model in a way that allows you to

estimate 1  and its standard error directly. Your equation should be a function of

1  ,

2  , and

3  .

4) Consider the following model for how state employment depends on the tax revenue mix:

  usalesshareincomesharepropertyshareemployment  ___log 3210

 ,

where share_pop is the share of property taxes in total tax revenue; share_income is the

share of income taxes, and share_sales is the share of sales taxes. The share of tax revenue

from all other sources (share_other) is omitted. The shares have all been multiplied by 100

so that a value of 50 for share_property would indicate that 50% of total tax revenue comes

from property taxes, for example.

(15 points: 5/5/5)

(i) Explain why share_other is omitted.

(ii) Provide a careful interpretation of 2

 .

(iii)Suppose share_income and share_sales are highly correlated. How would you expect

this correlation to affect the bias and variance of the OLS estimator for 2

 ? In principle,

how could you mitigate any potential concerns about the effect of this correlation on the

bias and variance?

4

PART III: EMPIRICAL ANALYSIS

(4) On HW#1 you assessed the potential impact of adverse selection in the market for health care. Specifically, you regressed a variety of health outcomes experienced by people with

health insurance on the share of a state’s population with health insurance. One potential

concern with the univariate model is that omitted variables could bias the estimator.

Fortunately, the BRFSS survey also collects data on a variety of demographic variables

including age, employment status, gender, and income. This problem asks you to use these

additional variables to reassess the impact of adverse selection. The more complete data are

posted as “BRFSS2.dta” on blackboard.

(20 points: 5/5/5/5)

(i) Fill in the following table of regression coefficients. Use asterisks after the coefficients to indicate the level of statistical significance, as follows: *** indicates

the coefficient is statistically different from zero at the 1% level. ** indicates the

coefficient is statistically different from zero at the 5% level. * indicates the

coefficient is statistically different from zero at the 10% level.

avg. days

health not

good

avg. days

health

prevented

regular

activity

disability

(%)

use

equipment

because of

disability

(%)

exercise in

past month

(%)

asthma (%) diabetes

(%)

share_insured

age25to64

ageover65

selfemployed

unemployed_1yr

unemployed_lessyr

homemaker

student

retired

unable_work

income15to50

incomeover50

female

Intercept

N

R 2

Independent

Variable

Dependent Variable

(ii) Discuss how adding the additional explanatory variables affect your findings on adverse selection compared to homework #1.

5

(iii) Discuss the coefficients on ageover65 in the avg health days not good and % with disability regressions. What might explain these results?

(iv) In the univariate model on homework #1, the R2 was very similar in the regressions for avg days health prevented regular activity and asthma. This is no longer true in

the multivariate model. What can explain this change?

PART IV: CONSISTENCY OF OLS ESTIMATORS—A FAKE DATA EXPERIMENT

(5) In this problem, we will use the file “fake1.dta” to investigate the properties of OLS estimators as the sample size increases. Recall from homework #1 that these data describe a

population of 500 observations from the (true) regression equation: uzy  10  , such

that   0uE ,   0| zuE , and   2|var zu .

(20 points: 4/4/4/4/4)

a) Write a program called “fake25” that takes one thousand random samples of 25 observations from the population (with replacement), and uses each one to regress y on z.

This process will yield 1000 different sets of values for 0

̂ and 1

̂ . Use these results to

calculate the bias, recalling from the first homework that 95.10 0

 and 17.2 1  .

To draw a sample of 25 observations (with replacement) you should use the Stata

command bsample 25.

Manually doing this 1000 times will take an enormous amount of time. Writing a short

simulation program saves us the time it would take to enter the commands for each

replication manually. This allows us to run fake data experiments with more replications.

See the Stata Hint below for information on how to do this.

b) Your results from part (a) define distributions of values for 0

̂ and 1

̂ . Report the mean,

standard deviation, min and max of this distribution for 1

̂ . You can also graph the

distribution by typing the command: kdensity b_z. Print out a picture of your result.

c) Now repeat parts (a) and (b) to write two more programs that take increasing large sample sizes: “fake100” should take 1000 random samples of 100 observations, and

6

“fake500” should take 1000 random samples of 500 observations. Use your results to fill

in the following table for the distribution of 1

̂ , using at least four decimal places:

#

observations

per sample

mean Standard

deviation

Min Max Bias( 1

̂ )

25

100

500

d) Explain what happens to the probability distribution of 1

̂ as the sample size increases.

Illustrate your answer by sketching a figure that shows probability distributions for two

of the different sample sizes.

e) If an estimator produces a probability distribution for 1

̂ that becomes more and more

tightly distributed around 1

 as the sample size grows, the estimator is called

“consistent.” In HW#1 we saw that incorrectly replacing z with x biased the OLS

estimator for 1

 . Explore whether this bias diminishes as the sample size grows by

taking 1000 random samples of 500 observations and running regressions of y on x.

Does the biased estimator appear to be consistent? Explain your answer. As part of your

explanation, draw a picture of your result similar to part (d).

Stata Hint

In HW#1 we conducted a fake data experiment by starting with a population and then (manually)

choosing 20 random samples from that population. Rather than perform the sampling process

manually, we could have written a short program to do it automatically. For example, the

following 6 lines of code perform parts (b) and (c) of the fake data experiment from HW#1:

1. program define hw1fake

2. use "C:\Desktop\fake1.dta", clear

3. bsample round(0.05*_N)

4. reg y z

5. end

6. simulate “hw1fake” _b, reps(20)

7. sum

7

Lines 1 through 5 define a program that is stored in STATA’s memory. Line 1 tells STATA that

we are writing a new program and naming it “hw1fake”. Lines 2, 3, and 4 define the operations

that the program will perform. Line 5 tells STATA we are finished writing the new program.

When we ask STATA to run the program, it will perform lines 2 through 4. Line 2 tells

STATA to open the file “fake1.dta” which I have stored on the desktop of my computer. Line 3

tells STATA to take a 5% sample of the data. Line 4 tells STATA to run a regression on the 5%

sample. If I type hw1fake into STATA and it will perform lines 2 through 4. Try it for yourself.

Line 6 tells STATA to run the new program 20 times and to save the parameter estimates

from each replication. The results are stored in the data editor. You can view them. Finally,

line 7 reports summary statistics for our 20 estimates of 0

 and 1

 .