statatechniques
Econometrics Models I-II Supplementary Material
James Bland
September 27, 2017
Contents
1 Introduction 5 1.1 Summation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Types of random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Describing one random variable . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Cumulative density function . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.2 Probability mass function . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.3 Probability density function . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.4 Mean, variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
I Estimating one parameter 17
2 Estimators 18 2.1 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Estimators and the sampling distribution . . . . . . . . . . . . . . . . . . . . 19 2.3 Small-sample properties of estimators . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.3 Mean squared error . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Inference 32 3.1 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 One-sided hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 p-values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Test power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 The take-away . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Inference with asymptotic assumptions 44 4.1 Large-sample properties of sample means . . . . . . . . . . . . . . . . . . . . 44
4.1.1 The Weak Law of Large Numbers . . . . . . . . . . . . . . . . . . . . 45
1
4.1.2 A central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.3 The continuous mapping theorem . . . . . . . . . . . . . . . . . . . . 46 4.1.4 The delta method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Large-sample properties of estimators . . . . . . . . . . . . . . . . . . . . . . 46 4.2.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.2 Asymptotic distribution . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 Using large-sample properties to make hypothesis tests easier . . . . . . . . . 47 4.3.1 Hypothesis tests with asymptotic approximations . . . . . . . . . . . 47 4.3.2 Confidence intervals with asymptotic approximations . . . . . . . . . 47 4.3.3 p-values with asymptotic approximations . . . . . . . . . . . . . . . . 47
II Basics of programming and handling data in Stata 51
5 Getting started in Stata 52 5.1 Importing, saving, and exporting data . . . . . . . . . . . . . . . . . . . . . 52 5.2 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3 The working directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6 For loops 58
7 Types of data 60 7.1 How your computer thinks (or doesn’t think) about data . . . . . . . . . . . 60 7.2 Censored and truncated data . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.3 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.3.1 Unordered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 7.3.2 Ordered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
8 Merging data, and wide & long formats 61 8.1 One-to-one merges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 8.2 Wide and long datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 8.3 Many-to-one and one-to-many merges . . . . . . . . . . . . . . . . . . . . . . 66
9 Marginal effects and interactions in non-linear models 69
10 Standard errors under different assumptions about � 70 10.1 Homoskedasticity: the *standard* standard errors . . . . . . . . . . . . . . . 71 10.2 Heteroskedasticity: reg y x, robust . . . . . . . . . . . . . . . . . . . . . . 73 10.3 Clustering: “I think you have 3 statistically independent observations” . . . 74 10.4 Autocorrelated errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
2
III Some common econometric techniques 78
11 Ordinary Least Squares (linear regression) 79 11.1 regress: Implementing OLS in Stata . . . . . . . . . . . . . . . . . . . . . . 79 11.2 esttab: Producing publication-quality regression outputs in Stata . . . . . . 81
12 Maximum Likelihood 82
13 Instrumental variables (2SLS) 84 13.1 Over-identification test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
14 Time series 89 14.1 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
14.1.1 Autocorrelation and partial autocorrelation functions . . . . . . . . . 89 14.2 Declaring time series datasets and dealing with lagged variables . . . . . . . 92
IV Advanced reg-monkeying 93
15 Looping over variables: one reg y x, robust, many regressions. 94
V Simulation techniques 98
16 An introduction to Monte Carlo techniques 99 16.1 Stata’s (pseudo) random number generators . . . . . . . . . . . . . . . . . . 99 16.2 Using random number generators . . . . . . . . . . . . . . . . . . . . . . . . 99 16.3 Stata’s simulate command . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
17 Simulations with OLS 106 17.1 Method 1: Load the variables you want to keep constant when you run the
program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 17.2 Method 2: Keep x in memory . . . . . . . . . . . . . . . . . . . . . . . . . . 107
18 Techniques for drawing random numbers 111 18.1 Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
18.1.1 What inversion is and how it works . . . . . . . . . . . . . . . . . . . 111 18.1.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
19 Using pseudo random numbers to calculate things 114 19.1 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
19.1.1 Expectations of random variables . . . . . . . . . . . . . . . . . . . . 114 19.1.2 Expectations of functions of random variables . . . . . . . . . . . . . 115 19.1.3 Expectations when you can’t draw directly from X . . . . . . . . . . 115 19.1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
3
A Solutions to selected problems 121
4
Chapter 1
Introduction
Before getting into the more interesting material, it is important that we are all on the same page in terms of base knowledge. While I hope that there are parts of the course coming up that you can blast through with great understanding and intuition, the reality is that there are a few things that you will need to be able to do in your sleep very early on in the course. I will cover these in class, but please make sure you can do them as soon as possible.
1.1 Summation
[See Appendix A of Bailey (2016)] In Econometrics, we can’t avoid adding things up. For example, when we compute a
sample mean, we add up all the values in the sample, and divide by the sample size. In practice, we will get our computer to do the heavy lifting for us, but we need to understand what it’s doing, and have some notation to make this more compact. For one thing, id we are computing a sample mean, our hope is that we have lots of observations, and so we will need to add a lot of things up. It is cumbersome, for example, to write the sum of integers between 1 and 10 (inclusive), as:
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 = 55 (1.1)
Alternatively, we can write:
10∑ k=1
k = 55 (1.2)
We can read (1.2) as follows:
• The summation symbol “ ∑
” (Greek capital sigma) tells us that we are adding things.
• The “k” to the right of the summation symbol is the thing we are adding.
• The “k = 1” underneath the summation symbol tells us that we are using k as an index, and we start with k = 1.
5
• The “10” above the summation symbol tells us that we stop summing when we get to 10 (inclusive).
We can, and will, get more sophisticated than this. For example:
4∑ k=1
2k2 = 12 + 22 + . . . + 422 (1.3)
12∑ k=1
k − 1 k
= 0
1 +
1
2 +
2
3 + . . . +
11
12 (1.4)
3∑ l=1
l∑ k=1
lk = 1 × 1 + 2 × 1 + 2 × 2 + 3 × 1 + 3 × 2 + 3 × 3 (1.5)
Note that in the last equation, this is a double summation. The index of the leftmost summation (l) appears in the rightmost summation as the stopping point for index k.
The above examples are instructive, but not particularly useful. We are usually interested in adding up a whole lot of things, so we need some notation for “a whole lot of numbers.” To do this, let’s start with some notation for an arbitrary, indexed number, xk. You can interpret this as the kth number in a set of numbers. We can denote “a whole lot of numbers”, formally a “set of numbers”, as:
{xk}Kk=1 = {x1,x2, . . . ,xK} (1.6)
That is, we have a set of K (a positive integer) numbers, which we index by k = 1, 2, . . . ,K. Here’s an example. My drive to work involves driving the following distances, in miles
(each distance is the drive distance between turns on Google Maps):
x1 = 0.3, x2 = 1.4, x3 = 0.1, x4 = 0.5, x5 = 0.0,
x6 = 1.8, x7 = 4.2, x8 = 0.3, x9 = 3.4, x10 = 1.8,
x11 = 0.2, x12 = 0.8, x13 = 0.2, x14 = 0.1, x15 = 0.1
We could denote this dataset as {xt}15t=1, and then calculate: 15∑ t=1
xt = 15.4 miles total distance
3∑ t=1
xt = 1.8 miles distance to 3rd turn
15∑ t=10
xt = 3.2 miles distance left after making 9 turns
15∑ t=1
x2t = 38.9 miles 2 total squared distance (because why not?)
6
Note the change in units in the las summation. There are some useful properties of sums. To begin with, multiplying every component
of a sum by a constant is the same as multiplying the final value by the constant. That is (see Bailey, Appendix A):
N∑ i=1
βXi = β N∑ i=1
Xi (1.7)
Suppose, for example, that we wanted to report the total distance above, but in a more widely-accepted unit of measurement. Knowing that 1 mile = 1.6 km (accurate to 1 decimal place), we could do this by computing:
15∑ t=1
xt miles × 1.6 km
mile = 0.3 × 1.6 + 1.4 × 1.6 + . . .
alternatively, we would be smarter and use this result:
15∑ t=1
xt miles × 1.6 km
mile = 1.6
km
mile
15∑ t=1
xt miles = 1.6 km
mile × 15.4miles = 24.6 km
i.e. β = 1.6 km mile
. Note here that I made sure the unit conversion was correct by writing down the units as well as the numbers. If it’s all done correctly, the units you are trying to get rid of should cancel.
1.2 Types of random variables
There are many ways to categorize random variables, and we’ll get in to a lot of them during this course. For the moment, we will begin by introducing two important types of random variables: discrete and continuous. These do not constitute an exhaustive set (i.e. I could show you some pathological examples that are neither discrete nor continuous), but cover pretty much everything we will be interested in. The distinction between discrete and continuous is important because it tells us how we will (or at least should) analyze our data (see for example Bailey, 2016, chapter 12).
We can tell these types apart by the random variable’s support. This, loosely, is the set of values that the random variable could possibly take on. That is, let S be the support of a random variable X. If, say, 3 ∈S, then it is possible that X could take on the value 3. If 3 /∈S, then X could never be equal to 3.
1.2.1 Discrete random variables
Discrete random variables have countable supports. Formally, this means that we can assign an integer to every value that the random variable could take on. In fact, discrete random numbers are often stored as integers, even if assigning them a number does not add any value to the problem. Here are some examples of discrete random variables:
7
• The outcome of a coin toss. We could record this as Heads = 1 and Tails = 0. Hence the support is {0, 1}
• The number of days with rain in Toledo, OH in 2018. As 2018 is not a leap year, the support is {0, 1, 2, . . . , 365}.
• The number of asteroids that hit Earth after Dec 31st, 2017 before cockroaches become extinct. Note here that, in principal at least, cockroaches could remain extant forever, and so there is no upper bound on the support, hence: {0, 1, 2, . . . ,∞}
• The number of coin tosses made until four heads have been observed. It would be impossible to toss the coin fewer than 4 before this event occurs, and a particularly unlucky individual could potentially end up doing this for ever, so the support is {4, 5, 6, . . . ,∞}.
• The name of the first player in the Collingwood Football Club to kick a goal in the last round of the Australian Football League 2017 season. Since at the time of writing, the 2017 AFL season was well underway, the support for this would be the list of players on the roster: {Travis Cloke, Dane Swan, Scott Pendlebury, . . .}. However one may find it practical to handle data by assigning integers to these names (basically, computers like numbers more than strings): {1 = Travis Cloke, 2 = Dane Swan, 3 = Scott Pendlebury, . . .}. Additionally, since it is possible that Collingwood will have a particularly terrible game, one should also assign “nobody” to this support.
1.2.2 Continuous random variables
Continuous random variables have interval supports (or collections of intervals). Note that all of the above examples of discrete random variables fail this test. Examples of continuous random numbers include:
• The total precipitation in Toledo, OH between Jan 1st 2018 and Dec 31st 2018, in millimeters.
• The time between now and when the next asteroid hits earth.
• Your bank balance (note that strictly speaking, this is discrete random variable, be- cause it is an integer multiple of $0.01. however at some point it is reasonable to claim that a variable is approximately continuous and hence can be treated as continuous).
1.3 Describing one random variable
If all we had in our toolbox was “discrete” and “continuous”, we would not be able to describe random variables very well. Fortunately, we can do much better than this. To begin with, there is the cumulative density function, which completely captures anything you may
8
want to know about a random variable. If we know that the variable is either continuous or discrete, we can use either a probability mass function or probability density function, which also characterize the variable completely (once we know that it is either continuous or discrete). Finally, we can summarize particular aspects of the random variable with quantities such as mean, variance, median, etc.. These quantities don’t fully characterize the distribution, but are sometimes the most important quantities for our analysis.
1.3.1 Cumulative density function
Suppose that a random variable X has support S, which is a subset of the real number line (formally: S ⊆ R). For any particular value of x ∈ R (i.e. pick any x on the real number line), it must be that either X ≤ x, or X > x. Hence, no matter whether X is discrete or continuous, we can report the probability that X is less than or equal to any particular value of x on the real number line. Hence, we define the cumulative density function (cdf) of X as follows:
FX(x) = Pr(X ≤ x) (1.8)
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 x
0
0.2
0.4
0.6
0.8
1
F U (x
)
Figure 1.1: Cumulative density function of standard uniform random variable
Note that the cdf is a function of x, a par- ticular value, and not the random variable itself. Since Pr(X ≤ X) is something that we can compute for any x ∈ R, we must make sure to specify it for the whole real number line, and not just the support of X. For example, if U is a standard uniform ran- dom variable (i.e. U is equally likely to be drawn anywhere on the unit interval), then the support of X is the unit interval (0, 1). However we can still assign a probability to U being (say) less than zero, or less than three (which would be equal to 0 and 1 re- spectively). This cdf would therefore be:
FU (x) =
0 if x < 0
x if 0 ≤ x < 1 1 if x ≥ 1
(1.9)
This is shown graphically in Figure 1.1. For discrete random variables, the cdf is defined exactly the same, but we need to take
special care of the inequality. For example, consider a 6-sided fair die roll. The support of this random variable is {1, 2, 3, 4, 5, 6}, the probability of rolling any of these is 1
6 , but
the probability of getting anything other than these is zero. Therefore, for example, the probabilities of rolling a number less than or equal to 3.01, π, 3.6, and 3.99 are all the
9
(a) 6-sided fair die (b) Binomial(5, 0.6)
0 1 2 3 4 5 6 7 x
0
0.2
0.4
0.6
0.8
1
F X (x
)
0 1 2 3 4 5 6 x
0
0.2
0.4
0.6
0.8
1
F X (x
)
Figure 1.2: Cumulative density functions for some discrete random variables.
same (i.e. they are all equal to 1 2 ). Then, as the function gets to x = 4, it jumps up to 2
3 .
Therefore, at every x in the support of a discrete random variable, the cdf jumps up, and it is flat everywhere else. For example, Figure 1.2a shows the cdf of a fair, 6-sided die roll. Figure 1.2b shows the cdf of the Binomial(5, 0.6) distribution, which can be constructed by flipping five coins coins, each with a probability of 0.6 of coming up heads, and then counting the number of heads.
1.3.2 Probability mass function
We can describe discrete random variables using a probability mass function (pmf). These take a number on the real number line, and return the probability that the random variable is equal to it. Going back to our fair die and Binomial examples in Figure 1.2, the pmf of these are:
Fair die roll : p(x) =
{ 1 6
if x ∈{1, 2, 3, 4, 5, 6} 0 otherwise
(1.10)
Binomial(5, 0.6) : p(x) =
{ 5!
x!(5−x)! 0.6 x0.45−x if x ∈{1, 2, 3, 4, 5}
0 otherwise (1.11)
These are shown graphically in Figure 1.3. Note that we can find the height of the cdf at the values in the support by adding up all of the values of the pmf between −∞ and x.
Any pmf p(x) must only return non-negative numbers (because negative probability does not make sense), and must sum to 1 (because this is the probability of drawing an x inside the support).
10
(a) 6-sided fair die (b) Binomial(5, 0.6)
0 1 2 3 4 5 6 7 x
0
0.05
0.1
0.15
0.2
0.25
p X (x
)
0 1 2 3 4 5 6 x
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
p X (x
)
Figure 1.3: probability mass functions for some discrete random variables.
1.3.3 Probability density function
We cannot use a pmf to describe continuous random variables. To see this, note that for a continuous random variable X, the probability that X is equal to a particular value is zero. For example, the probability that we will get exactly half an inch of rain tomorrow is zero. Not because half an inch of rain is not in the support of rainfall that we could get tomorrow, but because rain does not fall in discrete chunks. Instead, we use a probability density function (pdf) to describe how likely drawing particular values are. If we integrate this thing over a region, we get the probability that the random variable is drawn within this region. For example, while the probability of exactly half an inch of rain is zero, the probability of getting between 1/4 and 3/4 inches of rain is not, and also quite a meaningful number (and useful, depending on how much you care about rainfall). The pdf fX(x) therefore has the following properties:
Pr[X ∈ (a,b)] = ∫ b a
fX(x)dx (1.12)
Pr[X ≤ x] = ∫ x −∞
fX(x̃)dx̃ = Fx(x) (1.13)
d
dx Fx(x) = fx(x) (1.14)
While the 2nd and 3rd lines of equations here are implied by the first, I feel that they are worth pointing out: know how to go between pdf and cdf, and know how they relate to the pmf of a discrete variable. Like pmfs, and pdf must never return negative numbers, and must integrate to 1.
11
1.3.4 Mean, variance
While a cdf, pmf, or pdf will completely characterize a distribution, they sometimes require a bit of work to find the economically relevant values associated with this distribution. For example, a risk-neutral person cares only about the expected value of a distribution over money, and hence what we would really want to know is:
Definition 1. The mean (alternatively expected value or expectation) of random vari- able X with support SX and cdf FX(x) is equal to:
E[X] ≡ ∫ SX
xdFX(x) (1.15)
If X is a continuous random variable with pdf fx(x) = F ′ X(x), then (1.15) can be expressed
as:1
E[X] =
∫ SX
xfx(x)dx (1.16)
If X is a discrete random variable with pmf px(x), then (1.15) can be expressed as:
E[X] = ∑ x∈SX
xpx(x) (1.17)
Hence, for the standard uniform random variable (see (1.9)):
E[U] =
∫ 1 0
x× 1dx = 1
2 x
∣∣∣∣1 0
= 1
2 − 0 =
1
2 (1.18)
and for a fair die roll (see (1.10)):
E[X] = 6∑
k=1
k 1
6 =
7 × 3 6
= 3.5 (1.19)
A useful property of means is that one can add them up. For example, if we wanted to determine the expected value of the sum of two fair die rolls, say X1 and X2, then we could use our answer in (1.19) as follows:
E[X1] = E[X2] = 3.5 =⇒ E[X1 + X2] = 3.5 + 3.5 = 7 (1.20)
This also means that if Y = cX for some constant c ∈ R, then E[Y ] = E[cX] = cE[X]. However we need to be careful about non-linear functions of random variables. If h(x) is a non-linear function, then in general E[h(X)] 6= h(E[X]).
The mean gives us an idea of what we might, quite literally, “expect” X to be. However it gives us no idea about how likely we are to be “close” to this value. For example, Figure 1.4 shows the pdf and cdf of three distributions, all have the same mean, but some are more spread out than others. One measure of this is:
1If you are struggling to see this step, note the following for a continuous random variable: dFX(x)
dx =
fX(x). Then, multiplying both sides by a fancy 1 = dx dx
yields dFX(x) = fX(x)dx.
12
(a) Probability density function (b) Cumulative density function
-10 -8 -6 -4 -2 0 2 4 6 8 10 x
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
f (x
)
<2 = 1 <2 = 3 <2 = 10
-10 -8 -6 -4 -2 0 2 4 6 8 10 x
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
F (x
)
Figure 1.4: Normal distributions with different variances
Definition 2. The variance of random variable X is equal to:
V [X] ≡ E [ (X −E[X])2
] (1.21)
In words, V [X] is X’s “expected squared distance” from its mean. For example, for the uniform distribution in (1.9), the variance of X is:
V [X] =
∫ SX
(x−E[X])2dFX(x) (1.22)
=
∫ 1 0
( x−
1
2
)2 × 1dx (1.23)
=
∫ 1 0
( x2 −x +
1
4
) dx (1.24)
= 1
3 x3 −
1
2 x2 +
1
4 x
∣∣∣∣1 0
(1.25)
= 1
3 −
1
2 +
1
4 =
4 − 6 + 3 12
= 1
12 (1.26)
One can use our knowledge of expectations to further simplify (1.21) as follows:
V [X] = E [ (X −E[X])2
] (1.27)
= E [ X2 − 2XE[X] + E[X]2
] (1.28)
= E[X2] −E [2XE[X]] + E [ E[X]2
] (1.29)
= E[X2] − 2E[X]2 + E[X]2 (1.30) = E[X2] −E[X]2 (1.31)
13
where the 2nd row expands the squared term, the third recognizes that this is the expectation of the sum of some random variables, and the fourth recognizes that 2 and E[X] are constants. Since we have to compute E[X] to get to V [X] anyway, it is sometimes easier to compute E[X2] first, rather than E[(X −E[X])2] directly. For example, with the fair die roll:
E[X2] = 6∑
k=1
k2 1
6 =
1 + 4 + 9 + 16 + 25 + 36
6 =
91
6 (1.32)
V [X] = 91
6 − (
21
6
)2 =
546 − 441 36
= 105
36 ≈ 2.92 (1.33)
Note that variance and expectation are indifferent units. For example, if X is the height of a human in meters, then E[x] has units of meters, and V [X] is in square meters, an area! To express spread in the same units and the mean, we therefore sometimes take the square root of this, which we call standard deviation.
Like means, we can add the variances of two random variables, but only if they are not correlated. To see this, we go back to our definition of variance:
V [X + Y ] = E [ (X + Y −E[X + Y ])2
] (1.34)
= E [ (X + Y −E[X] −E[Y ])2
] (1.35)
= E [ ((X −E[X]) + (Y −E[Y ]))2
] (1.36)
= E [ (X −E[X])2 + (Y −E[Y ])2 + 2(X −E[X])(Y −E[Y ])
] (1.37)
= E [ (X −E[X])2
] + E
[ (Y −E[Y ])2
] + 2E [(X −E[X])(Y −E[Y ])] (1.38)
= V [X] + V [Y ] + 2E [(X −E[X])(Y −E[Y ])]︸ ︷︷ ︸ cov(X,Y )
(1.39)
Where the last term E [(X −E[X])(Y −E[Y ])] = cov(X,Y ) is the covariance of X and Y . This is a measure of how much X and Y move together in a linear way. Loosely, if cov(X,Y ) > 0, then a particularly large X means that Y is also likely to be large; conversely, cov(X,Y ) < 0 tells us that a particularly large X means that Y is likely to be small. There are many cases in econometrics where we assume (perhaps to our own peril) that a covariance is zero. In fact, a lot of this course will be devoted to what goes wrong when cov(X,Y ) 6= 0. When working through a derivation, therefore, please think carefully about why this thing migh or might not be equal to zero. Ideally, have a good story to back up your decision!
Exercises
Exercise 1.1. Let X be the sum of two fair, four-sided die rolls. That is, each die has four faces, with numbers 1, 2, 3, 4.
1. What is the support of X?
14
2. Is X a discrete or continuous random variable? Explain
3. Based on your answer to question 2, construct the pdf or pmf of X.
4. Construct the cdf of X
5. Compute E[X] and V [X].
Exercise 1.2. The exponential distribution has probability density function:
fX(x) =
{ c exp(−λx) if x > 0 0 otherwise
(1.40)
where c is a positive constant, and λ is a scale parameter.
1. What is the support of X?
2. What must the positive constant c be equal to?
3. What is the cdf of X?
4. Determine E[X] and V [X]. To do this, you will need to use integration by parts:∫ b a
u′(x)v(x)dx = [u(x)v(x)] b a −
∫ b a
u(x)v′(x)dx (1.41)
where u(x) an v(x) are both differentiable functions. You won’t need to remember this, because you can always work it out from the product rule:
[u(x)v(x)]′ = u′(x)v(x) + v′(x)u(x) (1.42)
u′(x)v(x) = [u(x)v(x)]′ −v′(x)u(x) (1.43)
then just integrate both sides.
Exercise 1.3. The cumulative density function for random variable X is:
F(x) =
0 if x ≤ 0 xα if 0 < x < 1
1 if x ≥ 1
where α > 0 is a parameter of the distribution.
1. What is the support of X, and is X a discrete or continuous random variable?
2. Calculate the pdf, f(x)
15
3. Calculate E[X] and V [X].
Exercise 1.4. An unfair coin has a probability of 1
3 coming up heads, tails otherwise. You keep flipping it
until the first time it comes up heads. Let X be the number of times you have to flip the coin.
1. What is the support of X?
2. What is the probability that X = 10? Hint: Since each coin flip is an independent event you can multiply the probability of each coin flip that needs to occur together. For example, the probability that X = 4 is equal to:
Pr[1st flip is tails] × Pr[3nd flip is tails] × Pr[3rd flip is tails] × Pr[4th flip is heads]
= 2
3 ×
2
3 ×
2
3 ×
1
3 =
23
34
3. What is the probability that X ≤ 4?
4. What is the probability mass function for X?
5. Verify that your pmf sums to 1. You can use the results that for −1 < p < 1: ∞∑ n=0
pn = 1
1 −p , and
∞∑ n=1
pn = 1 + ∞∑ n=0
pn
16
Part I
Estimating one parameter
17
Chapter 2
Estimators
An estimator is a mathematical function that takes data and gives you an estimate of some- thing. While this course will mainly focus on numerical examples, I would like you to remember that we take in information, and use this information to make educated guesses about things all the time. For example, before I go grocery shopping, I have a peek into the fridge and decide how much food I will need to buy this week: an estimate. When I drive to work, I pay attention to the traffic conditions (i.e. data), in part so I have an estimate of how fast I should be driving. In what will follow, our analysis will look somewhat more formal than this, but be mindful that there are similarities: (i) we gather information and use it to make a prediction or guess about something, (ii) there are some ways of using the information that are more useful than others, (iii) some types of information are better than other types, (iv) if we know more about how we are gathering our information, we can sometimes use this to make a better guess, and (v) more information usually makes our guess more accurate.
2.1 Populations and samples
The Frequentist approach in econometrics1 starts with the premise that there is a population parameter (or collection of parameters, the distinction is not important at all), say θ, that determines how we observe data. We observe a sample of data, say {xi}Ni=1, and use this sample to produce an estimate of θ, the property of the population that we would like to know about. Our prime objective in econometrics is to estimate properties of the population, using a sample and an estimator.
As an example, suppose that you wish to estimate the probability that tossing a particular coin results in it landing heads up.2 You decide to model the data-generating process as
1Statistics and econometrics can be divided into two philosophies: Frequentist and Bayesian. This is a course in Frequentist econometrics. For a good reference on Bayesian econometrics, see Koop et al. (2007). Whenever econometrics is mentioned without clarifying whether it is Frequentist or Bayesian, it is usually safe to assume Frequentist.
2It is at this point that I feel some need to apologize for a seemingly endless string of examples involving coin flipping. There are many reasons for my choice of this type of example. Most importantly, coin-flip
18
follows:
Hi =
{ 1 if coin flip i lands heads up
0 otherwise (2.1)
Pr[Hi = 1] = θ, θ ∈ [0, 1] (2.2) Hi ∼ iidBernoulli(θ) (2.3)
Here (2.1) defines a random variable that can take on two values, 0 and 1 (we implicitly assume that the coin has exactly two sides, and so Hi 6= 1 ⇐⇒ Hi = 0 ⇐⇒ tails). (2.2) tells us that the probability of the coin flip coming up heads is equal to θ: this is the population parameter that we want to estimate. (2.3) formalizes this further, by (i) using the formal name for a coin-flip variable (Bernoulli), and (ii) formally stating the assumption that each coin flip is an independent draw from the same distribution.3 These are all statements about the population.
Now we collect a sample from this population. In this example, this could involve flipping the coin (say) 100 times, and recording the result of each coin flip. Let’s denote this sample as {hi}100i=1. This last bit of notation denotes the collection of coin flip outcomes for 100 coin flips. We could write this out long-hand as:
{hi}100i=1 = {h1,h2,h3, . . . ,h99,h100} (2.4)
but we have better things to do (or at least I do). At this point (and forever into the future) it is very important to be clear about when we
are talking about properties of the sample and properties of the population. We will be using the former to tell us something about the latter, but they are two different things. {hi}Ni=1, the sample, is the thing we will be importing into our statistical package, then calculating means, variances, etc. of. We know the sample mean because we can calculate it. We can never know the population mean: that’s why the sample is useful!
2.2 Estimators and the sampling distribution
We now take our sample, and stick it into our estimator. Out comes an estimate. Here’s the thing: our sample is random. If we put something random into a function, in general we should expect to get something random out. In the context of our coin-flipping example starting in Section 2.1, we have a sample of 100 coin flips, and wish to use this to estimate
random variables (formally: Bernoulli random variables), are very simple to understand. Because of this, they allow for introduction of simple concepts, without the need for you to get your head around anything else that could be complicated. Additionally, coin-flip variables show up everywhere: get used to it.
3The “independent” part of this means (among other things) that if I toss two coins, knowing the outcome of one tells me nothing about whether the other outcome is heads or tails. While for practical reasons we might not give a hoot about whether the His are independent, we usually need to make an assumption about this when we estimate things. It is best to state it (and any other assumption we make) formally. That way, it is more obvious when we are doing something stupid.
19
θ, the probability that our coin comes up heads. For a lot of reasons that we will get to later on, a good estimator to use in this situation is:
θ̂ = 1
N
N∑ i=1
hi (2.5)
which happens to be the sample mean.4 Although it is not stated explicitly on the left- hand side of (2.5), θ̂ is a function of the sample: θ̂ = f
( {hi}Ni=1
) = 1
N
∑N i=1 hi, but that
is overly cumbersome, so we will stick with the notation in (2.5). We have also gone a bit more general, and written this for an arbitrary sample size N, rather than our N = 100 observations in the above example.
h1 h2 h3 θ̂ Probability 0 0 0 0 (1 −θ)3 0 0 1 1/3 θ(1 −θ)2 0 1 0 1/3 θ(1 −θ)2 0 1 1 2/3 θ2(1 −θ) 1 0 0 1/3 θ(1 −θ)2 1 0 1 2/3 θ2(1 −θ) 1 1 0 2/3 θ2(1 −θ) 1 1 1 1 θ3
Table 2.1: Sampling distribution of θ̂ when N = 3
Before exploring the properties of θ̂ when N = 100, it is instructive to understand how θ̂ behaves with stupidly small samples. Each row of Table 2.1 shows a possible sample that we could have observed, if we only tossed the coin N = 3 times. The right- most column shows the probability of observing that sample, as a function of the population parameter θ. Note that we pay attention to the order of coin flips, and hence we don’t think of the sample {1, 0, 1} as be- ing the same as {1, 1, 0}, even though they both have 2 heads and 1 tail. This is an important distinction for later on, but at the moment just be aware that there are 23 = 8 possible samples that we could have observed, with varying probability of being observed. That said, the sample mean doesn’t give a hoot about which order in which the heads and tails came, so we can add up the cells in the “Proba- bility” column of this Table to get the probability mass function of the sample mean, as a function of θ:
Pr[θ̂ = x] =
(1 −θ)3 if x = 0 3θ(1 −θ)2 if x = 1/3 3θ2(1 −θ) if x = 2/3 θ3 if x = 1
0 otherwise
(2.6)
(2.6) characterizes the distribution of our estimator θ̂, as a function of the population pa- rameter θ. We call this the sampling distribution of θ̂.
If one can see the matrix when it comes to probability mass functions, one may realize
4 . . . and here is one of the first good reasons to use this: we know a lot about sample means, so we can use them to derive properties of this estimator. More on this later.
20
0 2
4 6
8 1
0
0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 x
theta = 0.2 theta = 0.4 theta = 0.6
Figure 2.1: Sampling distribution of Bernoulli random variable, N = 100.
this that we can write (2.6) more compactly as:
Pr[θ̂ = x] =
{( 3 3x
) θ3x(1 −θ)3(1−x) if x ∈{0, 1/3, 2/3, 1}
0 otherwise (2.7)
which if you squint hard enough, looks almost like the Binomial distribution. In fact, if you substitute in k = 3x, this is exactly what you get: the number of heads for this sampling pro- cess is distributed Binomial(3,θ). For a sample size of N, this generalizes to Binomial(N,θ). To illustrate this, Figure 2.1 shows the sampling distribution for the same estimator for a much more reasonable sample of N = 100 coin flips. As the sample size gets larger, there are more values that θ̂ can take on. For example, when N = 4 we will get one of θ̂ =0, 1/4, 2/4, 3/4, 4/4, and as N gets really large, we struggle to see that the distribution is still discrete. However note that since we are always taking a ratio of two integers, the number of heads divided by the sample size, there are some values of θ̂ that we could never get: π
4 ,
for example. Unfortunately, in general, the sampling distribution of an estimator is not easy to work
out. Most of the time, however, we can determine a few properties of this distribution. In particular, we may be interested in knowing the expected value of θ̂ (i.e.: “on average, do I get right number?”), the variance (i.e.: “how precise is my estimator?”), and how these things change with sample size (i.e.: “if my sample size gets bigger, how much better is my estimator?”). We explore some of these properties in the next section.
21
2.3 Small-sample properties of estimators
While it is usually infeasible to determine the exact sampling distribution of our estimator, we can usually derive, or at least approximate (more on this later), some properties of its distribution. That is, we might not be able to write down the cdf of θ̂, but we may be able to work out a few things, like its mean and variance. This is especially easy when our estimator is a sample mean, because we know a lot about sample means, and is particularly useful when there is more than one estimator that could do the job for you. If there is more than one option, it usually pays to think at least a bit about which one will work best for you.
To illustrate this, suppose for example that a friend of yours was rolling a fair5 die, and calling out the numbers. Your problem is that you don’t know how many sides the die has, and you would like to estimate it. Let η be the number of sides on the die. Your data {ki}Ni=1 consists of the outcomes of the N die rolls that your friend has called out.6 Here are two estimators that you may want to consider:
1. Noting that for an η-sided die, the expected value of a roll is E[ki] = η+1 2
, you replace E[ki] with its sample analog, k̄ =
1 N
∑ i ki and solve for η:
η̂ = 2k̄ − 1, i.e.: k̄ = η̂ + 1
2 (2.8)
2. Noting that the highest possible value of ki that you could observe is ki = η, you use the maximum:
η̃ = max i {ki} (2.9)
Both of these take a property of the population (the mean and maximum respectively), and then use the sample analog of this. Unsurprisingly, this is often referred to as an analogy estimation strategy.7 We will be introduced to some important properties of estimators below, in the context of η̂ and η̃. Neither will come out as unambiguously better. Get used to it! If we (economists) assume people can make trade-offs, we’d better be able to make them ourselves. But before getting into this, suppose that you observed the following sample:
{1, 1, 1, 1, 1, 6} =⇒ θ̂ = 2 11
6 − 1 ≈ 2.7, η̃ = 6
One alarming property of θ̂ (that is not discussed below) is that our estimate of 2.7, even if we round it up to the nearest integer, could not possibly be believable, because we observe a 6 in our sample! We would never run into this problem for θ̃.
5That is: each number is equally likely to be the outcome. 6At this point, you may be telling me “But James, almost all dice are 6-sided, and there are a few 20-sided
dice out there, but it’s really hard to get your hands on a 42-sided die. Isn’t this some information that we shouldn’t be ignoring?” To which my response would be: “Yes, go and learn Bayesian econometrics.”
7Analogy estimators are reasonably easy to come up with, but there is no guarantee that they have any nice properties. Later on you will learn about maximum likelihood (ML) estimation, which is a systematic way to come up with an estimator that has some very nice properties. Another example of this is a Generalized Method of Moments (GMM) estimator.
22
2.3.1 Bias
While we have no guarantee that our estimator gives us the right number (i.e. θ̂ = θ) for sure, we can assess whether we get the right number on average. Specifically, we can compare E[θ̂] to θ. If they are equal, i.e. E[θ̂] = θ, then we say that our estimator is unbiased. On the other hand, if E[θ̂] 6= θ, then our estimator is biased, and we might want to worry.
Now let’s evaluate the properties of our estimators η̂ and η̃ described earlier. For the estimator based on the population mean:
E[η̂] = E [ 2k̄ − 1
] (2.10)
= 2E[k̄] − 1 (2.11)
= 2E
[ 1
N
∑ i
ki
] − 1 (2.12)
= 2
N
∑ i
E[ki] − 1 (2.13)
= 2
N NE[ki] − 1 (2.14)
= 2E[ki] − 1 (2.15)
= 2 η + 1
2 − 1 (2.16)
= η (2.17)
In short, E[η̂] = η ⇐⇒ η̂ is unbiased. Good! In expectation (loosely: “on average”) we get the right number.
Now let’s look at the estimator based on the maximum:
E[η̃] = E[max i ki] (2.18)
This opens up a bit more of a can of worms, because now we need to take an expectation over the maximum of the kis in our sample. Welcome to the wonderful world of order statistics! Specifically, the first order statistic (i.e. the maximum of a sample). We first need to derive the distribution of maxi ki. Let M be this maximum, to make notation easier. What is the probability that M is equal to a particular value m? That is, what is the pmf of M?
p(m) = Pr[N − 1 observations are less than or equal to m and at least one is equal to m] (2.19)
=
( N
1
)( m
η
)N−1 1
η = NmN−1
ηN (2.20)
23
if m = 1, 2, 3, . . . ,η, and zero otherwise. We can now take the expectation of M:
E[η̃] = E[M] =
η∑ m=1
[ m NmN−1
ηN
] (2.21)
= Nη
N + 1
η∑ m=1
[ (N + 1)mN+1−1
ηN+1
] (2.22)
= η N
N + 1 (2.23)
Where the last line follows by noting that the thingy that we are summing is the pmf of M, if we have one extra observation in our sample, and so it must sum to 1. Remember this monkey trick, it will come in handy! Inspection of (2.23) yields some sad news: η̃ is biased. On average we will under-estimate η by the fraction N
N+1 . This should not be too
surprising: For any sample size, there is a non-zero probability that the maximum is not equal to η, and so some of the terms in the above expectation calculation put positive weight on outcomes that are less than η. Further inspection of (2.23) and a bit of thinking (!), however, shows that all is not lost. Firstly, as our sample size gets large, N
N+1 → 1, and
so the bias disappears. This is a common property of many (but by no means all) biased estimators, and may be why we might prefer one to an unbiased estimator if we think our sample size is large enough. Secondly, since the bias is only a function of N, we can easily correct for this by multiplying the maximum by N+1
N , that is:
η̌ = N + 1
N max i ki =
N + 1
N η̃ (2.24)
E[η̌] = E
[ N + 1
N η̃
] = N + 1
N E[η̃] =
N + 1
N
N
N + 1 η = η (2.25)
By deriving the bias of this estimator we were not only able to say something about the direction of the bias (i.e. η̃ under-estimates the population parameter on average), but we also came up with another one that was unbiased! You should probably remember that.
2.3.2 Variance
If our estimator is unbiased, or at least if the bias is something that we can cope with, the next question we might ask is: how precise is our estimator? That is, through the sampling process, do our estimates all fall nice and close to their mean, or are they all over the place? We typically use variance to evaluate this. After checking bias, we found that while η̂ was the only unbiased estimator in consideration, we could easily correct the bias in η̃. Therefore
24
this should be the next thing to check in our die-rolling example. Firstly, for η̂ :
V [η̂] = V [ 2k̄ − 1
] (2.26)
= 4
N V [ki], (assumed independence here) (2.27)
= 4
N
η2 − 1 12
(2.28)
= η2 − 1
3N (2.29)
Inspection of (2.29) tells us a few things. Firstly, if η = 1, then V [η̂] = 0. This should not be surprising, but it is comforting: if we have a one-sided “die”, then we will always get the same number, and hence have zero variance. Perhaps of more use is that the variance is (i) increasing in η, and (ii) decreasing in N. (ii) is typical of almost anything you will end up using, and loosely can be interpreted as “bigger samples are better”.
Now let’s move to our second estimator, η̃. For reasons that should become obvious after you do this over and over again, we are going to use the relationship V [X] = E[X2]−E[X]2. In case they are not obvious now, these reasons are (i) we already know E[X], and (ii) it is easier to evaluate E[X2] on its own than try to do V [X] in one fell swoop.
E[η̃2] = E [ M2 ]
(2.30)
=
η∑ m=1
[ m2
NmN−1
ηN
] (2.31)
=
η∑ m=1
[ NmN+1
ηN
] (2.32)
= NηN+2
ηN (N + 2)
N∑ m=1
[ (N + 2)mN+1
ηN+2
] (2.33)
= η2 N
N + 2 (2.34)
V [η̃] = η2
[ N
N + 2 − (
N
N + 1
)2] (2.35)
= η2 N3 + 2N2 + N −N3 − 2N2
(N + 2)(N + 1)2 (2.36)
= Nη2
(N + 2)(N + 1)2 (2.37)
as with η̂, the variance of η̃ decreases as sample size increases. To see this, note that we have N in the numerator, and a cubic in the denominator, so the denominator grows much faster.
But which of η̂ and η̃ is better based on variance? It turns out that for almost all
25
reasonable values of η and N, V [η̃] < V [η̂], which can be shown as follows:
V [η̃]
V [η̂] =
Nη2
(N + 2)(N + 1)2 ×
3N
η2 − 1 (2.38)
= Nη2
(N + 2)(N2 + 2N + 1) ×
3N
η2 − 1 (2.39)
= Nη2
N3 + 2N2 + N + 2N2 + 4N + 2 ×
3N
η2 − 1 (2.40)
= Nη2
N3 + 4N2 + 5N + 2 ×
3N
η2 − 1 (2.41)
= η2
N + 4 + 5/N + 2/N2 ×
3
η2 − 1 (2.42)
which → 0 as N →∞.8 Furthermore, the denominator in the first fraction is at least 5,9 so we can say that:
V [η̃]
V [η̂] <
3η2
5(η2 − 1) (2.44)
and when is this fraction less than one?
3η2
5(η2 − 1) ≤ 1 (2.45)
3η2 ≤ 5η2 − 5 (2.46) 5 ≤ 2η2 (2.47) η ≥ 2 (2.48)
Note that the mathematical solution to (2.47) is η ∈ (−∞,− √
2.5] ∪ [ √
2.5,∞), however we can discard the negative part of this because our die can only take on positive numbers, and we can round up the lower bound of
√ 2.5 to 2 because our die has an integer number of
sides. Hence η ≥ 2 is the econometric solution to the problem. In short, η̃ has a smaller variance than η̂, as long as we don’t have a one-sided die.
2.3.3 Mean squared error
A small variance is a good thing, but only if the estimator’s distribution is centered (at least roughly) around the true value. That is, if the estimator is substantially biased, why should
8 Alternatively, note that:
V [η̃] = Nη2
(N + 2)(N + 1)2 <
Nη2
(N + 0)(N + 0)2 = Nη2
N3 =
η2
N2 (2.43)
Which is only a little bit more than 3/N × V [η̂] = η 2−1 N2
, so for even very small sample sizes (say N > 4), using the maximum is better.
9I.e.: ignore the 5/N and 2/N2 terms, and set N = 1.
26
be care that the variance is small? We shouldn’t! To see this, let’s construct a silly but illustrative example. Suppose that you can use estimators for population parameter θ with the following sampling distributions:10
Pr[θ̂ = x] =
{ 1 if x = 2
0 otherwise (2.49)
Pr[θ̃ = x] =
{ 1 5
if x ∈{θ − 2,θ − 1,θ,θ + 1,θ + 2} 0 otherwise
(2.50)
The first estimator θ̂ returns an estimate of 2 no matter what data we get. This should look like a silly choice, but θ̂ does have the following desirable property: V [θ̂] = 0. If we were to judge estimators based only on their variance, we could do no better than θ̂! The problem with this estimator is that it is biased: E[θ̂] = 2 6= θ (unless the true value θ is also equal to 2, but we can’t know that). θ̃, on the other hand, is unbiased, because:
E[θ̃] = 2∑
k=−2
θ + k
5 = θ (2.51)
but has a non-zero (hence realistic) variance of:
V [θ̃] = 2∑
k=−2
1
5 (θ + k −θ)2 =
1
5 (22 + 12 + 02 + 12 + 22) =
10
2 = 5 (2.52)
One useful measure to use in these cases is mean squared error (MSE). Unlike variance, which asks how far away (in terms of squared distance) on average is an estimator from its expected value, MSE asks how far away our estimator is from its true value. Let’s put these side-by-side to see the difference:
V [θ̂] = E
[( θ̂ −E[θ̂]
)2] (2.53)
MSE[θ̂] = E
[( θ̂ −θ
)2] (2.54)
Comparing (2.53) and (2.54), note the only difference is that for the MSE equation, we replace the expected value of the estimator, E[θ̂], with population parameter that we are trying to estimate, θ. Hence, these two things will be equal if and only if E[θ̂] = θ. To see
10Note here that I have abstracted away from the sampling process and just written down a probability distribution for each estimator. In the background, there may be a function taking data and returning an estimate, but this is unnecessary for the example. If you prefer, you can think about these as signals containing information about θ (which is pretty much what an estimator is, anyway).
27
the “only if” part of this, we can decompose (2.54) as follows:
MSE[θ̂] = E
[( θ̂ −θ + E[θ̂] −E[θ̂]
)2] (2.55)
= E
[(( θ̂ −E[θ̂]
) + ( E[θ̂] −θ
))2] (2.56)
= E
[( θ̂ −E[θ̂]
)2] + E
[( E[θ̂] −θ
)2] + 2E
[( θ̂ −E[θ̂]
)( E[θ̂] −θ
)] (2.57)
= V [θ̂] + Bias2[θ̂] + 0 (2.58)
The third term in (2.57) is equal to zero because E[θ̂] and θ are constants, and E [ θ̂ −E[θ̂]
] =
0. Hence the MSE of an estimator is equal to the estimator’s bias squared plus its variance. How well do our estimators η̂ and η̃ from the previous section stack up based on MSE? Using this formula, we already have the hard part done:
MSE[η̂] = η2 − 1
3N + 0 (2.59)
MSE[η̃] = Nη2
(N + 2)(N + 1)2 + η2
( 1
N + 1
)2 (2.60)
= η2 2N + 2
(N + 2)(N + 1)2 (2.61)
= η2 2
(N + 2)(N + 1) (2.62)
MSE[η̃]
MSE[η̂] = η2
2
(N + 2)(N + 1) ×
3N
η2 − 1 (2.63)
= η2
η2 − 1 ×
6
(N + 2)(1 + 1/N) (2.64)
This leads to some ambiguity, but none that can’t be dealt with with a bit of thinking. To begin with, the fraction 6
(N+2)(1+1/N) → 0 as N → ∞, so as long as our sample is large
enough we should probably use η̃. Additionally, the first term η 2
η2−1 is reasonably close to 1
for any integer greater than about η = 3,11 so we need not fret too much about it.
11“Close” is a judgment call on my part, but plug in some numbers an see for yourself.
28
Exercises
Exercise 2.1. Consider the distribution studied in Exercise 1.3. We derived the following properties:
f(x) =
{ αxα−1 if 0 < x < 1
0 otherwise
E[X] = α
α + 1
This motivates the following estimator:
α̂ = 1 N
∑N i=1 Xi
1 − 1 N
∑N i=1 Xi
which is the sample analog of:
α = E[X]
1 −E[X]
An alternative estimator for α̃ is:12
α̃ = − N∑N
i=1 log(Xi)
1. Download the dataset ExBetaSim 1.csv from Blackboard, which contains a simulated sample from this distribution. Use both estimators to estimate α.
2. (Simulation exercise) Fix α = 0.7. Simulate the properties of these estimators for a sample size of N = 30. Are they biased? Does one stand out as better than the other?
Hint: You can simulate the distribution of X by transforming uniform random num- bers. Specifically, if U ∼ U[0, 1], then:
X = U 1 α
will have the correct distribution.
Exercise 2.2 (Solutions provided). The exponential distribution can be characterized by the pdf:
fX(x) =
{ µ−1 exp(−x/µ) if x > 0 0 otherwise
(2.65)
12This the maximum likelihood estimator of α, which is not important here, but we may cover this later in the year.
29
where µ > 0 is a scale parameter. This distribution has the following properties:
E[Xk] = k!µk (2.66)
Xi ∼ iidExponential(µ) =⇒ min{X1,X2,X3, . . . ,XN}∼ Exponential(µ/N) (2.67)
We could plug in k = 1 to the first property, and use:
µ̂ = 1
N
N∑ i=1
Xi, analogy of: µ = E[X] (2.68)
Alternatively, we could use the second property to construct the estimator:
µ̃ = N min i {Xi}, analogy of:
µ
N = E
[ min i {Xi}
] (2.69)
1. Derive the bias, variance, and MSE of these two estimators. Which one would you prefer to use? Hint: use the first property extensively!
2. (Simulation exercise) Simulate the properties of both estimators when µ = 1 and N = 30. Your answer should include your approximation of the bias, variance, and MSE of the estimators, as well as a plot showing the pdfs of both estimators.
Hint: If U ∼ U[0, 1], then X = −µ log U ∼ Exponential(µ)
3. (Simulation exercise) In principle, you could have plugged any k into 2.66 to get an analogy estimator. plug in k = 1 and derive the analogy estimator. Don’t try to work out the bias, variance, and MSE of this analytically, just modify your code to also simulate the properties of this third estimator, say µ̌.
Exercise 2.3. We know the following about two random variables X and Y :
• X and Y are independent
• E[X] = E[Y ] = µ
• V [X] = σ21 > 0, V [Y ] = σ22 > 0.
• For some reason, we know the exact values of σ21 and σ22.
Suppose that we obtain a sample of one of each of these variables, i.e., x and y, and use it to construct an estimator for µ:
µ̂ = ax + by (2.70)
where a and b are constants chosen by the econometrician (you).
30
1. What is the expectation of µ̂?
2. For what values of a and b is µ̂ unbiased? Hint: We are looking for an expression of the form b = f(a), find out what function f is.
3. For the moment, ignore your answer to part (2). That is, consider the original expres- sion as a function of both a and b: µ̂ = ax + by. What is the variance of µ̂?
4. Substitute the condition for unbiasedness (your answer in part (2)) to your expression for V [µ̂] (your answer for part (3)). The right-hand side of this equation should contain only a, σ21, and σ
2 2.
5. What value of a minimizes the variance of µ̂? Make sure that your answer is a global minimum. (Start with your answer to part (4))
6. Write out your expressions for a and b. In what case is a = b? What about a > b? When and why would we not want to just take the simple average µ̂ = 1
2 (x+y)? Would
we ever want to set either a = 0 or b = 0?
31
Chapter 3
Inference
In Chapter 2, we introduced the concept of an estimator, and some important properties of it. In econometrics, you will be estimating stuff all the time, but equally importantly, you will be making statements about your confidence in your estimates. Formally, such a statement will take the form of a hypothesis test, a p-value, or a confidence interval.
In many textbooks, the material that follows is presented alongside asymptotic theory, which tells us how estimators behave when the sample size approaches infinity. This is where
your text will have a lot of p−→, d−→, and normal, F , and χ2 distributions. These are very useful
ideas, but I have found that they can be confusing when presented at the same time as the material I wish to teach you in this chapter. Therefore, what follows is a run-through of statistical inference in the absence of asymptotic theory, which we will get to in the next chapter. For the rest of this chapter, please note the absence of normal distribution tables, dividing mans by standard deviations, and the magic number 1.96.
To illustrate these concepts, let us go back to the coin-flipping example in Chapter 2. We have a data-generating process:
Hi ∼ iidBernoulli(θ) (3.1)
and wish to estimate θ, the probability that a coin flip will come up heads. Our research question is as follows: Is the coin a fair one? I hope that you never have to research a question as mundane as this, but once you’re done with this chapter, go and do Exercise 3.1. Hopefully by then you can see the point of it. This research question can be formalized as θ = 0.5. We collect a sample {Hi}Ni=1 of N flips of the same coin, which we assume to be independent draws from 3.1. We then use the sample mean h̄ as an estimator for θ:
θ̂ = 1
N
N∑ i=1
Hi (3.2)
On its own, this gives us a point estimate of θ, which is somewhat useful, but at this point we have no idea how close our sample is to one that would come from fair coin flip. The next sections of this chapter approach the research question (is θ equal to 0.5?) three different ways.
32
3.1 Hypothesis tests
Loosely, this first approach asks whether our sample looks close enough to one that would come from a coin-flipping process with θ = 0.5. To do this, we need a formal definition of “looks close enough to one that would come from a coin-flipping process with θ = 0.5”. Specifically, we ask whether our estimate of θ is close enough to the hypothesized value of 1
2 .
Therefore we state the null hypothesis:
H0 : θ = 0.5 (3.3)
and an alternative hypothesis:
HA : θ 6= 0.5 (3.4)
This is a two-sided alternative hypothesis, because HA permits θ to be either greater than or less than the value in the null. We will get to one-sided tests later.
-0.6 -0.4 -0.2 0 0.2 0.4
t = 3̂ ! 0:5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 cd
f
3 = 0:2 3 = 0:5 3 = 0:9
Figure 3.1: Cumulative distribution function of the test statistic for sample size N = 20. Dashed lines show the bounds for a two-sided, α = 5% test.
Next we need a test statistic. This is a function of our sample, and should tell us something about θ. For the purposes of this application, we can just use our estimator for θ itself: θ̂ = 1
N
∑N i=1 Hi. Using this, we
can start to build up our definition of “looks close enough to one that would come from a coin-flipping process with θ = 0.5”. A natural measure of how close our sample is to one that would come from a θ = 0.5 coin- flipping process is t = θ̂ − 0.5: if t is close to zero, then this seems like good support for the coin being a fair one. On the other hand, it is unlikely that t is close to zero if the sample was generated by some other θ 6= 0.5. This is illustrated in Figure 3.1. If H0 is true, then our test statistic will have the distribution shown with the black lines. Notice here that there is a lot of probability for events where t is close to zero. On the other hand, if θ 6= 0.5, then the distribution will look something like the red (θ = 0.2) or blue (θ = 0.9) lines. Note however, that we only get one realization of t, we next need to map this in a decision rule.
Reject H0? H0 true H0 not true N Good Type I error Y Type II error Good
Table 3.1: The four possible outcomes of a hypothesis test
t tells us “how close” the sample is, but how close is close enough for us to conclude that θ = 0.5? To answer this, we need to make a trade-off about how often we want to be wrong, and what type of wrong that will be. Since there are two possible truths
33
(either H0 is true or H0 is not true), and two possible decisions (either we reject H0 or do not reject H0), then there are 2 × 2 = 4 possible outcomes of the test, half of which have us making the wrong conclusion. Table 3.1 summarizes the two types of wrong that we could be: we should be worried about either failing to reject H0 when H0 is false, a type I error, or rejecting H0 when H0 is true, a type II error. Since we are basing our decisions on the test statistic, which is random, there will always be a trade-off between these two errors. For practical reasons, we focus on targeting an acceptable probability of making a type II error: incorrectly rejecting the null hypothesis. One good reason for this is that this probability is a function of the null distribution (i.e. Binomial(0.5,N)). In our case, we know this exactly, and in most cases, we can approximate it if N is large enough (see the next chapter to learn about this). Compare this to evaluating the probability of a type I error: if HA is true, then all we know is that θ 6= 0.5. How do we assess the distribution of t if we don’t know the actual value of θ? That’s a hard one, and one that we avoid entirely if we focus on targeting the probability of making a type II error.
To do this, we need to work out when we need to define a decision rule about rejecting H0 based on t. As large |t| is evidence against H0, we will therefore use:
Reject H0 if and only if |t| > tc (3.5)
where tc > 0 is a critical value. Note that as tc gets larger, the more evidence we require against H0 to reject it. When H0 is true, the probability of rejecting H0 based on this decision rule is the probability of making a type II error, and equal to:
Pr[reject H0 | θ = 0.5] = Pr[|t| ≥ tc | θ = 0.5] (3.6)
= Pr
[∣∣∣∣∣ 1N N∑ i=1
Hi − 0.5
∣∣∣∣∣ > tc ]
(3.7)
= Pr
[∣∣∣∣∣ N∑ i=1
Hi − 0.5N
∣∣∣∣∣ > Ntc ]
(3.8)
= Pr
[( N∑ i=1
Hi > N(tc + 0.5)
) ∪
( N∑ i=1
Hi < N(tc − 0.5)
)] (3.9)
= Pr
( N∑ i=1
Hi > N(tc + 0.5)
) +
( N∑ i=1
Hi < N(tc − 0.5)
) (3.10)
where the last line follows because ∑N
i=1 Hi > N(tc + 0.5) and ∑N
i=1 Hi < N(tc − 0.5) are mutually exclusive events. But we can simplify this further, because we know that∑N
i=1 Hi ∼ Binomial(0.5,N): we just need to add up all of the bits of its pmf that satisfy
34
∑N i=1 Hi > N(tc + 0.5) or
∑N i=1 Hi < N(tc − 0.5):
Pr[reject H0 | θ = 0.5] = ∑
k:k>N(tc+0.5)
p(k) + ∑
k:k<N(tc−0.5)
p(k) (3.11)
= ∑
k:k>N(tc+0.5)
N!
k!(N −k!) 0.5N +
∑ k:k<N(tc−0.5)
N!
k!(N −k!) 0.5N (3.12)
where the “k : k > N(tc+0.5)” bit means “sum over all ks satisfying k > N(tc+0.5)”. We are looking to set this probability equal to α = 5%. α, the probability of rejecting H0 when it is true, is referred to the test size. Actually, we can’t in general set this probability to exactly 5% for the binomial distribution, because we have no guarantee that there is a place in the pmf where we can stop adding and get 5%. Let’s pick smallest tc such that this thing is less than 5%.
X pX(x) FX(x) θ̂ t 0 0.0000 0.0000 0.00 -0.50 1 0.0000 0.0000 0.05 -0.45 2 0.0002 0.0002 0.10 -0.40 3 0.0011 0.0013 0.15 -0.35 4 0.0046 0.0059 0.20 -0.30 5 0.0148 0.0207 0.25 -0.25 6 0.0370 0.0577 0.30 -0.20 7 0.0739 0.1316 0.35 -0.15 8 0.1201 0.2517 0.40 -0.10 9 0.1602 0.4119 0.45 -0.05 10 0.1762 0.5881 0.50 0.00 11 0.1602 0.7483 0.55 0.05 12 0.1201 0.8684 0.60 0.10 13 0.0739 0.9423 0.65 0.15 14 0.0370 0.9793 0.70 0.20 15 0.0148 0.9941 0.75 0.25 16 0.0046 0.9987 0.80 0.30 17 0.0011 0.9998 0.85 0.35 18 0.0002 1.0000 0.90 0.40 19 0.0000 1.0000 0.95 0.45 20 0.0000 1.0000 1.00 0.50
Table 3.2: Pmf and cdf for X =∑ i Hi ∼ Binomial(0.5, 20).
Table 3.2 shows the relevant pdf and cdf. Since the Binomial distribution is symmetric (i.e. p(X) = p(N−X)), we can look for one cutoff kc at the left tail such that Pr[X < kc] is just less than 2.5%, and then the other cutoff will be at the corresponding point of the right tail: i.e. N − kc. Looking at Table 3.2, we see that 2.1% of the samples will have 5 or fewer heads, and 5.8% of samples will have 6 or fewer heads. Therefore we can choose kc = 5 and have a probability of rejecting H0 when H0 is true of 2 × 2.1% = 4.2%, which is reasonably close to the standard number of 5%. Hence, we will reject H0 if and only if we observe a sample with 5 or fewer heads, or 16 or more heads. Note that we found these cutoffs from Table 3.2 by finding the (approximate) solutions to FX(x) = 0.025 and FX(x) = 0.975.
We’re almost there. In fact, we could do the hy- pothesis test without going any further, but since we specified things in terms of our test statistic t instead of the sum of heads, for completeness we should work out tc. Graphically this is exactly the same problem as solving for the rejection rule in terms of the sum of heads. To see this, have a look at Figure 3.1. The dashed lines have horizontal coordinates of 0.025 and 0.975. What we need to do is look at the cdf of t when H0 is true (the black line), and read off these points. These points are when t = −0.2 and t = 0.2. So our rejection rule becomes:
Reject H0 if and only if: |t| > 0.2
35
Hence, we would reject H0 if we observed, say, 3 or 19 heads in our sample, but would not reject H0 if we observed 8 or 11 heads in our sample.
At this point it should be pretty obvious to you that you will need to compute a lot of probabilities associated with the Binomial distribution. To learn about how to do this in Stata, a good place to start would be by typing the following into the command line:
help binomial
This gives you a pretty minimal description, but the blue text is a link to some more statistical functions that you may want to use later.
3.1.1 One-sided hypothesis tests
The previous section presented a two-sided hypothesis test: it was done under the assumption that the coin could possibly be unfair because either θ was more or less than 0.5. Sometimes, we have reason to rule out a portion of the alternative hypothesis space, and usually this means that if H0 is not true, then we know which side of the hypothesized value (in our case 0.5) the true value of θ is. Suppose, for example, that instead of “are we are flipping a fair coin?”, our research question was “is the coin biased towards heads?”. Formally, we could state a null and alternative as:
H0 : θ = 0.5, HA : θ > 0.5 (3.13)
that is, our research question motivates a (dogmatic) belief that θ could never be less than 0.5. The procedure for such a test is exactly the same, but we just need to think a bit more about what realizations of θ̂ (or t) would provide us with support for HA in favor of H0. For example, observing 2 heads, and so calculating θ̂ = 0.1 and t = 0.4 would not be very convincing that θ > 0.5, however we would have rejected H0 for the two-sided test outlined in the previous section. Hence, we need a rejection rule that only rejects H0 when we observe a sufficiently large θ̂, or sufficiently positive t. Other than that, we approach the problem in exactly the same way.
We need to choose a critical value tc such that the rejection rule:
Reject H0 if and only if t > tc (3.14)
so that the probability of this event, if H0 was true, is equal to α = 0.05, or at least close to 0.05, since t is a discrete random variable. We need to solve for:
α = Pr [t > tc] (3.15)
= Pr
[ 1
N
N∑ i=1
Hi − 0.5 > tc
] (3.16)
= Pr
[ N∑ i=1
Hi > N(tc + 0.5)
] (3.17)
= Pr [Binomial(θ, N) > N(tc + 0.5)] (3.18)
= 1 −FX (N(tc + 0.5)) (3.19)
36
where FX(·) is the binomial cdf shown in Table 3.2. Therefore we are looking for a cell in the FX(x) column of this table corresponding to (roughly) FX(x) = 1 − α = 0.95. The probability of drawing 13 or fewer heads is 0.942, and the probability of drawing 14 or fewer heads is 0.979, so we can’t get exactly α = 0.05, but the following decision rule:
Reject H0 if and only if t > 0.15 (3.20)
gets reasonably close: α = 1 − 0.9423 = 0.0577.
3.2 p-values
0 2 4 6 8 10 12 14 16 18 20
sample: P
i Hi
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p -v
a lu
e
2-sided H A
: 3 0.5
1-sided H A
: 3>0.5
Figure 3.2: p-values associated with the 21 possible samples.
Another popular way of reporting statistical significance is with a p-value. The p-value associated with a hypothesis test is defined as the probability of observing a test statis- tic at least as extreme as the one we actually observed, assuming that H0 is true. If this almost seems like α, the test size, then you have made an important connection! p is equal to the test size α that would put you on the margin between rejecting and not re- jecting H0 for your observed sample.
The benefit of reporting a p-value is that it allows your reader to test your hypothesis at their choice of α, rather than the one that you selected. For example, if you calculated p = 0.03, then you would reject H0 if you wanted to do an α = 0.05 test, but fail to reject H0 for an α = 0.01 test. A smaller p-value means that the data would pass a more stringent hypothesis test.
For example, suppose that you observed 3 heads in your sample. Since we’ve worked through Section 3.1, you know that you would reject the two-sided test that θ = 0.5. However what if a pesky audience member at a seminar is in the mood for a more conservative test. By reporting the p-value, this may avoid an annoying question. So let’s calculate it. Since we are doing a two sided test, there are eight samples that are at least as unlikely to occur when H0 is true. These the samples that include either 0, 1, 2, 3, 17, 18, 19, or 20 heads. Hence the p-value is the sum of the probabilities of these events occurring, assuming that H0 is true, i.e. θ = 0.5:
p = 3∑
k=0
20!
k!(20 −k)! 1
220 +
20∑ k=17
20!
k!(20 −k)! 1
220 ≈ 0.0015 (3.21)
37
so we would reject H0 at most reasonable levels of significance (including α = 0.05). The black circles in Figure 3.2 show the p-values associated with the 2-sided test for all 21 possible realizations of
∑ i Hi.
For the one-sided test in Section 3.1.1, note that observing 3 heads is terrible support for HA, so before we actually calculate this thing, note that it should be close to 1. We need to add up the probabilities of all the samples we could have observed, that would have supplied at least as much support for HA : θ > 0 as did our observed sample of 3 heads. These the samples that include either 3, 4, 5, . . . , or 20 heads:
p = 20∑ k=3
20!
k!(20 −k)! 1
220 ≈ 0.9987 (3.22)
The blue circles in Figure 3.2 show the p-values associated with this 1-sided test for all 21 possible realizations of
∑ i Hi.
For the other 1-sided test, with HA : θ < 0.5, we up all of the probabilities associated with getting at most 3 heads in our sample. In this case, observing 3 out of 20 heads is somewhat strong support for HA in favor of H0, because we shouldn’t expect this to happen too often assuming H0 is true. In this case:
p = 3∑
k=0
20!
k!(20 −k)! 1
220 ≈ 0.0013 (3.23)
so we would reject H0 at for any test with α > 0.0013. It is important to interpret p-values correctly. It is tempting to claim that p is the
probability that H0 is not true. However this is false. 1 Remember that we derived the p-
value assuming that H0 was true. Hence, you are permitted to interpret it in the following ways:
• p is the probability of calculating a test statistic at least as extreme as the one you actually calculated, assuming H0 is true.
• Assuming H0 is true (and all other distributional assumptions about the data-generating process are correct), p will be uniformly distributed (think about this when/if you learn about the method of inversion for generating random numbers).
• When H0 is true, rejecting H0 when p < α implements the same decision rule as testing H0 at the α level of significance.
It tells you something, just be aware of what it doesn’t tell you.
3.3 Confidence intervals
1Do Bayesian econometrics if you want to make claims like this.
38
0 2 4 6 8 10 12 14 16 18 20
sample: P
i Hi
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
C o n - d en
ce in
te rv
a l
HA : 3 6= 30 HA : 3 < 30 (lower bound) HA : 3 > 30 (upper bound)
Figure 3.3: Confidence intervals (α = 0.1) as- sociated with the 21 possible samples. Black lines show the intervals for the 2-sided alterna- tive. Red and blue crosses show the minimum and maximum values in the confidence inter- vals for the one-sided tests respectively.
The third way we might want to report the statistical significance of our results is a con- fidence interval. This reports all of the val- ues of θ0 for which we would fail to reject H0 in the test:
H0 : θ = θ0, HA : θ 6= θ0
(this also applies to one-sided hypothesis tests).
This is useful because it reports, hold- ing α constant, all of the null hypotheses that would not be rejected. Figure 3.3 shows the confidence intervals we would assign af- ter observing each of the 21 possible samples we could observe from flipping 20 coins. The black lines show the 2-sided confidence inter- vals. The red crosses show the lower bound of the one-sided confidence intervals with al- ternative HA : θ < θ0; the upper bound of all of these is θ = 0. and the blue crosses show the upper bound of the one-sided confidence intervals with alternative HA : θ > θ0; the lower bound of all of these is θ = 1.
3.4 Test power
Up to this point, all of our calculations were done assuming that H0 was true (remember this, it’s important). But what about HA? Often we will be testing something with the expectation that H0 is not true. For example, maybe some economic theory tells us that we are not flipping a fair coin (maybe more on this later). α tells us the probability that we are wrong when H0 is true, but what about being wrong when HA is true? That is, what is the probability of a Type I error? The reason it is (relatively) easy to work out things when H0 is true is that H0 completely pins down the distribution of our test statistic. In our case in this Chapter, we know that
∑ i Hi ∼ Binomial(N, 0.5). But for the alternative, all we know
is that θ falls in a range: the distribution of the test statistic when HA is true is not known! That being said, we can answer a simple question: what is the probability of rejecting the null when θ is equal to a particular value? If this seems similar to α, good! α is the answer to this question if we plug in θ equal to the value set in H0 (in this example, θ = 0.5). If we instead plugged in a value consistent with HA, we would be calculating the test’s power. That is, if H0 is not true, how good is our test at telling us this?
To get this, let’s go back to (3.6), but substitute in another value of θ, say 0.6 (i.e. the
39
coin comes up heads with probability 60%):
Pr[reject H0 | θ] = Pr
( N∑ i=1
Hi > N(tc + 0.5) | θ
) +
( N∑ i=1
Hi < N(tc − 0.5) | θ
) (3.24)
So we know that ∑
i Hi ∼ Binomial(N,θ), so this thing is equal to:
Pr[reject H0 | θ] = 1 −F (N(tc + 0.5); N,θ) + F (N(tc − 0.5) − 1; N,θ) (3.25)
where F(·; N,θ) is the cdf of the Binomial(N,θ) distribution. For our 5% test calculated earlier, we had tc = 0.2, so this reduces to calculating the probability of getting 6 or fewer heads, or 16 or more heads. We have already worked out that this is equal to α = 4.2% when H0 is true, but now we evaluate the same probability for a different θ. The probability of rejecting H0 when θ = 0.6 is:
Pr [reject H0 | θ = 0.6] = 6∑
k=0
20!
k!(20 −k)! 0.6k0.420−k +
20∑ k=15
20!
k!(20 −k)! 0.6k0.420−k (3.26)
≈ 13% (3.27)
10 20 30 40 50 60 70 80 90 100
Sample size N
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
te st
p o w er
3 = 0:5, H0 true 3 = 0:6 3 = 0:7 3 = 0:8
Figure 3.4: Power of the 2-sided Binomial test outlined in this chapter, targeting a test size of α = 0.05.
We we want this number to be as big as pos- sible because we want to be able to reject H0 whenever it is false. I hope 13% seems rea- sonably bad to you. This means that 87% of the time, we fail to reject H0. But this is what we accept when we pin down α = 0.05. As long as we want to do a 5% test, the only variable we can play with is the sample size, N. Unsurprisingly, the test power gets big- ger (which is better) as N increases. Figure 3.4 shows this relationship. The black line shows the test size, which we have set to 5%. This and the other lines are jagged be- cause the Binomial distribution is discrete: this wouldn’t happen with a continuous dis- tribution. The colored lines show the test power for 3 different values of θ. Looking at each line individually, it is comforting that they are upward-sloping (outside of the jagged shape due to the discrete distribution). Fur- thermore, as θ becomes further away from the H0 value, the power increases: it is easier to spot the difference between θ0.5 and θ = 0.8 than it is to spot the difference between θ0.5 and θ = 0.6.
40
3.5 The take-away
I write these notes assuming that my students have been introduced to hypothesis tests, p-values and confidence intervals at an introductory statistics/quantitative methods level. That is, you have probably seen a mechanical, plug-and-chug, explanation of what to do. I intend for this to be a chapter on what you’re doing, and why you’re doing it. So what should you take away from this?
Firstly, here’s what we did:
1. Stated formal null and alternative hypotheses
2. Defined a test statistic
3. Worked out the distribution of the test statistic assuming H0 is true
4. Used this distribution to work out a decision rule
5. Reported at least one of (i) the result and conclusion from a hypothesis test, (ii) reported and interpreted a p-value, and (iii) reported and interpreted a confidence interval.
6. Commented on our test’s power of identifying the alternative hypothesis when it is true
But be aware that we did none of the following
• Divide by the sample standard deviation
• Look up a normal, χ2, Student’s t, or F distribution
• 1.96
although we did look up a distribution table, namely Table 3.2. Don’t worry, what you were taught in the past (probably) wasn’t wrong. Just be aware that hypothesis tests and the like can be done without assuming that anything is normally distributed. The reason that we so often do make a normal approximation is that it makes step 3 a whole lot easier, and often doesn’t change things too much.
Exercises
Exercise 3.1. Load the Galton heights dataset. Just focus on parent height. We have a working hypothesis that within a randomly selected couple, the father is more likely to be taller than the mother.
1. Formally define a population parameter that speaks to this test, and state a formal null and alternative hypothesis about this parameter that addresses the working hypothesis above.
41
2. Define a test statistic for this hypothesis.
3. What is the distribution of the test statistic when the null hypothesis is true?
4. Calculate the p-value associated with your null and alternative hypotheses. State any additional assumptions you needed to make for your procedure to be valid (and don’t make any assumptions tht ou don’t need to). Produce a graph to illustrate what you’re doing. Hint: This is a computationally intensive exercise that will take you too long and waste too much paper if you do it by hand.
Exercise 3.2 (Assessing the performance of a “cookbook” hypothesis test). Consider the following procedure for a hypothesis test for a dataset of N iid coin flips {Xi}Ni=1:
H0 : Pr[Xi = 1] = 0.5, HA : Pr[Xi = 1] 6= 0.5
t = 2 √ N(x̄− 0.5), x̄ =
1
N
N∑ i=1
Xi
Reject H0 if and only if |t| > 1.96
(We will understand why this might be a good approach in some circumstances in the next chapter.) Evaluate the actual size (i.e. α) of this hypothesis test. How does the actual size change with sample size N?
You may want to break this problem up into answer the following steps:
1. What component of t is random due to the random sampling procedure (Hint: there are two possible answers: x̄ and N. Work out which is correct.)
2. What is the distribution of this random component when H0 is true?
3. What values of x̄ mean that you would reject the null? (draw a graph)
4. What is the probability that x̄ falls in to this range?
5. How does this probability relate to α?
Exercise 3.3 (Understanding some Stata code). Consider the following script:
clear all
set obs 50
generate X = 0
replace X = 1 if runiform () <=0.5
summarize X
generate t = 2*sqrt(r(N))*(r(mean) -0.5)
summarize t
generate reject = 0
replace reject =1 if abs(t) >1.96
summarize reject
Note especially the line that begins generate t = . In relation to Exercise 3.2:
42
1. Provide a one-sentence description of what each line of this script does (i.e. thoroughly comment the code). Also Provide a brief description of what the entire script does.
2. Based on your answer to Exercise 3.2, what is the probability distribution of the variable reject? Express your answer as a function of α. (Hint: What values could reject take on, and what are the probabilities of these events?)
3. Modify this code to generate 100,000 simulated draws from t. What is the actual test size? What number should you use if you wanted to do a 5% test? Hint: This will take a while. Until you want to get your final answer, run your script with a smaller simulation size (e.g. 100) to make sure it works,
Exercise 3.4. Figure 3.4 shows the power of a 2-sided Binomial test, and how it varies by sample size N. Produce a similar plot that shows the trade-off in a 1-sided test between test power and test size (α). Set N = 100 constant.
43
Chapter 4
Inference with asymptotic assumptions
At this point, I am hoping that you have a rather grim view of hypothesis tests in the context of Chapter 3: we need to know a lot about our random variable in order to derive the distribution of our test statistic when H0 is true, and even when we know all of this, the process is a difficult one. Fortunately, we have another tool that allows us to make the process much simpler: asymptotics. Even if we don’t know (or don’t care) enough about our random variable to derive exactly its sampling distribution, we can use this tool to work out what it would be as our sample size N approached infinity. Then we take the leap of faith that our sample size is large enough that this distribution is a good approximation for our actual sample.
To do this, we use two theorems about sample means. The Weak Law of Large Numbers tells that as our sample size N → ∞, the probability that we are arbitrarily close to the population mean (i.e. E[X]) approaches 1. Then, central limit theorems tell us how we can appropriately scale things so that they are (usually) normally distributed as N →∞. I introduce these concepts, then outline how we can use them to derive approximate properties of our estimator and/or test statistic if we can argue that our sample size is large enough. We then learn how these are useful for doing inference, and finish with a useful approximation of transformation of sample means.
4.1 Large-sample properties of sample means
Fortunately for us, (i) many of our estimators and test statistics are just fancy sample means, and (ii) a lot of work has gone into understanding sample means. I present four of the results of (ii) below, which will make our life a lot easier.
44
4.1.1 The Weak Law of Large Numbers
The weak law of large numbers tells us (loosely) that a sample mean (i.e. 1 N
∑ i Xi) will get
very close to the equivalent population mean (i.e. E[X]) as our sample size (i.e. N) becomes large. Formally:
Theorem 1 (Weak law of large numbers). Let Xi be an infinite set of iid Lebesgue integrable random numbers satisfying E[Xi] = µ for all i. Then:
lim N→∞
Pr
(∣∣∣∣∣ 1N N∑ i=1
Xi −µ
∣∣∣∣∣ > � )
= 0, for all � > 0 (4.1)
The above limit can also be written as:
1
N
N∑ i=1
Xi p−→ µ (4.2)
What does (4.1) mean in plain(er) English? Note that ∣∣∣ 1N ∑Ni=1 Xi −µ∣∣∣ is the distance
between our sample mean and the population mean. This is random because we have a random sample. Now we define � as some arbitrary criterion for closeness, and ask the question: how likely are we to get a sample mean at least � away from the population mean? (4.1) tells us that no matter how we define this criterion closeness, as N →∞ our sample will be close to the population mean with probability approaching 1. Basically, sample means converge to population means as N →∞.
4.1.2 A central limit theorem
So the Weak Law of Large Numbers is useful for point estimates: if we have a sample mean with a large sample size, we are likely to get very close to the population mean. However this is not helpful for inference. How do we put a confidence interval around an estimate if the distribution of the estimator collapses to a point? The answer is to use an appropriate scaling of the estimator that doesn’t collapse. To understand the problem, note that the variance of the sample mean for an iid sample is:
V [x̄N ] = V
[ 1
N
N∑ i=1
Xi
] = V [X]
N (4.3)
So for finite V [X], V [x̄N ] → 0 as N →∞. The solution to this can also be seen in Equation 4.3: we need to multiply x̄N by a fudge factor g(N) (actually a fudge function of N) that increases in such a way that V [g(N)x̄] is a constant. Since g(N) is not random, we can do the following:
V [g(N)x̄N ] = V [g(N)x̄N ] = [g(N)] 2V [x̄N ] = [g(N)]
2V [X]
N (4.4)
45
So if g(N) = √ N, then:
V [g(N)x̄N ] = V [ √ Nx̄N ] = N
V [X]
N = V [X] (4.5)
a constant! That is, the variance of √ Nx̄N does not depend on N. Furthermore, we can
scale this a little bit more so it has zero mean:
E [√
N (x̄N −E[X]) ]
= 0 (4.6)
V [√
N (x̄N −E[X]) ]
= V [X] (4.7)
4.1.3 The continuous mapping theorem
4.1.4 The delta method
4.2 Large-sample properties of estimators
In Chapter 2, we learned that bias, variance, and mean squared error are useful quantities to summarize the performance of an estimator. These are sometimes referred to small sample properties of estimators, meaning that they are things that you might need to worry about if your sample size is small, but if your estimator satisfies some conditions, will not be of too much concern for large N.
46
4.2.1 Consistency
4.2.2 Asymptotic distribution
4.3 Using large-sample properties to make hypothesis
tests easier
4.3.1 Hypothesis tests with asymptotic approximations
4.3.2 Confidence intervals with asymptotic approximations
4.3.3 p-values with asymptotic approximations
Exercises
Exercise 4.1. Consider the exponential distribution, which has the following properties:
FX(x) = 1 − exp(−λx)I(x > 0) (4.8) fX(x) = λ exp(−λx) (4.9)
E[Xk] = k!
λk (4.10)
It can be used to model the time until an event occurs. We will consider the following two estimators for λ:
λ̂ = 1
1 N
∑N i=1 Xi
, λ̃ =
√ 2
1 N
∑N i=1 X
2 i
1. Explain how these estimators can be motivated from Equation 4.10 above.
2. Load ExpData.csv, which is a dataset of exponential random numbers. Estimate λ using both estimators described above.
3. Are these consistent estimators for λ? Explain.
4. What are the asymptotic variances of these estimators? Which one would you prefer?
5. Suppose that you wanted to report the probability that the event had not occurred after 1 unit of time. Write down this expression as a function of λ.
From now on, let’s just focus on λ̂, although you could do all of this with λ̃ as well.
6. Replace λ with λ̂ in your answer to the previous part. This is an estimator of this probability. Is this estimator consistent?
47
7. What is the asymptotic distribution of the estimator of this probability?
8. Construct a 95% confidence interval for this probability.
9. Is λ̂ a biased estimator for λ? Explain. If it is biased, can you say in which direction?
Exercise 4.2 (Simulation exercise – What is so magical about N = 30?). In an undergraduate statistics course you may have been told that you need a sample size of at least 30 to justify looking up a normal distribution table. Plot the distribution of the test statistic t, and evaluate the test size for the hypothesis test:
H0 : E[X] = 0, HA : E[X] 6= 0
t =
√ NX̄√
1 N
∑ i(Xi − X̄)2
reject H0 if and only if |t| > 1.96
assuming that:
1. Xi ∼ iidN(0, 1)
2. Xi ∼ iidN(0, 4)
3. Xi ∼ iidBernoulli(0.5) − 0.5 (you can draw this by generating a fair coin flip variable then subtracting 0.5 from it).
4. Xi ∼ iidχ21 − 1. Note that if Z ∼ N(0, 1), then X2 ∼ χ21
Note that once you have generated X from the correct distribution, you can compute t as follows:
summarize X
display t = sqrt(_N)*r(mean)/sqrt(r(Var))
within a program, you will want to replace display with return scalar.
Exercise 4.3. Consider the distribution first introduced in Exercise 1.3. We will continue analyzing the properties of the following estimators for the parameter in this distribution, which were introduced in Exercise 2.1:
α̂ = 1 N
∑N i=1 Xi
1 − 1 N
∑N i=1 Xi
α̃ = − N∑N
i=1 log(Xi)
48
1. For each of these estimators, answer do the following questions. For simplicity, I refer to everything below as α̂, but do this for both α̂ and tα̃.
Hint: for α̃, you will need to do some integration by parts, then use L’Hôpital’s rule. Either that or use something like thishttps://www.wolframalpha.com
(a) Is α̂ a consistent estimator for α? Explain your answer.
(b) What is the delta method approximation of the variance of α̂?
(c) Suppose that you wish to test:
H0 : α = 2
against:
HA : α 6= 2
Under the null hypothesis, what is the asymptotic (large sample) distribution of√ N(α̂− 2)? That is, complete the right-hand side of:
√ N(α̂− 2) d−→ ?
(d) Use your answer in the previous part to propose a function of α̂ and N that at large enough samples is approximately distributed N(0, 1)
(e) Suppose that you collected N = 30 observations and estimated α̂ = 2.2. Use your answer in part 1d to test this hypothesis. Use a 5% level of significance.
(f) What is the p-value for this test?
(g) Construct a 2-sided 90% confidence interval around this point estimate.
2. Based on their asymptotic variances alone, which estimator would you prefer to use?
3. Propose an alternative method of testing α = 2 using the sample mean instead of our estimate of α̂ (just outline the steps). Make sure you state the distribution of the test statistic under the null, if you made a large-sample approximation to get there, and the rejection rule.
4. (Simulation exercise): In question 1e you constructed a rejection rule for H0 based on a large-sample approximation of the distribution of α̂. Use a simulation to construct a rejection rule that does not need this approximation. Compare it to your large-sample approximation rejection rule. Do you think the probability of a Type II error using the large-sample approximation is close enough to 5% to be a good approximation? Briefly discuss your answer.
Hint: To do this, you should:
49
• Simulate the distribution of α̂ when the null hypothesis is true • Calculate two critical values (i.e. reject if α̂ is not between these critical values)
from your simulated distribution. Think about how you calculated your critical values when you used the large sample approximation, and how they relate to the normal distribution.
Exercise 4.4 (A sort-of simulation exercise). When we simulate a draw from the distribution of an estimator, say µ̂, one thing we may want to ask is how accurate is our approximation of the bias? That is, how close is the simulated bias:
BiasS(µ̂) = 1
S
S∑ s=1
(µ̂s −µ)
to
Bias(µ̂) = E (µ̂−µ)
Assume that we have correctly simulated {µ̂s}Ss=1, such that each µ̂s is an iid draw from the sampling distribution of µ̂.
1. What is the variance of our simulated bias?
2. What is the approximate distribution of:
√ S ( BiasS(µ̂) − Bias(µ̂)
) for large S?
3. How can you use your previous answer to work out how accurate your simulation is?
4. Assume that V [µ̂] = 1. How large doe S be for your simulation to get the bias correct to the 2nd decimal place with probability 99%?
Express your answers as a function of the actual bias, the simulation size S, E[µ̂] and V [µ̂] (assuming these are all finite quantities).
√ S ( BiasS(µ̂) − Bias(µ̂)
) d−→ N (0,V [µ̂]) We can use this to construct a confidence interval around our simulated bias. Importantly, since we get to choose S, we can work out how big S has to be for us to get close to the true bias.
50
Part II
Basics of programming and handling data in Stata
51
Chapter 5
Getting started in Stata
5.1 Importing, saving, and exporting data
The most likely reason that you open up Stata is that you want to analyze some data. An important first step is therefore knowing how to import your file. Basically, you need to give Stata some instructions along the lines of “Go to this folder, and open this file.” On top of this, you will also need to tell it the file format. While you can look at most of the datasets that Stata can open using a text editor,1 you need to tell it the “language” it should be expecting. The bad news is that Stata will not be your friend if you don’t get this right, the good news is that if you do get it right, Stata is able to open quite a lot of stuff (with the right instructions).
You can tell Stata to import data through the File→Import drop-down menu. In addi- tion, Stata also has its own file format with the suffix .dta. You can import this file format using the use command. While this format usually preserves much more useful information than the other formats,2 it has its drawbacks, too. First, it is difficult to read in programs other than Stata; and second, there are sometimes compatibility issues between versions of Stata.3 The .dta format can be produced using the drop-down menus File→Save or File→Save As....
Note that all of these functions can be accessed through the command line. For example, if you have a file thingy.dta in your working directory, instead of File→Open... you could type use thingy, or if you already have something in memory: use thingy, clear. Exercise 5.2 is there to help you get more acquainted with how Stata stores data in different formats.
1MS Windows’ Notepad is a text editor, but it is not a particularly good one. I have found that Notepad++ works pretty well for almost everything.
2My favorites of these are value ad data labels, which make using someone else’s data much easier. 3At the time of writing, I had recently encountered a problem with opening files in Stata 13 that were
saved in Stata 14. This can be solved at the newer version end with the version command, but it is annoying, nonetheless.
52
5.2 Scripts
Almost everything that comes with Stata can be done using the drop-down menus. This seems like a great comfort at first, but I warn you against using Stata like this for (at least) four reasons:
1. The “almost” part: there are some things that you will not be able to access using the drop-down menus.
2. The “that comes with Stata” part: Stata has a lot of really good and free user-generated content.4 These typically are not friendly to those who like to point and click.
3. If you always use the drop-down menus, and you ever want to change what you do (you should expect to do this all the time), then you will have to re-trace all of your steps. Will you remember them? Will you have time to do them? Will you get the changes right on the first try? Probably not (no offense).
4. Pedagogically, I want you to learn something about programming in general, as well as in Stata specifically. Learning about some fundamental components of programming languages (such as scripts, for loops, and if statements) will make any programming language easier to learn in the future.
In addition to not using the drop-down menus, I also encourage you to not use the command line for anything you think you might want to keep. This leaves us with the following solution:
It is with a keen sense of irony that I invite you to use the drop-down menus in Stata as follows: Window → Do-file Editor → New Do-file Editor. Alternatively, ctrl+9 will also get you there. This opens up a new Do-file editor (duh). This is Stata’s in-built text editor for scripts, which are set of instructions that Stata follows from top to bottom. These things are extremely useful for many reasons. So much so, that from now on you should think about doing everything in Stata exclusively in scripts.
To demonstrate how to get started on a script, let’s have a look at Stata’s system dataset auto.dta.5 We can access this and see what’s in it by typing the following lines into the command line:
clear
sysuse auto
describe
clear removes any data that was in memory beforehand (make sure you save regularly!), sysuse auto loads this dataset, and describe gives us a description of all of the variables held in memory. One of the variables we have is mpg, which is each cars’ fuel efficiency, in
4E.g. esttab, which you will become familiar with shortly. 5Stata comes with a number of small, pre-loaded datasets that can be accessed with the sysuse command.
These are frequently used in the help files as examples to demonstrate particular programs. There are also other datasets on Stata’s website that can be accessed using the webuse command.
53
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
mpg | 74 21.2973 5.785503 12 41
Lper100km | 74 11.83116 3.016803 5.749129 19.64286
Table 5.1: Resultant summary of auto dataset after generating Lper100km variable.
miles per gallon. Suppose that, for whatever reason, we wish to do our analysis in liters per 100km instead. This is easily done by generating a variable:
generate Lper100km = 1/(mpg *1.61/3.795) *100
That is, there are 1.61km in a mile, 3.795L per US gallon, and this new unit is based on 100L of fuel:
3.795L/gallon
mile/gallon × 1.61km/mile = 2.35 × L/km ×
100
100 = 235L/100km (5.1)
So before we ever want to do analysis on the auto dataset using these units for fuel efficiency, without scripts we would have to type in all of these lines. This could become tedious if we have to do this over and over again, so alternatively, you could write the script:
clear // clear everything in memory
sysuse auto // load the system dataset called "auto"
describe // provide a description of the dataset
generate Lper100km = 1/(mpg *1.61/3.795) *100 // convert mpg into L per 100km
summarize mpg Lper100km // provide summary statistics for mpg and Lper100km
which works in exactly the same way as tying stuff into the command window, but you only have to type it once. This code also produces a table of summary statistics of the old and new variable, which is shown in Table 5.1.
5.3 The working directory
In many popular MS programs, whenever you save something for the first time you are asked where you would like the file to be stored. This is a problem if you are using scripts, for at least two reasons:
1. If your script is a long one, then if you are like me you probably want to go and do something else while it runs. Go get coffee, grade a paper, mindlessly check Facebook, etc. If your script has multiple lines telling Stata to write files, you don’t want it interrupting you whenever it needs to do this with the question “where do you want me to put this?”
2. If you are collaborating with others (which I strongly encourage), your script will need to run on many (at least two) machines, which all have different file systems. (This is by fare the more important reason)
54
The solution to these problems is to point Stata in the direction of a “working directory”. This is the default folder that, if prompted to save, export, import, use, or read or write anything else for that matter, it will do it here by default.
You can ask Stata about the working directory through the command line by typing dir. This gives you the file path, e.g. C:/users/AceRimmer/Metrix, as well as information about any files and folders that are in this directory. You can go up one level by typing cd .., which in this example would get you to C:/users/AceRimmer. You could go back to the original by typing either cd Metrix, or the whole file path: cd C:/users/AceRimmer/Metrix.6 In fact, this second command will get you there no matter what the current working directory is.
Furthermore, by default Stata assumes that any incomplete file path that you give it starts at the working directory. So suppose that you had a folder called “figures” inside your working directory (I almost aways do), then you could tell Stata to save a histogram of variable x as follows:
hist x
graph export figures/HistogramOfX.png
Exercises
Exercise 5.1.
1. Create a new folder somewhere, and copy “galton heights.csv” into it
2. Stata needs to know which folder you want it to look in by default. This is called the “working directory”. You can set this by clicking on File --> Change Working Directory and following the prompts.
3. Type ls into the command window. Stata will tell you the contents of this folder. If you’re ever unsure what your working directory is, type pwd (print working directory)
4. Stata needs to know a bit about the data file you want it to read. galton heights.csv is a comma separated variable file. Open it with NotePad or something similar to find out what that means.
5. Today we will import our file from the command line. It can be done through File --> ... as well, but some things cannot be done with the drop-down tabs. The command we will use is import delimited, but we need to know the information Stata needs. To do this, type “help import delimited” into the command line. A help file will pop up. Get to know how to read these help files. Once you do, they are actually quite helpful! Work out what you need to type to import your data (I will run through this with you once you’ve tried yourself)
6If there is a space in the file path, you will need to put double quotes around it, e.g.: cd "C:/this folder has spaces"
55
6. In the command line, try typing (one at a time) “describe”, “summarize”, and “summarize, detail”. What do these tell you? Have a look at the help files for describe and summarize Can you summarize only two of the five variables in the data set? (read the help files).7
7. What is wrong about the summary statistics for mothers and fathers? (think about how the data are arranged)
8. The height variables are recorded in inches (reasonable in the US), but for some reason 60 inches are subtracted. Create 3 new variables equal to the actual heights:
generate father height actual = father height + 60 If you want to use a sys- tem of measurements that the majority of the world uses:
generate father height meters
= father height actual / 39.37
9. Check your work with a scatter plot. If you created these variables correctly, what would the relationship between child height actual and child height be? With the drop-down menus, use: Graphics --> Twoway graph (scatter, line, etc.)
10. Look at the text that just appeared in the output. Copy and paste it into the command line. Same results? Good!
11. Try the following, what do they do?
twoway (scatter child height actual mother height actual)
twoway (scatter child height actual mother height actual if son==0)
by son, sort: twoway (scatter child height actual mother height actual)
12. Suppose that you wanted to get rid of some data:
I don’t need to use family id, then: drop family id
I only want to focus on daughters, then: drop if son==1
13. Once you’re done, you can save the data (but not the outputs) by typing:
save filename.dta
dta is a special file format for Stata that allows you to store some additional information (more on this later).
To load a dta file: use filename
If you already have data in the memory, you will need to clear it first:
7Note that Stata lets you take a few shortcuts: you could have types desc and sum respectively. I will try to not take the shortcuts, but if you use Stata enough you will probably always use them. I think this is a drawback of Stata. Having exactly one way to do things makes it easier to read others’ programs. If you use Python, have a think about what a tab means.
56
clear
use filename
Exercise 5.2. Load one of the system datasets (e.g. sysuse auto, clear), then do the following:
1. Type describe and summarize into the command line. What do these commands do?
2. Use the drop-down menus to export this dataset into the following formats:
(a) Text data in csv format
(b) Text data in fixed format
(c) Stata’s .dta format.
What commands does Stata write into the command line when you export these files?
3. In a good text editor (e.g. Notepad++). Open up the two files you created in the previous question. In each of these files, how does Stata know when one column ends, and another begins?
4. Import the .csv and .dta that you created, and describe the data. Has any infor- mation changed or been destroyed compared to the system dataset?
Exercise 5.3.
1. Find out where Stata’s default working directory is on your machine. I.e. what is ther working directory when you open Stata?
2. Create a folder somewhere. Write a script that, no matter where it is located, will export a comma-separated variable version of Stata’s system dataset auto to this folder.
3. What do you need to do to make this script run without errors twice?
57
Chapter 6
For loops
Sometimes we find ourselves copying blocks of code over and over again, and only making minor changes to each copy. In econometrics, this can often involve running the same analysis on several different datasets, systematically running every possible combination of things from two or more sets of things,1 or performing the same transformation of every variable in our dataset.2 While there is no reason why your carefully copied and pasted script that is hundreds of lines long could work, it pays to take a pause whenever you get the urge to do this, and think about writing a for loop. Firstly, if there is an error in the block of code that you are copying, then there will be hundreds of errors after you ctrl+v 99 times.
The idea of a for loop is as follows:
Hey , STATA! Do this thing in the curly brackets for x = 1, 2, 3, ..., K {
Here is the thing that I want you to do over and over again
I want each iteration to be a little bit different , so I can
include ‘x’ in here to distinguish between each step
} // here is the end of the thing I want you to do
Of course, Stata isn’t intelligent enough to understand this, but it can do something quite similar, if you give it the right instructions. For example:
1 forvalues ii = 1/5 { 2 display ‘ii’ 3 }
which displays the numbers 1 through 5. While this example is somewhat underwhelming, note that in order to achieve this without using forvalues, I would have to:
1 display 1 2 display 2 3 display 3 4 display 4 5 display 5
1E.g.: Running an analysis for men and women separately, slicing the data by level of education. 2E.g.: We might want to take the natural log of all of all of our variables.
58
which is 2 lines longer. More to the point, if I wanted the numbers 1 through 1,000 displayed, the for loop would still be 3 lines long (all you would have to do is change the 5 line 1 to 1000), and the second script would be 1,000 lines long.
Listing all of the integers between 1 and 1,000 may be a special moment for you in learning about programming, but it is not particularly useful. One application that is somewhat more econometrics-y, is reporting a sample mean. In my working directory at the moment, I have 10 files (unimaginatively) called data 1.dta, data 2.dta, . . . , data 10.dta. Each of these contains exactly one variable, called x. I want to report the sample mean of x in each dataset. Of course, I could always do this without a loop:
clear
use data_1.dta
quietly summarize x
display ‘r(mean)’
clear
use data_2.dta
quietly summarize x
display ‘r(mean)’
clear
use data_3.dta
quietly summarize x
display ‘r(mean)’
// and so on
But that is 4 lines of code per dataset, for 10 datasets. That’s 40 lines of very repetitive code! Instead, I could do the same thing as follows:
forvalues dd = 1/10 {
clear
use data_ ‘dd ’.dta
quietly summarize x
display ‘r(mean)’
}
which doesn’t get any bigger if I have 100 datasets or 1,000,000 datasets. How do you think I created these .dta files for this example? Hint: I’m lazy:
forvalues dd = 1/10 {
clear
set obs 1000
gen x = rnormal ()
save data_ ‘dd ’.dta , replace
drop x
}
59
Chapter 7
Types of data
7.1 How your computer thinks (or doesn’t think) about
data
7.2 Censored and truncated data
7.3 Categorical data
7.3.1 Unordered
7.3.2 Ordered
60
Chapter 8
Merging data, and wide & long formats
This chapter demonstrates some useful tools for merging data and handling panel data.
8.1 One-to-one merges
Frequently you have more than one data file that contains information you would like to study.
To begin with, we will look at two datasets:
• USCensusTab04.xls contains state abbreviations (i.e. AL, AK, etc), and state popu- lations between 2000 and 1990
• us state.xls contains state abbreviations and their full names
We aim to merge these two datasets so that we can have full state names and populations in the same dataset. This is a 1:1 merge because each row in the first file corresponds to exactly one row in the second file, and vice versa.
The following code imports the first dataset, appropriately labels some variables, and saves it in Stata’s .dta format.
import excel "USCensusTab04.xls", sheet("Table 4") cellrange(A7:C57)
rename A StateAbbrev
rename B pop2000
rename C pop1990
// Have a look at the Tab04 dataset
list in 1/10
save tab04.dta , replace
clear
The output from list in 1/10 is
61
+--------------------------------+
| StateA~v pop2000 pop1990 |
|--------------------------------|
1. | AL 4447100 4040587 |
2. | AK 626,932 550,043 |
3. | AZ 5130632 3665228 |
4. | AR 2673400 2350725 |
5. | CA 33871648 29760021 |
|--------------------------------|
6. | CO 4301261 3294394 |
7. | CT 3405565 3287116 |
8. | DE 783,600 666,168 |
9. | DC 572,059 606,900 |
10. | FL 15982378 12937926 |
+--------------------------------+
that is, each row of data contains a state abbreviation (string variable), and that state’s population in 2000 and 1990. We wish to import the actual name of the state into the dataset as well. We could, if we really wanted to waste our time, go ahead and manually code up:
generate State = .
replace State = "Alabama" if StateAbbrev == State = "AL"
// and so on ...
but we have better things to do. Fortunately, we also have the file us states.csv, which has this information. First we need to get it in to a useful format:
import delimited "us_states.csv"
rename v2 StateName
rename v3 StateAbbrev
drop v1
// Have a look at the us_states dataset
list in 1/10
which gives us the output:
+------------------------+
| StateName StateA~v |
|------------------------|
1. | Alabama AL |
2. | Alaska AK |
3. | Arizona AZ |
4. | Arkansas AR |
5. | California CA |
|------------------------|
6. | Colorado CO |
7. | Connecticut CT |
62
8. | Delaware DE |
9. | Florida FL |
10. | Georgia GA |
+------------------------+
So now we have a variable called StateAbbrev in each dataset, and we would like to have the column StateName alongside everything else in tab04.dta. To do this, all we need to do is:
merge 1:1 StateAbbrev using tab04.dta
This line gives us the output:
Result # of obs.
-----------------------------------------
not matched 1
from master 0 (_merge==1)
from using 1 (_merge==2)
matched 50 (_merge==3)
-----------------------------------------
which tells us that we successfully merged 50 rows of each dataset, and there was one left over from the “using” dataset, tab04.dta. We can find out about this row by:
list if _merge ~=3
+----------------------------------------------------------+
| StateN~e StateA~v pop2000 pop1990 _merge |
|----------------------------------------------------------|
51. | DC 572,059 606,900 using only (2) |
+----------------------------------------------------------+
It looks like DC did not appear in this file, but everything else worked well:
// Have a look at the merged dataset
list in 1/10
+------------------------------------------------------------+
| StateName StateA~v pop2000 pop1990 _merge |
|------------------------------------------------------------|
1. | Alaska AK 626,932 550,043 matched (3) |
2. | Alabama AL 4447100 4040587 matched (3) |
3. | Arkansas AR 2673400 2350725 matched (3) |
4. | Arizona AZ 5130632 3665228 matched (3) |
5. | California CA 33871648 29760021 matched (3) |
63
|------------------------------------------------------------|
6. | Colorado CO 4301261 3294394 matched (3) |
7. | Connecticut CT 3405565 3287116 matched (3) |
8. | Delaware DE 783,600 666,168 matched (3) |
9. | Florida FL 15982378 12937926 matched (3) |
10. | Georgia GA 8186453 6478216 matched (3) |
+------------------------------------------------------------
Good! We can fix the one row that merge could not help us by:
replace StateName = "District of Columbia" if StateAbbrev == "DC"
Annoying, but not as much as coding up 51 lines like this one. If we are going to do other merges, we may want to drop merge now, because it will
try to generate another merge variable for the next merge.
8.2 Wide and long datasets
Here we use emp-unemployment.xls, which can be found here: http://www.icip.iastate. edu/tables/employment/unemployment-states. I have modified this file slightly to make it easier to import into Stata. Specifically, I have appended the year columns with a “Y” so that the variable is preserved. Otherwise Stata just names these A, B, C, D, etc., because we aren’t allowed to have variable names that start with a number.
Once we import the data, we notice a problem:
clear
import excel "emp -unemployment.xls", sheet (" States ") cellrange(B7:AL59) firstrow
desc
------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------
Area str20 %20s Area
Y1980 double %10.0g Y1980
Y1981 double %10.0g Y1981
Y1982 double %10.0g Y1982
Y1983 double %10.0g Y1983
The trouble is that we have one column for each year’s unemployment rate (the rows cor- respond to states). This is a good way to organize things in a table, but it is terrible for organizing things in Stata. This file is in wide format: in terms of our panel data notation, each row corresponds to a different i subscript, and each column corresponds to a different t subscript. (e.g. one row will be for Ohio, and there will be one column for every year of data
64
we have). We would like to transform the data into long format, where each row corresponds to a different i-t pair, (e.g. one row will be Ohio in 2007). To see this:
list Area Y1980 Y1981 Y1982 Y1983 in 1/3
+-----------------------------------------------+
| Area Y1980 Y1981 Y1982 Y1983 |
|-----------------------------------------------|
1. | United States 7.1 7.6 9.7 9.6 |
2. | Alabama 8.9 10.6 14.1 13.8 |
3. | Alaska 9.6 9.4 9.9 9.9 |
+-----------------------------------------------+
We fix this using the reshape command. The syntax on the next line is as follows:
• reshape long tells Stata that we want to convert a wide dataset to long,
• Y tells Stata that all of the columns that correspond to a different t index start with the string “Y”
• i(Area) tells Stata that Area is the i-index of the data
• j(year) tells stata that the (new) t-index is to be called “year”, it will be equal to everything that follows the “Y” in the wide dataset. (e.g. the Y2006 column will be coded as year = 2006)
reshape long Y, i(Area) j(year)
// Problem solved!
list in 1/5
+-----------------------+
| Area year Y |
|-----------------------|
1. | Alabama 1980 8.9 |
2. | Alabama 1981 10.6 |
3. | Alabama 1982 14.1 |
4. | Alabama 1983 13.8 |
5. | Alabama 1984 11 |
+-----------------------+
Y is a terrible name for unemployment, so as a good practitioner we may want to think about doing something like:
rename Y unemp
Finally, Area is a string variable, which is not always easy to use in Stata. To fix this, we can use the encode command to assign an integer to every unique string in a variable
65
encode Area , generate(state)
A nice feature of this is that Stata remembers what these strings were, so you don’t have to use data labels to get the state names back, even if you drop the original variable. This means that the actual state name, rather than "state=3" will chow up in all of your graphs, regression outputs, etc. For example:
quietly regress unemp i.state
coefplot , drop(_cons)
produces Figure 8.1, which shows the coefficients on the state dummy variables. Of course, encode also allows you to do some naughty things that would get you stupid
results, such as regress state unemp, summarize state, hist state, kdensity state, and so on. Stata will not give you an error here, so be careful. Go and do a Google search of “log(NAICS)” if you want a good example of what not to do with encoded variables.
Figure 8.1:
we will use this dataset later, so before moving on, let’s:
save unemp_long.dta , replace
And if, for whatever reason, you ever want to go back to the wide format, you can do this:
reshape wide unemp , i(state) j(year)
8.3 Many-to-one and one-
to-many merges
Earlier, we had two datasets, each with the same number of rows. Our expecta- tion was that there was a on-to-one map- ping between rows of these two datasets. In many cases, however, we have to imple- ment a many-to-one or one-to-many match. This means that rows in one dataset corre- spond to many rows of the other. To illustrate this point, we continue with the (long format of) the dataset in Part 2, where we have yearly (t) data on US states (i). For whatever reason, we wish to include some characteristics that are in ”list-state-capitals- us-764j.xlsx”, which can be found here: http://www.downloadexcelfiles.com/us_en/ download-excel-file-list-state-capitals-united-states#.WH0AMBsrKUk.
Since it is easier to do the merge for a dataset that is already in Stata’s .dta format, let’s load this in first:
clear
66
import excel "list -state -capitals -us -764j.xlsx", sheet("List of State Capitals of US")
↪→ cellrange(A2:K52) firstrow generate Area = State
The dataset we want to merge this with recode ”State” as ”Area”. We need these to have the same name. We also need to save the dataset in Stata’s .dta format:
// Save the data for merging
save list -state -capitals -us -764j.dta , replace
// Go back to the long version of the unemployment data from Part 2
use unemp_long.dta
We have a variable called Area in both datasets. We need to tell Stata which file is the “many” set, and which is the “one”. For us, the data currently in memory is the “many”, because we are matching on State (named Area), and there are multiple years for each state in this file. The list-state-capitals-us-764j.dta file contains exactly one row per state, so this is the “one”.
merge m:1 Area using "list -state -capitals -us -764j.dta"
Result # of obs.
-----------------------------------------
not matched 72
from master 72 (_merge==1)
from using 0 (_merge==2)
matched 1,800 (_merge==3)
-----------------------------------------
There are some that did not match:
tab Area if _merge ~=3
Area | Freq. Percent Cum.
---------------------+-----------------------------------
District of Columbia | 36 50.00 50.00
United States | 36 50.00 100.00
---------------------+-----------------------------------
Total | 72 100.00
But that be expected, because DC, and the whole of the USA, does not appear in our dataset. Let’s check that it worked:
list State unemp Capital in 1/10
sort year
list State unemp Capital in 1/10
which generates the output:
67
+------------------------------+
| State unemp Capital |
|------------------------------|
1. | Alabama 8.9 Montgomery |
2. | Alabama 10.6 Montgomery |
3. | Alabama 14.1 Montgomery |
4. | Alabama 13.8 Montgomery |
5. | Alabama 11 Montgomery |
|------------------------------|
6. | Alabama 9.2 Montgomery |
7. | Alabama 9.7 Montgomery |
8. | Alabama 8.1 Montgomery |
9. | Alabama 7.2 Montgomery |
10. | Alabama 7 Montgomery |
+------------------------------+
. sort year
. list State unemp Capital in 1/10
+-----------------------------------------+
| State unemp Capital |
|-----------------------------------------|
1. | Connecticut 5.8 Hartford |
2. | Rhode Island 7.2 Providence |
3. | Mississippi 7.4 Jackson |
4. | Maryland 6.6 Annapolis |
5. | Delaware 7.6 Dover |
|-----------------------------------------|
6. | Michigan 12.3 Lansing |
7. | Arizona 6.6 Phoenix |
8. | Utah 6.2 Salt Lake City |
9. | Georgia 6.3 Atlanta |
10. | South Carolina 6.7 Columbia |
+-----------------------------------------+
Looks good! Note that we can also do this as a one-to-many merge if we happened to start with
list-state-capitals-us-764j.dta in the memory:
clear all
use list -state -capitals -us -764j.dta
merge 1:m Area using unemp_long.dta
68
Chapter 9
Marginal effects and interactions in non-linear models
Ai and Norton (2003) Bland and Cook (2017)
69
Chapter 10
Standard errors under different assumptions about �
For all of Section I of Bailey (2016) (and for a lot of the following sections), we assume that the error terms of our regressions, {�i}Ni=1, (among other things):
1. Have a constant variance. That is, no matter two rows of the data we are looking at, it must be that:
V [�i] = V [�j] = σ 2, for all i,j ∈{1, 2, . . .N} (10.1)
This is the assumption of homoskedasticity.
2. Are uncorrelated with each other. That is, for any two rows i and j of our dataset:
corr(�i, �j) = 0, for all i 6= j (10.2)
Note that if either of these are not true, we needn’t worry about all of the nice properties of OLS breaking down. Importantly, if these are the only problems we have, then our slope estimator is still unbiased. What we should worry about, however, is that our standard errors are not calculated correctly, and so without any correction for this, we report the results of hypothesis tests at our own peril. If the former assumption is violated, we refer to this as heteroskedasticity: the variance of the error term is not constant across observations. This is eminently fixable without having any additional insights into your data. On the other hand, if the latter is not true, then we need to know a bit more about our data to fix the problem. For a thorough run-through of these procedures, have a look at Cameron and Miller (2015). What follows is a simplification of that work to the realm of bivariate OLS. The extension to multivariate OLS, and some non-OLS techniques, is relatively straightforward with the right matrix algebra background.
To begin with, let’s see how far we can get with V [β̂1] without making any additional assumptions about the error term. The variance of β̂1 when you reg y x is:
V [β̂1] = V
[∑ i(Xi − X̄)�i∑ i(Xi − X̄)2
] (10.3)
70
Noting that we are treating the Xs as fixed, without loss of generality, we can write this as:
V [β̂1] = V [∑
i(Xi − X̄)�i ](∑
i(Xi − X̄)2 )2 (10.4)
The denominator of this is only a function of the data, so it is easily computable, and doesn’t depend on any assumptions about �. The numerator, however, simplifies differently depending on our understanding of �. Before we make any further assumptions about �, note that we can express the denominator of 10.4, without loss of generality, as follows:
V
[∑ i
(Xi − X̄)�i
] = E
(∑
i
(Xi − X̄)�i −E
[∑ j
(Xj − X̄)�j
])2 (10.5) = E
(∑
i
(Xi − X̄)�i
)2 (10.6) = E
[∑ i
∑ j
( (Xi − X̄)�i
)( (Xj − X̄)�j
)] (10.7)
= E
[∑ i
∑ j
(Xi − X̄)(Xj − X̄)�i�j
] (10.8)
= ∑ i
∑ j
(Xi − X̄)(Xj − X̄)E[�i�j] (10.9)
where (10.5) follows by the definition of variance, (10.6) follows because the expectation of any �i is zero, and (10.7) expands the squared term. What follows are further simplifications of (10.9), after making various assumptions about E[�i�j].
10.1 Homoskedasticity: the *standard* standard er-
rors
If you have been regging y x with free abandon up to this point, this is what you have been doing. Depending on how deep your understanding of OLS is, you would have been implicitly, or (I really hope) explicitly, been making the assumption that the error term has constant variance, and that any two randomly selected errors are uncorrelated with each other. More formally, this means that:
Assumption 1 (Homoskedasticity).
V [�i] = E[� 2 i ] = σ
2 for all i = 1, 2, . . .N
E[�i�j] = 0 for all i 6= j
71
Note that these two restrictions allow us to say something about all of the terms in (10.9). Specifically:
E[�i,j] =
{ σ2 if i = j
0 otherwise (10.10)
This means that we can simplify (10.9) as follows:∑ i
∑ j
(Xi − X̄)(Xj − X̄)E[�i�j] = ∑ i
(Xi − X̄)2E[�2i ] (10.11)
= ∑ i
(Xi − X̄)2σ2 (10.12)
= σ2 ∑ i
(Xi − X̄)2 (10.13)
Substituting this into (10.4) yields:
V [β̂1] = V [∑
i(Xi − X̄)�i ](∑
i(Xi − X̄)2 )2 (10.14)
= σ2 ∑
i(Xi − X̄) 2(∑
i(Xi − X̄)2 )2 (10.15)
= σ2∑
i(Xi − X̄)2 (10.16)
The denominator of this is eminently computable: it is N times the sample variance of X. σ2, however, is an unknown. Fortunately we can consistently and unbiasedly estimate it using the residuals from the regression as follows:
σ̂2 = 1
N −k ∑ i
�̂2i (10.17)
where k is the number of parameters in our model (for bivariate OLS, k = 2). And so, if we are happy with Assumption 1, we (or if we have something better than a pen and paper, our favorite statistical package) can compute our standard errors as follows:
V̂ [β̂1] = 1
N−k ∑
i �̂ 2 i∑
i(Xi − X̄)2 (10.18)
At this point, we should make an important distinction between (10.16) (10.18). (10.16) is the actual variance of β̂1. However since we do not know the true value of σ
2, we must estimate this variance. Therefore (10.18) is an estimator of (10.16). (10.18) is what Bailey (2016) reports in his equations 3.9 and 3.10.1
1Put simply: β̂1 is the OLS estimator for β1. (10.18) is the estimator of the variance of the OLS estimator for β1. :p
72
Since we usually like to report things in the same units, we typically take the square root of this thing and report the standard error, rather than the variance:
se[β̂1] =
√ 1
N−k ∑
i �̂ 2 i∑
i(Xi − X̄)2 (10.19)
10.2 Heteroskedasticity: reg y x, robust
While Assumption 1 may seem like a reasonable restriction, there are plenty of cases where we assume homoskedasticity at our own peril. The next step is to relax the “constant variance” part of Assumption 1, while maintaining the assumption that the errors are independent. That is, we drop the “identically” from the iid assumption:
Assumption 2 (Heteroskedasticity).
V [�i] = σ 2 i (σ
2 i is not necessarily equal to σ
2 j )
E[�i�j] = 0 for all i 6= j
Going back to 10.9, the E[�i�j] (i 6= j) part of this, as in the previous section, means that we can set all of the i 6= j components of the double summation equal to zero, leaving us just with the i = j terms:∑
i
∑ j
(Xi − X̄)(Xj − X̄)E[�i�j] = ∑ i
(Xi − X̄)2E[�2i ] (10.20)
However, unlike homoskedasticity, this is as far as we can get. Therefore we can simplify the expression for the variance to:
V [β̂1] =
∑ i(Xi − X̄)
2E[�2i ](∑ i(Xi − X̄)2
)2 (10.21) A quick glance of (10.21) suggests that we need an estimate for E[�2i ] for every i. While �̂
2 i
is a candidate for this, it is a terrible one because we only get one of those for each i, and so �̂2i does not plim to E[�
2 i ]. Fortunately, closer inspection of (10.21) reveals that we need
only estimate the numerator, specifically:
V [β̂1] =
∑ i(Xi − X̄)
2E[�2i ](∑ i(Xi − X̄)2
)2 (10.22) =
1 N
∑ i(Xi − X̄)
2E[�2i ] 1 N
(∑ i(Xi − X̄)2
)2 (10.23) and by some law of large numbers arguments:2
1
N
∑ i
(Xi − X̄)2�̂2i p−→
1
N
∑ i
(Xi − X̄)2E[�2i ] (10.24)
2I am being somewhat hand-wavy here.
73
So we can estimate the variance of β̂1, under Assumption 2, as follows:
V̂ [β̂1] =
∑ i(Xi − X̄)
2�̂2i(∑ i(Xi − X̄)2
)2 (10.25) se[β̂1] =
√ ∑ i(Xi − X̄)2�̂
2 i(∑
i(Xi − X̄)2 )2 (10.26)
Importantly, this formula requires no additional information about the data generating pro- cess to compute it (although it requires stronger assumptions than some of the techniques in later sections of this chapter). Contrast this to later sections of this chapter. If you can reg y x, you can always estimate standard errors that are robust to heteroskedasticity. In Stata, just reg y x, robust instead. These standard errors are often referred to as “heteroskedasticity-robust standard errors”, or simply “robust standard errors”. Try to use the former, they are not robust to everything (see, for example, the next section)!
10.3 Clustering: “I think you have 3 statistically inde-
pendent observations”
In Section 10.1, we explored the implications of assuming that our errors were independently and identically distributed. In Section 10.2 we relaxed the “identically” distributed part by allowing each �i to have a different variance. In this Section, we will work to relax the “independently” part of this. In relation to (10.9), this means that we can now allow for E[�i�j] 6= 0 for some i 6= j.
The “some” in the previous sentence is an important one: in particular, I was very deliberate in not using the word “all”. To understand this, and what is to come, it is important why we can’t do this for “all” i 6= j. Note that the sample analog of (10.9) is:∑
i
∑ j
(Xi − X̄)(Xj − X̄)�̂i�̂j (10.27)
That is, we have replaced E[�i�j] with �̂i�̂j. We can re-arrange this as follows:∑ i
∑ j
(Xi − X̄)(Xj − X̄)�̂i�̂j = ∑ i
[ (Xi − X̄)�̂i
]∑ j
[ (Xj − X̄)�̂j
] (10.28)
Each one of these summation terms is the solution to the sum-of-squares minimization prob- lem! In other words, when we do OLS, we are exactly setting these things equal to zero. Therefore, using (10.28) for the (sample equivalent of) the denominator of (10.4) means that we would compute standard errors of zero, and our t-statistics would shoot off to infinity. This is no good: we need to do better! Unlike heteroskedasticity, where we could say “we can construct standard errors that are robust to any kind of hereoskedasticity without knowing what that heterskedasticity looks like”, we can’t make a similar statement of the form “we
74
can construct standard errors that are robust to any kind of correlation between the error terms, without knowing what that correlation looks like.” But sometimes we can know a bit about the structure of this correlation, or at least have a good story about why the proposed structure is a believeable one.
One such instance of this is clustering. In this situation, we believ that the data are divided into distinct clusters. If two observations are not in the same cluster, then we have a good reason to believe that their errors ar uncorrelated. On the other hand, for two observations within the same cluster, then we cannot make the argument that they are uncorrelated.
An example Consider, for example, the task of estimating the mean height students on campus. The two following methods would achieve unbiased estimators of these quantities, both of which require the collection of 100 observations:
1. Randomly select N students on campus, and measure their heights {hi,1}Ni=1. Take the average of these heights. This is your estimate µ̂1 = 1
N
∑N i=1 hi,1.
2. Randomly select one student on campus. Measure his/her height on T days over the course of the academic year {h1,t}Tt=1. Take the average of these heights. This is your estimate µ̂2 = 1
T
∑T t=1 h1,t.
Suppose that each sample contains the same number of observations: N = T = 100. Both sampling procedures generate a point estimate using 100 observations. As (by assumption) any randomly selected student’s height will on average be equal to the population mean, both procedures produce unbiased estimates. But what is generating the variation in mea- surements in these two procedures? Suppose that we can model a measurement of student i’s height at time t as follows:
hi,t = µ + ηi + �i,t (10.29)
Where µ is the population mean height (the thing we are trying to estimate), ηi is student i’s deviation from the mean height (i.e. how much taller/shorter is i than the average height), and �i,t is an iid error in measurement for student i on day i. We assume without loss of generality that E[ηi] = E[�i] = 0. With some loss of generality, let’s also assume that V [ηi] < ∞ and V [�i,t] < ∞.
For sampling procedure 1, every row of our dataset belongs to a different student, so the variation in hi,t is driven by both ηi and �i,t, so we could alternatively write this as
hi,t = µ+ψi,t, where ψi,t is the combined error term ηi +�i,t. Hence µ̂ 1 p−→ µ, good! The more
observations we collect in sampling procedure 1, the more likely we are to be arbitrarily close
to µ. Additionally, by standard central limit arguments: √ N(µ̂1 −µ) d−→ N(0,V [ηi + �i,t]),
and so all of our inference can be done in the *usual* way. For sampling procedure 2, things become more complicated. To see this, note that since
we are repeatedly sampling the same student’s height, we always get the same ηi in our equation. Therefore, instead of (loosely) converging to µ, we get a really good estimate of
75
µ + η1, the single student’s height. By “really good” here, I don’t mean that we should be happy: we have a really good estimate of something we don’t want to know, and hence a really bad estimate of the population mean height. While in sampling procedure 1, increasing the sample size gets us closer (in the plim sense) to µ, increasing the sample size in sampling procedure 2 gets us closer to µ + ηi. This is in expectation equal to µ, but it does not have
the same nice convergence properties (both p−→ and d−→) as µ̂1. One way of looking at this
problem is that sampling procedure 2 does not collect statistically independent observations:
for t 6= s : cov(h1,t,h1,s) = E [(η1 + �1,t)(η1 + �1,s)] (10.30) = E
[ η21 + �1,sη1 + �1,tη1,t + ηi,tη1,s
] (10.31)
= E [ η21 ] 6= 0 (10.32)
OK, so it seems reasonable, even before reading the above section, that any econome- trician with half a brain should realize that procedure 2 is a terrible one for estimating µ. Why would we ever see such a procedure at all then? The answer is that we usually don’t, but we often see things that are a mix of procedures 1 and 2. In this context, this might be because it is cheaper to sample one person N times than sample N people once (perhaps the study requires getting consent from all of the participants, but only once per participant). Clearly we would never want to just sample 1 person, but maybe we settle for sampling a few people a few times. Therefore, it is reasonably common to see a sampling procedure like the following:
3. Randomly select N students on campus, and measure their heights on T days over the course of the academic year {hi,t}
i=N,t=T i=1,t=1 . Take the average of these heights. This is
your estimate µ̂3 = 1 NT
∑T t=1
∑N i=1 hi,t.
For the sake of simplicity, we have assumed that we have a balanced panel: each student is measured T times, hence we have NT observations. This assumption is unnecessary, and does not affect any of the discussion below.
Again, µ̂3 is an unbiased estimator of µ because everything that goes in to the average is on average equal to µ. Moreover, as N →∞ (i.e. as we sample more and more students), this thing will plim to µ, and will be asymptotically normal. However, we need to be careful about how we apply this second property when doing inference. Specifically, it is reckless to think, or apply a technique that assumes, that we have NT statistically independent observations. To see this, note the following for two arbitrary observations in our dataset:
cov(hi,t,hj,s) = E [(ηi + �i,t)(ηj + �j,s)] (10.33)
= E [ηiηj + ηi�j,s + ηj�i,t + �i,t�j,s] (10.34)
= E [ηiηj] + 0 + 0 + 0 (10.35)
=
{ E [η2i ] > 0 if i = j
0 if i 6= j (10.36)
76
What this is telling us is that observations that correspond to the same student are not statistically independent, but observations that correspond to different students are statis- tically independent. Actually, 10.36 tells us more than this: observations corresponding to the same student are correlated, and we know that this correlation must be positive. The implications of this are as follows:
• E[µ̂3] = µ (good)
• As N →∞, µ̂3 → µ (good)
• As N → ∞, neither the standard, nor the heteroskedasticity-robust, standard errors approach the asymptotic standard deviation of µ̂3.
The third point is really bad: we can get a good point estimate of µ quite easily, but unless you keep reading, you can’t do any hypothesis tests. Please keep reading!
Formally, we have a variable ci which identifies the cluster that observation i belongs to such that:
ci = cj ⇐⇒ i and j are in the same cluster (10.37) ci 6= cj ⇐⇒ i and j are not in the same cluster (10.38)
10.4 Autocorrelated errors
77
Part III
Some common econometric techniques
78
Chapter 11
Ordinary Least Squares (linear regression)
This chapter is a companion to Bailey (2016) chapters 3-7. As such, it assumes that you have read them and understand them. That said, there will be some overlap. This is deliberate.
11.1 regress: Implementing OLS in Stata
In Bailey (2016) Chapter 3, we are introduced to bivariate OLS, or bivariate linear regression. The “bivariate” part of this means that we have two variables. These are usually notated as:
• Yi is the dependent, or left-hand-side (LHS) variable, and
• Xi is the independent, explanatory, or right-hand-side (RHS) variable. I prefer not to use “independent”, for reasons that will become clear when we study endogeneity.
To see how these names fit in, note that the equation that we are trying to estimate is:
Yi = β0 + β1Xi + �i (11.1)
where β0 and β1 are parameters that we are trying to estimate, and �i is an error term. Thus Xi is on the RHS, Yi is on the LHS, Xi explains Yi, and Yi depends on Xi.
Without telling Stata anything else, when you ask it to regress, it will estimate a model that is valid if the following criteria are met:
1. E[�i]=0. That is, the error term has zero mean. This is more of a normalization than a criterion, in that we can always dink with β0 to make this true.
2. E[Xi�i] = 0. In words: there is no (linear) correlation between the explanatory variable Xi and the error term �i. This could be why Xi is often referred to as the “independent variable”: because it is assumed to be independent of �i. We will spend most of the remainder of this course worrying that Xi is not independent of �i (in specific cases), and how/if we can fix this problem.
79
3. V [�i] = σ 2 for all i. In words: the variance of �i is a constant. Specifically, we might
worry that the variance of �i depends on Xi. This assumption is called homoskedastic- ity.
4. cov(�i, �j) = 0 for all i 6= j. If this is not true, then one error will tell you something about another, and so the rows of our dataset are not independent observations.
When we regress Y X, we probably want to answer a question that looks like one of the following:
1. How can I predict Yi using Xi? or
2. What is the causal effect of Xi or Yi?
If our goal is prediction, then all we need is Assumption 1. This isn’t really an assumption, so we are good to go. Specifically, if we are trying to predict Yi without an Xi, we would just use ȳ, the sample mean of {yi}Ni=1. If we used Xi as well, then this must be at least as good as just using ȳ. On the other hand, if we want to get an unbiased estimate of the causal effect of Xi on Yi, that is E[β̂1] = β1, then we need Assumption 2 to be true.
What about the others? Well, if all we wanted was a point prediction, nobody would care. However it is <understatement> somewhat standard </understatement> to report mea- sures of precision of your estimates, or do inference. In that case, you also need Assumptions 3 and 4.
OK, so now that we understand what we’re doing, let’s actually do it. For this example, I am going to use the galton heights.csv dataset to estimate the equation:
child heighti = β0 + β1av parent heighti + �i (11.2)
where av parent heighti is the average of the heights of the child’s parents (in units of inches, minus sixty inches). All I have to do to estimate this is:
clear all
import delimited "galton_heights.csv"
desc
generate av_parent_height = (father_height+mother_height)/2
regress child_height av_parent_height
which gives me the output in Table 11.1. There’s a lot of information here, a lot of which most people will care about, but all of it could be useful to someone, depending on the application. Of most importance are our estimates for β0 and β1, the constant and slope term respectively. These are β̂0 = 2.34 and β̂1 = 0.66. Immediately to the right of these numbers are the standard errors, and everything to the right of that are functions of the estimates and their standard errors. Specifically, we get t-statistics for the test that each parameter is equal to zero, the p-value associated with that (2-sided) test, ans a 2-sided 95%
80
Source | SS df MS Number of obs = 934
-------------+---------------------------------- F(1, 932) = 108.20
Model | 1243.54014 1 1243.54014 Prob > F = 0.0000
Residual | 10711.1131 932 11.4926106 R-squared = 0.1040
-------------+---------------------------------- Adj R-squared = 0.1031
Total | 11954.6533 933 12.8131332 Root MSE = 3.3901
------------------------------------------------------------------------------
child_height | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
av_parent_~t | .6625512 .0636941 10.40 0.000 .5375509 .7875516
_cons | 2.342302 .4375628 5.35 0.000 1.483579 3.201025
------------------------------------------------------------------------------
Table 11.1: Estimation output for Equation11.2.
confidence interval for that parameter. That is (for large enough N):1
tk = β̂k/se(β̂k) (11.3)
pk = 2Φ(|tk|) (11.4) CIk = β̂k ± 1.96se(β̂) (11.5)
11.2 esttab: Producing publication-quality regression
outputs in Stata
1Stata sometimes uses a t-distribution to calculate these values rather than a normal. If the number of degrees of freedom in the t distribution is greater than 30, these two distributions practically indistinguish- able.
81
Chapter 12
Maximum Likelihood
Exercises
Exercise 12.1. In Game Theory, an indefinitely repeated game is one that is repeated until a random condition is met. One way to implement this in an economic experiment is to roll a die after every repetition: if a 6-sided die roll is (say) four or less, then the game is repeated for another round, otherwise there are no more repetitions. For this particular stopping rule, what we achieve is a stopping probability of δ = 1/3. That is, if we roll a 1, 2, 3, or 4, we continue, and if we roll a 5 or 6, we stop. One concern an experimenter might have is that the number of repetitions that two subsets of the sample had were very different. This might happen if one group had unusually long game lengths, and another had unusually short game lengths. This could be a problem because we want to attribute differences in participants’ behavior to something else, like the different payoffs in the game. Therefore, it is common in situations like this to report the results of a hypothesis test that the two groups experienced similar game lengths.
EndRound.csv is a stripped-down dataset from an experiment of mine and some co- authors [note to self: insert citation when we actually publish it]. Each row of this file contains one instance of a repeated game. The file contains two variables: EndRound is the number of rounds that this game was played for, and group identifies whether this row correpsonds to Group 1 or Group 2 in the experiment. The EndRound variable was generated almost exactly as described above: during the experiment at the end of every repetition, I rolled a 20-sided die, and we played another one if the number was sufficiently low.
Let Xi be the number of rounds that participants play game i. Given the description above, Xi must follow a Geometric distribution, which has probability mass function:
p(x) =
{ (1 − δ)x−1δ if x = 1, 2, 3, 4, . . . 0 otherwise
You can think of this as Xi is the number of times you have to flip an unfair coin that comes up heads with probability δ, until you have seen one head.
82
1. What is the likelihood of observing a sample {xi}Ni=1?
2. What is the log-likelihood function? Express your answer as a function of δ, N, and the sample mean only.
3. What is δ̂, the maximum likelihood estimator for δ?
4. Use the data to estimate δ for each group individually, and for both groups pooled.
5. Report the p-value for the test that the two groups have the same δ (do the Likelihood Ratio test).
83
Chapter 13
Instrumental variables (2SLS)
13.1 Over-identification test
In the case that we are lucky enough to have more than one instrument, we can do an over- identification test. Qualitatively, what we are doing in this test is estimating β̂1 for for each of our instruments separately, and determining if this changes our answer in any appreciable way. There are two possible outcomes of this test:
1. If all of the β̂1s are close to each other, then we conclude that either all instruments satisfy the exclusion condition, or all instruments do not satisfy the exclusion condition.
2. If at least one β̂1 is very different from the others, then we conclude that at least one instrument does not satisfy the exclusion condition, and at least one instrument satisfies the exclusion condition.
Importantly, note that we can never know that all of our instruments are valid. To demonstrate this, we need an example of using 2SLS with multiple instruments. Berry
et al. (1995) is one such example. In particular, we are interested in estimating the demand curve for cars, and how this shifts with three characteristics:
• air: a dummy variable for whether the car has air conditioning as standard
• weight: The weight of the car (units are unspecified in the dataset, but are somewhat irrelevant for this example), and
• hp: how powerful the car is.
A cut-down version of the specification in Berry et al. (1995) is as follows:
log(Qi) − log(Q0) = β0 + β1pi + β2airi + β3weighti + β4hpi + �i (13.1)
where Qi is the quantity of car i sold in the market, and Q0 is a measure of the potential size of the market (e.g. the population of consumers in the market). Econometrically, we worry that
84
(1) (2) (3) delta price delta
price -0.000361∗ -0.0000753∗∗∗
(-2.15) (-5.38) air 0.753 3118.0 -0.480
(0.76) (1.36) (-1.44) weight 0.00154∗ 2.709 0.00124∗∗∗
(2.26) (1.17) (3.84) hp 0.0494 169.6∗∗∗ -0.000510
(1.63) (6.21) (-0.12) Z air -80797.3
(-1.26) Z weight 104.0
(1.88) Z hp 111.0
(0.15) Constant -12.71∗∗∗ -294891.0 -9.854∗∗∗
(-5.74) (-1.87) (-13.64) Observations 131 131 131 F 46.73 14.98 method 2SLS 2SLS (1st stage) OLS p overid 0.255
t statistics in parentheses ∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001
Table 13.1: 2SLS estimation of equation 13.1 using data from Berry et al. (1995).
price is endogenous. It is somewhat common in this field to use the average characteristics of all cars not produced by the firm that produces car i are valid instruments for pi. I will ignore the justification of this here, but use theis as an example of an over-identification test. Specifically, we have one endogenous regressor (price), and three instruments (average air, weight, and hp of cars not made by manufacturer i). The results of this estimation are shown in Table 13.1. I also log price and run the estimations again in Table 13.2 so that the coefficient has an elasticity interpretation. Column 1 shows the 2SLS results, column 2 is the first stage regression, and column 3 is a näıve OLS specification that we should never take seriously. The F-statistic in column 2 provides strong support for the instruments satisfying the inclusion condition.
Now let’s focus on the exclusion condition. We can never directly test that the exclusion condition is satisfied, but we can do an over-identification test. To implement this in STATA, simply code up estat overid below your 2SLS command. By default this displays two tests. In my code I get STATA to report the p-value of the first, the Sargan (score) test, in the esttab tables. In both tables, the p-value is large, so we conclude that one of the following
85
(1) (2) (3) delta log(price) delta
log(price) -8.031∗∗ -2.192∗∗∗
(-3.02) (-5.78) air 1.625 0.257∗∗ -0.141
(1.68) (3.11) (-0.41) weight 0.00272∗∗∗ 0.000273∗∗ 0.00159∗∗∗
(3.68) (3.29) (4.86) hp 0.0389∗ 0.00626∗∗∗ 0.000674
(2.11) (6.39) (0.15) Z air -3.318
(-1.44) Z weight 0.00514∗
(2.60) Z hp 0.00304
(0.12) Constant 55.64∗∗ -5.996 8.574∗∗
(2.59) (-1.06) (2.73) Observations 131 131 131 F 94.11 16.33 method 2SLS 2SLS (1st stage) OLS p overid 0.330
t statistics in parentheses ∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001
Table 13.2: Table 13.1, but with logged price, so that we can interpret the coefficient as an elasticity.
86
(1) (2) (3) delta delta delta
log(price) -207.7 -110.0∗ -168.4 (-0.83) (-2.34) (-0.90)
air 5.629 2.602 4.414 (0.71) (1.59) (0.74)
weight 0.00571 0.00357∗∗ 0.00485 (1.00) (2.82) (1.13)
hp 0.122 0.0580 0.0961 (0.74) (1.82) (0.78)
Constant 426.7 221.7∗ 344.4 (0.81) (2.25) (0.87)
Observations 131 131 131 instrument air weight hp
t statistics in parentheses ∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001
Table 13.3: Table 13.2, using only one instrument.
is true:
• All of the instruments are valid,
• All of the instruments are equally bad,
but we don’t know which. For the sake of completeness, I also include Table 13.3, which shows the three possible
2SLS estimations using only one instrument. The coefficients on log(price) vary wildly, but not so much as to reject the null for the over-identification tests.
The code that generated these tables follows in the panel below:
clear all
import excel "cars_data.xls", sheet("PS_ cars_data ") cellrange(B10:H141) firstrow
desc
global M = 100*10^6
// generate share data
qui gen share = Q/$M
egen shareCAR = total(share)
gen share0 = 1-shareCAR
// delta (estimate of quality)
gen delta = log(share)-log(share0)
// RHS characteristics
local X "air weight hp"
// Instruments of average characteristics of cars not produced by firm i
qui sum firm
local nfirms = r(max)
87
foreach x of local X {
qui gen Z_‘x’ = .
forvalues ii= 1/‘nfirms ’ {
qui sum ‘x’ if firm ~=‘ii’
qui replace Z_‘x’ = r(mean) if firm == ‘ii’
}
}
// Tables using all instruments
forvalues kk = 1/2 {
eststo reg_ ‘kk’_1: ivregress 2sls delta air weight hp ( price = Z_*)
estat overid // Over -identification test. That ’s it! Just one line!
estadd scalar p_overid = r(p_score)
estadd local method "2SLS"
eststo reg_ ‘kk’_2: regress price air weight hp Z_*
estadd local method "2SLS (1st stage)"
eststo reg_ ‘kk’_3: regress delta air weight hp price
estadd local method "OLS"
replace price = log(price)
esttab reg_ ‘kk’_* using cars ‘kk ’.tex , label scalars(F method p_overid) compress nogaps
↪→ replace label variable price "log(price)"
}
// Table using each instrument separately
foreach x of local X {
eststo reg0_ ‘x’: ivregress 2sls delta air weight hp ( price = Z_‘x’)
estadd local instrument "‘x’"
}
esttab reg0_* using cars0.tex , label compress nogaps replace scalars(instrument)
88
Chapter 14
Time series
14.1 Diagnostics
Like most chapters, Bailey jumps right into a concept’s implication for OLS without going over some more fundamental concepts. I cover some that I think are important here. Specif- ically, we focus on a univariate time series {Yt}tt=1, and the implications of autocorrelation on the properties of a sample mean. In particular, we may be worried that at least one of the following are true:
1. ȳ is a biased estimator of E[Yt]
2. The method we use for calculating standard errors for ȳ assume that the time series is not serially correlated, and so we may be over- or under-stating significance.
14.1.1 Autocorrelation and partial autocorrelation functions
Autocorrelation and partial autocorrelation functions are useful ways to graphically represent the serial correlation in a time series. For stationary time series, the autocorrelation function (ACF) is defined as:
R(τ) = E[(Yt −µ)(Yt−τ −µ)]
σ2 (14.1)
In words, this is the correlation between our random variable, and its value τ periods in the past, or corr(Yt,Yt−τ ). If Yt is an MA(k) process, then R(τ) 6= 0 for τ = k, and zero for τ > k.
The partial autocorrelation function is defined recursively, and involves projection ma- trices. In upholding my promise to not go into matrix algebra, I will hold off on the formal definition, and provide this intuition instead: You estimate the model:
Yt = β0 + β1Yt−1 + β2Yt−2 + . . . + βτYt−τ + �t (14.2)
89
Figure 14.1: ACF and PACF of a simulated AR(1) process: Yt = 0.5Yt−1 + �t
The partial autocorrelation function is equal to:
α(τ) = plimβ̂τ (14.3)
That is, after controlling for all lags of lower order, how much additional explanatory power does Yt−τ provide for Yt. An AR(k) process will have α(τ) 6= 0 for τ = k, and zero for τ > k.
We can therefore use the sample analog of these to diagnose the presence of autocorrela- tion, and maybe even the type of autocorrelation present (if it is not too fancy). Fortunately, Stata calculates plots of these very simply:
ac Y // For autocorrelation function
pac Y // For partial autocorrelation function
Sometimes these are reasonably easy to spot. For example Figures 14.1 and 14.2 show these functions for an AR(1) and MA(1) process respectively. For the PACF in Figure 14.1, we only identify one significant lag: the first. This is consistent with an AR(1) process. Since the higher-order lags appear to be insignificant in the PACF, this tells us that Yt−1 adequately characterizes the serial correlation of Yt. For the ACF in Figure 14.2, again we only identify one significant lag: the first. This is consistent with an MA(1) process. Since the higher-order lags appear to be insignificant in the ACF, this tells us that �t−1 adequately characterizes the serial correlation of Yt. We may not be so lucky. For example, Figure 14.3 shows the ACF and PACF for an ARMA(1,1) process. It is not clear from the “eyeball” hypothesis test that this is indeed the case. That said, this figure is evidence for the presence of autocorrelation, it is just not very helpful in identifying the type of autocorrelation.
90
Figure 14.2: ACF and PACF of a simulated MA(1) process: Yt = �t + 0.5�t−1
Figure 14.3: ACF and PACF of a simulated ARMA(1,1) process: Yt = 0.5Yt−1 + �t + 0.5�t−1
91
14.2 Declaring time series datasets and dealing with
lagged variables
(watch this space!)
92
Part IV
Advanced reg-monkeying
93
Chapter 15
Looping over variables: one reg y x, robust, many regressions.
In the process of doing research, unless you have perfect foresight you will be constantly updating the way you analyze your data, and how you communicate this. This could include things like:
• changing the variables that you include on the RHS of your regressions
• changing the way your regression tables are labeled
• changing the hypothesis tests that you do
All of these changes could be motivated by, for example:
• recognizing a mistake
• recognizing a better way to analyze your data or display your results
• being told by a referee or conference participant that you are doing it wrong. Especially for a referee, you should pay attention to this! (even if you think they are nuts)
To demonstrate this problem, we will investigate the determinants of speeding fines. We will use a dataset (also used in Bailey, 2016), which is a cut-down version of the dataset used in Makowsky and Stratmann (2009).
The dataset includes information about people who were pulled over for speeding. In particular, we are interested in the effect of MPHover (how must faster than the speed limit a person was driving) on amount (the fine they received). Additionally, we might be worried if the other variables in the dataset had any bearing on speeding: if they do, this could be evidence for discrimination. In order to drive home our point (that there is discrimination), we run five regressions two different ways. The five regressions are:
1. reg amount MPHover //i.e. just bivariate OLS
2. reg amount MPHover age female
94
3. reg amount MPHover Black Hispanic
4. reg amount MPHover StatePol OutTown OutState
5. reg amount MPHover age female Black Hispanic StatePol OutTown OutState
That is, in 2-4 we control for some things that could be related, and regression 5 we put them all together. The “two different ways” are:
1. Using the entire sample
2. Only using observations of people who were actually fined (there are a lot of zeros)
We want to show 2 tables. The first for all regressions without the restriction, and the second with the restriction (only people who got fined). In addition to this, we would like to do a hypothesis test that all of the controls jointly do not affect fines (i.e. no discrimination), and include the p-value in our regression tables. By my reckoning we need about 3 lines of code per regression:
eststo reg_1: regress amount mphover controls ... if restriction
test controls
estadd scalar p=round(‘r(p) ’,0.0001)
then for each restriction:
esttab reg_*, scalars(p) ...
so we’re looking at 2 × 5 × 3 = 30 lines of code for the regressions and hypothesis tests, and another 2 to get the table outputs. More to the point, if we want to add another set of controls, we need to code up another 3 × 2 = 6 additional lines, as well as change the code for the last column on the table (which would now be the 6th column). This seems tedious, and a great way to make a mistake. Instead, we are going to do the following:
1. Define our three sets of controls
2. Define the two restrictions
3. Loop over the controls and restrictions
This way, if we want to add or remove some controls, or a referee/seminar attendant says we deed to do something differently, we simply change one line of code in steps 1 or 2, and STATA will take care of the rest:
clear all
set more off
import delimited "M08_speeding_tickets_text.csv"
desc
summarize
// how Stata deals with strings
95
local strA "stringA" //here we are generating two variables , strA and strB and ’adding ’ them
↪→ togehter to create a vairable called both. local strB "stringB"
local both "‘strA ’ ‘strB ’"
disp "‘both ’"
/*In the coding literature this called string concatenation. We are ’adding ’ strings
↪→ together so we can define our controls and then run multiple regressions adding in a ↪→ control at a time. Note that there are some missing values in "amount ". These are ↪→ recorded in the data editor as ".". Let ’s assume that in these cases no fine was ↪→ issued. We will create a variable called "nofine", and replace the missing values ↪→ with zeros.
*/
generate nofine = 0
replace nofine = 1 if amount ==.
replace amount = 0 if nofine ==1
summarize
// Define the controls here
local control_0 ""
local controllabel_0 "-"
local control_1 "black"
local controllabel_1 "black"
local control_2 "female"
local controllabel_2 "female"
local control_3 "outtown outstate statepol" // location controls
local controllabel_3 "location"
local control_4 "hispanic"
local controllabel_4 "hispanic"
local controllabel_5 "all"
//when we go through the outdie loop ‘ii’ for the first time , ii=-1, which will include all
↪→ the data since all the amounts are 0 or greater , //the second time we go through the outside loop , since we have a strict inequality we are
↪→ going to exclude all fines equal to 0, i.e. only run on peple who got fined.
forvalues ii = -1/0 {
local control_all = ""
forvalues cc = 0/4 {
// All regressions except the one with all controls are done here
quietly eststo reg_ ‘cc ’: regress amount mphover ‘control_ ‘cc’’ if
↪→ amount >‘ii ’ estadd local controls ‘controllabel_ ‘cc’’
local control_all = "‘control_all ’ ‘control_ ‘cc ’’"
disp "‘control_all ’"
}
// The final column with all controls
quietly eststo reg_all: regress amount mphover ‘control_all ’ if amount >‘ii’
estadd local controls ‘controllabel_5 ’
esttab reg_* using Looping_over_variables ‘ii ’.tex , se compress nogaps
↪→ replace keep(mphover) scalars(controls) drop *reg_*
}
The above code produces the following Tables 15.1 and 15.2
Exercise 15.1. You are unsure whether amount and/or mphover should be logged or not in Table 15.2.1
Write a script that produces an esttab table with four columns, corresponding to the 4
1Note that this doesn’t make much sense with Table 15.1 because we can’t (or at least shouldn’t) log the
96
(1) (2) (3) (4) (5) (6) amount amount amount amount amount amount
mphover 8.345∗∗∗ 8.340∗∗∗ 8.273∗∗∗ 8.031∗∗∗ 8.323∗∗∗ 7.971∗∗∗
(0.0437) (0.0437) (0.0436) (0.0402) (0.0437) (0.0401) N 68357 68357 68357 68357 68357 68357 controls - black female location hispanic all
Standard errors in parentheses ∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001
Table 15.1: All data used
(1) (2) (3) (4) (5) (6) amount amount amount amount amount amount
mphover 6.886∗∗∗ 6.889∗∗∗ 6.871∗∗∗ 6.899∗∗∗ 6.884∗∗∗ 6.887∗∗∗
(0.0385) (0.0385) (0.0385) (0.0382) (0.0385) (0.0382) N 31674 31674 31674 31674 31674 31674 controls - black female location hispanic all
Standard errors in parentheses ∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001
Table 15.2: Restricting to positive amounts.
possible combinations of logging or not logging these variables. You may use the regress command exactly once.
zeros. If we applied this same script to the whole dataset, the columns with logged amount on the LHS will be for regressions dropping all observations with no fine (i.e. the restriction in Table 15.2).
97
Part V
Simulation techniques
98
Chapter 16
An introduction to Monte Carlo techniques
16.1 Stata’s (pseudo) random number generators
16.2 Using random number generators
16.3 Stata’s simulate command
Stata allows you to break up the simulation process into two steps. This is helpful because you can concentrate on getting one thing done well at a time. These steps are:
1. Write a program that simulates one draw from the distribution you are trying to simulate.
2. Use Stata’s simulate command to run this program over and over again lots of times. It puts a “sample” from this simulated distribution in the data editor.
To begin with, let’s work through the example for the simulate function in Stata’s help file. You can access this by typing: help simulate. This example simulates draws from a lognormal distribution: if X ∼ N(µ,σ2), then Y = exp(X) ∼ lognormal(µ,σ2), i.e. log(Y ) = X N(µ,σ2), hence if you log Y , it has a normal distribution.
We would like to simulate the distributions of the sample mean and variance of Y ∼ lognormal(0, 1). To do this, we will need to:
0. Set the sample size to simulate
1. draw X N(0,1)
2. Generate Y=exp(X)
3. Summarize Y (to get the mean and variance of our simulated sample)
99
4. Store the mean and variance as data
5. Go back to step 1. Stop when we have done this enough that we have approximated the distribution well.
Steps 0-3 for a sample size of 20 on their own would be:
set obs 20 // (0)
generate X = rnormal (0,1) // (1)
generate Y = exp(X) // (2)
summarize Y // (3)
Note that we can access the stored results of summarize using ‘r(.)’, specifically:
display "‘r(mean)’"
display "‘r(Var)’"
To see what else we can access from summarize, type help summarize into the command line.
If we want to use this procedure for simulating data, it will take a long time. A more elegant way to set this problem up is to write a program that performs steps 0-3, then let Stata’s simulate program do steps 4 and 5. The first step is to write a program. This one is a cut and paste from the simulate help file, then I have commented above each line explaining what it does:
// Clear everything in the memory so that we know that we are starting fresh
clear all
/*Tell stata that we are writing a program. It knows that everything between
here and "end" is part of the program The program name is lnsim (i.e.
lognormal simulation) rclass lets us know that we can access variables generated
by the program through ‘r(.)’ (more on this later)
*/
program define lnsim , rclass
// Sometimes newer and older versions of Stata work slightly differently.
↪→ Make Stata behave as if it’s Stata 13 // This line is not always needed
version 13
/*
define the syntax of the program
Here we let Stata know the inputs to the program
These inputs are:
obs = number of observations. It must be an integer , and by default
↪→ is equal to 1 mu = parameter mu in the lognormal distribution. It must be a real
↪→ number , and by default it is equal to 0 sigma = sigma in the lognormal distribution. It must be a real
↪→ number , and by default it is equal to 1
If you don ’t specify these inputs when calling the function , Stata will use
↪→ the default values. */
syntax [, obs(integer 1) mu(real 0) sigma(real 1) ]
// Drop all variables from the memory
drop _all
// Set the number of observations in the dataset (i.e. step 0)
set obs ‘obs ’
// Define a temporary variable called z (it will be deleted when the program
↪→ finishes) tempvar z
100
// generate z but taking the exponential of a normal random variable with
↪→ mean mu and standard deviation sigma // This is our random sample
gen ‘z’ = exp(rnormal(‘mu ’,‘sigma ’))
// Summarize z. This calculates the mean and variance of the random sample
summarize ‘z’
// Tell Stata to store the mean calcualted in summarize as a scalar called
↪→ mean return scalar mean = r(mean)
// Tell Stata to store the Variance calcualted in summarize as a scalar
↪→ called Var return scalar Var = r(Var)
end
If you run this script as is, you probably won’t notice much. What is going on in the background is Stata adds this function nlsim to its memory, so you can now use it just like you would use summarize or tabulate. To see this, once you have run the above script, try typing the following into the command line (or pasting it below this script):
// Check that the program works by itself
lnsim , obs (20) mu(0) sigma (1)
display r(mean)
display r(Var)
This will display the sample mean and variance of your simulated sample of 20 observations. However we want a “sample” of the sample mean and variance, not just one observa-
tion. If you really had nothing better to do, you could click run 1,000 times (actually this is probably not enough repetitions) and copy and paste the numbers into a spreadsheet. Fortunately, Stata can do this for you. Here’s how:
simulate SampleMean = r(mean) SampleVariance = r(Var), reps (1000): lnsim , obs (20) mu(0)
↪→ sigma (1)
which will give you a dataset that looks something like this (just showing the first 10 draws):
. list in 1/10
+---------------------+
| Sample~n Sample~e |
|---------------------|
1. | 1.208445 .8967296 |
2. | 1.50637 1.604306 |
3. | 2.138609 6.13594 |
4. | 2.690951 6.772248 |
5. | 1.46579 1.116809 |
|---------------------|
6. | 1.088434 .5308753 |
7. | 1.803651 3.82006 |
8. | 2.250783 7.55613 |
9. | 1.89442 6.097122 |
10. | 1.069643 .364791 |
+---------------------+
101
So we have a “sample” of sample means and variances. The following script will do all of this for 10,000 repetitions, then outputs some histograms of
• The simulated sample mean and variance, See Figure 16.1, and
• The simulated t-statistics testing H0 : E[X] = exp(1/2) (this is the true population mean). Specifically:
t = x̄− exp(1/2)√
s2X/N
See Figure 16.2. Alarmingly, the last 3 lines of this script calculates that about 15% of these t-statistics are greater than 1.96 in absolute value, but 1.96 is the critical value for the 5% test. We would be rejecting H0 much too frequently!!
clear all
// run the script where I have defined the function lnsim
do ExampleLogNormal01
// check that the program works by itself
lnsim , obs (20) mu(0) sigma (1)
display r(mean)
display r(Var)
set more off
set seed 42
simulate SampleMean = r(mean) SampleVariance = r(Var), reps (10000): lnsim , obs (20) mu(0)
↪→ sigma (1) list in 1/10
// histograms of means and variances
histogram SampleMean
graph export ExampleLogNormalMean.pdf , replace
histogram SampleVariance
graph export ExampleLogNormalVariance.pdf , replace
// t-statistics
generate t = (SampleMean -exp (1/2))/sqrt(SampleVariance /20)
histogram t
graph export ExampleLogNormalT.pdf , replace
generate reject = 0
replace reject = 1 if abs(t) >1.96
summarize reject
102
(a) Sample mean (b) Sample variance 0
.2 .4
.6 .8
1 D
en si
ty
0 2 4 6 r(mean)
0 .0
5 .1
.1 5
D en
si ty
0 50 100 150 200 250 r(Var)
Figure 16.1: Simulated sample means and variances of 10,000 draws from log(Xi) ∼ iidN(0, 1).
0 .1
.2 .3
.4 D
en si
ty
-15 -10 -5 0 5 t
Figure 16.2: Simulated t-statistics for the hypothesis that E[X] = exp(1/2) (which is true in this case).
103
Exercises
Exercise 16.1 (Power calculations). In this exercise, we will investigate an application of the power calculation. This can be useful in (at least two) stages of research:
• Once your data are collected and analyzed, you can defend a null result by showing that an economically significant false null would be identified by your test a good fraction of the time.
• Before you collect your data, it may be able to help you collect a better data set.
We will focus on the second here. Suppose that we wish to test that the means of two subsets of the population are equal.
To fix ideas, consider a drug trial where we have a treated group and a control group. Since we have a finite budget, we can only collect 100 observations. How many people should be in the treatment group, and how many in the control?
Let random variable X be some measure of an individual patient health in the control group, and Y be the same measure of patient health in the treatment group. A reasonable hypothesis to test is:
H0 E[X] = E[Y ] HA : E[X] > E[Y ]
To test this, we assign NY test subjects to the treatment, and 100 −NY to the control. We therefore have samples:
{Xi}100−NYi=1 , {Yi} NY i=1
A simple test of the above hypothesis using data like this is the two-sample t-test, which is outlined here, and can be easily implemented using Stata’s ttest command. Please read about this test, it is a very useful one, but for now we take it that it is the right one for this example. For this exercise, we ask the question:
How many observations should we assign to the treatment group? This is going to be a function of the distributions of X and Y, and how economically
significant the difference in means has to be to be excited about the new drug. For the purpose of this simulation, we want to be able to maximize the power of a 5% test when E[Y ]−E[X] = 1. That is, if Y is (in expectation) one unit better that Y , then we want our test to be able to reject the null as frequently as possible. To further simplify things, fix the population DGP as:
Xi ∼ iidN(0, 1), Yi ∼ iidN(1, 2)
1. Write a program that simulates 100 draws (total) from X and Y , and outputs the p-value of the t-test assuming different variances. The program should take NY as an input.
104
2. Simulate the distribution of the p-values when the null is false. That is, when X and Y conform to the distributions above. Do this for a reasonable range of NY . E.g. 20, 40, 60, 80.
3. For each value of NY , calculate the fraction of times that you would reject the null on a 5% test (i.e. what fraction of p-values are less than 5%?)
4. Find the NY that maximizes this fraction.
Extensions: The intersection of experiment design, econometrics, and eco- nomics!
Your payoff from this trial is $1bn times the power of this test:
5. You have a budget of $100,000, Each control observation costs $100, and each treatment observation costs $200. How do you allocate your budget to maximize payoff? Is it optimal to spend the entire budget?
6. How much would you be willing to pay to reduce the variance of Y (e.g. by using better testing equipment)? How does this change your allocation of 100 test subjects between treatment and control? Express your answer as an elasticity of demand for
precision ( 1 σ2 Y
). That is, report: ∂σ−2 Y
∂P P σ−2
105
Chapter 17
Simulations with OLS
When running simulations for regressions, we often want our source of randomness to be only from the error term, and not from repeated draws of right-hand side variables. Therefore when running simulations for regressions, we need to be able to keep some variables in the constant for all simulation steps, while randomly drawing the component that we are interested in: usually the error term.
To help understand this, note that frequently we derive result for OLS estimators assum- ing that the RHS variable(s) are constant. That is, we think about the thought experiment where we collect the same X data, but get different draws of � every time.
Sadly, when we write our simulation program, we need to include a drop all command at the beginning so that Stata allows us to overwrite data in the memory. Fortunately, there are two easy fixes to this. One of them is better because it uses the processor and memory less. The following discussion is motivated from Example 2 in the Stata documentation on the simulate function, which can be found here.
17.1 Method 1: Load the variables you want to keep
constant when you run the program
0. Before running your program, generate the variables you want to keep constant. Them save them to the hard drive. For example, if you want to keep your RHS variable x constant. E.g.: if you want to have X distributed N(0,1), with the same draws every time:
clear _all
set obs 100
generate x = rnormal ()
save keepx.dta
1. Start the program, Clear any data in the memory
program define myprogram // etc , put the relevant things in here
clear _all
106
2. load x from your stored file keepx.dta
use keepx.dts
3. generate y, for example if we want the true model to be y = 1 + 2x + �, with the error term distributed N(0, 0.12):
generate y = 1+2*x + 0.1* rnormal ()
4. Run the regression, then end
regress y x
end
The program will automatically return anything regress stores in its results. Type help regress and look at what is in e(.).
This method is cumbersome because it requires Stata to load your x variable every time. It would be faster if it did not have to (reading and writing to the hard drive takes time). Fortunately, there is:
17.2 Method 2: Keep x in memory
This method takes advantage of Stata’s capture command. One useful feature of the capture command is that it will allow your .do file to proceed to the next line, even if it returns an error. Here’s why it is useful in our case. We need to run a script like this one (note that drop y is commented out here):
clear all
set seed 42
set obs 20
generate x = rnormal ()
program define myprogram , rclass
//drop y
generate y = x + rnormal ()*0.1
regress y x
end
simulate b = _b[x], reps (100): myprogram
So the first generate gets us our RHS variable x, which we want to keep constant. The program myprogram will work just fine on the first simulation step, but then Stata will kick up a fuss on the second step because when it comes to generate y = ..., this variable already exists. Alternatively, if we uncomment drop y, then on the first step Stata gives us an error because y does not exist. We need a line just before this one to tell Stata something like “if y exists, drop it, otherwise do nothing”. This is where capture come in. If we replaced the line //drop y with capture drop y, then this will work. Specifically, on the
107
first step drop y returns an error, but capture tells Stata to ignore it; then on subsequent steps it does not return an error, so y gets dropped.
Here is the step-by-step guide to this method:
0. Before running your program, generate the variables you want to keep constant. For example, if you want to keep your RHS variable x constant. E.g.: if you want to have x distributed N(0, 1), with the same draws every time:
clear _all
set obs 100
generate x = rnormal ()
1. Use Stata’s capture command to drop y if is in the memory, and do nothing otherwise. Specifically for us, if we have a variable y hanging round in the memory from a previous simulation step, we want to delete it, but we don’t want to drop y because Stata will kick up a fuss on the first run through because y doesn’t exist yet. So we can proceed with:
program define myprogram // etc , put the relevant things in here
2. Get rid of y if it is in the memory, otherwise proceed without doing anything:
capture drop y
3. generate y, for example if we want the true model to be y = 1 + 2x + �, with the error term distributed N(0, 0.12):
generate y = 1+2*x + 0.1* rnormal ()
4. As before, run the regression and end the program.
regress y x
end
Finally, no matter which method we used, we can:
simulate _b _se , reps (10000): myprogram
which simulates the distribution of the estimators for the slope and intercept (stored in e( b)) and the standard errors (stored in e( se)).
Exercises
Exercise 17.1 (Endogeneity). Investigate what happens when X and � are correlated (this is the problem we ran into with the flu shots and death example). To do this, you should compare the case where X and � are uncorrelated to a case where they are. Try this:
108
1. Run the simulation of y step as:
generate y = 1 + 2*x + rnormal ()*0.1
2. Run another simulation with:
generate y = 1 + 2*x + rnormal ()*0.1 + 0.2*x
That is, in case (1) X and � are uncorrelated, and in case (2) corr(Xi, �i) = 0.2. What is the bias in these cases?
Exercise 17.2 (How many observations do I need to get a good confidence interval?). Bailey (2016, ch 4) States that that the slope coefficient estimator is approximately normally distributed for large samples. But how large is large enough for this to be a good assumption? The answer to this question depends on the data-generating process. If the �is are normally distributed, then this is exactly true for any sample size. For other distributions of the error term, we can use simulation to inform us.
One consequence of the normal approximation being bad is that the probability that a 95% confidence interval (constructed using a large-sample result) contains the true value of the slope coefficient is not necessarily 95%. Neither can we be sure of whether this confidence interval will contain the true value with probability greater or less than the intended value. We will use simulation to explore this. Note that such a simulation could be used to tell us how many observations we should collect, or to tell us that, for a fixed sample size, whether we should think about constructing confidence intervals without using a large-sample approximation (more on how to do this in a couple of weeks).
We will investigate the relationship between sample size and large-sample confidence intervals when the error term is uniformly distributed. Specifically, consider the true data- generating model:
yi = β0 + β1xi + �i
�i ∼ iidU[−1, 1]
Since the error term is iid with mean zero we know that for large samples the slope coefficient estimator is approximately normal, but we have no guarantee that this is the case for small samples.
1. Choose 3-4 sample sizes to investigate. Remember that things approach normality on a √ N scale.
2. Choose intercept and slope coefficients (this won’t change your answer too much).
3. Write a program that simulates this data-generating process and the subsequent re- gression, holding X fixed.
4. Run a simulation for each of your chosen sample sizes, store the slope coefficients and their standard errors
109
5. Use these to construct 2-sided 95% confidence intervals around the slope coefficient
6. For each sample size, generate a variable that is equal to 1 if your true value is inside the confidence interval, and zero otherwise.
7. For each sample size, compare the nominal size of the confidence interval (i.e. 95%) to the actual size of your confidence interval.
Note that if A ∼ U[0, 1], Then B = 2A− 1 ∼ U[−1, 1].
110
Chapter 18
Techniques for drawing random numbers
This section of the Masters course is going to cover some common simulation techniques. Generally, we will exploit a mathematical theorem to generate random numbers in a partic- ular way. A good reference for this material (and the basis for a lot of my notes) is Chapter 8 of Judd (1998).
18.1 Inversion
18.1.1 What inversion is and how it works
Consider a uniform random variable U ∼ U[0, 1] and another continuous random variable X with cdf FX(·). Since X is continuous, its cdf can be inverted. Therefore we can do the following:
Pr ( F−1X (U) ≤ x
) = Pr (U ≤ FX(x)) (18.1) = FU (FX(x)) (18.2)
=
FX(x) if FX(x) ∈ (0, 1) 0 if FX(x) ≤ 0 1 if FX(x) ≥ 1
(18.3)
That is, the cdf of the transformed random variable F−1X (U) is identical to FX(·). Therefore is we can:
1. Invert the cdf of X, and
2. generate (pseudo) random uniforms
we can draw from the distribution of X as follows:
X = F−1X (U) (18.4)
111
18.1.2 Example
Consider an exponential random variable X with cdf:
FX(x) =
{ 0 if x ≤ 0 1 − exp(−x) if x > 0
(18.5)
For x > 0, we can invert the cdf as follows:
u = FX(x) (18.6)
= 1 − exp(−x) (18.7) exp(−x) = 1 −u (18.8)
x = − log(1 −u) (18.9)
Therefore X = − log(1 −U) has the desired distribution. To implement this in Stata: clear all
set seed 42
set obs 30 // number of random values to be generated
generate U = uniform () // draw uniforms
generate X = -log(1-U) // inversion
// plot the empirical cdf against the target
generate cdf = 1-exp(-X) // target
cumul X, generate(cX) // empirical cdf
label variable cX "simulated"
sort X
twoway (line cdf cX X)
graph export inversion.png , replace
which generates Figure 18.1. Here I have deliberately shown an example with a small sim- ulation size (30 observations). We should expect any random sample to exhibit deviations from the true cdf because it is . . . well, random!
Exercises
Exercise 18.1. Generate the equivalent of Figure 18.1 for the following:
1. The normal distribution – use Stata’s inverse normal function invnormal().
2. A special case of the Beta distribution:
FX(x) =
xα if 0 < x < 1
0 if x ≤ 0 1 if x ≥ 1
Do this for α = 0.5
112
Figure 18.1: Simulation of 30 random numbers with cdf FX(·) defined in Equation 18.5
3. The triangular distribution:
fX(x) =
0 if x ≤ 0 2x c
if 0 < x ≤ c 2(1−x)
c if c < x ≤ 1
0 if x > 1
Do this for c = 0.2 and c = 0.5. Note that you will have to integrate the cdf to get the pdf.
Exercise 18.2. For Exercise 18.1 question 2):
1. Use your simulation to approximate E[X] and V [X]
2. Provide an estimate of the accuracy of these approximations. Hint: Did you use a sample mean? What do we know about the asymptotic properties of sample means?
113
Chapter 19
Using pseudo random numbers to calculate things
While Chapter 18 introduces methods for drawing pseudo random numbers that conform to a particular distribution. This chapter is about using them to calculate things. Again, the theory in this chapter draws heavily from Chapter 8 of:
Judd, K. L. (1998). Numerical methods in economics. MIT press.
19.1 Monte Carlo Integration
19.1.1 Expectations of random variables
If we can draw pseudo random numbers from the distribution of X, it is straightforward to use these to calculate the expectation of X. In particular, given a sample {xs=1}S of S simulated draws, we can approximate E[X] as follows:
E[X] ≈ 1
S
S∑ s=1
xs (19.1)
that is, we simply compute the sample mean of our random numbers. If we can further establish that each xs is independent,
1, then we can use a Lindeberg-Levy Central Limit Theorem argument:
√ S
( 1
S
S∑ s=1
xs −E[X]
) d−−→ N(0,V [X]) (19.2)
to assign a degree of accuracy to our approximation. Therefore the precision of our approxi- mation is proportional to
√ S. This tells us that as we increase S, the simulation size, we get
1This is the case if, for example, we draw X using the method of inversion (see Section 18.1), and the uniform draws we use for this are independent. On the other hand, if we use Markov chain techniques, then the draws are typically not independent.
114
closer and close to the true value of E[X]. Unlike collecting real data, with simulation this is not so much of a thought experiment: increasing S is relatively cheap, and one is really only limited by the available RAM ({xs}Ss=1 must be held in your computer’s memory),2 and the processor speed.
19.1.2 Expectations of functions of random variables
Equation 19.2 is useful because it tells us how large of a simulation we need to achieve a desired level of accuracy. The trouble is that if we need a simulation to evaluate E[X], we probably also need a simulation to evaluate V [X]. This is not actually too much trouble, because variance is also an expectation. That is:
V [X] ≈ 1
S
S∑ s=1
(xs −E[X]) 2
(19.3)
where we can substitute in our approximation for E[X]. Alternatively, we can use:
V [X] = E[X2] −E[X]2 ≈ 1
S
S∑ s=1
x2s −
( 1
S
S∑ s=1
xs
)2 (19.4)
Approximating V (X) is a special case of approximating E[g(X)].3 If we can draw from X, then this is a simple extension:
E[g(X)] ≈ 1
S
S∑ s=1
g(xs) (19.5)
19.1.3 Expectations when you can’t draw directly from X
For whatever reason, it might be that we can’t find an appropriate way to draw pseudo random numbers from X directly. When X is a continuous random variable, this does not have to be a deal-breaker. To see this, note that for continuous X with pod f(·):
E[g(X)] =
∫ R g(x)f(x)dx (19.6)
=
∫ R
g(x)f(x)
h(x) h(x)dx (19.7)
Where (19.7) makes the much celebrated algebra monkey trick of multiplying by a fancy one. In this case
1fancy = h(x)
h(x)
2Even this is not so much of an issue: if a very large S is required, the simulation can be split up into blocks of smaller simulations. Form this, you will get a sample mean for each block, and it is simply a matter of taking the mean of the blocks’ sample means.
3Here g(x) = X2 −E[X]
115
Note here that we have implicitly assumed that h(x) 6= 0 for all x in the support of X. The implication of (19.7) is that if we can draw from any pdf h(·), whose support has the same support as X, then we can approximate E[g(X)] as follows:
1. Draw a simulated sample {ys}Ss=1 from the pdf h(·).
2. Generate the transformed sample according to zs = g(ys)f(ys)/h(ys)
3. Compute the sample mean of zs:
E[g(x)] ≈ 1
S
S∑ s=1
zs
19.1.4 Example
We wish to evaluate E[|X|], the expected absolute value of X, where X conforms to the logistic distribution with location and scale parameters both equal to one. That is, the pdf of X is:
f(x) = exp(−(x− 1))
(1 + exp(−(x− 1)))2 (19.8)
but we are unable to draw directly from this distribution (perhaps because (i) we have forgotten about inversion, and (ii) that you can read up on it in Section 18.1). To get around this, we use normal draws. In particular, we draw from Z ∼ N(1, 1). The reason for this is that the logistic distribution looks a lot like the normal distribution, as long as the means and scale are similar. By drawing from N(1, 1) instead of the standard normal, we make f(z)/φ(z) ≈ 1 for most draws of z. The procedure is therefore:
E[|X|] ∫ |x|f(x)dx =
∫ |x|f(x) φ(x)
φ(x)dx (19.9)
So the procedure is:
1. Generate {zs}Ss=1, a sample of independent normals with µ = σ2 = 1
2. Generate the transformed variable ys = |z|f(z) φ(z)
3. E[|X|] ≈ 1 S
∑S s=1 ys
The following code implements this in Stata:
clear all
set seed 42
set obs 100 // number of random values to be generated
// Step 1
generate Z = rnormal ()+1
116
// Step 2: do this in stages
generate absZ = abs(Z)
generate fZ = exp(-(Z-1))/(1+ exp(-(Z-1)))^2
generate phiZ = normalden(Z-1)
generate Y = absZ*fZ/phiZ
// Step 3
summarize Y
// and look at the mean
// Since we can use inversion , let ’s try this too
generate U = runiform ()
generate X = 1+log(U/(1-U)) // I looked this up on Wikipedia
generate absX = abs(X)
summarize Y absX
which generates the output:
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
Y | 100 1.601705 3.358623 .0113715 25.86216
absX | 100 1.593992 1.301472 .0280923 5.829037
where Y was computed using the above method, and absX was computed using inversion.
Exercises
Exercise 19.1 (Solution provided). Let X ∼ Beta(3, 7). Use Equation 19.7 and uniform random numbers to compute E[(X − 0.5)2].
Exercise 19.2 (Solution provided). You are an expected utility maximizer with Constant Relative Risk Aversion utility function over money:
u(x) = x1−r
1 − r
where r = −u′′(x)/u′(x) is your coefficient of relative risk aversion. You are offered a lotter that pays out a random amount $X, where X is drawn from the triangular pdf:
fX(x) =
4x if 0 ≤ x < 0.5 4(1 −x) if 0.5 ≤ x < 1 0 otherwise
Calculate the certainty equivalent of this lottery, as a function of r, for r ∈ (−0.5, 0.5). Summarize your answer in a plot. You have forgotten how to draw from X directly, so use draws from the uniform distribution to perform these calculations.
Your final plot should look something like this:
117
Note that I have also coded up a measure of the accuracy of this approximation. Hint: You need a rather large simulation size to get an accurate approximation of the
expected utility. One way to check how accurate you are is to check the accuracy of your certainty equivalent for r = 0. That is, if you are risk neutral, it is easy to work out what the certainty equivalent is. Once you are happy with how well your simulation approximates this, move on to coding up a for loop to generate the plot.
Exercise 19.3. The game of Yatzee is a die-rolling game with a similar scoring system to Poker. In each round, a player rolls five six-sided dice (at most) three times. After the first and second rolls, the player can choose to only roll a subset of the dice. Their score for the round is a function of the numbers facing upward after the third roll. The highest possible score, called a “Yatzee” occurs when all five dice have the same number (i.e. five ones, five twos, etc.).
Consider the following strategy:
1st roll: Roll all 5 dice (there really isn’t any decision to make)
2nd roll: Determine the modal outcome of the previous roll.
– If there is exacty one mode, roll only the dice that do not show this modal number.
– If there is more than one mode, pick the mode with the highest value, and roll only the dice that do not show this modal.
– If there is no mode, roll all 5 dice
3rd roll: Follow the same rule as for the 2nd roll.
Write a simulation to answer the following questions:
1. What is the probability of scoring a Yatzee by following this strategy?
118
2. What is the distribution of points induced by this decision rule? You can find the list of scoring rules here: https://en.wikipedia.org/wiki/Yahtzee#Rules. Assume that the player chooses the highest scoring option from the “upper section” only.4 Show your answer graphically, and also report the expected score with a measure of the accuracy of this simulated moment.
3. Suppose that instead of following the above decision rule with probability one, you accidentally roll a die that you shouldn’t with probability θ. That is, for every die that you shouldn’t roll, you accidentally roll it with probability θ, and these accidental rolls are independent of each other. How does this change your answer to question 2.
4. Ex-ante, it seems reasonable that rolling all the dice in the second roll might be optimal if the modal outcome is a small number (i.e. 1 or 2). Also, it seems pretty obvious that if there is no modal outcome after the first or second roll, one should hang on to a 6. Write down a modified decision rule to reflect this, and determine whether this tweak results in a better outcome.
5. How much would a risk-neutral5 player benefit if they were allowed to make 4, 5, or 6 rolls, instead of just 3?
4You could code it up the lower section as well, but that would be more menial work, for a very similar (albeit longer) looking program.
5I.e. they only care about the mean of the distribution of points.
119
Bibliography
Ai, C. and Norton, E. C. (2003). Interaction terms in logit and probit models. Economics Letters, 80(1):123 – 129.
Bailey, M. (2016). Real econometrics: The right tools to answer important questions.
Berry, S., Levinsohn, J., and Pakes, A. (1995). Automobile prices in market equilibrium. Econometrica, 63(4):841–890.
Bland, J. R. and Cook, A. C. (2017). Random effects probit and logit: The right marginal effects for the right econometric specification.
Cameron, A. C. and Miller, D. L. (2015). A practitioners guide to cluster-robust inference. Journal of Human Resources, 50(2):317–372.
Judd, K. L. (1998). Numerical methods in economics. MIT press.
Koop, G., Poirier, D. J., and Tobias, J. L. (2007). Bayesian econometric methods. Cambridge University Press.
Makowsky, M. D. and Stratmann, T. (2009). Political economy at any speed: what deter- mines traffic citations? The American Economic Review, 99(1):509–527.
120
Appendix A
Solutions to selected problems
Exercise 2.2 Bias:
E[µ̂] = E
[ 1
N
N∑ i=1
Xi
] (A.1)
= 1
N
N∑ i=1
E[Xi] (A.2)
= 1
N
N∑ i=1
µ, (iid) (A.3)
= µ =⇒ unbiased (A.4)
E[µ̃] = E [ N min
i {Xi}
] (A.5)
= NE [ min i {Xi}
] (A.6)
= N µ
N , (given) (A.7)
= µ =⇒ unbiased (A.8)
121
Variance:
V [µ̂] = V
[ 1
N
N∑ i=1
Xi
] (A.9)
= N 1
N2 V [Xi] =
1
N V [Xi], (iid) (A.10)
= 1
N
( E[X2i ] −E[Xi]
2 )
(A.11)
= 1
N
( 2µ2 −µ2
) (A.12)
= µ2
N (A.13)
V [µ̃] = V [ N min
i {Xi}
] (A.14)
= N2V [min i {Xi}] (A.15)
= N2V [Exponential(µ/N)] (A.16)
= N2 µ2
N2 (A.17)
= µ2 (A.18)
MSE: Since both are unbiased, MSE[µ̂] = V [µ̂] and MSE[µ̃] = V [µ̃]. By these measures, µ̂ is unambiguously better than µ̃: both are unbiased, but µ̂ has
smaller variance. Simulation:
Let’s start by working out µ̌. This is motivated from:
E[X2] = 2µ2 (A.19)
µ̌ =
√√√√ 1 2N
N∑ i=1
X2i (A.20)
So in terms of the code, once we have simulated {Xi}Ni=1, we need to generate X2i , take its mean, then divide by 2 and take te square root.
The code below produces the following output table:
. summarize
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
muHat | 10,000 .9983377 .1829214 .4800103 1.714862
muTilde | 10,000 .9992013 .9903682 .0001359 10.60011
muC | 10,000 .9791806 .1963733 .4536264 1.972492
122
The mean values are close enough to 1 to confirm that our calculations in part 1 are correct. µ̌ appears to be biased downward. For the standard deviation for the simulated sampling distribution, note that:
√ V [µ̂] =
1 √
30 ≈ 0.18 (A.21)√
V [µ̂] = 1 (A.22)
Good! The simulated sampling distributions are shown in the plot below. Again, we are much
better off using the sample mean than the minimum. The sampling distribution of estimator µ̌ is almost indistinguishable from that of µ̂, so it does better than µ̃, but we might want to explore its properties further if we want to use it (as a general rule, estimators that are sample means are hard to beat).
0 .5
1 1.
5 2
2. 5
0 2 4 6 8 10 x
kdensity muHat kdensity muTilde kdensity muC
// Clear everything from memory
clear
clear all
/* First we write a program that generates a sample conforming to N=30 and mu=1,
then applies the two estimators
*/
program define ExponentialSim , rclass
// Here our inputs are the number of observation , and mu, the parameter we
↪→ want to estimate syntax [, obs(integer 30) mu(real 1) ]
// Set the sample size
set obs ‘obs ’
// declare some temporary variables that we will use later
tempvar x x2
// Generate x according to the specified distribution
generate ‘x’ = -‘mu ’*log(runiform ())
123
// Summarize x. We need to calcualte its mean and the minimum
summarize ‘x’
// estimators
return scalar muHat = r(mean)
return scalar muTilde = ‘obs ’*r(min)
// muC , based on 2nd raw moment
generate ‘x2 ’ = ‘x’^2
summarize ‘x2 ’
return scalar muC = sqrt(r(mean)/2)
end
// check that the program is running properly
ExponentialSim , obs (30) mu(1)
disp "‘r(muHat) ’"
disp "‘r(muTilde)’"
// run the simulation
set seed 42
quietly simulate muHat=r(muHat) muTilde=r(muTilde) muC=r(muC), reps (10000): ExponentialSim ,
↪→ obs (30) mu(1)
summarize
twoway (kdensity muHat) (kdensity muTilde) (kdensity muC)
graph export ExExponentialSim.pdf , replace
Exercise 19.1
clear all
set seed 42
set obs 100
generate U = runiform ()
generate gU = (U-0.5) ^2
generate hU = 1
generate fU = betaden(3,7,U)
generate Y = gU*fU/hU
estpost summarize Y
esttab using MCIntegrationEx1.tex , replace cells("count mean sd min max") noobs
Which generates the following table:
(1)
count mean sd min max Y 100 .0547719 .0800443 3.93e-13 .2622285
So we conclude that E[(X − 0.5)2] ≈ 0.055
Exercise 19.2
clear all
set seed 42
set obs 1000000
generate U = runiform ()
124
generate R = .
generate EU = .
generate Y = .
generate Yp1sd = .
generate Ym1sd = .
forvalues rr = 1/100 {
local r = -0.5 + ‘rr ’/100*(0.5 -( -0.5))
generate uU = 1/(1-‘r’)*(U)^(1-‘r’) // function to evaluate
generate fU = 4*U // pdf of X for 0<x
↪→ <0.5 quietly replace fU = 4*(1-U) if U>0.5 // pdf of X for 0.5<x<1
generate hU = 1 // pdf of U
generate T = uU*fU/hU
quietly summarize T
local EU = r(mean)
local std = sqrt(r(Var)/_N)
/* Certainty equivalent calculation
we are trying to solve:
u(y) = EU
after substituting in u(.) on the LHS:
y^(1-r)/(1-r) = EU
y^(1-r) = (1-r)*EU
y = [(1-r)*EU]^(1/(1 -r))
*/
quietly replace EU = ‘EU ’ if ‘rr ’==_n
local y = ((1-‘r’)*‘EU ’)^(1/(1-‘r’))
display ‘r’ ‘y’
quietly replace R = ‘r’ if ‘rr ’==_n
quietly replace Y = ‘y’ if ‘rr ’==_n
quietly replace Yp1sd = ‘y’+‘std ’ if ‘rr ’==_n
quietly replace Ym1sd = ‘y’-‘std ’ if ‘rr ’==_n
drop uU fU hU T
}
label variable R "r: CRRA parameter"
label variable Y "certainty equivalent ($)"
label variable Yp1sd "CE+1sd"
label variable Ym1sd "CE -1sd"
twoway (line Y R) (line Yp1sd R, lpattern(dash)) (line Ym1sd R, lpattern(dash))
graph export MCIntergrationEx2.png , replace
125
- Introduction
- Summation
- Types of random variables
- Discrete random variables
- Continuous random variables
- Describing one random variable
- Cumulative density function
- Probability mass function
- Probability density function
- Mean, variance
- I Estimating one parameter
- Estimators
- Populations and samples
- Estimators and the sampling distribution
- Small-sample properties of estimators
- Bias
- Variance
- Mean squared error
- Inference
- Hypothesis tests
- One-sided hypothesis tests
- p-values
- Confidence intervals
- Test power
- The take-away
- Inference with asymptotic assumptions
- Large-sample properties of sample means
- The Weak Law of Large Numbers
- A central limit theorem
- The continuous mapping theorem
- The delta method
- Large-sample properties of estimators
- Consistency
- Asymptotic distribution
- Using large-sample properties to make hypothesis tests easier
- Hypothesis tests with asymptotic approximations
- Confidence intervals with asymptotic approximations
- p-values with asymptotic approximations
- II Basics of programming and handling data in Stata
- Getting started in Stata
- Importing, saving, and exporting data
- Scripts
- The working directory
- For loops
- Types of data
- How your computer thinks (or doesn't think) about data
- Censored and truncated data
- Categorical data
- Unordered
- Ordered
- Merging data, and wide & long formats
- One-to-one merges
- Wide and long datasets
- Many-to-one and one-to-many merges
- Marginal effects and interactions in non-linear models
- Standard errors under different assumptions about
- Homoskedasticity: the *standard* standard errors
- Heteroskedasticity: reg y x, robust
- Clustering: ``I think you have 3 statistically independent observations''
- Autocorrelated errors
- III Some common econometric techniques
- Ordinary Least Squares (linear regression)
- regress: Implementing OLS in Stata
- esttab: Producing publication-quality regression outputs in Stata
- Maximum Likelihood
- Instrumental variables (2SLS)
- Over-identification test
- Time series
- Diagnostics
- Autocorrelation and partial autocorrelation functions
- Declaring time series datasets and dealing with lagged variables
- IV Advanced reg-monkeying
- Looping over variables: one reg y x, robust, many regressions.
- V Simulation techniques
- An introduction to Monte Carlo techniques
- Stata's (pseudo) random number generators
- Using random number generators
- Stata's simulate command
- Simulations with OLS
- Method 1: Load the variables you want to keep constant when you run the program
- Method 2: Keep x in memory
- Techniques for drawing random numbers
- Inversion
- What inversion is and how it works
- Example
- Using pseudo random numbers to calculate things
- Monte Carlo Integration
- Expectations of random variables
- Expectations of functions of random variables
- Expectations when you can't draw directly from X
- Example
- Solutions to selected problems