elem econ homework
Multiple regression
Regression for dummies
Our whole purpose in introducing regression was to make all else equal when we didn’t have random assignment by controlling for the confounding factors. So if we don’t control for any other factors, how will regression differ from the simple average differences we did before?
To examine this, consider a dummy variable representing a binary treatment (such as health insurance), . takes on two values, 1 for receiving treatment, and 0 for not receiving treatment. Thus the CEF of Y given Z also takes on two values.
Regression for dummies
Then we can write the CEF as:
To see why this is the case plug in Z=1 and Z=0
Plugging in from the values of the CEF we see that we have , which is exactly the formula for a simple regression.
So
Then if we estimate the parameter by replacing the conditional expectations with conditional averages, we see that the OLS estimate will give us the same simple average difference we looked at before.
So if we expect that the average difference is invalid due to selection bias, we aren’t doing any better with simple regression. We need to control for those factors which make ceteris not paribus.
The multiple regression model
We control for those other factors by simply including them as additional independent variables in our regression model.
Once we have more than one independent variable we have to work with a multiple regression model. The multiple regression model with k independent variables is given by:
Note here I have made no distinction between a treatment variable and a control variable. Again, there will be no difference in how I estimate the various beta coefficients. Which one is the treatment of interest is simply a matter of what question we are trying to answer.
The key identification assumption of the multiple regression model simply extends that of the simple regression model:
Obtaining estimates
The estimates of the k+1 regression parameters come from the solution of k+1 equations similar to what we saw before (but I’ll spare you). Accordingly, deriving a formula for the betas is a bit more complicated.
The estimate of a single coefficient has the interpretation of how much Y changes in response to a one unit increase in holding all other variables fixed. We have to figure out a way to remove the effect those other variables have on Y through .
In other words, we want to “partial out” or “net out” the effects of the other variables…
Partialling out
That’s exactly what the OLS estimator of the multiple regression model does, it “nets out” the effects of the other regressors (“regressor” is another word for an independent variable in the regression equation).
For example, in order to estimate we could do as follows:
Regress on :
Calculate the residuals from the above regression:
Regress on (simple regression):
is the coefficient on
Recall that the residual of a regression is everything else not explained by the regressors, so by using the residuals from the regression of on we are leaving only the portion of that is independent of the other regressors. But what if there is no residual…
Perfect collinearity
The partialling out interpretation of multiple regression brings up another very important condition in order for us to be able to estimate the regression coefficients using OLS. There must be residuals in the first-stage regression or else we cannot estimate the regression coefficient using OLS.
If there is no residual from the regression that would mean that is an exact linear combination of the other variables, so that after estimating a linear model there is nothing left over. We call this situation perfect collinearity.
If we have perfect collinearity, then we cannot estimate the multiple regression model using OLS!
Example: We may be interested in estimating a regression of wages on worker characteristics. Suppose we were to include a variable for age, a variable that measures labor force experience (age-schooling), and number of years of schooling. These variables would be perfectly collinear.
Unbiasedness of OLS in multiple regression
Given the zero conditional mean assumption and no perfect collinearity, the OLS estimates of the intercept and coefficients are unbiased.
Note again that this assumes that the linear model is true in the population. If, on the other hand, Y was a non-linear function of the parameters then OLS would be a biased estimator. Y can be non-linear in one or more of the regressors, as long as the parameters are not non-linear. For example:
Acceptable:
Unacceptable:
Omitted variable bias
Exactly what happens if we fail to include some of the regressors in a multiple regression model? The short answer is that we end up with biased estimates of the regression parameters. This type of bias is known as omitted variable bias.
To demonstrate suppose that the true population regression model is a function of two regressors:
Instead we estimate: where s simply stands for “short” (short model).
Then the estimate of the coefficient in the short model is:
Where comes from regressing on :
The second term, , is the OVB. It is the effect of on Y times the effect of on .
Don’t be so sensitive…
We can never say with 100% certainty whether or not we have OVB, after all, some things are by nature unobservable (like innate ability) so we can never control for them.
So often the best we can do is test how “sensitive” our regression results are to the inclusion of additional controls.
That is, we start with a baseline model which has “core” control variables that we know we must control for.
Then we add other controls which we suspect could potentially cause OVB. We hope that our regression estimates change by little. If our regression estimates change by little, econometricians often say the results are “robust” to the inclusion of key control variables.
Example: We are often concerned with unobserved ability causing bias in any measure of the returns to schooling. A common strategy is to include observable measures of ability (e.g. test scores) and see how our regression estimates change. If they don’t change by much (we hope) then maybe, just maybe, we can be hopeful that those unobservable measures of ability aren’t mucking up our results by too much.
Including irrelevant variables
The worried researcher may throw a bunch of control variables into a regression to avoid OVB. Does this cause estimates to be biased?
The answer is, no it does not. Estimates of the regression parameters will still be unbiased. You can think of those irrelevant variables as having coefficient parameters of 0, so we shouldn’t expect that OLS will yield a biased estimator of those.
That does NOT mean that you should always throw every variable you can plus the kitchen sink into each regression you run. Other problems arise when we include irrelevant variables which we will discuss later.
Goodness-of-fit
The calculation of the (often called the coefficient of determination) is no different for multiple regression than it was for simple regression. It is still:
However, now that we are talking about adding extra variables, it bears mentioning what happens to the coefficient of determination when we add irrelevant variables. Remember that is a measure of how much variation in Y we can explain with our model. If we add extra variables, can that make our model explain less variation in Y?
No it cannot. In general it can only explain more. This means that will never go down when we include extra variables (irrelevant or not), and in fact usually goes up. This again gives us reason to take the as a measure of “goodness” of a model with a grain of salt: if we use it to measure the worth of a model then the best model would be the one that includes every possible variable under the sun.
Variance of coefficient estimates: Assumptions
We first revise the homoskedasticity assumption to include the additional variables we are controlling for before:
We also add one other assumption that we previously took for granted:
This assumption states that individual error terms are uncorrelated
If we have a purely random sample from the population of interest, then this assumption holds. However, this will not always be the case. Further, when it comes to time-series regression, this assumption will be heavily called into question.
Without this assumption, the formula for the variance of the regression coefficient is invalid, though we can still have unbiased estimates.
Variance of coefficient estimates
With those assumptions in place, in addition to the zero conditional mean and no perfect collinearity assumptions, it can be shown that the variance of a single regression coefficient estimate is given by:
Where
And is the coefficient of determination from regressing on all of the other regressors.
Variation in is a good thing
Recall that the sum of squares total of a variable is effectively the sample variation in that variable. Examining the formula for the variance of the regression coefficient, we see that the higher is SST, the lower is the variance of our estimate (and therefore the more precise is our estimate).
For example, imagine trying to fit a line through the points on the scatterplot to the right. It looks there could be a slight positive relationship, but it isn’t clear that such is the case or to what extent.
Variation in is a good thing
Now if we add more variation in X, the relationship looks much clearer.
We should then expect that an estimate derived from these points would be much more precise than an estimate derived from just the black points.
Correlation with other regressors… not so much
The formula for the multiple regression coefficient estimate variance differs from that of simple regression by multiplication of the term this term is often called the variance inflation factor.
Remember that is the coefficient of determination from the regression of on all other independent variables. This has two key implications for the variance of :
The more highly correlated is with other independent variables (a situation we call multicollinearity), the higher the , the higher the variance inflation factor, and hence the higher the variance of the regression coefficient estimate.
The more control variables we include, the higher the , the higher the variance inflation factor, and the higher the variance of the regression coefficient estimate (there’s one problem with throwing in irrelevant regressors).
Estimating the error variance
We must also revise our error variance estimate formula for multiple regression to account for the difference in degrees of freedom.
An unbiased estimate of the error variance for the multiple regression with k regressors (and an intercept) is given by:
In this case the degrees of freedom is (n-k-1). This is the more general case which includes simple regression, in which k=1 so the degrees of freedom was (n-2)
Remember: we subtract the number of parameters we are estimating from the sample size to get the degrees of freedom for a regression. In this case we are estimating k+1 parameters (k coefficients plus 1 intercept), so we have k+1 equations determining the values of those estimates that we must satisfy.
The standard error
The standard error of a multiple regression coefficient estimate is given by:
Where
This estimate is only valid under homoskedasticity and no correlation among error terms!
Under the set of assumptions we have used so far we can say with confidence that is BLUE (no that does not mean it’s lonely).
BLUE stands for “Best Linear Unbiased Estimator
“Best” in this case refers to having the smallest variance in a particular class of estimators, in this case the class is Linear Unbiased Estimators
An estimator of is “Linear” if it can be expressed as a linear function of the data on the dependent variable: where is any function of sample values of the independent variable.
An estimator of is “Unbiased” if
Gauss-Markov Theorem: assumptions
This famous result in statistics is known as the Gauss-Markov Theorem.
The assumptions necessary for the theorem to hold are:
Linearity: The true population model is given by:
Zero conditional mean:
No perfect collinearity
Homoskedasticity:
is a purely random sample from the population
Buys us no correlation between individual error terms.
Gauss-Markov Theorem
Under assumptions 1-5, the OLS estimates have the smallest variance among all linear, unbiased estimators.
This means that in addition to being accurate on average, the OLS estimates are the most precise among all estimators of the parameters that are both linear and unbiased
Note that there could be other estimators that nonlinear that attain smaller variances. Under certain additional conditions, we can argue that OLS is the best among all unbiased estimators (more on that later). In fact there can also be biased estimators that attain smaller variances (but we are generally not interested in sacrificing bias for smaller variance).
Pitfalls of OLS: nonlinearity
As we previously discussed, if the PRF is not linear in the parameters, then we cannot use OLS to estimate the parameters.
We often end up having to resort to other estimation methods, such as nonlinear least squares (outside of the scope of this class, and a pain in the butt).
Still, we would like to be able to use OLS when possible, since it is so easy to work with…
Pitfalls of OLS: nonlinearity
In fact in certain situations we may still be able to use OLS, even when we have nonlinearities in the parameters, by transforming the model into one that is linear in the parameters.
For example suppose that the population model takes the following form:
We can transform this into a model that is linear in parameters using the natural logarithm:
Logs are important in economics. For example, one of the most commonly estimated regression equations in labor economics is a log-wage equation:
Where educ is a measure of education (often years of schooling), exper is a measure of labor market experience, and … is any other interesting thing that may impact wages.
Interpretation: multiple regression
We would like to think a bit more about interpretation of OLS coefficient estimates in the context of multiple regression.
To begin let’s look at the fitted values from estimating the regression equation:
Now let’s consider changing each of the independent variables and seeing how the predicted value changes:
Δ indicates “change in” a variable
Note that is constant, so it does not come into play when looking at how log-wages change.
Now suppose that we have a one unit increase in education, and we hold experience constant :
So we see that we can interpret a single regression coefficient estimate, in this case education, as the predicted change in log wages from a one year increase in education holding all other factors fixed.
Interpretation: logs
The predicted change in log wages (or any logged variable) has a special interpretation. Let’s have a closer look.
You can take that last step on faith, but a little calculus would tell us that those last two are pretty close when the percentage change is small.
So we can see that the regression coefficient has the interpretation of (approximately) the (predicted) percentage change in wages in response to a one year increase in schooling (ceteris paribus)
Dummy dependent variables
While we are on the subject of dependent variables with special interpretations in regression analysis, we will discuss one more: binary dependent variables.
Suppose we have the following population model:
Where is a dummy variable either equal to 1 or 0.
Then the PRF is given by (assuming zero conditional mean of the error):
We know from statistics (or at least now you do) that when is a dummy variable,
The expected value of a dummy variable is simply the probability that the variable attains a value of 1.
Dummy dependent variables
We call a multiple regression model with a binary dependent variable a linear probability model because the response probability is linear in the parameters.
Accordingly, the regression coefficient estimates have the interpretation of the predicted change in probability that the dependent variable is 1 given a one unit increase in a regressor holding all other factors fixed.
For example suppose we swap out log-wages in the previous regression model we estimated with a dummy variable equal to 1 if an individual participates in the labor force:
Then has the interpretation in this case of the predicted change in the probability of labor force participation from a one year increase in schooling holding all other factors fixed.