statistics economics homework

shazuanzhe
Goodnessoffitandvarianceofsimplereg.pptx

Goodness of fit and variance of the simple regression

Explaining Y

Even if our simple regression model accurately captures the relationship between Y and X (usually doubtful) we may be interested in seeing to what extent X explains Y.

If we increase X by one unit, can we reasonably expect Y to increase by units?

Or is there a lot more variation from unobserved components that we are missing?

Another way to think about this is graphically. We are interested in seeing how well our predicted regression line actually fits the data points (a measure of “goodness-of-fit”).

Residuals and fitted values

First we must recall what residuals and fitted values are.

The fitted value of the simple regression is defined as:

The residual of the simple regression is:

From the formula for a residual we have:

Sum of squares total

Next we introduce some summary measures of variation:

The sum of squares total is defined by:

This should look somewhat familiar, if not, think about dividing SST by (n-1):

This is actually just the formula for the sample variance of Y.

Thus SST is basically a measure of how much total sample variation there is in Y.

Sum of squares explained

The sum of squares explained is defined as:

Similarly to how SST is a measure of the sample variation in Y, SSE is a measure of the sample variation in the fitted values of the regression.

This may not be obvious given that we are using the total sample average, but the total sample average is actually the same as the average of the fitted values:

Recall that , which gives us the result that

Sum of squared residuals

We have seen the sum of squared residuals before. This is what we minimize in order to estimate the OLS regression coefficients:

This too has an interpretation as a measure of sample variation. In this case we are measuring the sample variation in the residuals.

Again, we are used to seeing squared deviations from the mean, but recall that we now know that .

So we can write SSR as:

Putting it all together…

It can be shown (but much to your relief I will not) that SST=SSE + SSR

We already established that we can split up Y into two components: fitted values and residuals. Now we see that we can also split up the sample variation in Y into two components: variation in the fitted values and variation in the residuals.

Another way to think about this is we split the sample variation in Y into two parts: a part that we can explain with our regression model (SSE) and a part that we cannot explain with our regression model (SSR).

Measuring goodness-of-fit

Now we are ready to establish a measure of goodness-of-fit. Specifically, we are looking to measure the percentage of variation in Y that we can explain with our model.

Start with our result from the previous slide:

Dividing both sides by SST:

Rearranging we arrive at our measure:

Recall that SST measures total variation and SSE measure explained variation, so this statistic gives us a measure of how much of the total variation we are explaining.

A few notes on

Given SST=SSE+SSR, the statistic is a number between 0 and 1, which makes sense given it’s interpretation as a percentage of explained variation in Y.

Econometricians and statisticians often try to use as a means of measuring the quality of their models. However, does not necessarily serve as a reliable measure of how good a model is at capturing the relationship between X and Y.

A very low does not mean that an econometric model is useless. We may still have a good estimate of the relationship between X and Y, but there may be lots of noise in Y unrelated to X. Low are actually somewhat common in econometric analysis, especially so in cross-sectional analysis. A low may signal the possibility that there are important unobserved components that can cause omitted variable bias (more on that later), but it does not guarantee it.

By the same token, a very high does not necessarily mean that an econometric model is 100% accurate. Here again omitted variable bias can be a problem.

The variability of

The variance of the regression coefficient is a very important statistic in econometrics.

It provides a metric for how “good” an estimator is. Examining the variance of an estimator of the regression parameter gives us a sense of how precisely we are estimating the parameter. Unbiasedness tells us that on average our estimate is very close to the parameter. In other words, if we were to draw multiple random samples from our population of interest, estimate the regression coefficient, and average those estimates we should be close to the parameter. Of course, we generally only work with one sample, so we hope that our estimate isn’t likely to be too far away from the truth.

The variance will play a key role in conducting hypothesis tests on our regression coefficients.

Homoskedasticity

Before we commence examining the variance of we start with a key assumption.

We call this condition homoskedasticity. This just says that the variance of the error term (the unobserved components) conditional on X is a constant.

This condition is not necessary for us to have an unbiased estimator of the population regression coefficient, but it will be important in establishing what the “best” estimator for β is.

Note that with this assumption in addition to our other simple OLS assumption we can show that

Heteroskedasticity

The opposite of homoskedasticity is heteroskedasticity (say that 5 times fast).

Under heteroskedasticity, the conditional variance of the error term is a function of X, in other words the residuals will be more or less spread out around the regression line depending on what the value of X is.

Under heteroskedasticity, calculating the variance of regression coefficients becomes a little bit more complicated, but it can still be done. We will talk about how we deal with this later in the class.

Homoskedasticity vs. Heteroskedasticity

The variance of

We saw previously that the formula for could be written as

Given this combined with the assumption of homoskedasticity we have:

Note that this formula is only valid under homoskedasticity.

How do we know ?

Well… we generally don’t. After all, we think of the error term e as containing all unobserved components that affect our dependent variable. And given those components are unobserved it would be pretty hard to observe their variance…

Thus, in practice, we are going to have to estimate .

Recall that

How can we estimate ?

Estimating the error variance

You might think that we can simply replace the expectation with a sample average, which is close… but not quite right.

For one thing, we cannot observe any of the exact values of the error term e.

We do, on the other hand, know the values of the residuals:

So we use residuals, not errors, in estimating the error variance.

Also, we may be tempted to estimate with , but this would be a biased estimate of the true variance.

The unbiased estimate of is

Wait, why (n-2)?

In this case n-2 is the degrees of freedom. Let’s unpack this a little bit for intuition.

Recall that . The key here is that within the residual we have two estimates of parameters, and .

Let’s suppose we wanted to tinker with the values of the residuals while still satisfying the two equations that determine the regression estimates. We would be able to freely alter n-2 of them, but we would have to restrict the last 2 to satisfy those equations. Thus we have n-2 degrees of freedom.

This is the same logic underlying why we have (n-1) in the estimate of a sample variance. In the sample variance we use the sample average, which is an estimate of the population mean.

The standard error

We call the square root of the estimated error variance the standard error (not standard deviation) of the regression. We denote this by

Also, for , we have the standard error of the regression coefficient given by:

This is the primary measure of variation in the regression coefficient we typically use. It is standard in pretty much all statistical software packages to report the standard errors.

Standard errors will be very important when we consider hypothesis testing in regression.