statistics economics homework

shazuanzhe
regression_in_R.docx

(R)egression

Drew Barker

September 13, 2018

The lm() function

Calculating a regression equation in R is actually pretty easy with the lm() function. The lm() function takes two main inputs, a formula and the dataset we wish to use. To demonstrate I will use a dataset that is pre-built into R called “mtcars”. This dataset contains data on Motor Trend cars. For the purpose of this demonstration, we wish to regress average miles per gallon of a car on weight of the car. First let’s call the dataset (no need to import, since it is already built in) and look at a summary of it.

data(mtcars) #This calls the dataset into the global environment summary(mtcars) #summary of the dataset with sample averages, median, min, etc.

## mpg cyl disp hp ## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 ## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 ## Median :19.20 Median :6.000 Median :196.3 Median :123.0 ## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 ## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 ## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 ## drat wt qsec vs ## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000 ## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 ## Median :3.695 Median :3.325 Median :17.71 Median :0.0000 ## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375 ## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 ## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000 ## am gear carb ## Min. :0.0000 Min. :3.000 Min. :1.000 ## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 ## Median :0.0000 Median :4.000 Median :2.000 ## Mean :0.4062 Mean :3.688 Mean :2.812 ## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 ## Max. :1.0000 Max. :5.000 Max. :8.000

The two variables we are interested in are named “mpg” and “wt” respectively. So let’s run the regression. Take very special note of how we input the formula for the regression. The correct syntax for a formula specifying running a regression of Y on X is Y~X. Pay special attention to the “~”. This is a “tilde” character, not a hyphen or dash.

lm(mpg~wt, data=mtcars) #this is how we use the lm function to run a regression of miles per gallon on car weight.

## ## Call: ## lm(formula = mpg ~ wt, data = mtcars) ## ## Coefficients: ## (Intercept) wt ## 37.285 -5.344

We see that if we run the lm (which stands for “linear model” by the way) as it is, it calculates the regression and spits out the estimates. In this case we have an estimated intercept of 37.285 and coefficient estimate of -5.344. We can also save the linear model to a new object in the global model exactly as we did before:

reg1<-lm(mpg~wt, data=mtcars) #"saving" the regression output

The linear model output contains much more than just the parameter estimates, such as residuals and fitted values. You will often be interested in revisiting the regression, so saving the output is handy (especially for regressions with much longer formulas). With the regression output saved, we can retrieve the coefficient estimates by extracting them exactly the same as we extracted variables from a dataset:

reg1$coefficients #extracts the coefficient estimates from the linear model

## (Intercept) wt ## 37.285126 -5.344472

Interpreting regression coefficients

Take care when interpreting regression coefficients. Remember that we only have an unbiased estimate of the true causal effect of a treatment variable if the true model is in fact linear (and the zero conditional mean assumption holds). With a coefficient estimate of -5.344 we are tempted to interpret this as: “for every 1,000 lbs (since units of wt are in 1,000s of lbs) of weight, miles per gallon decreases by 5.344”. However, we should be careful to remember that this is only true if our assumptions hold (and even then this is only an estimate of the true causal effect).

R-squared and standard errors

Getting the R-squared statistic and standard errors for the regression is easily done with the summary function. In fact, the summary function returns much of what econometricians normally look for in a regression.

summary(reg1) #self-explanatory, summarizes the linear model output

## ## Call: ## lm(formula = mpg ~ wt, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.5432 -2.3647 -0.1252 1.4096 6.8727 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 37.2851 1.8776 19.858 < 2e-16 *** ## wt -5.3445 0.5591 -9.559 1.29e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.046 on 30 degrees of freedom ## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446 ## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

We see that this returns the coefficient estimates, the standard errors, the r-squared (listed under multiple r-squared), and other things which we will discuss later in the class. Remember that these standard errors assume homoskedasticity. In the presence of heteroskedasticity, these estimates will be invalid. We will discuss how to estimate heterskedasticity-robust standard errors when we discuss hypothesis testing.