I really need a help with R program
Ordinary Least Squares Introduction
This chapter introduces the most commonly used regression estimation technique, Ordinary Least Squares (OLS). OLS is a regression estimation technique that calculates the coefficients β̂ so as to minimize the sum of the squared residuals, that is OLS minimizes
∑n i=1 e
2 i or
∑ (Yi − Ŷi)2.
An estimator is a mathematical technique that is applied to a sample of data to produce real world numerical estimates of the true population regression coefficients. Thus, OLS is an estimator, and a β̂ produced by OLS is an estimate.
How does OLS work?
A Single-Independent-Variable Regression Model
Recall the theoretical equation: Yi = β0 + β1Xi + �i OLS selects those estimates of β0 and β1 that minimize the total squared residuals, where the sum is taken over all the sample data points. The formula of these coefficients are:
β̂1 = ∑n
i=1[(Xi − X̄)(Yi − Ȳ )]∑n i=1(Xi − X̄)2
β̂0 = Ȳ − β̂1X̄
where, X̄ = ∑ Xi/N and Ȳ =
∑ Yi/N
Manual Calculation of Estimated Regression Coefficients
In this section we will calculate estimated regression coefficients β̂0 and β̂1 manually using R. The first thing we will do is read a .csv file into R as a dataframe named mydata. The file, “FINAID2.csv” is available on Canvas from this course module. Save it to the same folder as your current working directory for ease of use. rm(list=ls()) setwd("C:/myfile/University of Utah/2019 Spring/Econometrics") mydata <- read.csv("FINAID2.csv", header = TRUE) head(mydata)
## OBS FINAID PARENT HSRANK MALE ## 1 1 19640 0 92 0 ## 2 2 8325 9147 44 1 ## 3 3 12950 7063 89 0 ## 4 4 700 33344 97 1 ## 5 5 7000 20497 95 1 ## 6 6 11325 10487 96 0
This last function shows the first 6 rows to verify the .csv was read in correctly, and to let us examine the variable names and discover whether they are numeric or character. We will focus on two of the variables for our initial regression: one dependent variable and one independent variable. The dependent variable, or left-hand-side (LHS) variable, is called Yi in our formula; the independent variable, or right-hand-side (RHS) variable is labelled Xi in our formula.
1
For our first example we will use FINAID as the dependent variable and HSRANK as the independent variable (our first model specification).
Next, we will attach the dataframe we just created. Then we will proceed to manually calculate estimated coefficients using OLS estimation, as done in the first section of this chapter of the text. attach(mydata)
For ease of use and understanding, for this example we will identify the dependent variable, FINAID, as Y in R and the independent variable, HSRANK, as X. Y <- FINAID X <- HSRANK
Next we need to tell R what to use for Ȳ and X̄, easily done with the mean command. YBAR <- mean(Y) XBAR <- mean(X)
Next we will create vectors for (Yi − Ȳ ) and (Xi − X̄), calling them YDIFF and XDIFF respectively. YDIFF <- Y - YBAR XDIFF <- X - XBAR
To get an idea for what we are creating here, lets combine the vectors we just created into a table called table, using the column bind, or cbind command. Then we will use the head command to view it. table <-cbind(Y, X, YDIFF, XDIFF) head(table)
## Y X YDIFF XDIFF ## [1,] 19640 92 7963.74 10.38 ## [2,] 8325 44 -3351.26 -37.62 ## [3,] 12950 89 1273.74 7.38 ## [4,] 700 97 -10976.26 15.38 ## [5,] 7000 95 -4676.26 13.38 ## [6,] 11325 96 -351.26 14.38
This is comparable to the first few columns of Table 1 on page 39 of the text. Continuing on, we can have R perform the remaining intermediate calculations to estimate our regression coefficients. We will calculate (Xi − X̄)2 and name it XDIFFSQ, as such: XDIFFSQ <- (XDIFF)^2
As well as (Xi − X̄)(Yi − Ȳ ), which we will call PRODUCT PRODUCT <- XDIFF * YDIFF
We will now calculate β̂1, using the sum command in R, and assigning the name BETAHAT1 to the output. BETAHAT1 <- sum(PRODUCT)/sum(XDIFFSQ) BETAHAT1
## [1] 64.68226
We can also now calculate β̂0 as BETAHAT0 <- YBAR - BETAHAT1 * XBAR BETAHAT0
## [1] 6396.894
Let’s check that our estimators look right. We are going to determine estimated regression coefficients with OLS automatically through R by using the lm function. Intuitively lm allows us to write out the equation
2
similar to how we do on paper, with LHS variable, a tilde instead of an equals sign (~), and RHS variable after. In the case (almost always) that we want to save our results, we store them in a new arbitrary variable we create. Here, we’ll name it lm1. lm1 <- lm(Y ~ X) lm1
## ## Call: ## lm(formula = Y ~ X) ## ## Coefficients: ## (Intercept) X ## 6396.89 64.68
We see from the output that the intercept, or β̂0 matches what we calculated, as does the coefficient corresponding to X.
Having computed the estimated regression coefficients using OLS manually, (and checked it with lm), we will now move on to a more typical procedure for estimating regression models with OLS in R.
OLS Estimation with R functions
We will start out by examining the data. Exploring the data using numerical and graphical techniques is an important step in creating a model (part of a larger process we will develop over this course). R has a very useful summary command for descriptive statistics; let’s run it on our data set, first on our dependent variable, then on our independent variable. We can explore the general shape of the data sample for this variable. summary(FINAID)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 7869 11262 11676 15809 22148
Next let’s explore this variable visually using a graph called a histogram. hist(FINAID)
3
Histogram of FINAID
FINAID
F re
q u
e n
cy
0 5000 15000 25000
0 5
1 0
1 5
The next commands refine the histogram by adding a smooth curve called a kernel density curve that adds information to our exploration. You can think of this as an approximate actual probability density function (PDF). hist(FINAID, prob = TRUE) lines(density(FINAID))
Histogram of FINAID
FINAID
D e
n si
ty
0 5000 15000 25000
0 e
+ 0
0 4
e −
0 5
And now, do the same analyses for our independent variable, HSRANK. summary(HSRANK)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
4
## 20.00 80.00 89.00 81.62 96.75 99.00 hist(HSRANK, prob = TRUE) lines(density(HSRANK))
Histogram of HSRANK
HSRANK
D e
n si
ty
20 40 60 80 100
0 .0
0 0
.0 2
0 .0
4
Now we are prepared to run our regression. Once again we will utilize lm, which will run an OLS regression on the data, and a new function abline, which will use estimators from the regression to draw a line on the plot.
As before we will want to save our results, storing them in an arbitrarily named variable, mymodel. To see the results, we again use the summary function, this time with the newly created model variable as the argument to the function. mymodel <- lm(FINAID ~ HSRANK) summary(mymodel)
## ## Call: ## lm(formula = FINAID ~ HSRANK) ## ## Residuals: ## Min 1Q Median 3Q Max ## -11971.1 -3966.3 -838.7 4436.4 9412.2 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6396.89 3281.34 1.949 0.0571 . ## HSRANK 64.68 39.15 1.652 0.1050 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 5273 on 48 degrees of freedom ## Multiple R-squared: 0.05381, Adjusted R-squared: 0.03409 ## F-statistic: 2.73 on 1 and 48 DF, p-value: 0.105
5
The resulting summary contains a great deal of information. For this module we will focus on reading the (Intercept) and HSRANK coefficients, specifically the estimates of them apparent in the column titled as such. To be clear, R is telling us β̂0 is the value in the Estimate column next to (Intercept) and β̂1 is the value in the Estimate column next to HSRANK. The other aspect we want to focus on is the Multiple R-Squared, R’s way of identifying what we normally call R2, and Adjusted R-squared. Adjusted R-squared will come in handy once we start adding additional independent variables. Note the particularly low value of R2. We will examine how these change as independent variables are added.
Now we insert the linear regression line into a plot of the data. Note that by convention, we put our independent variable on the X-axis and the dependent variable on the Y -axis; with plot this means naming the independent variable as the first argument and the dependent variable as the second. The third argument will be using the abline command to insert the regression established in the last step as a line in the plot. plot(HSRANK, FINAID, abline(mymodel))
20 40 60 80 100
0 1
0 0
0 0
2 0
0 0
0
HSRANK
F IN
A ID
Multivariate Regression Model
Recall the multivariate equation:
Yi = β0 + β1X1i + β2X2i + ... + β3XKi + �i
A multivariate regression coefficient indicates the change in the dependent variable associated with a one unit increase in the independent variable by holding the other independent variables in the equation constant. For example, β1 measures the impact on Y of a one unit increase in X1 holding constant X2, X3. . . and XK.
The OLS estimation of multivariate models is identical in general approach to the OLS estimation of models with just one independent variable.
To demonstrate multivariate regression, we will extend our initial model twice. The first extension will add PARENT as an independent variable; the second will add MALE, a special type of dummy variable as an independent variable.
As we are adding new variables, we should proceed with data exploration, and a plot to begin our understanding of the relationship between dependent and independent variables.
6
summary(PARENT)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 3382 8934 12284 17818 64305 hist(PARENT, prob = TRUE) lines(density(PARENT))
Histogram of PARENT
PARENT
D e
n si
ty
0 20000 40000 60000
0 e
+ 0
0 3
e −
0 5
plot(PARENT, FINAID)
0 20000 40000 60000
0 1
0 0
0 0
2 0
0 0
0
PARENT
F IN
A ID
Having done our initial data exploration, we will run and interpret a multivariate regression. In R this is
7
very simple . . . simply add the new independent variable to the RHS of the model specification: mymodel <- lm(FINAID ~ PARENT + HSRANK) summary(mymodel)
## ## Call: ## lm(formula = FINAID ~ PARENT + HSRANK) ## ## Residuals: ## Min 1Q Median 3Q Max ## -6353.1 -1905.9 280.9 1942.5 6675.5 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8926.92907 1739.08253 5.133 5.36e-06 *** ## PARENT -0.35677 0.03169 -11.260 6.06e-15 *** ## HSRANK 87.37815 20.67413 4.226 0.000108 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2771 on 47 degrees of freedom ## Multiple R-squared: 0.7441, Adjusted R-squared: 0.7332 ## F-statistic: 68.33 on 2 and 47 DF, p-value: 1.229e-14
Notice in this model, the added coefficient for HSRANK, as well as the change in Multiple R-squared and Adjusted R-Squared. Multiple R-squared increased, as we expect it to when adding additional independent variables, but more importantly Adjusted R-squared has increased, which accounts for the change in degrees of freedom and is a more relevant measure to compare this model specification to the previous one.
When looking at the model, it is important to intuitively understand the signs of the individual estimated coefficients. Holding parents’ ability to contribute constant, does the financial aid increase of $87.40 for each percentage point increase in GPA rank make sense? And vice versa holding GPA rank constant does the parent financial contribution estimated coefficient seem logical? To pull up the coefficients on their own, let’s use the coef command in R and look at what it tells us. coef(mymodel)
## (Intercept) PARENT HSRANK ## 8926.9290669 -0.3567721 87.3781524
It will be helpful if we create some plots to visualize these relationships.
The plots will be similar to the plots on page 44 of the text which illustrate these points. For the impact of parents’ ability to contribute on financial aid, we can ask R the following: plot(PARENT, FINAID, abline(coef(mymodel)[1], coef(mymodel)[2]))
8
0 20000 40000 60000
0 1
0 0
0 0
2 0
0 0
0
PARENT
F IN
A ID
This command told R to plot PARENT and FINAID variables, using abline, (think of it as literally plotting points “a” and “b” to make a line) to grab the intercept coefficient ([1]), and the coefficient of PARENT, which happened to be the first independent variable we wrote in our model (R thinks of it as the coefficient after the intercept one, hence the [2]). So we can see the apparent negative relationship between parents’ ability to pay and financial aid.
Similarly we can call a graph for financial aid and high school GPA rank: plot(HSRANK, FINAID, abline(coef(mymodel)[1], coef(mymodel)[3]))
20 40 60 80 100
0 1
0 0
0 0
2 0
0 0
0
HSRANK
F IN
A ID
Finally, we will introduce the idea of dummy variables and include them in a regression. A dummy is simply a categorical variable, as in male or female, young or old, New York or San Francisco, and so forth. In the
9
data it is coded as a 1 or 0. Our dummy variable here is MALE. With categorical variables, the only data exploration that makes sense is to see the count of the category; a convenient way to do this in R is the table function: table(MALE)
## MALE ## 0 1 ## 27 23
Here the count of MALE is 23; the rest are presumably female.
And now we are ready to include that as an independent variable: mymodel <- lm(FINAID ~ PARENT + HSRANK + MALE) summary(mymodel)
## ## Call: ## lm(formula = FINAID ~ PARENT + HSRANK + MALE) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5401.7 -2388.1 292.1 2069.4 5233.8 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.813e+03 1.743e+03 5.630 1.04e-06 *** ## PARENT -3.427e-01 3.151e-02 -10.879 2.61e-14 *** ## HSRANK 8.326e+01 2.015e+01 4.132 0.00015 *** ## MALE -1.570e+03 7.843e+02 -2.002 0.05120 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2687 on 46 degrees of freedom ## Multiple R-squared: 0.7646, Adjusted R-squared: 0.7493 ## F-statistic: 49.81 on 3 and 46 DF, p-value: 1.721e-14
Note again the change in Adjusted R-squared. Can you explain why that change is relevant? And why did Multiple R-squared increase and would we expect that as we add independent variables?
Good data hygiene requires that we detach our data set when we are finished with it. detach(mydata)
Total, Explained, and Residual Sums of Squares
The square difference of Y around its mean is used to measure how much of the variation of the dependent variable is explained by the estimated regression equation. It is called the total sum of squares (TSS):
TSS = n∑
i=1 (Yi − Ȳ )2
TSS has two components, variation that can be explained by the regression and variation that cannot:
10
∑ i
(Yi − Ȳ )2 = ∑
i
(Ŷi − Ȳ )2 + ∑
i
e2i
TSS = ESS + RSS
Explained Sum of Squares (ESS) measures the amount of the squared deviation of Yi from its means that is explained by the regression. Residual Sum of Squares (RSS) is the part of the TSS that is unexplained by the estimated regression. The smaller RSS is relative to the TSS, the better the estimated regression line fits the data.
To apply this, we return to the manual calculation of estimated coefficients with OLS in the section of Manual Calculation of Estimated Regression Coefficients. To calculate ESS and RSS, we need Ŷi. To calculate this, we have R determine Ŷi using the estimated equation of the model, like the one found on page 34 of the text, Ŷi = β̂0 + β̂1Xi YHAT <- BETAHAT0 + BETAHAT1 * X YHAT
## [1] 12347.662 9242.913 12153.615 12671.073 12541.709 12606.391 12735.755 ## [8] 10924.652 9566.325 11571.475 12218.297 11248.063 12282.980 11636.157 ## [15] 12735.755 11571.475 12218.297 11700.839 12735.755 9631.007 12735.755 ## [22] 8984.184 9566.325 12282.980 11700.839 12735.755 12800.438 12218.297 ## [29] 12671.073 10213.147 12671.073 11830.204 11830.204 12735.755 12541.709 ## [36] 12800.438 7690.539 12153.615 9048.867 12347.662 11959.568 12024.251 ## [43] 11830.204 12800.438 12153.615 10083.783 11830.204 11636.157 12800.438 ## [50] 10795.288
Now that we have our Ŷi data, we can calculate our residuals, ei, where ei = Yi − Ŷi. Just a reminder, these residual errors represent the difference between the estimated Ŷi values, and the actual sample values, Yi. RESID <- Y - YHAT RESID
## [1] 7292.33816 -917.91343 796.38493 -11971.07314 -5541.70862 ## [6] -1281.39088 6429.24460 -3924.65216 -1641.32473 -96.47474 ## [11] 6571.70267 -2358.06345 5307.02041 6128.84300 1364.24460 ## [16] 7393.52526 -7718.29733 -3750.83926 -5735.75540 -2356.00698 ## [21] -4735.75540 -4694.18440 -1391.32473 -932.97959 3624.16074 ## [26] 9412.24460 4619.56235 6771.70267 -1496.07314 3886.85269 ## [31] -5671.07314 -3980.20378 -11830.20378 -5735.75540 3558.29138 ## [36] -4800.43765 809.46077 -4578.61507 4701.13334 -5347.66184 ## [41] -759.56829 2425.74945 3434.79622 7669.56235 -2603.61507 ## [46] 5886.21721 359.79622 163.84300 8839.56235 -1595.28764
Now we can calculate the ESS as: ESS <- sum((YHAT - YBAR)^2) ESS
## [1] 75893113
We can also calculate the RSS with ease: RSS <- sum(RESID^2) RSS
## [1] 1334607437
Finally, the TSS will be:
11
TSS <- sum((Y - YBAR)^2) TSS
## [1] 1410500550
As stated above, the TSS = ESS + RSS, lets check that now. ESS + RSS
## [1] 1410500550
Looks good!
The Overall Fit of the Estimated Model
R2
R2 is the ratio of the explained sum of squares to the total sum of squares:
R2 = ESS
TSS = 1 −
RSS
TSS = 1 −
∑ e2i∑
(Yi − Ȳ )2
R2 measures the percentage of the variation of Y around Ȳ that is explained by the regression. The higher R2 is, the closer the estimated regression equation fits the sample data. R2 lies in the interval 0 ≤ R2 ≤ 1. The value of R2 close to one shows an excellent overall fit.
Let’s manually calculate R2 as ESS/TSS: R2 <- ESS/TSS R2
## [1] 0.0538058
Now we will grab the R2 value from the lm model we executed earlier, by telling the summary function to grab it for us specifically: summary(lm1)$r.squared
## [1] 0.0538058
The R2 value matches with ours, a good sign. However as you may have noticed it is rather low, a sign that the estimated regression line is not fitting the sample data very well.
Correlation Coefficient (r)
The simple correlation coefficient, r ∈ [−1, 1], is a measure of the strength and direction of the linear relationship between two variables. The sign of r indicates the direction of the correlation between the two variables.
If r = +1, the two variables are perfectly positively correlated. If r = −1,the two variables are perfectly negatively correlated. If r = 0, the two variables are totally uncorrelated.
Checking this in R, we use the cor command: cor(Y,X)
## [1] 0.2319608
revealing a relatively weak positive correlation.
12
The Adjusted R2 (R̄2)
R̄2 measures the percentage of the variation of Y around its mean that is explained by the regression equation, adjusted for degrees of freedom.
R̄2 = 1 − ∑ e2i /(N −K − 1)∑
(Yi − Ȳ )2/(N − 1)
N −K − 1 is the degree of freedom which is the excess of the number of observation N over the number of coefficients (including the intercept) estimated (K + 1). The value of R̄2 can be used to compare the fits of equations with the same dependent variable and different numbers of independent variables.
For one final exercise we will calculate the R̄2 for our last model specification. The first step is to extract the residuals from R. We can pull the residuals from mymodel by asking R to specifically return residuals: RESIDS <- mymodel$residuals
Now we can square them and sum them up: RSQ <- sum(RESIDS^2)
Notice that the TSS we calculated previously (we can use it as it only makes use of the sample Yi values and their mean, not any estimated data hence it is still relevant to this model’s R̄2). Knowing that we can calculate R̄2 as RBAR <- 1 - ((RSQ/46)/(TSS/49)) RBAR
## [1] 0.7492617
We can compare that to the calculated R̄2 from the mymodel summary as such: summary(mymodel)$adj.r.squared
## [1] 0.7492617
Good and we see our calculation was correct.
13
- Introduction
- How does OLS work?
- A Single-Independent-Variable Regression Model
- Manual Calculation of Estimated Regression Coefficients
- OLS Estimation with R functions
- Multivariate Regression Model
- Total, Explained, and Residual Sums of Squares
- The Overall Fit of the Estimated Model
- R^2
- Correlation Coefficient (r)
- The Adjusted R^2 (\bar{R^2})