Econometrics homework
Simple LinearRegression
Class 5 for Econometrics 1
Vincent Geloso
Recap
In this unit, we will expand on what we saw with correlations and try to understand regressions (bivariate – i.e. only two variables).
I know some of you are wondering what will we do with theme 3 (normal distribution, central limit theorem, t-stat, z-score). Hold your breath! What we will see in this unit will be combined with t-stats and elements from previous classes in the units on « hypothesis testing ».
After we are done with hypothesis testing, we will move to multiple regressions.
Concept of regression
With correlation, we simply explained the relation between variables.
Regressions are different as you must define which one is affecting the other.
A given variable’s movements (Y ) are explained by movements in another (X ).
If so, we can « predict » how further changing the other variable ( X ) will affect the given variable (Y )
Thus, if I say Y is influenced by X, I am saying Y=f(x)
Concept of regression
Thus, if I say Y is influenced by X, I am saying Y=f(x)
I am saying that Y is an dependent variable and X is an independent variable. The dependent is the explained variable and the independent is the explanatory variable.
This requires a priori decision
Textbook example: crime rates in Britain during the 19th century and wheat prices (Question: do you think crime rates affect wheat prices or the reverse?)
Textbook counter-example (to invite caution): Government dole causes high unemployment during British Great Depression or is it the high unemployment that causes government dole? (p. 94)
Linear Regression Model
The relationship between X and Y is described by a linear function
Changes in Y are assumed to be influenced by changes in X
Linear regression population equation model
Important assumptions that you must know but which will be discussed more in econometrics 2: a) the relationship is linear; b) errors are independent of the explanatory variable; c) errors have a constant variance and a mean of zero (they cancel out); d) the errors are note correlated with each other
Simple Linear Regression Model
(continued)
Random Error for this Xi value
Y
X
Observed Value of Y for xi
Predicted Value of Y for xi
xi
Slope = β1
Intercept = β0
εi
Remember the notation!
This is what we discuss today
Simple Linear Regression Equation
The simple linear regression equation provides an estimate of the population regression line
Estimate of the regression intercept
Estimate of the regression slope
Estimated (or predicted) y value for observation i
Value of x for observation i
The individual random error terms ei have a mean of zero
Simple Linear Regression Equation
The simple linear regression equation provides an estimate of the population regression line (notice something here: if this is an estimate of the population line, do you think we can later use t-stat to see how well this estimate (from a sample) speaks to population)
Estimate of the regression intercept
Estimate of the regression slope
Estimated (or predicted) y value for observation i
Value of x for observation i
The individual random error terms ei have a mean of zero
Least Squares Coefficient Estimators
b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared residuals (errors), SSE:
Differential calculus is used to obtain the coefficient estimators b0 and b1 that minimize SSE (but we wont do that here – we will do it in econometrics II)
This image illustrates well why we take the « Squares » of distances to the predicted and this is what we want to minimize!
Least Squares Coefficient Estimators
The slope coefficient estimator is
And the constant or y-intercept is
The regression line always goes through the mean x, y
(continued)
Simple Linear Regression Example
A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet)
A random sample of 10 houses is selected
Dependent variable (Y) = house price in $1000s
Independent variable (X) = square feet
Sample Data for House Price Model
| House Price in $1000s (Y) | Square Feet (X) |
| 245 | 1400 |
| 312 | 1600 |
| 279 | 1700 |
| 308 | 1875 |
| 199 | 1100 |
| 219 | 1550 |
| 405 | 2350 |
| 324 | 2450 |
| 319 | 1425 |
| 255 | 1700 |
Graphical Presentation
House price model: scatter plot
Regression Using Excel
Ch. 11-17
Excel will be used to generate the coefficients and measures of goodness of fit for regression
Data / Data Analysis / Regression
Regression Using Excel
Data / Data Analysis / Regression
(continued)
Provide desired input:
Excel Output
We will deal with these later (theme 6 on hypothesis testing)
Excel Output
| Regression Statistics | ||||||
| Multiple R | 0.76211 | |||||
| R Square | 0.58082 | |||||
| Adjusted R Square | 0.52842 | |||||
| Standard Error | 41.33032 | |||||
| Observations | 10 | |||||
| ANOVA | df | SS | MS | F | Significance F | |
| Regression | 1 | 18934.9348 | 18934.9348 | 11.0848 | 0.01039 | |
| Residual | 8 | 13665.5652 | 1708.1957 | |||
| Total | 9 | 32600.5000 | ||||
| Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | |
| Intercept | 98.24833 | 58.03348 | 1.69296 | 0.12892 | -35.57720 | 232.07386 |
| Square Feet | 0.10977 | 0.03297 | 3.32938 | 0.01039 | 0.03374 | 0.18580 |
The regression equation is:
(continued)
Graphical Presentation
House price model: scatter plot and regression line
Slope
= 0.10977
Intercept
= 98.248
Interpretation of the Intercept, b0
b0 is the estimated average value of Y when the value of X is zero (if X = 0 is in the range of observed X values)
Here, no houses had 0 square feet, so b0 = 98.24833 just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet
Interpretation of the Slope Coefficient, b1
b1 measures the estimated change in the average value of Y as a result of a one-unit change in X
Here, b1 = .10977 tells us that the average value of a house increases by .10977($1000) = $109.77, on average, for each additional one square foot of size
Explanatory Power of a Linear Regression Equation
Total variation is made up of two parts:
Total Sum of Squares
Regression Sum of Squares
Error (residual) Sum of Squares
where:
= Average value of the dependent variable
yi = Observed values of the dependent variable
i = Predicted value of y for the given xi value
Analysis of Variance
SST = total sum of squares
Measures the variation of the yi values around their mean, y
SSR = regression sum of squares
Explained variation attributable to the linear relationship between x and y
SSE = error sum of squares
Variation attributable to factors other than the linear relationship between x and y
Analysis of Variance
(continued)
xi
y
X
yi
SST = (yi - y)2
SSE = (yi - yi )2
SSR = (yi - y)2
_
_
_
y
Y
y
_
y
Explained variation
Unexplained variation
Coefficient of Determination, R2
The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable
The coefficient of determination is also called R-squared and is denoted as R2
note:
Examples of Approximate r2 Values
r2 = 1
Y
X
Y
X
r2 = 1
r2 = 1
Perfect linear relationship between X and Y:
100% of the variation in Y is explained by variation in X
Examples of Approximate r2 Values
Y
X
Y
X
0 < r2 < 1
Weaker linear relationships between X and Y:
Some but not all of the variation in Y is explained by variation in X
Examples of Approximate r2 Values
r2 = 0
No linear relationship between X and Y:
The value of Y does not depend on X. (None of the variation in Y is explained by variation in X)
Y
X
r2 = 0
Excel Output
| Regression Statistics | ||||||
| Multiple R | 0.76211 | |||||
| R Square | 0.58082 | |||||
| Adjusted R Square | 0.52842 | |||||
| Standard Error | 41.33032 | |||||
| Observations | 10 | |||||
| ANOVA | df | SS | MS | F | Significance F | |
| Regression | 1 | 18934.9348 | 18934.9348 | 11.0848 | 0.01039 | |
| Residual | 8 | 13665.5652 | 1708.1957 | |||
| Total | 9 | 32600.5000 | ||||
| Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | |
| Intercept | 98.24833 | 58.03348 | 1.69296 | 0.12892 | -35.57720 | 232.07386 |
| Square Feet | 0.10977 | 0.03297 | 3.32938 | 0.01039 | 0.03374 | 0.18580 |
58.08% of the variation in house prices is explained by variation in square feet
Correlation and R2
The coefficient of determination, R2, for a simple regression is equal to the simple correlation squared
Correlation and R2
Take the “Relief” dataset in OWL and do the regression in two ways.
1st: do all the steps to calculate the estimator (b) of the relation between unemployment (y) and relief (x) – that means get your variance, your covariance, your mean etc. Also, calculate the R-squared by using all the tools one by one. That means you calculate the standard deviations, the correlation coefficient etc.
2nd: Once you are done with the first step, just run the regression as excel would (if you want to use Stata, just copy paste the data and have it run the following line : reg unemp relief). Compare the results you got in the first step to those the software churns out of you.
Note: I know this sounds tedious, but it is incredibly useful for you to do so because you will understand the ingredients (and tools) of a regression.
Regression of a time series (fitting a time-trend)
i
i
1
0
i
ε
X
β
β
Y
+
+
=
i
1
0
i
x
b
b
y
+
=
ˆ
)
)
ˆ
(
i
1
0
i
i
i
i
x
b
(b
-
y
y
-
y
e
+
=
=
2
i
1
0
i
2
i
i
n
1
i
2
i
)]
x
b
(b
[y
min
)
y
(y
min
e
min
SSE
min
+
-
=
-
=
=
å
å
å
=
ˆ
x
y
2
x
n
1
i
2
i
n
1
i
i
i
1
s
s
r
s
y)
Cov(x,
)
x
(x
)
y
)(y
x
(x
b
=
=
-
-
-
=
å
å
=
=
x
b
y
b
1
0
-
=
0
.2
.4
.6
Livres (monetary) in rents per arpent
170180190200210
Length of growing season (days)
Rents per arpent (livres)Fitted values
Seigneurial rents (private taxes) per arpent and land quality in Lower Canada, 1831
0
50
100
150
200
250
300
350
400
450
050010001500200025003000
Square Feet
House Price ($1000s)
Chart2
| 1400 |
| 1600 |
| 1700 |
| 1875 |
| 1100 |
| 1550 |
| 2350 |
| 2450 |
| 1425 |
| 1700 |
Sheet4
| SUMMARY OUTPUT | ||||||
| Regression Statistics | ||||||
| Multiple R | 0.76211 | |||||
| R Square | 0.58082 | |||||
| Adjusted R Square | 0.52842 | |||||
| Standard Error | 41.33032 | |||||
| Observations | 10 | |||||
| ANOVA | ||||||
| df | SS | MS | F | Significance F | ||
| Regression | 1 | 18934.9348 | 18934.9348 | 11.08476 | 0.01039 | |
| Residual | 8 | 13665.5652 | 1708.1957 | |||
| Total | 9 | 32600.5000 | ||||
| Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | |
| Intercept | 98.24833 | 58.03348 | 1.69296 | 0.12892 | -35.57720 | 232.07386 |
| Square Feet | 0.10977 | 0.03297 | 3.32938 | 0.01039 | 0.03374 | 0.18580 |
| RESIDUAL OUTPUT | ||||||
| Observation | Predicted House Price | Residuals | ||||
| 1 | 251.9231625835 | -6.9231625835 | ||||
| 2 | 273.8767101495 | 38.1232898505 | ||||
| 3 | 284.8534839325 | -5.8534839325 | ||||
| 4 | 304.0628380528 | 3.9371619472 | ||||
| 5 | 218.9928412345 | -19.9928412345 | ||||
| 6 | 268.388323258 | -49.388323258 | ||||
| 7 | 356.2025135221 | 48.7974864779 | ||||
| 8 | 367.1792873051 | -43.1792873051 | ||||
| 9 | 254.6673560293 | 64.3326439707 | ||||
| 10 | 284.8534839325 | -29.8534839325 |
Sheet4
| 1400 | 1400 |
| 1600 | 1600 |
| 1700 | 1700 |
| 1875 | 1875 |
| 1100 | 1100 |
| 1550 | 1550 |
| 2350 | 2350 |
| 2450 | 2450 |
| 1425 | 1425 |
| 1700 | 1700 |
Sheet1
| House Price | Square Feet |
| 245 | 1400 |
| 312 | 1600 |
| 279 | 1700 |
| 308 | 1875 |
| 199 | 1100 |
| 219 | 1550 |
| 405 | 2350 |
| 324 | 2450 |
| 319 | 1425 |
| 255 | 1700 |
Sheet1
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
Sheet2
Sheet3
feet)
(square
0.10977
98.24833
price
house
+
=
0
50
100
150
200
250
300
350
400
450
050010001500200025003000
Square Feet
House Price ($1000s)
Chart2
| 1400 |
| 1600 |
| 1700 |
| 1875 |
| 1100 |
| 1550 |
| 2350 |
| 2450 |
| 1425 |
| 1700 |
Sheet4
| SUMMARY OUTPUT | ||||||
| Regression Statistics | ||||||
| Multiple R | 0.76211 | |||||
| R Square | 0.58082 | |||||
| Adjusted R Square | 0.52842 | |||||
| Standard Error | 41.33032 | |||||
| Observations | 10 | |||||
| ANOVA | ||||||
| df | SS | MS | F | Significance F | ||
| Regression | 1 | 18934.9348 | 18934.9348 | 11.08476 | 0.01039 | |
| Residual | 8 | 13665.5652 | 1708.1957 | |||
| Total | 9 | 32600.5000 | ||||
| Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | |
| Intercept | 98.24833 | 58.03348 | 1.69296 | 0.12892 | -35.57720 | 232.07386 |
| Square Feet | 0.10977 | 0.03297 | 3.32938 | 0.01039 | 0.03374 | 0.18580 |
| RESIDUAL OUTPUT | ||||||
| Observation | Predicted House Price | Residuals | ||||
| 1 | 251.9231625835 | -6.9231625835 | ||||
| 2 | 273.8767101495 | 38.1232898505 | ||||
| 3 | 284.8534839325 | -5.8534839325 | ||||
| 4 | 304.0628380528 | 3.9371619472 | ||||
| 5 | 218.9928412345 | -19.9928412345 | ||||
| 6 | 268.388323258 | -49.388323258 | ||||
| 7 | 356.2025135221 | 48.7974864779 | ||||
| 8 | 367.1792873051 | -43.1792873051 | ||||
| 9 | 254.6673560293 | 64.3326439707 | ||||
| 10 | 284.8534839325 | -29.8534839325 |
Sheet4
| 1400 | 1400 |
| 1600 | 1600 |
| 1700 | 1700 |
| 1875 | 1875 |
| 1100 | 1100 |
| 1550 | 1550 |
| 2350 | 2350 |
| 2450 | 2450 |
| 1425 | 1425 |
| 1700 | 1700 |
Sheet1
| House Price | Square Feet |
| 245 | 1400 |
| 312 | 1600 |
| 279 | 1700 |
| 308 | 1875 |
| 199 | 1100 |
| 219 | 1550 |
| 405 | 2350 |
| 324 | 2450 |
| 319 | 1425 |
| 255 | 1700 |
Sheet1
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
Sheet2
Sheet3
feet)
(square
0.10977
98.24833
price
house
+
=
SSE
SSR
SST
+
=
å
-
=
2
i
)
y
(y
SST
å
-
=
2
i
i
)
y
(y
SSE
ˆ
å
-
=
2
i
)
y
y
(
SSR
ˆ
y
ˆ
y
0.58082
32600.5000
18934.9348
SST
SSR
R
2
=
=
=
2
2
r
R
=
.5
1
1.5
2
2.5
170017501800
var2
Real WageFitted values
Fitting
a Time Trend in real wages in Canada, 1688 to 1790
.4
.6
.8
1
Dollars per day
185018601870188018901900
var1
var3Fitted values
Real Wages in Canada, 1850 to 1900