Finance reserch report based on provided data
Data Analysis – 6 Issues
FINA305/405
1
Agenda
Introduction: assumptions for OLS
Bias and sources of bias
Irrelevant variables and multicollinearity
Heteroscedasticity
Practice in Excel
2
Introduction
Statistical significance, economic significance
What else do we worry about our estimation?
3
Introduction
Assumptions (Gauss Markov Conditions)
1. Linear combination of parameters
2. Random sample of data from the population
3. Non-zero sample variance in X
4. No perfect collinearity for Xs
5. X and the unexplained part of Y (that is, e) are unrelated
6. Homoscedasticity
7. Normality
Assumptions 1-5 are required for unbiased estimation of
Assumptions 1-7 are required for valid hypothesis testing
4
Bias (biasness, biasedness)
Intuitively, our estimated coefficient is not a reliable predictor for the true underlying parameter.
It is likely to be different from the true effect.
Furthermore, bias may result in misleading conclusions about statistical and economic significance of a variable. (why?)
5
Sources of bias
1 Omitted Variables Bias (most popular)
We exclude explanatory variable(s) that should be present in the model
AND
These variable(s) are correlated with an included explanatory variable
THEN
The OLS estimate of the coefficient on the included explanatory variable will be “biased” – that is, it won’t reflect the “pure” or “true” impact of that variable on Y
Example
So, is downside biased due to omitted variable(s)
7
Sources of bias
2 Sampling error
Measurement error with variables
Sample selection bias
Outliers (residual analysis, scatter plot)
8
Examples
9
Sources of bias
3 Reverse causality
We are interested in the pure effect of Education on Income.
But the real data may be driven by the fact that high income individuals purse more education. Thus, the data shows a more positive relationship (not causal) between education and income than we expect.
We get a upward biased coefficient .
10
Question
Is it always good to include as many variables as we can, in the hope to alleviate omitted variable bias?
No!
11
Irrelevant variables
True regression specification
Y=α + β1X1 + ε (Equation 1)
Regression estimated
Y=α + β1X1 + β2X2 + ω (Equation 2)
Hence ω= ε - β2X2
Remember we assume that X2 is an irrelevant variable, which means β2 is zero. In this case, ω= ε. Hence in equation 2 is unbiased when β2 = 0.
True value is zero
12
Irrelevant variables
However, it will increase the variance of the estimated coefficients, which will tend to decrease the magnitude of their t-stat. Why?
VAR ()∑(x1-2] for equation 2
VAR (∑(x1-2 for equation 1
Since r is the correlation coefficient between X1 and X2, it is practically never zero and always in [-1,1]. Hence VAR(for equation 2 is always greater than VAR (for equation 1.
So, …
Multicollinearity
Explanatory variables Xs may be HIGHLY correlated the model has trouble differentiating between their effects on Y.
High R2, large F-stat, but insignificant t stats for coefficient estimates (or wrong sign).
Model overall fits well, but cannot pin down marginal effects of individual variables
Diagnosing: Look at your correlation matrix for high levels of correlation (rule of thumb: >0.8 or <-0.8) between your explanatory variables.
Example
A cross-sectional model of the demand for petrol by state
+ - + +
Petroli = f (UHKMi, TAXi, REGi, POPi) + εi Equation 1
where: Petroli=petroleum consumption in the ith state
UHKMi=urban highway kilometres within the ith state
TAXi=the petroleum tax rate in the ith state
REGi=motor vehicle registration in the ith state
POP=how many people living in the ith state
Example
What is wrong with Equation 1?
Both motor vehicle registration and population variables have insignificant coefficient with an unexpected sign, but it is hard to believe that these variables are irrelevant.
The simple correlation coefficient is 0.96 for pop and uhkm, 0.98 for reg and uhkm, 0.98 for reg and pop. Hence it is fair to say we have serious multicollinearity. All three variables (uhkm, pop and reg) measure the size of the state, so two of them are redundant.
Remedies for multicolinearity
Do nothing
If looking for overall prediction and not individual effects
We are interested in tax effect only
If theory suggests variables should be included
Combine or transform variables
e.g. we can simply have a new variable/index size=uhkm+pop+reg, and include it in the regression
Drop the redundant variables
How do you select explanatory variables?
Ideally, turn to theory, intuition, logic, and/or common sense for suggestions on what is appropriate to include.
Include (insofar as possible) all explanatory variables that you think might explain your dependent variable. This will reduce OVB.
Correlation matrix.
Plot of Residuals (again)
THAT IS A FUNNEL!!!
Our model may have heteroskedasticity.
Heteroskedasticity
Heteroskedasticity literally means “different variance”
The term applies when the errors in a regression model appear to be drawn from distributions which have different variances as we move along the X-axis (or, in multiple regression, the “predicted-Y” axis).
VAR (εi) =σ2 a constant variance—homoskedasticity
VAR (εi) =σi2 not a constant variance—heteroskedasticity
(note a subscript i is attached to σ2, which implies instead of being constant over all observations, a heteroskedastic error term’s variance can change depending on the observation (hence the subscript).)
It often occurs in data sets in which there is a wide disparity between the largest and smallest observed values.
We’d expect that the error term distribution for very large observations (i.e. people with high wages) might have a large variance, but the error term distribution for small observations (people with low wages) might have a small variance.
It can also caused by an incorrect specification, such as omitted variable, or improper function form (lecture 11).
Why the fail of hypothesis testing?
F and t-statistics will not subject to F or t distribution.
Solutions?
The most simple solution to heteroskedasticity, which sometimes works, is to adjust the functional form.
Weighted Least Squares if the functional form for the variance of error term is known
Heteroskedasticity-corrected standard errors (straightforward in STATA)
Redefining the variables
i.e. GDP per capita instead of GDP
Practice in Excel
http://www.rbnz.govt.nz/statistics/key-graphs/key-graph-house-price-values
Regression (OLS)
Issues
http://www.real-statistics.com/free-download/
_
c
o
n
s
3
8
7
.
6
3
0
8
1
4
6
.
2
0
0
5
2
.
6
5
0
.
0
1
1
9
3
.
1
6
7
9
7
6
8
2
.
0
9
3
6
p
o
p
-
.
0
0
6
6
3
0
4
.
0
2
9
4
2
7
8
-
0
.
2
3
0
.
8
2
3
-
.
0
6
5
9
0
1
.
0
5
2
6
4
0
2
r
e
g
-
.
0
5
2
4
4
2
2
.
0
5
7
9
8
1
1
-
0
.
9
0
0
.
3
7
1
-
.
1
6
9
2
2
2
1
.
0
6
4
3
3
7
8
t
a
x
-
3
6
.
3
6
9
1
1
3
.
2
9
9
9
9
-
2
.
7
3
0
.
0
0
9
-
6
3
.
1
5
6
6
4
-
9
.
5
8
1
5
4
8
u
h
k
m
6
1
.
0
5
3
6
8
1
0
.
4
4
7
5
5
5
.
8
4
0
.
0
0
0
4
0
.
0
1
1
2
4
8
2
.
0
9
6
1
3
p
e
t
r
o
l
C
o
e
f
.
S
t
d
.
E
r
r
.
t
P
>
|
t
|
[
9
5
%
C
o
n
f
.
I
n
t
e
r
v
a
l
]
T
o
t
a
l
2
2
5
1
3
0
0
0
.
5
4
9
4
5
9
4
4
8
.
9
9
R
o
o
t
M
S
E
=
1
9
4
.
6
4
A
d
j
R
-
s
q
u
a
r
e
d
=
0
.
9
1
7
5
R
e
s
i
d
u
a
l
1
7
0
4
7
9
4
.
7
7
4
5
3
7
8
8
4
.
3
2
8
2
R
-
s
q
u
a
r
e
d
=
0
.
9
2
4
3
M
o
d
e
l
2
0
8
0
8
2
0
5
.
7
4
5
2
0
2
0
5
1
.
4
3
P
r
o
b
>
F
=
0
.
0
0
0
0
F
(
4
,
4
5
)
=
1
3
7
.
3
1
S
o
u
r
c
e
S
S
d
f
M
S
N
u
m
b
e
r
o
f
o
b
s
=
5
0
_cons 387.6308 146.2005 2.65 0.011 93.16797 682.0936
pop -.0066304 .0294278 -0.23 0.823 -.065901 .0526402
reg -.0524422 .0579811 -0.90 0.371 -.1692221 .0643378
tax -36.3691 13.29999 -2.73 0.009 -63.15664 -9.581548
uhkm 61.05368 10.44755 5.84 0.000 40.01124 82.09613
petrol Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 22513000.5 49 459448.99 Root MSE = 194.64
Adj R-squared = 0.9175
Residual 1704794.77 45 37884.3282 R-squared = 0.9243
Model 20808205.7 4 5202051.43 Prob > F = 0.0000
F( 4, 45) = 137.31
Source SS df MS Number of obs = 50
u
h
k
m
0
.
9
6
4
5
0
.
9
7
8
6
-
0
.
2
8
0
9
1
.
0
0
0
0
t
a
x
-
0
.
2
3
6
3
-
0
.
2
4
2
2
1
.
0
0
0
0
r
e
g
0
.
9
8
0
6
1
.
0
0
0
0
p
o
p
1
.
0
0
0
0
p
o
p
r
e
g
t
a
x
u
h
k
m
uhkm 0.9645 0.9786 -0.2809 1.0000
tax -0.2363 -0.2422 1.0000
reg 0.9806 1.0000
pop 1.0000
pop reg tax uhkm
_
c
o
n
s
4
1
0
.
0
1
6
5
1
4
5
.
3
2
7
2
.
8
2
0
.
0
0
7
1
1
7
.
6
5
6
2
7
0
2
.
3
7
6
7
t
a
x
-
3
9
.
5
8
9
2
9
1
3
.
1
1
7
6
7
-
3
.
0
2
0
.
0
0
4
-
6
5
.
9
7
8
6
3
-
1
3
.
1
9
9
9
4
u
h
k
m
4
6
.
3
8
6
3
9
2
.
1
6
7
5
1
6
2
1
.
4
0
0
.
0
0
0
4
2
.
0
2
5
9
1
5
0
.
7
4
6
8
7
p
e
t
r
o
l
C
o
e
f
.
S
t
d
.
E
r
r
.
t
P
>
|
t
|
[
9
5
%
C
o
n
f
.
I
n
t
e
r
v
a
l
]
T
o
t
a
l
2
2
5
1
3
0
0
0
.
5
4
9
4
5
9
4
4
8
.
9
9
R
o
o
t
M
S
E
=
1
9
4
.
7
7
A
d
j
R
-
s
q
u
a
r
e
d
=
0
.
9
1
7
4
R
e
s
i
d
u
a
l
1
7
8
2
8
8
2
.
3
4
7
3
7
9
3
3
.
6
6
6
R
-
s
q
u
a
r
e
d
=
0
.
9
2
0
8
M
o
d
e
l
2
0
7
3
0
1
1
8
.
2
2
1
0
3
6
5
0
5
9
.
1
P
r
o
b
>
F
=
0
.
0
0
0
0
F
(
2
,
4
7
)
=
2
7
3
.
2
4
S
o
u
r
c
e
S
S
d
f
M
S
N
u
m
b
e
r
o
f
o
b
s
=
5
0
_cons 410.0165 145.327 2.82 0.007 117.6562 702.3767
tax -39.58929 13.11767 -3.02 0.004 -65.97863 -13.19994
uhkm 46.38639 2.167516 21.40 0.000 42.02591 50.74687
petrol Coef. Std. Err. t P>|t| [95% Conf. Interval]
Total 22513000.5 49 459448.99 Root MSE = 194.77
Adj R-squared = 0.9174
Residual 1782882.3 47 37933.666 R-squared = 0.9208
Model 20730118.2 2 10365059.1 Prob > F = 0.0000
F( 2, 47) = 273.24
Source SS df MS Number of obs = 50
RESIDUAL OUTPUT
-3
-2
-1
0
1
2
3
4
5
6
-4-2024681012
Predicted Wage
Standard Residuals