Finance reserch report based on provided data

profileDennis547
Lecture_10.pptx

Data Analysis – 6 Issues

FINA305/405

1

Agenda

Introduction: assumptions for OLS

Bias and sources of bias

Irrelevant variables and multicollinearity

Heteroscedasticity

Practice in Excel

2

Introduction

Statistical significance, economic significance

What else do we worry about our estimation?

3

Introduction

Assumptions (Gauss Markov Conditions)

1. Linear combination of parameters

2. Random sample of data from the population

3. Non-zero sample variance in X

4. No perfect collinearity for Xs

5. X and the unexplained part of Y (that is, e) are unrelated

6. Homoscedasticity

7. Normality

Assumptions 1-5 are required for unbiased estimation of

Assumptions 1-7 are required for valid hypothesis testing

4

Bias (biasness, biasedness)

Intuitively, our estimated coefficient is not a reliable predictor for the true underlying parameter.

It is likely to be different from the true effect.

Furthermore, bias may result in misleading conclusions about statistical and economic significance of a variable. (why?)

5

Sources of bias

1 Omitted Variables Bias (most popular)

We exclude explanatory variable(s) that should be present in the model

AND

These variable(s) are correlated with an included explanatory variable

THEN

The OLS estimate of the coefficient on the included explanatory variable will be “biased” – that is, it won’t reflect the “pure” or “true” impact of that variable on Y

Example

So, is downside biased due to omitted variable(s)

7

Sources of bias

2 Sampling error

Measurement error with variables

Sample selection bias

Outliers (residual analysis, scatter plot)

8

Examples

9

Sources of bias

3 Reverse causality

We are interested in the pure effect of Education on Income.

But the real data may be driven by the fact that high income individuals purse more education. Thus, the data shows a more positive relationship (not causal) between education and income than we expect.

We get a upward biased coefficient .

10

Question

Is it always good to include as many variables as we can, in the hope to alleviate omitted variable bias?

No!

11

Irrelevant variables

True regression specification

Y=α + β1X1 + ε (Equation 1)

Regression estimated

Y=α + β1X1 + β2X2 + ω (Equation 2)

Hence ω= ε - β2X2

Remember we assume that X2 is an irrelevant variable, which means β2 is zero. In this case, ω= ε. Hence in equation 2 is unbiased when β2 = 0.

True value is zero

12

Irrelevant variables

However, it will increase the variance of the estimated coefficients, which will tend to decrease the magnitude of their t-stat. Why?

VAR ()∑(x1-2] for equation 2

VAR (∑(x1-2 for equation 1

Since r is the correlation coefficient between X1 and X2, it is practically never zero and always in [-1,1]. Hence VAR(for equation 2 is always greater than VAR (for equation 1.

So, …

Multicollinearity

Explanatory variables Xs may be HIGHLY correlated  the model has trouble differentiating between their effects on Y.

High R2, large F-stat, but insignificant t stats for coefficient estimates (or wrong sign).

Model overall fits well, but cannot pin down marginal effects of individual variables

Diagnosing: Look at your correlation matrix for high levels of correlation (rule of thumb: >0.8 or <-0.8) between your explanatory variables.

Example

A cross-sectional model of the demand for petrol by state

+ - + +

Petroli = f (UHKMi, TAXi, REGi, POPi) + εi Equation 1

where: Petroli=petroleum consumption in the ith state

UHKMi=urban highway kilometres within the ith state

TAXi=the petroleum tax rate in the ith state

REGi=motor vehicle registration in the ith state

POP=how many people living in the ith state

Example

What is wrong with Equation 1?

Both motor vehicle registration and population variables have insignificant coefficient with an unexpected sign, but it is hard to believe that these variables are irrelevant.

The simple correlation coefficient is 0.96 for pop and uhkm, 0.98 for reg and uhkm, 0.98 for reg and pop. Hence it is fair to say we have serious multicollinearity. All three variables (uhkm, pop and reg) measure the size of the state, so two of them are redundant.

Remedies for multicolinearity

Do nothing

If looking for overall prediction and not individual effects

We are interested in tax effect only

If theory suggests variables should be included

Combine or transform variables

e.g. we can simply have a new variable/index size=uhkm+pop+reg, and include it in the regression

Drop the redundant variables

How do you select explanatory variables?

Ideally, turn to theory, intuition, logic, and/or common sense for suggestions on what is appropriate to include.

Include (insofar as possible) all explanatory variables that you think might explain your dependent variable. This will reduce OVB.

Correlation matrix.

Plot of Residuals (again)

THAT IS A FUNNEL!!!

Our model may have heteroskedasticity.

Heteroskedasticity

Heteroskedasticity literally means “different variance”

The term applies when the errors in a regression model appear to be drawn from distributions which have different variances as we move along the X-axis (or, in multiple regression, the “predicted-Y” axis).

VAR (εi) =σ2 a constant variance—homoskedasticity

VAR (εi) =σi2 not a constant variance—heteroskedasticity

(note a subscript i is attached to σ2, which implies instead of being constant over all observations, a heteroskedastic error term’s variance can change depending on the observation (hence the subscript).)

It often occurs in data sets in which there is a wide disparity between the largest and smallest observed values.

We’d expect that the error term distribution for very large observations (i.e. people with high wages) might have a large variance, but the error term distribution for small observations (people with low wages) might have a small variance.

It can also caused by an incorrect specification, such as omitted variable, or improper function form (lecture 11).

Why the fail of hypothesis testing?

F and t-statistics will not subject to F or t distribution.

Solutions?

The most simple solution to heteroskedasticity, which sometimes works, is to adjust the functional form.

Weighted Least Squares if the functional form for the variance of error term is known

Heteroskedasticity-corrected standard errors (straightforward in STATA)

Redefining the variables

i.e. GDP per capita instead of GDP

Practice in Excel

http://www.rbnz.govt.nz/statistics/key-graphs/key-graph-house-price-values

Regression (OLS)

Issues

http://www.real-statistics.com/free-download/

_

c

o

n

s

3

8

7

.

6

3

0

8

1

4

6

.

2

0

0

5

2

.

6

5

0

.

0

1

1

9

3

.

1

6

7

9

7

6

8

2

.

0

9

3

6

p

o

p

-

.

0

0

6

6

3

0

4

.

0

2

9

4

2

7

8

-

0

.

2

3

0

.

8

2

3

-

.

0

6

5

9

0

1

.

0

5

2

6

4

0

2

r

e

g

-

.

0

5

2

4

4

2

2

.

0

5

7

9

8

1

1

-

0

.

9

0

0

.

3

7

1

-

.

1

6

9

2

2

2

1

.

0

6

4

3

3

7

8

t

a

x

-

3

6

.

3

6

9

1

1

3

.

2

9

9

9

9

-

2

.

7

3

0

.

0

0

9

-

6

3

.

1

5

6

6

4

-

9

.

5

8

1

5

4

8

u

h

k

m

6

1

.

0

5

3

6

8

1

0

.

4

4

7

5

5

5

.

8

4

0

.

0

0

0

4

0

.

0

1

1

2

4

8

2

.

0

9

6

1

3

p

e

t

r

o

l

C

o

e

f

.

S

t

d

.

E

r

r

.

t

P

>

|

t

|

[

9

5

%

C

o

n

f

.

I

n

t

e

r

v

a

l

]

T

o

t

a

l

2

2

5

1

3

0

0

0

.

5

4

9

4

5

9

4

4

8

.

9

9

R

o

o

t

M

S

E

=

1

9

4

.

6

4

A

d

j

R

-

s

q

u

a

r

e

d

=

0

.

9

1

7

5

R

e

s

i

d

u

a

l

1

7

0

4

7

9

4

.

7

7

4

5

3

7

8

8

4

.

3

2

8

2

R

-

s

q

u

a

r

e

d

=

0

.

9

2

4

3

M

o

d

e

l

2

0

8

0

8

2

0

5

.

7

4

5

2

0

2

0

5

1

.

4

3

P

r

o

b

>

F

=

0

.

0

0

0

0

F

(

4

,

4

5

)

=

1

3

7

.

3

1

S

o

u

r

c

e

S

S

d

f

M

S

N

u

m

b

e

r

o

f

o

b

s

=

5

0

_cons 387.6308 146.2005 2.65 0.011 93.16797 682.0936

pop -.0066304 .0294278 -0.23 0.823 -.065901 .0526402

reg -.0524422 .0579811 -0.90 0.371 -.1692221 .0643378

tax -36.3691 13.29999 -2.73 0.009 -63.15664 -9.581548

uhkm 61.05368 10.44755 5.84 0.000 40.01124 82.09613

petrol Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 22513000.5 49 459448.99 Root MSE = 194.64

Adj R-squared = 0.9175

Residual 1704794.77 45 37884.3282 R-squared = 0.9243

Model 20808205.7 4 5202051.43 Prob > F = 0.0000

F( 4, 45) = 137.31

Source SS df MS Number of obs = 50

u

h

k

m

0

.

9

6

4

5

0

.

9

7

8

6

-

0

.

2

8

0

9

1

.

0

0

0

0

t

a

x

-

0

.

2

3

6

3

-

0

.

2

4

2

2

1

.

0

0

0

0

r

e

g

0

.

9

8

0

6

1

.

0

0

0

0

p

o

p

1

.

0

0

0

0

p

o

p

r

e

g

t

a

x

u

h

k

m

uhkm 0.9645 0.9786 -0.2809 1.0000

tax -0.2363 -0.2422 1.0000

reg 0.9806 1.0000

pop 1.0000

pop reg tax uhkm

_

c

o

n

s

4

1

0

.

0

1

6

5

1

4

5

.

3

2

7

2

.

8

2

0

.

0

0

7

1

1

7

.

6

5

6

2

7

0

2

.

3

7

6

7

t

a

x

-

3

9

.

5

8

9

2

9

1

3

.

1

1

7

6

7

-

3

.

0

2

0

.

0

0

4

-

6

5

.

9

7

8

6

3

-

1

3

.

1

9

9

9

4

u

h

k

m

4

6

.

3

8

6

3

9

2

.

1

6

7

5

1

6

2

1

.

4

0

0

.

0

0

0

4

2

.

0

2

5

9

1

5

0

.

7

4

6

8

7

p

e

t

r

o

l

C

o

e

f

.

S

t

d

.

E

r

r

.

t

P

>

|

t

|

[

9

5

%

C

o

n

f

.

I

n

t

e

r

v

a

l

]

T

o

t

a

l

2

2

5

1

3

0

0

0

.

5

4

9

4

5

9

4

4

8

.

9

9

R

o

o

t

M

S

E

=

1

9

4

.

7

7

A

d

j

R

-

s

q

u

a

r

e

d

=

0

.

9

1

7

4

R

e

s

i

d

u

a

l

1

7

8

2

8

8

2

.

3

4

7

3

7

9

3

3

.

6

6

6

R

-

s

q

u

a

r

e

d

=

0

.

9

2

0

8

M

o

d

e

l

2

0

7

3

0

1

1

8

.

2

2

1

0

3

6

5

0

5

9

.

1

P

r

o

b

>

F

=

0

.

0

0

0

0

F

(

2

,

4

7

)

=

2

7

3

.

2

4

S

o

u

r

c

e

S

S

d

f

M

S

N

u

m

b

e

r

o

f

o

b

s

=

5

0

_cons 410.0165 145.327 2.82 0.007 117.6562 702.3767

tax -39.58929 13.11767 -3.02 0.004 -65.97863 -13.19994

uhkm 46.38639 2.167516 21.40 0.000 42.02591 50.74687

petrol Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 22513000.5 49 459448.99 Root MSE = 194.77

Adj R-squared = 0.9174

Residual 1782882.3 47 37933.666 R-squared = 0.9208

Model 20730118.2 2 10365059.1 Prob > F = 0.0000

F( 2, 47) = 273.24

Source SS df MS Number of obs = 50

RESIDUAL OUTPUT

-3

-2

-1

0

1

2

3

4

5

6

-4-2024681012

Predicted Wage

Standard Residuals