Strata Microeconomics

profilePSw
Week11.pdf

Week 1

Review of Linear Regression and Correlation Analysis

ECON 122B

Applied Econometrics II

Summer Session-1st Half 2021

1

Confidence Intervals

• An interval with random endpoints which

contains the parameter of interest (for example,

μ) with a pre-specified probability, denoted by

1 - α.

• The confidence interval automatically provides

a margin of error to account for the sampling

variability of the sample statistic.

2

- 3 -

Testing Hypotheses

• Alternative Hypothesis is what the test is

trying to prove

– Denoted: Ha or H1

• Null Hypothesis is what is trying to be

disproved

– Denoted: H0

Scatter Plots and Correlation

• A scatter plot (or scatter diagram) is used

to show the relationship between two

variables

• Correlation analysis is used to measure

strength of the association (linear

relationship) between two variables

– Only concerned with strength of the

relationship

– No causal effect is implied 4

Correlation Coefficient

• The population correlation coefficient ρ (rho)

measures the strength of the association between

the variables

• The sample correlation coefficient r is an estimate

of ρ and is used to measure the strength of the

linear relationship in the sample observations

(continued)

5

Features of ρ and r

• Unit free

• Range between -1 and 1

• The closer to -1, the stronger the

negative linear relationship

• The closer to 1, the stronger the positive

linear relationship

• The closer to 0, the weaker the linear

relationship

6

Significance Test for Correlation

• Hypotheses

H0: ρ = 0 (no correlation)

HA: ρ ≠ 0 (correlation exists)

• Test statistic

(with n – 2 degrees of freedom)

2n

r1

r t

2

− =

7

- 8 -

Correlation and Causation

• Must be very careful in interpreting correlation coefficients

• Just because two variables are highly correlated does not mean that one causes the other

– Ice cream sales and the number of shark attacks on swimmers are correlated

– The number of cavities in elementary school children and vocabulary size have a strong positive correlation.

• To establish causation, a designed experiment must be run

CORRELATION DOES NOT IMPLY CAUSATION

Regression Analysis

9

• Basic Idea: Fit a straight line that relates dependent variable (y) and

independent variable (x)

• Linearity Assumption: Slope of the equation does not change as x

change

• Assuming linearity we can write

which says that Y is made up of a predictable part (due

to X) and an unpredictable part

• Coefficients are interpreted as the true, underlying intercept and slope

 ++= xy 10

εxββy 10 ++=

Linear component

Population Linear Regression

The population regression model:

Population

y intercept

Population

Slope

Coefficient

Random

Error

term, or

residualDependent

Variable

Independent

Variable

Random Error

component

10

- 11 -

Regression Assumptions We start by assuming that for each value of X, the corresponding

value of Y is random, and has a normal distribution.

Linear Regression Assumptions • Error values (ε) are statistically

independent

• Error values are normally distributed for any given value of x

• The probability distribution of the errors is normal

• The probability distribution of the errors has constant variance

• The underlying relationship between the x variable and the y variable is linear

12

xbbŷ 10i

+=

The sample regression line provides an estimate of

the population regression line

Estimated Regression Model

Estimate of

the regression

intercept

Estimate of the

regression slope

Estimated

(or predicted)

y value

Independent

variable

The individual random error terms ei have a mean of zero

13

Least Squares Criterion

• This method gives a best-fitting straight

line by minimizing the sum of the squares

of the vertical deviations about the line

• b0 and b1 are obtained by finding the

values of b0 and b1 that minimize the

sum of the squared residuals

2

10

22

x))b(b(y

)ŷ(ye

+−=

−=



14

The Least Squares Equation

• The formulas for b1 and b0 are:

algebraic

equivalent:

 

  

=

n

x x

n

yx xy

b 2

2

1 )(

 

−− =

21 )(

))((

xx

yyxx b

xbyb 10

−=

and

15

• b0 is the estimated average value of

y when the value of x is zero

• b1 is the estimated change in the

average value of y as a result of a

one-unit change in x

Interpretation of the Slope and the Intercept

16

Finding the Least Squares

Equation

• The coefficients b0 and b1 will

usually be found using computer

software, such as STATA

• Other regression measures will also

be computed as part of computer-

based regression analysis

17

Least Squares Regression Properties

• The sum of the residuals from the least squares

regression line is 0 ( )

• The sum of the squared residuals is a minimum

(minimized )

• The simple regression line always passes through

the mean of the y variable and the mean of the x

variable

• The least squares coefficients are unbiased

estimates of β0 and β1

0)ˆ( =− yy

2 )ˆ( yy −

18

Explained and Unexplained

Variation

• Total variation is made up of two parts:

SSR SSE SST +=

 −= 2

)yy(SST  −= 2

)ŷy(SSE  −= 2

)yŷ(SSR

where:

= Average value of the dependent variable

y = Observed values of the dependent variable

= Estimated value of y for the given x valueŷ

y

19

• The coefficient of determination is the

portion of the total variation in the

dependent variable that is explained by

variation in the independent variable

• The coefficient of determination is also

called R-squared and is denoted as R2

Coefficient of Determination,

R2

SST

SSR R =

2 1R0 2 where

20

Coefficient of determination

Coefficient of Determination, R2

squares of sum total

regressionby explained squares of sum

SST

SSR R ==

2

(continued)

Note: In the single independent variable case, the coefficient

of determination is

where:

R2 = Coefficient of determination

r = Simple correlation coefficient

22 rR =

21

Standard Error of Estimate

• The standard deviation of the variation of

observations around the regression line is

estimated by

1−− =

 kn

SSE s

Where

SSE = Sum of squares error

n = Sample size

k = number of independent variables in the model

22

The Standard Deviation of the Regression Slope

• The standard error of the regression slope

coefficient (b1) is estimated by

 

= −

=

n

x)( x

s

)x(x

s s

2

2

ε

2

ε b1

where:

= Estimate of the standard error of the least squares slope

= Sample standard error of the estimate

1b s

2n

SSE s ε

− =

23

Inference about the Slope: t Test

• t test for a population slope

– Is there a linear relationship between x and y?

• Null and alternative hypotheses

– H0: β1 = 0 (no linear relationship)

– H1: β1  0 (linear relationship does exist)

• Test statistic

1b

11

s

βb t

− = 2nd.f. −=

where:

b1 = Sample regression slope coefficient

β1 = Hypothesized slope

sb1 = Estimator of the standard error of the slope

24

The Multiple Regression

Model Idea: Examine the linear relationship between

1 dependent (y) & 2 or more independent variables (xi)

εxβxβxββy kk22110 +++++= 

kk22110 xbxbxbbŷ ++++= 

Population model:

Y-intercept Population slopes Random Error

Estimated (or predicted) value of y

Estimated slope coefficients

Estimated multiple regression model:

Estimated intercept

Multiple Regression

Assumptions

• The model errors are independent and random

• The errors are normally distributed

• The mean of the errors is zero

• Errors have a constant variance

e = (y – y)

<

Errors (residuals) from the regression model:

Adjusted R2

• R2 never decreases when a new x variable is

added to the model

– This can be a disadvantage when comparing

models

• What is the net effect of adding a new variable?

– We lose a degree of freedom when a new x

variable is added

– Did the new x variable add enough

explanatory power to offset the loss of one

degree of freedom?

• Shows the proportion of variation in y explained by all x variables adjusted for the number of x variables used

(where n = sample size, k = number of independent variables)

– Penalize excessive use of unimportant independent variables

– Smaller than R2

– Useful in comparing among models

Adjusted R2

(continued)

 

  

−−

− −−=

1kn

1n )R1(1R

22

A

Is the Model Significant?

• F-Test for Overall Significance of the Model

• Shows if there is a linear relationship between all

of the x variables considered together and y

• Use F test statistic

• Hypotheses:

– H0: β1 = β2 = … = βk = 0 (no linear relationship)

– HA: at least one βi ≠ 0 (at least one independent variable affects y)

F-Test for Overall Significance

• Test statistic:

where F has (numerator) D1 = k and

(denominator) D2 = (n – k – 1)

degrees of freedom

(continued)

MSE

MSR

1kn

SSE k

SSR

F =

−−

=

Are Individual Variables Significant?

• Use t-tests of individual variable slopes

• Shows if there is a linear relationship between the

variable xi and y

• Hypotheses:

– H0: βi = 0 (no linear relationship)

– HA: βi ≠ 0 (linear relationship does exist between xi and y)

Are Individual Variables Significant?

H0: βi = 0 (no linear relationship)

HA: βi ≠ 0 (linear relationship does exist between xi and y )

Test Statistic:

(df = n – k – 1)

ib

i

s

0b t

− =

(continued)

Standard Deviation of the Regression Model

• The estimate of the standard deviation of the

regression model is:

MSE kn

SSE s =

−− =

 1

◼ Is this value large or small? Must compare to the

mean size of y for comparison

Multicollinearity

• Multicollinearity: High correlation exists

between two independent variables

• This means the two variables contribute

redundant information to the multiple

regression model

Detect Collinearity (Variance Inflationary Factor)

VIFj is used to measure collinearity:

If VIFj ≥ 10, xj is highly correlated with

the other explanatory variables

R2j is the coefficient of determination when the j th

independent variable is regressed against the

remaining k – 1 independent variables

2 1

1

j

j R

VIF −

=

Qualitative (Dummy) Variables

• Categorical explanatory variable (dummy variable) with two or more levels:

– yes or no, on or off, male or female

– coded as 0 or 1

• Regression intercepts are different if the variable is significant

• Assumes equal slopes for other variables

• The number of dummy variables needed is (number of levels – 1)