Strata Microeconomics

PSw

Week11.pdf

Home >Business & Finance homework help >Economics homework help >Strata Microeconomics

Week 1

Review of Linear Regression and Correlation Analysis

ECON 122B

Applied Econometrics II

Summer Session-1st Half 2021

Confidence Intervals

• An interval with random endpoints which

contains the parameter of interest (for example,

μ) with a pre-specified probability, denoted by

1 - α.

• The confidence interval automatically provides

a margin of error to account for the sampling

variability of the sample statistic.

- 3 -

Testing Hypotheses

• Alternative Hypothesis is what the test is

trying to prove

– Denoted: Ha or H1

• Null Hypothesis is what is trying to be

disproved

– Denoted: H0

Scatter Plots and Correlation

• A scatter plot (or scatter diagram) is used

to show the relationship between two

variables

• Correlation analysis is used to measure

strength of the association (linear

relationship) between two variables

– Only concerned with strength of the

relationship

– No causal effect is implied 4

Correlation Coefficient

• The population correlation coefficient ρ (rho)

measures the strength of the association between

the variables

• The sample correlation coefficient r is an estimate

of ρ and is used to measure the strength of the

linear relationship in the sample observations

(continued)

Features of ρ and r

• Unit free

• Range between -1 and 1

• The closer to -1, the stronger the

negative linear relationship

• The closer to 1, the stronger the positive

linear relationship

• The closer to 0, the weaker the linear

relationship

Significance Test for Correlation

• Hypotheses

H0: ρ = 0 (no correlation)

HA: ρ ≠ 0 (correlation exists)

• Test statistic

(with n – 2 degrees of freedom)

r t

−

− =

- 8 -

Correlation and Causation

• Must be very careful in interpreting correlation coefficients

• Just because two variables are highly correlated does not mean that one causes the other

– Ice cream sales and the number of shark attacks on swimmers are correlated

– The number of cavities in elementary school children and vocabulary size have a strong positive correlation.

• To establish causation, a designed experiment must be run

CORRELATION DOES NOT IMPLY CAUSATION

Regression Analysis

• Basic Idea: Fit a straight line that relates dependent variable (y) and

independent variable (x)

• Linearity Assumption: Slope of the equation does not change as x

change

• Assuming linearity we can write

which says that Y is made up of a predictable part (due

to X) and an unpredictable part

• Coefficients are interpreted as the true, underlying intercept and slope

 ++= xy 10

εxββy 10 ++=

Linear component

Population Linear Regression

The population regression model:

Population

y intercept

Population

Slope

Coefficient

Random

Error

term, or

residualDependent

Variable

Independent

Variable

Random Error

component

- 11 -

Regression Assumptions We start by assuming that for each value of X, the corresponding

value of Y is random, and has a normal distribution.

Linear Regression Assumptions • Error values (ε) are statistically

independent

• Error values are normally distributed for any given value of x

• The probability distribution of the errors is normal

• The probability distribution of the errors has constant variance

• The underlying relationship between the x variable and the y variable is linear

xbbŷ 10i

The sample regression line provides an estimate of

the population regression line

Estimated Regression Model

Estimate of

the regression

intercept

Estimate of the

regression slope

Estimated

(or predicted)

y value

Independent

variable

The individual random error terms ei have a mean of zero

Least Squares Criterion

• This method gives a best-fitting straight

line by minimizing the sum of the squares

of the vertical deviations about the line

• b0 and b1 are obtained by finding the

values of b0 and b1 that minimize the

sum of the squared residuals

x))b(b(y

)ŷ(ye

+−=

−=





The Least Squares Equation

• The formulas for b1 and b0 are:

algebraic

equivalent:

 

  

−

x x

yx xy

b 2

1 )(

 

−

−− =

21 )(

))((

yyxx b

xbyb 10

−=

and

• b0 is the estimated average value of

y when the value of x is zero

• b1 is the estimated change in the

average value of y as a result of a

one-unit change in x

Interpretation of the Slope and the Intercept

Finding the Least Squares

Equation

• The coefficients b0 and b1 will

usually be found using computer

software, such as STATA

• Other regression measures will also

be computed as part of computer-

based regression analysis

Least Squares Regression Properties

• The sum of the residuals from the least squares

regression line is 0 ( )

• The sum of the squared residuals is a minimum

(minimized )

• The simple regression line always passes through

the mean of the y variable and the mean of the x

variable

• The least squares coefficients are unbiased

estimates of β0 and β1

0)ˆ( =− yy

2 )ˆ( yy −

Explained and Unexplained

Variation

• Total variation is made up of two parts:

SSR SSE SST +=

 −= 2

)yy(SST  −= 2

)ŷy(SSE  −= 2

)yŷ(SSR

where:

= Average value of the dependent variable

y = Observed values of the dependent variable

= Estimated value of y for the given x valueŷ

• The coefficient of determination is the

portion of the total variation in the

dependent variable that is explained by

variation in the independent variable

• The coefficient of determination is also

called R-squared and is denoted as R2

Coefficient of Determination,

SST

SSR R =

2 1R0 2 where

Coefficient of determination

Coefficient of Determination, R2

squares of sum total

regressionby explained squares of sum

SST

SSR R ==

(continued)

Note: In the single independent variable case, the coefficient

of determination is

where:

R2 = Coefficient of determination

r = Simple correlation coefficient

22 rR =

Standard Error of Estimate

• The standard deviation of the variation of

observations around the regression line is

estimated by

1−− =

 kn

SSE s

Where

SSE = Sum of squares error

n = Sample size

k = number of independent variables in the model

The Standard Deviation of the Regression Slope

• The standard error of the regression slope

coefficient (b1) is estimated by

 

−

= −

x)( x

)x(x

s s

ε b1

where:

= Estimate of the standard error of the least squares slope

= Sample standard error of the estimate

1b s

SSE s ε

− =

Inference about the Slope: t Test

• t test for a population slope

– Is there a linear relationship between x and y?

• Null and alternative hypotheses

– H0: β1 = 0 (no linear relationship)

– H1: β1  0 (linear relationship does exist)

• Test statistic

βb t

− = 2nd.f. −=

where:

b1 = Sample regression slope coefficient

β1 = Hypothesized slope

sb1 = Estimator of the standard error of the slope

The Multiple Regression

Model Idea: Examine the linear relationship between

1 dependent (y) & 2 or more independent variables (xi)

εxβxβxββy kk22110 +++++= 

kk22110 xbxbxbbŷ ++++= 

Population model:

Y-intercept Population slopes Random Error

Estimated (or predicted) value of y

Estimated slope coefficients

Estimated multiple regression model:

Estimated intercept

Multiple Regression

Assumptions

• The model errors are independent and random

• The errors are normally distributed

• The mean of the errors is zero

• Errors have a constant variance

e = (y – y)

Errors (residuals) from the regression model:

Adjusted R2

• R2 never decreases when a new x variable is

added to the model

– This can be a disadvantage when comparing

models

• What is the net effect of adding a new variable?

– We lose a degree of freedom when a new x

variable is added

– Did the new x variable add enough

explanatory power to offset the loss of one

degree of freedom?

• Shows the proportion of variation in y explained by all x variables adjusted for the number of x variables used

(where n = sample size, k = number of independent variables)

– Penalize excessive use of unimportant independent variables

– Smaller than R2

– Useful in comparing among models

Adjusted R2

(continued)

 

  



−−

− −−=

1kn

1n )R1(1R

Is the Model Significant?

• F-Test for Overall Significance of the Model

• Shows if there is a linear relationship between all

of the x variables considered together and y

• Use F test statistic

• Hypotheses:

– H0: β1 = β2 = … = βk = 0 (no linear relationship)

– HA: at least one βi ≠ 0 (at least one independent variable affects y)

F-Test for Overall Significance

• Test statistic:

where F has (numerator) D1 = k and

(denominator) D2 = (n – k – 1)

degrees of freedom

(continued)

MSE

MSR

1kn

SSE k

SSR

F =

−−

Are Individual Variables Significant?

• Use t-tests of individual variable slopes

• Shows if there is a linear relationship between the

variable xi and y

• Hypotheses:

– H0: βi = 0 (no linear relationship)

– HA: βi ≠ 0 (linear relationship does exist between xi and y)

Are Individual Variables Significant?

H0: βi = 0 (no linear relationship)

HA: βi ≠ 0 (linear relationship does exist between xi and y )

Test Statistic:

(df = n – k – 1)

0b t

− =

(continued)

Standard Deviation of the Regression Model

• The estimate of the standard deviation of the

regression model is:

MSE kn

SSE s =

−− =

 1

◼ Is this value large or small? Must compare to the

mean size of y for comparison

Multicollinearity

• Multicollinearity: High correlation exists

between two independent variables

• This means the two variables contribute

redundant information to the multiple

regression model

Detect Collinearity (Variance Inflationary Factor)

VIFj is used to measure collinearity:

If VIFj ≥ 10, xj is highly correlated with

the other explanatory variables

R2j is the coefficient of determination when the j th

independent variable is regressed against the

remaining k – 1 independent variables

2 1

j R

VIF −

Qualitative (Dummy) Variables

• Categorical explanatory variable (dummy variable) with two or more levels:

– yes or no, on or off, male or female

– coded as 0 or 1

• Regression intercepts are different if the variable is significant

• Assumes equal slopes for other variables

• The number of dummy variables needed is (number of levels – 1)