Strata Microeconomics
Week 1
Review of Linear Regression and Correlation Analysis
ECON 122B
Applied Econometrics II
Summer Session-1st Half 2021
1
Confidence Intervals
• An interval with random endpoints which
contains the parameter of interest (for example,
μ) with a pre-specified probability, denoted by
1 - α.
• The confidence interval automatically provides
a margin of error to account for the sampling
variability of the sample statistic.
2
- 3 -
Testing Hypotheses
• Alternative Hypothesis is what the test is
trying to prove
– Denoted: Ha or H1
• Null Hypothesis is what is trying to be
disproved
– Denoted: H0
Scatter Plots and Correlation
• A scatter plot (or scatter diagram) is used
to show the relationship between two
variables
• Correlation analysis is used to measure
strength of the association (linear
relationship) between two variables
– Only concerned with strength of the
relationship
– No causal effect is implied 4
Correlation Coefficient
• The population correlation coefficient ρ (rho)
measures the strength of the association between
the variables
• The sample correlation coefficient r is an estimate
of ρ and is used to measure the strength of the
linear relationship in the sample observations
(continued)
5
Features of ρ and r
• Unit free
• Range between -1 and 1
• The closer to -1, the stronger the
negative linear relationship
• The closer to 1, the stronger the positive
linear relationship
• The closer to 0, the weaker the linear
relationship
6
Significance Test for Correlation
• Hypotheses
H0: ρ = 0 (no correlation)
HA: ρ ≠ 0 (correlation exists)
• Test statistic
(with n – 2 degrees of freedom)
2n
r1
r t
2
−
− =
7
- 8 -
Correlation and Causation
• Must be very careful in interpreting correlation coefficients
• Just because two variables are highly correlated does not mean that one causes the other
– Ice cream sales and the number of shark attacks on swimmers are correlated
– The number of cavities in elementary school children and vocabulary size have a strong positive correlation.
• To establish causation, a designed experiment must be run
CORRELATION DOES NOT IMPLY CAUSATION
Regression Analysis
9
• Basic Idea: Fit a straight line that relates dependent variable (y) and
independent variable (x)
• Linearity Assumption: Slope of the equation does not change as x
change
• Assuming linearity we can write
which says that Y is made up of a predictable part (due
to X) and an unpredictable part
• Coefficients are interpreted as the true, underlying intercept and slope
++= xy 10
εxββy 10 ++=
Linear component
Population Linear Regression
The population regression model:
Population
y intercept
Population
Slope
Coefficient
Random
Error
term, or
residualDependent
Variable
Independent
Variable
Random Error
component
10
- 11 -
Regression Assumptions We start by assuming that for each value of X, the corresponding
value of Y is random, and has a normal distribution.
Linear Regression Assumptions • Error values (ε) are statistically
independent
• Error values are normally distributed for any given value of x
• The probability distribution of the errors is normal
• The probability distribution of the errors has constant variance
• The underlying relationship between the x variable and the y variable is linear
12
xbbŷ 10i
+=
The sample regression line provides an estimate of
the population regression line
Estimated Regression Model
Estimate of
the regression
intercept
Estimate of the
regression slope
Estimated
(or predicted)
y value
Independent
variable
The individual random error terms ei have a mean of zero
13
Least Squares Criterion
• This method gives a best-fitting straight
line by minimizing the sum of the squares
of the vertical deviations about the line
• b0 and b1 are obtained by finding the
values of b0 and b1 that minimize the
sum of the squared residuals
2
10
22
x))b(b(y
)ŷ(ye
+−=
−=
14
The Least Squares Equation
• The formulas for b1 and b0 are:
algebraic
equivalent:
−
−
=
n
x x
n
yx xy
b 2
2
1 )(
−
−− =
21 )(
))((
xx
yyxx b
xbyb 10
−=
and
15
• b0 is the estimated average value of
y when the value of x is zero
• b1 is the estimated change in the
average value of y as a result of a
one-unit change in x
Interpretation of the Slope and the Intercept
16
Finding the Least Squares
Equation
• The coefficients b0 and b1 will
usually be found using computer
software, such as STATA
• Other regression measures will also
be computed as part of computer-
based regression analysis
17
Least Squares Regression Properties
• The sum of the residuals from the least squares
regression line is 0 ( )
• The sum of the squared residuals is a minimum
(minimized )
• The simple regression line always passes through
the mean of the y variable and the mean of the x
variable
• The least squares coefficients are unbiased
estimates of β0 and β1
0)ˆ( =− yy
2 )ˆ( yy −
18
Explained and Unexplained
Variation
• Total variation is made up of two parts:
SSR SSE SST +=
−= 2
)yy(SST −= 2
)ŷy(SSE −= 2
)yŷ(SSR
where:
= Average value of the dependent variable
y = Observed values of the dependent variable
= Estimated value of y for the given x valueŷ
y
19
• The coefficient of determination is the
portion of the total variation in the
dependent variable that is explained by
variation in the independent variable
• The coefficient of determination is also
called R-squared and is denoted as R2
Coefficient of Determination,
R2
SST
SSR R =
2 1R0 2 where
20
Coefficient of determination
Coefficient of Determination, R2
squares of sum total
regressionby explained squares of sum
SST
SSR R ==
2
(continued)
Note: In the single independent variable case, the coefficient
of determination is
where:
R2 = Coefficient of determination
r = Simple correlation coefficient
22 rR =
21
Standard Error of Estimate
• The standard deviation of the variation of
observations around the regression line is
estimated by
1−− =
kn
SSE s
Where
SSE = Sum of squares error
n = Sample size
k = number of independent variables in the model
22
The Standard Deviation of the Regression Slope
• The standard error of the regression slope
coefficient (b1) is estimated by
−
= −
=
n
x)( x
s
)x(x
s s
2
2
ε
2
ε b1
where:
= Estimate of the standard error of the least squares slope
= Sample standard error of the estimate
1b s
2n
SSE s ε
− =
23
Inference about the Slope: t Test
• t test for a population slope
– Is there a linear relationship between x and y?
• Null and alternative hypotheses
– H0: β1 = 0 (no linear relationship)
– H1: β1 0 (linear relationship does exist)
• Test statistic
1b
11
s
βb t
− = 2nd.f. −=
where:
b1 = Sample regression slope coefficient
β1 = Hypothesized slope
sb1 = Estimator of the standard error of the slope
24
The Multiple Regression
Model Idea: Examine the linear relationship between
1 dependent (y) & 2 or more independent variables (xi)
εxβxβxββy kk22110 +++++=
kk22110 xbxbxbbŷ ++++=
Population model:
Y-intercept Population slopes Random Error
Estimated (or predicted) value of y
Estimated slope coefficients
Estimated multiple regression model:
Estimated intercept
Multiple Regression
Assumptions
• The model errors are independent and random
• The errors are normally distributed
• The mean of the errors is zero
• Errors have a constant variance
e = (y – y)
<
Errors (residuals) from the regression model:
Adjusted R2
• R2 never decreases when a new x variable is
added to the model
– This can be a disadvantage when comparing
models
• What is the net effect of adding a new variable?
– We lose a degree of freedom when a new x
variable is added
– Did the new x variable add enough
explanatory power to offset the loss of one
degree of freedom?
• Shows the proportion of variation in y explained by all x variables adjusted for the number of x variables used
(where n = sample size, k = number of independent variables)
– Penalize excessive use of unimportant independent variables
– Smaller than R2
– Useful in comparing among models
Adjusted R2
(continued)
−−
− −−=
1kn
1n )R1(1R
22
A
Is the Model Significant?
• F-Test for Overall Significance of the Model
• Shows if there is a linear relationship between all
of the x variables considered together and y
• Use F test statistic
• Hypotheses:
– H0: β1 = β2 = … = βk = 0 (no linear relationship)
– HA: at least one βi ≠ 0 (at least one independent variable affects y)
F-Test for Overall Significance
• Test statistic:
where F has (numerator) D1 = k and
(denominator) D2 = (n – k – 1)
degrees of freedom
(continued)
MSE
MSR
1kn
SSE k
SSR
F =
−−
=
Are Individual Variables Significant?
• Use t-tests of individual variable slopes
• Shows if there is a linear relationship between the
variable xi and y
• Hypotheses:
– H0: βi = 0 (no linear relationship)
– HA: βi ≠ 0 (linear relationship does exist between xi and y)
Are Individual Variables Significant?
H0: βi = 0 (no linear relationship)
HA: βi ≠ 0 (linear relationship does exist between xi and y )
Test Statistic:
(df = n – k – 1)
ib
i
s
0b t
− =
(continued)
Standard Deviation of the Regression Model
• The estimate of the standard deviation of the
regression model is:
MSE kn
SSE s =
−− =
1
◼ Is this value large or small? Must compare to the
mean size of y for comparison
Multicollinearity
• Multicollinearity: High correlation exists
between two independent variables
• This means the two variables contribute
redundant information to the multiple
regression model
Detect Collinearity (Variance Inflationary Factor)
VIFj is used to measure collinearity:
If VIFj ≥ 10, xj is highly correlated with
the other explanatory variables
R2j is the coefficient of determination when the j th
independent variable is regressed against the
remaining k – 1 independent variables
2 1
1
j
j R
VIF −
=
Qualitative (Dummy) Variables
• Categorical explanatory variable (dummy variable) with two or more levels:
– yes or no, on or off, male or female
– coded as 0 or 1
• Regression intercepts are different if the variable is significant
• Assumes equal slopes for other variables
• The number of dummy variables needed is (number of levels – 1)