QUANT homework

profilemilesss
Week5notes.pdf

Simple Regression

Learning Objectives

• Conduct a descriptive statistics investigation prior to running a simple regression analysis

• Check simple regression usage assumptions. Address violations to assumptions

• Run a simple regression analysis. Interpret simple regression table output

• Generate resulting simple regression equation from regression table output

Simple Regression

• Examines relation between 2 variables (explanatory and response variable)

• Begin analysis with a descriptive statistic analysis (numerical and graphical) of each variable

• Ensure the simple regression usage assumptions are met prior to running analysis in Excel

• It is possible that a transformation to one or both variables may need to be applied prior to running the analysis

Simple Regression

• Examines relation between 2 variables (explanatory(x) and response variable(y)) • Explanatory variable is examined to determine how well it serves as a

predictor for values of the response variable

• Used to qualify the strength of the linear relation between the two variables

• The sample regression equation, generated from the analysis, is dependent on the sample used and is an estimate of the true population regression equation

The equation summarizes the overall dot pattern observed in a bivariate scatter plot of the two variables

෢(𝐵0 𝑖𝑠 𝑠𝑜𝑚𝑒𝑡𝑖𝑚𝑒𝑠 𝑢𝑠𝑒𝑑)

෢(𝐵1𝑖𝑠 𝑠𝑜𝑚𝑒𝑡𝑖𝑚𝑒𝑠 𝑢𝑠𝑒𝑑)

Sample Simple Regression Line

Sample Simple Linear Regression Coefficients

Intercept b0

• It is the value where the regression line cross the y-axis on the graph

• Value of the sample regression equation when the explanatory variable x = 0

• It is the expected mean value of the response variable when x=0

• If in practice the explanatory variable never has a value of zero, then the intercept is not interpreted • If x=0 is outside of the range of observed data used in the sample, do not

interpret the intercept

Example interpretation of the regression intercept (number units is 1k)

𝑆𝑎𝑙𝑒𝑠 𝑑𝑜𝑙𝑙𝑎𝑟𝑠 𝑝𝑒𝑟 𝑚𝑜𝑛𝑡ℎ = 3.7 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒𝑠

+ 203.5

Interpretation : If there are no employees, the equation predicts that sales for the month will be on average $203,500.

Intercept b0

Slope b1

• The regression slope coefficient is the mean amount of change in the response variable that the equation predicts as the explanatory variable is increased by 1 unit

• The sign of the regression slope coefficient indicates the direction (positive or negative) of the linear relation between the 2 variables of the line is positive

• If the regression slope coefficient is very near 0, then as one variable increases, the other remains fairly constant. Therefore, there is no meaningful predictive relationship present

Example interpretation of the regression slope (number units is 1k)

𝑆𝑎𝑙𝑒𝑠 𝑑𝑜𝑙𝑙𝑎𝑟𝑠 𝑝𝑒𝑟 𝑚𝑜𝑛𝑡ℎ = 3.7 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒𝑠

+ 203.5

Interpretation : If the number of employees increases by 1, the equation predicts the sales dollars per month on average will increase by approximately $3700.

Intercept b1

• Results can be read in the simple regression Excel output table

Simple Regression Usage Assumptions

• The relation between the explanatory and response variables is linear • Check the scatter plot. Should have a linear dot pattern

• The response variable does not have a relation the residuals (independent of the errors in prediction) • Check a scatterplot of residuals versus predicted values (residual plot). Should have a

scattered dot pattern. Check the correlation between these values. Correlation should be very near zero

• Residuals are normally distributed • Check a histogram of the residuals. Check skewness numbers. Check normal

probability plots

• Variance of residuals should be the statistically the same across all values of the explanatory variable • Check a scatterplot of residuals versus predicted values (residual plot). Should have a

scattered dot pattern where the variance across values is consistent

Residuals The estimates for the response variable made through the regression line are called the predicted values

• To find residuals, subtract the predicted value (calculated with the regression equation) from the

observed value

• negative residual: the regression equation is overestimating

• postive residual: the regression equation is underestimating

• Best fitting line means the sum of its least squared residuals is smallest

Scatter plot check for linearity

Residual Plot check

Assumptions met

Assumptions not met

Assumptions not met

Scattered dot pattern and equal variance

Normality Plot check

Normal Skewed Left Skewed Right

Thick tailed Thin tailed

Assumptions are met if the residual dot pattern closely follows the line

Histogram Normality check Near Normal

Right Skewed Left Skewed

Options to Fix linearity problems

• Increase the sample size

• Check for outliers and remove them if warranted or assign new variables through imputation (beyond the scope of the course)

• Apply a transformation to the entire data set for one or both variables • Trying transformations may be an iterative process of ‘try and check’ in order

to meet assumptions. Try fixing the linearity challenge first and then proceed on to the other fixes as needed

• As transformations are applied, be sure to re-check all assumptions to ensure that they all hold after application of the transformation

Applying a transformation

Often trial and error. No one technique works all the time

• Transform the explanatory (x) values only. • Try if linearity is the only assumption violation

• Transform the response (y) values only. • Try when non-normality and/or unequal variances are the assumption

violations

• Transform both sets of values. • Try when linearity, non-normality , and unequal variances are the

assumption violations

Frequently used Transformations

• In simple regression, try looking at the scatter plot dot pattern for hints on which transformation to try

Try LN transformation Try reciprocal (1/x) or exponential 𝑒−𝑥

transformation

Other Frequently used Transformations

• Power transformations transform the response variable to some power (usually between -1 and 2) • Try if the residual variances are unequal and/or residuals are not normal

• A square root or reciprocal transformation can be applied to the response variable • Try if the residual variances are unequal

Which tried transformation should you use?

• Run the simple regression analysis, in Excel, to generate the regression output table for each transformation

• Examine the adjusted R squared value for each transformation and for the original data

• Select the data set to use with the largest adjusted r squared value (or the smallest standard error value)

What to check in the regression output table • The overall significance of the regression equation (in the ANOVA table)

• less than .05 for the significance and/or p-values suggests the model is a good fit to the data

• The overall significance of the regression coefficients (in the coefficient table)

• less than .05 for the significance and/or p-values the explanatory variable is a good predictor of the response variable

Approach we will use in this class

• If regression equation is not significant, the model is not a good fit to the data

• If the regression model is significant but the coefficient is not, the model provides improved fit over using the expected value of the response variable as the estimated prediction

• If the regression model is significant and the coefficient is significant, the model is a good fit for the data and the explanatory variable is contributing significantly towards the quality of prediction

What to check in the regression output table

Interpretation: Results suggest that because the p- value .09>.05, the presence of East variable in the regression equation is not contributing significantly to the prediction of the response variable.

Interpretation: Results suggest that because the significance F value .0003<.05, the regression equation is a good fit to the data and provides meaningful prediction of the response variables given the presence of the explanatory variables in the model

Other numerical summaries in the regression output table to check

• Correlation Coefficient: Strength of Linear Relation

• Coefficient of Determination (r^2) - Use Adjusted R^2 • fraction of the variation in the data accounted for by the regression equation . Values are

between 0 and 1 in the table but are sometimes reported as percentages

• SEE (Standard Error of the Estimate) • standard deviation of the residuals

• how spread out the observations are from the regression line

Writing the Simple Regression Equation

Report the regression equation ONLY if all simple regression assumptions are met

Use the values in the coefficients column

𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 = −1.76 𝑃𝑟𝑖𝑐𝑒 𝑜𝑓 𝑅𝑜𝑠𝑒𝑠 + 183475.43

Simple Regression Equation

Writing the Simple Regression Equation

Be sure to name the transformation variable, if needed, when writing the regression equation

𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 = −3.67 𝐿𝑁 𝑃𝑟𝑖𝑐𝑒 𝑜𝑓 𝑅𝑜𝑠𝑒𝑠 + 32.58

Simple Regression Equation with LN transformation on the explanatory variable