QUANT homework
Simple Regression
Learning Objectives
• Conduct a descriptive statistics investigation prior to running a simple regression analysis
• Check simple regression usage assumptions. Address violations to assumptions
• Run a simple regression analysis. Interpret simple regression table output
• Generate resulting simple regression equation from regression table output
Simple Regression
• Examines relation between 2 variables (explanatory and response variable)
• Begin analysis with a descriptive statistic analysis (numerical and graphical) of each variable
• Ensure the simple regression usage assumptions are met prior to running analysis in Excel
• It is possible that a transformation to one or both variables may need to be applied prior to running the analysis
Simple Regression
• Examines relation between 2 variables (explanatory(x) and response variable(y)) • Explanatory variable is examined to determine how well it serves as a
predictor for values of the response variable
• Used to qualify the strength of the linear relation between the two variables
• The sample regression equation, generated from the analysis, is dependent on the sample used and is an estimate of the true population regression equation
The equation summarizes the overall dot pattern observed in a bivariate scatter plot of the two variables
(𝐵0 𝑖𝑠 𝑠𝑜𝑚𝑒𝑡𝑖𝑚𝑒𝑠 𝑢𝑠𝑒𝑑)
(𝐵1𝑖𝑠 𝑠𝑜𝑚𝑒𝑡𝑖𝑚𝑒𝑠 𝑢𝑠𝑒𝑑)
Sample Simple Regression Line
Sample Simple Linear Regression Coefficients
Intercept b0
• It is the value where the regression line cross the y-axis on the graph
• Value of the sample regression equation when the explanatory variable x = 0
• It is the expected mean value of the response variable when x=0
• If in practice the explanatory variable never has a value of zero, then the intercept is not interpreted • If x=0 is outside of the range of observed data used in the sample, do not
interpret the intercept
Example interpretation of the regression intercept (number units is 1k)
𝑆𝑎𝑙𝑒𝑠 𝑑𝑜𝑙𝑙𝑎𝑟𝑠 𝑝𝑒𝑟 𝑚𝑜𝑛𝑡ℎ = 3.7 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒𝑠
+ 203.5
Interpretation : If there are no employees, the equation predicts that sales for the month will be on average $203,500.
Intercept b0
Slope b1
• The regression slope coefficient is the mean amount of change in the response variable that the equation predicts as the explanatory variable is increased by 1 unit
• The sign of the regression slope coefficient indicates the direction (positive or negative) of the linear relation between the 2 variables of the line is positive
• If the regression slope coefficient is very near 0, then as one variable increases, the other remains fairly constant. Therefore, there is no meaningful predictive relationship present
Example interpretation of the regression slope (number units is 1k)
𝑆𝑎𝑙𝑒𝑠 𝑑𝑜𝑙𝑙𝑎𝑟𝑠 𝑝𝑒𝑟 𝑚𝑜𝑛𝑡ℎ = 3.7 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑒𝑚𝑝𝑙𝑜𝑦𝑒𝑒𝑠
+ 203.5
Interpretation : If the number of employees increases by 1, the equation predicts the sales dollars per month on average will increase by approximately $3700.
Intercept b1
• Results can be read in the simple regression Excel output table
Simple Regression Usage Assumptions
• The relation between the explanatory and response variables is linear • Check the scatter plot. Should have a linear dot pattern
• The response variable does not have a relation the residuals (independent of the errors in prediction) • Check a scatterplot of residuals versus predicted values (residual plot). Should have a
scattered dot pattern. Check the correlation between these values. Correlation should be very near zero
• Residuals are normally distributed • Check a histogram of the residuals. Check skewness numbers. Check normal
probability plots
• Variance of residuals should be the statistically the same across all values of the explanatory variable • Check a scatterplot of residuals versus predicted values (residual plot). Should have a
scattered dot pattern where the variance across values is consistent
Residuals The estimates for the response variable made through the regression line are called the predicted values
• To find residuals, subtract the predicted value (calculated with the regression equation) from the
observed value
• negative residual: the regression equation is overestimating
• postive residual: the regression equation is underestimating
• Best fitting line means the sum of its least squared residuals is smallest
Scatter plot check for linearity
Residual Plot check
Assumptions met
Assumptions not met
Assumptions not met
Scattered dot pattern and equal variance
Normality Plot check
Normal Skewed Left Skewed Right
Thick tailed Thin tailed
Assumptions are met if the residual dot pattern closely follows the line
Histogram Normality check Near Normal
Right Skewed Left Skewed
Options to Fix linearity problems
• Increase the sample size
• Check for outliers and remove them if warranted or assign new variables through imputation (beyond the scope of the course)
• Apply a transformation to the entire data set for one or both variables • Trying transformations may be an iterative process of ‘try and check’ in order
to meet assumptions. Try fixing the linearity challenge first and then proceed on to the other fixes as needed
• As transformations are applied, be sure to re-check all assumptions to ensure that they all hold after application of the transformation
Applying a transformation
Often trial and error. No one technique works all the time
• Transform the explanatory (x) values only. • Try if linearity is the only assumption violation
• Transform the response (y) values only. • Try when non-normality and/or unequal variances are the assumption
violations
• Transform both sets of values. • Try when linearity, non-normality , and unequal variances are the
assumption violations
Frequently used Transformations
• In simple regression, try looking at the scatter plot dot pattern for hints on which transformation to try
Try LN transformation Try reciprocal (1/x) or exponential 𝑒−𝑥
transformation
Other Frequently used Transformations
• Power transformations transform the response variable to some power (usually between -1 and 2) • Try if the residual variances are unequal and/or residuals are not normal
• A square root or reciprocal transformation can be applied to the response variable • Try if the residual variances are unequal
Which tried transformation should you use?
• Run the simple regression analysis, in Excel, to generate the regression output table for each transformation
• Examine the adjusted R squared value for each transformation and for the original data
• Select the data set to use with the largest adjusted r squared value (or the smallest standard error value)
What to check in the regression output table • The overall significance of the regression equation (in the ANOVA table)
• less than .05 for the significance and/or p-values suggests the model is a good fit to the data
• The overall significance of the regression coefficients (in the coefficient table)
• less than .05 for the significance and/or p-values the explanatory variable is a good predictor of the response variable
Approach we will use in this class
• If regression equation is not significant, the model is not a good fit to the data
• If the regression model is significant but the coefficient is not, the model provides improved fit over using the expected value of the response variable as the estimated prediction
• If the regression model is significant and the coefficient is significant, the model is a good fit for the data and the explanatory variable is contributing significantly towards the quality of prediction
What to check in the regression output table
Interpretation: Results suggest that because the p- value .09>.05, the presence of East variable in the regression equation is not contributing significantly to the prediction of the response variable.
Interpretation: Results suggest that because the significance F value .0003<.05, the regression equation is a good fit to the data and provides meaningful prediction of the response variables given the presence of the explanatory variables in the model
Other numerical summaries in the regression output table to check
• Correlation Coefficient: Strength of Linear Relation
• Coefficient of Determination (r^2) - Use Adjusted R^2 • fraction of the variation in the data accounted for by the regression equation . Values are
between 0 and 1 in the table but are sometimes reported as percentages
• SEE (Standard Error of the Estimate) • standard deviation of the residuals
• how spread out the observations are from the regression line
Writing the Simple Regression Equation
Report the regression equation ONLY if all simple regression assumptions are met
Use the values in the coefficients column
𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 = −1.76 𝑃𝑟𝑖𝑐𝑒 𝑜𝑓 𝑅𝑜𝑠𝑒𝑠 + 183475.43
Simple Regression Equation
Writing the Simple Regression Equation
Be sure to name the transformation variable, if needed, when writing the regression equation
𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒 𝑉𝑎𝑟𝑖𝑎𝑏𝑙𝑒 = −3.67 𝐿𝑁 𝑃𝑟𝑖𝑐𝑒 𝑜𝑓 𝑅𝑜𝑠𝑒𝑠 + 32.58
Simple Regression Equation with LN transformation on the explanatory variable