QUANT homework
Introduction
| Ideally the goal is to select explanatory variables to build the simplest regression equation possible that will produce a good estimate of the response variable (parsimony) | ||||
| this to simplify interpretation of the regression coefficients | ||||
| selecting good variables for use in a regression analysis usually involves initially engaging in some research | ||||
| To use linear simple linear regression analysis: There needs to be a linear relation between: | ||||
| 1) the response variable and the explanatory variable | ||||
| Investigate the linear relations by examining: | ||||
| 1) Correlation: the numerical measure of strength in the relation between the explanatory and response variables | ||||
| 2) scatter plot of the response variable against the independent variable | ||||
| the plot can show linear dot pattern , indicating likely need for the explanatory variable to be included in the regression equation | ||||
| the plot can show non-linear dot pattern suggesting a transformation of an explanatory variable may be needed | ||||
| the plot can show a horizontal dot pattern suggesting the variable should not be used in the regression equation | ||||
| 3) residual plots | ||||
| looking for a scattered, cloudy dot pattern with consistent variance at each level of the explanatory variable to help confirm linear relation likely exists | ||||
| if a pattern is observed, this suggests transformations many need to be applied |
correlation
| Response variables (variable of interest) : typically placed on y-axis | ||||||||||
| explanatory variables: typically placed on x-axis | ||||||||||
| Choosing the role of the variables depends on how you think the variables are related | ||||||||||
| Correlation coefficient (r): permits you to quantify the strength of the linear relation between the 2 variables | ||||||||||
| In Excel use =CORREL feature | ||||||||||
| properties of correlation coefficient | ||||||||||
| range from -1 to 1 | ||||||||||
| sign gives the direction of the association | ||||||||||
| unitless measure | ||||||||||
| not affected by changes to the center or scale of the variables | ||||||||||
| depends only on the z-scores | ||||||||||
| sensative to outliers | ||||||||||
| does not imply causation (could be lurking variables standing behind the association) | ||||||||||
| check the scatter plot for straightness even if the value of r is high |
scatter
| Scatterplots | ||||||||||||
| displays relation (association) between 2 quantitative variables | ||||||||||||
| investigor can see patterns, trends, and outliers | ||||||||||||
| 1. Direction of the association | ||||||||||||
| negative | positive | |||||||||||
| 2. Form of the association | ||||||||||||
| nonlinear vs linear | ||||||||||||
| 3. Strength of association | ||||||||||||
| How discernable is the data pattern or trend? | ||||||||||||
| weak, moderate, strong |
residual
| Question: What line can be used to model the linear data pattern displayed in the scatterplot? | |||||||||||||
| the equation of the best fit line is the regression line (model) | |||||||||||||
| the estimates for the response variable made through the regression line are called the predicted values | |||||||||||||
| residuals | |||||||||||||
| To find residuals, subtract the predicted value (calculated with the regression equation) from the observed value | |||||||||||||
| negative residual: the regression equation is overestimating | |||||||||||||
| postive residual: the regression equation is underestimating | |||||||||||||
| Best fitting line means the sum of its least squared residuals is smallest |
Example1A
| Question: Is there a linear relation between the response variable (salary) and housing price? | ||||||||||||||||||||
| 1) correlation (CORELL feature in Excel) | ||||||||||||||||||||
| 2) scatter plot | ||||||||||||||||||||
| 3) residual plot | ||||||||||||||||||||
| 1) Correlation | 0.9086846363 | 3) Residual Plot (Run the regression in Excel, using the Data Analysis Toolbar. Select Residual plot | ||||||||||||||||||
| A correlation of approximately .91 implies that salary and housing price have | ||||||||||||||||||||
| a strong, positive, linear relation where salary increase ad housing price increases in a systematically predictable way | ||||||||||||||||||||
| Salary($10,000) | Housing Price ($10,000) | |||||||||||||||||||
| 1 | 45 | 2) Scatter Plot (select variables and use the Insert .. Charts menu) | ||||||||||||||||||
| 2 | 47 | |||||||||||||||||||
| 3 | 57 | |||||||||||||||||||
| 4.5 | 59 | |||||||||||||||||||
| 5.2 | 60 | |||||||||||||||||||
| 8.4 | 65 | |||||||||||||||||||
| 11 | 66 | |||||||||||||||||||
| 12.4 | 67 | |||||||||||||||||||
| 12.5 | 69 | |||||||||||||||||||
| 13.1 | 70 | |||||||||||||||||||
| 15 | 70 | |||||||||||||||||||
| 17 | 72 | |||||||||||||||||||
| 22.2 | 73 | |||||||||||||||||||
| 22.6 | 74 | There is a pattern (dot pattern is not scattered) evident in the residual plot. This suggests that some other relation, beyond a linear relation | ||||||||||||||||||
| 25.8 | 76 | would likely better represent the relation between salary and housing price. Hence a transformation should be | ||||||||||||||||||
| performed | ||||||||||||||||||||
| From the scatter plot is can be seen that the relation appears to be positive and strong. However there also | ||||||||||||||||||||
| appears to be a slight bend in the dot pattern. Further investigation is needed in order to determine if the | ||||||||||||||||||||
| trend is linear | ||||||||||||||||||||
Salary
Housing Price ($10,000)
Transform1
| Goal: Make the dot pattern of a scatterplot more nearly linear | ||
| calculate transform for the explanatory and the response variables | ||
| look at scatterplots of combinations of transformed and/or non-transformed values | ||
| Can use Adjusted R-Squared to help chose between candidate linear equation models that result | ||
| Non-linear dot patterns in scatterplots and possible transformations to straighten | ||
| year | salary (in $10,000) | LN(salary) |
| 1980 | 1 | |
| 1982 | 2 | |
| 1990 | 3 | |
| 1990 | 4.5 | |
| 1991 | 5.2 | |
| 1996 | 8.4 | |
| 1997 | 11 | |
| 1998 | 12.4 | |
| 1999 | 12.5 | |
| 1999 | 13.1 | |
| 1999 | 15 | |
| 2001 | 17 | |
| 2001 | 22.2 | |
| 2004 | 22.6 | |
| 2005 | 25.8 |
(LN(x), Y) (x, y2) (LN(y), LN(x))
(LN(x), Y) (x, 1/y)
(x, SQRT(y)) (x, LN(y)) (LN(y), LN(x))
Significance
| What else to check in the Regression Output | |||||||||||
| The overall significance of the regression equation (in the ANOVA table) | |||||||||||
| The regression coefficent is the average change in y expected per unit change in the explanatory variable when all other explanatory variables are held constant | |||||||||||
| How significant are the regression equation and/or the coefficients? | |||||||||||
| less than .05 for the significance and/or p-values , 1 or more of the explanatory variables are good predictors of the response variable | |||||||||||
| also, check R-squared (use adjusted R-squared) | |||||||||||
| Standard Approach we will use (1 of the following will be the investigation conclusion) | |||||||||||
| If regression equation is not significant, the model is not a good fit to the data | |||||||||||
| If the regression model is significant but the coefficient is not, the model provides improved fit over using the expected value of the response variable as the estimated prediction | |||||||||||
| If the regression model is significant and the coefficient is significant, the model is a good fit for the data and the explanatory variable is contributing significantly towards the quality of prediction | |||||||||||
| Coefficient significance | Model significance |
Example2
| Is the regression model with the Square Root Transformed response variable salary the better model? Yes according to the analysis (scatter plots, residual plots, adjusted r-squared, and standard error comparisons) | ||||||||||||||||||||||||||
| year (actual) | year (Explanatory) | salary (in $10,000) (Response) | year (Explanatory) | SQRT(Salary) | Compare Scatter Plots : Which looks visually straighter ? | |||||||||||||||||||||
| 1980 | 80 | 1 | 80 | 1 | ||||||||||||||||||||||
| 1982 | 82 | 2 | 82 | 1.4142135624 | ||||||||||||||||||||||
| 1990 | 90 | 3 | 90 | 1.7320508076 | ||||||||||||||||||||||
| 1990 | 90 | 4.5 | 90 | 2.1213203436 | ||||||||||||||||||||||
| 1991 | 91 | 5.2 | 91 | 2.2803508502 | ||||||||||||||||||||||
| 1996 | 96 | 8.4 | 96 | 2.8982753492 | ||||||||||||||||||||||
| 1997 | 97 | 11 | 97 | 3.3166247904 | ||||||||||||||||||||||
| 1998 | 98 | 12.4 | 98 | 3.5213633723 | ||||||||||||||||||||||
| 1999 | 99 | 12.5 | 99 | 3.5355339059 | ||||||||||||||||||||||
| 1999 | 99 | 13.1 | 99 | 3.6193922142 | ||||||||||||||||||||||
| 1999 | 99 | 15 | 99 | 3.8729833462 | ||||||||||||||||||||||
| 2001 | 101 | 17 | 101 | 4.1231056256 | ||||||||||||||||||||||
| 2001 | 101 | 22.2 | 101 | 4.7116875958 | ||||||||||||||||||||||
| 2004 | 104 | 22.6 | 104 | 4.7539457296 | ||||||||||||||||||||||
| 2005 | 105 | 25.8 | 105 | 5.0793700397 | Compare Residual Plots : Which has a cloudy, scattered dot pattern ? | |||||||||||||||||||||
| Response: Salary | Response : Square Root of Salary | |||||||||||||||||||||||||
| Compare Adjusted R-squared values : Which is largest ? Compare Standard Error: Which is smallest? | ||||||||||||||||||||||||||
| Both models are significant | ||||||||||||||||||||||||||
| The regression model with the transformed response variable has the largest Adjusted R-squared value and smallest standard error | ||||||||||||||||||||||||||
| Response Variable : Salary | Response Variable : Square Root Of Salary | |||||||||||||||||||||||||
| Regression Statistics | Regression Statistics | |||||||||||||||||||||||||
| Multiple R | 0.9150264296 | Multiple R | 0.9668776273 | |||||||||||||||||||||||
| R Square | 0.8372733668 | R Square | 0.9348523462 | |||||||||||||||||||||||
| Adjusted R Square | 0.8247559335 | Adjusted R Square | 0.9298409882 | |||||||||||||||||||||||
| Standard Error | 3.28038926 | Standard Error | 0.3337444496 | |||||||||||||||||||||||
| Observations | 15 | Observations | 15 | |||||||||||||||||||||||
| ANOVA | ANOVA | |||||||||||||||||||||||||
| df | SS | MS | F | Significance F | df | SS | MS | F | Significance F | |||||||||||||||||
| Regression | 1 | 719.7849352684 | 719.7849352684 | 66.8885821383 | 0.0000017547 | Regression | 1 | 20.7785720521 | 20.7785720521 | 186.5467103404 | 0.0000000044 | |||||||||||||||
| Residual | 13 | 139.892398065 | 10.7609536973 | Residual | 13 | 1.4480096496 | 0.1113853577 | |||||||||||||||||||
| Total | 14 | 859.6773333333 | Total | 14 | 22.2265817017 | |||||||||||||||||||||
| Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | |||||||||||
| Intercept | -80.4842432619 | 11.3048781078 | -7.1194260119 | 0.0000078229 | -104.906947591 | -56.0615389328 | -104.906947591 | -56.0615389328 | Intercept | -12.4661593835 | 1.1501501874 | -10.8387230813 | 0.0000000701 | -14.9509077987 | -9.9814109684 | -14.9509077987 | -9.9814109684 | |||||||||
| year (Explanatory) | 0.9657567381 | 0.1180841892 | 8.1785440109 | 0.0000017547 | 0.710651357 | 1.2208621192 | 0.710651357 | 1.2208621192 | year (Explanatory) | 0.164087017 | 0.0120138007 | 13.6582103637 | 0.0000000044 | 0.1381327785 | 0.1900412554 | 0.1381327785 | 0.1900412554 | |||||||||
salary (in $10,000) (Response) 80 82 90 90 91 96 97 98 99 99 99 101 101 104 105 1 2 3 4.5 5.2 8.4 11 12.4 12.5 13.1 15 17 22.2 22.6 25.8
SQRT(Salary) 80 82 90 90 91 96 97 98 99 99 99 101 101 104 105 1 1.4142135623730951 1.7320508075688772 2.1213203435596424 2.2803508501982761 2.8982753492378879 3.3166247903553998 3.5213633723318019 3.5355339059327378 3.6193922141707713 3.872983346207417 4.1231056256176606 4.7116875957558984 4.7539457296018854 5.0793700396801178
regression equation after trans
| To undo the LN transformation to determine a regression equation for the original salary units, take the exponential (e) of both sides of the equation |
To Undo the LN transformations (x, LN(Y)) LN(Y) = ax + b y = e(ax + b) y = eax (eb) (LN(x), y) y = aLN(x) + b (LN(x), LN(y)) LN(y) = aLN(x) + b LN(y) = aLN(x) + b LN(y) = LN(xa) + b y = xa + eb
numerical summaries
| Numerical Summaries: Excel Regression Output Table (color coded) | ||||||||||||||||||||||
| Regression Coefficients | Excel Regression Output Table | |||||||||||||||||||||
| y-intercept: average value of the response variable when all explanatory variables are 0 | Regression Statistics | |||||||||||||||||||||
| impact of explanatory variables are not considered | Multiple R | 0.9150264296 | ||||||||||||||||||||
| R Square | 0.8372733668 | |||||||||||||||||||||
| regression coefficients: the average predicted change in the response variable per unit change in the explanatory variable | Adjusted R Square | 0.8247559335 | ||||||||||||||||||||
| Standard Error | 3.28038926 | |||||||||||||||||||||
| Correlation Coefficient: Strength of Linear Relation | Observations | 15 | ||||||||||||||||||||
| Coefficient of Determination (r^2) - Use Adjusted R^2 | ANOVA | |||||||||||||||||||||
| fraction of the variation in the data accounted for by the regression equation | df | SS | MS | F | Significance F | |||||||||||||||||
| sometimes reported as a percentage | Regression | 1 | 719.7849352684 | 719.7849352684 | 66.8885821383 | 0.0000017547 | ||||||||||||||||
| between 0 and 1 | Residual | 13 | 139.892398065 | 10.7609536973 | ||||||||||||||||||
| Total | 14 | 859.6773333333 | ||||||||||||||||||||
| SEE (Standard Error of the Estimate) | ||||||||||||||||||||||
| standard deviation of the residuals | Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | ||||||||||||||
| how spread out the observations are from the regression line | Intercept | -80.4842432619 | 11.3048781078 | -7.1194260119 | 0.0000078229 | -104.906947591 | -56.0615389328 | -104.906947591 | -56.0615389328 | |||||||||||||
| year (Explanatory) | 0.9657567381 | 0.1180841892 | 8.1785440109 | 0.0000017547 | 0.710651357 | 1.2208621192 | 0.710651357 | 1.2208621192 | ||||||||||||||
| Confidence Intervals : Range, withing specified level of confidence, that the population parameter resides |
Influential Points
| Influential Points | |||||||||||||||||
| Does the regression equation change significantly when the isolated points is removed? | |||||||||||||||||
| Would a different result from the analysis be concluded? | |||||||||||||||||
| If yes, then the point is influential | |||||||||||||||||
| In the regression output look at: | |||||||||||||||||
| significance of model or coefficients | |||||||||||||||||
| numerical summaries | |||||||||||||||||
| Example | |||||||||||||||||
| with influencial point | without influential point |
recode
| Recode multiple category variables (3 or more categories) to 2 category variable | ||||||||
| Gender: male (1) female (0) | ||||||||
| Political Party (0) republican, (1) democrat, (2) libertarian, (3) green | ||||||||
| Student | IQ | Study hours | Gender | Political Party | Test score | |||
| 1 | 110 | 40 | 1 | 0 | 100 | |||
| 2 | 110 | 40 | 0 | 2 | 95 | |||
| 3 | 120 | 30 | 1 | 1 | 90 | |||
| 4 | 110 | 40 | 1 | 3 | 85 | |||
| 5 | 100 | 20 | 0 | 3 | 80 | |||
| 6 | 110 | 40 | 1 | 1 | 75 | |||
| 7 | 90 | 0 | 0 | 0 | 70 | |||
| 8 | 110 | 40 | 0 | 2 | 65 | |||
| 9 | 80 | 10 | 1 | 3 | 60 | |||
| 10 | 80 | 10 | 0 | 1 | 55 | |||
| The dummy variables are 1 when (zero otherwise): | ||||||||
| x1: republican | ||||||||
| x2: democrat | ||||||||
| x3: libertarian | ||||||||
| Student | IQ | Study hours | Gender | x1 | x2 | x3 | Test score | |
| 1 | 110 | 40 | 1 | 1 | 0 | 0 | 100 | |
| 2 | 110 | 40 | 0 | 0 | 0 | 1 | 95 | |
| 3 | 120 | 30 | 1 | 0 | 1 | 0 | 90 | |
| 4 | 110 | 40 | 1 | 0 | 0 | 0 | 85 | |
| 5 | 100 | 20 | 0 | 80 | ||||
| 6 | 110 | 40 | 1 | 75 | ||||
| 7 | 90 | 0 | 0 | 70 | ||||
| 8 | 110 | 40 | 0 | 65 | ||||
| 9 | 80 | 10 | 1 | 60 | ||||
| 10 | 80 | 10 | 0 | 55 |