Statics
BUS 308 Week 4 Lecture 3
Moving to correlation and regression opens up new insights into our data sets, but still lets us use what we have learned about Excel tools in setting up and generating our results. Regression lets us use relationships between and among our variables to predict or explain outcomes based upon inputs, factors we think might be related. In our quest to understand what impacts the compa-ratio and salary outcomes we see, we have often been frustrated due to being basically limited to examining only two variables at a time, when we felt that we needed to include many other factors. Regression, particularly multiple regression, is the tool that allows us to do this.
Regression
Regression takes us the next step in the journey. We move from knowing which variables are correlated to finding out which variables can be used to actually predict outcomes or explain the influence of different variables on a result. As we might suspect, linear regression involves a single dependent (outcome) and single independent (input) variable. Linear regression uses at least interval level data for both the dependent and independent variables.
The form of a linear regression equation is:
Y = a + b*X; where Y is the output, X is the input, a is the intercept (the value of y when X = 0) on a graph, and b is the coefficient (showing the change in Y for every 1 change in the value of X.
Earlier, we found that the correlation between raise and performance rating was 0.674 (rounded). While we did not make note of this in our correlation discussion, it was part of the correlation table. This correlation relates to a coefficient of determination (CD) of 0.674^2 or 0.45 (rounded). As mentioned, this is not a particularly strong correlation, and we would not expect the graph of these values to show much of a straight line. For purposes of understanding linear regression, let’s look at a graph showing performance rating as an input (an X variable) predicting raise (Y). An example of a regression equation and its graph is:
Raise (Y) vs Performance Rating (X)
y = 0.0512x + 0.5412 R² = 0.4538
0
1
2
3
4
5
6
7
0 20 40 60 80 100 120
This is a Scatter Diagram graph produced by Excel. The regression line, equation, and R- squared values have been added. Note that the Coefficient of Determination (R2) is the 45% we found earlier, and that the data points are not all that close to the regression (AKA trend) line. Note the format of the regression equation Y = 0.5412 + 0.0512X, this is the same as saying Raise = 0.5412 + 0.0512* Performance Rating when we substitute the variable names for the algebraic letters.
Let us look at the equation. Since we know that the correlation is significant (it is larger than our 0.278 cut-off discussed in lecture 2), the linear regression equation is significant. The regression says for every single point increase in the performance rating (our X variable), the raise (The Y variable) increases, on average by 0.0512%. If we extended the line towards the y (vertical axis), it would cross at Y = – 0.0512 and X = 0, this is an example where looking at the origin points is not particularly helpful as no one has a performance rating of 0. This graph does tend to reinforce our earlier comment that raise and performance rating, even though the strongest correlation, are not particularly good at predicting each other’s value. We see too much dispersion of data points around the best fit regression line through the data points.
Most of us are probably not surprised, just as we feel compa-ratio is not determined by a single factor, we know raise is more complicated than simply the performance rating. This is where looking at multiple regression, the use of several factors, might be more insightful.
Multiple Regression
Multiple Regression is probably the most powerful tool we will look at in this course. It allows us to examine the impact of multiple inputs (AKA independent variables) on a single output or result (AKA dependent variable). It also allows us to include nominal and ordinal variables in the results when they are used as dummy coded variables.
Multiple regression has an interesting ability that we have not been able to use before. It can use nominal data variables as inputs to our outcomes, rather than using them simply as grouping labels. It does so by assigning either a 0 or 1 to the variable value depending upon whether some characteristic exists or not. For example, with degree we essentially are looking to see if a graduate degree has any impact, since everyone in the sample has at least an undergraduate degree. So, we code the existence of a graduate degree with a 1, and the “non- existence” with a 0. Similarly, with gender we are interested, essentially, how females are being treated, so we code them 1 (existence of being female). This coding is called Dummy Coding, and involves only using a 0 or 1 in specific situations where the existence of a factor is considered important. Note, other than some changes in the value of the coefficients, the outcomes would not differ if the codes were reversed. The significance, or non-significance, of degree or gender would remain the same regardless of the code used. We will comment on this more after we see our results.
Question 2
Question 2 for this week asks for a regression equation that explains the impact of various variables on our output of interest. Of course, in the homework this is salary, while in our lectures it is the compa-ratio.
Both linear and multiple regression are both set up in the same fashion, so we will look at only the multiple regression situation. For the data, put the dependent variable, the output such as salary or compa-ratio, in one column and then paste the independent, input, variables in sequential columns next to it. Make sure that none of the columns contain letter characters. It is also a good idea to include the variable labels for each data column.
The Regression function is found in the Data | Analysis block and is labeled Regression. Here is a screen shot of a complete Regression set-up for a regression equation for compa-ratio. Note that unlike the correlation input, we have two ranges to work with. The first is the output, which for this example is compa-ratio (and would be salary for the homework). The second is for the inputs, which should include all of the numeric looking variables, including the Degree and Gender variables as shown below.
Data range entry for the Y (or outcome) and the X (or input) variables are done separately by either typing in the ranges or using dragging the cursor over the data range after clicking on the up arrow at the right end of the data entry boxes. The same is done with the data entry box after clicking the circle for Output range.
There are a number of options to consider. First, of course, is the need to click the labels box if your data ranges include labels. A second option is the Constant is Zero equation. This would force the regression equation to pass thru the X = 0 and Y = 0 origin, even if this is not the best fit. Use this with caution, even though it might make no sense to have Y = 0 when all the X variables are 0, using this option may not give us the equation that best fits the data.
The residuals box provides a way to see how well each of the plotted data points fits with the predicted results. This will often allow us to see outliers – cases that do not fit with the rest of the data set. Outliers are sometimes indications of data entry errors or, in the case of salary, they may be paid using a different approach. One such example would be a commission salesperson being included with employees that are paid on a straight salary, the basis of pay is so different these two should not be analyzed in the same study. Other options here allow for the results to be turned into Z-scores (Standardized Residuals), plotted on a graph, or have linear plots made for the output and each separate input. Normal Probability Plots are rather complicated to discuss, and it is left to the student to explore this if desired. You are encouraged to play around with some of these options, even though they are not required for the assignment.
Now that the data has been set up, let’s look at our hypothesis testing process for the question, first, of whether or not the regression equation is helpful in explaining what impacts compa-ratio outcomes.
Parts a and b. This part looks at the overall regression.
Step 1: Ha: The regression equation is not significant.
Ho: The regression equation is significant.
Step 2: Alpha = 0.05
Step 3: F stat and ANOVA-Regression, used to test regression significance
Step 4: Decision Rule: Reject the null hypothesis if p-value < 0.05.
Step 5: Conduct the test.
After completing the set-up box, click on OK to produce the result.
Here is a screen shot of a multiple regression analysis for the question of what factors influence compa-ratio. Note: we will split the discussion of the output into two screen shots.
The first table in the output provides some summary statistics. Two are important for us – the multiple correlation, shown as R, which equals 0.655, a moderate value; and, the R square or the
multiple coefficient of determination showing that about 43% of the variation in compa-ratio values can be explained by the shared variation in the variables used in the analysis.
The second table shows the results of the actual statistical test of the regression. Similar to the ANOVA tables we looked at last week, it has two rows that are used to generate our F statistic (4.51) and the p-value (Significance F) of 0.0008.
Step 6: Conclusion and Interpretation.
What is the p-value? 0.0008
Decision: Rej or Not reject the null? Reject the null hypothesis.
Why? The p-value is less than (<) 0.05.
Conclusion about Compa-ratio factors? The input variables are significantly related to compa- ratio outcomes. Some of the compa-ratio outcomes can be explained by the selected variables. We used the phrase “some of” since the equation only explains 43% of the variance, less than half.
Part c
Once we reject the null hypothesis, our attention changes to the actual equation, the variables and their corresponding coefficients. The third table provides all the details we need to reach our conclusions.
As with the correlations in question 1, we will use the hypothesis testing process, but will write it only once and use the p-values to make decisions on each of the possible equation variables.
Step 1: Ha: The variable coefficient is not significant (b = 0).
Ho: The variable coefficient is significant (b =/= 0).
Step 2: Alpha = 0.05
Step 3: T stat and t-test for coefficients
Step 4: Decision Rule: Reject the null hypothesis if p-value < 0.05.
Step 5: Conduct the test. In this case, the test has already been performed and is part of the regression out. Here is a screen shot of the second half of the Regression output.
Step 6: Conclusions and Interpretation
As with the correlations, we will use a single statement of the 6 steps to interpret the outcomes in this part. Here is the completed table.
The Multiple Regression equation is similar to the linear regression example given above except it has more independent terms: Y = a + b1*X1 + b2*X2 + B3*X3 + …. The b’s stand for the coefficients that are multiplied by the value of each variable (represented by the X’s).
In first column (L in the screen shot) are the possible regression elements starting with the intercept, which is always a part of the equation. The next column (M) and the fifth column are the really important columns. Column P, labeled p-value, tells us which variables are statistically significant. Just as with our previous tests, if the p-value is less than (<) our chosen alpha, we reject the related null hypothesis and accept the alternate that the coefficient’s value is different than 0, and the related variable should be included in the final equation.
For our example, we find that only 3 variables are statistically significant; the midpoint, the performance rating, and the gender. With these 3 variables and the intercept, the statistically significant regression equation is:
Compa-ratio = 0.954 + 0.003*midpoint -0.002*performance rating + 0.056*gender.
So, what does this equation mean? How do we interpret it? The intercept (0.9545) is somewhat of a place holder – it centers the line in the middle of the data points, but has little other meaning for us. The three variables, however tell us a lot. Changes in each of them impact the compa- ratio outcome independently of the others – it is as if we can consider the other factors being held constant as we examine each factor’s impact. So, all other things the same, each dollar increase in midpoint increases the compa-ratio value by 0.0034. This relates to what we found last week that compa-ratio is not independent of grade. At the same time, and possibly surprisingly, every increase in an employee’s performance rating causes the compa-rating to decrease by .0024! Finally, the equation says that gender is an important factor. This factor alone means that the company is violating the equal pay act. But, what might be surprising is that for a change from male (coded 0) to female (coded 1) the compa-ratio goes up by 0.0562! Females get a higher compa-ratio (percent of midpoint) when all other things are equal than males do, since the female gender results in adding 0.056*1 to the compa-ratio while the male gender has 0.056 * 0 (or 0) added to their compa-ratio.
We did have one hint that this might be the case, when we noticed in week 1 that the female mean compa-ratio was higher than the male compa-ratio. But, then some of the single factor tests minimize this difference. This is one of multiple regression’s greatest strengths, it will show us the impact of a single variable by controlling for, or keeping constant, the impact of all other variables.
Parts d, e, and f
Gender is a significant element in the compa-ratio, as females get a higher value when all other variables are equal. We see this from the significant positive coefficient to the variable gender. Females are coded 1, so they get more added to their result.
Here is a video on Regression: https://screencast-o-matic.com/watch/cb6jfuIk8S
Question 3
This answer will depend on what other factors you would like to see.
Question 4
As of this point, we have some strong evidence in the compa-regression equation and the t-test on average compa-ratios, that females get more pay for equal work than males. The company is violating the Equal Pay Act, in favor of women.
Question 5
What you say here describes your understanding of regression analysis versus the power of inferential tests of 2 variables at a time.
Please ask your instructor if you have any questions about this material.
When you have finished with this lecture, please respond to Discussion thread 3 for this week with your initial response and responses to others over a couple of days.