Statics
BUS 308 – Week 4 Lecture 2
As in many detective stories, we will often find that when one thing changes, we see that something else has changed as well. The correlation between events is mirrored in data analysis examinations with correlation analysis. This week’s focus changes from detecting and evaluating differences to looking at relationships. As students often comment, finding significant differences in gender based measures does not explain why these differences exist. Correlation, while not always explaining why things happen gives detectives great clues on what to examine more closely and helps move us towards understanding why outcomes exist and what impacts them. If we see correlations in the real world, we often will spend time examining what might underlie them; finding out if they are spurious or causal.
Linear Correlation
When two things seem to move in a somewhat predictable way, we say they are correlated. This correlation could be direct or positive, both move in the same direction, or it could be inverse or negative, where when one increases the other decreases. The Law of Supply and price in economics is a common example of an inverse correlation, where the more supply we have of something, the less we typically can charge for it; the Law of Demand is an example of a direct correlation as the more demand exists for something, the more we can charge for it. Height and weight in young children is another common example of a direct correlation, as one increases so does the other measure.
Probably the most commonly used correlation is the Pearson Correlation Coefficient, symbolized by r. It measures the strength of the association – the extent to which measures change together – between interval or ratio level measures as well as the direction of the relationship (inverse or direct). Several measures in our company data set could use the Pearson Correlation to show relationships; salary and midpoint, salary and years of service, salary and performance rating, etc. The Pearson Correlation runs from -1.0 (perfect negative or inverse correlation) thru 0 (no correlation) to +1.0 (perfect positive or direct correlation).
A perfect correlation means that if we graphed the values, they would fall exactly on a straight line, either increase from bottom left to top right (positive) or from top left to bottom right (negative). The stronger the absolute value (ignoring the sign), the stronger the correlation and the more the data points would form a straight line when plotted on a graph. The Excel Fx function Correl, and the Data Analysis function Correlation both produce Pearson Correlations.
Question 1
When we have a data set with multiple variables, we would want to see what relationships exist – a detective’s sort of “who works with whom” around the result we are looking for. Data set-up for a correlation is perhaps the simplest of any we have seen. It involves simply copying and pasting the variables from the Data tab to the Week 4 worksheet. Again, paste them to the right of the question area. The screenshot below pasted them starting at column V. The only issue is to only paste the variables we want to use.
Question 1 asks for the correlation among the interval/ratio level variables with salary, and says to exclude compa-ratio. Setting you your data would be fairly simple. Copy the salary data to column Q (for example). Then copy the Midpoint thru Service columns and paste them next to salary. Finally copy the Raise column and paste it next to the service column. Notice that our data input range now excludes compa-ratio, Gender, Degree, Gender1, and Grade that are in the original data set. Except for compa-ratio (which was excluded by the question), none of these are at least interval level variables and therefore cannot be used in a Pearson Correlation.
For our example, we will correlation compa-ratio with the other interval/ratio level variables with the exclusion of salary. Since compa-ratio = salary/midpoint, it does not seem reasonable to use salary in predicting compa-ratio or compa-ratio in predicting salary.
Now that the data is set-up, we can proceed with the hypothesis test on the question of which variables are significantly correlated to each other. However, we are going to proceed a bit differently this week. Since you have all had practice in setting up and performing these steps, we are going to look at how to evaluate the correlations in a different way.
The significance of the Pearson Correlation is tested with the t-test, t = r * sqrt(n- 2)/sqrt(1-r^2), df = n-2; where n equals the number of data point pairs used in the correlation. So, we could set up the hypothesis testing steps for each of the correlations (which we will see shortly equals 15), or we can find the value of r that cuts-off the significant and non-significant correlation values. Having this critical value (which is sometimes presented in correlation tables), gives us a quick decision point (much like we use the p-value).
The formal approach is:
Step 1: Ho: Correlation is not significant
Ha: Correlation is significant. (A two-tail test.)
Step 2: Alpha = 0.05
Step 3: Spearman’s r, t, and the correlation t-test to test a correlation
Step 4: Reject the null hypothesis if the correlation value is larger than the critical value. (The critical correlation value has a related t-statistic having a p-value = 0.05. Larger correlations result in smaller p-values. So, we are essentially saying reject the null when the p-value is < 0.05; our usual standard.)
Statistical Significance
The issue is now, what is our critical r value?
Technical Point. If you are interested in how we obtain the formula for determining the minimum r value, the approach is shown below. If you are not interested in the math, you can safely skip this paragraph, and go to The Result paragraph below.
We know that t = r* sqrt(n-2)/sqrt(1-r2)
Multiplying both sides by sqrt(1-r2) gives us t *sqrt (1- r2) = r*sqrt(n-2)
Squaring both sides gives us: t2 * (1- r2) = r2* (n-2)
Multiplying each side out gives us: t2– t2* r2 = n r2-2* r2
Adding t2* r2 both sides gives us: t2= n* r2-2*r2+ t2 *r2
Factoring gives us: t2= r2 *(n -2+ t2)
Dividing both sides by *(n -2+ t2) gives us: t2 / (n -2+ t2) = r2
Taking the square root gives us: r = sqrt (t2 / (n -2+ t2)
The Result. The formula to use in finding the minimum correlation value that is statistically significant is: r = sqrt(t^2/(t^2 + n-2)), where t is the 2-tail value.
We would find the appropriate t value by using the t.inv.2T(alpha, df) function with alpha = 0.05 and df = n-2 or 48 (for our data set of 50 employees). Plugging these values into the gives us a t- value of 2.0106 or 2.011(rounded).
Putting 2.011 and 48 (n-2) into our formula gives us a r value of 0.278; therefore, in a correlation table based on 50 pairs, any correlation greater or equal to 0.278 would be statistically significant.
So, what does all this mean? If we find a correlation based on 50 pairs of data (such as what our data set will produce), any correlation value that exceeds 0.278 would be found to be statistically significant (p-value less than 0.05), and cause us to reject the null hypothesis of not significant.
So, when looking at a table of correlation values, we can identify the significant correlations immediately; these are any correlation above the absolute value of 0.278 (that means larger than + 0.278 (such as + .46) or less than -0.278 (such as -0.53)). Knowing how to interpret table results, we can proceed to creating the correlations.
Step 5: Conduct the test.
Pearson correlations can be performed in two ways within Excel. If we have a single pair of variables we are interested in, for example compa-ratio and performance rating, we could use the fx (or Formulas) function CORREL(array1, array2) (note array means the same as range) to give us the correlation.
However, if we have several variables we want to correlate at the same time, it is more effective to use the Correlation function found in the Analysis ToolPak in the Data Analysis tab. Set up of the input data for Correlation is simple. Just ensure that all of the variables to be correlated are listed together, and only include interval or ratio level data. For our data set, this
would mean we cannot include gender or degree; even though they look like numerical data the 0 and 1 are merely labels as far as correlation is concerned.
In the Correlation data input box, list the entire data range and indicate if your data has labels or not (good idea to include these), select the output cell and click OK. Here is a screen shot of the input box and some of the data.
Here is a screen shot of the output table and part of the data.
Reading the Table. The table only shows correlations below the diagonal (which has a 1.00 correlation of each variable with itself). Values above the line would simply duplicate those below it. The diagonal is a “pivot-point,” so to speak. In reading the correlations we would start with a row such as Age. The correlation of Age and Compa-ratio is 0.195 (rounded), the correlation of Age and Midpoint is 0.567. Then we get to the diagonal. Instead of continuing horizontally, we start going vertical (down the Age column). So, the correlation of Age with Performance rating is 0.139, with Service is 0.565, and with raise -0.180. All correlations, except for the first one (in this case Compa-ratio) would be read this way in Correlation tables.
Step 6: Conclusion and Interpretation.
Normally, we would go thru our questions about the p-value for each value. But since you are familiar with the testing logic, for this question we are going to “shortcut” the process. The first question asked is what is the T value that cuts off the two tails of the distribution with an alpha of 0.05? We calculated this above as 2.011. The second question asks for the associated correlation value for this t-value. Again, we found this above to be 0.278.
Part c asks us to use this information and identify the variables significantly correlated to salary in the homework and to compa-ratio for this example. Looking at the output table above, only Midpoint is significantly correlated to compa-ratio with a correlation of .50 (rounded) which is greater than our cut-off of 0.278.
Part d asks for any surprising results/correlations. This will depend upon your table and what you did or did not expect.
Part e asks if this information helps us answer our equal pay question. The compa-ratio correlations do not seem to be helpful as they do not shed any insight on gender based issues.
While question 1 does not appear to rely upon the hypothesis testing process, by showing the logic behind finding the significant correlation cut-off value, we can see that we are being faithful to the logic even if not the actual step-by-step process while making our decisions on correlation significance.
Multiple Correlation
As interesting as linear correlation is, multiple correlation is even more so. It correlates several independent (input) variables with a single dependent (output) variable. For example, it would show the shared variation (multiple R squared, or Multiple Coefficient of Determination) for compa-ratio with the other variables in the data set at the same time rather than in pairs as we did in question 1.
While we can generate this value by itself, it is a bit complicated and is rarely found except in conjunction with a multiple regression equation. So, having noted that this exists, let’s move on to multiple regression.
Please ask your instructor if you have any questions about this material.
When you have finished with this lecture, please respond to Discussion thread 2 for this week with your initial response and responses to others over a couple of days before reading the third lecture for the week.