week 4

profileamason254
2018cCanvasLectureweek4-3a.pdf

BUS 308 Week 4 Lecture 3

Developing Relationships in Excel

Expected Outcomes

After reading this lecture, the student should be able to:

1. Calculate the t-value for a correlation coefficient 2. Calculate the minimum statistically significant correlation coefficient value. 3. Set-up and interpret a Linear Regression in Excel 4. Set-up and interpret a Multiple Regression in Excel

Overview

Setting up correlations and regressions in Excel is fairly straightforward and follows the approaches we have seen with our previous tools. This involves setting up the data input table, selecting the tools, and inputting information into the appropriate parts of the input window.

Correlations

Question 1

Data set-up for a correlation is perhaps the simplest of any we have seen. It involves simply copying and pasting the variables from the Data tab to the Week 4 worksheet. Again, paste them to the right of the question area. The screenshot below has the data for both the question 1 correlation and the question 2 multiple regression pasted them starting at column V. You can paste all the data at once or add the multiple regression variables later (as long as you do not sort the original data).

Specifically, for Question 1, copy the salary data to column V (for example). Then copy the Midpoint thru Service columns and paste them next to salary. Finally copy the Raise column and paste it next to the service column. Notice that our data input range for this question now includes Salary in Column V and the other interval level variables found in Columns W thru AA.

Question 1 asks for the correlation among the interval/ratio level variables with salary and says to exclude compa-ratio. For our example, we will correlation compa-ratio with the other interval/ratio level variables with the exclusion of salary. Since compa-ratio equals the salary divided by the midpoint, it does not seem reasonable to use salary in predicting compa- ratio or compa-ratio in predicting salary.

Pearson correlations can be performed in two ways within Excel. If we have a single pair of variables we are interested in, for example compa-ratio and performance rating, we could use the fx (or Formulas) function CORREL(array1, array2) (note array means the same as range) to give us the correlation.

However, if we have several variables we want to correlate at the same time, it is more effective to use the Correlation function found in the Analysis ToolPak in the Data Analysis tab. Set up of the input data for Correlation is simple. Just ensure that all of the variables to be correlated are listed together, and only include interval or ratio level data. For our data set, this would mean we cannot include gender or degree; even though they look like numerical data the 0 and 1 are merely labels as far as correlation is concerned.

In the Correlation data input box shown below, list the entire data range, indicate if your data has labels or not (good idea to include these), select the output cell, and click OK. Here is a screen shot of the input box and some of the data.

The result will show up in K08 (in this case).

Statistical Significance

Part b. Normally, we would go thru our questions about the p-value for each value. But since you are familiar with the testing logic, for this question we are going to “shortcut” the process. Now, there is an easier way of determining which of the correlations are statistically significant. This is suggested by the question1 part b that we skipped in lecture 2. We noted that values smaller than the r = .20 that we tested could be assumed to all be non-significant. We could also have assumed that values larger than the tested 0.50 would be assumed to all be significant. So, it would seem to make sense that there is a specific value of r that exactly matches the alpha = 0.05 criteria.

If we can find this value of r, we can compare each correlation with this critical value; correlations larger (absolute values) than this are significant; while smaller correlations are not significant. Having this critical value would give us a quick decision point (much like how we use the p-value).

The issue is now, what is this critical r value?

Technical Point. If you are interested in how we obtain the formula for determining the minimum r value, the approach is shown below. If you are not interested in the math, you can safely skip this paragraph, and go to The Result paragraph below.

We know that t = r* sqrt(n-2)/sqrt(1-r2)

Multiplying both sides by sqrt(1-r2) gives us t *sqrt (1- r2) = r*sqrt(n-2)

Squaring both sides gives us: t2 * (1- r2) = r2* (n-2)

Multiplying each side out gives us: t2– t2* r2 = n r2-2* r2

Adding t2* r2 both sides gives us: t2= n* r2-2*r2+ t2 *r2

Factoring gives us: t2= r2 *(n -2+ t2)

Dividing both sides by *(n -2+ t2) gives us: t2 / (n -2+ t2) = r2

Taking the square root gives us: r = sqrt (t2 / (n -2+ t2)

The Result. The formula to use in finding the minimum correlation value that is statistically significant is: r = sqrt(t^2/(t^2 + n-2)), where t is the 2-tail value for any df count.

We would find the t value associated with a two-tail p-value of 0.05 and a df value of 48 by using the t.inv.2T(alpha, df) function with alpha = 0.05 and df = n-2 or 48 (for our data set of 50 employees). Plugging these values into the gives us a t-value of 2.0106 or 2.011(rounded).

t =t.inv.2T(alpha, df) =t.inv.2t(0.05,48) = 2.011

r = sqrt(t^2/(t^2 + n-2)) = sqrt(2.011^2/(2.011^2 + 50-2) = 0.278.

Therefore, in a correlation table based on 50 pairs, any correlation greater than or equal to 0.278 would be statistically significant.

So, what does all this mean? If we find a correlation based on 50 pairs of data (such as what our data set will produce), any correlation value that exceeds an absolute value of 0.278 would be found to be statistically significant (p-value less than 0.05) and cause us to reject the related null hypothesis of not significant.

So, when looking at a table of correlation values, we can identify the significant correlations immediately; these are any correlation above the absolute value of 0.278 (that means larger than + 0.278 (such as + .46) or less than -0.278 (such as -0.53)). Knowing how to interpret table results, we can proceed making our decisions on what is significant.

So, for part b, the first question asked is what is the T value that cuts off the two tails of the distribution with an alpha of 0.05? We calculated this above as 2.011. The second question asks for the associated correlation value for this t-value. Again, we found this above to be 0.278.

Spearman’s. Note that while the Spearman’s rank order correlation is not asked for in the assignment, you might want to use it at times. For example, some could argue that Performance Rating, since it is based mostly on human judgement, it is really ordinal and requires Spearman’s. The formula for Spearman’s (which needs to be manually input into Excel), is:

Rho = 1 – 6*(Sum of d^2)/(n*(n^2 – 1)); where d is the difference in the rank score for each of the paired variables, and n is the count of paired data used. Remember that the ^ in an Excel formula means take the number to that power, so d^2 means d squared (or d times d).

Regression

Question 2

Both linear and multiple regression are both set up in the same fashion, so we will look at only the multiple regression situation. For the data, put the dependent variable, the output such as salary or compa-ratio, in one column and then paste the independent, input, variables in

sequential columns next to it. Make sure that none of the columns contain letter characters. It is also a good idea to include the variable labels for each data column. The first screen shot above shows the data input required for this question.

The Regression function is found in the Data | Analysis block and is labeled Regression. Here is a screen shot of a complete Regression set-up for a regression equation for compa-ratio. Note that unlike the correlation input, we have two ranges to work with. The first is the output, which for this example is compa-ratio (and would be salary for the homework). The second is for the inputs, which should include all of the numeric looking variables, including the Degree and Gender variables as shown below.

Data range entry for the Y (or outcome) and the X (or input) variables are done separately by either typing in the ranges or using dragging the cursor over the data range after clicking on the up arrow at the right end of the data entry boxes. The same is done with the data entry box after clicking the circle for Output range.

There are a number of options to consider. First, of course, is the need to click the labels box if your data ranges include labels. A second option is the Constant is Zero equation. This would force the regression equation to pass thru the X = 0 and Y = 0 origin, even if this is not the best fit. Use this with caution, even though it might make sense to have Y = 0 when all the X variables are 0, using this option may not give us the equation that best fits the data.

The residuals box provides a way to see how well each of the plotted data points fits with the predicted results. This will often allow us to see outliers – cases that do not fit with the rest of the data set. Outliers are sometimes indications of data entry errors or, in the case of salary, they may be paid using a different approach. One such example would be a commission salesperson being included with employees that are paid on a straight salary, the basis of pay is so different these two should not be analyzed in the same study. Other options here allow for the results to be turned into Z-scores (Standardized Residuals), plotted on a graph, or have linear plots made for the output and each separate input. Normal Probability Plots are rather

complicated to discuss, and it is left to the student to explore this if desired. You are encouraged to play around with some of these options, even though they are not required for the assignment.

Here is a video on Regression: https://screencast-o-matic.com/watch/cb6jfuIk8S

Summary

Pearson Correlations are fairly easy to produce in Excel with either the Analysis ToolPak Correlation function (best used for multiple correlations or when you want the labels shown) or the Fx (or Formulas) function CORREL, best used for a single correlation outcome with no labels. Both are used for the Pearson correlation only. The Spearman’s correlation requires setting up the data in rank order and providing ranks for each variable separately and then summing these and placing them into a cell formula to obtain the correlation.

Setting up the data for a correlation is fairly simple. Just list the variables in a column and select the appropriate columns for the function being used. For the correlation table, have all the variable columns in a continuous range.

The statistical significance of either correlation is found using the t formula t = r* sqrt(n- 2)/sqrt(1-r2), where r is the correlation value and n is the number of data pairs used for the correlation. Once we have a t-value we can use the t.dist.2t(t, df) formula (df -= n-2) to find the two tail p-value. The lecture presents an approach for finding a minimum statistically significant value when we have a table of correlations to look at; correlations with an absolute value equal to or greater than this value would be statistically significant.

Data set-up for a regression is similar to the correlation table. Have the outcome variable at one end of the range so it can be selected alone and have all the other input variables listed in a continuous range.

The regression function (for either a linear or multiple regression) is located in the Analysis ToolPak list and is called Regression. The set-up within the data entry box is similar to the other functions we have done (t-test, ANOVA, etc.) with a data range, data output location, and a label box to fill in.

Please ask your instructor if you have any questions about this material.

When you have finished with this lecture, please respond to Discussion Thread 3 for this week with your initial response and responses to others over a couple of days.