LectureWeek3-2.pdf

BUS308 Week 3 Lecture 2

This week we introduce the ANOVA test for multiple mean equality and the Chi Square tests for distributions. This lecture will focus on setting up and interpreting the ANVOA tests. The third lecture for this week will focus on the Chi Square tests.

The set-up of ANOVA within Excel is very similar to how we set up the F and T tests last week; place the data set in appropriate groups and then use the ANOVA input box. One difference this week is that the Fx (or Formulas) list does not include an option for ANOVA, so we need to use the Data | Analysis tools.

ANOVA

Data Set-up

Single Factor. As with the t-test, ANOVA has a couple of versions to select between. Pretty much the only difference lies in the data table used for each version. We will be working primarily with the ANOAV Single Factor, which deals with examining possible differences between the means of a single variable within different groups. A question of whether or not the mean compa-ratios are equal across the grades is an example of the kind of question answered with this approach.

As with the F and T tests, we need to group the data into distinct groups. For example, if we are going to test the compa-ratio mean across grades, then the data must be set-up in a table with grades across the top, as in the screen shot below. Note that as with the T and F tests, the

raw or initial data was listed next to the screening criteria (in this case grade) and sorted by grade. While the data shows Gender1, that variable is not used in this test of mean equality across grades. Then the appropriate values were copied and pasted in the grade columns of the input table.

Two-Factor. While we will not work with either of the two-factor forms, a brief explanation will help show the difference and usefulness of these forms. The ANOVA Two- Factor without replication allows us to test the means of two factors at once. An example of this kind of question might be are the compa-ratio means equal across grades when sorted by gender?

The outcome of this test gives us the significance of each group (grade average and gender average) as if the other variable was held constant. In other words, it removes some of the variation on what we are measuring.

A data set-up table for this version might look like this:

A B C D E F Male Female The values in each cell would be a measure for each cell. For example, male salary in grade A. For situations where we have multiple values, we could use the average or median value.

For the with replication version, the more significant test is to see if the variables interact with each other rather than simply examining mean equality. This requires multiple data points

A B C D E F Male Female

The values in each cell would be measures for each group. For example, we could use the minimum, maximum, and mean for each grade and gender group.

for each of the groups (females in grade C, for example).

Test Set-up

Setting up each version of ANOVA is pretty much the same, so we will look at just the Single Factor version. Going to the Data | Analysis and selecting ANOVA Single Factor gives us the following completed input screen. Notice that the entire table range, including the top and side labels, is entered into the Input box as a single entry. We do need to check the labels box, as Excel needs to be explicitly told that some of the data range is not numeric. Our normal alpha value of 0.05 is automatically filled in, but you can change this value. The last entry is where we want to the output table to start. As with the T and F tests, this cell is the upper left corner of the output.

The data input for the with and without replication are a bit different quite similar.

Hypothesis Test

The week 3 question 1 asks if the average salary per grade is equal? While this might seem like a no-brainer (we expect each grade to have higher average salaries), we need to test all assumed relationships. This is much like our detectives saying “we need to exclude you from the suspect pool; where were you last night?” This example will, of course use the compa-ratio instead of the salary values you will use in the homework.

Step 1: Ho: All salary means are equal across the grades

Ha: At least one salary mean differs

Step 2: Alpha = 0.05

Step 3: F statistic and Single Factor ANOVA; used to test multiple means

Step 4: Decision Rule: Reject Ho if the p-value < 0.05

Step 5: Perform the test.

Here is a screen shot of the ANOVA output.

Note that after the test name, we have a summary table that gives us the count, sum, average, and variance for the compa-ratios by group (in this case our grades). Note that we are assuming equal variances for this example, and your assignment, which might not be the case partially due to the differing counts in the grades. Please note that the column and row values are present in this screenshot. These will be needed as references in question 2.

The next table is the meat of the test. While for all practical purposes, we are only interested in the highlighted p-value, knowing what the other values are is helpful. When we introduced ANVOA in lecture 1, we discussed the between and within groups variation. As you recall, the between groups focused on the data set as a single group and not distinct groups. For the Between Groups row, we have an Sum of Squares (SS) value, which is a raw estimate of the variation that exists. The degrees of freedom (df) for Between Groups equals the number of groups (k) we have minus 1 (k-1), which equals 5 for our 6 groups. The Mean Square variation estimate equals the SS divided by the df.

The Within Group focuses on the average variation for all our groups. SS gives us the same raw estimate as for the BG row. The df for Within Groups is the total count (N) minus the number of groups (N-k), or 44 for our 50 employees in the 6 groups. MSwg equals SS/df.

The F statistic is calculated by dividing the MSbg by MSwg. The next column gives us our p- value followed by the critical value of F (when the p-value would be exactly 0.05).

The total line is the sum of the SS values and the overall df which equal the total count -1 (N – 1).

Now that we have our test results, we can complete step 6 of the hypothesis testing procedure.

Step 6: Conclusions and Interpretation

What is the p-value? Found in the table, it is 0.0186 (rounded).

Decision: Reject the null hypothesis.

Why? P-value is less than 0.05.

Conclusion: at least one mean differs across the grades.

Here is a video on ANOVA: https://screencast-o-matic.com/watch/cb6jecIkLg

Question 2: Group Comparisons

Now that we know at least one grade compa-ratio mean is not equal to the rest, we need to determine which mean(s) differ. We do this by creating ranges of the possible difference in the population mean values. Remember, that our sample results are only a close approximation of the actual population mean. We can estimate the range of values that the population mean actually equals (remember that discussion of the sampling distribution of the mean from last week). So, using the variation that exists in our groups, we estimate the range of differences between means (the possible outcomes of subtracting one mean from another).

The formula for developing the range to examine mean differences is:

(mean1 – mean2) +/- t* sqrt(mse*(1/n1 + 1/n2))

Ok – breathe. Most of the values we need are in the ANVOA table, and Excel will let us set up a table and do all these actions one step at a time. Let’s look at the completed table, and then step back and develop it one cell at a time.

Let’s look at what this table tells us before focusing on how to develop the values. Looking at the A-B row, we see that we are initially comparing the means of grades A and B, the Mean Diff. column tells us the difference between the sample means. The T value is used to provide our level of confidence for the range. Since we are using the results from the ANOVA table, t is the same value for all the comparisons regardless of each sample size. The +/- term is the most likely maximum difference for each compared pair based on variation (the MSwg value) and the sample sizes. Once we know these values, we can construct our minimum and maximum differences, the Low value is simply the Mean Diff – the +/- term, while the high end of the range is the Mean Diff + the +/- term.

Now, we need to decide which of these ranges indicates a significantly different pair of means (within the population) and which ranges indicate the likelihood of equal population means (non- significant differences). This is fairly simple, if the range contains a 0 (that is, one endpoint is negative and the other is positive), then the difference is not significant (since a mean difference of 0 would never be significant). Notice in the table, that the A-B, A-C, and A-D range all contain 0, and the results are not significant different. The A-E and A-F comparisons, however have positive values for each end, and do not contain 0; these means are different in the population.

Now, let’s take a look at setting up the values in the table. The following screen shot is of the same table, but different cells display the formulas used to create the values rather than the values. This can help us see the relationships.

Let’s take a look at each column and see how the calculations are set up.

In row 31, we start with the groups we want to compare, in this case it is the means for grades A and B, so we just set up a convenient label telling us what the row refers to, something like A-B.

In Column C, labeled Mean Diff., we set up our first value, the difference between the two means. The formula is =ABS($N$11 – N12). First, the ABS stands for absolute value, this merely eliminates any negative signs (as if we always subtracted the smaller value from the larger value). This is not needed; the author just likes it. The ($N$11 – N12) refers to the mean values located in the Summary table from our ANOVA results. Cell N11 refers to the mean of grade A while cell N12 refers to the mean of grade B. Scroll back up the lecture to check this out in an earlier screenshot. Note that the next row contains the reference to grade A (N11) but changes the second reference to the location of the grade C mean (N13). Repeat this pattern all the way down the table, referencing the two grades being compared in each row.

In column D, we have the t-value used to provide our confidence in the range outcomes. The related cell formula is shown in row 34. With our alpha set at 0.05, we will have 95% confidence in the resulting ranges. This is an accepted level of confidence. The lower alpha, the higher our level of confidence and the larger the range. A 100% confidence results in a range from – infinity to + infinity, of no help whatsoever. Since the entire analysis is based on the ANOVA results, the df for the t-value equals the df for the Within Group variation. The T.INV.2T (0.05, df) function returns the t-value that cuts off the middle 95% of the t-distribution.

Column E develops our range constant. Row 37 shows the calculation formula, =D37*SQRT($N$21 * (1/$l$12 + 1/L13)). This row is comparing the means for grades B and C. Let’s break this down. The D37 refers to the T-value found in column D. The SQRT is Excel’s code for taking the square root of whatever is within the ( ). The $N$21 is the cell reference to the MSwg measure in the ANOVA table. This is the common variance estimate for the samples. The (1/$L$12 and 1/L13) are the references to the counts for grades B and C.

Before going on, let’s review. We have seen examples of the calculations done in each column presented in different rows. As we create these formulas for each row, we would use the cell references for that row. For example, in row 31, we would use only row 11 and 12

references (N11, N12, L11, L12). In row 32, this would change to reflect grades A and C, found in rows 11 and 13, etc.

Now, let’s develop the ranges. The low value equals the Mean Diff. (column C) minus the +/- term (column E), so the formula for row 31 would be =c31 – E31; for row 32, the values change to =C32 – E32, etc. The high value equals column C + column E, or =c31 + E31, etc.

The Yes (for significant) or No (not significant difference) is determined by looking at the end points of the range. If one is negative and the other positive, as with the first couple of comparison, then we do not have a significant difference as the real difference could be 0. Column K just asks you to say why you choose a Yes or No for column J.

Now, to make things a bit easier. Notice the dollar signs around some of the cell references. For example, the dollar signs found in N12; these are made by typing N12 and then pressing F4. These tell Excel if we copy this cell keep N12 as a constant. Without these, copying the cell would change values we want to remain the same. What does this mean? If you want to try copying cells rather than writing the formula in each cell, try the following.

• Using just cell C31, move the cursor to the bottom right corner of the cell. When it is place correctly at the corner, the cursor will change to a small +.

• When you see the +, depress the left mouse button and pull the cursor down one cell to C32.

• You should now see =(4N$11 – N13) rather than =($N$11 – N12). The relative reference of cell N12 went down 1 row as you pulled the cell down one row.

What this means is that after you set up the entire row 31 (from column C thru column I) you can highlight the entire range, place the cursor on the far-right corner, and after you see the + drag all of the cells down from row 31 to row 37, where we start to compare grade B. First, delete the mess in row 36, which is just a separator row. Then in cell C37, change the references to $N$12 and N13 (for grades B and C), do the same in cell E37 to the related counts in $L$12 and L13. Highlight and drag the range down to C42, and make the appropriate adjustments again. Do this until you have reached and edited the cells in row 49. You should now have all the table calculations done, and are ready to make your comparison decisions in columns J and K.

Please ask your instructor if you have any questions about this material.

When you have finished with this lecture, please respond to Discussion thread 2 for this week with your initial response and responses to others over a couple of days.