Years ago, a comedy show used to introduce new skits with the phrase “and now for something completely different.” That seems appropriate for this week’s material.
This week we will look at evaluating our data results in somewhat different ways. One of the criticisms of the hypothesis testing procedure is that it only shows one value, when it is reasonably clear that several different values would also cause us to reject or not reject a null hypothesis of no difference. Many managers and researchers would like to see what these values could be and how they might impact decisions. Confidence intervals will help us here.
The other criticism of the hypothesis testing procedure involves the ability to “manage” the results and ensure a “reject the null hypothesis” decision simply by manipulating the sample size. For example, if we have a difference in a customer preference between two products of only 1%, is this a big deal? Given the uncertainty contained in sample results, we might tend to think that we can safely ignore this result. However, if we were to use a sample of, say, 10,000, we would find that this difference is statistically significant. This, for many, seems to fly in the face of reasonableness. We will look at a measure of practical significance, is the difference worth paying any attention to, called the effect size to help us here.
Confidence Intervals
A confidence interval is a range of values that, based upon the sample results, most likely contains the actual population parameter. The “most likely” element is the level of confidence attached to the interval, such as a 95% confidence interval, 90% confidence interval, 99% confidence interval, etc. They can be created at any time, with or without performing a statistical test, such as the t-test.
A confidence interval may be expressed as a range (45 to 51% of the town’s population support the proposal) or as a mean or proportion with a margin of error (48% of the town supports the proposal, with a margin of error of 3%). This last format is frequently seen with opinion poll results, and simply means that you should add and subtract this margin of error from the reported proportion to obtain the range. With either format, the confidence percent should also be provided.
Single Variable Confidence Interval
Confidence intervals (CI) for a single mean (or proportion) are fairly straightforward to understand, and relate to t-test outcomes directly. A confidence interval for a mean equals
CI = mean +/- t* sqrt(variance/sample size), t is the two tail value for desired confidence
In a Week 4 example, we looked at a one-sample test of a mean against a constant, and we found that we would not have rejected the null hypothesis of mean = 75. In that output, we had the mean equal to 75.5, the count equal to 6, the variance equaled 3.5, and the two-tailed t-value was 2.571 (rounded). The related 95% confidence interval for that variable would be:
CI = 75.5 +/- 2.571* sqrt(3.5/6), or 75.5 +/- 0.76 (rounded) = 74.74 to 76.26.
Note: recall that the sqrt(variance/sample size) is the standard error.
The Fx (or Formulas) function Confidence.norm or Confidence.t will give us the margin of error (the +/- term). The Input window for both is identical, one is shown below, and requires we input the desired confidence level (alpha, 0.05 for a 95% confidence interval), the standard deviation (Standard_dev) and the sample size (Size).
The output for both would be subtracted from or added to the mean to give us the confidence range.
How can we use confidence intervals? First, this tells us that we have a 95% certainty that the population mean is between these two values. This would help us in making a decision based upon our sample results. Second, we can say that any claim about the population mean that is outside of this range would result in our rejecting the null hypothesis of the mean equaling that claim. So, in some ways, a one sample confidence interval can take the place of testing different population parameter claims.
Confidence intervals allow us to make some informed “gut level” decisions when more precise measure may not be needed. For example, if the means of two variables are fairly close, the one having a wider confidence interval will have more variation within the data, and be less consistent. (This could be verified with the F-test for variance that we covered in week 3.) Comparing the endpoints against the standard used in our one-sample t-test would give a sense of how “close” we came to making the other decision.
We can use individual CIs for related variables, such as male and female mean salaries, to get an idea about differences that may exist within the populations by looking at the extent to which they overlap. There are a couple of guidelines to follow.
· If the bottom quarter of one range overlaps with the top quarter of the other range, the means would be found to be significantly different at the alpha = 0.05 level.
· If the bottom value of one range just touches the top value of the other range, the means would be found to be significantly different at the .01 alpha level.
· If the ranges do not overlap at all, the means are significantly different at an alpha less than .01 level.
· If the intervals overlap more than at the extreme quarters, the means are not significantly different at the alpha = .05 level.
Confidence Intervals for Differences
Confidence intervals can also be used to examine the difference between means. The most direct way is by constructing a confidence interval for the difference. This result is very similar to the intervals we constructed while doing the ANOVA comparisons. While we use a different calculation formula when comparing only two means (rather than two means at a time with the ANOVA situation), the interpretation is the same. If the range contains a 0, then the population means could be identical and we would not reject the null hypothesis of no difference.
Excel does not have a function to directly provide us with the interval for the difference between means. It does have the tools to give us the t-value to provide the level of confidence we need. The generic equation for the confidence range for differences is
(Mean difference) +/- t*SE of difference.
If we perform a two-sample t-test assuming equal variances, we can get the information needed to construct this range. (Note: the unequal variance t-test result does not give us the correct t value in the output table.) Here is where we need to look:
· The mean difference is simply the absolute difference between the values listed in the Mean row.
· The t-value is found in the T Critical Two Tail row at the bottom
· The Standard Error of the difference needs to be calculated, and equals the square root of the sum of each variance divided by its sample size.
Combining these three values using the general formula above will give us the interval for the difference between the means.
Effect Size – Practical Importance
A popular saying a few years ago was “if you torture data long enough, it will confess to anything.” 😊 Unfortunately, many regard statistical analysis this same way. If you do not get a rejection of the null hypothesis that you want, simply repeat the sampling with a larger group, at some sample size virtually all differences will be found to be statistically significant.
But, does statistical significance mean the findings should be used in decision making? If, for example we typically round salary to the nearest thousand dollars when making decisions, does a significant difference based on a $500 difference have any practical importance? Probably not.
So, how do we decide the practical importance of a statistically significant difference? Once, and this is important, we have rejected the null hypothesis – and only if we have rejected the null hypothesis – we calculate a new statistic called the effect size.
The name comes from the effect changing the variable’s value would have on the outcome. To understand this idea, let’s look at the male and female compa-ratios. We found in week 2 that the male and female compa-ratio means were not significantly different. So, the “effect” of changing from male to female when doing an analysis with the compa-ratio mean would not be very big. However, if we switched from the male to female average salary, we would expect to see a large effect or difference in the outcome since their salaries were so different.
The effect size measure – however it is calculated for different statistical tests – can be interpreted in a similar fashion. Effect sizes generally have their value translated into a “large,” “moderate,” or “small” label. If we have a large effect, then we know that the variable interaction caused the rejection of the null hypothesis, and that our results have a strong practical significance. If, however, we have a small effect, then we can be fairly sure that the sample size caused the rejection of the null hypothesis and the results have little to no practical significance for decision making or research results. A moderate outcome is less clear, and we might want to redo the analysis with a different sample.
Now, when do we look at an Effect Size; that is, when should we go to the effort of calculating one. The general consensus is that the Effect Size measure only adds value to our analysis if we have already rejected the null hypothesis. This makes sense, if we found no difference between the variables we were looking at, why try to see what effect changing from one to the other would do. We already know, not much.
When we reject a null hypothesis due to a significant test statistic (one having a p-value less than our chosen alpha level), we can ask a question: was this rejection due to the variable interactions or was it due to the sample size? If due to a large sample size, the practical significance of the outcome is very low. It would often not be “smart business” to decide based on those kinds of results. If, however, we have evidence that the null was rejected due to a significant interaction by the variables, then it makes more sense to use this information in making decisions.
Therefore, when looking at Effect Sizes, we tend to classify them as large, moderate, or small. Large effects mean that the variable interactions caused the rejection of the null, and our results have practical significance. If we have small effect size measures, it indicates that the rejection of the null was more likely to have been caused by the sample size, and thus the rejection has very little practical significance on daily activities and decisions.
OK, so far:
· Effect sizes are examined only after we reject the null hypothesis, they are meaningless when we do not reject a claim of no difference.
· Large effect size values indicate that variable interactions caused the rejection of the null hypothesis, and indicate a strong practical significance to the rejection decision.
· Small effect size values indicate that the sample size was the most likely cause of rejecting the null, and that the outcome is of very limited practical significance.
· Moderate effect sizes are more difficult to interpret. It is not clear what had more influence on the rejection decision and suggests only moderate practical significance. These results might suggest a new sample and analysis.
Different statistical tests have different effect size measures and interpretations of their values. Here are some that relate to the work we have done in this course.
· T-test for independent samples. Cohen’s D is found by the absolute difference between the means divided by the pooled standard deviation of the entire data set. A large effect is .8 or above, a moderate effect is around .5 to .7, and a small effect is .4 or lower. Interpretation of values between these levels is up to the researcher and/or decision maker. In the two-sample equal variance t-test, the pooled variance value is provided. In the two-sample unequal variance test, the pooled variance is not provided, and the simplest solution is to find the variance of the entire data set separately with a VAR.S(range) command and use this value in the calculation of Cohen’s D.
· One-sample T-test. Cohen’s D is found by the absolute difference between the mean and the standard divided by the standard deviation of the tested variable data set. A large effect is .8 or above, a moderate effect is around .5 to .7, and a small effect is .4 or lower. Interpretation of values between these levels is up to the researcher and/or decision maker.
· Paired T-test. Effect size r = square root of (t^2/(t^2 + df)). A large effect is .4 or above, a moderate effect is around .25 to .4, and a small effect is .25 or lower.
· Eta squared equals the SS between/SS total. A large effect is .4 or above, a moderate effect is .25 to .40, and a small effect is .25 or lower.
· Chi Square Goodness of Fit tests (1-row actual tables). It is, also called Effect size r = square root (Chi Square statistic/(N * (c -1)), where c equals the number of columns in the table. A large effect is .3 or above, a moderate effect is .3 to .5, and a small effect is .3 or lower.
· Chi Square Contingency Table tests. For a 2x2 table, use phi = square root of (chi square value/N). A large effect is .5 or above, a moderate effect is .3 to .5, and a small effect is .3 or lower.
· Chi Square Contingency Table tests. For larger than a 2x2 table, use Cramer’s V = square root (chi square value/((smaller of R or C)-1)). A large effect is .5 or above, a moderate effect is .3 to .5, and a small effect is .3 or lower.
· Use the absolute value of the correlation, A large effect is .4 or above, a moderate effect is .25 to .4, and a small effect is .25 or lower.