Statistics Analysis
STATISTICAL ANALYSIS
Objective: Following completion of this class, the student will be able to carry out some basic statistical testing, will have fundamental knowledge regarding selection of proper tests, and will understand the necessity for consulting statisticians prior to beginning a research project.
In the planning stages of any research you might do in the future, one of your first steps should be to consult an expert in statistics. Because you have no required statistics courses in your curriculum and because coverage in this class will be minimal, the probability is that you will need help. With a simple research design, you might be seeking simply confirmation of your plans for statistical analysis, but in a complex design or when group equivalence is in question, you might avoid serious difficulties later. Planning the project is really as important as execution of the study and not to include consideration of what eventual statistical treatment to use is folly. This expert that you consult can be a statistician in the math or psychology departments or someone in education who is up on the topic. All faculty members with doctorates have taken multiple classes in statistics, but like much of what is learned, what you don’t use, you lose. Your advisor can either help you with statistical planning or can point you to another faculty member for assistance. After graduation, you’ll find that many of your mentors will still be happy to assist you.
The primary challenge is in selecting the proper statistical test. You might use only descriptive statistics, the most important being means and standard deviations. What is the average score? Were scores fairly homogeneous or was there a widespread dispersion in one or both groups? You’ve probably dealt with these concepts in previous classes and these are important questions to address in descriptive studies. However, in your pilot studies you are looking for the effect of an independent variable (your two groups) on a dependent variable (the answer to a survey item). If you’re to test the difference between means of these dependent variables (survey items), how many groups are involved? If there are two as there will be in your pilot study, use a t-test. You could also use the t-test to determine whether significant differences between two groups exist at pretesting in an experimental design. Testing the differences between means of three or more groups entails using analysis of variance (ANOVA). In the latter case, if the groups are not equivalent at the outset, analysis of covariance (ANCOVA) can be run. Are you looking for relationships between variables? Use a correlation procedure (Relationship? – Correlation!). Are you looking to predict something from a set of scores? For example, graduate schools like to predict graduate performance (GPA) from GRE scores. Regression analysis will do the trick. When working with true/false or yes/no survey items like you will have in your pilot study, a procedure you can use is Fischer’s Exact Probability Test.
This might sound easy, but it isn’t really this simple. There’s more than one kind of t-test. You’ll be using a t-test for independent samples. If you test for differences between pretest and posttest of a single sample, you’d use a t-test for dependent samples. There is a whole set of what are called nonparametric tests which serve the same purpose as the t-tests and analysis of variance mentioned. There are different correlational procedures, depending in part on the kind of data you have. These are but examples of the kinds of things that the statistician will be interested in talking to you about and that will lead to the final decision on the approach to take. For your pilot study in this class, I will serve as your statistician and guide you through the proper statistical analysis procedures.
lllustration of complexity of selecting the proper nonparametric test
In executing your pilot study in this class you will work with mandated small sample sizes, just to go through the procedures while looking for Murphy’s Law to work. With small sample sizes, the power of the test is weak, as is discussed in the Etext section on sampling. A lack of power indicates that it is very difficult to detect significant differences between means. That isn’t a problem with the pilot study, but in a full-blown research project you’d like to have group numbers of 20 or more in some cases. You should place the following statement in the opening paragraph of your results and discussion section: While it is generally understood that statistically insignificant differences are not discussed, this pilot study had a mandated five subjects per group, almost assuring findings of no statistically significant differences. Therefore, there will be at least some discussion of findings from the statistical analysis of each survey item. (Copy and paste freely; there is no need to cite the source in this case.)
Fortunately, there are many easy-to-use software packages that can help you with the statistical treatment. Excel spreadsheets can have some calculation formulas built in and three of those will be provided for your use in this class. Computers have replaced older and more tedious calculation procedures and in doing so have lowered expectations of calculation errors. Statistics is no longer something to be feared. It is simply a tool to use in describing what you found and analyzing for significance. Remember, please seek expert assistance unless you have a formal background of several statistics courses. This particular course provides the bare minimum in statistical training. It should help you in planning and some understanding of what statistical treatment to choose, but it just might raise more questions than answers. If all you learned was that an expert’s advice is imperative in planning this part of your research, you’re ahead in this game!
You will have statistics assignments to complete prior to getting into your data analysis. You’ll learn how to use Excel spreadsheets to obtain means and standard deviations, to carry out t-tests, to do Fisher’s Exact Probability Test, and to use the Pearson-r correlation procedure for determining reliability of your Likert-type items. Finally, you’ll have an exercise that will illustrate the effect of sample size on power. This class will give you only a superficial exposure to using statistics in educational research. To really understand statistics, formal classwork in that field would be desirable.
Let’s look at the null hypothesis and make sure you understand what it means and how the statistical results relate to it. First, the null hypothesis is a statement that there will be no significant differences between the means. This is stated for a given alpha level, most often set at .05 in educational research. The alpha value is sometimes called the level of confidence or level of significance. Basically, what alpha = .05 means is that 5 times in 100 a sampling error will cause you to make a mistake by rejecting the null hypothesis when it is really true. Remember that in sampling we use random assignment to groups, but there is always the outside possibility that we end up with a biased sample. The silver lining to the 5% error is that 95% of the time when we reject the null hypothesis at alpha = .05 we are correct.
What we’ve just discussed is called a Type 1 error. There is another possible statistical mistake that we can make, actually that we most certainly will make in these pilot studies. It is a Type 2 error, failing to reject the null hypothesis when it is false. This happens often as a result of inadequate statistical power caused by a too-small sample. I know that dealing with these negatives and double negatives is difficult, but please be patient. When we make this Type 2 error, we do not reject the null hypothesis even when there is truly a difference between the two groups. What will occur in most of your analyses, with your very small samples, is that you will not reject your null hypotheses. You’ll look at the means of the two groups and they will appear to be really big differences, but your statistical testing says no the difference is not statistically significant. You know in your heart that the two groups’ perceptions are different, but statistics says otherwise. If you’re right, to remedy the situation in a full blown study, you would increase the power of your test by using a larger sample size and then would be able to reject the null and say with confidence that the means are really different. There is a truth table on the next page that lays out the possible outcomes of statistical testing.
Did you ever hear of a false negative pregnancy test? That’s when the test kit says that you aren’t pregnant, but you really are. You can get a false positive treadmill test. That is when the test shows you to have heart disease but you really don’t.
The bottom line in all of this is that you should attempt to avoid Type 2 errors by using large enough samples (20-30 per group as a rule of thumb). There will remain the possibility of making the Type 1 error from just accidently getting a nonrepresentative sample. There’s not a lot you can do about that beyond your doing your best in sampling. We live with that, the 5% of the time error when alpha is set at .05.
Table. Truth table for statistical testing outcomes.
|
|
Fail to reject |
Reject |
|
When the null hypothesis is really true (means are not sig diff) |
This is fine, as it should be. |
This is a Type 1 error that is made 5% of the time (alpha). (You say the null is not true when it really is – false negative ) |
|
When the null hypothesis is really false (means are sig diff) |
This is a Type 2 error (Beta) (You say the null is true when it really isn’t – false positive ) |
This is fine, as it should be. |
Try to get a good feel for these very basic statistical concepts. As a beginning researcher, it is common to be insecure with statistics. Your professional life might lead you into taking one or more classes in statistics during which you will develop an in-depth understanding. Until that happens, please consult the experts regarding experimental design and statistics.
Suggested reading
Ferguson, G. A. (1981). Statistical analysis in psychology and education (5th ed.). New York:
McGraw-Hill.
Wike, E. L. (1971). Data analysis: A statistical primer for psychology students.
Chicago: Aldine-Atherton.
Fundamental Statistics Concepts and Practice
Previous students have recommended more explanation and clarification of the statistical tools used in this class and this appendix is a move in that direction. You will need an understanding of mean, standard deviation, the t-test, Fischer’s Exact Probability Test, and correlations. Let’s look at those, one at a time, and do some simple problems that should contribute to your mastery. This is certainly not a substitute for a statistics class, but is rather meant to give you a basic foundation so that you can intelligently handle the requirements of your pilot study.
Means and standard deviations
A mean is simply an average and calculating a mean is as simple as adding the scores and then dividing by the number of scores. When you work with your Likert items, this is done for you automatically. All you have to do is enter the data on the spreadsheet provided (yellow cells) and read the result. Just entering your raw data will not just give you the means, but the standard deviation will be calculated and provided for you as well. The median (score in the center of a distribution) and the mode (score appearing most frequently) will not be used in your research. These, along with the mean, are frequently grouped as measures or indicators of central tendency. We will use only the mean.
The standard deviation is a measure of variability. Are your scores homogenous, pretty tightly grouped such as samples 3, 4, and 5 in Table 1? If so, the standard deviation is a small value. Are your scores quite different, some strongly agree mixed with strongly disagree, for example (see samples 1 and 2 in Table 1)? If so, you’ll get a relatively large standard deviation. And if one group has a large SD and the other a small SD, then that’s food for thought and discussion. Specifically, why would one group have so much variation in their opinion? Why do they disagree among themselves? That would certainly be something worthy of investigation and discussion.
Let’s get some practice. Open the spreadsheet for doing t-tests and consider the raw data provided in this table.
Table 1. Raw data for statistics practice.
|
Sample |
1 |
2 |
3 |
4 |
5 |
|
Subject 1 |
5 |
5 |
3 |
1 |
3 |
|
Subject 2 |
4 |
5 |
3 |
2 |
4 |
|
Subject 3 |
3 |
3 |
3 |
2 |
5 |
|
Subject 4 |
2 |
1 |
3 |
2 |
5 |
|
Subject 5 |
1 |
1 |
3 |
1 |
4 |
|
Mean |
|
|
|
|
|
|
St Dev |
|
|
|
|
|
In your t-test spreadsheet, use the solid yellow block in either column C or column D. Enter the scores shown for sample 1 in Table 1 and copy the mean and SD in the spaces provided. Do the same for samples 2-5. For questions 6-8, assume a Likert scale of strongly agree = 5, agree = 4, neutral = 3, disagree = 2, and strongly disagree = 1.
Questions on means and standard deviations:
One sample shows no variability at all. Which one is it and why is that so?
Which two samples show the most variability in scores? What are their SD values?
Which two samples show the lowest amount of variation, excluding #3? What are their SD values?
By trial and error, list five scores that would be more variable than #2. What is the SD for your hypothetical sample?
By trial and error, list five scores that would be less variable than #4, but would show some variability. What is the SD for your sample?
Which sample mean indicates agreement with the survey item in question? Is it nearer ‘agree’ or ‘strongly agree’?
Which sample mean indicates disagreement with the survey item in question? Is it nearer ‘disagree’ or ‘strongly disagree’?
Which sample means indicate neutrality with respect to the question?
t-tests for independent samples
In your pilot study, you’ll compare two sets of sample scores by using the t-test spreadsheet. You will see means and standard deviations calculated for you following data entry and you’ll also see values for t that you can make a decision on regarding significance. With the total sample of ten subjects, a value of t equal to or greater than 2.31 will be required to meet the criterion for significant differences between the means. Look at the two columns in red font on the spreadsheet to see how the criterion value of t changes as the samples get larger.
Given that information, enter pairs of samples into the yellow cell blocks in your t-test spreadsheet after reading this whole paragraph. There’s no point in testing for differences of means between samples 1&2, 1&3, or 2&3. Why? Those means are all the same, 3.0, so there is no difference to test. In the real world, if you enter your pilot study data and it happens that the means are equal, the test for significance will be automatic and you’ll be provided with values of t to report. In such a case, t = 0…as far away from the criterion value of 2.31 as you can get. That makes sense, if the means are equal and you’re testing for significance of differences, doesn’t it? The opposite side of that coin is that the bigger the difference between means, the greater the probability that the two means are significantly different. The variability of scores, shown by the size of the SD is a factor, but we will ignore that at this point. Now you have five sets of means that you had calculated with the data from Table 1. Three of them are equal (3, 3, and 3) and then there are means of 1.6 and 4.2. Let’s do some testing of those pairs that are different. Go ahead now and enter pairs of data that you see in Table 2 into the two yellow columns of your spreadsheet. Then fill out Table 2. The last row will be absolute difference between means, no plus or minus signs (Just subtract the smaller mean from the larger). When you finish, check your results with the key provided at the end of this worksheet to make certain that you know how to use the t-test spreadsheet correctly.
Table 2. t-values and decisions on selected pairs of samples from Table 1.
|
Samples |
1&4 |
1&5 |
2&4 |
2&5 |
3&4 |
3&5 |
4&5 |
|
t-value |
|
|
|
|
|
|
|
|
Sig? Y or N? |
|
|
|
|
|
|
|
|
SD for Group 1 SD for Group 2 |
|
|
|
|
|
|
|
|
Diff betw means |
|
|
|
|
|
|
|
Questions on the t-test results:
Once you finish filling out Table 2, you should find that the means of three of the seven pairs of samples are significantly different. For sets 4 & 5, the absolute difference between the means was 2.6 and common sense would lead you to believe those two means would be significantly different and they were. However, look at the difference between the means of the other pairs. They are all either 1.2 or 1.4, but in two cases we have significance and in four we do not. If the magnitude of the difference between means isn’t the only factor at work, explain to me what makes the difference in this table between significance and nonsignificance.
Look carefully at the two columns in red font on the spreadsheet for the answer to the three questions in this item. If you had six subjects per group, what is the criterion value for decision-making at alpha=.05? What would be the value required for ten subjects per group, a total of 20? What is the trend that you see in the relationship of sample size to the value of t required for significance?
You have already determined a t-value for 1 & 4. Now do the same for 4 & 1. In one case, you’ll enter the data for Group 1 in the left column and in the second case you’ll enter the data for Group 4 in the left column. The data for the two groups doesn’t change; only the order of entry is different. What is the only difference that you see in the value of t? Does that explain why we would not report a negative value of t and why the criterion value is always without sign, the absolute value? Please never report a t-value with a negative sign! It makes no sense at all.
Power (ability to avoid a Type 2 error –false positive – see p. 88)
Statistical power relates to the relative ability of detecting significant differences when the means are really different. Elsewhere in the Etext, a point was made that with the small sample sizes used in your pilot study, the probability of finding significance is slim or none. Here are a couple of exercises that will illustrate that point. You have already calculated t = 1.87 for Samples 1 and 4 data in Table 1. Now take that same data and enter it twice in the yellow blocks in your spreadsheet. For group one, enter sample #1 data of 5, 4, 3, 2, and 1 and continue down the column with the same numbers again 5,4,3,2,1. For group 2 on the spreadsheet, enter sample #4 data (see Table 1) twice as you did for sample #1. Be sure to change the values for the numbers in each group – see the two yellow cells to the right. You will then note that the means are identical (see the Table 1 values that you calculated), but the standard deviations are just a little lower. The big difference is in the value of t. Compare it with the criterion value for a sample of 20 subjects. You doubled the sample size and now find significant differences between the mean of 3.0 for sample 1 and the mean of 1.6 for sample 4. That will not always happen. This time we made sure that the additional scores were the same as those we started with. However, if you beef up your study with more subjects, there is no guarantee that the ones you add will think like the others! Random sampling will make that more likely to occur, but we deal with probabilities throughout statistical practice and there are no guarantees!
Questions on power
Add two more subjects to sample 1, scores that won’t affect the mean. What scores could that be? Let’s use 3 and 3! Now do the same for sample 4, adding scores of 1 and 2 to keep the mean fairly constant. Be sure to change the numbers in the two yellow spaces to the right…7 rather than 5 per group. Place those new numbers in cells F3 and F4.
What is the value of t?
Does this meet the criterion for significance?
What story does this tell about the importance of sample size?
Fischer Exact Probability Test for true/false and yes/no items
Most educational research is done at an alpha level of .05, meaning that if the probability calculated by the Fischer test is less than or equal to .05 we have a statistically significant difference. If the Fischer value exceeds .05 then the means are considered not to be significantly different. The size of the Fischer value is important. If you get one at .06 with your small sample, you’ll get excited and want to talk about it. On the other hand, if the value is .6, .8, or 1.0 it is a ho-hum-no difference situation.
Table 3. Hypothetical distributions for Fischer analysis.
|
|
Yes |
No |
|
True |
False |
|
Sample 1 |
5 |
0 |
Sample 7 |
5 |
0 |
|
Sample 2 |
4 |
1 |
Sample 8 |
2 |
3 |
|
Fischer Ex Prob |
|
Fisher Ex Prob |
|
||
|
Sample 3 |
5 |
0 |
Sample 9 |
5 |
0 |
|
Sample 4 |
3 |
2 |
Sample 10 |
1 |
4 |
|
Fischer Ex Prob |
|
Fisher Ex Prob |
|
||
|
Sample 5 |
10 |
0 |
Sample 11 |
5 |
0 |
|
Sample 6 |
6 |
4 |
Sample 12 |
0 |
5 |
|
Fischer Ex Prob |
|
Fisher Ex Prob |
|
Fisher problems. Enter these six sets of data in the yellow cells in the Fischer spreadsheet. Then copy the Fisher Exact Probability values into the proper cells with brown fill in this table. These are just examples of possible distributions and obviously there are more. As you can see, you have to have remarkable disagreement between samples to find significance. This, again, is a function of sample size and larger samples would increase the power, the probability of finding significance when the populations really are different. Look at the Exact Probability of Samples 3 and 4 and compare it with that of 5 and 6, both having exactly the same proportional distribution. This is another demonstration of the effect of sample size on statistical power.
Assignment on Fisher: Copy and paste the filled out Table 3 as your answers to the six problems.
Additional question: Look at the last two distributions on the left (samples 3 & 4, samples 5 &6). The proportions are the same for both. The only difference is that Samples 5 & 6 are exactly twice as large as sample 3 &4. You might want to try a distribution with samples three times as large as 3&4. You’d have the same proportional distribution in all three cases. What can you conclude regarding the importance of sample size in the Fisher Test?
Correlations
Please use the word correlation in this class only as a statistical term used to describe the degree of relationship. People not in the know often speak of things like the correlation between teachers’ and students’ thinking about something when they’re really meaning similarity. Don’t make that common mistake. Let’s look at some legitimate uses of the word correlation.
Suppose we want to describe the strength of the relationship between math ability as shown through final examination class data and math ability as shown in some sort of high stakes testing. How do the two relate? Open your correlations spreadsheet, please, and enter the data from this table. The teachers’ scores will be expressed as percentages for the year and the high stakes test score will be based on a maximum possible of 50. (We will express correlations in this form: r=.76)
Table 4. Hypothetical scores on two types of tests.
|
Test |
Jim |
Bob |
Ruff |
Sis |
Lola |
Sam |
Joe |
Mufufu |
Abdul |
|
Teacher’s |
91 |
49 |
82 |
98 |
88 |
80 |
71 |
85 |
80 |
|
High Stakes |
38 |
38 |
40 |
40 |
35 |
45 |
30 |
44 |
40 |
What is the correlation of these two sets of data? What do the results mean?
Values for correlations can range from 1.0 to -1.0 and are expressed in decimal form, often to two decimal places. If it is positive, it means that as values in one set of data increase, values in the other set increase as well. If the correlation is negative, it means that when one set increases, the other decreases. Drawing conclusions from correlations is sometimes done in error. You can never conclude anything in terms of cause and effect – just a relationship and what it might mean. Look at this next example.
Table 5. Incidence of shootings and parties over seven months.
|
|
Jan |
Feb |
Mar |
Apr |
May |
June |
July |
Aug |
Sept |
|
Shootings |
11 |
14 |
15 |
16 |
16 |
20 |
22 |
21 |
24 |
|
Parties |
10 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
What is the correlation between number of parties each month and shootings each month over this time period? What can we conclude? Do shootings cut down on parties? Does having fewer parties irritate people who then shoot more other folks?
There’s one more practical application of correlations that we will consider. In order to estimate the reliability of a Likert-type item, you could administer your survey to the same group on two occasions and calculate a correlation coefficient that shows the relationship between the two sets of scores. Our criterion for an acceptable survey item will be r>.7 (a correlation of at least .7). Ideally, an individual would give the same response each testing time, but life doesn’t work that way.
Table 6. Data from two administrations of Likert Item #2.
|
Likert #2 |
Bill |
Bo |
Bobo |
Defran |
Dustin |
Alice |
Jack |
Jackie |
John |
|
First try |
5 |
4 |
2 |
5 |
4 |
3 |
2 |
2 |
1 |
|
Second try |
5 |
4 |
3 |
4 |
2 |
3 |
2 |
3 |
1 |
Calculate the correlation for this test-retest determination of the reliability of a Likert-type item. The same survey was administered to these subjects twice, one week apart.
What is the correlation coefficient? Is Likert Item #2 reliable? Explain.
In conclusion
Ideally, you’d have taken a statistics class, but realistically you can assume that you’ve mastered a pretty good collection of basic statistics concepts if you successfully answered those questions in red font. You’ll be able to handle the demands of this class, which include analyzing your data and then interpreting it so that you can do a good job of writing a results and discussion section for your paper.