Test Reliability
JOURNAL OF EDUCATIONAL MEASUREMENT VOLUME 23, NO. 2, SUMMER 1986, pp. 171-173
C A N A TEST BE TOO R E L I A B L E ?
H O W A R D W A I N E R E d u c a t i o n a l Testing Service
It is shown that s u m m a r y statistics that are c o m m o n l y u s e d to m e a s u r e test quality (reliability, m e a n rbis, and m e a n proportion correct) can be seriously misleading. This is d e m o n s t r a t e d and explained.
Being too reliable, like being too rich or too thin, is a state that psychometricians have long been taught is impossible to attain. Yet, can an overlarge reliability be indicative of an error? Since Spearman and Brown it has been well known that with most well-developed tests, reliability and test length go hand in hand. In fact, one can often make a fairly accurate guess as to the reliability of a test just by knowing its length (see Gulliksen, 1950, p. 81, for a graph that allows one to do this). Thus, when a relatively short test shows unusually high reliability it should be cause for concern rather than unbridled jubilation.
To illustrate this, consider an example from our own unfortunate and immediate past. During the course of an item analysis of a 28-item test (n = 2,450) we found a respectable a of .83. More than respectable, it was downright wonderful. With a mean score of 51% correct and a mean rbi s of .45, it looked, on the surface, like a test that could stand proudly in the glare of public s c r u t i n y - - a shining example of the test maker's skill. A deeper look disappointed us, and taught us a lesson that seems valuable to share.
Shown in Figure 1 is a stem-and-leaf d i a g r a m of the item-total biserial correlation coefficients (rb~s) for the 28 items of the test. Note that some are very high--astonishingly s o - - a n d a few others are negative. This latter aspect of the test led us to question the results. We were told that there must be a mistake since "we do not give tests with negative rb~s."
At this point, we looked more carefully at the score distribution (see Figure 2). The "sore t h u m b " that sticks out at the low end of the score distribution confirmed our suspicions that something was amiss. But what?
The test had 28 items with four choices each, thus a score of 7 represented chance. The peak of the first mode in the score distribution was at 9, just above chance. This provided a further hint. Further detective work revealed what had happened. The data we had analyzed as a single test form were, in fact, two different forms. One form, whose key we used in scoring, had 1,356 examinees. The other had 1,094. Scoring a test with an inappropriate key would yield scores for the affected examinees in the vicinity of chance. In actual fact, we discovered that exactly 9 of the 28 items had the same keyed correct response, so that someone who achieved a perfect score on the incorrectly scored form would have an observed score of 9. Note that the lower mode of the obtained score distribution is 9.
Once we realized the cause of the problem, the explanation for the various
My thanks to Paul Holland.
171
1 7 2 HOWARD WAINER
n = 2450
3 0 0 " ~
.9 277
.8 6788
.7 113455677
.6
.5 9
.4 3
.3
.2
.i
0.0 278
-.0 25
-.I
-.2 48
-.2 14
-.4 5
Stem-and-Leaf Display of Biserial Correlations for 28 Items
>- 0 z LU
0 LU rr LI.
250-
200-
150-
100-
5 0 -
0 ,
0 I I I
Figure 1.
I 5 10 15 20 25
TEST SCORE
I 3O
Figure 2. Distribution of Raw Scores for the 2450 examinees
CAN A TEST BE TOO RELIABLE? 173
observed anomalies b e c a m e a p p a r e n t . For example, the biserial correlations would be very high when the two keys disagreed, because those who scored low ( n e a r chance) would get the item wrong, and those who scored high would get it right. Similarly, the negative biserials would occur when the two keys agreed, for then the low scoring individuals would get the item correct. This problem b e c a m e even worse when the item was relatively difficult on the correctly scored form, and relatively easy on the incorrectly scored form.
Last, we return to our original topic, the reliability. N o wonder it was high. T h e score distribution was broadened, yielding a large g r o u p with very low scores and a n o t h e r group with high scores. A n y split of the test would yield the s a m e picture. When the test was rescored properly, the reliability dropped to the m o r e usual .6 level for each form.
It seems clear that when a test has a reliability t h a t is too low the test will be scrutinized. T h e purpose o f this note was to e m p h a s i z e the i m p o r t a n c e of giving the test close scrutiny when the reliability is too high as well. T h e problem will p r o b a b l y be of a different cause, but it m a y be a problem nonetheless. T h e e x a m p l e presented is but one case o f this. It emphasizes t h a t s u m m a r y statistics for the whole test ( m e a n p+, mean rbis, 0~) a r e not sufficient for j u d g i n g the quality of the test.
R E F E R E N C E
GULLIKSEN, H. (1950). Theory o f mental tests. New York: John Wiley.
A U T H O R
HOWARD WA1NER, Senior Research Scientist, Educational Testing Service, Princeton, NJ 08541. Degrees: BS, Rensselaer Polytechnic Institute; AM, PhD, Princeton University. Specializations: Statistics, psychometrics, graphics.
- Return to List of Articles