Education EDU530 Week 6 assignment
78
Chapter 3
Reliability of Assessment
Chief Chapter Outcome
An understanding of commonly employed indicators of a test’s reliability/precision that is sufficient to identify the types of reliability evidence already collected for a test and, if necessary, to select the kinds of reliability evidence needed for particular uses of educational assessments
Learning Objectives
3.1 Explain the basic conceptual and mathematic principles of reliability.
3.2 Identify and apply the kinds of reliability evidence needed for particular uses of educational assessments.
Reliability is such a cherished commodity. We all want our automobiles, washing machines, and spouses to be reliable. The term reliability simply reeks of solid goodness. It conjures up visions of meat loaf, mashed potatoes, and a supportive family. Clearly, reliability is an attribute to be sought.
In the realm of educational assessment, reliability is also a desired attribute. We definitely want our educational assessments to be reliable. In matters related to measurement, however, reliability has a very restricted meaning. When you encounter the term reliability in any assessment context, you should draw a mental “equals sign” between reliability and consistency, because reliability refers to the consistency with which a test measures whatever it’s measuring:
=Reliability Consistency
From a classroom teacher’s perspective, there are two important ways the concept of reliability can rub up against your day-to-day activities. First, there’s the possibility that your own classroom assessments might lack sufficient
M03_POPH0936_10_SE_C03.indd 78M03_POPH0936_10_SE_C03.indd 78 09/11/23 6:04 PM09/11/23 6:04 PM
Reliability of Assessment 79
reliability to be doing a good job for you and your students. For example, sup- pose you are using your students’ performances on a tough reading test you’ve developed to determine which students need an additional dose of reading instruction. If you discover that students who take the test early in the day con- sistently outscore students who take the test later in the day, it seems that your home-grown reading test is not measuring with consistency and that, therefore, you are likely to make at least some inaccurate decisions about which students should receive additional reading instruction. Later in this chapter you will learn about some straightforward ways of collecting reliability evidence regarding your teacher-made assessments.
Second, if your students are obliged to complete any sort of commercially published standardized tests, you’re apt to find a parent or two who might want to discuss the adequacy of those tests. And reliability, as noted in the previous chapter’s preview, such as for your state’s accountability tests, is an evaluative criterion by which external standardized tests are judged. You may need to know enough about reliability’s wrinkles so you’ll be able to talk sensibly with parents about the way reliability is employed to judge the quality of standardized tests. To provide explanations for parents about the meaning or, more accurately, the meanings of assessment reliability, you’ll need more than a superficial understand- ing of what’s meant by reliability when it is routinely determined by publishers of commercial standardized tests.
As explained in Chapter 2, a particularly influential document in the field of educational assessment is the Standards for Educational and Psychological Testing (2014).1 Commonly referred to simply as “the Standards” or, perhaps, “the Joint Standards,” this important compilation of dos and don’ts regarding educational and psychological measurement was developed and distributed by the American Educational Research Association (AERA), the American Psychological Associa- tion (APA), and the National Council on Measurement in Education (NCME). The Standards provide the nuts and bolts of how educational tests should be created, evaluated, and used. Because the 2014 Standards constituted the first revision of this significant AERA-APA-NCME publication since 1999, assessment specialists everywhere are particularly attentive to its contents, and we can safely predict that most of those specialists attempt to adhere to its mandates. Let’s consider, then, what the 2014 Standards say about reliability.
Well, for openers, the architects of the Standards use a different label to describe what has historically been referred to as the “reliability” of tests. The authors of the new Standards point out that the term reliability has been used not only (1) to represent the traditional reliability coefficients so often employed through the years to describe a test’s quality, but also (2) to refer more generally to the consistency of students’ scores across replications of a testing procedure regardless of the way such consistency is estimated or reported.
In case you are unfamiliar with the meaning of the technical term coefficient as used in this context, that term typically refers to a correlation coefficient—that is, a numerical indicator of the relationship between the same persons’ status on two
M03_POPH0936_10_SE_C03.indd 79M03_POPH0936_10_SE_C03.indd 79 09/11/23 6:04 PM09/11/23 6:04 PM
80 ChApteR 3 Reliability of Assessment
variables, such as students’ scores on two different tests. The symbol representing such a coefficient is r. If students score pretty much the same on the two tests, the resulting correlation coefficient will be strong and positive. If the individuals score high on one test and low on the other test, the resulting r will be negative. If there’s a high reliability coefficient, this doesn’t necessarily signify that students’ scores on the two testing occasions are identical. Rather, a high r indicates that students’ relative performances on the two testing occasions are quite similar. Correlation coefficients can range from a high of +1.00 to a low of –1.00. Thus, what a reliability coefficient really means is that a correlation coefficient has typically been computed for test-takers’ performances on two sets of variables—as you will soon see.
To illustrate, if the developers of a new achievement test report that when their test was administered to the same students on two occasions—for example, a month apart—the resulting test-retest correlation coefficient was .86, this would be an instance involving the sort of traditional reliability coefficient that measurement experts have used for roughly a full century. If, however, a test-development com- pany creates a spanking new high school graduation test, and supplies evidence about what percentage of students’ scores can consistently classify those test-takers into “diploma awardees” and “diploma denials,” this too constitutes a useful way of representing a test’s consistency—but a way that’s clearly not the same as reliance on a traditional reliability coefficient. A variety of different indices of classification consistency are often employed these days to supply test users with indications about the reliability with which a test classifies test-takers. Don’t be surprised, then, if you encounter an indicator of a test’s reliability that’s expressed in a manner quite different from oft-encountered reliability coefficients.
Architects of the 2014 Standards wanted to provide a link to traditional concep- tions of measurement consistency (in which a single reliability coefficient typically indicated a test’s consistency), but they wished to avoid the ambiguity of using the single label reliability to refer to a wide range of reliability indicators, such as mea- sures of classification consistency. Accordingly, the 2014 Standards architects employ the term reliability/precision to denote the more general notion of score consistency across instances of the testing procedure. The label reliability coefficient is used to describe more conventionally used coefficients representing different forms of test-takers’ consistency.
The descriptor reliability/precision, then, describes not only traditional reliabil- ity coefficients but also various indicators of classification consistency and a num- ber of less readily understandable statistical indices of assessment consistency. As the writers of the Standards make clear, the need for precision of measurement increases as the consequences in terms of test-based inferences and resultant deci- sions become more important.
Whether the label reliability/precision will become widely employed by those who work with educational tests remains to be seen. In this chapter, you will encounter descriptions of reliability in several contexts. Those contexts should make it clear whether the term applies to a quantitative indicator representing a traditional relationship between two sets of test scores—that is, a reliability
M03_POPH0936_10_SE_C03.indd 80M03_POPH0936_10_SE_C03.indd 80 09/11/23 6:04 PM09/11/23 6:04 PM
Reliability of Assessment 81
coefficient—or, instead, refers to another way of representing a test’s consistency of measurement—that is, a reliability/precision procedure.
What does a teacher really need to know about reliability or about reliability/ precision? Well, hold off for a bit on that question, for the answer may surprise you. But one thing you do need to recognize is that the fundamental notion of reliability is downright important for those whose professional lives bump up against educational tests in any meaningful manner. This overriding truth about the significance of assessment consistency is well represented in the very first, most basic expectation set forth in the 2014 Standards:
Appropriate evidence of reliability/precision should be provided for the interpretation of each intended score use. (AERA, 2014, p. 42)
Based on this call for appropriate evidence supporting the reliability/precision of an educational test, the Joint Standards include fully 20 subordinate standards dealing with various aspects of reliability. Taken together, they spell out how the authors of the 2014 Standards believe the consistency of educational tests ought to be determined.
Please note, in the above snippet from the 2014 Standards, that evidence of reli- ability/precision is supposed to be supplied for each intended score use. What this signifies, of course, is that a test itself is not reliable or unreliable in some general, ill-defined way. Rather, because of the evidence that’s available, we reach a judg- ment about the reliability regarding a test’s use for Intended Use A, Intended Use B, and so on. Although many educators casually refer to the “reliability of a test,” what’s really represented in that phrase is the “reliability of a test for a specified use.” In the next chapter (spoiler alert!) about validity, you will learn that tests are not valid or invalid but, rather, it is the validity of test-based interpretations that is up for grabs. Similarly, regarding the reliability of tests, a test’s reliability revolves around the evidence supporting the consistency of scores for each and every intended test usage.
But how, you might ask, can a teacher’s test be unreliable? That is, how can certain features of a teacher-made test result in a test’s yielding inconsistent scores? This would be a good ask, and a few examples might help in responding to it. Let’s say that the directions for a teacher-made tests are somewhat ambigu- ous and that, as a consequence, students arrive at several meanings of what they are being asked to do with the test’s items. Result: Inconsistency. As another illus- tration, suppose that in the multiple-choice items a teacher has ginned up for her physics tests, many items seem to have not one but two correct answers. Students, unless directed otherwise, can typically make only one choice per item. Result: Inconsistency. And, as a final example, suppose that the teacher’s test contains a flock of true–false items that are arguably either true or false. Students need to guess which is which for each item. Result: Inconsistency.
As you read further in this chapter, you will encounter some fairly techni- cal notions about ways of calculating the reliability with which a test can cap- ture test-takers’ performances. Try not to get so caught up in the techniques of
M03_POPH0936_10_SE_C03.indd 81M03_POPH0936_10_SE_C03.indd 81 09/11/23 6:04 PM09/11/23 6:04 PM
82 ChApteR 3 Reliability of Assessment
collecting reliability evidence that you begin to believe the unreliability of an educational test is visited on that test by extraterrestrial forces. No, when someone put the test together, mistakes were made that allowed the test to yield inconsis- tent test-takers’ scores. Perhaps the evildoer was a classroom teacher, perhaps a psychometrician from a major testing firm. Whoever the culprit was, somebody messed up. What we have learned in the field of educational measurement, for- tunately, are ways of spotting such mess-ups.
So, recognizing that we can estimate a test’s consistency in a number of ways, and that the expression reliability coefficient refers to particular sorts of evidence that are traditionally served up when educational tests are scrutinized, let’s look at the three traditional reliability coefficients presented in Table 3.1: test-retest coefficients, alternate-form coefficients, and internal consistency coefficients. Along the way, while describing how these three sorts of reliability coefficients are obtained, we will consider other approaches to be employed when describing a test’s reliability/ precision.
If you have only recently been tussling with different conceptions of assess- ments’ consistency (such as in the last few paragraphs), an altogether reasonable question for you to ask about reliability coefficients is: How big must a reliability coefficient be? For instance, looking at the three sorts of reliability coefficients tersely described in Table 3.1, how high do those correlation coefficients need to be for us to regard a test as sufficiently reliable for its intended use? Regrettably, the answer to this straightforward question is apt to be more murky than you might like.
For openers, if a correlation coefficient is involved, it needs to be positive rather than negative. If test developers from an educational measurement com- pany were trying to measure how consistently their newly created history test does its measurement job—regardless of when it is administered—they would typically ask a group of students to take the same test on two separate occasions. Thereupon a test-retest correlation could be computed for the test-takers’ two sets of scores. Clearly, the hope would be that students’ scores on the two administra- tions would be decisively positive—that is, the scores would reveal substantial similarity in the way that students performed on both of the two test administra- tions. Okay, the test’s developers want a positive test-retest reliability coefficient to signify test-score stability. But how large does that coefficient need to be before the test’s developers start clinking glasses during a subsequent champagne toast?
Table 3.1 three types of Reliability evidence
Type of Reliability Coefficient Brief Description
Test-Retest Consistency of results among different testing occasions
Alternate Form Consistency of results among two or more different forms of a test
Internal Consistency Consistency in the way an assessment instrument’s items function
M03_POPH0936_10_SE_C03.indd 82M03_POPH0936_10_SE_C03.indd 82 09/11/23 6:04 PM09/11/23 6:04 PM
Reliability of Assessment 83
Well, get ready for an answer that’s often to be encountered when judging the quality of educational tests. That’s right: It depends. The size that a correlation coefficient needs to be before we regard a test as sufficiently reliable depends on the context in which a test is being used and, in particular, hinges on the nature of the test-based decision that will be influenced by test-takers’ scores.
In general, the higher the stakes involved, the higher should be our expecta- tions for our tests’ reliability coefficients. For instance, let’s suppose that a teacher is working in a state where the awarding of a high school diploma requires a student to perform above a specific cut-score on both a mathematics test and an English language arts test. Given the considerable importance of the decision that’s riding on students’ test performances, we should be demanding a much greater indication of test consistency than we would with, say, a teacher’s exam covering a two-week unit on the topic of punctuation.
But because the contexts in which educational tests are used will vary so con- siderably, it is simply impossible to set forth a definitive table presenting minimally acceptable levels for certain kinds of reliability coefficients. Experience shows us that when teachers attempt to collect evidence for their own teacher-constructed tests, it is not uncommon to encounter a test-retest r of .60–plus or minus .10–and an alternate-form r reflecting about the same range. Accordingly, when decid- ing whether a test’s reliability coefficients are sufficiently strong for the test’s intended use, this decision boils down to professional judgment based on the cor- relation coefficients seen over the years in similar settings. Yes, we use historical precedent to help us arrive at realistic expectations for what’s possible regarding reliability coefficients. For instance, with significant high-stakes examinations such as nationally standardized achievement tests that have been developed and revised many times at great cost, it is not uncommon for the test developers to report internal consistency coefficients hovering slightly above or slightly below .90. However, for a district-made test requiring far fewer developmental dollars, the identical sorts of internal consistency coefficients might be closer to .80 or .70.
In general, reliability coefficients representing test-retest or alternate-form consistency tend to be lower than internal consistency coefficients. However, in certain contexts—when a test has been developed in an effort to provide dis- tinguishable subscores—we should expect internal consistency coefficients to be much lower, because the overall test is not attempting to measure a single, all-encompassing trait but, rather, a set of related subscales scores. As indicated earlier, it depends. The judgment about whether a test’s reliability coefficients are sufficiently strong for a given use should hinge on what is a realistic expectation for such coefficients in such situations.
Similarly, when looking at reliability/precision more generally—for instance, when considering the consistency with which a set of test scores allows us to classify test-takers’ levels of proficiency accurately, once more we need to be guided by historical precedent. That is, based on the experience of others, what expectations about reliability/precision are realistic—irrespective of the quantita- tive indicator employed? For example, if we look back at recent percentages of
M03_POPH0936_10_SE_C03.indd 83M03_POPH0936_10_SE_C03.indd 83 09/11/23 6:04 PM09/11/23 6:04 PM
84 ChApteR 3 Reliability of Assessment
identical classifications of students on district-developed end-of-course exams, and we find that these decision-consistency percentages almost always reach at least 80 percent (that is, 80 percent of the test-takers are given the same classifica- tions irrespective of the test forms they completed), then as educators we ought to be wary if a new test yields evidence supporting only a 60 percent estimate of decision consistency.
Test-Retest Reliability Evidence The first kind of reliability evidence we’ll be looking at is called test-retest. This conception of reliability often comes to people’s minds when someone asserts that reliability equals consistency. Formerly referred to in earlier versions of the Standards as “stability reliability,” test-retest evidence refers to consistency of test results over time. We want our educational assessments of students to yield simi- lar results even if the tests were administered on different occasions. For example, suppose you gave your students a midterm exam on Tuesday, but later in the afternoon a masked thief (1) snatched your briefcase containing the students’ test papers, (2) jumped into a waiting armored personnel carrier, and (3) escaped to an adjacent state or nation. The next day, after describing to your students how their examinations had been purloined by a masked assailant, you then ask them to retake the midterm exam. Because there have been no intervening events of significance, such as more instruction from you on the topics covered by the examination, you would expect your students’ Wednesday examination scores to be somewhat similar to their Tuesday examination scores. And that’s what the test-retest coefficient conception of test reliability refers to—evidence of consis- tency over time. If the Wednesday scores weren’t rather similar to the Tuesday scores, then your midterm exam would probably be judged to provide insufficient test-retest reliability.
To get a fix on how stable an assessment’s results are over time, we usually test students on one occasion, wait a week or two, and then retest them with the same instrument. Because measurement specialists typically use the descriptors stability reliability and test-retest reliability interchangeably, you are hereby allowed to do the same thing. Simply choose which of the two labels you prefer. It is important, however, for no significant events that might alter students’ perfor- mances on the second assessment occasion to have taken place between the two testing occasions. For instance, suppose the test you are administering assessed students’ knowledge regarding World War II. If a widely viewed television mini-series about World War II is presented during the interval between the initial test and the retest, it is likely that the performances of the students who watched the mini-series will be higher on the second test because of their exposure to test-relevant information in the mini-series. Thus, for test-retest coefficients to be interpreted accurately, it is imperative that no significant performance-influencing events transpire during the between-assessments interval.
M03_POPH0936_10_SE_C03.indd 84M03_POPH0936_10_SE_C03.indd 84 09/11/23 6:04 PM09/11/23 6:04 PM
test-Retest Reliability evidence 85
One reliability/precision procedure for calculating the stability of students’ performances on the same test administered on two assessment occasions is to determine the percentage of student classifications that were consistent over time. Such a classification-consistency approach to the determination of a test’s reli- ability might be used, for instance, when a teacher is deciding which students should be exempted from further study about Topic X. To illustrate, let’s say that the teacher establishes an 80 percent correct as the degree of proficiency required in order to exempt students from further Topic X study. Then, on a test- retest basis, the teacher would simply determine the percentage of students who were classified the same way on the two assessment occasions. The focus in such an approach would not be on the specific scores a student earned, but only on whether the same classification was made about the student on both occasions. Thus, if Jaini Jabour earned an 84 percent correct score on the first testing occa- sion and a 99 percent correct score on the second testing occasion, Jill would be exempted from further Topic X study in both cases, because she surpassed the 80 percent correct standard both times. The classifications for Jill would be con- sistent, so the teacher’s decisions about Jill would also be consistent. However, if
Decision time Quibbling over Quizzes
Wayne Wong’s first-year teaching assignment is a group of 28 fifth-grade students in an inner-city elementary school. Because Mr. Wong believes in the importance of frequent assessments as motivational devices for his students, he typically administers one or more surprise quizzes per week to his students. Admittedly, after the first month or so, very few of Mr. Wong’s fifth-graders are really “surprised” when he whips out one of his unannounced quizzes. Students’ scores on the quizzes are used by Mr. Wong to help compute each student’s 6-week grades.
Mrs. Halverson, the principal of Wayne’s school, has visited his class on numerous occasions. Mrs. Halverson believes that it is her “special responsibility” to see that first-year teachers receive adequate instructional support from school administrators.
Recently, Mrs. Halverson completed a master’s degree from the local branch of the state university. As part of her coursework, she was required to
take a class in “educational measurement.” She earned an A. Because the professor for that course stressed the importance of “reliability as a crucial ingredient of solid educational tests,” Mrs. Halverson has been pressing Mr. Wong to compute some form of reliability evidence for his surprise quizzes. Mr. Wong has been resisting her suggestions because, in his view, he administers so many quizzes that the computation of reliability indices for the quizzes would surely be a time-consuming pain. He believes that if he’s forced to fuss with reliability estimates for each quiz, he’ll give fewer quizzes. And because he thinks students’ perception that they may be quizzed sufficiently stimulates them to actually be prepared, he is reluctant to reduce the number of quizzes he gives. Even after hearing Wayne’s position, however, Principal Halverson seems unwilling to bend.
If you were Wayne Wong and were faced with this problem, what would your decision be?
M03_POPH0936_10_SE_C03.indd 85M03_POPH0936_10_SE_C03.indd 85 09/11/23 6:04 PM09/11/23 6:04 PM
86 ChApteR 3 Reliability of Assessment
Harry Harvey received a score of 65 percent correct on the first testing occasion and a score of 82 percent correct on the second testing occasion, different classifi- cations on the two occasions would lead to different decisions being made about Harry’s need to keep plugging away at Topic X. To determine the percentage of a test’s classification consistency, you can simply make the kinds of calculations seen in Table 3.2.
Whether you use a correlational approach or a classification-consistency approach to the determination of a test’s consistency over time, it is apparent that you’ll need to test students twice in order to determine the test score’s stability. If a test is yielding rather unstable results between two occasions, it’s really difficult to put much confidence in that test’s results. Just think about it—if you can’t tell whether your students have really performed wonderfully or woefully on a test because your students’ scores might vary depending on the day you test them, how can you pro- ceed to make defensible test-based instructional decisions about those students?
Realistically, of course, why would a sane, nonsadistic classroom teacher administer identical tests to the same students on two different testing occasions? It’s pretty tough to come up with an unembarrassing answer to that question. What’s most important for teachers to realize is that there is always a meaningful level of instability between students’ performances on two different testing occa- sions, even when the very same test is used. And this realization, of course, should dis- incline teachers to treat a student’s test score as though it were a super-scientific, impeccably precise representation of the student’s achievement level.
Alternate-Form Reliability Evidence The second of our three kinds of reliability evidence for educational assessment instruments focuses on the consistency between two forms of a test—forms that are supposedly equivalent. Alternate-form reliability evidence deals with the question of whether two or more allegedly equivalent test forms do, in fact, yield sufficiently equivalent scores.
In the classroom, teachers rarely have reason to generate two forms of an assessment instrument. Multiple forms of educational tests are more commonly encountered in high-stakes assessment situations, such as when high school stu- dents must pass graduation tests before receiving diplomas. In such settings,
Table 3.2 An Illustration of how Classification Consistency Is Determined in a test-Retest Context
A. Percent of students identified as exempt from further study on both assessment occasions
= 42%
B. Percent of students identified as requiring further study on both assessment occasions
= 46%
C. Percent of students classified differently on the two occasions = 12%
D. Percentage of the test’s classification consistency (A B)+ = 88%
M03_POPH0936_10_SE_C03.indd 86M03_POPH0936_10_SE_C03.indd 86 09/11/23 6:04 PM09/11/23 6:04 PM
Alternate-Form Reliability evidence 87
students who fail an examination when it is initially administered might have the opportunity to pass the examination. Clearly, to make the assessment process fair, the challenge of the assessment hurdle faced by individuals when they take the initial test must be essentially the same as the challenge of the assessment hurdle faced by individuals when they take the make-up examination. Alternate- form reliability evidence bears on the comparability of two (or more) test forms.
Multiple test forms are apt to be found whenever educators fear that if the identical test were simply reused, students who had access to subsequent admin- istrations of the test would have an advantage because those later test-takers would have learned about the test’s contents, and thus have an edge over the first-time test-takers. Typically, then, in a variety of high-stakes settings such as (1) those involving high school diploma tests or (2) the certification examinations governing entry to a profession, multiple test forms are employed in which sub- stantial portions differ.
To collect alternate-form consistency evidence, procedural approaches are employed that are in many ways similar to those used for the determination of test-retest reliability evidence. First, the two test forms are administered to the same individuals. Ideally, there would be little or no delay between the administration of the two test forms. For example, suppose you were interested in determining the comparability of two forms of a district-developed language arts examination. Let’s say you could round up 100 suitable students. Because the examination requires only 20 to 25 minutes to complete, you could adminis- ter both forms of the language arts test (Form A and Form B) to each of the 100 students during a single period. To eliminate the impact of the order in which the two forms were completed by students, you could ask 50 of the students to complete Form A first and then Form B. The remaining students would be directed to take Form B first and then Form A.
When you obtain each student’s scores on the two forms, you could compute a correlation coefficient reflecting the relationship between students’ performances on the two forms. As with test-retest reliability, the closer the alternate-form cor- relation coefficient is to a positive 1.0, the more agreement there is between stu- dents’ relative scores on the two forms. Alternatively, you could use the kind of classification-consistency approach for the determination of alternate-form reli- ability/precision that was described earlier for stability. To illustrate, you could decide on a level of performance that would lead to different classifications for students, and then simply calculate the percentage of identically classified stu- dents on the basis of the two test forms. For instance, if a pass/fail cutoff score of 65 percent correct had been chosen, then you would simply add (1) the percent of students who passed both times (scored 65 percent or better) and (2) the percent of students who failed both times (scored 64 percent or lower). The addition of those two percentages yields a classification-consistency estimate of alternate-form reliability for the two test forms under consideration.
As you can see, although both species of reliability evidence we’ve considered thus far are related—in the sense that both deal with consistency—they represent
M03_POPH0936_10_SE_C03.indd 87M03_POPH0936_10_SE_C03.indd 87 09/11/23 6:04 PM09/11/23 6:04 PM
88 ChApteR 3 Reliability of Assessment
very different conceptions of consistency evidence. Test-retest reliability evidence deals with consistency over time for a single examination. Alternate-form reli- ability evidence deals with the consistency inherent in two or more supposedly “equivalent” forms of the same examination.
Alternate-form reliability is not established by proclamation. Rather, evi- dence must be gathered regarding the between-form consistency of the test forms under scrutiny. Accordingly, if you’re ever reviewing a commercially published or state-developed test that claims to have equivalent forms available, be sure you inspect the evidence supporting those claims of equivalence. Determine how the evidence of alternate-form comparability was gathered—that is, under what circumstances. Make sure that what’s described makes sense to you.
Later in the book (Chapter 13), we’ll consider a procedure known as item response theory. Whenever large numbers of students are tested, that approach can be employed to adjust students’ scores on different forms of tests that are not equiva- lent in difficulty. For purposes of our current discussion, however, simply remember that evidence of alternate-form reliability is a special form of consistency evidence dealing with the comparability of the scores yielded by two or more test forms.
Internal Consistency Reliability Evidence The final entrant in our reliability evidence sweepstakes is called internal con- sistency reliability evidence. It really is quite a different creature than stability and alternate-form reliability evidence. Internal consistency evidence does not focus on the consistency of students’ scores on a test. Rather, internal consistency evidence deals with the extent to which the items in an educational assessment instrument are functioning in a consistent fashion.
Whereas evidence of stability and alternate-form reliability requires two administrations of a test, internal consistency reliability can be computed on the basis of only a single test administration. It is for this reason, one suspects, that we tend to encounter internal consistency estimates of reliability far more fre- quently than we encounter its two reliability cousins. Yet, as you will see, inter- nal consistency reliability evidence is substantively different from stability and alternate-form reliability evidence.
Internal consistency reliability reflects the degree to which the items on a test are doing their measurement job in a consistent manner—that is, the degree to which the test’s items are functioning homogeneously. Many educational tests are designed to measure a single variable, such as students’ “reading achievement” or their “attitude toward school.” If a test’s items are all truly measuring a single variable, then each of the test’s items ought to be doing fundamentally the same assessment job. To the extent that the test’s items are tapping the same variable, of course, the responses to those items by students will tend to be quite similar.
M03_POPH0936_10_SE_C03.indd 88M03_POPH0936_10_SE_C03.indd 88 09/11/23 6:04 PM09/11/23 6:04 PM
Internal Consistency Reliability evidence 89
For example, if all the items in a 20-item test on problem solving do, in fact, mea- sure a student’s problem-solving ability, then students who are skilled problem solvers should get most of the test’s items right, whereas unskilled problem solv- ers will miss most of the test’s 20 items. The more homogeneous the responses yielded by a test’s items, the higher will be the test’s internal consistency evidence.
There are several different formulae around for computing a test’s inter- nal consistency.2 Each formula is intended to yield a numerical estimate that reflects the extent to which the assessment procedure’s items are functioning
parent talk One of your strongest students, Raphael Lopez, has recently received his scores on a nationally standardized achievement test used in your school district. Raphael’s subtest percentile scores (in comparison to the test’s norm group) were the following:
Subject Percentile
Language Arts 85th
Mathematics 92nd
Science 91st
Social Studies 51st
Raphael’s father, a retired U.S. Air Force colonel, has called for an after-school conference with you about the test results. He has used his home computer and the Internet to discover that the internal consistency reliabilities on all four subtests, as published in the test’s technical manual, are higher than .93. When he telephoned you to set up the conference, he said he couldn’t see how the four subtests could all be reliable when Raphael’s score on the social studies subtest was so “out of whack” with the other three subtests. He wants you to explain how this could happen.
(Continued)
M03_POPH0936_10_SE_C03.indd 89M03_POPH0936_10_SE_C03.indd 89 09/11/23 6:04 PM09/11/23 6:04 PM
90 ChApteR 3 Reliability of Assessment
homogeneously. By the way, because internal consistency estimates of test-score reliability are focused on the homogeneity of the items on a test, not on any clas- sifications of test-takers (as we saw with stability and alternate-form reliability), decision-consistency approaches to reliability are not used in this instance.
For tests containing items on which a student can be right or wrong, such as a multiple-choice item, the most commonly used internal consistency approaches are the Kuder-Richardson procedures (usually referred to as the K-R formulae). For tests containing items on which students can be given different numbers of points, such as essay items, the most common internal consistency coefficient is called Cronbach’s coefficient alpha after its originator, Lee J. Cron- bach. Incidentally, if you want to impress your colleagues with your newfound and altogether exotic assessment vocabulary, you might want to know that test items scored right or wrong (such as true–false items) are called dichotomous items, and those that yield multiple scores (such as essay items) are called
If I were you, here’s how I’d respond to Colonel Lopez:
“First off, Colonel Lopez, I’m delighted that you’ve taken the time to look into the standardized test we’re using in the district. Not many parents are willing to expend the energy to do so.
“I’d like to deal immediately with the issue you raised on the phone regarding the reliability of the four subtests, and then discuss Raphael’s social studies result. You may already know some of what I’ll be talking about because of your access to the Internet, but here goes.
“Assessment reliability refers to the consistency of measurement. But there are very different ways that test developers look at measurement consistency. The reliability estimates that are supplied for Raphael’s standardized test, as you pointed out, are called internal consistency correlation coefficients. Those correlations tell us whether the items on a particular subtest are performing in the same way—that is, whether they seem to be measuring the same thing.
“The internal consistency reliability for all four subtests is quite good. But this kind of reliability evidence doesn’t tell us anything about how Raphael would score if he took the test again or if he took a different form of the same test. We don’t know, in other words, whether his performance
would be stable across time or would be consistent across different collections of similar test items.
“What we do see in Raphael’s case is a social studies performance that is decisively different from his performance on the other three subtests. I’ve checked his grades for the past few years, and I’ve seen that his grades in social studies are routinely just as high as his other grades. So those grades do cast some doubt on the meaningfulness of his recent lower test performance in social studies.
“Whatever is measured on the social studies subtest seems, from my inspection of the actual items, to be measured by a set of homogeneous items. This doesn’t mean, however, the content of the social studies subtest meshes with the social studies Raphael’s been taught here in our district. To me, Colonel, I think it’s less likely to be a case of measurement unreliability than it is to be a problem of content mismatch between what the standardized examination is testing and what we are trying to teach Raphael about social studies.
“I recommend that you, Mrs. Lopez, and I carefully monitor Raphael’s performance in social studies during the upcoming school year so that we can really see whether we’re dealing with a learning problem or with an assessment that’s not aligned to what we’re teaching.”
Now, how would you respond to Colonel Lopez?
M03_POPH0936_10_SE_C03.indd 90M03_POPH0936_10_SE_C03.indd 90 09/11/23 6:04 PM09/11/23 6:04 PM
Internal Consistency Reliability evidence 91
polytomous items. Sometime, try to work polytomous into a casual conversation around the watercooler. Its intimidation power is awesome.
Incidentally, other things being equal, the more items there are in an educational assessment device, the more reliability/precision it will tend to possess. To illustrate, if you set out to measure a student’s mathematics achievement with a 100-item test dealing with various aspects of mathematics, you’re apt to get a more precise fix on a student’s overall mathematical prow- ess than you would if you asked students to solve only a set of near-similar mathematical word problems. The more times you ladle out samples from a pot of soup, the more accurate will be your estimate of what the soup’s ingre- dients are. One ladleful might fool you. Twenty ladlefuls will give you a much better idea of what’s in the pot. In general, then, more items on educational assessment devices will tend to yield higher reliability/precision estimates than will fewer items.
Three Coins in the Reliability/Precision Fountain You’ve now seen that there are three different ways of conceptualizing the manner in which the consistency of a test’s results are described. Consistency of measure- ment is a requisite for making much sense out of a test’s results. If the test yields inconsistent results, how can teachers make sensible decisions based on what appears to be a capricious assessment procedure? Yet, as we have seen, reliability evidence comes in three flavors. It is up to you to make sure that the reliability evidence supplied with a test is consonant with the use to which the test’s results will be put—namely, the decision linked to the test’s results. Although there is surely a relationship among the three kinds of reliability evidence we’ve been discussing, the following is also unarguably true:
Test-Retest Reliability Evidence
≠
Alternate-Form Reliability Evidence
≠
Internal Consistency Reliability Evidence
To illustrate, suppose you were a teacher in a school district where a high school diploma test had been developed by an assistant superintendent in collaboration with a committee of volunteer district teachers. The assistant superintendent has claimed the test’s three different forms are essentially inter- changeable because each form, when field-tested, yielded a Kuder-Richardson reliability coefficient of .88 or higher. “The three test forms,” claimed the assis- tant superintendent at a recent school board meeting, “are reliable and, there- fore, equivalent.” You now know better.
If the assistant superintendent really wanted to know about between-form comparability, then the kind of reliability evidence needed would be alternate-form reliability, not internal consistency. (Incidentally, it is not recommended that you rise to your feet at the school board meeting to publicly repudiate an assistant
M03_POPH0936_10_SE_C03.indd 91M03_POPH0936_10_SE_C03.indd 91 09/11/23 6:04 PM09/11/23 6:04 PM
92 ChApteR 3 Reliability of Assessment
superintendent’s motley mastery of measurement reliability. You might, instead, send the assistant superintendent a copy of this text, designating the pages to be read. However, be sure to send it anonymously.)
Yet, even those educators who unquestionably know something about reli- ability and its importance will sometimes unthinkingly mush the three brands of reliability evidence together. They’ll see a K-R reliability coefficient of .90 and assume not only that the test is internally consistent but also that it will produce stable results. That’s not necessarily so. If these educators make such mistakes in your presence, make sure that any correcting you undertake is laden with thoughtfulness. When taking part in assessment deliberations with colleagues, try your darndest to preserve the dignity of your colleagues.
The Standard Error of Measurement Before bidding adieu to reliability and all its raptures, there’s one other thing you need to know about consistency of measurement. So far, the kinds of reliability/ precision evidence and reliability coefficient evidence we’ve been considering deal with the reliability of a group of students’ scores. For a few paragraphs now, please turn your attention to the consistency with which we measure an individ- ual’s performance. The index used in educational assessment to describe the con- sistency of a particular person’s performance(s) is referred to as the standard error of measurement (SEM). Often a test’s standard error of measurement is identified as the test’s SEM. The standard error of measurement is another indicator that writers of the 2014 Standards would classify as a reliability/precision procedure.
You should think of the standard error of measurement as a reflection of the consistency of an individual’s scores that would emerge if a given assess- ment procedure were administered to that individual again, and again, and again. However, as a practical matter, it is impossible to re-administer the same test innumerable times to the same students; such students would revolt or, if exceed- ingly acquiescent, soon swoon from exhaustion. (Swooning, these days, is rarely encountered—but someday could be.) Accordingly, we need to estimate how much variability there would be if we were able to re-administer a given assessment procedure many times to the same individual. The standard error of measurement is much like the plus-or-minus “sampling errors” or “confidence intervals” so frequently seen in the media these days for various sorts of opinion polls. We are told that “89 percent of telephone interviewees indicated that they would consider brussels sprouts in brownies to be repugnant” ( 3± percent margin of error).
Other things being equal, the higher the reliability of a test, the smaller that the standard error of measurement will be. For all commercially published tests, a technical manual is available supplying the standard error of measurement for the test. Sometimes, when you have occasion to check out a test’s standard error, you find it’s much larger than you might have suspected. As is true of sampling errors for opinion polls, what you’d prefer to have is small, not large, standard errors of measurement.
M03_POPH0936_10_SE_C03.indd 92M03_POPH0936_10_SE_C03.indd 92 09/11/23 6:04 PM09/11/23 6:04 PM
the Standard error of Measurement 93
In many realms of life, big is better. Most folks like big bank accounts, houses with ample square footage, and basketball players who tower over other folks. But with standard errors of measurement, the reverse is true. Smaller standard errors of measurements signify more accurate assessment.
It’s important that you not think a test’s standard error of measurement is computed by employing some type of measurement mysticism. Accordingly, the formula assessment folks use in order to obtain a standard error of measure- ment is presented here. If, however, you don’t really care if the standard error of measurement was spawned on Magic Mountain or Mulberry Hill, just skip the formula as well as the explanatory paragraph that follows it.
S S r1e x xx= −
Swhere standard error of measuremente = S standard deviation of the test scoresx = r reliability of the testxx =
Take a look at the formula for just a moment (especially if you get a kick out of formula-looking), and you’ll see that the size of a particular test’s standard error of measurement (se) depends on two factors. First, there’s the standard deviation (sx ) of the test’s scores—that is, how spread out those scores are. The greater the spread in scores, the higher the scores’ standard deviation will be. Second, there’s the coefficient representing the test’s reliability (rxx). Now, if you consider what’s going on in this formula, you’ll see that the larger the standard deviation (score spread), the larger the standard error of measurement. Similarly, the smaller the reliability coefficient, the larger the standard error of measurement. So, in general, a test will have a smaller standard error of measurement if the test’s scores are not too widely spread out and if the test yields more reliable scores. A smaller standard error of measurement signifies that a student’s score is more accurately reflective of the student’s “true” performance level.3
The standard error of measurement is an important concept because it reminds teachers about the imprecision of the test scores an individual student receives. Novice teachers often ascribe unwarranted precision to a student’s test results. I can remember all too vividly making this mistake myself when I began teaching. While getting ready for my first group of students, I saw as I inspected my students’ records that one of those students, Sally Palmer, had taken a group intelligence test. (Such tests were popular in those days.) Sally had earned a score of 126. Accordingly, for the next year, I was abso- lutely convinced not merely that Sally was not only above average in her intellectual abilities. Rather, I was certain that her IQ was exactly 126. I was too ignorant about assessment to realize that there may have been a sizeable standard error of measurement associated with the intelligence test Sally had taken. Her “true” IQ score might have been substantially lower or markedly higher. I doubt that if Sally had retaken the same intelligence test 10 different times, she would ever have received another score of precisely 126. But, in my naivete, I blissfully assumed Sally’s intellectual ability was dead-center 126.
M03_POPH0936_10_SE_C03.indd 93M03_POPH0936_10_SE_C03.indd 93 09/11/23 6:04 PM09/11/23 6:04 PM
94 ChApteR 3 Reliability of Assessment
The standard error of measurement helps remind teachers (novice teachers especially) that the scores earned by students on commercial or classroom tests are not so darned exact.
There’s one place a typical teacher is apt to find standard errors of measure- ment useful, and it is directly linked to the way students’ performances on a state’s accountability tests are reported. Many states classify a student’s perfor- mance in at least one of three levels: basic, proficient, or advanced. A student is given one of these three (or more) labels depending on the student’s test scores. So, for instance, on a 60-item accountability test in mathematics, the following classifica- tion scheme might have been adopted:
Student’s Mathematics Classification
Student’s Mathematics Accountability Test Score
Advanced 54–60
Proficient 44–53
Basic 37–43
Below Basic 36 and below
Now, let’s suppose one of your students has a score near the cutoff for one of this fictitious state’s math classifications. To illustrate, suppose your student answered 53 of the 60 items correctly, and was therefore classified as a proficient student. Well, if the accountability test has a rather large standard error of mea- surement, you recognize it’s quite likely your proficient student might really be an advanced student who, because of the test’s inaccuracy, didn’t earn the neces- sary extra point. This is the kind of information you can relay to both the student and the student’s parents. But to do so, of course, you need to possess at least a rudimentary idea about where a standard error of measurement comes from and how it works.
The more students’ scores you find at or near a particular cut-score, the more frequently there will be misclassifications. For example, using the previous table, the cut-score for an “Advanced” classification is 54—that is, a student must get a score of 54 or more to be designated “Advanced.” Well, if there are relatively few students who earned that many points, there will be fewer who might be misclas- sified. However, for the cut-score between “Basic” and “Proficient,” 44 points, there might be many more students earning scores of approximately 44 points, so the likelihood of misclassifications rises accordingly.
Although teachers who understand that a standard error of measurement can reveal how much confidence we can place in a student’s test performance, and that this is a useful cautionary mechanism for users of test results, few teachers routinely rely on SEMs as they work with test results. Particularly for important tests employed to classify test-takers, SEMs can even be tailored so that they allow test users to make more precise interpretations at or near classification cut-scores.
Here’s how this process works: As you probably know already, in a given dis- tribution of students’ scores on a particular test, we will ordinarily find substantial
M03_POPH0936_10_SE_C03.indd 94M03_POPH0936_10_SE_C03.indd 94 09/11/23 6:04 PM09/11/23 6:04 PM
What Do Classroom teachers Really Need to Know About Reliability/precision? 95
differences in the number of students who earn certain sorts of scores. Usually, for example, there are more students who score toward the middle of what often turns out to be a “bell-shaped curve” than there are students who earn either very high or very low scores. Well, assume for the moment that five classification cat- egories have been established for a state-administered achievement test, and that there are four cut-scores separating students’ performances into the five groups.
Because it might be important to assign test-takers to one of the five groups, not only is it possible to calculate an overall standard error for the entire array of students’ test scores, but we can also compute SEMs near each of the four cut-scores established for the test. Yes, these are called “conditional standard errors of measurement,” and conditional SEMs can vary considerably in size when calculated for cut-scores representing different areas of an overall distribu- tion of test scores.
It seems unlikely that classroom teachers will ever want to compute standard errors for their own tests, and it is almost certain that they’ll never wish to com- pute conditional SEMs for those tests. However, for particularly high-stakes tests, such as those affecting students’ admission to universities or students’ attainment of scholarship support, reliability/precision is even more important than usual. Teachers should recognize that it is possible to obtain both overall and condi- tional SEMs at the cut-score segments of a score distribution, and that access to this information can prove particularly helpful in estimating how near to—or far from—the closest cut-score a given student’s performance is.
What Do Classroom Teachers Really Need to Know About Reliability/ Precision? What do you, as a teacher or teacher in preparation, truly need to know about the reliability/precision of educational tests? Do you, for example, need to gather data from your own classroom assessment procedures so you can actually cal- culate reliability coefficients for those assessments? If so, do you need to collect all three varieties of reliability evidence? My answers may surprise you. I think you need to know what reliability is, but I don’t think you’ll have much call to use it with your own tests—you won’t, that is, unless certain of your tests are extraordinarily significant to the students. And I haven’t run into teacher-made classroom tests, even rocko-socko final examinations, that I would consider suf- ficiently significant to warrant your engaging in a reliability-evidence orgy. In general, if you construct your own classroom tests with care, those tests will be sufficiently reliable for the decisions you will base on the tests’ results.
You need to know what reliability is because you may be called on to explain to parents the meaning of a student’s important test scores, and you’ll
M03_POPH0936_10_SE_C03.indd 95M03_POPH0936_10_SE_C03.indd 95 09/11/23 6:04 PM09/11/23 6:04 PM
96 ChApteR 3 Reliability of Assessment
want to know how reliable are the scores yielded by those important tests. You need to know what a commercial test manual’s authors are talking about and to be wary of those who secure one type of reliability evidence—for instance, a form of internal consistency evidence (because it’s the easiest to obtain)— and then try to proclaim that this form of reliability evidence indicates the test’s stability or the comparability of its multiple forms. In short, you need to be knowledgeable about the fundamental meaning of reliability, but it is not seriously suggested that you make your own classroom tests pass any sort of significant reliability muster.
Reliability is a central concept in measurement. As you’ll see in the next chap- ter, if an assessment procedure fails to yield consistent results, it is almost impos- sible to make any truly accurate inferences about what a student’s score really means. Inconsistent measurement is, at least much of the time, almost certain to be inaccurate measurement. Thus, you should realize that as the stakes associ- ated with an assessment procedure become higher, there will typically be more attention given to establishing that the assessment procedure is, indeed, able to produce reliable results. If you’re evaluating an important test developed by oth- ers, and you see that only skimpy attention has been given to establishing reli- ability for the test, you should be critical of the test because evidence regarding an essential attribute of an educational test is missing.
You also ought to possess at least an intuitive understanding of what a stan- dard error of measurement is. This sort of understanding will come in handy when you’re explaining to students, or their parents, how to make sense of a student’s scores on such high-stakes external exams as your state’s accountabil- ity tests. It’s also somewhat useful, but not all that often, to know that for very important tests, conditional SEMs can be determined.
The other thing you should know about reliability evidence is that it comes in three brands—three kinds of evidence about a test’s consistency that are not interchangeable. Don’t let someone foist a set of internal consistency results on you and suggest that these results tell you anything of importance about test-retest evidence. Don’t let anyone tell you that a stability reliability coefficient indicates anything about the equivalence of a test’s multiple forms. Although the three types of reliability evidence are related, they really are fairly distinctive kinds of creatures, something along the lines of second or third cousins.
What is clear is that classroom teachers, as professionals in the field of educa- tion, need to understand that an important attribute of educational assessment procedures is reliability. The higher the stakes associated with a test’s use, the more that educators should attend to the test’s reliability/precision. Reliability is such a key criterion by which psychometricians evaluate tests that you really ought to know what it is, even if you don’t use it on a daily basis.
The situation regarding your knowledge about reliability is probably some- what analogous to a health professional’s knowledge about blood pressure and how blood pressure affects health. Even though only a small proportion of health professionals work directly with patients’ blood pressure on a day-by-day basis,
M03_POPH0936_10_SE_C03.indd 96M03_POPH0936_10_SE_C03.indd 96 09/11/23 6:04 PM09/11/23 6:04 PM
What Do Classroom teachers Really Need to Know About Reliability/precision? 97
But What Does this have to Do with teaching? Students’ questions can sometimes get under a teacher’s skin. For example, a student once asked me a question that, in time, forced me to completely rethink what I believed about reliability—at least insofar as it made a difference in a classroom teacher’s behavior. “Why,” the student asked, “is reliability really important for classroom teachers to know?”
The incident occurred during the early seventies. I can still recall where the student was sitting (the back-right corner of an open-square desk arrangement). I had only recently begun teaching measurement courses at UCLA, and I was “going by the book”; in other words, I was following the course’s textbook almost unthinkingly.
You see, as a graduate student myself, I had never taken any coursework dealing with testing. My doctoral studies dealt with curriculum and instruction, not measurement. I wanted to learn ways of instructing prospective teachers about how to whip up winning instructional plans and deliver them with panache to their students. But after graduate school, I soon began to recognize that what was tested on important tests invariably influenced what was taught by most teachers. I began to read all about educational measurement so I could teach the introductory measurement course in the UCLA Graduate School of Education. Frankly, I was somewhat intimidated by the psychometric shroud with which testing experts sometimes surround their measurement playpens. Thus, as a beginner in the field of measurement, I rarely strayed from the “truths” contained in traditional measurement textbooks.
I can’t recall my answer to that 1970s student, but I’m sure it must have been somewhat insipid.
You see, I was merely mouthing a traditional view of reliability as a key attribute of good tests. In my own mind, at that moment, I really didn’t have an answer to her Why question. But her question kept bothering me—actually it did so for several years. Then I finally rethought the realistic value of reliability as a concept for classroom teachers. As you’ll see in this chapter’s wrap-up, I now downplay reliability’s importance for busy teachers. But I suspect I still might be dishing out the same old psychometric party line about reliability—if it hadn’t been for this one student’s perceptive question. One of those psychometrically sanctioned truths was “reliability is a good thing.” Accordingly, I was in the midst of a lecture extolling the virtues of reliability when this memorable student, a woman who was at the time teaching sixth-graders while also working on her master’s degree, said, “I can’t see any practical reason for teachers to know about reliability or to go to the trouble of computing all these reliability coefficients you’ve been touting. Why should we?”
Teachers do need to make sure they evaluate their students’ test performances with consistency, especially if students are supplying essay responses or other sorts of performances that can’t be scored objectively. But as a general rule, classroom teachers need not devote their valuable time to reliability exotica. In reality, reliability has precious little to do with a classroom teacher’s teaching. But this doesn’t make key notions about reliability totally irrelevant to the concerns of classroom teachers. It’s just that reliability is way, way less germane than how to whomp up a winning lesson plan for tomorrow’s class!
there are few health professionals who don’t know at least the fundamentals of how one’s blood pressure can influence one’s health.
You really need not devote any time to calculating the reliability of the scores generated by your own classroom tests, but you should have a general knowledge about what reliability is and why it’s important. Besides, computing too many reliability coefficients for your own classroom tests could give you high blood pressure.
M03_POPH0936_10_SE_C03.indd 97M03_POPH0936_10_SE_C03.indd 97 09/11/23 6:04 PM09/11/23 6:04 PM
98 ChApteR 3 Reliability of Assessment
Chapter Summary This chapter focused on the reliability/precision of educational assessment procedures. Reliabil- ity refers to the consistency with which a test measures whatever it’s measuring—that is, the consistency of the test’s scores for any intended use of those scores.
There are three distinct types of reliability evidence. Test-retest reliability refers to the con- sistency of students’ scores over time. Such reli- ability is usually represented by a coefficient of correlation between students’ scores on two occa- sions, but it can also be indicated by the degree of classification consistency displayed for students on two measurement occasions. Alternate-form reliability evidence refers to the consistency of results between two or more forms of the same test. Alternate-form reliability evidence is usu- ally represented by the correlation of students’ scores on two different test forms, but it can also be reflected by classification consistency percent- ages. Internal consistency evidence reliability refers to the degree of homogeneity in an assessment procedure’s items. Common indices of internal
consistency are the Kuder-Richardson formulae and Cronbach’s coefficient alpha. The three forms of reliability evidence should not be used inter- changeably, but they should be sought if they are relevant to the educational purpose(s) to which an assessment procedure is being put—that is, to the kind(s) of educational decisions linked to the assessment’s results.
The standard error of measurement sup- plies an indication regarding the consistency of an individual’s score by estimating person-score consistency from evidence of group-score con- sistency. The standard error of measurement is interpreted in a manner similar to the plus-or- minus estimates of sampling error that are often provided with national opinion polls. Condi- tional SEMs can be computed for particular seg- ments of a test’s score scale—such as those scores near any key cut-scores. Classroom teachers are advised to become generally familiar with the key notions of reliability, but not to subject their own classroom tests to reliability analyses unless the tests are extraordinarily important.
References American Educational Research Association.
(2014). Standards for educational and psychological testing. Washington, DC: Author.
Kimbell, A. M., & Huzinec, C. (2019). Factors that influence assessment. Pearson Education. Retrieved September 20, 2022, from https:// www.pearsonassessments.com/content/dam/ school/global/clinical/us/assets/campaign/ factors-that-influence-assessment.pdf
McMillan, J. H. (2018). Classroom assessment: Principles and practice for effective standards- based instruction (7th ed.). Pearson.
Miller, M. D. (2019, October). Reliability in educational assessments. ResearchGate. https:// www.researchgate.net/ publication/33690 1342_Reliability_in_ Educational_Assessments
Miller, M. D., & Linn, R. (2013). Measurement and assessment in teaching (11th ed.). Pearson.
Parkes, J. (2013). Reliability in classroom assessment. In J. H. McMillan (Ed.), SAGE handbook of research on classroom assessment (pp. 107–124). SAGE Publications.
M03_POPH0936_10_SE_C03.indd 98M03_POPH0936_10_SE_C03.indd 98 09/11/23 6:04 PM09/11/23 6:04 PM
endnotes 99
Endnotes 1. The Winter 2014 issue of Educational
Measurement: Issues and Practice (Volume 33, No. 4) is devoted to the most recently published Standards. It contains a marvelous collection of articles related to the nature of the Standards and their likely impact on educational practice. Those who are interested in the 2014 Standards are encouraged to consult this informative special issue.
2. Please note the use of the Latin plural for formula. Because I completed 2 years of
Latin in high school and 3 years in college, I have vowed to use Latin at least once per month to make those 5 years seem less wasted. Any fool could have said formulas.
3. Because different kinds of reliability coefficients are used to reflect a test’s consistency, the size of a test’s standard error of measurement will depend on the specifics of the particular reliability coefficient that’s used when calculating a standard error of measurement.
M03_POPH0936_10_SE_C03.indd 99M03_POPH0936_10_SE_C03.indd 99 09/11/23 6:04 PM09/11/23 6:04 PM
100 Chapter 3 Reliability of Assessment
A Testing Takeaway
Reliability: A Testing Necessity* W. James Popham, University of California, Los Angeles
Reliability is one of those concepts that our society has always held in high esteem. Historically, people have yearned for reliability in their autos, their politicians, and their spouses. In educational testing, reliability is also an urgently sought commodity. That’s because it describes the consistency with which a test measures whatever it measures. If evidence is not present to support a test’s reliability, the likelihood of our accurately interpreting a students’ test scores is diminished.
But unlike autos, politicians, and spouses, the evidence to support an educational test’s reliability is collected in three distinctive ways:
• Test-retest reliability evidence describes the consistency between students’ scores after they have completed the same test on two time-separated occasions.
• Alternate-form reliability evidence refers to the consistency of students’ performances after completing two supposedly equivalent forms of the same test.
• Internal consistency reliability evidence tells us how similarly a test’s items are functioning to measure what they’re measuring.
What is important to understand about test reliability is that these three kinds of evidence represent fundamentally different ways of supporting the degree to which a test is measuring with consistency. Although related to one another, somewhat like second cousins, these three officially approved varieties of reliability evidence are not interchangeable.
To illustrate, suppose a school district’s officials have purchased a commercially sold mathematics test to use “whenever a student’s achievement level indicates that the student is ready.” Different students, therefore, take the tests at different times during the school year. The test’s publisher supports this use of the test’s reliability by presenting positive evidence of the test’s internal consistency.
But, as you can see, the consistency evidence in this example should focus on test-retest reliability. That’s because the test must measure consistently regardless of when students completed it during the school year. The publisher’s internal consistency reliability evidence describes the similarity in how a test’s items function, not their test-retest consistency.
When a test fails to yield consistent test scores for the intended use of the test, it is unlikely that accurate—that is, valid—interpretations will be made based on a student’s test performance. Evidence of the right kind of consistency must be at hand. Unlike in a court of law, where the accused is regarded as innocent until proven guilty, in educational measurement a test is seen as guilty until its reliability has been demonstrated—using the right kind of reliability evidence, of course.
*From Chapter 3 of Classroom Assessment: What Teachers Need to Know, 10th ed., by W. James popham. Copyright 2022 by pearson, which hereby grants permission for the reproduction and distribution of this Testing Takeaway, with proper attribution, for the intent of increasing assessment literacy. A digitally shareable version is available from https://www.pearson.com/store/en-us/pearsonplus/login.
M03_POPH0936_10_SE_C03.indd 100M03_POPH0936_10_SE_C03.indd 100 09/11/23 6:04 PM09/11/23 6:04 PM