Education EDU 530 Week 2 Assignment
275
Chapter 11
Improving Teacher- Developed Assessments
Chief Chapter Outcome
Sufficient comprehension of both judgmental and empirical test- improvement procedures so that accurate decisions can be made about the way teachers are employing these two test-improvement strategies
Learning Objectives
11.1 Apply judgmental-based test-improvement procedures to improve teacher-developed assessments.
11.2 Apply empirical-based test-improvement procedures to improve teacher-developed assessments.
If you’ve ever visited the manuscript room of the British Museum, you’ll recall seeing handwritten manuscripts authored by some of the superstars of English literature. It’s a moving experience. Delightfully, the museum presents not only the final versions of famed works by such authors as Milton and Keats, but also the early drafts of those works. It is somewhat surprising, and genuinely encour- aging, to learn that even those giants of literature didn’t get it right the first time. They had to cross out words, delete sentences, and transpose phrases. Many of the museum’s early drafts are genuinely messy, reflecting all sorts of rethinking on the part of the author. Well, if the titans of English literature had to revise their early drafts, is it at all surprising that teachers usually need to spruce up their classroom assessments?
M11_POPH0936_10_SE_C11.indd 275M11_POPH0936_10_SE_C11.indd 275 09/11/23 6:11 PM09/11/23 6:11 PM
276 Chapter 11 Improving teacher-Developed assessments
This chapter is designed to provide you with several procedures by which you can improve the assessment instruments you develop. It’s a near certainty that if you use the chapter’s recommended procedures, your tests will get better. (But they probably won’t make it to the British Museum—unless you carry them in when visiting.)
Two general improvement strategies are described in this chapter. First, you’ll learn about judgmental item-improvement procedures, in which the chief means of sharpening your tests is human judgment—your own and that of others. Second, you’ll be considering empirical item-improvement procedures based on students’ responses to your assessment procedures. Ideally, if time permits and your motivation abounds, you can use both forms of test-improvement proce- dures to sharpen up your own classroom-assessment devices.
But, here’s a heads-up regarding the upcoming empirical item-repair proce- dures. What you’ll recognize when you get into this section of the chapter is that there are numbers in it. (There’s even a formula or two.) When encountering this material, perhaps you’ll conclude that you’ve been tricked—lulled into a sense of nonnumeric complacency for 10 chapters and then blindsided with a quagmire of quantitative esoterica. But that’s not so. Don’t be intimidated by quantitative stuff in the last half of the chapter. It truly constitutes a fairly primitive level of number nudging. Just take it slow and easy. You’ll survive surprisingly well. That’s a promise.
Judgmentally Based Improvement Procedures Human judgment, although it sometimes gets us in trouble, is a remarkably useful tool. Judgmental approaches to test improvement can be carried out quite system- atically or, in contrast, rather informally. Judgmental assessment-improvement strategies differ chiefly in who is supplying the judgments. There are three sources of test-improvement judgments you should consider—those supplied by (1) your- self, (2) your colleagues, and (3) your students. We’ll consider each of these poten- tial judgment sources separately.
Judging Your Own Assessment Instruments Let’s suppose you’ve created a 50-item combination short-answer and multiple-choice examination for a U.S. government class and, having adminis- tered it, you want to improve the examination for your next year’s class. We’ll assume that during item development you did a reasonable amount of in-process honing of the items so that they constituted what, at the time, was your best effort. Now, however, you have an opportunity to revisit the examination to see whether its 50 items are all that marvelous. It’s a good thing to do.
M11_POPH0936_10_SE_C11.indd 276M11_POPH0936_10_SE_C11.indd 276 09/11/23 6:11 PM09/11/23 6:11 PM
Judgmentally Based Improvement procedures 277
Let’s consider other kinds of assessments—and particularly the directions for them. Suppose you have devised what you think are pretty decent directions for your students to follow when preparing portfolios or when responding to tasks in a performance test. Even if you regard those directions as suitable, it’s always helpful to review such directions after a time to see whether you can now detect shortcomings that, when you originally prepared the set of directions, escaped your attention.
As you probably know from other kinds of writing, when you return to one of your written efforts after it has had time to “cool off,” you’re likely to spot deficits that, in the heat of the original writing, weren’t apparent to you. However, beyond a fairly casual second look, you will typically be able to improve your assessment procedures even more if you approach this test-improvement task systematically. To illustrate, you could use specific review criteria when you judge your earlier assessment efforts. If a test item or a set of directions falls short on any criterion, you should obviously do some modifying. Presented here are five review criteria that you might wish to consider if you set out systematically to improve your classroom-assessment procedures.
• Adherence to item-specific guidelines and general item-writing commandments. When you appraise your assessment procedures, it will be useful to review briefly the general item-writing precepts supplied earlier (in Chapter 6) as well as the particular item-writing guidelines provided for the specific kind of item(s) you’ve developed. If you now see violations of the suggestions put forth in either of those two sets of directives, fix the flaws.
• Contribution to score-based inference. Recall that the real reason why teachers assess students is in order to arrive at score-based interpretations about the status of students for particular assessment purposes. Therefore, it will be helpful for you to reconsider each aspect of a previously developed assess- ment procedure to see whether it does, in fact, really contribute to the kind of inference you wish to draw about your students.
• Accuracy of content. There’s always the possibility that previously accurate content has now been superseded or contradicted by more recent content. Make sure that the content you included earlier in the assessment instrument is still accurate and that your answer key is still correct.
• Absence of content lacunae. This review criterion gives me a chance to use one of my favorite words, the plural form of lacuna, which, incidentally, means “gap.” Although gaps would have done the communication job in this instance, you’ll admit that gaps look somewhat tawdry when stacked up against lacunae. (This, of course, is a norm-referenced contrast.) Hindsight is a nifty form of vision. Thus, when you take a second look at the content coverage represented in your assessment instrument, you may discover that you originally overlooked some important content. This review criterion is clearly related to the assessment’s contribution to the score-based inference
M11_POPH0936_10_SE_C11.indd 277M11_POPH0936_10_SE_C11.indd 277 09/11/23 6:11 PM09/11/23 6:11 PM
278 Chapter 11 Improving teacher-Developed assessments
that you want to make. Any meaningful lacunae in content will obviously reduce the accuracy of your inference.
• Fairness. Although you should clearly have tried to eradicate any bias in your assessment instruments when you originally developed them, there’s always the chance that you overlooked something. Undertake another bias review just to make certain you’ve been as attentive to bias elimination as you possibly can be.
I must personally confess to a case of early-career bias blindness when I authored a textbook on educational evaluation way back in the 1970s. I had rec- ognized that a number of people at the time were beginning to take offense at many authors’ use of masculine pronouns when referring to unnamed individu- als (for example, “The student lost his lunch”). These days, chiefly because of heightened equity concerns, the number of people voicing warranted dismay over such a male-biased pronoun practice would surely be far, far larger. Anyway, in a quest for equity, I had assiduously written the 70s textbook so that all my illustrative make-believe people were plural. I never was obliged to use his or her because I could always use their. After churning out the last chapter, I was complacently proud of my pluralization prowess. Yet, having been so attentive to a pronoun issue that might offend some people, I had still blithely included a cartoon in the book that showed scantily clad females cavorting before male members of a school board. Talk about dumb! If only I’d taken a serious second look at the book before it hit the presses, I might have spotted my insensitive error. As it was, because the cartoon appeared in all copies of the first edition (not the second!), I received more than a few heated, and richly deserved, complaints from readers—all genders!
This does not suggest that you fill out an elaborate rating form for each of these five review criteria—with each criterion requiring a numerical rating. Rather, it is recommended that you seriously think about these five criteria before you tackle any judgmental review of a previously developed assessment instrument.
This chapter dealing with the improvement of teacher-made classroom tests has a clear mission, and it’s probably a good idea to remind you of the nature of that mission. Remember how, earlier in this book, the three most important concepts of educational measurement were fawned over at length. Yes, validity, reliability, and fairness were applauded so relentlessly that you might have sus- pected they were being groomed for some sort of sainthood. Well, the “big three” are, indeed, important assessment concepts, and by improving the classroom assessments you’ve churned out, you’re surely hoping to improve one or more of those three concepts in your very own tests. Remember, the stronger that your teacher-made tests are, the more defensible will be the test-based interpretations you make about the unseen, covert status of your students. Often, a little tighten- ing up of just a few items in your tests can lead to meaningfully better interpreta- tions and, thereafter, better educated students.
M11_POPH0936_10_SE_C11.indd 278M11_POPH0936_10_SE_C11.indd 278 09/11/23 6:11 PM09/11/23 6:11 PM
Judgmentally Based Improvement procedures 279
Collegial Judgments If you are working with a colleague whose judgment you trust, it’s often helpful to ask this person to review your assessment procedures. To get the most mileage out of such a review, you’ll probably need to provide your coworker with at least a brief description of review criteria such as the five previously cited ones—that is, (1) adherence to item-specific guidelines and general item-writing precepts, (2) contribution to score-based inference, (3) accuracy of content, (4) absence of con- tent gaps or lacunae, and (5) fairness. You will need to describe to your colleague the key inference(s) you intend to base on the assessment procedure. It will also be useful to your colleague if you identify the decisions that will, thereafter, be influenced by your inferences about students.
Collegial judgments are particularly helpful to teachers who employ many performance tests or who use portfolio assessments. Most of the empirically based improvement procedures you’ll be learning about later in this chapter are intended to be used with more traditional sorts of items such as those found in multiple-choice exams. For portfolio assessment and performance tests, judgmen- tal approaches will often prove more useful.
Remember, if you are the creator of the portfolio assessment or performance tests you’re setting out to improve, you’re apt to be biased in their favor. After all, parents almost always adore their progeny. What you need is a good, hard, nonpartisan review of what you’ve been up to assessment-wise.
To do a thorough job of helping you review your assessment approaches, of course, your colleague will need to put in some time. In fairness, you’ll probably feel obliged to toss in a quid pro quo or two whereby you return the favor by reviewing your colleague’s tests (or by resolving a pivotal personal crisis in your colleague’s private life). If your school district is large enough, you might also have access to some central-office supervisorial personnel who know something about assess- ment. Here’s a nifty opportunity to let them earn their salaries—get one of those district supervisors to review your assessment procedures. It is often asserted that another pair of eyes can help improve almost any written document. This doesn’t necessarily mean the other pair of eyes see accurately while yours are in need of contact lenses. You should listen to what other reviewers say, but you should ulti- mately be guided by your own judgments about the virtues of their suggestions.
Student Judgments When teachers set out to improve assessment procedures, a rich source of data is often overlooked because teachers typically fail to secure advice from their students. Yet, because students have experienced test items in a most meaningful context, akin to the way a potential executionee experiences a firing squad, student judgments can provide teachers with useful insights. Student reactions can help you spot shortcom- ings in particular items and in other features of your assessment procedures, such as a test’s directions or the time you’ve allowed for completing the test.
M11_POPH0936_10_SE_C11.indd 279M11_POPH0936_10_SE_C11.indd 279 09/11/23 6:11 PM09/11/23 6:11 PM
280 Chapter 11 Improving teacher-Developed assessments
The kinds of data secured from students will vary, depending on the type of assessment procedure being used, but questions such as those on the item-improvement questionnaire in Figure 11.1 can profitably be given to students after they have completed an assessment procedure. Although the illustrative questionnaire in Figure 11.1 is intended for use with a selected-response type of test, only minor revisions would be needed to make it suitable for a constructed- response test, a performance test, or a portfolio assessment system.
It is important to let students finish a test prior to their engaging in such a judgmental exercise. If students are asked to simultaneously play the roles of test-takers and test-improvers, they’ll probably botch up both tasks. No student should be expected to perform two such functions, at least not at the same time.
Simply give students the test as usual, collect their answer sheets or test book- lets, and provide them with new, blank test booklets. Then distribute a question- naire asking students to supply per-item reactions. In other words, ask students to play examinees and item reviewers, but to play these roles consecutively, not simultaneously.
Now, how do you treat students’ reactions to test items? Let’s say you’re a classroom teacher and a few students have come up with a violent castigation of one of your favorite items. Do you automatically buckle by scrapping or revising the item? Of course not; teachers are made of sterner stuff. Perhaps the students were miffed about the item because they didn’t know how to answer it. One of the best ways for students to escape responsibility for a dismal test per- formance is to assail the test itself. Teachers should anticipate a certain amount of carping from low-scoring students or those questing for “perfection only” in their responses.
Yet, allowing for a reasonable degree of complaining, student reactions can sometimes provide useful insights for teachers. A student-castigated item may, indeed, deserve castigation—gobs of it. To overlook students as a source of judgmental test-improvement information, for both selected-response and constructed-response items, would clearly be an error.
Item-Improvement Questionnaire for Students
1. If any of the items seemed confusing, which ones were they?
2. Did any items have more than one correct answer? If so, which ones?
3. Did any items have no correct answers? If so, which ones?
4. Were there words in any items that confused you? If so, which ones?
5. Were the directions for the test, or for particular subsections, unclear? If so, which ones?
Figure 11.1 an Illustrative Item-Improvement Questionnaire for Students
M11_POPH0936_10_SE_C11.indd 280M11_POPH0936_10_SE_C11.indd 280 09/11/23 6:11 PM09/11/23 6:11 PM
empirically Based Improvement procedures 281
To review, then, judgmentally based test-improvement procedures can rely on the judgments supplied by you, your colleagues, or your students. In the final analysis, you will be the one who decides whether to modify your assessment pro- cedures. Nonetheless, it is almost always helpful to have others react to your tests.
Empirically Based Improvement Procedures In addition to judgmentally based methods of improving your assessment pro- cedures, there are improvement approaches based on the empirical data that stu- dents supply when they respond to the assessment instruments you’ve developed. Let’s turn, then, to the use of student-response data in the improvement of assess- ment procedures. A variety of empirical item-improvement techniques have been well honed over the years. We will consider the more traditional item-analysis procedures first, turning later to a few more recent wrinkles for using student data to improve a teacher’s classroom assessments.
Most of the procedures employed to improve classroom assessments on the basis of students’ responses to those assessments rely on numbers. Clearly, there are some readers of this text (you may be one) who are definitely put off by num- bers. It has been said, typically with feigned sincerity, that there is a secret cult among teachers and prospective teachers who have taken a “tremble pledge”—a vow to tremble when encountering any numerical value larger than a single digit (a number larger than 9). If you are one of these mathphobics, please stay calm because the numbers you’ll be encountering in this and later chapters will really be simple stuff. Tremble not. This will be sublimely pain free. Just work through the easy examples—you’ll survive with surprising ease.
Difficulty Indices One useful index of an item’s quality is its difficulty. The most commonly employed item-difficulty index, often referred to simply as a p value, is calculated as follows:
=p R T
Difficulty
where =R the number of students responding correctly (right) to an item T the= total number of students responding to the item
To illustrate, if 50 students answered an item, and only 37 of them answered it correctly, then the p value representing the item’s difficulty would be
= =pDifficulty 37 50
.74
It should be clear that such p value can range from 0 to 1.00, with higher p values indicating items that more students answered correctly. For example, a p value
M11_POPH0936_10_SE_C11.indd 281M11_POPH0936_10_SE_C11.indd 281 09/11/23 6:11 PM09/11/23 6:11 PM
282 Chapter 11 Improving teacher-Developed assessments
of .98 would signify an item answered correctly by almost all students. Similarly, an item with a p value of .15 would be one that most students (85 percent) missed.
The p value of an item should always be considered in relationship to the probability of the student’s getting the correct response purely by guessing. For example, if a binary-choice item (that is, a two-option item) is involved, then on the basis of chance alone, students who are wildly guessing should still be able to answer the item correctly half of the time, and thus the item would have a p value of .50. On a four-option multiple-choice test, a .25 p value by chance alone would be expected.
Educators sometimes err by referring to items with high p values (for instance, p values of .80 and above) as “easy” items, while items with low p values (of, say, .20 and below) are described as “difficult” items. Those descriptions may or may not be accurate. Even though we typically refer to an item’s p value as its difficulty index, the actual difficulty of an item is tied to the instructional program sur- rounding it. If students are especially well taught, they may perform excellently on a complex item that, by anyone’s estimate, is a tough one. Does the resulting p value of .95 indicate that the item is easy? No. The item’s complicated content may just have been taught effectively. For example, almost all students in a premed course for prospective physicians might correctly answer a technical item about the central nervous system that almost all “people off the street” would answer incorrectly. A p value of .96 based on the premed students’ performances would not render the item intrinsically easy.
M11_POPH0936_10_SE_C11.indd 282M11_POPH0936_10_SE_C11.indd 282 09/11/23 6:11 PM09/11/23 6:11 PM
empirically Based Improvement procedures 283
What is a suitable p value for a test item? Well, as is so often the answer in educational measurement whenever crystal-clear questions are posed, it depends. For instance, sometimes teachers want to include in one of their tests a wide range of items with decidedly different difficulty levels—as an attempt to differentiate among the full range of their students’ mastery of a given body of knowledge. In other instances, the teacher may wish to delete items that are answered incorrectly by too many students or answers indicating that an item’s content is too difficult or, perish the possibility, items whose content has not been well taught! Accord- ingly, p values provide a useful indication of the proportion of students who flew or flopped on a given item, but the varying contexts of educational assessment require teachers to arrive at judgments regarding a given item. That is, is it too tough, is it too easy, or did it hit the Goldilocks magic “just right” mark?
Item-Discrimination Indices For tests designed to yield norm-referenced inferences, one of the most power- ful indicators of an item’s quality is the item-discrimination index. In brief, an item-discrimination index typically tells us how frequently an item is answered correctly by those who perform well on the total test. Fundamentally, an item-discrimination index reflects the relationship between students’ responses for the total test and their responses to a particular test item. One approach to computing an item-discrimination statistic is to calculate a correlation coefficient between students’ total test scores and their performances on a particular item.
We should pause for just a moment to forestall potential confusion that might arise when the term “discrimination” is applied to two very different sorts of assessment activities. In just a moment, you will be learning that in the empiri- cal improvement of a teacher’s test items, it is often the case that discrimination is something positive—a quality much to be sought. Yet, when you grappled in Chapter 5 with the nuances of assessment bias, you saw that when items dis- criminate against subgroups of students, such discrimination renders the items unfair enough to warrant their being bounced from the test. As you tangle with subsequent paragraphs in this section of the chapter, remember that there is noth- ing inherently evil about an item that discriminates. Imagine, for example, that you had just finished a spectacular two-week teaching unit dealing with a set of conventions regarding comma usage. Your test items covering those two weeks of pedagogical bliss should most definitely discriminate between those students who learned what you were teaching and those students who apparently took a two-week cognitive vacation during your dazzling unit called Caring for One’s Commas.
A positively discriminating item is one that is answered correctly more often by those who score well on the total test than by those who score poorly on the total test. A negatively discriminating item is answered correctly more often by those who score poorly on the total test than by those who score well on the total test. A non- discriminating item is one for which there’s no appreciable difference between the
M11_POPH0936_10_SE_C11.indd 283M11_POPH0936_10_SE_C11.indd 283 09/11/23 6:11 PM09/11/23 6:11 PM
284 Chapter 11 Improving teacher-Developed assessments
correct-response proportions of those who score well and those who score poorly on the total test. This set of relationships is summarized in the following chart. (Remember that < and > signify less than and more than, respectively.)
Type of Item Proportion of Correct Responses on Total Test
Positive Discriminator High Scorers > Low Scorers
Negative Discriminator High Scorers < Low Scorers
Nondiscriminator High Scorers = Low Scorers
In general, teachers would like to discover that their items are positive dis- criminators, because a positively discriminating item tends to be answered cor- rectly by the most knowledgeable students (those who scored high on the total test) and incorrectly by the least knowledgeable students (those who scored low on the total test). Negatively discriminating items indicate something is awry, because the item tends to be missed more often by the most knowledgeable students and answered correctly more frequently by the least knowledgeable students.
Now, how do you go about computing an item’s discrimination index? Increasingly, these days, teachers can send off a batch of machine-scorable answer sheets (for selected-response items) to a district-operated assessment center where electronic machines have been programmed to spit out p values and item-discrimination indices. Happily, a number of easy-to-use computer pro- grams are now available so teachers can also carry out such analyses themselves. In such instances, the following four steps can be employed for the discrimination analyses of classroom-assessment procedures:
1. Order the test papers from high to low by total score. Place the paper having the highest total score on top, then continue with the next highest total score sequentially until the paper with the lowest score is placed on the bottom.
2. Divide the papers into a high group and a low group, with the same number of papers in each group. Split the groups into upper and lower halves. If there is an odd number of papers, simply set aside one of the middle papers. If there are several papers with identical scores at the middle of the distribution, then randomly assign them to the high or low distributions so that there is the same number of papers in both groups. The use of 50-percent groups has the advantage of providing enough papers to permit reliable estimates of upper- and lower-group performances.
3. Calculate a p value for the high group and a p value for the low group. Determine the number of students in the high group who answered the item correctly, and then divide this number by the number of students in the high group. This provides you with p .h Repeat the process for the low group to obtain p .l
4. Subtract pl from ph to obtain each item’s discrimination index (D). In essence, then, = −D p p .h l
M11_POPH0936_10_SE_C11.indd 284M11_POPH0936_10_SE_C11.indd 284 09/11/23 6:11 PM09/11/23 6:11 PM
empirically Based Improvement procedures 285
Suppose you are in the midst of conducting an item analysis of your midterm examination items. Let’s say you split the papers for your class of 30 young- sters into two equal upper-half and lower-half papers. All 15 students in the high group answered item 42 correctly, but only 5 of the 15 students in the low group answered it correctly. The item-discrimination index for item 42, therefore, would be 1.00 .33 .67.− =
Now, how large should an item’s discrimination index be in order for you to consider the item acceptable? Ebel and Frisbie (1991) offered the oft-used experience-based guidelines in Table 11.1 for indicating the quality of items to be used for making norm-referenced interpretations. If you consider their guide- lines as approximations, not absolute standards, they’ll usually help you decide whether your items are discriminating satisfactorily.
An item’s ability to discriminate is highly related to its overall difficulty index. For example, an item answered correctly by all students has a total p value of 1.00. For that item, ph and pl are also 1.00. Thus, the item’s discrimination index is zero 1.00 1.00 0 .( )− = A similar result would be obtained for items in which the overall p value was zero—that is, for items no student had answered correctly.
With items having very high or very low p values, it is thus less likely that substantial discrimination indices can be obtained. Later in the chapter, you will see that this situation has prompted proponents of criterion-referenced interpreta- tions of tests (who often hope that most post-instruction responses from students will be correct) to search for alternative ways to calculate indices of item quality.
In looking back at the use of item-discrimination indices to improve a test’s items, it is important to understand that these indices, as their name signifies, were designed to discriminate. That is, such indices have been employed to help spot items that do a good job in contributing sufficient variation to students’ total-test scores so that those total-test scores can be accurately compared with one another. Educators in the United States have been using most educational tests to do this sort of discriminating between high scorers and low scorers for more than a century. To make the kind of norm-referenced, comparative interpretations that have historically been the chief function of U.S. educational testing, we needed to refine a set of test-improvement procedures suitable for sharpening the accuracy of our comparative interpretations of scores. Item-discrimination indices help us to locate the optimal items for contrasting students’ test performances.
Table 11.1 Guidelines for evaluating the Discriminating efficiency of Items
Discrimination Index Item Evaluation
.40 and above Very good items
.30–.39 Reasonably good items, but possibly subject to improvement
.20–.29 Marginal items, usually needing improvement
.19 and below Poor items, to be rejected or improved by revision
Source: ebel, r.L., and Frisbie, D.a. (1991). Essentials of Educational Measurement (5th ed.). englewood Cliffs, NJ: prentice hall.
M11_POPH0936_10_SE_C11.indd 285M11_POPH0936_10_SE_C11.indd 285 09/11/23 6:11 PM09/11/23 6:11 PM
286 Chapter 11 Improving teacher-Developed assessments
But although it is certainly appropriate to employ educational tests in an attempt to come up with norm-referenced interpretations of students’ perfor- mances, there are educational settings in which a test’s comparative function should be overshadowed by its instructional mission. In those more instruction- ally oriented settings, traditional item-discrimination indices may be quite inap- propriate. To illustrate, suppose a mathematics teacher, Khanh, decides to get her students to become skilled in the use of a really demanding mathematical pro- cedure. (We can call it Procedure X.) Well, when Khanh gave her students a test containing Procedure X items early in the year, students’ performances on the items were rather scattered, and the resulting item-discrimination indices for all the Procedure X items looked quite acceptable—between .30 and .60. However, at the end of the school year, after an awesomely wonderful instructional job by plucky Khanh, almost every student answered nearly every Procedure X item correctly. The p value for every one of those items was .95 or above! As a con- sequence of this almost nonexistent variation in Khanh’s students’ Procedure X item performances, those items produced particularly low item-discrimination indices—much lower than those regarded as acceptable. Were those Procedure X items flawed? Do they need to be revised or replaced? Of course not! The “culprit” in this caper is simply Khanh’s excellent instruction. When almost all students score perfectly on a related set of test items because their teacher has done a
Decision time to Catch a Culprit: teacher or test?
Isabella Kappas teaches sixth-grade social studies in Exeter Middle School. During the 7 years she has taught at Exeter, Isabella has always spent considerable time in developing what she refers to as “credible classroom assessments.” She really has put in more than her share of weekends working to create stellar examinations.
This last spring, however, Isabella completed an extension course on educational testing. In the course, she learned how to compute discrimination analyses of her test items. As a consequence, Isabella has been subjecting all of her examinations to such analyses this year.
On one of her examinations containing mostly selected-response items, Isabella discovered to her dismay that 4 of the test’s 30 items turned out to have negative discriminators. In other words, students who performed well on the total test
answered the 4 items incorrectly more often than students who didn’t do so well on the total test. To her surprise, all 4 negatively discriminating items dealt with the same topic—that is, relationships among the legislative, judicial, and executive branches of the U.S. government.
Isabella’s first thought was to chuck the 4 items because they were clearly defective. As she considered the problem more closely, however, another possibility occurred to her. Because all 4 items were based on the same instructional content, perhaps she had confused the stronger students with her instructional explanations.
If you were Isabella and wanted to get to the bottom of this issue so you could decide whether to overhaul the items or the instruction, how would you proceed?
M11_POPH0936_10_SE_C11.indd 286M11_POPH0936_10_SE_C11.indd 286 09/11/23 6:11 PM09/11/23 6:11 PM
empirically Based Improvement procedures 287
spectacular instructional job teaching whatever those items measure, does this make the items unacceptable? Obviously not.
Item-discrimination indices, therefore, are often quite inappropriate for evaluating the quality of items in teacher-made tests if those tests are being employed chiefly for instructional purposes. Be wary of unthinkingly submit- ting your classroom assessments to a medley of traditional item-improvement procedures—empirical or judgmental—that do not mesh with the purpose for which the test was created. For norm-referenced oriented measurement mis- sions, item-discrimination indices have much to recommend them; for criterion- referenced assessment, the results of item-discrimination procedures can often be misleading.
Distractor Analyses For a selected-response item that, perhaps on the basis of its p value or its discrimi- nation index, appears to be in need of revision, it is necessary to look deeper. In the case of multiple-choice items, we can gain further insight by carrying out a distractor analysis, in which we see how the high and low groups are responding to the item’s distractors.
Presented in Table 11.2 is the information typically used when conducting a distractor analysis. Note that the asterisk in Table 11.2 indicates that choice B is the correct answer to the item. For the item in the table, the difficulty index (p) was .50 and the discrimination index (D) was .33.− An inspection of the distrac- tors reveals that something in alternative D seems to be enticing the students in the high group to choose it. Indeed, while over half of the high group opted for choice D, not a single student in the low group went for choice D. Alternative D needs to be reviewed carefully.
Also note that alternative C is doing nothing at all for the item. No students at all selected choice C. In addition to revising choice D, therefore, choice C might be made a bit more appealing. It is possible, of course, particularly if this is a best-answer type of multiple-choice item, that alternative B, the correct answer, needs a bit of massaging as well. For multiple-choice items in particular, but also for matching items, a more intensive analysis of students’ responses to individ- ual distractors can frequently be illuminating. In the same vein, careful scrutiny of students’ responses to essay and short-answer items will often supply useful insights for item-revision purposes.
Table 11.2 a typical Distractor-analysis table
Item No. 28 Alternatives
( p = .50, D = −.33) A B* C D Omit
Upper 15 students 2 5 0 8 0
Lower 15 students 4 10 0 0 1
M11_POPH0936_10_SE_C11.indd 287M11_POPH0936_10_SE_C11.indd 287 09/11/23 6:11 PM09/11/23 6:11 PM
288 Chapter 11 Improving teacher-Developed assessments
Item Analysis for Criterion- Referenced Measurement When teachers use tests intended to yield criterion-referenced interpretations, such teachers typically want most students to score well on their tests after instruction has occurred. In such instances, because post-instruction p values may approach 1.0, traditional item-analysis approaches often yield low discrimi- nation indices. Accordingly, several alternative approaches to item analysis for criterion-referenced measurement have been devised in recent years. (As you are reading about these approaches, please try not to be put off by the frequent use of off-putting subscripts such as p proportionpost = , that is, the proportion of students answering the item correctly on the posttest. What’s going on in these instances is that some truly simple descriptive labels are being represented by usually tiny, often slanting italicized letters. Just say the words aloud, slowly— while thinking about what the words actually mean—and you’ll usually find that those words are simply sloshing in common sense.)
Two general item-analysis schemes have been described thus far, depending on the kinds of student groups available. Both of these item-analysis schemes are roughly comparable to the item-discrimination indices used with tests that yield norm-referenced inferences. The first approach involves administration of the test to the same group of students prior to and following instruction. A disadvantage of this approach is that the teacher must wait for instruction to be completed before
parent talk Suppose a parent of one of your students called you this morning, before school, to complain about his son’s poor performance on your classroom tests. He concludes his grousing by asking you, “Just how sure are you that the fault is in Tony and not in your tests?”
If I were you, here’s how I’d respond to the parent:
“I’m glad you called about your son’s test results because I’m sure we both want what’s best for Tony, and I definitely want to be sure I’m making the instructional decisions that will be best for him.
“The way I use my classroom tests is to try to arrive at the most accurate conclusion I can about how well my students have mastered the skills and knowledge I’m trying to teach them. It’s very important,
then, that the conclusions I reach about students’ skill levels are valid. So, every year I devote systematic attention to students’ improvement on each of my major exams. You’ll remember that it was on the last two of these exams that Tony scored so badly.
“What I’d like to do is show you the actual exams I’ve been giving my students and the data I use each year to improve those exams. Why don’t we set up an after-school or, if necessary, an evening appointment for you and your wife to look over my classroom assessments? I’d like to show both of you the evidence that I’ve been compiling over the years to make sure my tests help me make valid inferences about what Tony and the rest of my students are learning.”
Now, how would you respond to this parent?
M11_POPH0936_10_SE_C11.indd 288M11_POPH0936_10_SE_C11.indd 288 09/11/23 6:11 PM09/11/23 6:11 PM
Item analysis for Criterion-referenced Measurement 289
securing the item-analysis data. Another problem is that the pretest may be reac- tive, in the sense that its administration sensitizes students to certain items such that students’ posttest performances are actually a function of the instruction plus the pretest’s administration.
Using the strategy of testing the same groups of students prior to and after instruction, we can employ an item-discrimination index calculated as follows:
= −D p pppd post pre
where p proportionpost = of students answering the item correctly on posttest p proportionpre = of students answering the item correctly on pretest
The value of Dppd (discrimination based on the pretest–posttest difference) can range from − +1.00 to 1.00, with high positive values indicating that an item is sensitive to instruction.
For example, if 41 percent of the students answered item 27 correctly in the pretest, and 84 percent answered it correctly on the posttest, then item 27’s Dppd would be .84 .41 .43.− = A high positive value would indicate that the item is sensitive to the instructional program you’ve provided to your students. Items with low or negative Dppd values would be earmarked for further analysis because such items are not behaving the way one would expect them to behave if instruc- tion were effective. (It is always possible, particularly if many items fail to reflect large posttest-minus-pretest differences, that the instruction being provided was not so spectacularly wonderful.)
The second approach to item analysis for criterion-referenced measurement is to locate two different groups of students, one of which has already been instructed and one of which has not. By comparing the performances on items of instructed and uninstructed students, you can often pick up some useful clues regarding item quality. This approach has the advantage of avoiding the delay associated with pretesting and posttesting the same group of students, and it also avoids the possibility of a reactive pretest (a pretest that gives away answers to some of the upcoming posttest items). Its drawback, however, is that you must rely on human judgment in the selection of the “instructed” and “uninstructed” groups. The two groups should be fairly identical in all other relevant respects (for example, in intellectual ability) but different with respect to whether or not they have been instructed on the content being assessed. The isolation of two such groups sounds easier than it usually is. Your best bet would be to prevail on a fellow teacher whose students are studying different topics than yours.
If you use two groups—that is, an instructed group and an uninstructed group—one of the more straightforward item-discrimination indices is Duigd (discrimination based on uninstructed versus instructed group differences). This index is calculated as follows:
= −D p puigd i u
where p proportioni = of instructed students answering an item correctly p proportionu = of uninstructed students answering an item correctly
M11_POPH0936_10_SE_C11.indd 289M11_POPH0936_10_SE_C11.indd 289 09/11/23 6:11 PM09/11/23 6:11 PM
290 Chapter 11 Improving teacher-Developed assessments
The Duigd index can also range in value from 1.00− to 1.00.+ To illustrate its computation, if an instructed group of students scored 91 percent correct on a particular item, while the same item was answered correctly by only 55 percent of an uninstructed group, then Duigd would be .91 .55 .36.− = Interpretations of Duigd are similar to those used with D .ppd
As suggested earlier, clearly there are advantages associated with using both judgmental and empirical approaches to improving your classroom-assessment procedures. Practically speaking, classroom teachers have only so much energy to expend. If you can spare a bit of your allotted energy and/or time to buff up your assessment instruments, you’ll usually see meaningful differences in the quality of those instruments.
What Do Classroom Teachers Really Need to Know About Improving Their Assessments? You ought to know that teacher-made tests can be improved as a consequence of judgmental and/or empirical improvement procedures. Judgmental approaches work well with either selected-response or constructed-response test items. Empirical item improvements have been used chiefly with selected-response tests
But What Does this have to Do with teaching? Because of the demands on most teachers’ time, the typical teacher really doesn’t have time to engage in unbridled test polishing. So, as a practical matter, very few teachers spend any time trying to improve their classroom assessments. It is all too understandable.
But there are two approaches to item improvement described in this chapter, one of which, a judgmentally based strategy, doesn’t burn up much time at all. It’s a judgmental approach to item improvement that represents a realistic item- improvement model for most classroom teachers.
Whereas teachers should understand the basics of how a test can be made better by using empirical data based on students’ responses, they’ll rarely be inclined to use such data-based strategies for improving their items unless a test is thought to be a super-important one.
During the 29 years that I taught courses in the UCLA Graduate School of Education, the only tests that I ever subjected to empirical item-improvement analyses were my midterm and final exams in courses I taught year after year. Oh, to be sure, I applied judgmental improvement techniques to most of my other tests, but the only tests I did a full-throttle, data-based improvement job on were the one or two most important tests for an oft- taught course.
My fear is that the apparent complexity of empirical item-improvement efforts might even dissuade you from routinely employing judgmental item-improvement procedures. Don’t let it. Even if you never use students’ data to sharpen your classroom tests, those tests will almost surely still need sharpening. Judgmental approaches will definitely help do the improvement job.
M11_POPH0936_10_SE_C11.indd 290M11_POPH0936_10_SE_C11.indd 290 09/11/23 6:11 PM09/11/23 6:11 PM
What Do Classroom teachers really Need to Know about Improving their assessments? 291
and typically are more readily employed with such test items. You should realize that most of the more widely used indices of item quality, such as discrimina- tion indices, are intended for use with test items in norm-referenced assessment approaches, and such indices may have less applicability to your tests if you employ criterion-referenced measurement strategies. You must also understand that because educators have far less experience in using (and improving) perfor- mance assessments and portfolio assessments, there isn’t really a delightful set of improvement procedures available for those assessment strategies—other than good, solid judgment.
One issue not seriously addressed in the chapter revolves around the ques- tion, How do you fix a guilty item? Clearly, because the range of what’s poten- tially rancid about items can be wide indeed, it would be foolish to toss out a list of “Do this when an item is flawed.” Flawed items can vary immensely, and so can the context of the settings in which they are used.
For example, one of the frequent shortcomings in a teacher-made test arises when a teacher employs a pronoun in an item that, because of the way the item is phrased, defies a student’s ability to discern to whom the offending pronoun refers. Consider the following true–false item, and then indicate whether—even if a student has clearly understood the meaning of a teacher-selected essay—the italicized statement reeks of truth or falsity: TRUE OR FALSE: In the essay you just read, Felipe’s treatment of Jorge, because of his underlying timidity, resulted in silence. Now, if Felipe and Jorge were both described in the essay as timid, then because of the item’s faulty pronoun referent, the student can’t tell which of the lads remained silent. Thus, an editing fix is clearly in order such as TRUE OR FALSE: In the essay you just read, Jorge’s underlying timidity led to his silence. Similarly, if teachers are constantly attentive to potential communicative shortcomings in their tests’ items, then they’ll be more likely to spot loser items in need of first aid.
Your first analysis of what might be done to improve an item will often be your best bet. It is also possible to enlist the aid of a colleague to help you figure out what might be wrong with an item and how best to obliterate that deficit. The chief thing you have going for you is that you definitely believe there is something decidedly tawdry about the item, and this is where you use your wits to dispel some of this shortcoming.
New-Age Assessment Items The more frequently that most teachers observe the evolving nature of recent educational testing, the greater the likelihood that such teachers—especially those brimming with conscientiousness—are seriously intimidated by a type of educa- tion testing of which they know little. More specifically, the focus of their concern is with what is usually described these days as digital, electronic, or (for someone of my age) new-fangled assessment.
That’s right; many of today’s seasoned teachers are understandably wor- ried about being obliged to employ novel assessment tools about which they
M11_POPH0936_10_SE_C11.indd 291M11_POPH0936_10_SE_C11.indd 291 09/11/23 6:11 PM09/11/23 6:11 PM
292 Chapter 11 Improving teacher-Developed assessments
have received scant preparation. Given the increasingly important evaluative role both for groups of teachers (such as when a school board’s members appraise the quality of their district educators’ effectiveness) or when gauging the caliber of individual educators (such as when administrators advance, retain, or release a teacher) the collection of evaluative evidence via unfamiliar sorts of testing instruments can be intimidating to the person using unfamiliar testing instru- ments. And, to many of today’s educators, the emerging array of electronically controlled testing instruments can seem not only unfamiliar but downright off-putting.
But rather than being overwhelmed by the unfamiliarity of today’s emerging digital tests, teachers need to remember the fundamental mission of educational assessment—whether it is traditional Same-Old, Same-Old or futuristic—we’re using tests to collect evidence from a student allowing us to make warranted inferences about that student. More often than not, the chief contribution of novel electronic measurement is its conveyance capacities, that is, its efficiency in pre- senting cognitive challenges to students that can elicit the evidence that we then use to draw inferences about student test-takers.
In numerous instances, the test items spawned because of some measurement professional’s infatuation with computer-abetted assessments will have been gen- erated by the same outfit that published a test. Care must be taken to ensure that those assessment specialists expended as much attention on the test item’s quality as they did on the electronic wizardry of its delivery system. Think back, please, to earlier chapters of this book where the essentials of good test development were touted for diverse sorts of items. Well, even if a new-age electronic delivery system is being employed rather than traditional paper-and-pencil test booklets, avoid being so seduced by a much-lauded digital delivery system that insufficient attention is given to validity, reliability, and assessment fairness. Good tests lead to better instructional decisions. And those improved decisions lead to better educated children.
Although it may be tempting for teachers to jettison anything not readily comprehensible, and there are numerous high-tech approaches on hand these days for building assessments or for analyzing students’ performances, most of the recently developed digital assessment exotica can be explained to educa- tors who may be called on to do similar explanations for students’ parents and for students themselves. Assessment tools, including “new-age” assessments, computer-delivered administrations, and analyses, should be, at bottom, clearly comprehensible. The task for instructors in educational assessment courses is to locate a colleague or a willing student who understands this electronic content and who can explain the essentials of its operations to colleagues, parents, or students. Remember, what is being sought is someone’s acquisition of the chief features of today’s electronic assessment. The aim is to promote greater numbers of individuals who can truly understand the main story of how digital tactics can make assessment not only better, but more manageable.
M11_POPH0936_10_SE_C11.indd 292M11_POPH0936_10_SE_C11.indd 292 09/11/23 6:11 PM09/11/23 6:11 PM
references 293
Chapter Summary This chapter focused on two strategies for improving assessment procedures. It was sug- gested that, with adequate time and motivation, teachers can use judgmental and/or empirical methods of improving their assessments.
Judgmentally based improvement procedures were described for use by teachers, teachers’ col- leagues, and students. Five review criteria for eval- uating assessment procedures were presented for use by teachers and their colleagues: (1) adherence to item-specific guidelines and general item-writing commandments, (2) contribution to score-based inferences for tests’ intended purpose(s), (3) accu- racy of content, (4) absence of gaps in content cov- erage, and (5) fairness. A set of possible questions to ask students about test items was also provided.
Two empirical item-improvement indi- ces described in the chapter were p values and
item-discrimination indices. A step-by-step procedure for determining item-discrimination values was described. Designed chiefly for use with norm-referenced measurement, item- discrimination indices do not function well when large numbers of students respond correctly (or incorrectly) to the items involved. Distractor analyses have proven highly useful in the improvement of multiple-choice items because the effectiveness of each item’s alternatives can be studied. Finally, two indices of item qual- ity for criterion-referenced measurement were described. Although roughly comparable to the kind of item-discrimination indices widely used with norm-referenced measurement, the two indices for criterion-referenced assessment can be used in settings where many effectively taught students perform well on examinations.
References Brookhart, S. (2023). Classroom assessment
essentials. ASCD. Ebel, R. L., & Frisbie, D. A. (1991). Essentials
of educational measurement (5th ed.). Prentice-Hall.
Hartati, N., & Yogi, H. P. S. (2019, September). Item analysis for a better quality test, English Language in Focus (ELIF), 2(1): 59–70. https:// doi.org/10.24853/elif.2.1.59-70
Lane, S., Raymond, M. R., & Haladyna, T. M. (Eds.). (2010). Handbook of test development (2nd ed.). Routledge.
Miller, M. D., & Linn, R. (2013). Measurement and assessment in teaching (11th ed.). Pearson.
Popham, J. W. (2018, September 19). Assessment literacy: The most cost-effective cure for our schools. Teaching Channel. Retrieved September 20, 2022, from https://www.teachingchannel. com/blog/assessment-literacy
Rein, L. M. (2019). Evaluating the assessment qualities of teacher-created tests. In Handbook of research on assessment literacy and teacher- made testing in the language classroom (pp. 263–282). IGI Global. https://doi. org/10.4018/97 8-1-5225-6986-2.ch013
Waugh, C. K., & Gronlund, N. (2012). Measurement and assessment in teaching (11th ed.). Pearson.
M11_POPH0936_10_SE_C11.indd 293M11_POPH0936_10_SE_C11.indd 293 09/11/23 6:11 PM09/11/23 6:11 PM
294 Chapter 11 Improving teacher-Developed assessments
A Testing Takeaway
Improving Teachers’ Test Items: The Oft-Misunderstood p Value* W. James Popham, University of California, Los Angeles
Perhaps the most frequent empirical indicator of an item’s performance is the p value. Frequently referred to as an item’s “difficulty,” a p value simply indicates what percent of the students who responded to an item answered it correctly. Thus, if an item were answered correctly by 82 percent of test-takers, then this item would have a p value of .82.
Where many educators get in trouble, however, is when they think of items with a high p value as “easy” and of items with a low p value as “hard.” Yet, an item with a low p value of, say, .30, might be associated with a topic that hasn’t been taught yet or was taught less than wonderfully. It is more sensible to consider an item’s p value only in the context of previous instruction regarding the item’s content. We want well-taught students earning huge hunks of high p values on intrinsically challenging items.
Teachers who construct their own classroom tests typically do so with the best of intentions. Indeed, most teachers aspire to create assessments containing only fabulous, flaw-free items. But, of course, despite such laudable aspirations, some teacher-made tests include items exhibiting serious shortcomings. Defective items, obviously, reduce a test’s ability to provide an accurate estimate of a test-taker’s status regarding whatever’s being assessed—with a resultant distortion in score-based interpretations regarding what a test-taker knows and can do.
Fortunately, through the years, assessment specialists have devised a winning two- part strategy to improve test items—a strategy that can be successfully employed with both teacher-made items and items in the standardized tests administered to thousands of students. The strategy’s two item-improvement components are judgmental and empirical.
Judgmental item improvement calls for items to be reviewed, one item at a time, according to a set of evaluative guidelines painstakingly generated over the years regarding how best to construct specific sorts of items such as short-answer questions. Items that don’t adhere to those guidelines can be improved.
Empirical item improvement is based on the way a test’s items function when they are used in an actual examination. Statistical analyses of the items provide such evidence as whether certain items seem to be performing too differently than other items intended to measure the same skill or body of knowledge. Clearly, the more students who complete a test, the more confidence we have in such analyses.
Understandably, classroom teachers use judgmental improvement techniques. In contrast, builders of large-scale tests lean more heavily, but not exclusively, on empirical item improvement.
(Once you grasp fully the meaning of a p value, try to include it casually during an informal chat with colleagues. Its impact is often profound!)
*From Chapter 11 of Classroom Assessment: What Teachers Need to Know, 10th ed., by W. James popham. Copyright 2022 by pearson, which hereby grants permission for the reproduction and distribution of this Testing Takeaway, with proper attribution, for the intent of increasing assessment literacy. a digitally shareable version is available from https://www.pearson.com/store/en-us/pearsonplus/login.
M11_POPH0936_10_SE_C11.indd 294M11_POPH0936_10_SE_C11.indd 294 09/11/23 6:11 PM09/11/23 6:11 PM