Education EDU530 Week 6 assignment
134
Chapter 5
Fairness
Chief Chapter Outcome
An understanding of the nature of assessment bias and the procedures by which it can be reduced in both large-scale tests and classroom assessments
Learning Objectives
5.1 Demonstrate conceptual knowledge of fairness and assessment bias by identifying biased assessment designs.
5.2 Identify the educational needs of students with disabilities and English language learners in regard to the prevention of assessment bias and including federal protective legislation.
The last of the “big three” criteria for evaluating educational assessment devices is, in sharp contrast to reliability and validity, a relatively new kid on the block. During the past couple of decades, educators have increasingly recognized that the tests they use are often biased against particular groups of students. As a consequence, students in those groups do not perform as well on a test, not because the students are less able but because there are features in the test that distort the nature of the students’ performances. Even if you had never read a single word about educational measurement, you would quickly realize that such a test would be unfair. The third major factor to be used in judging the caliber of educational assessments is fairness and, as just pointed out, its arrival on the test-quality scene is far more recent than that of its two time-honored evaluative cousins, namely, reliability and validity.
Fairness in educational testing, however, has no single, technically sanctioned meaning. This is because the term fairness is used by educators—and noneduca- tors alike—in many different ways. For example, educators might regard fairness in testing as a desirable aspiration in general, and yet regard a particular testing program as flagrantly unfair. Authors of the 2014 Standards for Educational and Psychological Testing (American Educational Research Association, 2014) limit their
M05_POPH0936_10_SE_C05.indd 134M05_POPH0936_10_SE_C05.indd 134 08/11/23 3:05 PM08/11/23 3:05 PM
The Nature of Assessment Bias 135
focus to fairness related to the aspects of testing that are the responsibility of those who develop, use, and interpret the results of educational tests. Moreover, the Standards restrict its attention to those assessment considerations about which there is general professional and technical agreement. Accordingly, in the remain- der of this chapter of Classroom Assessment, we’ll be considering fairness in testing against a backdrop of what’s recommended in the 2014 Standards.
More specifically, we’ll first be looking at a potential source of unfairness that can arise when assessment bias is present in a test. Later, we’ll consider the kinds of unfairness that can slither onto the scene when we try to assess students with disabilities or to measure the status of students whose first language is not English. Less than high-quality educational assessment of such students almost always leads to test-based interpretations—and subsequent interpretation-based educational decisions—that frequently are far from fair.
At its most basic level, fairness in educational testing is a validity issue. As you saw in the previous chapter, validity refers to the accuracy of score-based interpretations about test-takers (for intended uses of those scores). What happens to validity, however, if those interpretations are messed up because our tests con- tain serious assessment bias, or if we failed to accurately assess English language learners or students with disabilities because of those students’ atypicalities? Well, it is almost certain that the usage-focused interpretations we make based on stu- dents’ test scores are going to be inaccurate—that is, invalid. Thus, in almost all situations, fairness in testing should be routinely regarded as a requisite precursor to assessment validity; in other words, it should function as a necessary—but not a sufficient—condition for assessment validity.
The Nature of Assessment Bias Assessment bias refers to qualities of an assessment instrument that offend or unfairly penalize a group of students because of their gender, race, ethnicity, socio- economic status, religion, or other such group-defining characteristics. Most teach- ers are sufficiently familiar with the idea of test bias these days that they recognize what’s being talked about when someone says, “That test was biased.” In this chapter, however, we’ll take a deeper look at what test bias is—and what it isn’t. Moreover, we’ll consider some procedures that can help classroom teachers recog- nize whether bias is present in assessment procedures and, as a consequence, can help reduce or eradicate such bias in teachers’ own classroom tests. Finally, we’ll tackle a particularly difficult issue—namely, how to reduce bias when assessing students with disabilities or students who are English language learners.
Because the other two criteria for evaluating educational tests—reliability and validity—are “good things,” we’re going to use the expression absence-of- bias to describe this evaluative criterion. When bias is absent in tests, this is also a “good thing.” Before looking at procedures to promote the absence-of-bias in
M05_POPH0936_10_SE_C05.indd 135M05_POPH0936_10_SE_C05.indd 135 08/11/23 3:05 PM08/11/23 3:05 PM
136 ChApTer 5 Fairness
educational assessment procedures, let’s first look more directly at what we’re trying to eliminate. In other words, let’s look at assessment bias itself.
As indicated earlier, assessment bias is present when there are elements in an assessment procedure that distort a student’s performance merely because of the student’s personal characteristics, such as gender, ethnicity, and so on. Most of the time, we think about distortions that tend to lower the scores of students because those students are members of certain subgroups. For example, suppose officials in State X installed a high-stakes mathematics test that must be passed before students receive an “honors” diploma. If the mathematics test contained many word prob- lems based on certain competitive sports with which boys were more familiar than girls, then girls might perform less well on the test than boys. The lower performance by girls would not occur because girls were less skilled in mathematics, but because they were less familiar with the sports contexts in which the word problems were embedded. This mathematics test, therefore, appears to be biased against girls. On the other hand, it could also be said that the mathematics test is biased in favor of boys. In the case of assessment bias, what we’re worried about is distortions of stu- dents’ test performances—either unwarranted increases or unwarranted decreases.
Let’s try another illustration. Suppose that a scholastic aptitude test featured test items based on five lengthy reading selections, two of which were based on content more likely to be known to members of a particular religious group. If, because of familiarity with the content of the two selections, members of that religious group outperformed others, this would be a distortion in their test per- formances. It would constitute an assessment bias in their favor.
If you think carefully about the issue we’re considering, you’ll realize that unfairness of testing and assessment bias interfere with the validity of the score-based inferences we draw from our assessment procedures. But because assessment bias constitutes such a distinctive threat to the validity of test-based interpretations, and a threat that can be directly addressed, it is usually regarded as a separate criterion when evaluating educational tests. To the extent that bias is present, for the groups of students against whom the test is biased, valid score-based interpretations are not apt to be forthcoming.
Let’s look, now, at two forms of assessment bias—that is, two ways in which the test performances of individuals in particular groups can be distorted. It was noted earlier in the chapter that assessment bias is present when an educational assessment procedure either offends or unfairly penalizes students because of their membership in a gender, racial, ethnic, religious, or similar subgroup. We’ll now consider both of these forms of assessment bias.
Offensiveness An assessment procedure is biased if its content (for example, its collection of items) is offensive to a subgroup of students. Such offensiveness often occurs when negative stereotypes of certain subgroup members are presented in a test. For instance, suppose that in all of an exam’s items we saw males portrayed
M05_POPH0936_10_SE_C05.indd 136M05_POPH0936_10_SE_C05.indd 136 08/11/23 3:05 PM08/11/23 3:05 PM
The Nature of Assessment Bias 137
in high-paying and prestigious positions (e.g., attorneys and physicians), and we saw women portrayed in low-paying and less prestigious positions (e.g., housewives and clerks). Because at least some female students will, quite appropriately, be offended by this gender inequality, their resultant distress may lead to less than optimal performances on the exam and, as a consequence, to scores that do not accurately represent their capabilities. (Angry people aren’t the best test-takers.)
Other kinds of offensive content include slurs, blatant or implied, based on stereotypic negative attitudes about how members of certain ethnic or religious groups behave. This kind of offensive content, of course, can distract students in the offended group so that they end up focusing more on the offensiveness of a given item than on accurately displaying their own abilities on that item and subsequent items.
Although most of the nation’s major test-development agencies now employ item-writing and item-review procedures designed to eliminate such offensive con- tent, one encounters far too many examples of offensive content in less carefully developed teacher-made tests. It is not that teachers deliberately set out to offend any of their students. Rather, many teachers have simply not thought seriously about whether the content of their classroom assessment procedures might cause distress for any of their students. Later in the chapter, we’ll consider some practical ways of eradicating offensive content in teachers’ classroom assessment procedures.
Unfair Penalization A second factor contributing to assessment bias is content in a test that may cause unfair penalization for a student based on the student’s ethnicity, gender, race, religion, and so on. Let’s consider for a moment what an “unfair” penalty really is.
Unfair penalization arises when a student’s test performance is distorted because of content that, although not offensive, disadvantages the student because of the student’s group membership. The previously cited example about girls’ unfamiliarity with certain competitive sports is an illustration of such an unfair penalty. As another example, suppose an assessment procedure includes content apt to be known only to children from affluent families. Let’s say an assessment procedure is installed to see how well students can “collab- oratively solve problems” in groups, and the assessment activity culminates when the teacher gives students a new problem to analyze collaboratively dur- ing a group discussion. If the content of the new problem deals with a series of locally presented operas and symphonies likely to have been attended only by those students whose parents can afford such performances’ hefty ticket prices, then students from less affluent families will be unfairly penalized because there was probably much less dinner conversation about the unseen local operas and symphonies. Students from lower socioeconomic strata might perform less well on the “collaborative problem solving” assessment, not because they are less skilled at collaborative problem solving but because they are unfamiliar with
M05_POPH0936_10_SE_C05.indd 137M05_POPH0936_10_SE_C05.indd 137 08/11/23 3:05 PM08/11/23 3:05 PM
138 ChApTer 5 Fairness
the content of the particular assessment procedure being employed to gauge students’ collaborative problem-solving skills.
Some penalties, of course, are as fair as they can be. If you’re a classroom teacher who’s generated a test on a unit you’ve been teaching, and some students perform poorly simply because they didn’t pay attention in class or didn’t do their homework assignments, then penalizing their lousy performance on your test is eminently fair. Their “penalty,” in fact, simply reeks of righteousness. Poor performances on an assessment procedure do not necessarily mean that the resul- tant penalty to the student (such as a low grade) is unfair. Many of the students I taught in high school and college richly deserved the low grades I gave them.
Unfair penalization arises only when it is not the student’s ability that leads to poor performance but, rather, the student’s group membership. If, because of the student’s gender, ethnicity, and so on, the content of the test distorts how the student would otherwise have performed, then a solid case of assessment bias is at hand. As was true in the case of offensiveness of test content, we sometimes find classroom teachers who haven’t thought carefully about whether their assessment procedures’ content gives all students an equal chance to perform well. We’ll soon consider some ways for classroom teachers to ferret out test content that might unfairly penalize certain of their students.
Decision Time Choose Your Language, Children!
Jaime Jemez teaches mathematics in Wilson Junior High School. He has done so for the past 10 years and is generally regarded as an excellent instructor. In the past few years, however, the student population at Wilson Junior High has been undergoing some dramatic shifts. Whereas there were relatively few Hispanic students in the past, almost 40 percent of the student population is now Hispanic. Moreover, almost half of those students arrived in the United States less than 2 years ago. Most of the newly arrived students in Jaime’s classes came from Mexico and Central America.
Because English is a second language for most of these Hispanic students, and because many of the newly arrived students still have a difficult time with written English, Jaime has been wondering whether his mathematics examinations are biased against many of the Hispanic students in his classes.
Because Jaime reads and writes both English and Spanish fluently, he has been considering whether he should provide Spanish-language versions of his more important examinations. Although such a decision would result in some extra work for him, Jaime believes that language-choice tests would reduce the bias in his assessment procedures for students whose initial language was Spanish.
Jaime is also aware that the demographics of the district are still shifting and that an increasing number of Southeast Asian students are beginning to show up in his classes. He is afraid that if he starts providing Spanish-language versions of his tests, perhaps he will need to provide the tests in other languages as well. Regrettably, however, he is fluent only in Spanish and English.
If you were Jaime, what would your decision be?
M05_POPH0936_10_SE_C05.indd 138M05_POPH0936_10_SE_C05.indd 138 08/11/23 3:05 PM08/11/23 3:05 PM
Disparate Impact and Assessment Bias 139
Disparate Impact and Assessment Bias It is sometimes the case that when a statewide, high-stakes test is originally installed, it has disparate impacts on different ethnic groups. For example, Black students or Hispanic American students may perform less well than non-Hispanic White students. Does this disparate impact mean the test is biased? Not necessarily. However, if a test has a disparate impact on members of a specific racial, gender, or religious subgroup, this disparate impact certainly warrants further scrutiny to see whether the test is actually biased. Disparate impact does not equal assessment bias.
Let me tell you about an experience I had in the late 1970s that drove this point home for me so I never forgot it. I was working in a southern state to help its state officials develop a series of statewide tests that, ultimately, would be used to deny diplomas to students who failed to pass the 12th-grade level of the test. Having field-tested all of the potential items for the tests on several thousand children in the state, we had assembled a committee of 25 teachers from throughout the state, roughly half of whom were Black, to review the field-test results. The function of the committee was to see whether there were any items that should be discarded. For a number of the items, the Black children in the statewide tryout had per- formed significantly less well than the other children. Yet, almost any time it was proposed that such items be jettisoned because of such a disparate impact, it was the Black teachers who said, “Our children need to know what’s in those items. They deal with important content. Don’t you dare toss those items out!”
Those Black teachers were saying, in no uncertain terms, that Black children, just like non–Black children, should know what was being tested by those items, even if a disparate impact was currently present. The teachers wanted to make sure that if Black youngsters could not currently perform well enough on those items, the presence of the items in the test would allow such deficits to be identi- fied and, as a consequence, to be remediated. The teachers wanted to make sure that future disparate impacts would be eliminated.
If an assessment procedure leads to a differential impact on a particular group of students, then it is imperative to discern whether the assessment procedure is biased. But if the assessment procedure does not offend or unfairly penalize any student subgroups, then the most likely explanation of disparate impact would be shortcomings in instruction given to the low-performing group. In other words, there’s nothing wrong with the test, but there is something clearly wrong with the quality of instruction previously provided to the subgroup of students who scored poorly on the examination.
It is well known, of course, that genuine equality of education has not always been present in every locale. Thus, it should come as no surprise that some tests will, in particular settings, have a disparate impact on certain subgroups. But this does not necessarily indicate that the tests are biased. It may well be that the tests are helping identify prior inequities in instruction that, of course, should be ameliorated.
M05_POPH0936_10_SE_C05.indd 139M05_POPH0936_10_SE_C05.indd 139 08/11/23 3:05 PM08/11/23 3:05 PM
140 ChApTer 5 Fairness
One of the most commonly encountered factors leading to students’ lower than anticipated performance on an exam, particularly a high-stakes exam, is that stu- dents have not had an adequate opportunity to learn. Opportunity to learn describes the extent to which students have had exposure to the instruction or information that gives them a reasonable chance to master whatever knowledge or skills are being assessed by a test. Opportunity to learn simply captures the common-sense notion that if students have not been taught to do something that’s to be tested, then those untaught students will surely not shine when the testing trumpet is blown.
A relatively recent illustration of a situation in which opportunity to learn was not present occurred when education officials in some states closed out the 2013–14 school year by administering newly developed assessments intended to measure students’ mastery of a new, often unfamiliar set of curricular aims—namely, the Common Core State Standards (CCSS). Yet, in at least certain of those states, most teach- ers had not received sufficient professional development dealing with the CCSS. Accordingly, many of those teachers were unable to teach about the new content standards with any instructional effectiveness. As a consequence, many of the state’s students ended up not having been taught about the CCSS and, not surprisingly, scored poorly on the new tests. Did those lower-than-foreseen scores reflect the presence of assessment bias on the new CCSS-focused tests? Of course not. What was going on, quite obviously, was not assessment bias but, rather, a clear-cut case of students’ trying to tackle a test’s targeted content without having been given an opportunity to learn. Teachers must be wary of the opportunity-to-learn issue whenever substantially different content is being assessed by any tests linked to high-stakes decisions.
M05_POPH0936_10_SE_C05.indd 140M05_POPH0936_10_SE_C05.indd 140 08/11/23 3:05 PM08/11/23 3:05 PM
udgmental Approaches 141
Let’s conclude our consideration of assessment bias by dealing with two strategies for eliminating, or at least markedly reducing, the extent to which unsuitable content on assessment devices can distort certain students’ scores. Such distortions, of course, erode the validity of the score-based interpretations educators arrive at when they use such assessment devices. And rotten inter- pretations regarding students’ status, of course, usually lead to rotten decisions about those students.
Judgmental Approaches One particularly useful way to identify any aspects of an assessment procedure that may be biased is to call on content-knowledgeable reviewers to judgmen- tally scrutinize the assessment procedure, item by item, to see whether any items offend or unfairly penalize certain subgroups of students. When any important, high-stakes test is developed these days, it is customary to have a bias review panel consider its items in order to isolate and discard any items that seem likely to contribute to assessment bias. After seeing how such a procedure is imple- mented for high-stakes tests, we’ll consider how classroom teachers can carry out the essential features of an analogous judgmental bias-detection review for their own assessment procedures.
Bias Review Panels For high-stakes tests such as a state’s accountability exams, a bias review panel of, say, 15 to 25 reviewers is typically assembled. Each of the reviewers should be conversant with the content of the test being reviewed. For example, if a sixth-grade language arts accountability test were under development, the bias review committee might be made up mostly of sixth-grade teachers, along with a few district-level language arts curriculum specialists or university professors. The panel should be composed exclusively, or almost exclusively, of individu- als from the subgroups who might be adversely impacted by the test. In other words, if the test is to be administered to minority students who are Asian American, Native American, Black, and Hispanic American, there should be representatives of each of those groups on the panel. There should also be a mix of male and female panelists so that they can attend to possible gender bias in items.
After the bias review panel is assembled, its members should be provided with a thorough orientation regarding the overall meaning of fairness in testing and, more specifically, assessment bias—perhaps along the lines of the explana- tions presented earlier in this chapter. Ideally, panelists should be given some guided practice in reviewing actual assessment items—some of which have been deliberately designed to illustrate specific deficits—for instance, content that is offensive or content that unfairly penalizes certain subgroups. Discussions of such illustrative items usually help clarify panel members’ understanding of assessment bias.
M05_POPH0936_10_SE_C05.indd 141M05_POPH0936_10_SE_C05.indd 141 08/11/23 3:05 PM08/11/23 3:05 PM
142 ChApTer 5 Fairness
A Per-Item Absence-of-Bias Judgment After the orientation, the bias review panelists should be asked to respond to a question such as the following for each of the test’s items:
An Illustrative Absence-of-Bias Question for Bias review panelists Might this item offend or unfairly penalize any group of students on the basis of personal characteristics, such as gender, ethnicity, religion, or race?
Panelists are to respond yes or no to each item using this question. Note that the illustrative question does not ask panelists to judge whether an item would offend or unfairly penalize. Rather, the question asks whether an item might offend or unfairly penalize. In other words, if there’s a chance the item might be biased, then bias review panelists are to answer yes to the absence-of-bias ques- tion. Had the item-bias question used would instead of might, you can see that this would make it a less stringent form of per-item scrutiny.
The percentage of no judgments per item is then calculated so that an average per-item absence-of-bias index (for the whole panel) can be computed for each item and, thereafter, for the entire test. The greater the number of no judgments that panelists supply, the less bias they think is present in the items. If the test is still under development, items that are judged to be biased by several panelists are generally discarded. Because panelists are usually encouraged to indicate (via a brief written comment) why an item appears to be biased, an item may be dis- carded because of a single panelist’s judgment, if it is apparent the panelist spotted a bias-linked flaw that had been overlooked by other panelists.
Because myriad judgmental reviews of potential bias in under-construction items take place each year, it would be patently foolish to assume that those reviews are equivalent in the levels of rigor they require items to experience. Simple changes in a few tiny words can transform a “tough as nails” item-bias review process into a “cuddly, all items are yummy” review procedure. The obvi- ous solution to this reality is for the item-review orientation materials and all training materials to be read carefully, word-by-word, to make certain that the sought-for level of scrutiny is present. Words, particularly in this instance, matter.
An Overall Absence-of-Bias Judgment Although having a bias review panel scrutinize the individual items in an assess- ment device is critical, there’s always the chance that the items in aggregate may be biased, particularly with respect to the degree to which they might offend certain students. For instance, as the bias review panelists consider the individual items, they may find few problematic items. Yet, when considered as a collectivity, the
M05_POPH0936_10_SE_C05.indd 142M05_POPH0936_10_SE_C05.indd 142 08/11/23 3:05 PM08/11/23 3:05 PM
empirical Approaches 143
items may prove to be offensive. Suppose, for example, that in an extensive series of mathematics word problems, the individuals depicted were almost always females. Although this gender disproportion was not apparent from panelists’ reviews of individual items, it is apt to be discerned from panelists’ responses to an overall absence-of-bias question such as the following:
An Illustrative Overall Absence-of-Bias Question for Bias review panelists Considering all of the items in the assessment device you just reviewed, do the items, taken as a whole, offend or unfairly penalize any group of students on the basis of personal characteristics, such as gender, ethnicity, religion, or race?
Similar to the item-by-item judgments of panelists, the percent of no responses by panelists to this overall absence-of-bias question can provide a useful index of whether the assessment procedure under review is biased. As with the judg- ments of individual items, bias review panelists who respond yes to an overall absence-of-bias question are usually asked to indicate why. Typically, the deficits identified in response to this overall question can be corrected rather readily. For instance, in the previous example about too many females being described in the mathematics word problems, all that needs to be done is a bit of gender reassign- ment (only verbally, of course).
In review, then, judgmental scrutiny of an assessment procedure’s items, separately or in toto, can prove effective in identifying items that contribute to assessment bias. Such items should be modified or eliminated. (Please note that your book’s genial author is once more striving to make his five years of studying Latin pay off. If I recall correctly, in toto either signifies “totally” or refers to the innards of a small dog named Toto. You choose.)
Empirical Approaches If a high-stakes educational test is to be administered to a large number of stu- dents, it is typically possible to gather tryout evidence regarding the performances of different groups of students on individual items, and then review any items for which there are substantial disparities between the performances of different groups. There are a number of different technical procedures for identifying items on which subgroups perform differently. Generally, these procedures are charac- terized as differential item functioning (DIF) procedures, because the analyses are used to identify items that function differently for one group (for example, girls) than for another (for example, boys). Numerous DIF analysis procedures are now available for use in detecting potentially biased items.
M05_POPH0936_10_SE_C05.indd 143M05_POPH0936_10_SE_C05.indd 143 08/11/23 3:05 PM08/11/23 3:05 PM
144 ChApTer 5 Fairness
Even after an item has been identified as a differentially functioning item, this does not automatically mean the item is biased. Recalling our discussion of the difference between assessment bias and disparate impact, it is still neces- sary to scrutinize a differentially functioning item to see whether it is biased or, instead, is detecting the effects of prior instructional inadequacies experienced by a particular group of students—such as insufficient opportunity to learn. In most of today’s large-scale test-development projects, even after all items have been subjected to the judgmental scrutiny of a bias review panel, items identified as functioning differentially are “flagged” for a second look by bias reviewers (for example, by an earlier bias review panel or by a different bias review panel). Those items that are, at this point, judgmentally identified as biased are typically excised from the test.
To reiterate, large numbers of students are required in order for empirical bias-reduction approaches to work. Several hundred responses per item for each subgroup under scrutiny are needed before a reasonably accurate estimate of an item’s differential functioning can be made.
Bias Detection in the Classroom If you’re an experienced teacher, or are preparing to be a teacher, what can you do about assessment bias in your own classroom? There is a deceptively simple but accurate answer to this question. The answer is: Become sensitive to the existence of assessment bias and the need to eliminate it.
Assessment bias can transform an otherwise fine educational test into one that falls short of what’s needed for fairness in testing. Assessment bias really can be a serious shortcoming of your own tests. As indicated earlier, a test that’s biased won’t allow you to make valid inferences about your students’ learning levels. And if your inferences are invalid, who cares whether you’re making those inferences reliably? Absence-of-bias, in short, is significant stuff.
Okay, let’s assume you’ve made a commitment to be sensitive to the pos- sibility of bias in your own classroom assessments. How do you go about it? In general, it is usually recommended that you always review your own assess- ments, insofar as possible, from the perspective of the students you have in your classes—students whose experiences will frequently be decisively different from your own. Even if you’re an Hispanic American teacher, and you have a dozen His- panic American students, this does not mean that their backgrounds parallel yours.
What you need to do is think seriously about the impact that differing expe- riential backgrounds will have on the way students respond to your classroom assessments. You’ll obviously not always be able to “get inside the heads” of the many different students you have in class, but you can still give it a serious try.
My first teaching job was in a rural high school in an eastern Oregon town with a population of 1500 people. I had grown up in a fairly large city and had no knowledge about farming or ranching. To me, a “range” was a gas or electric kitchen appliance on which one cooked meals. In retrospect, I’m certain that many
M05_POPH0936_10_SE_C05.indd 144M05_POPH0936_10_SE_C05.indd 144 08/11/23 3:05 PM08/11/23 3:05 PM
Bias Detection in the Classroom 145
of my classroom test items contained “city” content that might have confused my students. I’ll bet that many of my early test items unfairly penalized some of those boys and girls who had rarely, if ever, left their farms or ranches to visit more metropolitan areas. And, given the size of our tiny town, almost any place on earth was “more metropolitan.” But, regrettably, assessment bias was something I simply didn’t think about back then. I hope you will.
A similar “beware of differential affluence” problem lurks either when teach- ers purchase glitzy computer-controlled classroom assessments or, perhaps, when teachers infuse computer gyrations into their own teacher-made tests. Fundamen- tally, computer-governed testing of any sort is always accompanied by a lurking potential for biased assessment. Let’s concede that students from more affluent homes are likely to have cavorted with more sophisticated—and more costly— computers than students from less affluent homes. Ritzier and costlier computers can require more challenging per-item analyses than less costly computers. Thus, if a classroom test relies on esoteric computer machinations, we usually find that students who have been obliged to use less costly computers will be unfairly penal- ized by being asked to engage in digital dances with which they are unfamiliar.
And remember that a corollary consideration of students’ differing levels of affluence is the differences in which students can acquire digital equipment containing the very latest “breakthrough” improvements in their digital devices. More parental dollars can often lead to the acquisition of the latest and best ver- sions of school-related digital equipment. The student who has access to the latest school-relevant digital devices has a clear advantage over the student who must get by with only a self-cranked pencil sharpener.
Because some teachers are less familiar with some of the newer forms of assess- ments you’ll be reading about in later chapters (particularly performance assess- ment, portfolio assessment, computer-controlled assessment, and a number of informal assessment procedures to be used as part of the formative-assessment process), those types of assessments contain serious potential for assessment bias. To illustrate, most performance assessments call for students to respond to a fairly elaborate task of some sort (such as researching a social issue and presenting an oral report to the class). Given that these tasks sometimes require students to draw heav- ily on their own experiences, it is imperative to select tasks that present the cherished “level playing field” to all students. You’d certainly not wish to use a performance assessment that gave an up-front advantage to students from more glitzy families.
Then there’s the problem of how to evaluate students’ responses to perfor- mance assessments or how to judge the student-generated products in students’ portfolios. All too often we see teachers evaluate students’ responses on the basis of whether those responses displayed “good thinking ability.” What those teachers are doing, unfortunately, is creating their own, off-the-cuff version of a group intelli- gence test—an assessment approach now generally repudiated. Students’ responses to performance and portfolio assessments are too often judged on the basis of fac- tors that are more background dependent than instructionally promotable. Assess- ment bias applies to all forms of testing, not merely to paper-and-pencil tests.
M05_POPH0936_10_SE_C05.indd 145M05_POPH0936_10_SE_C05.indd 145 08/11/23 3:05 PM08/11/23 3:05 PM
146 ChApTer 5 Fairness
If you want to expunge bias from your own assessments, try to review every item in every assessment from the perspective of whether there is anything pres- ent in the item that might offend or unfairly penalize any of your students. If you ever find such potential bias in any of your items, then without delay patch them up or plow them under.
You’ll probably never be able to remove all bias from your classroom assessments. Biased content has a nasty habit of slipping past even the most bias-conscious classroom teacher. But if you identify 100 teachers who are seri- ously sensitized to the possibility of assessment bias in their own tests and 100 teachers who’ve not thought much about assessment bias, there’s no doubt regarding which 100 teachers will create less biased assessments. Such doubt- free judgment, of course, might be biased. But it isn’t!
parent Talk Assume that you are White and were raised in a middle-class environment. During your fourth year of teaching, you’ve been transferred to an inner-city elementary school in which you teach a class of 27 fifth-graders, half of whom are Black. Suppose that Mrs. Johnson, a mother of one of your students, telephones you after school to complain about her son’s test results in your class. Mrs. Johnson says, “George always got good grades until he came to your class. But on your tests, he always scores low. I think it’s because your tests are biased against Black children!”
If I were you, here’s how I’d respond to Mrs. Johnson:
“First, Mrs. Johnson, I really appreciate your calling about George’s progress. We both want the very best for him, and now that you’ve phoned, I hope we can work together in his best interest.
“About George’s test performance, I realize he hasn’t done all that well on most of the tests. But I’m afraid it isn’t because of test bias. Because this is my first year in this school, and because about half of my fifth-graders are Black, I was really worried that, as a White person, I might be developing tests that were biased against some of my students.
“So, during my first week on the job, I asked Ms. Fleming—another fifth-grade teacher who’s Black—if she’d review my tests for possible bias. She agreed to do so, and she’s been kind enough to review every single test item that I’ve used this year. I try to be careful myself about identifying biased items, and I’m getting pretty good at it, but Ms. Fleming helped me spot several items, particularly at the start of the year, that might have been biased against Black children. I removed or revised every one of those items.
“What I think is going on in George’s case, Mrs. Johnson, is related to the instruction he’s received in past years. I think George is a very bright child, but I know your family transferred to our district this year from another state. I really fear that where George went to school in the past, he may not have been provided with the building-block skills he needs for my class.
“What I’d like to do, if you’re willing, is to set up a conference either here at school or in your home, if that’s more convenient, to discuss the particular skills George is having difficulty with. I’m confident we can work out an instructional approach that, with your assistance, will help George acquire the skills he needs for our tests.”
Now, how would you respond to Mrs. Johnson?
M05_POPH0936_10_SE_C05.indd 146M05_POPH0936_10_SE_C05.indd 146 08/11/23 3:05 PM08/11/23 3:05 PM
Assessing tudents with Disabilities and english Language Learners 147
Assessing Students with Disabilities and English Language Learners Assessment bias can blossom in often unforeseen ways when teachers attempt to assess two distinctive categories of children—namely, students with disabilities and students who are English language learners. In the past, because these two groups of students often received the bulk of their instruction from educational specialists rather than regular classroom teachers, many teachers gave scant attention to the testing of such students. However, largely due to the enactment of far-reaching federal laws, today’s classroom teachers need to understand, at the very least, some fundamentals about how to assess children with disabilities, as well as Eng- lish language learners. We’ll briefly consider potential assessment bias as it relates to both of these types of students.
Children with Disabilities and Federal Law Fully two thirds of children with disabilities either have specific learning disabilities—almost 80 percent of which arise from students’ problems with reading—or impairments in speech or language. There are 13 categories of dis- ability set forth in federal law, including such less-frequent disabilities as autism (2 percent of children with disabilities), emotional disturbance (8 percent), and those hard of hearing (1 percent).
Federal law, you see, has triggered substantially reconceptualized views about how to teach and to test students with disabilities. The Education for All Handicapped Children Act (Public Law 94-142) was enacted in 1975 because of (1) a growing recognition of the need to properly educate children who had dis- abilities and (2) a series of judicial rulings requiring states to provide a suitable education for students with disabilities if those states were providing one for students without disabilities. Usually referred to simply as “P.L. 94-142,” this influential law supplied federal funding to states, but only if those states appro- priately educated children with disabilities.
Public Law 94-142 also mandated the use of an individualized education program (IEP) for a student with disabilities. An IEP is the federally prescribed document developed by parents, teachers, and specialized services provid- ers (such as an audiologist) describing how a particular child with disabilities should be educated. An IEP represents a per-child plan that spells out annual curricular aims for the child, indicates how the student’s attainment of those goals will be determined, specifies whatever related services are needed by the child, and sets forth any adjustments or substitutions in the assessments to be used with the child. The architects of P.L. 94-142 believed it would encourage the education of many students with disabilities in regular classrooms to the greatest extent possible.
In 1997, P.L. 94-142 was reauthorized and, at the same time, labeled the Indi- viduals with Disabilities Education Act (IDEA). That reauthorization called for
M05_POPH0936_10_SE_C05.indd 147M05_POPH0936_10_SE_C05.indd 147 08/11/23 3:05 PM08/11/23 3:05 PM
148 ChApTer 5 Fairness
states and districts to identify curricular expectations for special education stu- dents that were as consonant as possible with the curricular expectations that had been established for all other students. Under IDEA, all states and districts were required to include students with disabilities in their assessment programs—and to publicly report those test results. However, because IDEA of 1997 contained few negative consequences for noncompliance, most of the nation’s educators simply failed to comply.
The special education landscape, however, was significantly altered when President George W. Bush affixed his signature to the No Child Left Behind Act (NCLB) in early 2002. Unlike IDEA of 1997, NCLB possessed significantly increased penalties for noncompliance. The act intended to improve the achieve- ment levels of all students—including children with disabilities—and to do so in a way that such improvement would be demonstrated on state-chosen tests linked to each state’s curricular aims. Students’ test scores were to improve on a regular basis over a dozen years in order to reflect substantially improved student learn- ing. Even though NCLB allowed states some flexibility in deciding how many more students must earn “proficient or higher” scores each year on that state’s accountability tests, if a school failed to make adequate yearly progress (AYP) based on the state’s 12-year improvement schedule, then the school was placed on a sanction-laden improvement track.
Importantly, NCLB required adequate yearly progress not only for students overall but also for subgroups reflecting race/ethnicity, the economically disad- vantaged, students with limited English proficiency, and students with disabili- ties. Accordingly, if there were sufficient numbers of any of those subgroups in a school, and one of those subgroups failed to score high enough to make that subgroup’s AYP targets, then the whole school flopped on AYP. Obviously, the assessed performances of students with disabilities became far more significant to a school’s educators than they were before the enactment of NCLB. As today’s educators deal with the challenges of the 2015 Every Student Succeeds Act (ESSA), few of them look back at NCLB and regard it as a flaw-free, rousing success. Yet, still fewer educators fail to give credit to NCLB as the federal statute that obliged the nation’s educators to look much harder at the caliber of education we pro- vided to these special children.
In 2004, IDEA was reauthorized once again. At that juncture, however, there was already another significant federal statute on the books dealing with the assessment of students with disabilities—namely, NCLB. Federal lawmakers attempted to shape the reauthorized IDEA so it would be as consistent as pos- sible with the assessment provisions of NCLB that had been enacted 2 years earlier. Even so, there were important procedural differences between IDEA and NCLB, though one could find few fundamental contradictions between those two laws. Remember that when P.L. 94-142 first scampered onto the scene, one of its most influential contributions was the IEP. Many teachers came to under- stand that an individualized education program was supposed to be just what it said it was—an individualized program for a particular student with disabilities.
M05_POPH0936_10_SE_C05.indd 148M05_POPH0936_10_SE_C05.indd 148 08/11/23 3:05 PM08/11/23 3:05 PM
Assessing tudents with Disabilities and english Language Learners 149
Thus, the push of IDEA has historically been directed toward an individualized approach to instruction and an individualized approach to assessing what a student has learned.
Instead of the more student-tailored approach embodied in IDEA, the accountability strategy embodied in NCLB and ESSA was a far more group-based (and subgroup-based) approach than the kid-by-kid orientation of IDEA. A key congressional committee put it well when it observed, “IDEA and NCLB work in concert to ensure that students with disabilities are included in assessment and accountability systems. While IDEA focuses on the needs of the individual child, NCLB focuses on ensuring improved academic achievement for all students.”1
The most important thing teachers need to understand about the instruction and assessment of students with disabilities is that the education of all but a very small group of those children must be aimed at precisely the same curricular tar- gets as the curricular targets teachers have for all other students. Both ESSA and IDEA required the same content standards for all children—with no exceptions! However, for students with the most severe cognitive disabilities, it is possible to use different definitions of achievement. Thus, this small group of students can use modified ways to display their performance related to the same content stan- dards. These modifications take the form of alternate assessments. Such alternate assessments were to be used with only that super-small percent of children who suffer from the most severe cognitive disabilities. In other words, whereas the academic content standards must stay the same for all students, the academic achievement standards (performance levels) can differ substantively for children who have the most severe cognitive disabilities.
Because IEPs are central to the assessment of many children with disabili- ties, we deal with the widespread misperception that IEPs are written documents allowing the parents and teachers of children with disabilities to do some serious bar-lowering, but to do so behind closed doors. Some general education teachers are not knowledgeable about the IEP process. To certain teachers, IEPs are born in clandestine meetings in which biased individuals simply soften expectations so that children with disabilities need not pursue the same curricular aims as those sought for all other children. Perhaps such perceptions were warranted earlier on, but IEP regulations have been seriously modified in recent years. If those who cre- ate today’s IEPs deliberately aim at anything less than the same content standards being pursued by all students, such IEPs violate some big-stick federal statutes.
ACCOMMODATIONS A prominent procedure to minimize assessment bias for students with disabilities is to employ assessment accommodations. An accommodation is a procedure or practice that permits students with disabilities to have equitable access to instruc- tion and assessment. The mission of an assessment accommodation is to eliminate or reduce the inference-distorting effects of a student’s disabilities. It is not the purpose of accommodations to lower the aspirations educators have for students with disabilities. Typically, but not always, the accommodations provided to a
M05_POPH0936_10_SE_C05.indd 149M05_POPH0936_10_SE_C05.indd 149 08/11/23 3:05 PM08/11/23 3:05 PM
150 ChApTer 5 Fairness
student in an assessment setting are similar to the ones used during instruction for that student.
Although accommodations are designed to enhance fairness by providing students with opportunities to circumvent their disabilities, there’s a critical limi- tation on just how much fairness-induction can be provided. Here’s the rub: An assessment accommodation must not fundamentally alter the nature of the skills or knowledge being assessed. This can be illustrated with respect to the skill of reading. Teachers clearly want children to learn to read—whether those children have disabilities or not. A teacher, however, can’t simply look at a student and tell whether that student can read. That’s because the ability to read is a covert skill. To determine a child’s reading ability, teachers need to have a student engage in an overt act, such as silently reading a brief short story and then saying aloud—or in writing—what the story was about. A teacher then can, based on the child’s assessed overt performance, arrive at an inference about the child’s reading skills.
However, what if a teacher proposed to use an accommodation for a child with disabilities that called for a short story to be read aloud to the child? This would surely be a meaningful accommodation, and we can assume that the child would enjoy hearing a story read aloud. But suppose the child were then given a written or oral test about the content of the short story. Do you think the child’s overt responses to that test would allow a teacher to arrive at a valid inference about the child’s reading skills? Of course not. Teachers who devise assessment accommodations for children with disabilities must be vigilant to avoid distorting the essential nature of the skills or knowledge being measured.
ACCESSIBILITY During the past several decades, accessibility has become a key concept of increasing relevance for assessing students with disabilities—as well as for mea- suring students with limited English proficiency. Accessibility is intended to mini- mize assessment bias and, therefore, increase fairness. Accessibility refers to the notion that all test-takers must have an unobstructed opportunity to demonstrate their status with respect to the construct(s) being measured by an educational test. If a student’s access to the assessed construct(s) is impeded by skills and/or characteristics unrelated to what’s being tested, this will limit the validity of score interpretations for particular uses. And those limitations in validity can affect both individuals and particular subgroups of test-takers.
It is generally agreed that because 1997’s Individuals with Disabilities Act (IDEA) called for students with disabilities to be given “access to the general cur- riculum,” educators of such children became sensitized to the concept of acces- sibility. Although the term access has, in years past, not been used in this particular way, assessment specialists who work with special populations of students have concluded that the concept of accessibility captures a factor that, for many stu- dents, plays a prominent role in determining the fairness of testing.
What does it mean to test developers or to classroom teachers that students, whether those students have disabilities or are learning English, must have access
M05_POPH0936_10_SE_C05.indd 150M05_POPH0936_10_SE_C05.indd 150 08/11/23 3:05 PM08/11/23 3:05 PM
Assessing tudents with Disabilities and english Language Learners 151
to the construct(s) being assessed? Well, for developers of major external exams, such as a state’s standardized accountability tests or a district’s high school gradu- ation test, it means that the test’s developers must be continually on guard to make sure there are no features in a test’s design that might preclude any students from having a chance to “show what they know”—that is, to demonstrate how they stack up against the construct(s) being measured. And for a teacher’s class- room tests, it means the very same thing!
Put differently, both large-scale test developers and teachers who crank out their own tests must be constantly on guard to ensure that their assessments will provide total accessibility to each and every child. Given the wide range of student disabilities and the considerable variety of levels of students’ English pro- ficiency encountered in today’s schools, this constant quest for accessibility often represents a genuine challenge for the test developers and teachers involved.
UNIVERSAL DESIGN An increasingly applauded strategy for improving accessibility to educational tests is the process of universal design. Universal design describes an approach to the building of tests that from the very outset of the test-construction process attempts to maximize accessibility for all intended test-takers. In years past, tests were first built, and then—often as an afterthought—the developers tried to make the tests accessible to more students. The motivations for those sorts of after-the-fact accessibility enhancements were, of course, commendable. But after-the-fact attempts to change an already completed assessment instrument are almost certain to be less effective than one might wish. Indeed, accessibility of a test’s content to the test-taker has become an increasingly important concern to members of the measurement community—and especially to measurement folks who are devising assessments for groups of special education learners.
When approaching test development from a universal-design perspective, however, the maximizing of accessibility governs the test developer’s decisions from the get-go. Test items and tasks are designed with an overriding motive in mind: to minimize construct-irrelevant features that might reduce access for the complete array of potential test-takers, not merely the typical test-takers. Universal design constitutes an increasingly prevalent mind-set on the part of major test developers. It is a mind-set that classroom teachers can—and should—develop for themselves.
English Language Learners And what about students whose first language is not English? In other words, how should classroom teachers assess such students? First off, we need to do a bit of term-defining, because there are a number of descriptive labels for such young- sters being used these days. English language learners (ELLs) represent a diverse, fast-growing student population in the United States and Canada. Included in the ELL group are (1) students whose first language is not English and who know
M05_POPH0936_10_SE_C05.indd 151M05_POPH0936_10_SE_C05.indd 151 08/11/23 3:05 PM08/11/23 3:05 PM
152 ChApTer 5 Fairness
little, if any, English; (2) students who are beginning to learn English but could benefit from school instruction; and (3) students who are proficient in English but need additional assistance in academic or social contexts. English language learners also include language-minority students (also called linguistic-minority students) who use a language besides English in the home. The federal govern- ment sometimes uses the term limited English proficient (LEP) to describe ELL students. One also encounters the label language learners. Currently, most people, because they believe it more politically correct, describe those students as “Eng- lish language learners,” but it is equally accurate to employ the governmentally favored descriptor “limited English proficient.” The ELL label will be employed from here on in this book.
It was indicated earlier that ESSA is the current reauthorization of a landmark federal 1965 law, the Elementary and Secondary Education Act (ESEA). So, if you are interested in the assessment requirements of the law insofar as ELLs are concerned, be sure to be attentive to any ESSA adjustments in the regulations for implementing the reauthorized federal statute. Until changes take place, however, two federal requirements spelled out the kinds of assessments to be used with language-minority students.
First, there is the requirement to determine the degree to which certain sub- groups of students—specifically those subgroups identified in ESSA—are making adequate yearly progress. One of these groups consists of “students with limited English proficiency.” If there are sufficient numbers of ELL students in a school (or in a district) to yield “statistically reliable information,” then a designated proportion of those students must have earned proficient-or-above scores on a state’s accountability tests. If enough students don’t earn those scores, a school (or district) will be regarded as having failed to make its AYP targets.
Interestingly, because federal officials have allowed states some latitude in deciding on the minimum numbers of students (in a subgroup) needed in order to provide “statistically reliable information,” there is considerable variation from state to state in these subgroup minimum numbers. You will easily recognize that if a state had set its minimum numbers fairly high, then fewer schools and districts would be regarded as having failed AYP. This is because low-performing students in a particular subgroup need not be counted in AYP calculations if there were too few students (that is, below state-set minima) in a particular subgroup. Conversely, a state whose subgroup minima had been set fairly low would find more of its schools or districts flopping on AYP because of one or more subgroups’ shortcomings. Yet, because ESSA attempted to locate more of its high-stakes assessment decisions in states rather than in Washington, DC, this localization of such decision making has generally been approved.
Second, ESSA contains another assessment requirement related to ELL stu- dents. The law called for states to employ separate academic assessments of Eng- lish language proficiency so that, on an annual basis, such tests can measure ELL students’ skills in “oral language, reading, and writing.” Because most states did not have such tests on hand (or, if they did, those tests usually measured beginning
M05_POPH0936_10_SE_C05.indd 152M05_POPH0936_10_SE_C05.indd 152 08/11/23 3:05 PM08/11/23 3:05 PM
Assessing tudents with Disabilities and english Language Learners 153
students’ social English skills rather than the students’ academic English skills), we have witnessed substantial test-development activity in individual states or, frequently, in test-development consortia involving several participating states.
The technical quality of these tests to assess students’ English language profi- ciency, of course, will be the chief determinant regarding whether valid inferences can be made regarding ELL students’ mastery of English. Clearly, considerable attention must be given to evaluating the caliber of this emerging collection of assessment instruments.
Lest you think that the assessment of ELL students, once addressed seri- ously because of federal requirements, is going to be easy to pull off, think again. Jamal Abedi, one of the nation’s leading authorities on the assessment of ELL students, has reminded us about two decades ago of half a dozen issues that must be addressed if federal laws are going to accomplish for LEP students what those laws’ architects intended (Abedi, 2004). He pointed to the following concerns that, it surely seems, still need to be pointed to:
1. Inconsistency in ELL classification across and within states. A variety of differ- ent classification criteria are used to identify ELL students among states and even within districts and schools in a state, thus affecting the accuracy of AYP reporting for ELL students.
2. Sparse ELL populations. In a large number of states and districts there are too few ELL students to permit meaningful analyses, thus distorting state and federal policy decisions in such settings.
3. Subgroup instability. Because a student’s ELL status is not stable over time, a school’s ELL population represents a moving, potentially unattainable tar- get. To illustrate, if a given ELL student’s mastery of English has improved sufficiently, that student is moved out of the ELL group, only to be replaced by a new student whose English proficiency is apt to be substantially lower. Therefore, even with superlative instruction, there is little chance of improv- ing the AYP status of the ELL subgroup from year to year.
4. Technical quality of AYP-determining assessment tools. Because studies have shown that assessment instruments constructed for native English speak- ers have lower reliability and validity when used with ELL populations, results of such tests are likely to be misinterpreted when employed with ELL students.
5. Lower baseline scores. Because schools with large numbers of ELL students are likely to have lower baseline scores and thus may have improvement tar- gets that will be unrealistically challenging, such schools may have difficulty in attaining state-set increments in students’ achievement levels.
6. Cut-scores for ELL students. Unlike the previous version of ESEA, in which a student’s weaker performance in one subject area could be compensated for by a higher performance in another subject (this is referred to as a compensatory model), federal statutes may require use of a conjunctive model in which
M05_POPH0936_10_SE_C05.indd 153M05_POPH0936_10_SE_C05.indd 153 08/11/23 3:05 PM08/11/23 3:05 PM
154 ChApTer 5 Fairness
students’ performances in all assessed subject areas must be acceptable. Stu- dents’ performances in mathematics are apt to be less dependent on language prowess than their performances in reading/language arts, where language demands are higher. This, then, renders the AYP expectations for schools with large ELL populations more challenging than those imposed on schools with fewer ELL students.
Even with all these issues that must be considered when assessing students in the accountability structure of federal assessment-related statutes, Abedi (2004) believed the isolation of subgroup performances was “a step in the right direc- tion.” He argued that state and federal officials appear to be addressing many of these issues at this time. Abedi regards this as necessary to resolve these prob- lems if genuine fairness is to be attained in the assessment-based accountability focused on ELL students. And this quest means there needs to be some sensible judgments made by teachers. It is decisively unsmart to force a child who speaks and reads only Spanish to take an English-language test about mathematics and then, because the child’s test score is low, to conclude that the child can’t do math.
The issue of whether teachers should try to make available versions of their tests in non-English languages is, unfortunately, often decided on totally practi- cal grounds. In many settings, teachers may find a dozen or more first languages spoken by their students. In some urban settings, more than 100 first languages other than English are found in a district’s students. And if an attempt is made to create an alternate assessment only in the most common language a teacher’s students possess, how fair is this to those students whose first languages aren’t found in sufficient numbers to warrant the creation of a special test in their native language?
Assessment accommodations, in the form of giving language-minority stu- dents more time or the use of dictionaries, can, depending on students’ levels of English mastery, often help deal with the measurement of such students. In recent years, several of the larger test-development groups, particularly the Smarter Balanced Assessment Consortium, have been experimenting with supplying cartoon-like, pictorial illustrations of the meaning of key words in some test items. If this more generic approach is successful, it would help address the myriad non-English languages that require language-specific dictionaries or glossaries. Results of early experiments with this sort of pictorial-definition approach are too fragmentary, however, for the immediate application of the approach.
It is difficult to attend any conference on educational assessment these days without finding a number of sessions devoted to the measurement of language-minority students or students with disabilities. Clearly, an effort is being made to assess these children in a fashion that is not biased against them. But, just as clearly, the problems of measuring such students with fairness, compassion, and accuracy are nontrivial.
The accurate assessment of students whose first language is not English will continue to pose a serious problem for the nation’s teachers. To provide equivalent
M05_POPH0936_10_SE_C05.indd 154M05_POPH0936_10_SE_C05.indd 154 08/11/23 3:05 PM08/11/23 3:05 PM
Is Fairness a Bona Fide Member of educational Testing’s “Big Three”? 155
tests in English and all other first languages spoken by students represents an essentially impossible task. The policy issue now being hotly contested in the United States is whether to create equivalent tests for the most numerous of our schools’ language-minority students.
Is Fairness a Bona Fide Member of Educational Testing’s “Big Three”? For roughly a century of serious-minded educational assessment, the roost was unarguably ruled by two powerful concepts, validity and reliability. But with the arrival of the 2014 AERA-APA-NCME Standards, educators were told to make room in their assessments for a third make-a-difference assessment concept, namely, fairness. This chapter has attempted to dig a bit into what’s really going on when we set out to show that an educational test is fair.
But adding a new test-evaluative factor to a pair of traditional evaluative factors is easier to say than to do. This is because educators can find it tempting to endorse the importance of assessment fairness, but then go about their testing of students as usual. Advocacy of fairness might easily become little more than lip-service. Assessment specialists, as well as everyday teachers, may toss out a few sentences in support of assessment fairness but, when the testing rubber really hits the road, slip back into their adulation of only reliability and validity.
If you find yourself committed, at least in spirit, to the importance of educational assessment’s fairness, how can you avoid treating test fairness in a tokenistic fash- ion? This should be, to you, an important question. Happily, the answer is a simple one. You can make fairness a bona fide member of educational measurement’s “big three” by simply collecting and presenting relevant evidence of assessment fairness.
Think back, if you will, to this book’s earlier treatments of both validity (Chap- ter 4) and reliability (Chapter 3). In both of those instances, the adequacy of assess- ment reliability and assessment validity was determined because of evidence. For validity, evidence related to the accuracy of score-based interpretations about test-takers was sought. Although different intended uses of test results might call for different kinds of evidence, what was needed was a validity argument featur- ing plentiful and potent evidence that a test was apt to provide users with scores from which valid inferences could be made for specific purposes. Similarly, to calculate a test’s reliability, once more we saw that reliability evidence of three sorts could be gathered so that judgments could be made about a test’s ability to yield consistent scores. Thus, with regard to both of these two notions of a test’s quality, the more solid evidence that’s available to support either reliability or validity, the more likely it is that we will approve a test for a given measurement mission.
It is the same thing for fairness. Educators who want test fairness to play a prominent role in appraising the worth of an educational test must be able to pres- ent a compelling array of evidence (judgmental, empirical, or both) that a test is up
M05_POPH0936_10_SE_C05.indd 155M05_POPH0936_10_SE_C05.indd 155 08/11/23 3:05 PM08/11/23 3:05 PM
156 ChApTer 5 Fairness
to the measurement purpose it is supposed to accomplish. As this chapter indicated, either judgmental or empirical evidence can be collected that bears on the fairness of a specific educational assessment. But mere verbal support of assessment fairness without compelling evidence bearing on a specific test’s fairness is likely to have little impact on judgments about the virtues of a given educational test.
Typically, educators will find today’s large-scale standardized tests accompa- nied by increasing quantities of evidence regarding a specific test’s fairness. For instance, you might encounter summaries of the per-item judgments rendered by a specially constituted bias review panel. Such summaries of judgmental evi- dence might be supplemented with analyses of the differing test performances of subgroups of student test-takers. Similarly, when teachers try to collect evidence regarding their own classroom assessments, much less elaborate records can be made, such as a paragraph or two about whatever analyses were undertaken regarding validity, reliability, and fairness.
To illustrate, suppose you’re a sixth-grade teacher who has asked a fifth- grade fellow teacher to review your midterm exam’s items for any instances of subgroup-linked biases. Based on your colleague’s report, you might simply sum- marize the percent of items thought to offend or unfairly penalize your sixth-grade students. Similar brief summaries of analyses dealing with reliability and valid- ity could be summarized for the test you are evaluating. Such once-over-lightly analyses on your own classroom tests are likely to sensitize you to the strengths and weaknesses of the evaluative evidence supplied for larger, higher-stakes tests.
What Do Classroom Teachers Really Need to Know About Fairness? Classroom teachers need to know that if assessment bias exists, then assessment fair- ness toddles out the door. Unfair bias in large-scale educational testing is less preva- lent than it was a decade or two ago because many measurement specialists, after having been sensitized to the presence of assessment bias, now strive to eliminate such bias. However, for the kinds of teacher-developed assessment procedures seen in typical classrooms, systematic attention to bias eradication is much less common.
All classroom teachers need routinely to use fairness—and, in particular, absence-of-bias—as one of the evaluative criteria by which they judge their own assessments and those educational assessments developed by others. For instance, if you are ever called on to review the quality of a high-stakes test—such as a district-developed or district-adopted examination whose results will have a meaningful impact on students’ lives—be sure that suitable absence-of-bias pro- cedures, both judgmental and empirical, were employed during the examination’s development.
But what about your own tests? How much effort should you devote to mak- ing sure that your tests are fair, and that they don’t offend or unfairly penalize any
M05_POPH0936_10_SE_C05.indd 156M05_POPH0936_10_SE_C05.indd 156 08/11/23 3:05 PM08/11/23 3:05 PM
What Do Classroom Teachers really Need to Know About Fairness? 157
of your students because of personal characteristics such as ethnicity or gender? A reasonable answer to that important query is that you really do need to devote attention to absence-of-bias for all your classroom assessment procedures. For the least significant of your assessment procedures, you can simply heighten your consciousness about bias eradication as you generate your test items or, having done so, as you review the completed test.
For more important examinations, try to enlist the assistance of a colleague to review your assessment instruments. If possible, attempt to secure the help of colleagues from the same subgroups as those represented among your students. For instance, if many of your students are Hispanic (and you aren’t), then try to get an Hispanic colleague to look over your test’s items to see whether there are any that might offend or unfairly penalize Hispanic students. When you enlist a colleague to help you review your tests for potential bias, try to carry out a mini-version of the bias review panel procedures described in this chapter. Briefly describe to your co-workers how you define assessment bias, give them a succinct orientation to the review task, and structure their reviews with absence-of-bias questions such as those illustrated earlier in the chapter.
Most important, if you personally realize how repugnant all forms of assess- ment unfairness are, and how assessment bias can distort certain students’ per- formances even if the bias was inadvertently introduced by the test’s developer, you’ll be far more likely to eliminate assessment bias in your own tests. In educa- tion, as in any other field, assessment bias should definitely be absent.
But What Does This have to Do with Teaching? Decades ago I was asked to serve as an expert witness in a federal court case taking place in Florida. It was known as the Debra P. v. Turlington case, because Ralph Turlington was the Florida Commissioner of Education and Debra P. was one of nine Black children who were about to be denied high school diplomas because they had not passed a state-administered basic skills test. A class-action lawsuit had been brought against Florida to stop this test-based denial of diplomas.
When I was invited to serve as a witness for the state, I was initially pleased. After all, to be an “expert” witness in a federal court case sounded like a recipe for instant tenure. So my first inclination was to agree to serve as a witness. But then I learned that Florida’s high school graduation test was having a substantially disparate impact on the state’s Black children, far more of whom were going to be denied
diplomas than their White counterparts. I pondered whether I wanted to support an assessment program that would surely penalize more minority than majority children.
So, before making up my mind about becoming a witness in the Debra P. case, I consulted a number of my Black friends at UCLA and in the Los Angeles Unified School District. “Should I,” I asked, “take part in the support of a test that denies diplomas to so many Black youngsters?”
Well, without exception, my Black friends urged me to accept the Florida assignment. The essence of their argument was simple. What was going on in Florida, they claimed, was the awarding of “counterfeit” diplomas to Black children. If so many Black youngsters were actually flopping on the state’s basic skills test, then it was quite likely that those children did not possess the necessary basic
(Continued)
M05_POPH0936_10_SE_C05.indd 157M05_POPH0936_10_SE_C05.indd 157 08/11/23 3:05 PM08/11/23 3:05 PM
158 ChApTer 5 Fairness
Assessment-related federal laws oblige today’s teachers to rethink the ways they should test students with disabilities and ELL students. With few exceptions, children with disabilities are to be assessed (and instructed) in relation to the same curricular goals as all other children. This can often be accomplished through the use of accommodations—that is, alterations in presentations, responses, settings, and timing/scheduling. Assessment accommodations, however, must never alter the fundamental nature of the skill or knowledge being assessed. Assessment accommodations can also be used to reduce potential bias associated with the test- ing of ELL students. Given the enormous number of first languages other than Eng- lish that are now found in the nation’s schools, financial limitations tend to prevent the development of suitable assessments for many language-minority students.
We have considered now the three sorts of assessment-related evidence that should influence how much confidence can be placed in a set of test results. Yes, by reviewing the reliability, validity, and fairness evidence associated with a given test, we can decide whether a test’s results should be regarded with indifference or adulation. But even if we are using a test that sits astride a compelling collection of test-supportive evidence, there’s the need to report the results of the investigation in which our wondrous test was employed. A botched report featuring a great educational test is of little value if people can’t understand what the test-based report was trying to communicate.
Chris Domaleski of the Center for Assessment has recently offered some excellent guidance to report writers when he provided a succinct summary of advice from the late Ron Hambleton, an acknowledged master of test-score reporting. Hambleton’s suggestions were based on his belief that while stake- holders cared first about such reports, assessment specialists seemed to focus last on those reports. Domaleski provided us with four Hambleton-recommended practices to assist writers of reports regarding test quality.
Recognizing that those preparing results-focused reports should “lead with the reports” because of the structure they provided to an overall report, Hambleton offered us the following four “best-practice” recommendations (Domaleski, 2022).
skills. Yet, even without those basic skills, many of Florida’s Black children had previously been receiving high school diplomas that allowed them to graduate without the skills they needed. My friends made it clear. As one put it, “Get your tail down to Tallahassee and support any test that shows many Black children are being instructionally shortchanged. You can’t fix something if you don’t know it’s broken.”
And that’s where bias detection comes in. If teachers can eliminate bias from their own assessments, then any gaps between the performance levels of minority and majority students can be attributed to instruction, not
shortcomings in a test. Achievement gaps, properly identified, can then be ameliorated instructionally. If there’s no recognition of a bona fide gap, then there’s little reason for instructional alterations.
(As a postscript, the Debra P. case turned out to be an important precedent-setter. It was ruled that a state can deny a high school diploma on the basis of a basic skills test, but if the test covers content that students have not had an opportunity to learn, then this violates a student’s—that is, a U.S. citizen’s—constitutionally guaranteed property rights. I liked the ruling a lot.)
M05_POPH0936_10_SE_C05.indd 158M05_POPH0936_10_SE_C05.indd 158 08/11/23 3:05 PM08/11/23 3:05 PM
Chapter ummary 159
Chapter Summary This chapter on fairness in testing was, quite naturally, focused on how educators make their tests fair. Assessment bias was described as any element in an assessment procedure that offends or unfairly penalizes students because of personal characteristics, such as their gender or ethnicity. Assessment bias, when present, was seen to distort certain stu- dents’ performances on educational tests and hence to reduce the validity of score-based interpretations about those students. The two chief contributors to assessment unfairness were identified as offensiveness and unfair penalization. The essential features of both of these factors were considered. An examina- tion having a disparate impact on a particular subgroup is not necessarily biased, although such a differential impact certainly would war- rant further scrutiny of the examination’s con- tent to discern whether assessment bias was actually present. In many instances, disparate impact of an examination simply indicates that certain groups of students have previously received inadequate instruction or an insuffi- cient opportunity to learn.
Two procedures for identifying unfair—that is, biased—segments of educational assessment devices were described. A judgmental approach relies on the considered opinions of properly oriented bias reviewers. Judgmental approaches
to bias detection can be formally employed with high-stakes tests or less formally used by class- room teachers. An empirical approach that relies chiefly on differential item functioning can also be used, although its application requires large numbers of students. It was recommended that classroom teachers be vigilant in the identification of bias, whether in their own tests or in the tests of others. For detecting bias in their own tests, class- room teachers were urged to adopt a judgmental review strategy consonant with the importance of the assessment procedures involved.
Issues associated with the assessment of ELL students or students with disabilities were identified near the chapter’s conclusion. In many cases, assessment accommodations can yield more valid inferences about students who have disabilities or are less proficient in English. In other instances, alternate assessments need to be developed. Neither of these assessment modifi- cations, however, is completely satisfactory in coping with this important assessment problem. Fairness, from an assessment perspective, cur- rently appears to be much more difficult to attain in the case of students with disabilities and ELL students. Two relevant concepts, accessibility and universal design, hold promise for assessing such learners.
Finally, it was argued that to avoid hav- ing assessment fairness be regarded as a
1. Say what you mean. An effective score report should include very clear statements of the intended interpretations.
2. Be honest about uncertainty. Because every reported test score is an estimate containing error, score reports should describe that error plainly and clearly.
3. Create visually appealing and informative reports. Not only can visuals help clarify interpretations, they can also streamline a report and thus make it more appealing to users.
4. Conduct a user review. Nothing is a suitable substitute for having teachers, parents, or other key stakeholders examine draft reports and provide infor- mative feedback.
M05_POPH0936_10_SE_C05.indd 159M05_POPH0936_10_SE_C05.indd 159 08/11/23 3:05 PM08/11/23 3:05 PM
160 ChApTer 5 Fairness
commendable but feckless notion, it is important for those creating tests (as well as those evaluat- ing others’ tests) to ensure that ample evidence bearing on test fairness accompanies educational
tests, and even significant classroom assessments. Just as we need solid evidence to reach decisions about assessment reliability and validity, we need such evidence about an educational test’s fairness.
References Abedi, J. (2004). The No Child Left Behind Act
and English language learners: Assessment and accountability issues, Educational Researcher, 33(1): 4–14.
American Educational Research Association. (2014). Standards for educational and psychological testing. Author.
Brookhart, S. (2023). Classroom assessment essentials. ASCD.
Christensen, L., Carver, W., VanDeZande, J., & Lazarus. S. (2011). Accommodations manual: How to select, administer, and evaluate use of accommodations for instruction and assessment of students with disabilities (3rd ed.). Council of Chief State School Officers.
Domaleski, C. (2022, May 11).What I learned about creating effective test score reports from the great Ron Hambleton. Centerline Blog. Center for Assessment. https://www.nciea.org/ blog/lead-with-the-reports/
Ferlazzo, L., Gottlieb, M., Micolta Simmons, V., Garcia, C., & Nemeth, K. (2021, April 19). Assessment strategies for English-language learners (Opinion), Education Week. Retrieved September 20, 2022, from https://www. edweek.org/teaching-l earning/opinion- assessment-strategies-for-english-language- learners/2021/04
Herman, J., & Cook. L. (2020). Fairness in classroom assessments. In S. M. Brookhart &
J. H. McMillan (Eds.), Classroom assessment and educational measurement (1st ed., pp. 243– 264). Routledge.
Ketterlin-Geller, L., & Ellis, M. (2020). Designing accessible learning outcomes assessments through intentional test design, Creative Education, 11, 1201–1212. https://doi. org/10.4236/ce.2020.117089.
Reibel, A. R. (2021, March 11). Uncovering implicit bias in assessment, feedback and grading. All Things Assessment. Retrieved September 20, 2022, from https:// allthingsassessment.info/2021/03/11/ uncoveri ng-implicit-bias-in-assessment- feedback-and-grading/
Rogers, C. M., Ressa, V. A., Thurlow, M. L., & Lazarus, S. S. (2022). A summary of the research on the effects of K–12 test accommodations: 2020 (NCEO Report 436). National Center on Educational Outcomes.
Shepard, L. A. (2021, September 15). Ambitious teaching and equitable assessment. American Federation of Teachers. Retrieved September 20, 2022, from https://www.aft.org/ae/ fall2021/shepard
Tierney, R. D. (2013). Fairness in classroom assessment. In J. H. McMillan (Ed.), SAGE Handbook of research on classroom assessment (pp. 125–144). SAGE Publications.
Endnote 1. U.S. Congressional Committee on Education
and the Workforce. (2005, February 17). IDEA guide to frequently asked questions (p. 10). Washington, DC: Author.
M05_POPH0936_10_SE_C05.indd 160M05_POPH0936_10_SE_C05.indd 160 08/11/23 3:05 PM08/11/23 3:05 PM
A Testing Takeaway 161
A Testing Takeaway
Assessment Fairness: Newest Member of Testing’s “Big Three”* W. James Popham, University of California, Los Angeles
For well over 100 years, educational testing’s two most important concepts were—without debate—validity and reliability. Assessment validity refers to the accuracy of score-based interpretations about test-takers regarding a test’s intended uses; assessment reliability represents the consistency with which an educational test measures whatever it is measuring.
But 15 years after its previous edition, a revised version of the Standards for Educational and Psychological Testing was published (AERA, 2014). Those guidelines changed the traditional notion of what’s most important in educational assessment. The revised Standards are particularly influential because they often play a key role in the resolution of courtroom disputes about the uses of educational tests. In the 2014 Standards revision, assessment fairness not only received extensive coverage but was also placed atop a “most important three” pedestal along with validity and reliability. Persuasive evidence regarding all three of those concepts must now be routinely collected and presented when educational tests are developed or evaluated.
Whenever practicable, two kinds of evidence should be assembled to determine an educational test’s fairness:
• Judgmental Evidence. One powerful form of evidence regarding an educational test’s fair- ness consists of item-by-item judgments rigorously rendered by independent judges. Com- mittees of “bias reviewers” typically provide per-item ratings according to whether an item might either offend or unfairly penalize any group of students because of such personal characteristics as race, gender, or religion.
• Empirical Evidence. If enough students have completed an educational test so that at least several hundred students have responded to every item, it is possible to analyze test-takers’ results to determine whether statistically significant differences exist among subgroup perfor- mances. If such differential item functioning (DIF) procedures reveal meaningful between- group differences in performance on any items, those items are then scrutinized more closely to detect and eliminate possible bias.
For significant high-stakes tests, both judgmental and empirical analyses are usually carried out to locate any biased items in a test. For special learners, such as students with physical or cognitive handicaps, a number of distinctive test adaptations are routinely employed to combat unfairness.
Most important, however, the isolation and elimination of test unfairness is no longer mere test-developer rhetoric. Since the publication of the 2014 revised assessment Standards, evidence of assessment fairness is every bit as important as the evidence provided for validity and reliability. Whenever you consider adopting an educational test, demand to see evidence of its fairness.
*From Chapter 5 of Classroom Assessment: What Teachers Need to Know, 10th ed., by W. ames popham. Copyright 2022 by pearson, which hereby grants permission for the reproduction and distribution of this Testing Takeaway, with proper attribution, for the intent of increasing assessment literacy. A digitally shareable version is available from https://www.pearson.com/store/en-us/pearsonplus/login.
M05_POPH0936_10_SE_C05.indd 161M05_POPH0936_10_SE_C05.indd 161 08/11/23 3:05 PM08/11/23 3:05 PM