Education EDU530 Week 6 assignment

profilebwilliams327
Chptr4.pdf

101

Chapter 4

Validity

Chief Chapter Outcome

A sufficiently deep understanding of assessment validity so that its essential nature can be explained, its establishment can be described, and the most appropriate kinds of validity evidence can be selected for specific uses of educational tests

Learning Objectives

4.1 Explain the basic conceptual principles of validity including common shortcomings of assessments with weak validity evidence.

4.2 Identify and differentiate between the four sources of validity evidenced in this chapter.

We’ll be looking at validity in this chapter. Validity is, hands down, the most significant concept in assessment. In order to appreciate the reasons why validity is so all-fired important, however, one first needs to understand why educators carry out assessments in the first place. Thus, let’s set the stage a bit for your consideration of validity by explaining why educators frequently find themselves obliged to mess around with measurement.

Before you eagerly plunge into this important chapter, you need a heads-up warning. This chapter on validity contains some fairly unusual concepts—that is, ideas you’re not apt to see on Saturday night television or even encounter in the “Letters to the Editor” section of your local newspaper (if you still have a local newspaper where you live).

Don’t be put off by these new concepts. When you’ve finished the chapter, having smilingly looked back on its entrancing content, you’ll be better able to focus on the really necessary ideas about validity. These really necessary ideas, when all the hotsy-totsy terminology has been stripped away, are simply gussied- up applications of common sense. Thus, be patient and plug away at this chapter. When it’s over, you’ll be a better person for having done so. And you’ll know oodles more about validity.

M04_POPH0936_10_SE_C04.indd 101M04_POPH0936_10_SE_C04.indd 101 08/11/23 3:35 PM08/11/23 3:35 PM

102 Chapter 4 Validity

A Quest for Defensible Interpretations As noted in Chapter 1, we assess students because we want to determine a student’s status with respect to an educationally relevant variable. One kind of variable of relevance to teachers is a variable that can be altered as a consequence of instruction. How much students have learned about world history is one such variable. Another kind of educationally relevant variable is one that can influence the way a teacher decides to instruct students. Students’ attitudes toward the study of whatever con- tent the teacher is teaching are examples of this sort of variable.

The more knowledge that teachers have about their students’ status with respect to certain educationally relevant variables, the more likely teachers are to make defensible educational decisions regarding their students. To illustrate, if a middle school teacher knows that Lee Lacey is a weak reader and has truly negative attitudes toward reading, the teacher will probably decide not to send Lee scampering off to the school library to tackle an independent research project based on self-directed reading. Similarly, if a mathematics teacher discovers early in the school year that her students know much more about mathematics than she had previously suspected, the teacher is apt to decide that the class will tackle more advanced topics than originally planned. Teachers use the results of assess- ments to make decisions about students. But appropriate educational decisions depend on the accuracy of educational assessment. That’s because, quite obviously, accurate assessments will improve the quality of decisions, whereas inaccurate assessments will do the opposite. And this is where validity rumbles onstage.

Teachers often need to know how well their students have mastered a curricular aim—for example, a skill or body of knowledge that students are sup- posed to learn. A curricular aim is also referred to these days as a content standard. To illustrate, if we set out to determine how well students can comprehend what they have read, it is obviously impractical to find out how well students can read everything. There’s just too much out there to read. Nonetheless, teachers would like to get an accurate fix on how well a particular student can handle the full collection of relevant reading tasks implied by a particular curricular aim. Because it is impossible to see how well students can perform with respect to the entire array of skills or knowledge embodied in most curricular aims, we must fall back on a sampling strategy. Thus, when we measure students, we try to sample their mastery of a curricular aim in a representative manner so that, based on students’ performance on the sample (in a test), we can infer what their status is with respect to their mastery of the entire curricular aim. Figure 4.1 portrays this relationship graphically. Note that we start with a curricular aim. In Figure 4.1, that’s the oval at the left. The left-hand oval represents, for illustration purposes, a curricular aim in reading consisting of a student’s ability to comprehend the main ideas of written passages. The oval at the right in Figure 4.1 represents an educational assessment approach—in this instance a 10-item test—that we use to

M04_POPH0936_10_SE_C04.indd 102M04_POPH0936_10_SE_C04.indd 102 08/11/23 3:35 PM08/11/23 3:35 PM

a Quest for Defensible Interpretations 103

make an inference about a student. If you prefer, you can think of a test-based infer- ence simply as an interpretation of what the test results signify. The interpretation concerns the student’s status with respect to the entire curricular aim (the oval at the left). If this interpretation (inference) is accurate, then the resultant educational decisions are likely to be more defensible. That’s because those decisions will be based on a correct estimate regarding the student’s actual status.

The 2014 Standards for Educational and Psychological Testing (American Edu- cational Research Association, 2014) don’t beat around the bush when dealing with the topic of assessment validity. Early on in their consideration of valid- ity, for example, the authors of the Standards indicate that “validity is, therefore, the most fundamental consideration in developing and evaluating tests” (AERA, 2014, p. 11). But in the intervening years between publication of the 1999 Standards and the 2014 Standards, we saw some serious disagreements among members of the educational measurement community about whether assessment validity should emphasize (1) the accuracy of the test-based interpretation regarding what a test-taker knows or can do or (2) the consequences of usage—that is, the appro- priateness of the uses of test-based interpretations. Indeed, some of those who stressed the importance of test-score usage attempted to popularize the concept of “consequential validity” to highlight the salience of looking at what action was taken because of test-based inference about test-takers’ status.

Happily, the 2014 Standards resolve this issue decisively by proclaiming that assessment validity involves accumulating relevant evidence to provide a sound sci- entific basis for “proposed uses of tests.” In other words, those who set out to collect validity evidence should do so in relationship to specific intended uses of the test under consideration. If an educational test is employed to provide an interpretation about a test-taker’s status that’s to be used in a particular way (such as, for instance, to determine a student’s grade-to-grade promotion), the accuracy of the interpretation must be supported not in general but, rather, for the specifically identified use. If another use of the test’s results is proposed, such as to help evaluate a teacher’s instructional effectiveness, then this different usage requires usage-specific evidence to support any interpretation of the test’s results related to teachers’ instructional skills.

Illustration: The student’s ability to comprehend

the main ideas of written passages.

Illustration: A 10-item short-answer test asking

students to supply statements representing

10 written passages’ main ideas.

A Student’s Inferred Status

Curricular Aim Educational Test

Figure 4.1 an Illustration of how We Infer a Student’s Status with respect to a Curricular aim from the Student’s performance on an educational test Sampling the aim

M04_POPH0936_10_SE_C04.indd 103M04_POPH0936_10_SE_C04.indd 103 08/11/23 3:35 PM08/11/23 3:35 PM

104 Chapter 4 Validity

The 2014 Standards, then, explicitly link the accuracy of score-based inter- pretations to the application of those interpretations for particular uses. There is no separate “consequential validity,” because the usage consequences of a test’s results are inherently derived from the accuracy of test-based interpretations for a specific purpose. Thus, clearly, for the very same test, evidence might reveal that for Use X, the test’s score-based interpretations were resoundingly valid, but for Use Y, the test’s score-based interpretations were insufficiently valid.

It is the validity of a score-based interpretation that’s at issue when measurement folks deal with validity. Tests themselves do not possess validity. Educational tests are used so that educators can make interpretations about a student’s status. If high scores lead to one kind of inference, low scores typically lead to an opposite inference. Moreover, because validity hinges on the accuracy of our inferences about students’ status with respect to a curricular aim, it is flat-out inaccurate to talk about “the validity of a test.” A well-constructed test, if used with the wrong group of students or if administered under unsuitable circumstances, can lead to a set of unsound and thoroughly invalid interpretations. Test-based interpretations may or may not be valid. It is the accuracy of the test-based interpretation with which teachers ought to be concerned.

In real life, however, you’ll find a fair number of educators talking about the “validity of a test.” Perhaps they really understand it’s the validity of score-based infer- ences that is at issue, and they’re simply using a shortcut descriptor. Sadly, it’s more likely that they really don’t know the focus of validity should be on test-based inter- pretations, not on tests themselves. There is no such thing as “a valid test.” And, just to be evenhanded in our no-such-thing castigations, there is no such thing as “an invalid test.”

Now that you know what’s really at issue in the case of validity, if you ever hear a colleague talking about “a test’s validity,” you’ll have to decide whether you should preserve that colleague’s dignity by letting such psychometric shortcomings go unchallenged. I suggest that, unless there are really critical decisions on the line— decisions that would have an important educational impact on students—you keep your insights regarding validation to yourself. (When I first truly comprehended what was going on with the validity of test-based interpretations, I shared this knowledge rather aggressively with several fellow teachers and, thereby, meaning- fully miffed them. No one, after all, likes a psychometric smart ass.)

Decisions about how to marshal a persuasive validity argument, either for a test under development or for a test that’s already in use, can often be abetted by developing a set of claims or propositions that support a proposed interpretation for a specified purpose of testing. Evidence is then collected to evaluate the sound- ness of those propositions. It is often useful, when generating the propositions for such an argument, to consider rival hypotheses that may challenge the proposed interpretation. Two variations of such hypotheses hinge on whether a test measures less or more of the construct it purports to measure. Let’s look at both of these rival hypotheses that can often, if undetected, meaningfully muck up a validity argument.

Content underrepresentation describes a shortcoming of a test that fails to capture important aspects of the construct being measured. For example, a test of

M04_POPH0936_10_SE_C04.indd 104M04_POPH0936_10_SE_C04.indd 104 08/11/23 3:35 PM08/11/23 3:35 PM

a Quest for Defensible Interpretations 105

students’ reading comprehension that systematically fails to include certain kinds of reading passages, yet claims it is supplying an accurate overall estimate of a child’s reading comprehension, embodies a content-underrepresentation mistake.

Construct-irrelevant variance is a test weakness in which test-takers’ scores are influenced by factors that are extraneous to the construct being assessed for a specific purpose. Using a reading example again, if a test included a flock of reading passages that were outlandishly too complicated or embarrassingly too simple, our interpretation of a student’s score on such a reading test would surely be less valid than if the test’s passages had meshed better with the test-takers’ abilities. In some instances, we have seen test developers creating mathematics tests for which the required reading load clearly undermines a test-taker’s math- ematics prowess. Construct-irrelevant variance can cripple the validity (that is, accuracy) of the score-based interpretations we make for a test’s avowed purpose.

There are some fairly fancy phrases used in the preceding paragraphs, and the idea of having to generate a compelling validity argument might seem alto- gether off-putting to you. But let’s strip away the glossy lingo. Here’s what’s really involved when test developers (for their own tests) or test users (generally for some- one else’s tests) set out to whomp up a winning validity argument. In turn, we need to (1) spell out clearly the nature of our intended interpretation of test scores in relation to the particular use (that is, decision) to which those scores will be put; (2) come up with propositions that must be supported if those intended interpretations are going to be accurate; (3) collect as much relevant evidence as is practical bearing on those propositions; and (4) synthesize the whole works in a convincing validity argument showing that score-based interpretations are likely to be accurate. Sure, it is easier said than done. But now you can see that, when the exotic nomenclature of psychometrics is expunged, what needs to be done is really not all that intimidating.

Decision time Group-Influenced Grading

A junior high school English teacher, Cecilia Celina, has recently installed cooperative learning groups in all five of her classes. The groups are organized so that, although there are individual grades earned by students based on each student’s specific accomplishments, there is also a group-based grade that is dependent on the average performance of a student’s group. Cecilia decides that 60 percent of a student’s grade will be based on the student’s individual effort and 40 percent of the student’s grade will be based on the collective efforts of the student’s group. She uses this 60–40 split when she grades students’ written examinations—as well as

when she grades a group’s oral presentation to the rest of the class.

Several of Cecilia’s fellow teachers have been interested in her use of cooperative learning because they are considering employing such an approach in their own classes. One of those teachers, Fred Florie, is uncomfortable about Cecilia’s 60–40 split of grading weights. Fred believes that Cecilia cannot arrive at a valid estimate of an individual student’s accomplishment when 40 percent of the student’s grade is based on the efforts of other students. Cecilia responds that this aggregated grading practice is one of the key

(Continued)

M04_POPH0936_10_SE_C04.indd 105M04_POPH0936_10_SE_C04.indd 105 08/11/23 3:35 PM08/11/23 3:35 PM

106 Chapter 4 Validity

Validity Evidence Teachers make scads of decisions, sometimes on a minute-by-minute basis. They decide whether to ask questions of their students and, if questions are to be asked, which student gets which question. Most of a teacher’s instructional decisions are based to a large extent on the teacher’s judgment about students’ achievement levels. For instance, if a third-grade teacher concludes that most of her students are able to read independently, then the teacher may decide to use independent reading as a key element in a social studies lesson. The teacher’s judgment about students’ reading abilities is almost always based on evidence of some sort—gathered either formally or informally.

The major contribution of classroom assessment to a teacher’s decision making is that it provides reasonably accurate evidence about students’ status. Although teachers are often forced to make inferences about students’ knowledge, skills, or attitudes on the basis of informal observations, such unsystematic obser- vations sometimes lead teachers to make inaccurate estimates about a particular student’s status. I’m not knocking teachers’ informal observational skills, mind you, for I certainly relied on my own informal observations when I was in the classroom. Frequently, however, I was off the mark! More than once, I saw what I wanted to see by inferring that my students possessed knowledge and skills that they really didn’t have. Later, when students tackled a midterm or final exam, I often discovered that the conclusions I had drawn from my observation-based judgments were far too generous.

Classroom assessments, if they’re truly going to help teachers make solid instructional decisions, should allow teachers to arrive at valid interpretations about their students’ status. But, more often than not, because a classroom assess- ment can be carefully structured, an assessment-based interpretation is going to be more valid than an on-the-fly interpretation made while the teacher is often focused on other concerns.

The only reason why teachers should assess their students is to make better educa- tional decisions about those students. Thus, when you think about validity, remember it is not some abstruse measurement mystery, knowable only to those who’ve labored

features of cooperative learning. This is because it is the contribution of the group grade that motivates the students in a group to help each other learn. In most of her groups, for example, she finds that students willingly help other group members prepare for important examinations.

As she considers Fred’s concerns, Cecilia concludes that he seems to be most troubled about the validity of the inferences she makes about her students’ achievements. In her mind, however, she

separates an estimate of a student’s accomplishments from the grade she gives a student.

Cecilia believes that she has three decision options facing her. As she sees it, she can (1) leave matters as they are, (2) delete all group-based contributions to an individual student’s grade, or (3) modify the 60–40 split.

If you were Cecilia, what would your decision be?

M04_POPH0936_10_SE_C04.indd 106M04_POPH0936_10_SE_C04.indd 106 08/11/23 3:35 PM08/11/23 3:35 PM

Validity evidence 107

in the validity vineyards for eons. Rather, validity reflects the accuracy of the interpre- tations teachers make about what their students know and are able to do.

Even though there are different ways of determining whether test-based interpretations are apt to be valid, the overriding focus is on the accuracy of an assessment-based interpretation. It usually helps to think about validity as an over- all evaluation of the degree to which a specific interpretation of a test’s results is supported. People who develop important tests often try to construct a validity argument in which they assemble evidence and analyses to show that their tests support the interpretations being claimed.

As a teacher, or a teacher in preparation, you may wonder why you need to devote any time at all to different kinds of validity evidence. Your puzzlement notwithstanding, it’s really important that you understand at least the chief kinds of evidence bearing on the validity of a score-based inference. Here, then, we will pay close attention to the sort of validity evidence with which classroom teachers should be dominantly concerned.

In the last chapter, we saw that there are several kinds of reliability evidence that can be used to help us decide how consistently a test’s scores measure what the test is assessing. There are also several kinds of evidence that can be used to help educators determine whether their score-based interpretations are valid for particular uses. Rarely will one set of evidence be so compelling that, all by itself, the evidence assures us our score-based interpretations are truly accurate for a specific purpose. More commonly, several different sets of validity evidence are needed for educators to be really comfortable about the test-based inferences they make. Putting it another way, to develop a powerful validity argument, it is often necessary to collect varied kinds of validity evidence.

When I was preparing to be a high school teacher, many years ago, my teacher education classmates and I were told that “validity refers to the degree to which a test measures what it purports to measure.” (I really grooved on that definition because it gave me an opportunity to use the word purport. Prior to that time, I’d had few occasions to employ this respect-engendering term.) Although, by modern standards, such a traditional definition of validity seems pretty antiquated, it still contains a solid seed of truth. If a test truly measures with reasonable accuracy what it sets out to measure, then it’s likely that the inferences we make about students based on their test performances will be valid for the purpose at hand. Valid test- based interpretations will be made because we will typically interpret students’ performances according to what the test’s developers set out to measure—and did so with a specific use in mind.

In some instances, however, we find educators making meaningful modifica- tions in the sorts of assessments they use with children who have physical or cogni- tive disabilities. We’ll look into this issue in the next chapter, particularly as it has been reshaped by federal assessment-related legislation. However, try to think about an assessment procedure in which a reading achievement test is to be read aloud to a child with blindness. Let’s say that the child, having heard the test’s reading

M04_POPH0936_10_SE_C04.indd 107M04_POPH0936_10_SE_C04.indd 107 08/11/23 3:35 PM08/11/23 3:35 PM

108 Chapter 4 Validity

passages and each multiple-choice option having been read aloud, scores very well. Does this permit us to make a valid inference that the child with blindness can read? No, even though a reading test is involved, a more valid interpretation is that the child can derive substantial meaning from read-aloud material. It’s an important skill for visually impaired children, and it’s a skill that ought to be nurtured. But it isn’t reading. Validity resides in a test-based interpretation, not in the test itself. And the interpretation must always be rooted in a specified use that’s intended.

Let’s take a look, now, at the three most common kinds of evidence you are apt to encounter in determining whether the interpretation that’s made from an educational assessment procedure is valid for a specific purpose. Having looked at the three varieties of validity evidence, you’ll then be getting a recommendation regarding what classroom teachers really need to know about validity and what kinds of validity evidence, if any, teachers need to gather regarding their own tests. Table 4.1 previews the four sources of validity evidence we’ll be considering in the remainder of the chapter.

Through the years, measurement specialists have carved up testing’s validity pumpkin in sometimes meaningfully different ways. Going back even earlier than the 1999 Standards, it was sometimes recommended that we were dealing not with types of validity evidence but, rather, with different types of validity itself. Both the 1999 and the 2014 Standards, however, make it clear that when we refer to validity evidence, we are describing different types of evidence—not different types of validity.

The 2014 Standards make the point nicely when its authors say, “Validity is a unitary concept. It is the degree to which all the accumulated evidence supports the intended interpretation of test scores for the proposed use” (AERA, 2014, p. 14). I have taken the liberty of slathering a heap of emphasis on that truly important sentence by italicizing it. The italicizing was added because, if you grasp the meaning of this single sentence, you will have understood what’s really meant by assessment validity.

In Table 4.1 you will find the four types of validity evidence identified in the Joint Standards. Two of those four kinds of evidence are used more frequently, so we will deal with those two categories of evidence more deeply in the remainder

Table 4.1 Four Sources of Validity evidence

Basis of Validity Evidence Brief Description

Test Content The extent to which an assessment procedure adequately represents the content of the curricular aim(s) being measured

Response Processes The degree to which the cognitive processes that test-takers employ during a test support an interpretation for a specific test use

Internal Structure The extent to which the internal organization of a test coincides with an accurate assessment of the construct supposedly being measured

Relations to Other Variables The degree to which an inferred construct is predictably correlated to measures of similar and dissimilar variables— including foreseen and unforeseen consequences

M04_POPH0936_10_SE_C04.indd 108M04_POPH0936_10_SE_C04.indd 108 08/11/23 3:35 PM08/11/23 3:35 PM

Validity evidence Based on test Content 109

of this chapter. Please consider, then, the four sources of validity evidence identi- fied in the 2014 Standards and daintily described in Table 4.1.

As you consider the four sources of validity evidence set forth in Table 4.1, each accompanied by its terse description, you will discover that in many settings, the most significant sources of validity evidence teachers need to be concerned about are the first and the fourth entries: validity evidence based on test content and validity evidence based on relations to other variables. On rare occasions, a teacher might bump into the remaining two sources of validity evidence—the kind based on response processes and those based on a test’s internal structure—but such sorts of validity evidence aren’t often encountered by most classroom teachers. Accordingly, we will take a serious look at two of the sources of validity evidence set forth in Table 4.1, and we will give only an abbreviated but appreciative treatment to the other two, less frequently encountered sources of validity evidence.

Validity Evidence Based on Test Content Remembering that the more evidence of validity we have, the better we’ll know how much confidence to place in score-based inferences for specific uses, let’s look at the first source of validity evidence: evidence based on test content.

Formerly described as content-related evidence of validity (and described, even earlier, as content validity), evidence of validity based on test content refers to the adequacy with which the content of a test represents the content of the cur- ricular aim(s) about which interpretations are to be made for specific uses. When the idea of content representatives was first dealt with by educational measure- ment folks many years ago, the focus was dominantly on achievement examina- tions, such as a test of students’ knowledge of history. If educators thought that eighth-grade students ought to know 124 specific facts about history, then the more of those 124 facts that were represented in a test, the more evidence there was regarding content validity.

These days, however, the notion of content refers to much more than factual knowledge. The content of curricular aims in which teachers are interested can embrace knowledge (such as historical facts), skills (such as higher-order thinking competencies), or attitudes (such as students’ dispositions toward the study of science). Content, therefore, should be conceived of broadly. When we determine the content representativeness of a test, the content in the curricular aims being sampled can consist of whatever is contained in those curricular aims. Remember, the curricular aims for most classroom tests consist of the skills and knowledge included in a teacher’s intended outcomes for a certain instructional period.

During the past decade or two, the term content standard has become a common way to describe the skills and knowledge that educators want their

M04_POPH0936_10_SE_C04.indd 109M04_POPH0936_10_SE_C04.indd 109 08/11/23 3:35 PM08/11/23 3:35 PM

110 Chapter 4 Validity

students to learn. Almost all states currently have an officially approved set of content standards for each of the major subject areas taught in that state’s public schools. (As noted in Chapter 2 of this book, in many instances a state’s official curriculum is, in reality, a complete lift or a slightly modified version of the Common Core Content Standards stimulated by federal incentive programs.) Teachers, too, sometimes pursue additional content standards (or curricular aims) associated with the particular grade level or subjects they teach. Clearly, teachers need to familiarize themselves with whatever descriptive language is officially approved in their locale regarding what it is that students are sup- posed to learn.

But what is adequate representativeness of a set of content standards, and what isn’t? Although this is clearly a situation in which more representativeness is better than less representativeness, let’s illustrate varying levels with which a curricular aim can be represented by a test. Take a look at Figure 4.2, where you see an illustrative curricular aim (represented by the shaded rectangle) and the items from different tests (represented by the dots). As the test items coincide less adequately with the curricular aim, the weaker is the evidence of validity based on test content.

For example, in Illustration A of Figure 4.2, we see that the test’s items effec- tively sample the full range of the curricular aim’s content represented by the shaded rectangle. In Illustration B, however, note that some of the test’s items don’t even coincide with the curricular aim’s content, and those items that do fall in the curricular aim don’t cover it well. Even in Illustration C, where all the test’s items measure content included in the curricular aim, the breadth of coverage for the curricular aim is clearly insufficient.

A. Excellent Representativeness

B. Inadequate Representativeness

C. Inadequate Representativeness

E. Inadequate Representativeness

D. Inadequate Representativeness

F. Really Rotten Representativeness

Curricular Aim Test Items

Figure 4.2 Varying Degrees to Which a test’s Items represent the Curricular aim about Which Score-Based Inferences are to Be Made

M04_POPH0936_10_SE_C04.indd 110M04_POPH0936_10_SE_C04.indd 110 08/11/23 3:35 PM08/11/23 3:35 PM

Validity evidence Based on test Content 111

Trying to put a bit of reality into those rectangles and dots, think about an Algebra I teacher who is trying to measure students’ mastery of a semester’s worth of content by creating a truly comprehensive final examination. Based chiefly on students’ performances on the final examination, the teacher will assign grades that will influence whether students can advance to Algebra II. Let’s assume the content the teacher addressed instructionally in Algebra I—that is, the algebraic skills and knowledge taught during the Algebra I course—is truly prerequisite to Algebra II. Then, if the curricular aims representing the Algebra I content are not satisfactorily represented by the teacher’s final examination, the teacher’s score-based interpretations about students’ end-of-course algebraic capabilities and the teacher’s resultant decisions about students’ readiness for Algebra II are apt to be in error. If teachers’ educational decisions hinge on students’ status regarding the curricular aims’ content, then those decisions are likely to be flawed if inferences about students’ mastery of the curricular aims are based on a test that fails to adequately represent those curricular aims’ content.

How do educators go about gathering evidence of validity derivative from a test’s content? Well, there are generally two approaches to follow. Let’s consider each of them briefly.

Developmental Care One way of trying to make sure a test’s content adequately taps the content of a curricular aim (or a group of curricular aims) is to employ a set of test-development procedures carefully focused on ensuring that the curricular aim’s content is

M04_POPH0936_10_SE_C04.indd 111M04_POPH0936_10_SE_C04.indd 111 08/11/23 3:35 PM08/11/23 3:35 PM

112 Chapter 4 Validity

properly reflected in the assessment procedure itself. The higher the stakes asso- ciated with the test’s use, the more effort is typically devoted to making certain that the assessment procedure’s content properly represents the content in the curricular aim(s) involved. For example, if a commercial test publisher were try- ing to develop an important new nationally standardized test measuring high school students’ knowledge of chemistry, it is likely there would be much effort during the development process to make sure the appropriate sorts of skills and knowledge were being measured by the new test. Similarly, because of the sig- nificance of a state’s accountability tests, substantial attention is typically given to the task of verifying that the test’s content suitably represents the state’s most important curricular aims.

Listed here, for example, are the kinds of activities that might be carried out during a state’s test-development process for an important chemistry test to ensure the content covered by the new test properly represents “the content that our state’s high school students ought to know about chemistry.”

Possible Developmental Activities to Enhance a High-Stakes Chemistry Test’s Content Representativeness

• A panel of national content experts, individually by mail, electronically via Internet Zoom sessions, or during extended face-to-face sessions, recom- mends the knowledge and skills that should be measured by the new test.

• The proposed content of the new test is systematically contrasted with a list of topics derived from a careful analysis of the content included in the five leading textbooks used in the nation’s high school chemistry classes.

• A group of high school chemistry teachers, each judged to be a “teacher of the year” in his or her own state, provides suggestions regarding the key topics (that is, knowledge and skills) to be measured by the new test.

• Several college professors, conceded to be international authorities regard- ing the teaching of chemistry, having independently reviewed the content suggestions of others for the new test, offer recommendations for additions, deletions, and modifications.

• State and national associations of secondary school chemistry teachers pro- vide Internet and by-mail reviews of the proposed content to be measured by the new test.

With lower-stakes tests, such as the kind of quiz that a high school chemis- try teacher might give after a one-week instructional unit, less elaborate content reviews are obviously warranted. Even classroom teachers, however, can be atten- tive to the content representativeness of their tests. For openers, teachers can give deliberate consideration to deciding whether the content of their classroom tests reflects the instructional outcomes supposedly being measured by those tests. For instance, whatever the test is, a teacher can deliberately try to identify the nature of the curricular aim the test is supposed to represent. Remember, the test itself should not be the focus of the teacher’s concern. Rather, the test should be

M04_POPH0936_10_SE_C04.indd 112M04_POPH0936_10_SE_C04.indd 112 08/11/23 3:35 PM08/11/23 3:35 PM

Validity evidence Based on test Content 113

regarded as simply a “stand in” for a curricular aim—that is, the set of skills and/ or knowledge embodied in the teacher’s instructional aspirations.

To illustrate, if a teacher of 10th-grade English wants to create a final exami- nation for a one-semester course, the teacher should first try to identify all the important skills and knowledge to be taught during the semester. An outline of such content, or even a simple list of topics, will usually suffice. Then, after identi- fying the content of the curricular aims covering the English course’s key content, the teacher can create an assessment instrument that attempts to represent the identified content properly.

As you can see, the important consideration here is that the teacher makes a careful effort to conceptualize the nature of a curricular aim and then tries to see whether the test being constructed actually contains content that is appropriately representative of the content in that curricular aim. Unfortunately, many teachers generate tests without any regard whatsoever for curricular aims. Rather than try- ing to figure out what knowledge, skills, or attitudes should be promoted instruc- tionally, some teachers simply start churning out test items. Before long, a test is born—a test that, more often than not, does a pretty cruddy job of sampling the skills and knowledge about which the teacher should make inferences.

We have seen, then, that one way of supplying evidence of validity based on test content is to deliberately incorporate test-development activities that increase the likelihood of representative content coverage. Next, these procedures should be documented. It is this documentation, in fact, that constitutes an important form of evidence for the validity argument dealing with the suitability of the test for specific uses. The more important the test—that is, the more important its uses—the more important it is that validity-relevant test-development activities be documented. For most teachers’ classroom assessments, no documentation is requisite.

External Reviews A second approach to gathering evidence of validity based on test content involves the assembly of judges who rate the content appropriateness of a given test in relationship to the curricular aim(s) the test allegedly represents. For high-stakes tests, such as a state’s accountability tests or a state-developed examination that must be passed before a student receives a high school diploma, these content reviews are typically quite systematic. For run-of-the-mill classroom tests, such external reviews are usually far more informal. For instance, when one teacher asks a colleague to scrutinize the content coverage of a midterm exam, that’s a version of this second approach to reviewing a test’s content. Clearly, the care with which external reviews of an assessment procedure’s content are conducted depends on the consequences associated with students’ performances. The more significant the consequences, the more elaborate the external content-review pro- cess. (Generally speaking, teachers can place more trust in a colleague’s take on a set of items gathered during a morning coffee session than during an evening’s cocktail session.) Let’s look at a couple of examples, now, to illustrate this point.

M04_POPH0936_10_SE_C04.indd 113M04_POPH0936_10_SE_C04.indd 113 08/11/23 3:35 PM08/11/23 3:35 PM

114 Chapter 4 Validity

Suppose state department of education officials decide to construct a state- wide assessment program in which all sixth-grade students who don’t achieve a specified level of competence in language arts or mathematics must take part in state-designed, but locally delivered, after-school remediation programs. Once the items for the new assessment program are developed (and those items might be fairly traditional or quite innovative), a panel of 20 content reviewers for language arts and a panel of 20 content reviewers for mathematics then consider the test’s items. Such reviewers, typically, are subject-matter experts who have considerable familiarity with the content involved.

Using the mathematics portion of the test for illustrative purposes, the 20 members of the Mathematics Content Review Panel might be asked to render a yes/no judgment for each of the test’s items in response to a question such as the following. Please recognize that this sort of review task must definitely describe the test’s intended use(s). This is because assessment validity revolves around the accuracy of score-based interpretations for specific, spelled-out-in-advance, purposes.

Note that the illustrative question is a two-component question. Not only should the mathematical knowledge and/or skill involved, if unmastered, require after-school remedial instruction for the student, but also the item being judged must “appropriately measure” this knowledge and/or skill. In other words, the knowledge and/or skill must be sufficiently important to warrant remedial instruction on the part of the students, and the knowledge and/or skill must be properly measured by the item so that accurate score-based inferences are apt to be accurate. If the content of an item is important, and if the content of the item is also properly measured, the content-review panelists should supply a yes judg- ment for the item. If either an item’s content is not significant, or the content is badly measured by the item, then the content-review panelists should supply a no judgment for the item.

By calculating the percentage of panelists who rate each item positively, an index of content-related evidence of validity can be obtained for each item. To illustrate the process, suppose we had a five-item subtest whose five items received the following positive per-item ratings from a content-review panel: Item One, 72 percent; Item Two, 92 percent; Item Three, 88 percent; Item Four, 98 percent; and Item Five, 100 percent. The average positive per-item ratings for the entire test, then, would be 90 percent. The higher the average per-item ratings

an Illustrative Item-Judgment Question for Content-review panelists Does this item appropriately measure mathematics knowledge and/or skill(s) so significant to the student’s further study that, if those skills and knowledge are not mastered, the student should receive after-school remedial instruction?

M04_POPH0936_10_SE_C04.indd 114M04_POPH0936_10_SE_C04.indd 114 08/11/23 3:35 PM08/11/23 3:35 PM

Validity evidence Based on test Content 115

provided by a test’s content reviewers, the stronger the content-related evidence of validity. This first kind of item-by-item judgment from reviewers represents an attempt to isolate (and eliminate) items whose content is insufficiently related to the curricular aim being measured.

In addition to the content-review panelists’ ratings of the individual items, such panels can also be asked to judge how well the test’s items represent the domain of content that should be assessed to determine whether a student is assigned to the after-school remediation extravaganza. Here is an example of how such a question for content-review panelists might be phrased.

If a content panel’s review of a test’s content coverage yields an average response to such a question of, say, 85 percent, that’s not bad. If, however, the content-review panel’s average response to the question about content cover- age is only 45 percent, this indicates there is a solid chunk of important content not being measured by the test’s items. For high-stakes tests, external reviewers’ average responses to an item-by-item content question for all items and their aver- age responses to a content-coverage question for the whole test constitute solid indicators of judgmentally derived evidence of validity. The more positive the evidence, the greater confidence one can have in making score-based inferences about a student’s status with respect to the curricular aims being measured. Typi- cally, summations of such judgments are placed in a validity (argument) report to accompany the test that’s under scrutiny.

Although it would be possible for classroom teachers to go through these same content-review machinations for their own classroom assessments, few sane teachers would want to expend this much energy to review the content of a typical classroom test. Instead, a teacher might ask another teacher to look over a test’s items and to render judgments akin to those asked of a content-review panel. Because it takes time to provide such judgments, however, a pair of teachers might trade the task of reviewing the content of each other’s classroom assess- ments. (“You scratch my back/review my test, and I’ll scratch/review yours.”)

One of the significant dividends resulting from having a fellow teacher review the content of your classroom tests is that the mere prospect of having such a review take place will almost always lead to the proactive inclusion

an Illustrative Content-Coverage Question for Content-review panelists First, try to identify—mentally–the full range of mathematics knowledge and/or skills you believe to be so important that, if they are not mastered, the student should be assigned to after-school remediation classes. Second, having mentally identified that domain of mathematics knowledge and/or skills, please estimate the percent of the domain represented by the set of test items you just reviewed. What percent is it? _______________ percent

M04_POPH0936_10_SE_C04.indd 115M04_POPH0936_10_SE_C04.indd 115 08/11/23 3:35 PM08/11/23 3:35 PM

116 Chapter 4 Validity

of more representative content coverage in your tests. The more carefully you consider your test’s content coverage early on, the more likely it is that the test’s content coverage will be appropriate. The prospect of having a respected colleague scrutinize one or more of your tests to discern its content coverage typically inclines a teacher to spiffy-up a test’s items. Ah, human nature!

In review, we have considered one kind of validity evidence—namely, content-related evidence of validity—that can be used to support the defensibil- ity of score-based inferences about a student’s status with respect to curricular aims. We’ve discussed this form of validity evidence in some detail because, as you will see, it is the most important form of validity evidence that classroom teachers need to be concerned with when they judge their own classroom assess- ment procedures.

Alignment During the last decade or two, those who construct high-stakes accountability tests have been urged to make sure those assessments are in alignment with what- ever curricular aims are supposed to be measured by the tests under construction. It is not a silly notion. But it is an imprecise one.

What most people mean when they say they want tests to be aligned with curricular targets is that the tests should properly measure students’ status with respect to those curricular targets. But “properly measure” is a fairly murky concept. As one looks at the four sources of validity evidence we are dealing with here, it seems clear that the most appropriate form of alignment evidence is evidence of validity based on a test’s content. Sometimes this test content is determined based on scrutiny of the test itself, and sometimes those determining the degree of alignment will employ the sorts of test-content blueprints described in Chapter 2. Given the diversity of ways that differ- ent measurement specialists have attempted to analyze the nature of a test’s alignment, we can understand why some commentators have opined that, “Assessment alignment is a nine-letter word that seems to function like a four-letter word!”

Thus, because federal assessment personnel have been calling for state accountability tests to be satisfactorily aligned with a state’s content standards, a growth industry has emerged in recent years focused on scrutinizing the ade- quacy of state tests’ content coverage. Remember, in most instances, federal offi- cials have appointed state-specific Technical Advisory Committees whose task it is to verify the legitimacy of a state’s content sampling in its tests. The prospect of such scrutiny invariably inclines state authorities to carry out their tests’ content sam- pling with greater care. We have seen numerous groups of assessment/curricular specialists spring up to independently determine whether a state’s accountability tests are in alignment with that state’s curricular aims. One of the most popular of the alignment approaches being used these days is the judgmental procedure

M04_POPH0936_10_SE_C04.indd 116M04_POPH0936_10_SE_C04.indd 116 08/11/23 3:35 PM08/11/23 3:35 PM

Validity evidence Based on response processes 117

devised by Norman Webb of the University of Wisconsin. Briefly introduced in Chapter 2, Webb’s method of determining alignment is centered on judgments being made about the following four questions (Webb, 2002).

• Categorical concurrence. Are identical or essentially equivalent categories employed for both curricular aims and assessments?

• Depth-of-knowledge (DOK) consistency. To what degree are the cognitive demands of curricular aims and assessments identical?

• Range-of-knowledge correspondence. Is the breadth of knowledge reflected in curricular aims and assessments the same?

• Balance of representation. To what degree are different curricular targets given equal emphasis on the assessments?

As you can see from Webb’s four evaluative criteria, the emphasis of this approach parallels other, more traditional methods of ascertaining whether the content that’s supposed to be assessed by a test has, indeed, been appro- priately assessed. A number of modifications of Webb’s general approach have been employed during the past few years, and it seems likely that we will see other approaches to determine a state test’s alignment with the curricular aims it supposedly assesses. At bottom, however, it is apparent that these alignment approaches are judgmentally rooted strategies for assembling evidence of validity based on test content.

Validity Evidence Based on Response Processes An infrequently employed source of validity evidence for educational tests is evi- dence based on response processes. In attempting to measure students’ status with respect to certain constructs, it is assumed that test-takers engage in particular cognitive processes. For example, if a test is intended to measure students’ logical reasoning abilities, it becomes important to know whether students are, in fact, relying on logical reasoning processes as they interact with the test.

Evidence based on response processes typically comes from analyses of indi- vidual test-takers’ responses during a test. For example, questions can be posed to test-takers at the conclusion of a test asking for descriptions of the procedures a test-taker employed during the test. More often than not, validation evidence from test-takers’ responses during the testing process will be used in connection with psychological assessments, rather than the somewhat more straightforward assessment of students’ knowledge and skills.

A potentially useful setting in which validity evidence would be based on students’ response processes can be seen when educators monitor the evolving nature of a sought-for skill such as students’ prowess in written composition. By keeping track of the development of students’ writing skills in the form of

M04_POPH0936_10_SE_C04.indd 117M04_POPH0936_10_SE_C04.indd 117 08/11/23 3:35 PM08/11/23 3:35 PM

118 Chapter 4 Validity

en route assessments of their written or digital drafts, validity evidence can be assembled regarding how well students can write.

The 2014 Standards suggests that studies of response processes are not limited to test-takers, but can also focus on observers or judges who have been charged with evaluating test-takers’ performances or products. Careful consideration of the 2014 Standards will reveal that the standards consist chiefly of sound recommendations and rarely set forth any “it must be done this way or else” proclamations.

As noted, even though this source of validity evidence has been identified as a legitimate contributor of evidence to support a validity argument, and teach- ers should probably know that studies of response processes can be germane for certain sorts of assessments, this particular source of validity evidence is rarely of interest to educators.

Validity Evidence Based on a Test’s Internal Structure For educators, another infrequently used source of validity evidence is the internal structure of a test. For example, let’s say some test developers create an achieve- ment test that they attempted to make unidimensional. Such a test is intended to measure only one construct (such as a student’s overall mathematical prowess). This test, then, ought to behave differently than a test that had been constructed specifically to measure several supposedly different dimensions, such as students’ mastery of, for instance, geometry, algebra, and basic computation.

This third source of validity evidence, of course, requires us to think more carefully about the “construct” supposedly being measured by a test. For many educational tests, the construct to be assessed is abundantly obvious. To illus- trate, before the arrival of nifty computer programs that automatically identify one’s misspelled words and, in most instances, instantly re-spell those words, most of the students in our nation’s schools actually needed to be able to spell. The students were expected to be able to spell on their own, unabetted by electrical, digital, or extraterrestrial support. A teacher would typically choose a sample of words, for instance, 20 words randomly chosen from a district-identified set of 400 “slippery spelling” words, and then read the words aloud—one by one—so students could write on their test papers how each word was spelled. The con- struct being assessed by such spelling tests was, quite clearly, a student’s spelling ability. To tie that to-be-assessed construct down more tightly, what was being measured was a student’s ability to accurately spell the words on the 400-word “slippery spelling” list. A student’s spelling performance on the 20-word sample was regarded as a sample indicative of the student’s ability to spell all 400 words. The construct being assessed so that educators could estimate students’ mastery of the construct was gratifyingly straightforward.

M04_POPH0936_10_SE_C04.indd 118M04_POPH0936_10_SE_C04.indd 118 08/11/23 3:35 PM08/11/23 3:35 PM

Validity evidence Based on a test’s Internal Structure 119

Not all constructs that serve as the focus of educational assessments are as easy to isolate. For example, because of several recent federal initiatives, con- siderable attention has been given to ascertaining students’ “college and career readiness”—that is, students’ readiness to successfully tackle higher education or embark on a career. But when assessments are created to measure students’ college and career readiness, what is the actual construct to be measured? Must that construct be an amalgam of both college readiness and career readiness, or should those two constructs be measured separately? The nature of the con- structs to be measured by educational tests is always in the hands of the test developers.

Moreover, depending on the way the test developers decide to carve out their to-be-assessed construct, it may be obvious that if the construct carving has been done properly, we would expect certain sorts of statistical interactions among items representing the overall construct or certain of its subconstructs. This third source of validity evidence, then, can confirm or disconfirm that the test being considered does the measurement job (based on our understanding of the con- structs involved, as well as on relevant previous research dealing with this topic) it should be doing.

One of the more important validity issues addressable by evidence regard- ing how a test’s items are functioning deals with the unidimensionality of the construct involved. This is a pivotal feature of most significant educational tests, and if you will think back to Chapter 3’s treatment of internal consis- tency reliability, you will see that we can pick up some useful insights about a test’s internal structure simply by calculating one or more of the available internal consistency reliability coefficients—for instance, Cronbach’s coefficient alpha. Such internal consistency coefficients, if positive and high, indicate that a test appears to be measuring test-takers’ status with regard to a single con- struct. If the construct to be measured by the test is, in fact, unidimensional in nature, then this form of reliability evidence also can contribute to our under- standing of whether we can arrive at valid interpretations from a test—that is, valid interpretations of a test-taker’s score for a specific purpose of the test being considered.

In carrying out efforts to secure this sort of validity evidence when judg- ments must be made by different raters regarding the caliber of a test-taker’s performance, it is frequently necessary to calculate the degree of inter-rater reliability. To do so, we simply verify the extent to which different raters con- cur in their evaluations of a student’s performance (for example, a student’s impromptu oral presentation). Clearly, the orientation of raters to the task at hand, coupled with their training and supervision, plays a prominent role in guaranteeing accuracy of the resultant qualitative estimates of students’ perfor- mances. Estimates of inter-rater reliability are typically reported as correlation coefficients or simply percent of agreements.

As has been noted, however, this source of internal-structure validity evi- dence is not often relied on by those who develop or use educational tests. Such

M04_POPH0936_10_SE_C04.indd 119M04_POPH0936_10_SE_C04.indd 119 08/11/23 3:35 PM08/11/23 3:35 PM

120 Chapter 4 Validity

validity evidence is more apt to be used by those who create and use psychologi- cally focused tests, where the nature of the constructs being measured is often more complex and multidimensional.

Validity Evidence Based on Relations to Other Variables The final source of validity evidence identified in the 2014 Standards stems from the nature of relationships between students’ scores on the test for which we are assembling validity-relevant evidence and those same students’ scores on other variables. This kind of validity evidence is often seen when educational tests are being evaluated—particularly for high-stakes tests such as we increasingly encounter in education. Consequently, teachers should become familiar with the ways that this kind of validation evidence is typically employed. We will now consider the two most common ways in which validation evidence is obtainable by seeing how the scores of test-takers on a particular test are related to those individuals’ status on other variables.

Test–Criterion Relationships Whenever students’ performances on a certain test are believed to be predic- tive of those students’ status on some other relevant variable, a test–criterion relationship exists. In previous years, such a relationship supplied what was referred to as criterion-related evidence of validity. This source of validity evidence is collected only in situations where educators are using an assessment procedure to predict how well students will perform on some subsequent criterion variable.

The easiest way to understand what this kind of validity evidence looks like is to consider the most common educational setting in which it is collected—namely, the relationship between students’ scores on (1) an aptitude test and (2) the grades those students subsequently earn. An aptitude test, as noted in Chapter 1, is an assessment device used in order to predict how well a student will perform aca- demically at some later point. For example, many high school students complete a scholastic predictor test (such as the SAT or the ACT) when they’re still in high school. The test is supposed to be predictive of how well those students are apt to perform in college. More specifically, students’ scores on the predictor test are employed to forecast students’ grade-point averages (GPAs) in college. It is assumed that those students who score well on the predictor test will earn higher GPAs in college than those students who score poorly on the aptitude test.

In Figure 4.3, we see the classic kind of relationship between a predictor test and the criterion it is supposed to predict. As you can see, a test is used to predict students’ subsequent performance on a criterion. The criterion could be GPA, on-the-job proficiency rating, annual income, or some other performance variable in which we are interested. Most of the time in the field of education, we

M04_POPH0936_10_SE_C04.indd 120M04_POPH0936_10_SE_C04.indd 120 08/11/23 3:35 PM08/11/23 3:35 PM

Validity evidence Based on relations to Other Variables 121

are concerned with a criterion that deals directly with educational matters. Thus, grades earned in later years are often employed as a criterion. Teachers could also be interested in such “citizenship” criteria as the number of times students vote after completing high school (a positive criterion) or the number of misdemeanors or felonies that are subsequently committed by students (a negative criterion).

If we know a predictor test is working pretty well, we can use its results to help us make educational decisions about students. For example, if you discov- ered that one of your students, Kareem Carver, scored very poorly on a scholastic aptitude test while in high school, but Kareem’s heart is set on attending college, you could devise a set of supplemental instructional activities so that he could try to acquire the needed academic skills and knowledge before he leaves high school. Test results, on predictor tests as well as on any educational assessment procedures, should always be used to make better educational decisions.

In some cases, however, psychometricians can’t afford to wait the length of time that’s needed between administration of the predictor test and the gathering of data regarding the criterion. For example, if the staff of a commercial testing company is developing a new academic aptitude test to predict high school stu- dents’ GPAs at the end of college, the company’s staff members might administer the new high school aptitude test to (much older) college seniors only a month or so before those seniors earn their baccalaureate degrees. The correlation between the aptitude test scores and the students’ final college GPAs has been histori- cally referred to as concurrent evidence of validity. Clearly, such evidence is less compelling than properly gathered predictive evidence of validity—that is, when a meaningful length of time has elapsed between the predictor test and the collec- tion of the criterion data. So, for instance, if we were to ask high school students to complete an academic aptitude test as 11th-graders, and then wait for three years to collect evidence of their college grades, this would represent an instance of predictive criterion-relevant validity evidence.

What teachers need to realize about the accuracy of scholastic predictor tests such as the ones we’ve been considering is that those tests are far from perfect. Sometimes the criterion that’s being predicted isn’t all that reliable itself. (How accurate were your college grades?) For instance, the correlations between students’ scores on academic aptitude tests and their subsequent grades rarely

Predictor Test (e.g., an academic

aptitude test)

The Criterion (e.g., subsequent grades earned)

Predictive of

Performance on

Figure 4.3 the typical Setting in Which Criterion-related evidence of Validity Is Sought

M04_POPH0936_10_SE_C04.indd 121M04_POPH0936_10_SE_C04.indd 121 08/11/23 3:35 PM08/11/23 3:35 PM

122 Chapter 4 Validity

exceed .50. A correlation coefficient of .50 indicates that, although the test is surely to some degree predictive of how students will subsequently perform, there are many other factors (such as motivation, study habits, and interpersonal skills) that play a major role in the grades a student earns. In fact, the best predictor of students’ future grades is not their scores on aptitude tests; it is their ear- lier grades. Previous grades more accurately reflect the full range of important grade-influencing factors, such as students’ perseverance, that are not directly assessed via aptitude assessment devices.

Statistically, if you want to determine what proportion of students’ perfor- mances on a criterion variable (such as college grades) is meaningfully related to students’ scores on a predictor test (such as the ACT or the SAT), you must square

parent talk Mrs. Billings, the mother of one of your students, is this year’s president of your school’s parent-teacher organization (PTO). As a consequence of her office, she’s been reading a variety of journal articles dealing with educational issues. After every PTO meeting, or so it seems to you, Mrs. Billings asks you to comment on one of the articles she’s read. Tonight, she asks you to explain what is meant by the phrase content standards and to tell her how your classroom tests are related, if at all, to content standards. She says that the author of one article she just read had argued that “if teachers’ tests didn’t measure national content standards, the tests wouldn’t be valid.”

If I were you, here’s how I’d respond to Mrs. Billings:

“You’ve really identified an important topic for tonight, Mrs. Billings. Content standards are regarded by many of the country’s educators as a key factor in how we organize our national education effort. Put simply, a content standard describes the knowledge or skill we want our students to master. You might think of a content standard as today’s label for what used to be called an instructional objective. What’s most imperative, of course, is that teachers focus their instructional efforts on appropriate content standards.

“As you may know, our state has already established a series of content standards at all grade levels in language arts, mathematics, social studies, and science. Each of these state-approved sets of content standards is based, more or less, on a set of content standards originally developed by the national subject-matter association involved— for example, the National Council of Teachers of Mathematics. So, in a sense, teachers in our school can consider both the state-sanctioned standards as well as those devised by the national content associations. Most of our school’s teachers have looked at both sets of content standards.

“Teachers look over these content standards when they engage in an activity we call test sharing. Each teacher has at least one other colleague look over all the teacher’s major classroom assessments to make sure that most of the truly significant content standards have been addressed.

“Those teachers whose tests do not seem to address the content standards satisfactorily are encouraged to revise their tests so that the content standards are more suitably covered. It really seems to work well for us. We’ve used the system for a couple of years now, and the teachers uniformly think our assessment coverage of key skills and knowledge is far better.”

Now, how would you respond to Mrs. Billings?

M04_POPH0936_10_SE_C04.indd 122M04_POPH0936_10_SE_C04.indd 122 08/11/23 3:35 PM08/11/23 3:35 PM

Validity evidence Based on relations to Other Variables 123

the correlation coefficient between the predictor and criterion variables. Thus, if the coefficient is .50, simply square it—that is, multiply it times itself—and you end up with .50 × .50 = .25. This signifies that about 25 percent of students’ col- lege grades can be explained by their scores on the aptitude test. Factors such as motivation and study habits account for the other 75 percent. Your students need to understand that when it comes to the predictive accuracy of students’ scores on academic aptitude tests, effort thumps test scores. The more potently you can promote this simple but potent message to your students, the more likely it is that they will grasp the potential payoffs of their own levels of effort.

Convergent and Discriminant Evidence Another kind of empirical investigation providing evidence of validity is often referred to as a related-measures study. In such a study, we hypothesize that a given kind of relationship will be present between students’ scores on the assessment device we’re scrutinizing and their scores on a related (or unrelated) assessment device. To illustrate, if we churned out a brand-new test of students’ ability to comprehend what they read, we could hypothesize that students’ scores on the new reading test would be positively correlated with their scores on an already established and widely used reading achievement test. To the extent that our hypothesis is confirmed, we have assembled evidence of validity supporting the validity of our score-based interpretation (using the new test’s results) about a student’s reading skill.

When it is hypothesized that two sets of test scores should be related, and evi- dence is collected to show that positive relationship, this is referred to as conver- gent evidence of validity. For example, suppose you were a U.S. history teacher and you used a final exam to cover the period from the Civil War to the present. If another teacher of U.S. history in your school also had a final exam covering the same period, you’d predict that the students who scored well on your final exam would also do well on your colleague’s final exam. (Your dazzling students should dazzle on his exam and your clunky students should clunk on his exam.) If you went to the trouble of actually doing this (and, of course, why would you?), this would be a form of convergent validity evidence.

peeves Make poor pets

(Continued)

Most people have at least one or two pet peeves. When it comes to educational assessment, a common Top-of-the-Charts Pet Peeve occurs when people use test results—without warrant—for a purpose other than what a test was originally created to accomplish. The “without warrant” is an important qualifier. That’s because, if a test

was initially built for Purpose 1 and has the validity evidence showing it can carry out that purpose properly, then if evidence indicates the test can also be used for Purpose 2, there is no problem with the test’s accomplishing this second purpose. (It’s like a 2-for-1 sale!) What ticks critics off, however, is when measurement folks build a test for one

M04_POPH0936_10_SE_C04.indd 123M04_POPH0936_10_SE_C04.indd 123 08/11/23 3:35 PM08/11/23 3:35 PM

124 Chapter 4 Validity

In contrast, suppose that as a test-crazed U.S. history teacher, you also tried to compare your students’ final exam scores to their scores on a final exam in an algebra class. You’d predict a weaker relationship between your students’ final exam scores in your history class and their final exam scores in the algebra class. This lower relationship is referred to as discriminant evidence of validity. It simply means that if your test is assessing what you think it is, then scores on your test ought to relate weakly to results of tests designed to measure other constructs. (This sort of test comparison, too, would be an unsound use of your time.)

If you can remember back—way, way back—to Figure 4.1 and its descrip- tions of the four kinds of validity evidence you’re apt to encounter in a report regarding a test’s quality, the fourth sort of evidence also identifies the relationship between a test’s scores and both the foreseen and unfore- seen consequences of using the test. In any serious attempt to ascertain the appropriateness of an educational test, it is important to consider the conse- quences of the test’s use, and not only the consequences that were intended, such as being able to carry out certain sorts of mathematical procedures, but also those that were not intended, such as developing an antipathy toward math itself. If a teacher develops a new language arts exam that seems to do a terrific job of distinguishing students who need extra help from those who do not, but implementing the test’s results also leads to an increase in language-rooted cliques and bullying, then this would be negative evidence

well-supported purpose, then cavalierly try to employ that test for a second, unsupported use.

That’s what has been going on in the United States in recent years when we have seen hoards of policymakers call for the nation’s teachers to be evaluated, as least in part, on the basis of their students’ scores on achievement tests. Most of the achievement tests being trotted out for this teacher- evaluation mission were originally constructed to provide inferences about the degree of students’ mastery regarding a large, often diverse, collection of curricular aims such as has been seen in the nation’s annual accountability tests.

But when students’ test scores are to be used to evaluate a teacher’s instructional quality, then there should be ample validity evidence on hand to show that students’ test scores do, indeed, distinguish between effectively taught students and ineffectively taught students. Merely because a test yields a valid interpretation about a student’s mastery of curricular aims does not indicate that the test automatically yields accurate interpretations

about whether particular teachers are instructionally dazzling or instructionally dismal.

Nonetheless, the proponents of test-score– based teacher evaluation blithely assume that any old test will supply the evidence we can use to differentiate among teachers’ instructional skills. This is truly a dumb assumption.

When you’re thinking about which test to use for which function, remember that usage governs validity. That is, we need to assemble validity evidence to support any use of a test’s results. When we employ students’ test scores to evaluate teachers, and the tests we use have not been demonstrated to be capable of doing that evaluative job, then you can safely bet that poor inferences will be made regarding which teachers are winners and which teachers are losers. And when those mistakes about teachers’ competence are made, so that weak teachers get tenure while strong teachers get released, guess who gets harmed most? That’s right; it’s the students.

We’ll dig more deeply into test-based teacher evaluation in Chapter 15.

M04_POPH0936_10_SE_C04.indd 124M04_POPH0936_10_SE_C04.indd 124 08/11/23 3:35 PM08/11/23 3:35 PM

Sanctioned and Unsanctioned Labels for Validity evidence 125

counteracting the apparent positive ranking of students. Effects of testing, anticipated or unanticipated, are crucial commodities. They, too, must be built into a candid validity argument.

Sanctioned and Unsanctioned Labels for Validity Evidence Validity is the linchpin of educational measurement. If teachers can’t arrive at valid score-based interpretations about students—interpretations in line with the test’s intended use—there’s simply no reason to measure students in the first place. However, because validity is such a central notion in educational assess- ment, some folks have attached specialized meanings to it that, although helpful at some level, also may introduce confusion.

One of these is face validity, a notion that has been around for a number of decades. All that’s meant by face validity is that the appearance of a test seems to coincide with the use to which the test is being put. To illustrate, if an assess- ment procedure is supposed to measure a student’s actual ability to function collaboratively with a group of other students, then a true–false test that focused exclusively on abstract principles of group dynamics would not appear to be face valid. But appearance can be deceiving, as has been noted in a variety of ways, including the injunction to remember that we “can’t judge a book by its cover.” Thus, even though an assessment procedure may appear to be consonant with the assessment procedure’s mission, we must still assemble evidence that inclines us to put confidence in the score-based interpretation we arrive at by using the test for a particular purpose. If an assessment procedure has no face validity, we’re less inclined to assume the assessment procedure is doing its job. Yet even in those circumstances, we still need to assemble one or more of the Standards-approved varieties of validity evidence to help us know how much confidence, if any, we can put in a score-based interpretation derived from using the assessment procedure.

Another more recently introduced variant of validity is something known as consequential validity. Consequential validity, referred to earlier in this chapter, refers to whether the uses of test results are valid. If, for example, a test’s results are inappropriately employed to deny students a reasonable expectation, such as progressing to the next grade level, the test may be said to be consequen- tially invalid because its results had been used improperly. Yet, because educators should obviously be attentive to the consequences of test use, the notion of con- sequential validity has been rendered superfluous because of the 2014 Standards’ emphasis on the validity of interpretations for specific uses. In other words, validity evidence is no longer to be collected to support just “any old interpretation.” Now that interpretation must be focused on a particular, announced use of test-based interpretations. In essence, framers of the Joint Standards have skillfully subsumed

M04_POPH0936_10_SE_C04.indd 125M04_POPH0936_10_SE_C04.indd 125 08/11/23 3:35 PM08/11/23 3:35 PM

126 Chapter 4 Validity

consequences into a more focused definition of assessment validity. Consequences must be heeded, of course, but they should be heeded in relation to a specified use of a test’s results.

A final point is in order regarding the labels for validity evidence. No one expects classroom teachers to be psychometric whiz-bangs. Teachers should know enough about assessment so that they can do a solid instructional job. In terms of validity evidence and how it ought to be employed by teachers, this means that when it comes to assessment validity, you do not need to keep constantly au courant. (This is either a French phrase for “up-to-date” or a scone with especially small raisins.) However, even though you need not become a validity expert, you probably should know that because the 1999 Standards were the only game in town for about 15 years, some of the labels used in that document still linger today. At this writing, it is unclear when an updated version of the Joint Standards will be released and, if so, whether it will embody significant modifications.

Some of your colleagues, or even some assessment specialists, might still employ the following phrases to describe validity evidence: “content-related evidence of validity” and “criterion-related evidence of validity.” We’ve already considered both of these under slightly different labels. The one type of validity evidence that, though it was widely used in years past, we have not considered is “construct-related evidence of validity.” It was also sometimes called sim- ply “construct validity.” Those labels refer to a more comprehensive approach to collecting relevant evidence and building a validity argument that supports test-based interpretations regarding the construct that test developers are attempt- ing to measure. Because of the more usage-emphasized manner in which the new Standards have attempted to lay out the validity landscape, there now seems to be no need for this traditional way of characterizing the accuracy with which we make score-based interpretations about test-takers—interpretations focused on particular uses.

The Relationship Between Reliability and Validity If convincing evidence is gathered that a test is permitting valid score-based interpretations for specific uses, we can be assured the test is also yield- ing reasonably reliable scores. In other words, valid score-based inferences almost certainly guarantee that consistent test results are present. The reverse, however, is not true. If a test yields reliable results, it may or may not yield valid score-based interpretations. A test could be measuring with remarkable consistency a construct the test developer never even contemplated measur- ing. For instance, although the test developer thought an assessment procedure was measuring students’ punctuation skills, what was actually being mea- sured might be students’ general intellectual ability, which, not surprisingly,

M04_POPH0936_10_SE_C04.indd 126M04_POPH0936_10_SE_C04.indd 126 08/11/23 3:35 PM08/11/23 3:35 PM

What Do Classroom teachers really eed to now about Validityy 127

splashes over into how well students can punctuate. Inconsistent results invariably preclude the validity of score-based interpretations. Evidence of valid score-based interpretations almost certainly ensures that consistency of measurement is present.

A Potential Paradox We’ve just about wrapped up this fourth chapter focused on the gravity and grandeurs of assessment validity. However, before we scurry forward, there’s one final validity issue that warrants your consideration. It’s often referred to as the validity paradox.

In the spring of 2023, Thompson, Rutkowski, and Rutkowski (2023) called our attention to the persistent challenge encountered by educators who, though “assessment literate” in their general understanding of educational assessment’s essentials, do not possess the technical knowledge often needed to understand the subtleties of interpreting test-collected data. Put simply, if limitations in the assessment understandings of classroom teachers, school administrators, and other educational professionals preclude those educators from determining the worth of test-based evidence, how can such evidence contribute to our making defensible educational decisions?

Thompson, Rutkowski, and Rutkowski (2023) propose an analytic model that can be employed to help educators determine whether available positive and nega- tive score-based evidence accompanying tests, particularly large-scale standardized tests, is sufficiently convincing to permit defensible test-based inferences about test- takers. Starting with test-collected data, educators are urged to formulate a results- based claim and its opposite as well as results-based support and nonsupport for both of those claims in order to evaluate which is most defensible. Such analyses can then be concluded by educators who have seriously considered test-collected support of an inference-based educational action or an educational policy stance.

What Do Classroom Teachers Really Need to Know About Validity? Well, we’ve spent a fair number of words worrying about the validity of score-based inferences. How much, if anything, do classroom teachers really need to know about validity? Do classroom teachers need to collect validity evidence for their own tests? If so, what kind(s)?

As with reliability, a classroom teacher needs to understand what the essential nature of the most common kinds of validity evidence is, but classroom teachers don’t need to embark on an evidence-gathering frenzy regarding validity. Clearly, if you’re a teacher or a teacher in preparation, you’ll be far too busy in your class- room keeping ahead of the students to spend much time assembling validity

M04_POPH0936_10_SE_C04.indd 127M04_POPH0936_10_SE_C04.indd 127 08/11/23 3:35 PM08/11/23 3:35 PM

128 Chapter 4 Validity

evidence. However, for your more important tests, you should probably devote at least some attention to validity evidence based on test content. Giving serious thought to the content of an assessment domain being represented by a test is a good first step. Having a colleague review your tests’ content is also an effective way to help make sure that your own classroom tests represent satisfactorily the content you’re trying to promote, and that your score-based interpretations about your students’ content-related status are not miles off the mark.

Regarding the other types of validity evidence, however, little more than a reasonable understanding of the nature of those kinds of evidence is needed. If you’re ever asked to help scrutinize a high-stakes educational test, you’ll want to know enough about such versions of validity evidence so that you’re not intimi- dated when measurement specialists start chanting their “odes to validity.”

A decidedly unusual set of circumstances, however, the worldwide COVID-19 pandemic, currently faces American public educators. Accordingly, it would surely be beneficial if today’s teachers—as well as current teachers-in-training— both recognized and attempted to cope with the consequences of the COVID pandemic. Looking back, during the third academic year of the pandemic, the school year that began in spring 2020, it became increasingly apparent that in many parts of the world, most definitely including the United States, substantial shortages of teachers would soon be present. By the spring of 2022, conservative estimates indicated that well over 1 million U.S. citizens had perished because of COVID. As the pandemic persisted, the predicted teacher shortages became even greater than feared. In response, hosts of stopgap alternative procedures were implemented to deal with this dearth of teachers. The evaluated efficacy of these diverse teacher-recruitment tactics has not yet been formally determined.

Nonetheless, today’s classroom teachers must recognize that many of their recently hired colleagues will, of necessity, often be bringing less than a full array of thoughtfully honed instructional and assessment procedures to school with them. What this means, in practical terms, is that if you possess a reasonable mas- tery of key measurement understandings—such as the notions you encountered while reading this chapter about validity, the “most important” concept in all of educational testing—please consider sharing your assessment insights with these novice colleagues. Perhaps, with their collaboration, you could initiate a series of informal group interactions dealing with such pivotal issues as validity and reli- ability. Or, maybe, you could set a few informal one-on-one “chats” dealing with what you regard as the most relevant notions of educational assessment.

If you complete this book, and perhaps do so in connection with a formal course to which the book is linked, think hard about whether you are willing to share important assessment insights with your colleagues. Please approach such a task thinking like a teacher who is in “planning mode.” More specifically, regard as a prioritizing challenge any support you might provide to a colleague who knows much less than you about the basics of educational assessment. You’ll learn by the close of your course, or from reading this book on your own, that there are many, many chunks of assessment content that can be shared with a

M04_POPH0936_10_SE_C04.indd 128M04_POPH0936_10_SE_C04.indd 128 08/11/23 3:35 PM08/11/23 3:35 PM

What Do Classroom teachers really eed to now about Validityy 129

colleague who knows precious little about educational testing. But please, choose with care the chunks you wish to treat. It will be far better, for example, to dig into only a handful of topics, such as validity, reliability, and fairness—attempting to strengthen a colleague’s grasp of the key ideas in those three arenas—than try- ing to treat many more assessment topics (all important, to be sure). Remember, in most cases, more is definitely less and, conversely, less turns out to be more.

In short, our schools are facing a serious set of instructional and assessment problems. If, given your newly acquired insights regarding the testing of students, you can position yourself to help beginning teachers deal successfully with their assessment problems, then everybody wins—most importantly, your school’s stu- dents. Put simply, you will soon know things about assessment that every truly professional teacher should know. So, while skillfully employing any interper- sonal skills you can invoke when dealing with a less knowledgeable colleague, spread the most significant assessment knowledge with colleagues who need it.

All the while, as this chapter is churning toward its conclusion, please keep in mind that assessment validity—the most important construct in the entire educational testing game—depends on the persuasiveness of evidence offered in support of a test’s ability to provide accurate score-based inferences relevant to a test’s intended use. In a nutshell, that’s assessment validity.

M04_POPH0936_10_SE_C04.indd 129M04_POPH0936_10_SE_C04.indd 129 08/11/23 3:35 PM08/11/23 3:35 PM

130 Chapter 4 Validity

But What Does this have to Do with teachingy Most teachers believe, mistakenly, that their classroom tests are pretty spiffy assessment instruments. Recently I asked five teachers whom I know, “Do you think your classroom tests are pretty spiffy assessment instruments?” Three of the five teachers said, “Yes.” One said, “No.” And one asked, “What the devil does ‘spiffy’ mean?” (I counted this as a negative response.) Thus, based on my extensive opinion poll, I can support my contention that most teachers think their classroom tests are pretty good—that is, spiffy.

And this is where the concept of validity impinges on teaching. Many teachers, you see, believe their classroom assessments produce results that are really rather accurate. Many teachers, in fact, think “their tests are valid.” The idea that validity resides in tests themselves is a prevalent misconception among the nation’s educators. And, unfortunately, when teachers think that their tests are valid, they begin to place unwarranted confidence in the scores their students attain on those “valid” tests—irrespective of the nature of the decisions informed by the so-called “valid” test’s results.

What this chapter has been stressing, restressing, and re-restressing is the idea that validity applies to test-based interpretations, not to the tests themselves. Moreover, for teachers to be confident that their score-based inferences are valid, it’s usually necessary to assemble some compelling evidence to support the accuracy of a teacher’s score-based inferences. And because teaching decisions are often based on a teacher’s estimate of students’ current achievement levels, it is apparent that unwarranted confidence in those test-based estimates can sometimes lead to faulty interpretations and, as a result, to unsound instructional decisions.

Ideally, teachers will accept this text’s central contention that validity of test-based inferences requires teachers to make a judgment about the accuracy of their own test-based inferences. Moreover, truly compelling data are needed for a teacher to be super-sure that test-based judgments about students’ achievement levels are actually on target. Yet, because compelling data are rarely at hand, teachers are more likely to recognize that their instructional decisions must often be based on a fairly fragile assessment foundation.

In today’s high-stakes environment, it is even more important that teachers regard validity as a judgment-derived commodity. For example, suppose your state’s officially approved curriculum contains scads of content standards. Because scads signifies a large number (and is rarely encountered in its singular form, scad), your state clearly has a great many content standards to be taught—and tested. Now let’s suppose your state’s annual accountability tests are intended to assess students’ mastery of those scads of content standards—in a few hours of testing. Well, if that’s the situation, do you really think there is apt to be great accuracy in the following statement by an elected state official? “Mastery of our state’s content standards can be determined on the basis of students’ scores on our state’s valid accountability tests.” Even elected state officials, you see, can make mistakes in their pronouncements. For one thing, there are too many content standards for students’ “mastery” to be determined. Second (and you know this by now), tests themselves aren’t valid (or invalid).

Chapter Summary This chapter attempted to promote not only an understanding of the nature of assessment valid- ity, but also an appreciation on the part of the reader for the enormously important role that validity plays in educational assessment. Its

content had been meaningfully molded by the 2014 Standards for Educational and Psychologi- cal Testing. Because the periodically published Standards is developed under the aegis of the three national organizations most responsible

M04_POPH0936_10_SE_C04.indd 130M04_POPH0936_10_SE_C04.indd 130 08/11/23 3:35 PM08/11/23 3:35 PM

Chapter Summary 131

for educational assessment, we can safely antici- pate that if a subsequent version of the Joint Standards is soon published, its impact on educa- tional testing will be equally important as were the predecessor versions of that important set of guidelines.

The 2014 Standards, far more explicitly than in any earlier version of this significant docu- ment, stress that validity consists of the “degree to which evidence and theory support the inter- pretations of test scores for proposed uses of tests” (AERA, 2014, p. 11). The emphasis therein is on the need to collect validity evidence for any intended use of the tests—that is, for any usage- based interpretation of test-takers’ performances. Thus, if a test had been originally created to sat- isfy one purpose (such as supplying comparative scor e-interpretations of students’ content mas- tery), and the test were to be used for another purpose, validity evidence should be collected and fashioned into separate arguments support- ing the validity of score-based interpretations for both the original and the new use. Such evidence- laced arguments are typically presented then in some sort of technical report regarding a test’s quality. And, of course, relatively few school districts or state departments of education pos- sess the technical wherewithal or the available staff members to carry out defensible validity analyses of the assessments being used in their situations.

Four sources of validity evidence were described in the chapter. Two of those sources are encountered more frequently by educators and, therefore, were dealt with in greater detail. Those two validity sources were (1) evidence based on test content and (2) evidence based on relations with other variables. The kind of validity evi- dence most likely to be encountered by teachers

involves test-based predictions of students’ per- formances on a criterion variable (such as high school students’ college grades). A second kind of relationship between students’ test scores and other variables arises when evidence is collected so that those test scores can be correlated with both similar and dissimilar variables. As always, the quest on the part of test developers and test users is to devise a compelling argument attest- ing to the accuracy of score-based interpretations for particular test uses.

The relationship between validity evidence and reliability evidence was considered, as were certain unsanctioned labels for validity. It was suggested that, in general, more validity evidence is better than less, but that classroom teachers need to be realistic in deciding how much evidence of validity to secure for their own tests. A recommendation was given to have teachers become familiar with all forms of valid- ity evidence, but to focus only on content-based evidence of validity for their own classroom assessment procedures. Most important, it was stressed that teachers must recognize validity as a judgmental inference about the interpretation of students’ test performances—not an attribute of tests themselves.

A concluding plea was offered calling for readers to do what they could to promote their colleagues’ grasp of assessment fundamentals, particularly those associated with the near- certain COVID-generated shortage of trained and certified teachers. It was suggested that by spurring the increased offering of short, informal sessions exploring a modest number of key con- structs such as test validity, fewer measurement- related mistakes would be made by teachers and, as a result, stronger instruction for students would follow.

M04_POPH0936_10_SE_C04.indd 131M04_POPH0936_10_SE_C04.indd 131 08/11/23 3:35 PM08/11/23 3:35 PM

132 Chapter 4 Validity

References American Educational Research Association.

(2014). Standards for educational and psychological testing. Author.

Bonner, S. (2013). Validity in classroom assessment: Purposes, properties, and principles. In J. H. McMillan (Ed.), SAGE Handbook of research on classroom assessment (pp. 87–106). SAGE Publications. https:// doi.org/10.4135/9781452218649.n6

Cizek, G. J., Kosh, A. E., & Toutkoushian, E. K. (2018). Generalizing and evaluating validity evidence: The generalized assessment tool, Journal of Educational Measurement, 55(4): 477–512.

Kane, M., & Wools, S. (2019). Perspectives on the validity of classroom assessments. In S. M. Brookhart & J. H. McMillian (Eds.), Classroom assessment and educational measurement (1st ed., pp. 12–26). Routledge.

Learning Sciences International. (2018, September 11). LSI Webinar: There Is No Such Thing as a Valid Test by Dylan Wiliam (part 1) [Video]. YouTube. https://www.youtube. com/watch?v=1IE7yc9Wo04

Popham, W. J. (1997). Consequential validity: Right concern—wrong concept, Educational Measurement: Issues and Practice, 16(2): 9–13.

Thompson, G., Rutkowski, D., & Rutkowski, L. (2023, March). How to make more valid decisions about assessment data, Kappan, 104, 6: 34–39.

Webb, N. L. (2002). Alignment study in language arts, mathematics, science, and social studies of state standards and assessment for four states. Council of Chief State School Officers.

Wiliam, D. (2020, May 2). What every teacher needs to know about assessment [Video]. YouTube. https://www.youtube.com/ watch?v=waRX-IOR5vE

M04_POPH0936_10_SE_C04.indd 132M04_POPH0936_10_SE_C04.indd 132 08/11/23 3:35 PM08/11/23 3:35 PM

a testing takeaway 133

A Testing Takeaway

Assessment Validity’s Two Requisites* W. James Popham, University of California, Los Angeles

Assessment validity is, far and away, the most important concept in educational testing. This is asserted in the most recent revision of the Standards for Educational and Psychological Testing, a significant collection of assessment guidelines issued by the three most prestigious national associations concerned with educational testing. As the revised standards put it, “validity is, therefore, the most fundamental consideration in developing and evaluating tests” (American Educational Research Association, 2014, p. 11).

What is this “most fundamental” concept in educational testing, and how do we determine whether it is present? First, a definition:

Assessment validity is the degree to which an evidence-based argument supports the accuracy of a test’s interpretations for a proposed use of the test’s results.

As you can see, in testing, validity is focused on coming up with accurate interpretations. This is understandable, because the things educators are most concerned with, such as students’ cognitive skills and bodies of knowledge, are invisible. We simply can’t see what’s going on inside students’ minds. Accordingly, we use a student’s overt responses to educational tests to help us arrive at interpretations regarding the student’s covert skills and knowledge. Validity is, at bottom, an inference-making strategy for reaching conclusions about what we can’t see.

Assessment validity, therefore, does not refer to tests themselves. There is no such thing as “a valid test” or “an invalid test.” Rather, what’s valid or invalid is the inference about a student test-taker that’s based on the student’s test performance. The danger with ascribing validity to a test itself is that once we regard a test as valid, we begin to believe this so-called “valid” test is suitable for all sorts of purposes, even some for which it isn’t suitable at all.

Please note in the above definition that there are two requirements for assessment validity. When those who are developing or evaluating an educational test assemble evidence supporting assessment validity, both of these requirements must be addressed:

• Interpretation Accuracy: Evidence indicating that the score-based inferences about students’ status are accurate

• Usage Support: Evidence indicating that a test’s score-based inferences contribute to the proposed use of the test

The evidence to confirm an educational test’s ability to provide accurate interpretations and support a test’s intended use is typically presented in the form of a “validity argument.” If this argument fails to supply convincing evidence regarding both interpretation accuracy and usage support, then employing the test will probably lead to unsound educational decisions and, consequently, miseducated children.

*From Chapter 4 of Classroom Assessment: What Teachers Need to Know, 10th ed., by W. James popham. Copyright 2022 by pearson, which hereby grants permission for the reproduction and distribution of this Testing Takeaway, with proper attribution, for the intent of increasing assessment literacy. a digitally shareable version is available from https://www.pearson.com/store/en-us/pearsonplus/login.

M04_POPH0936_10_SE_C04.indd 133M04_POPH0936_10_SE_C04.indd 133 08/11/23 3:35 PM08/11/23 3:35 PM