Journal Entry#3

profileACCOUNTING123
SourcesOfValidityEvidenceOverview.final.pdf

1

Assessing the Validity of Inferences Made from Assessment Results

Sources of Validity Evidence

• Validity evidence can be gathered during the development of the assessment or after the assessment has been developed.

• Some of the methods used to gather validity evidence can support more than one type of source (e.g., test content, internal structure).

• Large scale assessment and local classroom assessment developers often use different methods to gather validity evidence.

o Large scale assessment developers use more formal, objective, systematic, and

statistical methods to establish validity.

o Teachers use more informal and subjective methods which often to not involve

the use of statistics.

Evidence based on Test Content

• Questions one is striving to answer when gathering validity evidence based on test content or construct:

o Does the content of items that make-up the assessment fully represent the concept or construct the assessment is trying to measure?

o Does the assessment accurately represent the major aspects of the concept or construct and not include material that is irrelevant to it?

o To what extent do the assessment items represent a larger domain of the concept or construct being measured?

• The greater the extent to which an assessment represents all facets of a given concept or construct, the better the validity support based on the test content or construct. There is

no specific statistical test associated with this source of evidence.

• Methods used to gather validity evidence based on test content or construct o Large Scale Assessments

▪ Have experts in the concept or construct being measured create the assessment items and the assessment itself.

▪ Have experts in the concept or construct examine the assessment and review it to see how well it measures the concepts or construct. These

experts would think about the following during the review process:

▪ The extent to which the content of assessment represents the content or construct’s domain or

universe.

▪ How well the items, tasks, or subparts of the assessment fit the definition of the construct and/or

the purpose of the assessment.

▪ Is the content or construct underrepresented, or are there content or construct-irrelevant aspects of the

assessment that may result in unfair advantages for

2

one or more subgroups (e.g., Caucasians, African

Americans)?

▪ What is the relevance, importance, clarity, and lack of bias in the assessment’s items or tasks

o Local Classroom Assessment ▪ Develop assessment blue prints which indicate what will be assessed as

well as the nature of the learning (e.g., knowledge, application, etc.) that

should be represented on the assessment.

▪ Build a complete set of learning objectives or targets, showing number of items and/or percentage of items/questions on the assessment devoted to

each.

▪ Discuss with others (e.g., teachers, administrators, content experts, etc.) what constitutes essential understandings and principles.

▪ Ask another teacher to review your assessment for clarity and purpose. ▪ Review assessments before using them to make judgments about whether

the assessment, when considered as a whole, reflects what it purports to

measure.

▪ Review assessments to make judgments about whether items on the assessment accurately reflects the manner in which the concepts were

taught.

▪ Have procedures in place to ensure that the nature of the scoring criteria reflect important objectives. For example, if students are learning a

science skill in which a series of steps needs to be performed, the

assessment task should require them to show their work.

▪ When scoring answers give credit for partial correct answers when possible.

▪ Examine assessment items to see if they are favoring groups of students more likely to have useful background knowledge—for instance, boys or

girls.

Evidence based on Response Processes

• Questions one is striving to answer when gathering validity evidence based on response processes:

o Do the assessment takers understand the items on the assessment to mean what the assessment developer intend them to mean?

o To what extent do the actions and thought processes of the assessment takers demonstrate that they understand the concept or construct in the same way it is

defined by the assessment developer?

• The greater the extent to which the assessment developer can be certain that the actions and thought processes of assessment takers demonstrate that they understand the concept

or construct in the same way he/she (assessment developer) has defined it, the greater the

validity support via evidence response processes. There is no specific statistical test

associated with this source of evidence.

• Methods used to gather validity evidence based on response processes

3

o Large Scale assessments ▪ Analyses of individuals response to items or tasks via interviews with

respondents

▪ Studies of the similarities and differences in responses supplied by various subgroups of respondents

▪ Studies of the ways that raters, observers, interviewers, and judges collect and interpret data

▪ Longitudinal studies of changes in responses to items or tasks o Local classroom assessment

▪ Use different methods to measure the same learning objective. For example, one could use teacher observation and quiz performance. The

closer the results of the two match the greater the validity support.

▪ If test should be assessing specific cognitive levels of learning (e.g., application), make sure the items reflect this cognitive level. For example,

essay items would be more appropriate than fill-in-the-blank items for

measuring application of knowledge.

▪ Reflect on whether or not students could answer items on the assessment without really knowing the content. For example, is student performing

well because of good test taking skills and not necessarily because they

know the content?

▪ Ask students before or after taking the assessment how they interpret the items to make sure it is in line with how you expected them to interpret the

items. This procedure is called a think aloud and gives one insight into the

thought processes of the student.

Evidence based on Internal Structure

• Question one is striving to answer when gathering validity evidence based on internal structure:

o To what extent are items in a particular assessment measuring the same thing?

• If there is more than one item (ideally 6 to 8 items) on a test measuring the same thing and students’ individual answers to these item are highly related/correlated, the greater

the validity support via evidence of internal structure. There are specific statistical test

associated with this source of evidence, but these statistical tests are typically used by

large assessment developer and not with local classroom assessments.

• Methods used to gather validity evidence based on internal structure o Large Scale assessments

▪ Factor- or cluster analytical studies ▪ Analyses of item interrelationships, using item analyses procedures (e.g.,

item difficulty, etc.)

▪ Differential item function DIF studies o Local classroom assessments

▪ Include in one assessment multiple items (ideally 6 to 8) assessing the same thing. Don’t rely on a single item. For example, on an assessment

4

use several items to measure the same skill, concept, principal, or

application. Use of 6 to 8 items ensures consistency to conclude from the

results that students do or do not understand the concept measured by that

set of items. For example, one would expect that a student who

understands the concept would perform well on all the items measuring

that concept and as consistency among the item results increases so too

does the validity support increase. Consistency of responses provides good

evidence for the validity of the inference that a student does or does not

know the concept.

Evidence based on Relations to Other Variables

• Questions one is striving to answer when gathering validity evidence based on relationships to other variables.

o To what extent are the results of an assessment related to the results of a different assessment that measures the “same” thing?

o To what extent are the results of an assessment related to the result of a different assessment that measures a different thing?

• If observed relationships match predicted relationships, then the evidence supports the validity of the interpretation. There are specific statistical tests associated with this

source of evidence, but these statistical tests are typically used by large assessment

developers and not with local classroom assessments.

• Methods used to gather validity evidence based on relationships to other variables o Large Scale assessments

▪ Correlational studies of

• the strengths and direction of the relationships between the measure and external “criterion” variables

• the extent to which scores obtained with the measure predict external “criterion” variables at a later date

▪ Group separation studies, based on decision theory,

• That examines the extent to which a score obtained with an instrument accurately predict outcome variables.

▪ Convergent validity studies

• That examines the strength and direction of the relationships between the measure and the other variables that the measure

should, theoretically, have high correlations with.

▪ Discriminant validity studies

• That examines the strength and direction of the relationships between the measure and the other variables that the measure

should, theoretically have low correlations with.

▪ Experimental studies

• That test hypotheses about the effects of intervention on scores obtained with an instrument

▪ Known-group comparison studies

• That test hypotheses about expected difference in average scores across various groups of respondents.

5

▪ Longitudinal studies

• That test hypotheses about expected differences o Local classroom assessments

o Compare one group of students who are expected to obtain high scores with students of another group who are expected to obtain low scores. One would

expect the scores to match the prior expectations (e.g., students expected to

score low would actually score low) and the more the results match the

expectations the greater the validity support.

o Compare scores obtained before instruction with scores obtained after instruction. The expectation would be that scores would increase so increases

from pre to post would provide validity support.

o Compare two measures of the same thing and look for similarities or discrepancies between scores. For example, you could compare homework

and quiz performance and the fewer discrepancies or the more similarities, the

greater the validity support.

Evidence based on Consequences of Testing

• Questions one is striving to answer when gathering validity evidence on consequences of testing:

o What are possible intended and unintended results of having students take the assessment?

o Are the intended and unintended results of the assessment used to make decisions about the levels of student learning?

• Evidence based on consequences of testing is established by considering both positive and negative consequences of engaging in testing and using the test scores for decision

making purposes. For example a consequence of high stake testing could be the

increased use of Drill-and-practice instructional practices, and important subjects that are

not tested are ignored.

• Methods used to gather validity evidence based on consequences of testing o Large Scale assessments

▪ Longitudinal Studies of the extent to which expected or anticipated benefits of testing are realized.

▪ Longitudinal Studies of the extent of which unexpected or anticipated negative consequences of testing occur.

o Local Classroom assessments ▪ Before giving the assessment articulate the intended consequences you

expect, such as assessment results which will help you address through

instruction, areas in which students skills are lacking. Also articulate

possible unintended consequences such as low scores on an assessment

lowering a student’s perceived self-efficacy. Another unintended

consequence of heavy use of objective items could be to encourage

students to learn for recognition, whereas essay items motivate students to

learning in a way that stresses the organization of information

▪ After giving the assessment compare predicted, intended consequences with actual consequences.

6

Evidence based on instruction

• Questions one is striving to answer when gathering validity evidence on consequences of testing:

o How well does the content taught or what content the students have the opportunity to learn align with what is actually assessed on an assessment?

• This is only a consideration for local classroom assessment and not large scale assessments

o Methods used to gather validity evidence based on instruction ▪ The teacher should reflect on how he/she taught the information by asking

yourself questions like…Where my instructional methods appropriate.

Did I spend enough time on each concept? What is the match between

what I taught and what was assessed? Have students had adequate

opportunities to learn what I am assessing? Were the concepts on the

assessment actually taught, and taught well enough, so that students can

perform well and demonstrate their understanding? Answering these

questions should help determine whether or not performance on the

assessment is due to learning or if it is due to other factors such as the

format of the assessment, gender, social desirability, and other possible

influences on assessment performance besides instruction.