writing assignment

profileACCOUNTING123
PromotingReliabilitySourcesReliablityEvidence.Final.pdf

1

Promoting Reliability

Both MacMillan and Dar (see below) provide suggestions on how promote reliability in classroom assessments. Doing the things mentioned

below can help control both external and internal sources of error which in turn helps bolster reliability of test scores.

McMillan’s (2006, p.51) suggestion on how to help bolster or promote reliability in the classroom assessments:

 Motivated students to put forth their best efforts on assessment

 Use sufficient number of items or tasks. A minimum of 5 items is needed to assess a single trait or skill

 Construct items, scoring criteria, and tasks that clearly differentiate students on what is being assessed, and make the criteria

public

 Make sure scoring procedures for constructed-response items are consistently applied to all students

 Use independent raters or observers to score a sample of student responses, and check consistency with your evaluations

 Build in as much objectivity into scoring as possible and still maintain the integrity of what is being assessed.

 Continue assessments until results are consistent

 Eliminate or reduce external sources of error

 Use shorter assessment more frequently than fewer longer assessments

 Use several types of assessment tasks or methods of assessment

 Use clearly identifiable anchors and other examples to illustrate scoring criteria

 To the extent possible, standardized assessment and scoring protocols and procedures

 Keep an item and test bank or file; don’t release tests for student to keep.

Darr’s (2005, p. 60) Factors Affecting Reliability (and Reliability Coefficients):

 Number of tasks in assessment or test – more tasks will generally lead to higher reliability

 Suitability of questions or tasks for the students being assessed- questions are too hard or too easy for students will not increase

reliability

 The spread of scores produced by the assessment – the larger the spread, the higher the reliability

 The training of the assessors

 The clearness of making guides and checking of marking procedures

 The wording of the rubric – carefully worded rubrics make it easier to decide on achievement levels

 How closely standardized procedures and conditions for assessment are followed

 How well questions are phrased

 The anxiety or readiness of students for the assessment – assessing student when they are tired or after an exciting event is less

likely to produce reliable results.

2

Reliability as a Continuum

I think that it is helpful to think of Reliability as a continuum and absolute perfect reliability of a score can never be achieved especially when

attempting to measure something that cannot be directly observed like student knowledge. A person’s score on a test will always contain some

degree of error because it is impossible to control all sources of internal and external error.

Sources of Reliability Evidence

The 4 basic sources of reliability evidence (stability across time, equivalence across forms, internal consistencies, and scorer/rater

consistency), capture ways we can go about assessing reliability with the context of an assessment or decision. However, all sources are not

appropriate for every situation where you want to assess the reliability of an assessment or decision. Thus, not all ways of gathering reliability

evidence are appropriate for every situation.

How one goes about collecting reliability evidence for these four sources often differs based on whether or not the test is a classroom

assessment or a large-scale assessment. As Sarah Godlove Evans discusses in her blog, evidence of reliability is collected formally using

statistical analyses for large-scale assessment but more informally for classroom based, teacher-created assessments. “Reliability is a trait

achieved through statistical analysis in a process called equating. Equating is one of the many behind-the-scenes functions performed by

psychometricians, folks trained in the statistical measurement of knowledge.

In general, the informal, classroom based, teacher-created assessments do not directly engage with the concept of reliability, as these types of

assessments do not require advanced statistical analysis; however, they do informally engage with the concept.

Gathering Sources of Reliability Evidences

Evidence What does it tell

us?

When is it

appropriate?

How to can we assess it formally in

Large-scale assessments using

statistical methods?

How can we assess it informally in

Classroom Assessment?

Stability

across time

(Test-retest)

Answers the

questions:

--Are scores on a

test stable across

time?

--Are the results

approximately the

same when they

take the same test

Only when one

expects scores

from “tests”

administered at

two different

times to the

same group of

students to

remain the

The same test is administered at two

different times to the same group of

students.

Example:

The scores from the two administrations

are correlated & resulting correlation is

called a test-retest reliability coefficient.

Not often assessed

Why:

gathering evidence using stability across time

is only a good estimate of reliability when what

is being assessed is not expected to change

during the time span between the two tests. In

a situation where a student has studied between

the time she took the test the first time and the

3

on different

occasions?

Provides an

indication of how

consistent the

scores are over

time within

students on the

same test.

approximately

same.

High positive correlations supports

stability across time

second time she takes the test, stability across

time would not be an appropriate way to gather

reliability evidence for these scores. One would

not expect her score the first time she took the

test to be the same as the second time she took

it since what was being assessed on the test was

expected to change due to her studying

between the two testing periods.

Example:

I can administer the test on one day to half

of the students and the next day to the other

half of the students. Reliability evidence

should show consistent scores across the

student population; no matter what day the

test was taken.

Stability

across test

forms

(parallel

forms)

Answers the

question:

Do scores on two

forms (re-ordered

items or different

items) of a test

measure the same

thing?

Are results

approximately the

same when

different but

equivalent tests

are taken at the

same time or at

different times?

Only when one

expects scores

from

equivalent test

administered at

two different

times to the

same group of

students to

remain the

approximately

same.

A test and an equivalent version of the

test, is administered to the same group

of students.

Example:

The scores from the two administrations

are correlated & resulting correlation is

called a parallel forms reliability

coefficient. High positive correlations

supports stability across test forms

Of conceptual use, but typically do not

compute an actual reliability coefficient.

When:

--Create a different but equivalent version of a

test for students need to make-up the original

test

--Create two equivalent versions of a test to

discourage cheating off other students’ tests.

Example:

This will establish reliability by using

alternative forms, Form 1 and Form 2, of my

unit test to measure the same skill or concept. I

could have one group of students take Form 1

first and then take Form 2. The second group of

students will take Form 2 first and then Form 1.

Although it would be rather difficult and time

consuming, I could then correlate the scores on

4

Provides an

indication of how

comparable or

consistent the two

forms of the test

are.

both forms and produce a reliability coefficient.

This would show me the strength of the

relationship between the two scores.

Internal

Consistency

Answers the

question:

Do different

items on a test

measure the same

construct?

Indication of how

consistently the

items or tasks

within a test

promote the same

result

Only

appropriate to

use if the items

on a test are

measuring the

same thing.

The results on different items on a test

give to the same group of students are

compared to see how well they relate.

In other words, provides an indication

of how

Consistently the items or tasks within a

test promote the same result.

Examples:

3 ways internal consistency reliability

coefficients

--Split-half

--Cronbach’s alpha

--Kuder Richardson formula 20 or 21.

Used conceptually, Typically do NOT compute

an actual reliability coefficient

When:

Wanting to make sure that performance on

items in a test that are measuring the same

construct are consistent in terms of results

Example:

I will have several items measuring the same

trait. My logic is that scores on these items

measuring the same trait should be correlated.

This will assist me in estimating how well the

items within my test are functioning in a

consistent manner. This is particularly useful

for me because I only have to administer my

test one time to gain this type of reliability

Inter-rater Answers the

questions:

--When using

different raters,

does it matter

who does the

scoring?

--Is one rater's

score

similar/consistent

to another rater's

score?

When more

than one

person is

scoring/grading

a test.

When a test taken by students is

scored/graded by at least two or more

scorer/graders.

The scores from the two raters/graders

are compared using % agreement on

items or the total scores for each

grader/rater are correlated.

High % agreement or

High positive correlations

supports stability across raters.

Used conceptually sometimes, but typically do

not compute an actual reliability coefficient.

When:

Comparing on teachers scoring/grading of a

test to another teacher’s scoring/grading. Good

idea to do this when the scoring is more

subjective.

Example:

Asking another teacher within my department

to review a small sample of my students’

5

Results from

different

raters/scores can

be compared to

ascertain the level

of

agreement. This

method used to

show how

consistently two

or more assessors

are scoring the

same items/tasks

answers to the essay and performance

components of my unit test. I have already

established the criteria they are to be evaluated

on, which are specified in the rubric I created. I

will then see if the scores I came up with are in

agreement to those my colleague assigned to

the students’ answers.

References

Darr, C. (2005). A hitchhiker’s guide to reliability. SET: Research Information for Teachers, 2005(3), 59-60.

Evans, S. G. (2013, November 6). Five characteristics of quality educational assessments – part two [Web log post]. Retrieved from

https://www.nwea.org/blog/2013/five-characteristics-quality-educational-assessments-part-two/

McMillan, J. H. (2008). Assessment essentials for standards-based education. Thousands Oaks, CA: Corwin Press