writing assignment
1
Promoting Reliability
Both MacMillan and Dar (see below) provide suggestions on how promote reliability in classroom assessments. Doing the things mentioned
below can help control both external and internal sources of error which in turn helps bolster reliability of test scores.
McMillan’s (2006, p.51) suggestion on how to help bolster or promote reliability in the classroom assessments:
Motivated students to put forth their best efforts on assessment
Use sufficient number of items or tasks. A minimum of 5 items is needed to assess a single trait or skill
Construct items, scoring criteria, and tasks that clearly differentiate students on what is being assessed, and make the criteria
public
Make sure scoring procedures for constructed-response items are consistently applied to all students
Use independent raters or observers to score a sample of student responses, and check consistency with your evaluations
Build in as much objectivity into scoring as possible and still maintain the integrity of what is being assessed.
Continue assessments until results are consistent
Eliminate or reduce external sources of error
Use shorter assessment more frequently than fewer longer assessments
Use several types of assessment tasks or methods of assessment
Use clearly identifiable anchors and other examples to illustrate scoring criteria
To the extent possible, standardized assessment and scoring protocols and procedures
Keep an item and test bank or file; don’t release tests for student to keep.
Darr’s (2005, p. 60) Factors Affecting Reliability (and Reliability Coefficients):
Number of tasks in assessment or test – more tasks will generally lead to higher reliability
Suitability of questions or tasks for the students being assessed- questions are too hard or too easy for students will not increase
reliability
The spread of scores produced by the assessment – the larger the spread, the higher the reliability
The training of the assessors
The clearness of making guides and checking of marking procedures
The wording of the rubric – carefully worded rubrics make it easier to decide on achievement levels
How closely standardized procedures and conditions for assessment are followed
How well questions are phrased
The anxiety or readiness of students for the assessment – assessing student when they are tired or after an exciting event is less
likely to produce reliable results.
2
Reliability as a Continuum
I think that it is helpful to think of Reliability as a continuum and absolute perfect reliability of a score can never be achieved especially when
attempting to measure something that cannot be directly observed like student knowledge. A person’s score on a test will always contain some
degree of error because it is impossible to control all sources of internal and external error.
Sources of Reliability Evidence
The 4 basic sources of reliability evidence (stability across time, equivalence across forms, internal consistencies, and scorer/rater
consistency), capture ways we can go about assessing reliability with the context of an assessment or decision. However, all sources are not
appropriate for every situation where you want to assess the reliability of an assessment or decision. Thus, not all ways of gathering reliability
evidence are appropriate for every situation.
How one goes about collecting reliability evidence for these four sources often differs based on whether or not the test is a classroom
assessment or a large-scale assessment. As Sarah Godlove Evans discusses in her blog, evidence of reliability is collected formally using
statistical analyses for large-scale assessment but more informally for classroom based, teacher-created assessments. “Reliability is a trait
achieved through statistical analysis in a process called equating. Equating is one of the many behind-the-scenes functions performed by
psychometricians, folks trained in the statistical measurement of knowledge.
In general, the informal, classroom based, teacher-created assessments do not directly engage with the concept of reliability, as these types of
assessments do not require advanced statistical analysis; however, they do informally engage with the concept.
Gathering Sources of Reliability Evidences
Evidence What does it tell
us?
When is it
appropriate?
How to can we assess it formally in
Large-scale assessments using
statistical methods?
How can we assess it informally in
Classroom Assessment?
Stability
across time
(Test-retest)
Answers the
questions:
--Are scores on a
test stable across
time?
--Are the results
approximately the
same when they
take the same test
Only when one
expects scores
from “tests”
administered at
two different
times to the
same group of
students to
remain the
The same test is administered at two
different times to the same group of
students.
Example:
The scores from the two administrations
are correlated & resulting correlation is
called a test-retest reliability coefficient.
Not often assessed
Why:
gathering evidence using stability across time
is only a good estimate of reliability when what
is being assessed is not expected to change
during the time span between the two tests. In
a situation where a student has studied between
the time she took the test the first time and the
3
on different
occasions?
Provides an
indication of how
consistent the
scores are over
time within
students on the
same test.
approximately
same.
High positive correlations supports
stability across time
second time she takes the test, stability across
time would not be an appropriate way to gather
reliability evidence for these scores. One would
not expect her score the first time she took the
test to be the same as the second time she took
it since what was being assessed on the test was
expected to change due to her studying
between the two testing periods.
Example:
I can administer the test on one day to half
of the students and the next day to the other
half of the students. Reliability evidence
should show consistent scores across the
student population; no matter what day the
test was taken.
Stability
across test
forms
(parallel
forms)
Answers the
question:
Do scores on two
forms (re-ordered
items or different
items) of a test
measure the same
thing?
Are results
approximately the
same when
different but
equivalent tests
are taken at the
same time or at
different times?
Only when one
expects scores
from
equivalent test
administered at
two different
times to the
same group of
students to
remain the
approximately
same.
A test and an equivalent version of the
test, is administered to the same group
of students.
Example:
The scores from the two administrations
are correlated & resulting correlation is
called a parallel forms reliability
coefficient. High positive correlations
supports stability across test forms
Of conceptual use, but typically do not
compute an actual reliability coefficient.
When:
--Create a different but equivalent version of a
test for students need to make-up the original
test
--Create two equivalent versions of a test to
discourage cheating off other students’ tests.
Example:
This will establish reliability by using
alternative forms, Form 1 and Form 2, of my
unit test to measure the same skill or concept. I
could have one group of students take Form 1
first and then take Form 2. The second group of
students will take Form 2 first and then Form 1.
Although it would be rather difficult and time
consuming, I could then correlate the scores on
4
Provides an
indication of how
comparable or
consistent the two
forms of the test
are.
both forms and produce a reliability coefficient.
This would show me the strength of the
relationship between the two scores.
Internal
Consistency
Answers the
question:
Do different
items on a test
measure the same
construct?
Indication of how
consistently the
items or tasks
within a test
promote the same
result
Only
appropriate to
use if the items
on a test are
measuring the
same thing.
The results on different items on a test
give to the same group of students are
compared to see how well they relate.
In other words, provides an indication
of how
Consistently the items or tasks within a
test promote the same result.
Examples:
3 ways internal consistency reliability
coefficients
--Split-half
--Cronbach’s alpha
--Kuder Richardson formula 20 or 21.
Used conceptually, Typically do NOT compute
an actual reliability coefficient
When:
Wanting to make sure that performance on
items in a test that are measuring the same
construct are consistent in terms of results
Example:
I will have several items measuring the same
trait. My logic is that scores on these items
measuring the same trait should be correlated.
This will assist me in estimating how well the
items within my test are functioning in a
consistent manner. This is particularly useful
for me because I only have to administer my
test one time to gain this type of reliability
Inter-rater Answers the
questions:
--When using
different raters,
does it matter
who does the
scoring?
--Is one rater's
score
similar/consistent
to another rater's
score?
When more
than one
person is
scoring/grading
a test.
When a test taken by students is
scored/graded by at least two or more
scorer/graders.
The scores from the two raters/graders
are compared using % agreement on
items or the total scores for each
grader/rater are correlated.
High % agreement or
High positive correlations
supports stability across raters.
Used conceptually sometimes, but typically do
not compute an actual reliability coefficient.
When:
Comparing on teachers scoring/grading of a
test to another teacher’s scoring/grading. Good
idea to do this when the scoring is more
subjective.
Example:
Asking another teacher within my department
to review a small sample of my students’
5
Results from
different
raters/scores can
be compared to
ascertain the level
of
agreement. This
method used to
show how
consistently two
or more assessors
are scoring the
same items/tasks
answers to the essay and performance
components of my unit test. I have already
established the criteria they are to be evaluated
on, which are specified in the rubric I created. I
will then see if the scores I came up with are in
agreement to those my colleague assigned to
the students’ answers.
References
Darr, C. (2005). A hitchhiker’s guide to reliability. SET: Research Information for Teachers, 2005(3), 59-60.
Evans, S. G. (2013, November 6). Five characteristics of quality educational assessments – part two [Web log post]. Retrieved from
https://www.nwea.org/blog/2013/five-characteristics-quality-educational-assessments-part-two/
McMillan, J. H. (2008). Assessment essentials for standards-based education. Thousands Oaks, CA: Corwin Press