Research X
Reliability-The Same Yesterday, Today, and
Tomorrow
In: Testing and Measurement
By: Mary E. Stafford
Pub. Date: 2011
Access Date: May 19, 2019
Publishing Company: SAGE Publications, Inc.
City: Thousand Oaks
Print ISBN: 9781412910026
Online ISBN: 9781412986106
DOI: https://dx.doi.org/10.4135/9781412986106
Print pages: 121-140
© 2006 SAGE Publications, Inc. All Rights Reserved.
This PDF has been generated from SAGE Research Methods. Please note that the pagination of the
online version will vary from the pagination of the print book.
Reliability-The Same Yesterday, Today, and Tomorrow
When selecting tests for use either in research or in clinical decision making, you want to make sure that
the tests you select are reliable. Reliability can be defined as the trustworthiness or the accuracy of a
measurement. Those of us concerned with measurement issues also use the terms consistency and stability
when discussing reliability. Consistency is the degree to which all parts of a test or different forms of a test
measure the same thing. Stability is the degree to which a test measures the same thing at different times
or in different situations. A reliability coefficient does not refer to the test as a whole, but it refers to scores
obtained on a test. In measurement, we are interested in the consistency and stability of a person's scores.
As measurement specialists, we need to ask ourselves, “Is the score just obtained by Person X (Person X
seems so impersonal; let's call her George) the same score she would get if she took the test tomorrow, or
the next day, or the next week?” We want George's score to be a stable measure of her performance on any
given test. The reliability coefficient is a measure of consistency. We also ask ourselves, “Is the score George
received a true indication of her knowledge, ability, behavior, and so on?” Remember obtained scores, true
scores, and error scores? The more reliable a test, the more George's obtained score is a reflection of her
true score.
A reliability coefficient is a numerical value that can range from 0 to 1.00. A reliability coefficient of zero
indicates the test scores are absolutely unreliable. In contrast, the higher the reliability coefficient, the more
reliable or accurate the test scores. We want tests to have reliabilities above 0.70 for research purposes and
in the 0.80s and 0.90s for clinical decision making. To compute a reliability coefficient, you must do one of
three things:
Administer the test twice and keep track of the time interval between the two administrations.
Administer two different forms of the same test.
Administer the test one time.
More about this later.
Let's Check Your Understanding
The consistency or stability of test scores is called___________.
Does a reliability coefficient reflect
a. Stability of a set of scores over time, or
b. Stability of a set of scores across different tests?
When we say that reliability reflects accuracy of scores, do we mean
a. How accurately the test measures a given concept, or
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 2 of 20 Testing and Measurement
b. How accurately the test measures the true score?
Reliability coefficients can range from _________ to ________.
When we are making clinical decisions (decisions about a person's life), we want the value of our
reliability coefficients to be at least __________.
A test must always be administered twice in order for you to compute a reliability coefficient. True
or false?
Our Model Answers
The consistency or stability of test scores is called reliability.
A reliability coefficient reflects the
a. Stability of a set of scores over time
When we say that reliability reflects accuracy of scores, we mean
b. How accurately the test measures the true score
Reliability coefficients can range from 0 to +1.00.
When we are making clinical decisions (decisions about a person's life), we want the value of our
reliability coefficients to be at least 0.80.
A test must always be administered twice in order for you to compute a reliability coefficient.
This statement is false.
The Mathematical Foundation of Reliability
Now that you've mastered these basic concepts about reliability, we want to review quickly the concepts of
obtained, true, and error score variance. Remember from classical test theory in Chapter 7 that it is assumed
that the variance in obtained scores comprises true score variance and error score variance. One goal of
measurement is to reduce as much error score variance as possible.
The equation is key to understanding the concept of reliability. If we divide both sides of this
equation by the obtained variance our equation is
This equation shows us that the ratio of the true score variance to obtained score variance plus the
ratio of the error score variance to obtained score variance sums to 1. The ratio of the true score
variance to obtained score variance reflects the proportion of variance in obtained scores that are attributable
to true scores. This concept is the basic definition of reliability. Therefore, we can substitute the symbol for
reliability in the equation:
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 3 of 20 Testing and Measurement
If we subtract the ratio of the error score variance to the obtained score variance we have the basic
formula for reliability:
The closer the error variance comes to equaling 0, the closer the reliability coefficient comes to equaling 1.00.
Error variance is never totally controlled, so it can never be equal to 0. This also means that the reliability
will never be 1.00. At the risk of being redundant, in measurement we try to control as many sources of error
variance as possible. To the extent that we can do this, the more accurately (reliably) we are measuring a
person's true knowledge, behavior, personality, and so on.
Let's Check Your Understanding
Did you get all of that??? Check your understanding by answering the questions below.
Reliability is mathematically defined as the ratio of ______________ variance in scores to the
____________ variance.
The symbol we use for reliability is _________________.
The lower the error variance, the _____________ the reliability.
Our Model Answers
Reliability is mathematically defined as the ratio of true score variance in scores to the observed
score variance.
The symbol we use for reliability is rtt.
The lower the error variance, the higher the reliability.
We are so proud of you! Thank you for hanging in with us.
Types of Reliability Estimates
Although there are a variety of types of reliability, we will only cover the four most important and most used.
These four are test-retest reliability, alternate forms reliability, internal consistency reliability, and interrater
reliability.
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 4 of 20 Testing and Measurement
Test-Retest Reliability
This is the easiest to remember, because its name tells you exactly what you're doing. You're giving the test
and then, after a designated time period, you're giving it again. The most common time periods are 1 week
to 1 month. When test scores are reliable over time (i.e., good test-retest reliability), error variance due to
time is controlled (as much as is possible). The statistical procedure used to examine test-retest reliability
is correlation. A correlation coefficient tells us to what extent people obtain the same scores across the two
testing times. The Pearson correlation coefficient is the statistic used to reflect test-retest reliability when total
scores on a test are continuous.
Let's say that you are trying to create the Flawless IQ Test. Because you want it to compete with the Wechsler
and Binet IQ tests, you create it with a mean of 100 and a SD of 15. You administer it to 100 sophomores (the
human fruit flies) at the beginning of the semester. One month later you give it to them again. You correlate
the students' scores on the first test with their scores when they took it the second time. In our example, the
correlation coefficient you obtain is a measure of 1-month test-retest reliability for the Flawless IQ Test. To the
extent that your students score the same on both tests, the higher the reliability coefficient. If your reliability
coefficient is above 0.70, their scores are relatively stable over time (at least over 1 month). If your reliability
coefficient is in the 0.80s or 0.90s, you have a better claim to stability because more error variance has been
eliminated. Indeed, your reliability coefficient of 0.82 strongly suggests that this instrument reliably measures
students' scores over a 1-month period.
Let's relook at the reliability of the Flawless IQ Test when we take race or ethnicity into consideration. You
have 10 Latino students and 90 Euro-American students in your class. The scores for the 10 Latino students
on each of the two test administrations are reported below.
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 5 of 20 Testing and Measurement
When the Pearson correlation coefficient is calculated for these 10 sets of scores, it is equal to 0.62. If we
depicted this symbolically, it would look like this:
This reliability coefficient indicates that your Flawless IQ Test is not so flawless. It may be stable across time
for your Euro-American students, but it is not as stable for your Latino students. The poor reliability of scores
for your Latino students may reflect cultural bias in your test. This reliability estimate of 0.62 suggests that
you are not measuring IQ for these students very accurately. We'll discuss cultural bias in Chapter 11 on the
ethics of testing.
Alternate Forms Reliability
This type of reliability is also known as parallel forms reliability. When you want to determine whether
two equivalent forms of the same test are really equivalent, you calculate a correlation coefficient that is
interpreted as alternate form reliability. In order for two forms to be parallel, they need to have the same
number of items, test the same content, and have the same response format and options. Alternate forms of
a test are usually given at two different times. The beauty of the alternate form procedure is that it helps to
control for sources of error due to both content variability and time.
Maybe an example will help illustrate what we are talking about. For those of you who have taken aptitude
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 6 of 20 Testing and Measurement
tests such as the Millers Analogies Test (MAT), you may have discovered that there are multiple forms of the
MAT (if you've taken it more than once to try to improve your score). The following 10 students took one form
of the MAT in January and a different form of the MAT in May. Here are their scores:
The alternate forms reliability coefficient for these scores across this 4-month period is 0.95. If we depicted
this symbolically, it would look like this:
The measurement specialists who designed these alternate forms of the MAT did a great job. (Let's hope
they were given a bonus for a job well done!) This 0.95 is a very strong reliability coefficient and indicates that
sources of error due to both content variability and time were controlled to a very great extent.
Internal Consistency Reliability
This type of reliability is a bird of a different color. It does not require two testing times or two forms of a
test. You administer the test one time, and you let the computer find the mean of the correlations among all
possible halves of the test. This procedure is also called a split-half procedure. What you are trying to find out
is whether every item on the test correlates with every other item. You are assessing content stability.
Let's look at an example of a five-item test designed to measure Personal Valuing of Education. The five
items are
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 7 of 20 Testing and Measurement
How strong is your commitment to earning a bachelor's degree?
Getting a college degree will be worth the time required to obtain it.
Getting a college degree will be worth the money spent to obtain it.
Getting a college degree will be worth the work/effort required to get it.
How much do you value a college education?
The continuous response format ranges from 1 (not at all/strongly disagree) to 5 (very much/strongly agree).
Did you remember that this is called a continuous response format because there are more than two response
options for each item? The responses for 10 students to each of these items are presented in Table 9.1.
Table 9.1 Continuous Responses to the Personal Valuing of Education Scale
Because this test used a continuous response format, the type of internal consistency reliability coefficient
we calculated was a Cronbach's alpha. With the assistance of the Scale option in SPSS, we found that
the Cronbach's alpha for responses to these five items was 0.87. This suggests that for each student, their
responses to each of the items were very similar. An internal consistency reliability coefficient of 0.87 is
considered relatively strong and suggests that these five items are measuring the same content.
If you want to check our accuracy, open the Personal Valuing of Education data set on the Sage Web site at
http://www.sagepub.com/kurpius. Under Analyze, choose Scale, then choose Reliability Analysis. From the
window that opens, select and move the five test items from the left to the right using the arrow. Then click
Statistics and check all of the Descriptives in the next window. Click Continue and then OK. You will get results
that look like ours in Figure 9.1.
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 8 of 20 Testing and Measurement
Figure 9.1 Cronbach's Alpha Reliability Output
What should we have done if the response options had been dichotomous? Instead of being able to rate
the items on a scale from 1 to 5, the students had only two choices, such as yes/no or none/a lot. Table 9.2
presents these same students but the response option was dichotomous. Scores of 1 reflect responses of
“no” or “none” and scores of 2 reflect “yes” or “a lot.”
Table 9.2 Dichotomous Responses to the Personal Valuing of Education Scale
When students were forced to use a dichotomous response format, the internal consistency reliability
coefficient was 0.78. This is a moderately acceptable reliability coefficient. Based on this internal consistency
reliability of 0.78, you can conclude that the students tended to respond consistently to the five items on the
Personal Valuing of Education scale.
The statistical procedure that was used to arrive at this coefficient is called the Kuder-Richardson 20 (K-R
20). The K-R 20 should only be used with dichotomous data. When using SPSS to calculate a K-R 20, you
should click Analyze, then select Scales, then choose Reliability Analysis, and then choose the Alpha option
for this analysis. (Kuder and Richardson also developed the K-R 21, but it is not generally as acceptable as
the K-R 20.)
Because an internal consistency reliability coefficient is based on dividing the test in half in order to correlate
it with the other half, we have artificially shortened the test. A 20-item test became two 10-item tests. When a
test is too short, the reliability coefficient is suppressed due to the statistics that are employed. A Spearman-
Brown correction procedure can be used to compensate for this artificial shortening. The Spearman-Brown
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 9 of 20 Testing and Measurement
should only be used with internal consistency reliability. If you see it reported for either test-retest or alternate
forms reliability, the writer didn't know what he or she was talking about.
A final note about internal consistency reliability—never use it with a speeded test. Speeded tests are
designed so that they cannot be finished. Therefore, calculating reliability based on halves produces a
worthless reliability coefficient.
Interrater Reliability
This fourth type of reliability is used when two or more raters are making judgments about something. For
example, let's say you are interested in whether a teacher consistently reinforces students' answers in the
classroom. You train two raters to judge a teacher's response to students as reinforcing or not reinforcing.
They each observe the same teacher at the same time and code the teacher's response to a student as
reinforcing (designated by an X) or not reinforcing (designated by an O). The two raters' judgments might look
like the following:
To calculate interrater reliability, you count the number of times the raters agreed and divide by the number of
possible times they could have agreed. The formula for this is:
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 10 of 20 Testing and Measurement
The interrater reliability coefficient for these two raters is 0.80. They tended to view the teacher's behavior in
a similar fashion. If you want your reliability coefficient to be stronger, have them discuss the two responses
that they viewed differently and come to some consensus about what they are observing. Another commonly
used statistic for interrater reliability is the Cohen's Kappa.
Let's Check Your Understanding
OK, friends, have you digested this yet or has it left a bad taste in your mouth? It's time to check your
understanding of types of reliability by answering the questions below.
What are the four major types of reliability?
_________________________________
_________________________________
Which type is designed to control for error due to time?
_________________________________
_________________________________
Which type is designed to control for error due to time and content?
_________________________________
_________________________________
For which type do you need to administer the test only one time?
_________________________________
_________________________________
What happens to the reliability coefficient when it is based on halves of a test?
_________________________________
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 11 of 20 Testing and Measurement
_________________________________
How do you calculate interrater reliability?
_________________________________
_________________________________
Our Model Answers
We suggest that you pay close attention. In our answers, we're throwing in bits of new information that your
instructor may well hold you accountable for on your exam.
What are the four major types of reliability?
Test-retest, alternate forms, internal consistency, and interrater reliabilities.
Which type is designed to control for error due to time?
Test-retest reliability controls for error due to time, because the test is administered at two
different time periods. The time period between the two testings indicates the length of
time that the test scores have been found to be stable.
Which type is designed to control for error due to time and content?
Alternate forms reliability controls for error due to time and content. It controls for the
exact time period between the two testings as well as the equivalency of item content
across the two forms of the test.
For which type do you need to administer the test only one time?
You only need to administer a test once if you are calculating internal consistency
reliability. This type of reliability only controls for sources of error due to content since no
time interval is involved.
What happens to the reliability coefficient when it is based on halves of a test?
If the length of the possible halves of a test contains too few items, the reliability
coefficient may be distorted and typically is too small. The Spearman-Brown correction
somewhat compensates for this statistical artifact.
How do you calculate interrater reliability?
You divide the number of agreements for the two raters by the number of possible
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 12 of 20 Testing and Measurement
agreements to get interrater reliability.
Standard Error of Measurement
Any discussion of reliability would be incomplete if standard error of measurement (SEm) was not discussed.
Standard error of measurement is directly related to reliability. The formula for standard error of measurement
is
The more reliable your test scores, the smaller your SEm. For example, let's say your reliability is 0.90 and the
SDo is 3 (SDo is the standard deviation for the test). If these values are inserted in the formula, your SEm is
0.95. If your reliability is 0.70, your SEm is 1.64. The higher the reliability, the less error in your measurement.
Here's how we arrived at our two answers.
A standard error of measurement is a deviation score and reflects the area around an obtained score where
you would expect to find the true score. This area is called a confidence interval. Yeah, that sounds like
gobbledygook to us too. Perhaps an example might clarify what we are trying to say.
Mary, one of our star pupils, obtained a score of 15 on our first measurement quiz. If the reliability coefficient
for this quiz was 0.90, the SEm is 0.95. Mary's obtained score of 15 could be viewed as symbolizing the mean
of all possible scores Mary could receive on this test if she took it over and over and over. An obtained score
is analogous to a mean on the normal, bell-shaped curve, and the SEm is equivalent to a SD. Mary's true
score would be within ± 1SEm of her obtained score 68.26% of the time (Xo ± 1SEm). That is, we are 68.26%
confident that Mary's true score would fall between the scores 14.05 and 15.95. We got these numbers by
subtracting 0.95 from Mary's score of 15 and by adding 0.95 to Mary's score of 15. Remember, it helps to
think of Mary's obtained score as representing the mean and the SEm as representing a SD on a normal
curve.
Surprise, surprise: Mary's true score would be within ± 2SEm of her obtained score 95.44% of the
time—between the scores of 13.10 and 16.90. If you guessed that Mary's true score would be within ± 3SEm
of her obtained score 99.74% of the time (between 12.15 and 17.85), you are so smart. If you remembered
standard deviation from Chapter 4, this was a piece of cake for you.
Let's see what happens when the reliability is only 0.70. As we calculated above, the SEm is 1.64. Mary still
received an obtained score of 15. We can be confident 68.26% of the time that Mary's true score is within ± 1
SEm of this obtained score (between scores of 13.36 and 16.64). Mary's true score would be within ± 2SEm
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 13 of 20 Testing and Measurement
of her obtained score 95.44% of the time-between a score of 11.72 and a score of 18.28. By now you know
that Mary's true score would be within ± 3SEm of this obtained score 99.74% of the time (between 10.08 and
19.72).
Notice that when the test has a high reliability as in the first case, we are more confident that the value of
Mary's true score is closer to her obtained score. The SE is smaller, which reflects a smaller error score.
Remember that a goal in reliability is to control error.
Let's Check Your Understanding
What is the mathematical symbol for standard error of measurement?
_________________________________
_________________________________
What is the relationship between standard error of measurement and a reliability coefficient?
_________________________________
_________________________________
What does a 68.26% confidence interval mean?
_________________________________
_________________________________
If Mary in our example above had obtained a score of 18, and the SEm was 4, her true score
would be between what two scores at the 95.44% confidence interval?
_________________________________
_________________________________
Our Model Answers
What is the mathematical symbol for standard error of measurement?
SEm
What is the relationship between standard error of measurement and a reliability coefficient?
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 14 of 20 Testing and Measurement
They have a negative relationship. The greater the reliability coefficient, the smaller the
standard error of measurement.
What does a 68.26% confidence interval mean?
A 68.26% confidence interval indicates that you can expect to find the true score within
±1SEm from the obtained score.
If Mary in our example above had obtained a score of 18, and the SEm was 4, her true score
would be between what two scores at the 95.44% confidence interval?
The two scores are 10 and 26. We arrived at these scores by adding and subtracting 2SEm
to the obtained score of 18. The mathematical value of 2SEm was 2(4) = 8.
Correlation Coefficients as Measures of Reliability
We've already told you that the Pearson product-moment correlation is used to assess test-retest and
alternate forms reliability because total test scores are continuous data and reflect interval-level data. When
we look at individual items to assess internal consistency reliability, we use either the Cronbach's alpha for
continuous response formats or the K-R 20 for dichotomous response formats. Before we end this chapter, we
need to comment on what type of reliability you would use for ordinal or rank-ordered data. Ranked data might
include placement in class or teacher's ratings of students (first, second, third, etc.). The correct procedure
for rank-ordered data is the Spearman rho.
Some Final Thoughts About Reliability
There are several principles that we need to keep in mind about reliability:
When administering a test, follow the instructions carefully so that your administration matches
that of anyone else who would be administering this same test. This controls for one source of
error.
Try to establish standardized testing conditions such as good lighting in the room, freedom from
noise, and other environmental conditions. These can all be sources of error.
Reliability is affected by the length of a test, so make sure you have a sufficient number of items
to reflect the true reliability.
Key Terms
• Reliability
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 15 of 20 Testing and Measurement
— Test-retest
— Alternate forms
— Internal consistency
— Interrater
• Standard error of measurement
• Pearson product-moment correlations
• K-R 20
• Cronbach's alpha
• Spearman-Brown correction
Models and Self-instructional Exercises
Our Model
Business and industry are increasingly concerned about employee theft. To help identify potential employees
who may have a problem with honesty, you have created the Honesty Inventory (HI). You have normed it on
employees representative of different groups of workers, such as used car salesmen, construction workers,
lawyers, and land developers. You have been asked to assess applicants for bank teller positions. You know
that the 2-week test-retest reliability of the HI scores for bank tellers is 0.76. You administer the HI to the
applicants. The applicants' scores range from 20 to 55 out of a possible 60 points. Their mean score is 45
with a standard deviation of 2.
What is the SEm for this group of applicants?
One applicant, Tom, scored 43. At a 95% confidence level, between what two scores does his
true score lie?
Between ____________ and _____________.
Based on what you know so far about Tom and the other applicants, would you recommend him
as an employee? Yes or no?
Why did you make this recommendation?
_________________________________
_________________________________
_________________________________
When you calculated the internal consistency for this group of applicants, the Cronbach's alpha
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 16 of 20 Testing and Measurement
reliability coefficient was 0.65. What does this tell you about the content of the HI for these
applicants?
_________________________________
_________________________________
_________________________________
Our Model Answers
Being conscientious measurement consultants, we want to be conservative in our opinions. Therefore, we
pay close attention to our data when making recommendations. To err is human, but in business they don't
forgive.
What is the SEm for this group of applicants?
One applicant, Tom, scored 43. At a 95% confidence level, between what two scores does his
true score lie?
We are confident 95% of the time that Toms true score is between 41.04 and 44.96.
Based on what you know so far about Tom and the other applicants, would you recommend him
as an employee?
We would not recommend Tom for employment.
Why did you make this recommendation?
At the 95th confidence level, Toms highest potential true score is still barely equal to the
mean score of 45 for this applicant pool. More than likely, his true score is significantly
below the group mean. We would look for applicants whose true score would at a minimum
include the mean and scores above it at the 95% confidence interval.
When you calculated the internal consistency reliability for this group of applicants, the
Cronbach's alpha reliability coefficient was 0.65. What does this tell you about the content of the
HI for these applicants?
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 17 of 20 Testing and Measurement
A 0.65 internal consistency reliability coefficient indicates that the applicants did not
respond consistently to all the items on the HI. This suggests significant content error. We
need to tread very lightly in making any recommendations about this group of applicants
based on the internal consistency reliability for the HI for their scores. Remember that the
reliability coefficient is suppressed when a test is shortened, as happens when internal
consistency reliability is calculated.
Now It's Your Turn
The same applicants are also given a multidimensional test that assesses both interpersonal relationships
(IP) and fear of math (FOM). For a norm group of 500 business majors, the reported Cronbach's alpha across
their 30-item IP scores was 0.84. Their K-R 20 was 0.75 across their 30-item FOM scores. For the bank teller
applicants, their mean score on the IP subscale is 50 (SD = 4, possible score range 0 to 60, with higher scores
reflecting stronger interpersonal skills). Their mean score on the FOM subscale is 20 (SD = 2.2, possible
score range 0 to 30, with higher scores reflecting greater fear of math).
What is the SEm for each subscale for this group of applicants?
One applicant, Harold, scored 50 on the IP subscale. At a 68.26% confidence level, between what
two scores does his true score lie?
Between ____________ and ___________.
Harold scored 10 on the FOM subscale. At a 68.26% confidence level, between what two scores
does his true score lie?
Between __________ and __________.
Based on what you know so far about Harold on these two subscales, would you recommend him
as an employee? Yes or no?
Why did you make this recommendation?
_________________________________
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 18 of 20 Testing and Measurement
_________________________________
_________________________________
_________________________________
Six months later you retested the candidates you hired. The test-retest reliability coefficient was
0.88 for the IP scores and was 0.72 for FOM scores. What does this tell you about the stability of
the two subscales?
_________________________________
_________________________________
_________________________________
_________________________________
Our Model Answers
What is the SEm for each subscale for this group of applicants?
One applicant, Harold, scored 50 on the IP subscale. At a 68.26% confidence level, between what
two scores does his true score lie?
Between 484 and 51.6. To get these two values, we added and subtracted 1.6 (the SEm)
from Harold's score of 50.
Harold scored 10 on the FOM subscale. At a 68.26% confidence level, between what two scores
does his true score lie?
Between 8.9 and 11.1. To get these two values, we added and subtracted 1.1 (the SEm)
from Harold's score of 10.
Based on what you know so far about Harold on these two subscales, would you recommend him
as an employee?
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 19 of 20 Testing and Measurement
Yes, we should hire him.
Why did you make this recommendation?
Based only on these two tests, we chose to hire Harold because he has average
interpersonal skills and low fear of math. His scores on these two measures suggest that
he will interact satisfactorily with others (with customers as well as coworkers) and he will
be comfortable dealing with the mathematics related to being a bank teller (we don't know
his actual math ability, however).
Six months later you retested the candidates you hired. The test-retest reliability coefficient was
0.88 for the IP scores and was 0.68 for FOM scores. What does this tell you about the stability of
the two subscales?
The interpersonal relationships subscale scores were quite stable over a 6-month time
period for the pool of bank teller applicants who were actually hired. The scores on the
fear of math subscale for those who were hired were not as stable across time. This might
be due to the restricted range that resulted if only those with lower fear of math scores
on the initial testing were actually hired. The test-retest reliability coefficient of 0.68 might
also be related to the fact that the K-R 20 was also relatively low (0.75).
Words of Encouragement
Do you realize that you have successfully completed almost all of this book? Mastering the content in this
chapter is a major accomplishment. We hope you are starting to see how the measurement issues in the
earlier chapters have applicability for this more advanced topic-reliability.
If you want to practice using SPSS to calculate reliability coefficients, we suggest you visit
http://www.sagepub.com/kurpius. Individual test item responses for six different tests for 270 undergraduates
can be found in the Measurement Data Set.
http://dx.doi.org/10.4135/9781412986106.n9
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 20 of 20 Testing and Measurement
- Reliability-The Same Yesterday, Today, and Tomorrow
- In: Testing and Measurement