Research X

Batman007
Resource2.pdf

Reliability-The Same Yesterday, Today, and

Tomorrow

In: Testing and Measurement

By: Mary E. Stafford

Pub. Date: 2011

Access Date: May 19, 2019

Publishing Company: SAGE Publications, Inc.

City: Thousand Oaks

Print ISBN: 9781412910026

Online ISBN: 9781412986106

DOI: https://dx.doi.org/10.4135/9781412986106

Print pages: 121-140

© 2006 SAGE Publications, Inc. All Rights Reserved.

This PDF has been generated from SAGE Research Methods. Please note that the pagination of the

online version will vary from the pagination of the print book.

Reliability-The Same Yesterday, Today, and Tomorrow

When selecting tests for use either in research or in clinical decision making, you want to make sure that

the tests you select are reliable. Reliability can be defined as the trustworthiness or the accuracy of a

measurement. Those of us concerned with measurement issues also use the terms consistency and stability

when discussing reliability. Consistency is the degree to which all parts of a test or different forms of a test

measure the same thing. Stability is the degree to which a test measures the same thing at different times

or in different situations. A reliability coefficient does not refer to the test as a whole, but it refers to scores

obtained on a test. In measurement, we are interested in the consistency and stability of a person's scores.

As measurement specialists, we need to ask ourselves, “Is the score just obtained by Person X (Person X

seems so impersonal; let's call her George) the same score she would get if she took the test tomorrow, or

the next day, or the next week?” We want George's score to be a stable measure of her performance on any

given test. The reliability coefficient is a measure of consistency. We also ask ourselves, “Is the score George

received a true indication of her knowledge, ability, behavior, and so on?” Remember obtained scores, true

scores, and error scores? The more reliable a test, the more George's obtained score is a reflection of her

true score.

A reliability coefficient is a numerical value that can range from 0 to 1.00. A reliability coefficient of zero

indicates the test scores are absolutely unreliable. In contrast, the higher the reliability coefficient, the more

reliable or accurate the test scores. We want tests to have reliabilities above 0.70 for research purposes and

in the 0.80s and 0.90s for clinical decision making. To compute a reliability coefficient, you must do one of

three things:

Administer the test twice and keep track of the time interval between the two administrations.

Administer two different forms of the same test.

Administer the test one time.

More about this later.

Let's Check Your Understanding

The consistency or stability of test scores is called___________.

Does a reliability coefficient reflect

a. Stability of a set of scores over time, or

b. Stability of a set of scores across different tests?

When we say that reliability reflects accuracy of scores, do we mean

a. How accurately the test measures a given concept, or

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 2 of 20 Testing and Measurement

b. How accurately the test measures the true score?

Reliability coefficients can range from _________ to ________.

When we are making clinical decisions (decisions about a person's life), we want the value of our

reliability coefficients to be at least __________.

A test must always be administered twice in order for you to compute a reliability coefficient. True

or false?

Our Model Answers

The consistency or stability of test scores is called reliability.

A reliability coefficient reflects the

a. Stability of a set of scores over time

When we say that reliability reflects accuracy of scores, we mean

b. How accurately the test measures the true score

Reliability coefficients can range from 0 to +1.00.

When we are making clinical decisions (decisions about a person's life), we want the value of our

reliability coefficients to be at least 0.80.

A test must always be administered twice in order for you to compute a reliability coefficient.

This statement is false.

The Mathematical Foundation of Reliability

Now that you've mastered these basic concepts about reliability, we want to review quickly the concepts of

obtained, true, and error score variance. Remember from classical test theory in Chapter 7 that it is assumed

that the variance in obtained scores comprises true score variance and error score variance. One goal of

measurement is to reduce as much error score variance as possible.

The equation is key to understanding the concept of reliability. If we divide both sides of this

equation by the obtained variance our equation is

This equation shows us that the ratio of the true score variance to obtained score variance plus the

ratio of the error score variance to obtained score variance sums to 1. The ratio of the true score

variance to obtained score variance reflects the proportion of variance in obtained scores that are attributable

to true scores. This concept is the basic definition of reliability. Therefore, we can substitute the symbol for

reliability in the equation:

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 3 of 20 Testing and Measurement

If we subtract the ratio of the error score variance to the obtained score variance we have the basic

formula for reliability:

The closer the error variance comes to equaling 0, the closer the reliability coefficient comes to equaling 1.00.

Error variance is never totally controlled, so it can never be equal to 0. This also means that the reliability

will never be 1.00. At the risk of being redundant, in measurement we try to control as many sources of error

variance as possible. To the extent that we can do this, the more accurately (reliably) we are measuring a

person's true knowledge, behavior, personality, and so on.

Let's Check Your Understanding

Did you get all of that??? Check your understanding by answering the questions below.

Reliability is mathematically defined as the ratio of ______________ variance in scores to the

____________ variance.

The symbol we use for reliability is _________________.

The lower the error variance, the _____________ the reliability.

Our Model Answers

Reliability is mathematically defined as the ratio of true score variance in scores to the observed

score variance.

The symbol we use for reliability is rtt.

The lower the error variance, the higher the reliability.

We are so proud of you! Thank you for hanging in with us.

Types of Reliability Estimates

Although there are a variety of types of reliability, we will only cover the four most important and most used.

These four are test-retest reliability, alternate forms reliability, internal consistency reliability, and interrater

reliability.

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 4 of 20 Testing and Measurement

Test-Retest Reliability

This is the easiest to remember, because its name tells you exactly what you're doing. You're giving the test

and then, after a designated time period, you're giving it again. The most common time periods are 1 week

to 1 month. When test scores are reliable over time (i.e., good test-retest reliability), error variance due to

time is controlled (as much as is possible). The statistical procedure used to examine test-retest reliability

is correlation. A correlation coefficient tells us to what extent people obtain the same scores across the two

testing times. The Pearson correlation coefficient is the statistic used to reflect test-retest reliability when total

scores on a test are continuous.

Let's say that you are trying to create the Flawless IQ Test. Because you want it to compete with the Wechsler

and Binet IQ tests, you create it with a mean of 100 and a SD of 15. You administer it to 100 sophomores (the

human fruit flies) at the beginning of the semester. One month later you give it to them again. You correlate

the students' scores on the first test with their scores when they took it the second time. In our example, the

correlation coefficient you obtain is a measure of 1-month test-retest reliability for the Flawless IQ Test. To the

extent that your students score the same on both tests, the higher the reliability coefficient. If your reliability

coefficient is above 0.70, their scores are relatively stable over time (at least over 1 month). If your reliability

coefficient is in the 0.80s or 0.90s, you have a better claim to stability because more error variance has been

eliminated. Indeed, your reliability coefficient of 0.82 strongly suggests that this instrument reliably measures

students' scores over a 1-month period.

Let's relook at the reliability of the Flawless IQ Test when we take race or ethnicity into consideration. You

have 10 Latino students and 90 Euro-American students in your class. The scores for the 10 Latino students

on each of the two test administrations are reported below.

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 5 of 20 Testing and Measurement

When the Pearson correlation coefficient is calculated for these 10 sets of scores, it is equal to 0.62. If we

depicted this symbolically, it would look like this:

This reliability coefficient indicates that your Flawless IQ Test is not so flawless. It may be stable across time

for your Euro-American students, but it is not as stable for your Latino students. The poor reliability of scores

for your Latino students may reflect cultural bias in your test. This reliability estimate of 0.62 suggests that

you are not measuring IQ for these students very accurately. We'll discuss cultural bias in Chapter 11 on the

ethics of testing.

Alternate Forms Reliability

This type of reliability is also known as parallel forms reliability. When you want to determine whether

two equivalent forms of the same test are really equivalent, you calculate a correlation coefficient that is

interpreted as alternate form reliability. In order for two forms to be parallel, they need to have the same

number of items, test the same content, and have the same response format and options. Alternate forms of

a test are usually given at two different times. The beauty of the alternate form procedure is that it helps to

control for sources of error due to both content variability and time.

Maybe an example will help illustrate what we are talking about. For those of you who have taken aptitude

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 6 of 20 Testing and Measurement

tests such as the Millers Analogies Test (MAT), you may have discovered that there are multiple forms of the

MAT (if you've taken it more than once to try to improve your score). The following 10 students took one form

of the MAT in January and a different form of the MAT in May. Here are their scores:

The alternate forms reliability coefficient for these scores across this 4-month period is 0.95. If we depicted

this symbolically, it would look like this:

The measurement specialists who designed these alternate forms of the MAT did a great job. (Let's hope

they were given a bonus for a job well done!) This 0.95 is a very strong reliability coefficient and indicates that

sources of error due to both content variability and time were controlled to a very great extent.

Internal Consistency Reliability

This type of reliability is a bird of a different color. It does not require two testing times or two forms of a

test. You administer the test one time, and you let the computer find the mean of the correlations among all

possible halves of the test. This procedure is also called a split-half procedure. What you are trying to find out

is whether every item on the test correlates with every other item. You are assessing content stability.

Let's look at an example of a five-item test designed to measure Personal Valuing of Education. The five

items are

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 7 of 20 Testing and Measurement

How strong is your commitment to earning a bachelor's degree?

Getting a college degree will be worth the time required to obtain it.

Getting a college degree will be worth the money spent to obtain it.

Getting a college degree will be worth the work/effort required to get it.

How much do you value a college education?

The continuous response format ranges from 1 (not at all/strongly disagree) to 5 (very much/strongly agree).

Did you remember that this is called a continuous response format because there are more than two response

options for each item? The responses for 10 students to each of these items are presented in Table 9.1.

Table 9.1 Continuous Responses to the Personal Valuing of Education Scale

Because this test used a continuous response format, the type of internal consistency reliability coefficient

we calculated was a Cronbach's alpha. With the assistance of the Scale option in SPSS, we found that

the Cronbach's alpha for responses to these five items was 0.87. This suggests that for each student, their

responses to each of the items were very similar. An internal consistency reliability coefficient of 0.87 is

considered relatively strong and suggests that these five items are measuring the same content.

If you want to check our accuracy, open the Personal Valuing of Education data set on the Sage Web site at

http://www.sagepub.com/kurpius. Under Analyze, choose Scale, then choose Reliability Analysis. From the

window that opens, select and move the five test items from the left to the right using the arrow. Then click

Statistics and check all of the Descriptives in the next window. Click Continue and then OK. You will get results

that look like ours in Figure 9.1.

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 8 of 20 Testing and Measurement

Figure 9.1 Cronbach's Alpha Reliability Output

What should we have done if the response options had been dichotomous? Instead of being able to rate

the items on a scale from 1 to 5, the students had only two choices, such as yes/no or none/a lot. Table 9.2

presents these same students but the response option was dichotomous. Scores of 1 reflect responses of

“no” or “none” and scores of 2 reflect “yes” or “a lot.”

Table 9.2 Dichotomous Responses to the Personal Valuing of Education Scale

When students were forced to use a dichotomous response format, the internal consistency reliability

coefficient was 0.78. This is a moderately acceptable reliability coefficient. Based on this internal consistency

reliability of 0.78, you can conclude that the students tended to respond consistently to the five items on the

Personal Valuing of Education scale.

The statistical procedure that was used to arrive at this coefficient is called the Kuder-Richardson 20 (K-R

20). The K-R 20 should only be used with dichotomous data. When using SPSS to calculate a K-R 20, you

should click Analyze, then select Scales, then choose Reliability Analysis, and then choose the Alpha option

for this analysis. (Kuder and Richardson also developed the K-R 21, but it is not generally as acceptable as

the K-R 20.)

Because an internal consistency reliability coefficient is based on dividing the test in half in order to correlate

it with the other half, we have artificially shortened the test. A 20-item test became two 10-item tests. When a

test is too short, the reliability coefficient is suppressed due to the statistics that are employed. A Spearman-

Brown correction procedure can be used to compensate for this artificial shortening. The Spearman-Brown

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 9 of 20 Testing and Measurement

should only be used with internal consistency reliability. If you see it reported for either test-retest or alternate

forms reliability, the writer didn't know what he or she was talking about.

A final note about internal consistency reliability—never use it with a speeded test. Speeded tests are

designed so that they cannot be finished. Therefore, calculating reliability based on halves produces a

worthless reliability coefficient.

Interrater Reliability

This fourth type of reliability is used when two or more raters are making judgments about something. For

example, let's say you are interested in whether a teacher consistently reinforces students' answers in the

classroom. You train two raters to judge a teacher's response to students as reinforcing or not reinforcing.

They each observe the same teacher at the same time and code the teacher's response to a student as

reinforcing (designated by an X) or not reinforcing (designated by an O). The two raters' judgments might look

like the following:

To calculate interrater reliability, you count the number of times the raters agreed and divide by the number of

possible times they could have agreed. The formula for this is:

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 10 of 20 Testing and Measurement

The interrater reliability coefficient for these two raters is 0.80. They tended to view the teacher's behavior in

a similar fashion. If you want your reliability coefficient to be stronger, have them discuss the two responses

that they viewed differently and come to some consensus about what they are observing. Another commonly

used statistic for interrater reliability is the Cohen's Kappa.

Let's Check Your Understanding

OK, friends, have you digested this yet or has it left a bad taste in your mouth? It's time to check your

understanding of types of reliability by answering the questions below.

What are the four major types of reliability?

_________________________________

_________________________________

Which type is designed to control for error due to time?

_________________________________

_________________________________

Which type is designed to control for error due to time and content?

_________________________________

_________________________________

For which type do you need to administer the test only one time?

_________________________________

_________________________________

What happens to the reliability coefficient when it is based on halves of a test?

_________________________________

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 11 of 20 Testing and Measurement

_________________________________

How do you calculate interrater reliability?

_________________________________

_________________________________

Our Model Answers

We suggest that you pay close attention. In our answers, we're throwing in bits of new information that your

instructor may well hold you accountable for on your exam.

What are the four major types of reliability?

Test-retest, alternate forms, internal consistency, and interrater reliabilities.

Which type is designed to control for error due to time?

Test-retest reliability controls for error due to time, because the test is administered at two

different time periods. The time period between the two testings indicates the length of

time that the test scores have been found to be stable.

Which type is designed to control for error due to time and content?

Alternate forms reliability controls for error due to time and content. It controls for the

exact time period between the two testings as well as the equivalency of item content

across the two forms of the test.

For which type do you need to administer the test only one time?

You only need to administer a test once if you are calculating internal consistency

reliability. This type of reliability only controls for sources of error due to content since no

time interval is involved.

What happens to the reliability coefficient when it is based on halves of a test?

If the length of the possible halves of a test contains too few items, the reliability

coefficient may be distorted and typically is too small. The Spearman-Brown correction

somewhat compensates for this statistical artifact.

How do you calculate interrater reliability?

You divide the number of agreements for the two raters by the number of possible

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 12 of 20 Testing and Measurement

agreements to get interrater reliability.

Standard Error of Measurement

Any discussion of reliability would be incomplete if standard error of measurement (SEm) was not discussed.

Standard error of measurement is directly related to reliability. The formula for standard error of measurement

is

The more reliable your test scores, the smaller your SEm. For example, let's say your reliability is 0.90 and the

SDo is 3 (SDo is the standard deviation for the test). If these values are inserted in the formula, your SEm is

0.95. If your reliability is 0.70, your SEm is 1.64. The higher the reliability, the less error in your measurement.

Here's how we arrived at our two answers.

A standard error of measurement is a deviation score and reflects the area around an obtained score where

you would expect to find the true score. This area is called a confidence interval. Yeah, that sounds like

gobbledygook to us too. Perhaps an example might clarify what we are trying to say.

Mary, one of our star pupils, obtained a score of 15 on our first measurement quiz. If the reliability coefficient

for this quiz was 0.90, the SEm is 0.95. Mary's obtained score of 15 could be viewed as symbolizing the mean

of all possible scores Mary could receive on this test if she took it over and over and over. An obtained score

is analogous to a mean on the normal, bell-shaped curve, and the SEm is equivalent to a SD. Mary's true

score would be within ± 1SEm of her obtained score 68.26% of the time (Xo ± 1SEm). That is, we are 68.26%

confident that Mary's true score would fall between the scores 14.05 and 15.95. We got these numbers by

subtracting 0.95 from Mary's score of 15 and by adding 0.95 to Mary's score of 15. Remember, it helps to

think of Mary's obtained score as representing the mean and the SEm as representing a SD on a normal

curve.

Surprise, surprise: Mary's true score would be within ± 2SEm of her obtained score 95.44% of the

time—between the scores of 13.10 and 16.90. If you guessed that Mary's true score would be within ± 3SEm

of her obtained score 99.74% of the time (between 12.15 and 17.85), you are so smart. If you remembered

standard deviation from Chapter 4, this was a piece of cake for you.

Let's see what happens when the reliability is only 0.70. As we calculated above, the SEm is 1.64. Mary still

received an obtained score of 15. We can be confident 68.26% of the time that Mary's true score is within ± 1

SEm of this obtained score (between scores of 13.36 and 16.64). Mary's true score would be within ± 2SEm

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 13 of 20 Testing and Measurement

of her obtained score 95.44% of the time-between a score of 11.72 and a score of 18.28. By now you know

that Mary's true score would be within ± 3SEm of this obtained score 99.74% of the time (between 10.08 and

19.72).

Notice that when the test has a high reliability as in the first case, we are more confident that the value of

Mary's true score is closer to her obtained score. The SE is smaller, which reflects a smaller error score.

Remember that a goal in reliability is to control error.

Let's Check Your Understanding

What is the mathematical symbol for standard error of measurement?

_________________________________

_________________________________

What is the relationship between standard error of measurement and a reliability coefficient?

_________________________________

_________________________________

What does a 68.26% confidence interval mean?

_________________________________

_________________________________

If Mary in our example above had obtained a score of 18, and the SEm was 4, her true score

would be between what two scores at the 95.44% confidence interval?

_________________________________

_________________________________

Our Model Answers

What is the mathematical symbol for standard error of measurement?

SEm

What is the relationship between standard error of measurement and a reliability coefficient?

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 14 of 20 Testing and Measurement

They have a negative relationship. The greater the reliability coefficient, the smaller the

standard error of measurement.

What does a 68.26% confidence interval mean?

A 68.26% confidence interval indicates that you can expect to find the true score within

±1SEm from the obtained score.

If Mary in our example above had obtained a score of 18, and the SEm was 4, her true score

would be between what two scores at the 95.44% confidence interval?

The two scores are 10 and 26. We arrived at these scores by adding and subtracting 2SEm

to the obtained score of 18. The mathematical value of 2SEm was 2(4) = 8.

Correlation Coefficients as Measures of Reliability

We've already told you that the Pearson product-moment correlation is used to assess test-retest and

alternate forms reliability because total test scores are continuous data and reflect interval-level data. When

we look at individual items to assess internal consistency reliability, we use either the Cronbach's alpha for

continuous response formats or the K-R 20 for dichotomous response formats. Before we end this chapter, we

need to comment on what type of reliability you would use for ordinal or rank-ordered data. Ranked data might

include placement in class or teacher's ratings of students (first, second, third, etc.). The correct procedure

for rank-ordered data is the Spearman rho.

Some Final Thoughts About Reliability

There are several principles that we need to keep in mind about reliability:

When administering a test, follow the instructions carefully so that your administration matches

that of anyone else who would be administering this same test. This controls for one source of

error.

Try to establish standardized testing conditions such as good lighting in the room, freedom from

noise, and other environmental conditions. These can all be sources of error.

Reliability is affected by the length of a test, so make sure you have a sufficient number of items

to reflect the true reliability.

Key Terms

• Reliability

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 15 of 20 Testing and Measurement

— Test-retest

— Alternate forms

— Internal consistency

— Interrater

• Standard error of measurement

• Pearson product-moment correlations

• K-R 20

• Cronbach's alpha

• Spearman-Brown correction

Models and Self-instructional Exercises

Our Model

Business and industry are increasingly concerned about employee theft. To help identify potential employees

who may have a problem with honesty, you have created the Honesty Inventory (HI). You have normed it on

employees representative of different groups of workers, such as used car salesmen, construction workers,

lawyers, and land developers. You have been asked to assess applicants for bank teller positions. You know

that the 2-week test-retest reliability of the HI scores for bank tellers is 0.76. You administer the HI to the

applicants. The applicants' scores range from 20 to 55 out of a possible 60 points. Their mean score is 45

with a standard deviation of 2.

What is the SEm for this group of applicants?

One applicant, Tom, scored 43. At a 95% confidence level, between what two scores does his

true score lie?

Between ____________ and _____________.

Based on what you know so far about Tom and the other applicants, would you recommend him

as an employee? Yes or no?

Why did you make this recommendation?

_________________________________

_________________________________

_________________________________

When you calculated the internal consistency for this group of applicants, the Cronbach's alpha

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 16 of 20 Testing and Measurement

reliability coefficient was 0.65. What does this tell you about the content of the HI for these

applicants?

_________________________________

_________________________________

_________________________________

Our Model Answers

Being conscientious measurement consultants, we want to be conservative in our opinions. Therefore, we

pay close attention to our data when making recommendations. To err is human, but in business they don't

forgive.

What is the SEm for this group of applicants?

One applicant, Tom, scored 43. At a 95% confidence level, between what two scores does his

true score lie?

We are confident 95% of the time that Toms true score is between 41.04 and 44.96.

Based on what you know so far about Tom and the other applicants, would you recommend him

as an employee?

We would not recommend Tom for employment.

Why did you make this recommendation?

At the 95th confidence level, Toms highest potential true score is still barely equal to the

mean score of 45 for this applicant pool. More than likely, his true score is significantly

below the group mean. We would look for applicants whose true score would at a minimum

include the mean and scores above it at the 95% confidence interval.

When you calculated the internal consistency reliability for this group of applicants, the

Cronbach's alpha reliability coefficient was 0.65. What does this tell you about the content of the

HI for these applicants?

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 17 of 20 Testing and Measurement

A 0.65 internal consistency reliability coefficient indicates that the applicants did not

respond consistently to all the items on the HI. This suggests significant content error. We

need to tread very lightly in making any recommendations about this group of applicants

based on the internal consistency reliability for the HI for their scores. Remember that the

reliability coefficient is suppressed when a test is shortened, as happens when internal

consistency reliability is calculated.

Now It's Your Turn

The same applicants are also given a multidimensional test that assesses both interpersonal relationships

(IP) and fear of math (FOM). For a norm group of 500 business majors, the reported Cronbach's alpha across

their 30-item IP scores was 0.84. Their K-R 20 was 0.75 across their 30-item FOM scores. For the bank teller

applicants, their mean score on the IP subscale is 50 (SD = 4, possible score range 0 to 60, with higher scores

reflecting stronger interpersonal skills). Their mean score on the FOM subscale is 20 (SD = 2.2, possible

score range 0 to 30, with higher scores reflecting greater fear of math).

What is the SEm for each subscale for this group of applicants?

One applicant, Harold, scored 50 on the IP subscale. At a 68.26% confidence level, between what

two scores does his true score lie?

Between ____________ and ___________.

Harold scored 10 on the FOM subscale. At a 68.26% confidence level, between what two scores

does his true score lie?

Between __________ and __________.

Based on what you know so far about Harold on these two subscales, would you recommend him

as an employee? Yes or no?

Why did you make this recommendation?

_________________________________

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 18 of 20 Testing and Measurement

_________________________________

_________________________________

_________________________________

Six months later you retested the candidates you hired. The test-retest reliability coefficient was

0.88 for the IP scores and was 0.72 for FOM scores. What does this tell you about the stability of

the two subscales?

_________________________________

_________________________________

_________________________________

_________________________________

Our Model Answers

What is the SEm for each subscale for this group of applicants?

One applicant, Harold, scored 50 on the IP subscale. At a 68.26% confidence level, between what

two scores does his true score lie?

Between 484 and 51.6. To get these two values, we added and subtracted 1.6 (the SEm)

from Harold's score of 50.

Harold scored 10 on the FOM subscale. At a 68.26% confidence level, between what two scores

does his true score lie?

Between 8.9 and 11.1. To get these two values, we added and subtracted 1.1 (the SEm)

from Harold's score of 10.

Based on what you know so far about Harold on these two subscales, would you recommend him

as an employee?

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 19 of 20 Testing and Measurement

Yes, we should hire him.

Why did you make this recommendation?

Based only on these two tests, we chose to hire Harold because he has average

interpersonal skills and low fear of math. His scores on these two measures suggest that

he will interact satisfactorily with others (with customers as well as coworkers) and he will

be comfortable dealing with the mathematics related to being a bank teller (we don't know

his actual math ability, however).

Six months later you retested the candidates you hired. The test-retest reliability coefficient was

0.88 for the IP scores and was 0.68 for FOM scores. What does this tell you about the stability of

the two subscales?

The interpersonal relationships subscale scores were quite stable over a 6-month time

period for the pool of bank teller applicants who were actually hired. The scores on the

fear of math subscale for those who were hired were not as stable across time. This might

be due to the restricted range that resulted if only those with lower fear of math scores

on the initial testing were actually hired. The test-retest reliability coefficient of 0.68 might

also be related to the fact that the K-R 20 was also relatively low (0.75).

Words of Encouragement

Do you realize that you have successfully completed almost all of this book? Mastering the content in this

chapter is a major accomplishment. We hope you are starting to see how the measurement issues in the

earlier chapters have applicability for this more advanced topic-reliability.

If you want to practice using SPSS to calculate reliability coefficients, we suggest you visit

http://www.sagepub.com/kurpius. Individual test item responses for six different tests for 270 undergraduates

can be found in the Measurement Data Set.

http://dx.doi.org/10.4135/9781412986106.n9

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 20 of 20 Testing and Measurement

  • Reliability-The Same Yesterday, Today, and Tomorrow
    • In: Testing and Measurement