Research X

profileBatman007
Resource3.pdf

Validity-What You See Is Not Always What

You Get

In: Testing and Measurement

By: Mary E. Stafford

Pub. Date: 2011

Access Date: May 19, 2019

Publishing Company: SAGE Publications, Inc.

City: Thousand Oaks

Print ISBN: 9781412910026

Online ISBN: 9781412986106

DOI: https://dx.doi.org/10.4135/9781412986106

Print pages: 141-162

© 2006 SAGE Publications, Inc. All Rights Reserved.

This PDF has been generated from SAGE Research Methods. Please note that the pagination of the

online version will vary from the pagination of the print book.

Validity-What You See Is Not Always What You Get

Suppose you've created a test that has perfect reliability (r = +1.00). Anyone who takes this test gets the

same score time after time after time. The obtained score is their true score. There is no error. Well, doesn't

this sound wonderful!? Don't be gullible. If you believe there is such a thing as a perfectly reliable test, could

we interest you in some ocean-front property in the desert? Remember that “what you see is not always what

you get.”

We're sorry to tell you, but having a perfectly reliable test is not enough. Indeed, a perfectly reliable test (or

even a nonreliable test) may not have any value at all. We offer the case of Professor Notsobright to prove our

point. Professor Notsobright wants to know how smart or intelligent everyone in his class is. He knows that

intelligence is related to the brain and decides, therefore, that brain size must surely reflect intelligence. Since

he can't actually measure brain size, he measures the circumference of each student's head. Sure enough,

he gets the same values each time he takes out his handy-dandy tape measure and encircles each student's

head. He has found a reliable measurement. What he has NOT found is a valid measure of the construct

intelligence.

In measurement, our objective is to use tests that are valid as well as reliable. This chapter introduces you to

the most fundamental concept in measurement—validity. Validity is defined as how well a test measures what

it is designed to measure. In addition, validity tells us what can be inferred from test scores. According to the

Standards for Educational and Psychological Testing (1999), “the process of validation involves accumulating

evidence to provide a sound scientific basis for the proposed score interpretations” (p. 9). Evidence of validity

is related to the accuracy of the proposed interpretation of test scores, not to the test itself.

Good ol' Professor Notsobright wanted to measure the construct of intelligence. The approach he mistakenly

chose (measuring the circumference of his students' heads) does not yield valid evidence of intelligence.

He would be totally wrong to interpret any scores he obtained as an indicator of his students' intelligence.

(We think Professor Notsobright got his degree through a mail-order catalog. Furthermore, we suggest

that someone who knows about validity assess Dr. Notsobright's intelligence and suggest he seek different

employment.)

Scores on a test need to be valid and reliable. Evidence of validity is typically reported as a validity coefficient,

which can range from 0 to +1.00. Like the reliability coefficient discussed in Chapter 9, a validity coefficient is

often a correlation coefficient. A validity coefficient of 0 indicates the test scores absolutely do not measure

the construct under investigation. A validity coefficient approaching +1.00 (which you probably will never see

in your lifetime) provides strong evidence that the test scores are measuring the construct under investigation.

Ideally, test developers should report a validity coefficient for each of the groups for which the test could be

used. That is, if you're going to give an achievement test to middle school students, a validity coefficient for

each middle school grade level should be reported. In addition, validity coefficients for boys and for girls within

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 2 of 21 Testing and Measurement

each grade level should be reported. Remember when we talked about norm groups in Chapter 6? Well, in a

perfect measurement world, validity coefficients would be reported for all the potential groupings discussed in

that chapter.

The test user also has responsibility for test validation. If the test is going to be used in a setting different from

that reported by the test developers, the user is responsible for evaluating the validity evidence in the new

setting. For example, if a test was originally validated with public school students, but you want to use it with

students in a private parochial school, you have a responsibility for providing evidence of validity in this new

school setting.

Regardless of how evidence of validity is established, we want to stress that validity is a theoretical concept.

It can never actually be measured. A validity coefficient only suggests that test scores are valid for certain

groups in certain circumstances under certain conditions. We never ever “prove” validity, no matter how hard

we try. In spite of this, validity is an absolutely essential characteristic of a strong test. Only when a test is

valid (and of course, reliable) will you “get what you see.”

Let's Check Your Understanding

It's time to check your understanding of what we've told you so far.

Validity is defined as _________________.

When interpreting a test score, what is the role of validity?

_________________________________

_________________________________

Validity coefficients can range in value from _______ to __________.

When we talk about validity, are we referring to a test's scores or the test itself?

_________________________________

Test scores need to be both _________________and _________________.

If we try hard enough, we can prove that the test scores are valid. True or false?

Our Model Answers

Validity is defined as how well a test measures what it is designed to measure.

When interpreting a test score, what is the role of validity?

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 3 of 21 Testing and Measurement

Validity tells us what can be inferred from test scores.

Validity coefficients can range in value from 0 to +1.00.

When we talk about validity, are we referring to a test's scores or the test itself?

When we talk about validity, we are referring to a test's scores. Evidence of validity allows

us to make accurate interpretation of someone's test score. We do not interpret a test.

Test scores need to be both valid and reliable.

If we try hard enough, we can prove that the test scores are valid.

This statement is false. Since validity is a theoretical concept, you can never prove its

existence.

Helping You Get What You See

Like the Phantom of the Opera whose presence hovers over and shapes the meaning of Andrew Lloyd

Webber's musical, validity hovers over and shapes the meaning of a test. As the musical evolves, the

phantom becomes more visible; as more evidence of validity evolves, the meaning of test scores becomes

clearer. To develop evidence of validity, attention needs to be given to validation groups, criteria, construct

underrepresentation, and construct-irrelevant variance.

Validation Groups

The groups on which a test is validated are called validation groups. For our achievement test example, the

validation groups were middle school students. The achievement test is valid for students who have the same

characteristics as those in the validation sample of middle school students. Anyone who will potentially use

this achievement test needs to determine how closely his or her students match the characteristics of the

students in the validation group. The more dissimilar the students are from the validation group, the less valid

the achievement test may be for the new group of students. Characteristics of the validation group should be

presented in a test's manual.

Criteria

Validity is always a reflection of some criterion against which it is being measured. A criterion is some

knowledge, behavior, skill, process, or characteristic that is not a component of the test being examined. It is

external to the test itself. For example, scores on the Scholastic Aptitude Test (SAT) or the Graduate Record

Examination (GRE) have typically been validated against the criterion of undergraduate grade point averages

(GPA) or grades in graduate school, respectively. A fairly strong positive relationship has been found between

scores on these tests and later GPAs (the criteria).

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 4 of 21 Testing and Measurement

Scores on a test may also be validated against multiple criteria, depending on the inferences to be made

from the test scores. For example, scores on the Goody-Two-Shoes (G2S) personality test, which measures

a complex construct, probably need to be validated against several criteria. Appropriate criteria might include

teachers' perceptions of students, interpersonal relationships, potential for career success, and so forth. Each

of these criteria helps to define the goody-two-shoes construct. There would be a separate validity coefficient

for the relationship between the scores of the G2S test and each of these criteria. Collectively, these validity

coefficients provide evidence for the validity of the G2S scores measuring the construct “goody-two-shoes.”

In addition, based on which criterion was used to gather validity evidence, the interpretation of the G2S

test scores would vary. One would suspect, at least we do, that there may be a strong correlation between

students being “goody-two-shoes” and teachers' very favorable perceptions of these students. Therefore, the

criterion of teachers' perceptions of students provides strong evidence for the validity of scores on the G2S

test as a measure of teachers' perceptions of students. The same type of evidence needs to be gathered on

the other criteria in order to interpret scores on the G2S test as reflecting these criteria.

Construct Underrepresentation

When a test fails to capture or assess all important aspects of a construct adequately, this is called construct

underrepresentation. Let's go back to our example of aptitude tests to illustrate construct underrepresentation.

Most of you have probably taken the SAT or GRE. Furthermore, a few of you have probably argued that your

SAT or GRE test scores did not accurately reflect your academic ability. You may not know this, but it really

is possible that some tests don't comprehensively measure the constructs they are designed to measure.

When this happens, these tests are suffering from a serious illness that could even be fatal—construct

underrepresentation.

Let's pretend that the GRE suffers from construct underrepresentation. It doesn't really, but our example will

make more sense if we pretend that it does. Traditionally, the GRE measured quantitative and verbal aptitude.

More recently, reasoning ability was added to the GRE to complement its assessment of quantitative and

verbal abilities. Perhaps the test developers realized that the original GRE measured aptitude too narrowly,

so they added items to broaden the measure to include reasoning ability. Doing this broadened the domain

of behaviors that reflect aptitude. Perhaps, this more comprehensive assessment allows the GRE items to

better represent the construct aptitude.

Construct-Irrelevant Variance

It is also possible that when you took the SAT or GRE some process extraneous to the test's intended

construct affected the test scores. These extraneous variables might include things such as your reading

ability, speed of reading, emotional reactions to test items, familiarity with test content, test anxiety, or items

not related to the construct(s) being measured. Each of these can contribute to construct-irrelevant variance.

This is a source of error in the validity coefficient.

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 5 of 21 Testing and Measurement

Before we introduce you to the most common sources of validity evidence, it's time to check your

understanding of the concepts just introduced.

Let's Check Your Understanding

The individuals to be tested need to have characteristics similar to those of the

______________________.

A criterion is ____________________.

A test should be validated against one and only one criterion. True or false?

The criteria used as a source of validity evidence are external to the test itself. True or false?

Construct underrepresentation occurs when _______________

_________________________________

A source of error in a validity coefficient that is not related to the test's intended construct is called

________________________.

Examples of this source of error include ________________ and __________________.

Our Model Answers

The individuals to be tested need to have characteristics similar to those of the validation group.

A criterion is some knowledge, behavior, skill, process, or characteristic that is used to

establish the validity of test scores.

A test should be validated against one and only one criterion.

This statement is false. Test scores should be validated on as many criteria as are relevant

to the construct being measured. Multiple sources of validity evidence are particularly

needed when the test measures a complex construct.

The criteria used as a source of validity evidence are external to the test itself.

True. Criteria are not components of the test itself.

Construct underrepresentation occurs when the test does not adequately assess all aspects

of the construct being measured.

A source of error in a validity coefficient that is not related to the test's intended construct is called

construct-irrelevant variance.

Examples of this source of error include reading ability, speed of reading, emotional reactions

to test items, familiarity with test content, test anxiety, or items not related to the

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 6 of 21 Testing and Measurement

construct(s) being measured.

Sources of Validity Evidence

If you read the measurement literature (don't laugh, we find some of this literature very interesting), you might

have noticed that multiple “types” of validity are presented. Most likely, you'll find the terms content, construct,

concurrent, and predictive validity. Based on the Standards for Educational and Psychological Testing (1999),

validity is viewed as a unitary concept. It is the extent to which all sources of evidence for validity support the

intended interpretation of test scores. Even though validity is indeed a unitary concept, you still need to know

about the traditional types or sources of evidence for validity.

Evidence Based on Test Content

Examination of the content covered by test items and the construct the test is intended to measure can

yield important evidence for content validity. Test developers typically write their items to reflect a specific

content domain. Examples of test content domains might include the Revolutionary War, measures of central

tendency, eating disorders, leadership style, or star constellations. The more clearly every item on a test taps

information from one specific content domain, the greater the evidence for content validity.

Evidence for content validity typically comes from two approaches: (1) an empirical analysis of how well the

test items reflect the content domain and (2) expert ratings of the relationship between the test items and

the content domain. Empirical evidence can be derived from a statistical procedure such as factor analysis

to determine whether all of the test items measure one content domain or construct. The second approach,

expert ratings, requires identification of people who are experts on a content area. These experts then jointly

agree on the parameters of the content domain or construct they will be evaluating. Finally, based on these

parameters, they judge each item as to how well it assesses the desired content.

Content validity is most easily illustrated with an example from education. Every test you have taken, whether

in your math classes, your English classes, or this measurement class, should only include items that assess

the content or information covered in that class. This information may have been given through lectures,

readings, discussions, or demonstrations. The professor, who is the content expert, develops the class tests

by creating items that reflect the specific information covered in the course. To the extent that the content of

all of the items reflect the course content, evidence of content validity is established. Items that do not reflect

the course content contribute to construct-irrelevant variance. These items cause variation in test scores that

is not related to knowledge of the course content. Not all professors know about construct-irrelevant variance-

they may or may not appreciate your educating them. So, use your knowledge wisely. (Believe it or not, even

professors can become defensive.)

In the world of work, evidence of content validity is provided by a strong correspondence between specific

job tasks and the content of test items. Experts identify the dimensions of a job or the tasks that the job

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 7 of 21 Testing and Measurement

comprises. One process of deriving job tasks is to observe job behaviors systematically. Test items are then

evaluated against the specific job tasks. The correspondence between the test items and the job tasks is

referred to as the job relatedness of the test. Indeed, the U.S. Supreme Court has mandated that tests used

for job selection or placement have job relatedness.

The appropriateness of a specific content domain is directly related to any interpretation or inferences to be

made from test scores. In our education example, we may want to draw conclusions about individual student

mastery of a content area such as knowledge of star constellations. Based on their level of mastery, we may

want to make decisions about students passing or not passing the class.

We may also want to interpret test scores to find areas of a curriculum being adequately or inadequately

taught. If the majority of students systematically miss items related to class content, then perhaps this

content was not adequately taught. If we had given the students a comprehensive achievement test that was

designed and validated to measure multiple dimensions of achievement, we could draw conclusions about

content neglected or content taught based on student responses to test items. Information about content

neglected and about content taught both provide evidence of content validity.

Evidence of Criterion-Related Validity

A second “type” or source of validity evidence is criterion-related validity. If the purpose of a test is to predict

some future behavior or to estimate current behavior, you want evidence that the test items will do this

accurately. The relationship between test scores and the variable(s) external to the test (criterion) will provide

this source of evidence for criterion-related validity, as discussed next.

Predictive and Concurrent Validity

If the goal is for test scores to predict future behavior, we are concerned with predictive validity. This source

of evidence indicates that the test scores are strongly related to (predict) some behavior (criterion) that is

measured at a later time. Remember our example of the SAT and GRE aptitude tests? The ability of scores

on these two tests to predict future GPAs accurately provides evidence for their predictive validity.

In contrast, evidence for concurrent validity indicates a strong relationship between test scores and some

criterion measured at the same time. Both assessments are administered concurrently (or in approximately

the same time frame). Concurrent validity is essential in psychodiagnostic tests. For example, if someone

scores high on a test of depression, this person should also score high on any co-occurring criterion related

to depression. Sample criteria for depression could include mental health professionals' ratings, behavioral

observations, or self-reported behaviors. The College Stress Scale (CSS) that we introduced in Chapter 2

should have concurrent validity with behaviors that are indicative of currently experienced stress related to

being in college. The CSS should not be predicting future college stress or reflecting stress experienced in

the distant past.

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 8 of 21 Testing and Measurement

Similarly, our example of a test measuring elements of a job can be viewed as the test items having

concurrent validity with the job tasks. In work settings, a test is often used because its scores have concurrent

validity with the specific requirements of a job. Unless we're going to do extensive on-the-job training, we

want to know a person's ability to do a specific job immediately. In contrast, if we're going to do extensive

on-the-job training, we're more interested in the ability of test scores to predict a person's ability to perform

successfully after training.

A Special Case: Portfolio Assessment

Evidence for criterion-related validity is essential when portfolio assessment is the approach to measurement.

Let's say you're using a portfolio assessment to select students for a graduate program in psychology. To

keep this example simple, let's focus only on three criteria: (1) ability to use APA style when writing; (2) good

listening skills; and (3) academic potential.

For the first criterion (the ability to use APA style when writing), experts can evaluate an applicant's written

document to determine how closely it adheres to APA style. The document becomes the test, and the experts'

ratings are the criterion. Information derived from these two support concurrent validity for the ability of the

applicant at the time of writing to use APA style.

For the second criterion, good listening skills, a behavioral observation of the applicant in a structured role-

play situation could yield information about his or her current ability to use listening skills. The relationship

between the applicant's behaviors and the expert's ratings of these behaviors as reflecting listening skills

provides evidence for concurrent validity.

The third criterion, academic potential, would best be measured by an aptitude test, such as the GRE, that

would predict graduate school success.

Scores on this aptitude test would need to have established evidence of predictive validity for success in

graduate school.

Two Dimensions of Concurrent Validity—Convergent and Discriminant Validity

In each of the examples given thus far, we have been talking about how test scores and the criteria converge.

Significant relationships between test scores and other measures designed to assess the same construct

or behavior provide evidence of convergent validity. The test scores and the criterion are theoretically and

empirically linked.

The relationship (or nonrelationship) between test scores and measures of a construct to which the test is

not theoretically related also provides evidence for concurrent validity, known as discriminant validity. For

example, scores on an entrance exam for medical school should have convergent validity with grades in

medical school; however, these same scores may have a weak or no relationship with ratings of physician

bedside manner. This poor relationship provides evidence of discriminant validity. In this example, bedside

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 9 of 21 Testing and Measurement

manner is a construct different from what is being measured by the medical school entrance exam.

Convergent and discriminant validity indicate not only what a test will predict but also what it will not predict.

Evidence of Construct Validity

Construct validity, sometimes referred to as internal structural validity, indicates the degree to which all items

on a test are interrelated and measure the theoretical trait or construct the test is designed to measure.

Basically, a construct is a theoretical explanation for some behavior. Construct validity is concerned with the

validation of this underlying theory.

Anxiety is a theoretical construct that we can verify only by seeing how it is manifested in current behavior.

Because anxiety is theoretically one construct, a test that measures anxiety should be unidimensional. Factor

analysis is a statistical procedure that tests whether all items on a test contribute to that one construct. If a

test is unidimensional, we expect a one-factor structure to emerge from the factor analysis.

Many tests, however, are multidimensional, making whatever is being assessed more interesting (we think).

For example, the Strong Interest Inventory (SII) measures six different occupational interests: Realistic (R),

Artistic (A), Investigative (I), Social (S), Enterprising (E), and Conventional (C). The theoretical foundation for

the SII is Holland's conceptualization that there are six general work environments and these environments

are characterized by the six occupational interests. Considerable research on the psychometric properties of

the SII has consistently provided evidence supporting its underlying theory regarding a six-factor structure.

Support for a six-factor structure provides some evidence of construct validity (internal structural validity) for

each factor.

Items measuring a factor such as Enterprising (E) are homogeneous and contribute only to the construct

validity of that factor. There is evidence that the SII yields valid scores for adult men and women across a

variety of settings. The multidimensionality of the SII makes it interesting because we can interpret scores for

each of the six occupational interests and create an occupational interest profile for everyone who takes the

test.

A typical profile for business major Ronald Frump might be EAS. Ronald is high on Enterprising, Artistic, and

Social. He can be a “killer” business person, making those hard decisions that influence the bottom line. This

reflects his strong E score. His Artistic bent shows up in his creativity and ingenuity in the business world

and in his extensive art collection. Ronald's propensity to be the center of attention and to be surrounded by

fawning employees is a manifestation of the Social component of his occupational interests. Because there

is strong evidence that the scores on the SII are valid, we can draw conclusions about Ronald and how he

will manifest his occupational profile in the business world. (Watch out competitors!) If his behaviors match

his profile, this lends further support for the construct validity of the SII.

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 10 of 21 Testing and Measurement

Let's Check Your Understanding

We just fed you a three-course dinner. Let's see if you've started to digest each of these courses. Let's check

your understanding just in case you need to review some aspect of validity.

What are the three major sources of evidence of validity?

_________________________________

_________________________________

For which source of validity evidence do you compare test items with a specific domain of

information?

_________________________________

_________________________________

What are the two approaches for obtaining evidence for content validity?

a. _______________________

b. _______________________

What are the names given to the two major types of criterion-related validity?

a. _______________________

b. _______________________

Criterion-related validity is essential when the purpose of a test is

a. _______________________

b. _______________________

What is convergent validity?

_________________________________

_________________________________

What is discriminant validity?

_________________________________

_________________________________

Convergent and discriminant validity indicate not only what a test ____________________ but

also what it _______________________.

Conceptually, construct validity is ___________________

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 11 of 21 Testing and Measurement

_________________________________.

Construct validity is also referred to as __________________.

When a single theoretical construct is being measured, the test should be

________________________.

When multiple theoretical constructs are being measured by the same test, the test should

be ________________ and have __________________ for each construct or factor being

assessed.

Our Model Answers

What are the three major sources of evidence of validity?

The three major sources of validity are content validity, criterion-related validity, and

construct validity.

For which source of validity evidence do you compare test items with a specific domain of

information?

For content validity, you compare test items with a specific domain of information.

What are the two approaches for obtaining evidence for content validity?

The two approaches for obtaining evidence for content validity are (a) an empirical

analysis of how well the test items reflect the content domain and (b) expert ratings of the

relationship between the test items and the content domain.

What are the names given to the two major types of criterion-related validity?

The two major types of criterion-related validity are (a) concurrent validity and (b)

predictive validity.

Criterion-related validity is essential when the purpose of a test is (a) to estimate current

behaviors or (b) to predict some future behavior.

What is convergent validity?

Evidence of convergent validity is shown when there is a significant relationship between

test scores and other assessments of the same construct or behavior.

What is discriminant validity?

Evidence of discriminant validity is shown when there is a nonsignificant relationship

between test scores and measures of a construct to which the test is not theoretically

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 12 of 21 Testing and Measurement

related.

Convergent and discriminant validity indicate not only what a test will predict but also what it will

not predict.

Conceptually, construct validity is the degree to which all items of a test are interrelated to

each other and measure the theoretical trait the test is designed to measure.

Construct validity is also referred to as internal structural validity.

When a single theoretical construct is being measured, the test should be unidimensional.

When multiple theoretical constructs are being measured by the same test, the test should be

multidimensional and have validity evidence for each construct or factor being assessed.

The Marriage of Reliability and Validity—Wedded Bliss

Both reliability and validity are essential characteristics of a good test. Like love and marriage, you can't

have one without the other. Reliability and validity are even wed to each other mathematically. The validity

coefficient (rxy) for a test's scores cannot be greater than the square root of the test's reliability (rxx). For rxy,

x stands for the test scores and y stands for scores on the criterion. The formula for the relationship between

validity and reliability is

If the reliability of a test is 0.64, the potential maximum value of the validity coefficient would be 0.80. Notice

our use of the words “potential maximum value.” Rarely does a validity coefficient exactly equal the square

root of the reliability coefficient. It is almost always less than this potential maximum value.

Interpreting the Validity of Tests-Intended and Unintended

Consequences

We mentioned high-stakes testing earlier. High-stakes testing is when test results are used to make critical

decisions such as whether or not a student receives a high school diploma based on his or her test scores.

This decision is based on social policy, although policy makers have tried to embed it in the realm of validity.

For example, our student, John, takes a comprehensive achievement test during his senior year. This test

assesses multiple dimensions of knowledge, and evidence has been provided to support its content and

construct validities. An inappropriate use of this test would be to differentiate students into two groups: those

who can graduate and those who can't. An achievement test is only designed to measure knowledge of

content. Evidence of validity supports this purpose. Evidence of validity does not support the goal of social

policy makers, to give diplomas only to those who score “high enough.” Validity is always interpreted in light

of the purpose of the test and should not be distorted for alternative purposes.

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 13 of 21 Testing and Measurement

Some Final Thoughts About Validity

As noted by the Standards for Educational and Psychological Testing (1999), “A sound validity argument

integrates various strands of evidence into a coherent account of the degree to which existing evidence

and theory support the intended interpretation of test scores for specific uses” (p. 17). Two important

concepts related to validity appear in this statement. First, more than one strand of evidence is needed for

sound validity. In addition to what we have discussed in this chapter, another approach is called multitrait

multimethod (MTMM). This approach addresses the need for more than one strand of evidence for sound

validity. Two, the intended interpretation of test scores is based on validity evidence. The goal of testing is

to provide meaningful information. This is only possible if there is evidence supporting the validity of the test

scores. Without validity, test scores are meaningless! You might as well have read tea leaves.

Key Terms

• Validity

• Validation group

• Criterion

• Construct underrepresentation

• Construct-irrelevant variance

• Sources of validity evidence

— Content

— Criterion related

— Predictive

— Concurrent

— Convergent

— Discriminant

— Internal structure

— Construct

Models and Self-instructional Exercises

Our Model

Remember the Honesty Inventory (HI) you created in Chapter 9 to assess applicants for bank teller positions?

We know the reliability coefficient was 0.76. Now let's see how valid scores for the HI are. Not only do you

administer the HI to all applicants, you also give them the Perfectly Honest Scale (PHS) and a mathematical

aptitude test. The test manual for the PHS reports a validity coefficient of 0.80 as evidence of criterion-

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 14 of 21 Testing and Measurement

related validity. The items on the PHS were evaluated by experts as to their measuring aspects of the

construct honesty. The manual also reports that when the HI was given to incoming freshmen, 4 years later

it discriminated between undergraduates who were elected as members of a national honor society and

undergraduates who were kicked out of school for cheating. Based on what we've told you thus far:

What is the maximum potential validity value of the HI?

_________________________________

_________________________________

When you correlate applicants' HI scores with their PHS scores, what source of validity evidence

are you assessing?

_________________________________

_________________________________

When you correlate applicants' HI scores with their mathematical aptitude test scores, what

source of validity evidence are you assessing?

_________________________________

_________________________________

What source of validity evidence was established by the HI scores later discriminating between

students in an honor society and students kicked out of school?

_________________________________

_________________________________

Based on the sources of validity evidence you have gathered, have you proven that the HI is a

valid assessment instrument for potential bank tellers?

_________________________________

_________________________________

Based on your answers to questions 1 through 4, would you recommend the HI as a valid test of

honesty and why?

_________________________________

_________________________________

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 15 of 21 Testing and Measurement

_________________________________

_________________________________

An applicant who is not hired files a complaint about the job related-ness of the assessments

used to screen applicants. How would you address this complaint based on what you know?

_________________________________

_________________________________

_________________________________

_________________________________

Our Model Answers

What is the maximum potential validity value of the HI?

The validity cannot be greater than the square root of the reliability coefficient. To calculate

the maximum potential validity value, we would use the formula

Therefore, the maximum potential validity value is 0.87.

When you correlate applicants' HI scores with their PHS scores, what source of validity evidence

are you assessing?

When you correlate two measures administered at the same time and designed to assess

the same theoretical construct, you are providing evidence for concurrent validity. In

addition, you are providing evidence of construct validity since they are both based on the

same theoretical construct.

When you correlate applicants' HI scores with their mathematical aptitude test scores, what

source of validity evidence are you assessing?

We would not expect honesty and mathematical aptitude to be theoretically linked.

Therefore, the nonrelationship between these two tests would provide evidence for

discriminant validity.

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 16 of 21 Testing and Measurement

What source of validity evidence was established by the HI scores later discriminating between

students in an honor society and students kicked out of school?

Because the HI was administered at the beginning of the students' freshman year and

whether they became members of the honor society or were kicked out of school was

assessed 4 years later, the HI was used to predict the future status of the students. This

process provided evidence of predictive validity for HI scores.

Based on the sources of validity evidence you have gathered, have you proven that the HI is a

valid assessment instrument for potential bank tellers?

Although a variety of sources have provided evidence for validity of the HI, validity can

never be proven.

Based on your answers to questions 1 through 4, would you recommend the HI as a valid test of

honesty?

Multiple sources of evidence for validity of the HI have been provided. Specifically, for the

potential bank tellers, we've gathered evidence of concurrent, construct, and discriminant

validity. We know that the validity coefficient could be as large as 0.87 which would

be very strong. However, we need to remember that this is just a potential highest

value, not the actual value of the validity coefficient. Furthermore, while the evidence

provided for predictive validity is interesting, we need to remember that it was gathered on

undergraduates, not on applicants for bank teller positions. All in all, however, we would

probably recommend the HI as an assessment of honesty for applicants for bank teller

positions.

An applicant who is not hired files a complaint about the job related-ness of the assessments

used to screen applicants. How would you address this complaint based on what you know?

Although honesty is a highly desirable characteristic in someone entrusted with other

people's money, it is not directly related to the job. The mathematical aptitude test,

however, would have job relatedness. Bank tellers need to know how to do math to be

successful on the job. Perhaps this applicant was not hired because of a poor math

aptitude score.

Now It's Your Turn

Based on their scores from the HI and their mathematical aptitude test scores, you select the 50 applicants

who are the most honest and have the highest math aptitude for further testing. You give them the

multidimensional scale described in Chapter 9 to measure interpersonal relationships (IP) and fear of math

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 17 of 21 Testing and Measurement

(FOM). We know the internal consistency reliability coefficients were 0.84 for the IP scores and 0.75 for the

FOM scores for the norm group of undergraduate business majors. For bank tellers who were hired, the

6-month test-retest reliabilities were 0.88 and 0.68 for these two subscales, respectively.

What is the maximum potential validity value of the IP for bank tellers?

_________________________________

_________________________________

_________________________________

What is the maximum potential validity value of the FOM for bank tellers?

_________________________________

_________________________________

_________________________________

When you correlate applicants' FOM scores with their math aptitude scores, what source of

validity evidence are you assessing?

_________________________________

When you correlate applicants' IP scores with their mathematical aptitude test scores, what

source of validity evidence are you assessing?

_________________________________

What source of validity evidence was established when math aptitude scores were used to later

predict FOM scores for those who were hired?

_________________________________

What source of validity evidence was established when FOM scores were found not to be related

to IP scores for the applicant pool?

_________________________________

Based on your answers to the questions above, would you recommend the IP as a valid and

reliable test of interpersonal relationships and why?

_________________________________

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 18 of 21 Testing and Measurement

_________________________________

_________________________________

_________________________________

Based on your answers to the questions above, would you recommend the FOM as a valid and

reliable test of interpersonal relationships and why?

_________________________________

_________________________________

_________________________________

_________________________________

Our Model Answers

What is the maximum potential validity value of the IP for bank tellers?

The validity cannot be greater than the square root of the reliability coefficient. The

reliability coefficient we must use is the test-retest reliability coefficient, because this

is the only reliability coefficient based just on bank tellers. To calculate the maximum

potential validity value, we would use the formula

Therefore, the maximum potential validity value of the IP for bank tellers is 0.94.

What is the maximum potential validity value of the FOM for bank tellers?

Again, the reliability coefficient we must use is the test-retest reliability coefficient,

because this is the only reliability coefficient based just on bank tellers. To calculate the

maximum potential validity value, we would use the formula

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 19 of 21 Testing and Measurement

Therefore, the maximum potential validity value of the FOM for bank tellers is 0.82.

When you correlate applicants' FOM scores with their math aptitude scores, what source of

validity evidence are you assessing?

When you correlate two measures administered at the same time and designed to assess

the same or related theoretical constructs, you are providing evidence for concurrent

validity, specifically convergent validity.

When you correlate applicants' IP scores with their mathematical aptitude test scores, what

source of validity evidence are you assessing?

When you correlate two measures administered at the same time and designed to assess

the different or unrelated theoretical constructs, you are also providing evidence for

concurrent validity, specifically discriminant validity.

What source of validity evidence was established when math aptitude scores were used to later

predict FOM scores for those who were hired?

When you use one set of scores to predict scores on a test measuring the same or

a related construct and given at a later time, you are providing evidence for predictive

validity.

What source of validity evidence was established when FOM scores were found not to be related

to IP scores for the applicant pool?

When you correlate two measures administered at the same time and designed to assess

the different or unrelated theoretical constructs, you are again providing evidence for

concurrent validity, specifically discriminant validity.

Based on your answers to the questions above, would you recommend the IP as a valid and

reliable test of interpersonal relationships and why?

Yes. Its maximum potential validity coefficient was 0.94. The IP had a test-retest reliability

of 0.88 for bank tellers, and an internal consistency reliability of 0.84 for undergraduate

business majors. In addition, the SEm of 1.6 is relatively small for the applicant pool.

Taken collectively, the IP appears to be a reliable and valid measure of interpersonal

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 20 of 21 Testing and Measurement

relationships for bank tellers.

Based on your answers to the questions above, would you recommend the FOM as a valid and

reliable test of interpersonal relationships and why?

Maybe. Its maximum potential validity coefficient was 0.82. The FOM had a test-retest

reliability of only 0.68 for bank tellers, and an internal consistency reliability of 0.75

for undergraduate business majors. These are weak reliability coefficients. However, the

SEm of 1.1 is relatively small for the applicant pool. Taken collectively, the FOM is not a

particularly reliable measure of fear of math for bank tellers, even though the maximum

potential validity coefficient is 0.82. This would be a good time to emphasize that this value

is only a “maximum potential” The actual validity could be much lower.

Words of Encouragement

Hurray, hurray, hurray! You have mastered all the basic technical aspects of measurement and testing that

we have covered in this user-friendly guide. We hope we have piqued your interest in measurement. If we

have, and you are thinking about pursuing more course work in tests and measurement and then applying

this information in a job setting, our last chapter on the perils and pitfalls of testing will be of particular interest

to you.

http://dx.doi.org/10.4135/9781412986106.n10

SAGE

2006 SAGE Publications, Ltd. All Rights Reserved.

SAGE Research Methods

Page 21 of 21 Testing and Measurement

  • Validity-What You See Is Not Always What You Get
    • In: Testing and Measurement