Research X
Validity-What You See Is Not Always What
You Get
In: Testing and Measurement
By: Mary E. Stafford
Pub. Date: 2011
Access Date: May 19, 2019
Publishing Company: SAGE Publications, Inc.
City: Thousand Oaks
Print ISBN: 9781412910026
Online ISBN: 9781412986106
DOI: https://dx.doi.org/10.4135/9781412986106
Print pages: 141-162
© 2006 SAGE Publications, Inc. All Rights Reserved.
This PDF has been generated from SAGE Research Methods. Please note that the pagination of the
online version will vary from the pagination of the print book.
Validity-What You See Is Not Always What You Get
Suppose you've created a test that has perfect reliability (r = +1.00). Anyone who takes this test gets the
same score time after time after time. The obtained score is their true score. There is no error. Well, doesn't
this sound wonderful!? Don't be gullible. If you believe there is such a thing as a perfectly reliable test, could
we interest you in some ocean-front property in the desert? Remember that “what you see is not always what
you get.”
We're sorry to tell you, but having a perfectly reliable test is not enough. Indeed, a perfectly reliable test (or
even a nonreliable test) may not have any value at all. We offer the case of Professor Notsobright to prove our
point. Professor Notsobright wants to know how smart or intelligent everyone in his class is. He knows that
intelligence is related to the brain and decides, therefore, that brain size must surely reflect intelligence. Since
he can't actually measure brain size, he measures the circumference of each student's head. Sure enough,
he gets the same values each time he takes out his handy-dandy tape measure and encircles each student's
head. He has found a reliable measurement. What he has NOT found is a valid measure of the construct
intelligence.
In measurement, our objective is to use tests that are valid as well as reliable. This chapter introduces you to
the most fundamental concept in measurement—validity. Validity is defined as how well a test measures what
it is designed to measure. In addition, validity tells us what can be inferred from test scores. According to the
Standards for Educational and Psychological Testing (1999), “the process of validation involves accumulating
evidence to provide a sound scientific basis for the proposed score interpretations” (p. 9). Evidence of validity
is related to the accuracy of the proposed interpretation of test scores, not to the test itself.
Good ol' Professor Notsobright wanted to measure the construct of intelligence. The approach he mistakenly
chose (measuring the circumference of his students' heads) does not yield valid evidence of intelligence.
He would be totally wrong to interpret any scores he obtained as an indicator of his students' intelligence.
(We think Professor Notsobright got his degree through a mail-order catalog. Furthermore, we suggest
that someone who knows about validity assess Dr. Notsobright's intelligence and suggest he seek different
employment.)
Scores on a test need to be valid and reliable. Evidence of validity is typically reported as a validity coefficient,
which can range from 0 to +1.00. Like the reliability coefficient discussed in Chapter 9, a validity coefficient is
often a correlation coefficient. A validity coefficient of 0 indicates the test scores absolutely do not measure
the construct under investigation. A validity coefficient approaching +1.00 (which you probably will never see
in your lifetime) provides strong evidence that the test scores are measuring the construct under investigation.
Ideally, test developers should report a validity coefficient for each of the groups for which the test could be
used. That is, if you're going to give an achievement test to middle school students, a validity coefficient for
each middle school grade level should be reported. In addition, validity coefficients for boys and for girls within
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 2 of 21 Testing and Measurement
each grade level should be reported. Remember when we talked about norm groups in Chapter 6? Well, in a
perfect measurement world, validity coefficients would be reported for all the potential groupings discussed in
that chapter.
The test user also has responsibility for test validation. If the test is going to be used in a setting different from
that reported by the test developers, the user is responsible for evaluating the validity evidence in the new
setting. For example, if a test was originally validated with public school students, but you want to use it with
students in a private parochial school, you have a responsibility for providing evidence of validity in this new
school setting.
Regardless of how evidence of validity is established, we want to stress that validity is a theoretical concept.
It can never actually be measured. A validity coefficient only suggests that test scores are valid for certain
groups in certain circumstances under certain conditions. We never ever “prove” validity, no matter how hard
we try. In spite of this, validity is an absolutely essential characteristic of a strong test. Only when a test is
valid (and of course, reliable) will you “get what you see.”
Let's Check Your Understanding
It's time to check your understanding of what we've told you so far.
Validity is defined as _________________.
When interpreting a test score, what is the role of validity?
_________________________________
_________________________________
Validity coefficients can range in value from _______ to __________.
When we talk about validity, are we referring to a test's scores or the test itself?
_________________________________
Test scores need to be both _________________and _________________.
If we try hard enough, we can prove that the test scores are valid. True or false?
Our Model Answers
Validity is defined as how well a test measures what it is designed to measure.
When interpreting a test score, what is the role of validity?
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 3 of 21 Testing and Measurement
Validity tells us what can be inferred from test scores.
Validity coefficients can range in value from 0 to +1.00.
When we talk about validity, are we referring to a test's scores or the test itself?
When we talk about validity, we are referring to a test's scores. Evidence of validity allows
us to make accurate interpretation of someone's test score. We do not interpret a test.
Test scores need to be both valid and reliable.
If we try hard enough, we can prove that the test scores are valid.
This statement is false. Since validity is a theoretical concept, you can never prove its
existence.
Helping You Get What You See
Like the Phantom of the Opera whose presence hovers over and shapes the meaning of Andrew Lloyd
Webber's musical, validity hovers over and shapes the meaning of a test. As the musical evolves, the
phantom becomes more visible; as more evidence of validity evolves, the meaning of test scores becomes
clearer. To develop evidence of validity, attention needs to be given to validation groups, criteria, construct
underrepresentation, and construct-irrelevant variance.
Validation Groups
The groups on which a test is validated are called validation groups. For our achievement test example, the
validation groups were middle school students. The achievement test is valid for students who have the same
characteristics as those in the validation sample of middle school students. Anyone who will potentially use
this achievement test needs to determine how closely his or her students match the characteristics of the
students in the validation group. The more dissimilar the students are from the validation group, the less valid
the achievement test may be for the new group of students. Characteristics of the validation group should be
presented in a test's manual.
Criteria
Validity is always a reflection of some criterion against which it is being measured. A criterion is some
knowledge, behavior, skill, process, or characteristic that is not a component of the test being examined. It is
external to the test itself. For example, scores on the Scholastic Aptitude Test (SAT) or the Graduate Record
Examination (GRE) have typically been validated against the criterion of undergraduate grade point averages
(GPA) or grades in graduate school, respectively. A fairly strong positive relationship has been found between
scores on these tests and later GPAs (the criteria).
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 4 of 21 Testing and Measurement
Scores on a test may also be validated against multiple criteria, depending on the inferences to be made
from the test scores. For example, scores on the Goody-Two-Shoes (G2S) personality test, which measures
a complex construct, probably need to be validated against several criteria. Appropriate criteria might include
teachers' perceptions of students, interpersonal relationships, potential for career success, and so forth. Each
of these criteria helps to define the goody-two-shoes construct. There would be a separate validity coefficient
for the relationship between the scores of the G2S test and each of these criteria. Collectively, these validity
coefficients provide evidence for the validity of the G2S scores measuring the construct “goody-two-shoes.”
In addition, based on which criterion was used to gather validity evidence, the interpretation of the G2S
test scores would vary. One would suspect, at least we do, that there may be a strong correlation between
students being “goody-two-shoes” and teachers' very favorable perceptions of these students. Therefore, the
criterion of teachers' perceptions of students provides strong evidence for the validity of scores on the G2S
test as a measure of teachers' perceptions of students. The same type of evidence needs to be gathered on
the other criteria in order to interpret scores on the G2S test as reflecting these criteria.
Construct Underrepresentation
When a test fails to capture or assess all important aspects of a construct adequately, this is called construct
underrepresentation. Let's go back to our example of aptitude tests to illustrate construct underrepresentation.
Most of you have probably taken the SAT or GRE. Furthermore, a few of you have probably argued that your
SAT or GRE test scores did not accurately reflect your academic ability. You may not know this, but it really
is possible that some tests don't comprehensively measure the constructs they are designed to measure.
When this happens, these tests are suffering from a serious illness that could even be fatal—construct
underrepresentation.
Let's pretend that the GRE suffers from construct underrepresentation. It doesn't really, but our example will
make more sense if we pretend that it does. Traditionally, the GRE measured quantitative and verbal aptitude.
More recently, reasoning ability was added to the GRE to complement its assessment of quantitative and
verbal abilities. Perhaps the test developers realized that the original GRE measured aptitude too narrowly,
so they added items to broaden the measure to include reasoning ability. Doing this broadened the domain
of behaviors that reflect aptitude. Perhaps, this more comprehensive assessment allows the GRE items to
better represent the construct aptitude.
Construct-Irrelevant Variance
It is also possible that when you took the SAT or GRE some process extraneous to the test's intended
construct affected the test scores. These extraneous variables might include things such as your reading
ability, speed of reading, emotional reactions to test items, familiarity with test content, test anxiety, or items
not related to the construct(s) being measured. Each of these can contribute to construct-irrelevant variance.
This is a source of error in the validity coefficient.
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 5 of 21 Testing and Measurement
Before we introduce you to the most common sources of validity evidence, it's time to check your
understanding of the concepts just introduced.
Let's Check Your Understanding
The individuals to be tested need to have characteristics similar to those of the
______________________.
A criterion is ____________________.
A test should be validated against one and only one criterion. True or false?
The criteria used as a source of validity evidence are external to the test itself. True or false?
Construct underrepresentation occurs when _______________
_________________________________
A source of error in a validity coefficient that is not related to the test's intended construct is called
________________________.
Examples of this source of error include ________________ and __________________.
Our Model Answers
The individuals to be tested need to have characteristics similar to those of the validation group.
A criterion is some knowledge, behavior, skill, process, or characteristic that is used to
establish the validity of test scores.
A test should be validated against one and only one criterion.
This statement is false. Test scores should be validated on as many criteria as are relevant
to the construct being measured. Multiple sources of validity evidence are particularly
needed when the test measures a complex construct.
The criteria used as a source of validity evidence are external to the test itself.
True. Criteria are not components of the test itself.
Construct underrepresentation occurs when the test does not adequately assess all aspects
of the construct being measured.
A source of error in a validity coefficient that is not related to the test's intended construct is called
construct-irrelevant variance.
Examples of this source of error include reading ability, speed of reading, emotional reactions
to test items, familiarity with test content, test anxiety, or items not related to the
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 6 of 21 Testing and Measurement
construct(s) being measured.
Sources of Validity Evidence
If you read the measurement literature (don't laugh, we find some of this literature very interesting), you might
have noticed that multiple “types” of validity are presented. Most likely, you'll find the terms content, construct,
concurrent, and predictive validity. Based on the Standards for Educational and Psychological Testing (1999),
validity is viewed as a unitary concept. It is the extent to which all sources of evidence for validity support the
intended interpretation of test scores. Even though validity is indeed a unitary concept, you still need to know
about the traditional types or sources of evidence for validity.
Evidence Based on Test Content
Examination of the content covered by test items and the construct the test is intended to measure can
yield important evidence for content validity. Test developers typically write their items to reflect a specific
content domain. Examples of test content domains might include the Revolutionary War, measures of central
tendency, eating disorders, leadership style, or star constellations. The more clearly every item on a test taps
information from one specific content domain, the greater the evidence for content validity.
Evidence for content validity typically comes from two approaches: (1) an empirical analysis of how well the
test items reflect the content domain and (2) expert ratings of the relationship between the test items and
the content domain. Empirical evidence can be derived from a statistical procedure such as factor analysis
to determine whether all of the test items measure one content domain or construct. The second approach,
expert ratings, requires identification of people who are experts on a content area. These experts then jointly
agree on the parameters of the content domain or construct they will be evaluating. Finally, based on these
parameters, they judge each item as to how well it assesses the desired content.
Content validity is most easily illustrated with an example from education. Every test you have taken, whether
in your math classes, your English classes, or this measurement class, should only include items that assess
the content or information covered in that class. This information may have been given through lectures,
readings, discussions, or demonstrations. The professor, who is the content expert, develops the class tests
by creating items that reflect the specific information covered in the course. To the extent that the content of
all of the items reflect the course content, evidence of content validity is established. Items that do not reflect
the course content contribute to construct-irrelevant variance. These items cause variation in test scores that
is not related to knowledge of the course content. Not all professors know about construct-irrelevant variance-
they may or may not appreciate your educating them. So, use your knowledge wisely. (Believe it or not, even
professors can become defensive.)
In the world of work, evidence of content validity is provided by a strong correspondence between specific
job tasks and the content of test items. Experts identify the dimensions of a job or the tasks that the job
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 7 of 21 Testing and Measurement
comprises. One process of deriving job tasks is to observe job behaviors systematically. Test items are then
evaluated against the specific job tasks. The correspondence between the test items and the job tasks is
referred to as the job relatedness of the test. Indeed, the U.S. Supreme Court has mandated that tests used
for job selection or placement have job relatedness.
The appropriateness of a specific content domain is directly related to any interpretation or inferences to be
made from test scores. In our education example, we may want to draw conclusions about individual student
mastery of a content area such as knowledge of star constellations. Based on their level of mastery, we may
want to make decisions about students passing or not passing the class.
We may also want to interpret test scores to find areas of a curriculum being adequately or inadequately
taught. If the majority of students systematically miss items related to class content, then perhaps this
content was not adequately taught. If we had given the students a comprehensive achievement test that was
designed and validated to measure multiple dimensions of achievement, we could draw conclusions about
content neglected or content taught based on student responses to test items. Information about content
neglected and about content taught both provide evidence of content validity.
Evidence of Criterion-Related Validity
A second “type” or source of validity evidence is criterion-related validity. If the purpose of a test is to predict
some future behavior or to estimate current behavior, you want evidence that the test items will do this
accurately. The relationship between test scores and the variable(s) external to the test (criterion) will provide
this source of evidence for criterion-related validity, as discussed next.
Predictive and Concurrent Validity
If the goal is for test scores to predict future behavior, we are concerned with predictive validity. This source
of evidence indicates that the test scores are strongly related to (predict) some behavior (criterion) that is
measured at a later time. Remember our example of the SAT and GRE aptitude tests? The ability of scores
on these two tests to predict future GPAs accurately provides evidence for their predictive validity.
In contrast, evidence for concurrent validity indicates a strong relationship between test scores and some
criterion measured at the same time. Both assessments are administered concurrently (or in approximately
the same time frame). Concurrent validity is essential in psychodiagnostic tests. For example, if someone
scores high on a test of depression, this person should also score high on any co-occurring criterion related
to depression. Sample criteria for depression could include mental health professionals' ratings, behavioral
observations, or self-reported behaviors. The College Stress Scale (CSS) that we introduced in Chapter 2
should have concurrent validity with behaviors that are indicative of currently experienced stress related to
being in college. The CSS should not be predicting future college stress or reflecting stress experienced in
the distant past.
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 8 of 21 Testing and Measurement
Similarly, our example of a test measuring elements of a job can be viewed as the test items having
concurrent validity with the job tasks. In work settings, a test is often used because its scores have concurrent
validity with the specific requirements of a job. Unless we're going to do extensive on-the-job training, we
want to know a person's ability to do a specific job immediately. In contrast, if we're going to do extensive
on-the-job training, we're more interested in the ability of test scores to predict a person's ability to perform
successfully after training.
A Special Case: Portfolio Assessment
Evidence for criterion-related validity is essential when portfolio assessment is the approach to measurement.
Let's say you're using a portfolio assessment to select students for a graduate program in psychology. To
keep this example simple, let's focus only on three criteria: (1) ability to use APA style when writing; (2) good
listening skills; and (3) academic potential.
For the first criterion (the ability to use APA style when writing), experts can evaluate an applicant's written
document to determine how closely it adheres to APA style. The document becomes the test, and the experts'
ratings are the criterion. Information derived from these two support concurrent validity for the ability of the
applicant at the time of writing to use APA style.
For the second criterion, good listening skills, a behavioral observation of the applicant in a structured role-
play situation could yield information about his or her current ability to use listening skills. The relationship
between the applicant's behaviors and the expert's ratings of these behaviors as reflecting listening skills
provides evidence for concurrent validity.
The third criterion, academic potential, would best be measured by an aptitude test, such as the GRE, that
would predict graduate school success.
Scores on this aptitude test would need to have established evidence of predictive validity for success in
graduate school.
Two Dimensions of Concurrent Validity—Convergent and Discriminant Validity
In each of the examples given thus far, we have been talking about how test scores and the criteria converge.
Significant relationships between test scores and other measures designed to assess the same construct
or behavior provide evidence of convergent validity. The test scores and the criterion are theoretically and
empirically linked.
The relationship (or nonrelationship) between test scores and measures of a construct to which the test is
not theoretically related also provides evidence for concurrent validity, known as discriminant validity. For
example, scores on an entrance exam for medical school should have convergent validity with grades in
medical school; however, these same scores may have a weak or no relationship with ratings of physician
bedside manner. This poor relationship provides evidence of discriminant validity. In this example, bedside
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 9 of 21 Testing and Measurement
manner is a construct different from what is being measured by the medical school entrance exam.
Convergent and discriminant validity indicate not only what a test will predict but also what it will not predict.
Evidence of Construct Validity
Construct validity, sometimes referred to as internal structural validity, indicates the degree to which all items
on a test are interrelated and measure the theoretical trait or construct the test is designed to measure.
Basically, a construct is a theoretical explanation for some behavior. Construct validity is concerned with the
validation of this underlying theory.
Anxiety is a theoretical construct that we can verify only by seeing how it is manifested in current behavior.
Because anxiety is theoretically one construct, a test that measures anxiety should be unidimensional. Factor
analysis is a statistical procedure that tests whether all items on a test contribute to that one construct. If a
test is unidimensional, we expect a one-factor structure to emerge from the factor analysis.
Many tests, however, are multidimensional, making whatever is being assessed more interesting (we think).
For example, the Strong Interest Inventory (SII) measures six different occupational interests: Realistic (R),
Artistic (A), Investigative (I), Social (S), Enterprising (E), and Conventional (C). The theoretical foundation for
the SII is Holland's conceptualization that there are six general work environments and these environments
are characterized by the six occupational interests. Considerable research on the psychometric properties of
the SII has consistently provided evidence supporting its underlying theory regarding a six-factor structure.
Support for a six-factor structure provides some evidence of construct validity (internal structural validity) for
each factor.
Items measuring a factor such as Enterprising (E) are homogeneous and contribute only to the construct
validity of that factor. There is evidence that the SII yields valid scores for adult men and women across a
variety of settings. The multidimensionality of the SII makes it interesting because we can interpret scores for
each of the six occupational interests and create an occupational interest profile for everyone who takes the
test.
A typical profile for business major Ronald Frump might be EAS. Ronald is high on Enterprising, Artistic, and
Social. He can be a “killer” business person, making those hard decisions that influence the bottom line. This
reflects his strong E score. His Artistic bent shows up in his creativity and ingenuity in the business world
and in his extensive art collection. Ronald's propensity to be the center of attention and to be surrounded by
fawning employees is a manifestation of the Social component of his occupational interests. Because there
is strong evidence that the scores on the SII are valid, we can draw conclusions about Ronald and how he
will manifest his occupational profile in the business world. (Watch out competitors!) If his behaviors match
his profile, this lends further support for the construct validity of the SII.
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 10 of 21 Testing and Measurement
Let's Check Your Understanding
We just fed you a three-course dinner. Let's see if you've started to digest each of these courses. Let's check
your understanding just in case you need to review some aspect of validity.
What are the three major sources of evidence of validity?
_________________________________
_________________________________
For which source of validity evidence do you compare test items with a specific domain of
information?
_________________________________
_________________________________
What are the two approaches for obtaining evidence for content validity?
a. _______________________
b. _______________________
What are the names given to the two major types of criterion-related validity?
a. _______________________
b. _______________________
Criterion-related validity is essential when the purpose of a test is
a. _______________________
b. _______________________
What is convergent validity?
_________________________________
_________________________________
What is discriminant validity?
_________________________________
_________________________________
Convergent and discriminant validity indicate not only what a test ____________________ but
also what it _______________________.
Conceptually, construct validity is ___________________
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 11 of 21 Testing and Measurement
_________________________________.
Construct validity is also referred to as __________________.
When a single theoretical construct is being measured, the test should be
________________________.
When multiple theoretical constructs are being measured by the same test, the test should
be ________________ and have __________________ for each construct or factor being
assessed.
Our Model Answers
What are the three major sources of evidence of validity?
The three major sources of validity are content validity, criterion-related validity, and
construct validity.
For which source of validity evidence do you compare test items with a specific domain of
information?
For content validity, you compare test items with a specific domain of information.
What are the two approaches for obtaining evidence for content validity?
The two approaches for obtaining evidence for content validity are (a) an empirical
analysis of how well the test items reflect the content domain and (b) expert ratings of the
relationship between the test items and the content domain.
What are the names given to the two major types of criterion-related validity?
The two major types of criterion-related validity are (a) concurrent validity and (b)
predictive validity.
Criterion-related validity is essential when the purpose of a test is (a) to estimate current
behaviors or (b) to predict some future behavior.
What is convergent validity?
Evidence of convergent validity is shown when there is a significant relationship between
test scores and other assessments of the same construct or behavior.
What is discriminant validity?
Evidence of discriminant validity is shown when there is a nonsignificant relationship
between test scores and measures of a construct to which the test is not theoretically
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 12 of 21 Testing and Measurement
related.
Convergent and discriminant validity indicate not only what a test will predict but also what it will
not predict.
Conceptually, construct validity is the degree to which all items of a test are interrelated to
each other and measure the theoretical trait the test is designed to measure.
Construct validity is also referred to as internal structural validity.
When a single theoretical construct is being measured, the test should be unidimensional.
When multiple theoretical constructs are being measured by the same test, the test should be
multidimensional and have validity evidence for each construct or factor being assessed.
The Marriage of Reliability and Validity—Wedded Bliss
Both reliability and validity are essential characteristics of a good test. Like love and marriage, you can't
have one without the other. Reliability and validity are even wed to each other mathematically. The validity
coefficient (rxy) for a test's scores cannot be greater than the square root of the test's reliability (rxx). For rxy,
x stands for the test scores and y stands for scores on the criterion. The formula for the relationship between
validity and reliability is
If the reliability of a test is 0.64, the potential maximum value of the validity coefficient would be 0.80. Notice
our use of the words “potential maximum value.” Rarely does a validity coefficient exactly equal the square
root of the reliability coefficient. It is almost always less than this potential maximum value.
Interpreting the Validity of Tests-Intended and Unintended
Consequences
We mentioned high-stakes testing earlier. High-stakes testing is when test results are used to make critical
decisions such as whether or not a student receives a high school diploma based on his or her test scores.
This decision is based on social policy, although policy makers have tried to embed it in the realm of validity.
For example, our student, John, takes a comprehensive achievement test during his senior year. This test
assesses multiple dimensions of knowledge, and evidence has been provided to support its content and
construct validities. An inappropriate use of this test would be to differentiate students into two groups: those
who can graduate and those who can't. An achievement test is only designed to measure knowledge of
content. Evidence of validity supports this purpose. Evidence of validity does not support the goal of social
policy makers, to give diplomas only to those who score “high enough.” Validity is always interpreted in light
of the purpose of the test and should not be distorted for alternative purposes.
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 13 of 21 Testing and Measurement
Some Final Thoughts About Validity
As noted by the Standards for Educational and Psychological Testing (1999), “A sound validity argument
integrates various strands of evidence into a coherent account of the degree to which existing evidence
and theory support the intended interpretation of test scores for specific uses” (p. 17). Two important
concepts related to validity appear in this statement. First, more than one strand of evidence is needed for
sound validity. In addition to what we have discussed in this chapter, another approach is called multitrait
multimethod (MTMM). This approach addresses the need for more than one strand of evidence for sound
validity. Two, the intended interpretation of test scores is based on validity evidence. The goal of testing is
to provide meaningful information. This is only possible if there is evidence supporting the validity of the test
scores. Without validity, test scores are meaningless! You might as well have read tea leaves.
Key Terms
• Validity
• Validation group
• Criterion
• Construct underrepresentation
• Construct-irrelevant variance
• Sources of validity evidence
— Content
— Criterion related
— Predictive
— Concurrent
— Convergent
— Discriminant
— Internal structure
— Construct
Models and Self-instructional Exercises
Our Model
Remember the Honesty Inventory (HI) you created in Chapter 9 to assess applicants for bank teller positions?
We know the reliability coefficient was 0.76. Now let's see how valid scores for the HI are. Not only do you
administer the HI to all applicants, you also give them the Perfectly Honest Scale (PHS) and a mathematical
aptitude test. The test manual for the PHS reports a validity coefficient of 0.80 as evidence of criterion-
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 14 of 21 Testing and Measurement
related validity. The items on the PHS were evaluated by experts as to their measuring aspects of the
construct honesty. The manual also reports that when the HI was given to incoming freshmen, 4 years later
it discriminated between undergraduates who were elected as members of a national honor society and
undergraduates who were kicked out of school for cheating. Based on what we've told you thus far:
What is the maximum potential validity value of the HI?
_________________________________
_________________________________
When you correlate applicants' HI scores with their PHS scores, what source of validity evidence
are you assessing?
_________________________________
_________________________________
When you correlate applicants' HI scores with their mathematical aptitude test scores, what
source of validity evidence are you assessing?
_________________________________
_________________________________
What source of validity evidence was established by the HI scores later discriminating between
students in an honor society and students kicked out of school?
_________________________________
_________________________________
Based on the sources of validity evidence you have gathered, have you proven that the HI is a
valid assessment instrument for potential bank tellers?
_________________________________
_________________________________
Based on your answers to questions 1 through 4, would you recommend the HI as a valid test of
honesty and why?
_________________________________
_________________________________
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 15 of 21 Testing and Measurement
_________________________________
_________________________________
An applicant who is not hired files a complaint about the job related-ness of the assessments
used to screen applicants. How would you address this complaint based on what you know?
_________________________________
_________________________________
_________________________________
_________________________________
Our Model Answers
What is the maximum potential validity value of the HI?
The validity cannot be greater than the square root of the reliability coefficient. To calculate
the maximum potential validity value, we would use the formula
Therefore, the maximum potential validity value is 0.87.
When you correlate applicants' HI scores with their PHS scores, what source of validity evidence
are you assessing?
When you correlate two measures administered at the same time and designed to assess
the same theoretical construct, you are providing evidence for concurrent validity. In
addition, you are providing evidence of construct validity since they are both based on the
same theoretical construct.
When you correlate applicants' HI scores with their mathematical aptitude test scores, what
source of validity evidence are you assessing?
We would not expect honesty and mathematical aptitude to be theoretically linked.
Therefore, the nonrelationship between these two tests would provide evidence for
discriminant validity.
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 16 of 21 Testing and Measurement
What source of validity evidence was established by the HI scores later discriminating between
students in an honor society and students kicked out of school?
Because the HI was administered at the beginning of the students' freshman year and
whether they became members of the honor society or were kicked out of school was
assessed 4 years later, the HI was used to predict the future status of the students. This
process provided evidence of predictive validity for HI scores.
Based on the sources of validity evidence you have gathered, have you proven that the HI is a
valid assessment instrument for potential bank tellers?
Although a variety of sources have provided evidence for validity of the HI, validity can
never be proven.
Based on your answers to questions 1 through 4, would you recommend the HI as a valid test of
honesty?
Multiple sources of evidence for validity of the HI have been provided. Specifically, for the
potential bank tellers, we've gathered evidence of concurrent, construct, and discriminant
validity. We know that the validity coefficient could be as large as 0.87 which would
be very strong. However, we need to remember that this is just a potential highest
value, not the actual value of the validity coefficient. Furthermore, while the evidence
provided for predictive validity is interesting, we need to remember that it was gathered on
undergraduates, not on applicants for bank teller positions. All in all, however, we would
probably recommend the HI as an assessment of honesty for applicants for bank teller
positions.
An applicant who is not hired files a complaint about the job related-ness of the assessments
used to screen applicants. How would you address this complaint based on what you know?
Although honesty is a highly desirable characteristic in someone entrusted with other
people's money, it is not directly related to the job. The mathematical aptitude test,
however, would have job relatedness. Bank tellers need to know how to do math to be
successful on the job. Perhaps this applicant was not hired because of a poor math
aptitude score.
Now It's Your Turn
Based on their scores from the HI and their mathematical aptitude test scores, you select the 50 applicants
who are the most honest and have the highest math aptitude for further testing. You give them the
multidimensional scale described in Chapter 9 to measure interpersonal relationships (IP) and fear of math
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 17 of 21 Testing and Measurement
(FOM). We know the internal consistency reliability coefficients were 0.84 for the IP scores and 0.75 for the
FOM scores for the norm group of undergraduate business majors. For bank tellers who were hired, the
6-month test-retest reliabilities were 0.88 and 0.68 for these two subscales, respectively.
What is the maximum potential validity value of the IP for bank tellers?
_________________________________
_________________________________
_________________________________
What is the maximum potential validity value of the FOM for bank tellers?
_________________________________
_________________________________
_________________________________
When you correlate applicants' FOM scores with their math aptitude scores, what source of
validity evidence are you assessing?
_________________________________
When you correlate applicants' IP scores with their mathematical aptitude test scores, what
source of validity evidence are you assessing?
_________________________________
What source of validity evidence was established when math aptitude scores were used to later
predict FOM scores for those who were hired?
_________________________________
What source of validity evidence was established when FOM scores were found not to be related
to IP scores for the applicant pool?
_________________________________
Based on your answers to the questions above, would you recommend the IP as a valid and
reliable test of interpersonal relationships and why?
_________________________________
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 18 of 21 Testing and Measurement
_________________________________
_________________________________
_________________________________
Based on your answers to the questions above, would you recommend the FOM as a valid and
reliable test of interpersonal relationships and why?
_________________________________
_________________________________
_________________________________
_________________________________
Our Model Answers
What is the maximum potential validity value of the IP for bank tellers?
The validity cannot be greater than the square root of the reliability coefficient. The
reliability coefficient we must use is the test-retest reliability coefficient, because this
is the only reliability coefficient based just on bank tellers. To calculate the maximum
potential validity value, we would use the formula
Therefore, the maximum potential validity value of the IP for bank tellers is 0.94.
What is the maximum potential validity value of the FOM for bank tellers?
Again, the reliability coefficient we must use is the test-retest reliability coefficient,
because this is the only reliability coefficient based just on bank tellers. To calculate the
maximum potential validity value, we would use the formula
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 19 of 21 Testing and Measurement
Therefore, the maximum potential validity value of the FOM for bank tellers is 0.82.
When you correlate applicants' FOM scores with their math aptitude scores, what source of
validity evidence are you assessing?
When you correlate two measures administered at the same time and designed to assess
the same or related theoretical constructs, you are providing evidence for concurrent
validity, specifically convergent validity.
When you correlate applicants' IP scores with their mathematical aptitude test scores, what
source of validity evidence are you assessing?
When you correlate two measures administered at the same time and designed to assess
the different or unrelated theoretical constructs, you are also providing evidence for
concurrent validity, specifically discriminant validity.
What source of validity evidence was established when math aptitude scores were used to later
predict FOM scores for those who were hired?
When you use one set of scores to predict scores on a test measuring the same or
a related construct and given at a later time, you are providing evidence for predictive
validity.
What source of validity evidence was established when FOM scores were found not to be related
to IP scores for the applicant pool?
When you correlate two measures administered at the same time and designed to assess
the different or unrelated theoretical constructs, you are again providing evidence for
concurrent validity, specifically discriminant validity.
Based on your answers to the questions above, would you recommend the IP as a valid and
reliable test of interpersonal relationships and why?
Yes. Its maximum potential validity coefficient was 0.94. The IP had a test-retest reliability
of 0.88 for bank tellers, and an internal consistency reliability of 0.84 for undergraduate
business majors. In addition, the SEm of 1.6 is relatively small for the applicant pool.
Taken collectively, the IP appears to be a reliable and valid measure of interpersonal
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 20 of 21 Testing and Measurement
relationships for bank tellers.
Based on your answers to the questions above, would you recommend the FOM as a valid and
reliable test of interpersonal relationships and why?
Maybe. Its maximum potential validity coefficient was 0.82. The FOM had a test-retest
reliability of only 0.68 for bank tellers, and an internal consistency reliability of 0.75
for undergraduate business majors. These are weak reliability coefficients. However, the
SEm of 1.1 is relatively small for the applicant pool. Taken collectively, the FOM is not a
particularly reliable measure of fear of math for bank tellers, even though the maximum
potential validity coefficient is 0.82. This would be a good time to emphasize that this value
is only a “maximum potential” The actual validity could be much lower.
Words of Encouragement
Hurray, hurray, hurray! You have mastered all the basic technical aspects of measurement and testing that
we have covered in this user-friendly guide. We hope we have piqued your interest in measurement. If we
have, and you are thinking about pursuing more course work in tests and measurement and then applying
this information in a job setting, our last chapter on the perils and pitfalls of testing will be of particular interest
to you.
http://dx.doi.org/10.4135/9781412986106.n10
SAGE
2006 SAGE Publications, Ltd. All Rights Reserved.
SAGE Research Methods
Page 21 of 21 Testing and Measurement
- Validity-What You See Is Not Always What You Get
- In: Testing and Measurement