psych
Author: Miller, L. A., & Lovler, R. L. (2020). In Foundations of psychological testing: A practical approach (6th ed., pp. 54–84). SAGE.
7 How Do We Gather Evidence of Validity Based on
Test–Criterion Relationships?
Learning Objectives
After completing your study of this chapter, you should be able to do the following:
• Identify evidence of validity of a test based on its relationships to external criteria, and
describe two methods for obtaining this evidence.
• Read and interpret validity studies.
• Discuss how restriction of range occurs and its consequences.
• Describe the differences between evidence of validity based on test content and evidence
based on relationships with external criteria.
• Describe the difference between reliability/precision and validity.
• Define and give examples of objective and subjective criteria, and explain why criteria
must be reliable and valid.
• Interpret a validity coefficient, calculate the coefficient of determination, and conduct a
test of significance for a validity coefficient.
• Understand why measured validity will be reduced by unreliability in the predictor or
criterion measure and what statistical correction can be applied to adjust for this
reduction.
• Explain the concept of operational or “true” validity and how it is calculated.
• Explain the concept of regression, calculate and interpret a linear regression formula, and
interpret a multiple regression formula.
“The graduate school I’m applying to says they won’t accept anyone who scores less than 1,000 on the GRE. How
did they decide that 1,000 is the magic number?”
“Before we married, my fiancée and I went to a premarital counselor. She gave us a test that predicted how happy
our marriage would be.”
“My company uses a test for hiring salespeople to work as telemarketers. The test is designed for people selling life
insurance and automobiles. Is this a good test for hiring telemarketers?”
Have you ever wondered how psychological tests really work? Can we be comfortable using an
individual’s answers to test questions to make decisions about hiring him for a job or admitting
her to college? Can mental disorders really be diagnosed using scores on standard
questionnaires?
Psychologists who use tests for decision making are constantly asking these questions and others
like them. When psychologists use test scores for making decisions that affect individual lives,
they, as well as the public, want substantial evidence that the correct decisions are being made.
This chapter describes the processes that psychologists use to ensure that tests perform properly
when they are used for making predictions and decisions. We begin by discussing the concept of
validity evidence based on a test’s relationships to other variables, specifically external criteria.
As we discussed in the previous chapter, “How Do We Gather Evidence of Validity Based on the
Content of a Test?” this evidence has traditionally been called criterion-related evidence of
validity. We also discuss the importance of selecting a valid criterion measure, how to evaluate
validity coefficients, and the statistical processes that provide evidence that a test can be used for
making predictions.
What Is Evidence of Validity Based on Test–Criterion
Relationships?
In the last chapter, we introduced you to the concept of evidence of validity based on a test’s
relationship with other variables. We said that one method for obtaining evidence is to
investigate how well the test scores correlate with observed behaviors or events. When test
scores correlate with specific behaviors, attitudes, or events, we can confirm that there is
evidence of validity. In other words, the test scores may be used to predict
those specific behaviors, attitudes, or events. But as you recall, we cannot use such evidence to
make an overall statement that the test is valid. We also said that this evidence has traditionally
been referred to as criterion-related validity (a term that we use occasionally in this chapter, as it
is still widely used by testing practitioners).
For example, when you apply for a job, you might be asked to take a test designed to predict how
well you will perform the job. If the job is clerical and the test really predicts how well you will
perform, your test score will be related to your skill in performing clerical duties such as word
processing and filing. To provide evidence that the test predicts clerical performance,
psychologists correlate test score with a measure of each individual’s performance on clerical
tasks such as supervisor’s ratings. The measure of performance that we correlate with test scores
is called the criterion. If higher test scores relate to higher performance ratings, the test shows
evidence of validity based on the relationship between these two variables, traditionally referred
to as criterion-related validity. Educators use admissions tests to forecast how successful an
applicant will be in college or graduate school. The SAT and the Graduate Record Examination
(GRE) are admissions tests used by colleges. The criterion of success in college is often the
student’s first-year grade point average (GPA). In a clinical setting, psychologists often use tests
to diagnose mental disorders. In this case, the criterion is the diagnoses made by several
psychologists or psychiatrists independent of the test. Researchers then correlate the diagnoses
with the test scores to establish evidence of validity.
Methods for Providing Evidence of Validity Based on Test–
Criterion Relationships
There are two methods for demonstrating evidence of validity based on test–criterion
relationships: the predictive method and the concurrent method. This section defines and gives
examples of each method.
The Predictive Method
When it is important to show a relationship between test scores and a future behavior,
researchers use the predictive method to establish evidence of validity. In this case, a large group
of people take the test (the predictor), and their scores are held for a predetermined time interval,
such as 6 months. After the time interval passes, researchers collect a measure of some behavior,
for example, a rating or other measure of performance, on the same people (the criterion). Then
researchers correlate the test scores with the criterion scores. If the test scores and the criterion
scores have a strong relationship, the test has demonstrated predictive evidence of validity.
Researchers at Brigham Young University used the predictive method to demonstrate evidence
of validity of the PREParation for Marriage Questionnaire (PREP-M). For Your Information Box
7.1 describes the study they conducted.
Psychologists might use the predictive method in an organizational setting to establish evidence
of validity for an employment test. To do so, they administer an employment test (predictor) to
candidates for a job. Researchers file test scores in a secure place, and the company does not use
the scores for making hiring decisions. The company makes hiring decisions based on other
criteria, such as interviews or different tests. After a predetermined time interval, usually 3 to 6
months, supervisors evaluate the new hires on how well they perform the job (the criterion). To
determine whether the test scores predict the candidates who were successful and unsuccessful,
researchers correlate the test scores with the ratings of job performance. The resulting correlation
coefficient is called the validity coefficient, a statistic used to infer the strength of the evidence
of validity that the test scores might demonstrate in predicting job performance.
For Your Information Box 7.1 Evidence of Validity Based on Test–Criterion Relationships of a Premarital
Assessment Instrument
In 1991, researchers at Brigham Young University (Holman, Larson, & Harmer, 1994) conducted a study to
determine the evidence of validity based on test–criterion relationships of the PREParation for Marriage
Questionnaire (PREP-M; Holman, Busby, & Larson, 1989). Counselors use the PREP-M with engaged couples who
are participating in premarital courses or counseling. The PREP-M has 206 questions that provide information on
couples’ shared values, readiness for marriage, background, and home environment. The researchers contacted 103
married couples who had taken the PREP-M a year earlier as engaged couples and asked them about their marital
satisfaction and stability.
© Kati Neudert/iStockphoto
The researchers predicted that those couples who had high scores on the PREP-M would express high satisfaction
with their marriages. The researchers used two criteria to test their hypothesis. First, they drew questions from the
Marital Comparison Level Index (Sabatelli, 1984) and the Marital Instability Scale (Booth, Johnson, & Edwards,
1983) to construct a criterion that measured each couple’s level of marital satisfaction and marital stability. The
questionnaire showed internal consistency of .83. The researchers also classified each couple as “married satisfied,”
“married dissatisfied,” or “canceled/delayed” and as “married stable,” “married unstable,” or “canceled/delayed.”
These classifications provided a second criterion.
The researchers correlated the couples’ scores on the PREP-M with their scores on the criterion questionnaire. The
husbands’ scores on the PREP-M correlated at .44 (p < .01) with questions on marital satisfaction and at .34 (p <
.01) with questions on marital stability. The wives’ scores on the PREP-M were correlated with the same questions
at .25 (p < .01) and .20 (p < .05), respectively. These correlations show that the PREP-M is a moderate to strong
predictor of marital satisfaction and stability—good evidence of the validity of the PREP-M. (Later in this chapter,
we discuss the size of correlation coefficients needed to establish evidence of validity.)
In addition, the researchers compared the mean scores of those husbands and wives classified as married satisfied,
married dissatisfied, or canceled/delayed and those classified as married stable, married unstable, or
canceled/delayed. As predicted, those who were married satisfied or married stable scored higher on the PREP-M
than did those in the other two respective categories. In practical terms, these analyses show that counselors can use
scores on the PREP-M to make predictions about how satisfying and stable a marriage will be.
To get the best measure of validity, everyone who took the test would need to be hired so that all
test scores could be correlated with a measure of job performance (something that it is not
usually practical to do). This is because it is desirable to get the widest range of test scores
possible (including the very low ones) to understand fully how all the test scores relate to job
performance. Therefore, gathering predictive evidence of validity can present problems for some
organizations because it is important that everyone who took the test is also measured on the
criterion. Some organizations might not be able to hire everyone who applies regardless of
qualifications, and there are usually more applicants than available positions, so not all
applicants can be hired. Also, organizations frequently will be using some other selection tool
such as an interview to make hiring decisions, and typically, only people who do well on the
interview will be hired. Therefore, even predictive studies in organizations may only have access
to the scores of a portion of the candidates who applied for the job. Because those actually hired
are likely to be the higher performers, a restriction of range in the distribution of test scores is
created. In other words, if the test is a valid predictor of job performance and the other selection
tools that are used to make a hiring decision are also valid predictors, then people with lower
scores on the test will be less likely to be hired. This causes the range of test scores to be reduced
or restricted to those who scored relatively higher. Because a validity study conducted on these
data will not have access to the full range of test scores, the validity coefficient calculated only
from this restricted group is likely to be lower than if all candidates had been hired and included
in the study.
Why would the resulting validity coefficient from a range-restricted group be lower than it would
be if the entire group was available to measure? Think of it like this: The worst case of restricted
range would be if everyone obtained exactly the same score on the test (similar to what would
happen if you hired only those people who made a perfect score on the test). If this situation
occurred, the correlation between the test scores and any other criteria would be zero. This is
because if the test scores do not vary from person to person, high performers and lower
performers would all have exactly the same test score. We cannot distinguish high performers
from low performers when everybody gets the same score, and therefore these test scores cannot
be predictive of job performance. Using the full range of test scores enables you to obtain a more
accurate validity coefficient, which usually will be higher than the coefficient you obtained using
the restricted range of scores. However, a correlation coefficient can be statistically adjusted for
restriction of range, which, when used properly, can provide a corrected estimate of the validity
coefficient of the employment test in the unrestricted population. We have much more to say
about corrections to measured validity coefficients later in the chapter when we cover the
relationship between reliability and validity. These problems exist in educational and clinical
settings as well because individuals might not be admitted to an institution or might leave during
the predictive study. For Your Information Box 7.2 describes a validation study that might have
failed to find evidence of validity because of restriction of range.
The Concurrent Method
The method of demonstrating concurrent evidence of validity based on test–criteria
relationships is an alternative to the predictive method. In the concurrent method, test
administration and criterion measurement happen at approximately the same time. This method
does not involve prediction. Instead, it provides information about the present and the status quo
(Cascio, 1991). A study by Maisto and colleagues (2011), described in For Your Information
Box 7.3, is a good example of a study designed to assess concurrent (as well as predictive)
evidence of validity for an instrument used in a clinical setting.
The concurrent method involves administering two measures, the test and a second measure of
the attribute, to the same group of individuals at as close to the same point in time as possible.
For example, the test might be a paper-and-pencil measure of American literature, and the
second measure might be a grade in an American literature course. Usually, the first measure is
the test being validated, and the criterion is another type of measure of performance such as a
rating, grade, or diagnosis. It is very important that the criterion test itself be reliable and valid
(we discuss this further later in this chapter). The researchers then correlate the scores on the two
measures. If the scores correlate, the test scores demonstrate evidence of validity.
In organizational settings, researchers often use concurrent studies as alternatives to predictive
studies because of the difficulties of using a predictive design that we discussed earlier. In this
setting, the process is to administer the test to employees currently in the position for which the
test is being considered as a selection tool and then to collect criterion data on the same people
(such as performance appraisal data). In some cases, the criterion data are specifically designed
to be used in the concurrent study, while in other cases recent, existing data are used. Then the
test scores are correlated with the criterion data and the validity coefficient is calculated.
For Your Information Box 7.2 Did Restriction of Range Decrease the Validity Coefficient?
Does a student’s academic self-concept—how the student views himself or herself in the role of a student—affect
the student’s academic performance? Michael and Smith (1976) developed the Dimensions of Self-Concept
(DOSC), a self-concept measure that emphasizes school-related activities and that has five subscales that measure
level of aspiration, anxiety, academic interest and satisfaction, leadership and initiative, and identification versus
alienation.
Researchers at the University of Southern California (Gribbons, Tobey, & Michael, 1995) examined the evidence of
validity based on test–criterion relationships of the DOSC by correlating DOSC test scores with GPA. They selected
176 new undergraduates from two programs for students considered at risk for academic difficulties. The students
came from a variety of ethnic backgrounds, and 57% were men.
At the beginning of the semester, the researchers administered the DOSC to the students following the guidelines
described in the DOSC manual (Michael, Smith, & Michael, 1989). At the end of the semester, they obtained each
student’s first-semester GPA from university records. When they analyzed the data for evidence of
reliability/precision and validity, the DOSC showed high internal consistency, but scores on the DOSC did not
predict GPA.
Did something go wrong? One conclusion is that self-concept as measured by the DOSC is unrelated to GPA.
However, if the study or the measures were somehow flawed, the predictive evidence of validity of the DOSC might
have gone undetected. The researchers suggested that perhaps academic self-concept lacks stability during students’
first semester. Although the internal consistency of the DOSC was established, the researchers did not measure the
test–retest reliability/precision of the test. Therefore, this possibility cannot be ruled out. The researchers also
suggested that GPA might be an unreliable criterion.
Could restriction of range have caused the validity of the DOSC to go undetected? This is a distinct possibility for
two reasons. First, for this study the researchers chose only those students who were at risk for experiencing
academic difficulties. Because the unrestricted population of students also contains those who are expected to
succeed, the researchers might have restricted the range of both the test and the criterion. Second, the students in the
study enrolled in programs to help them become successful academically. Therefore, participating in the programs
might have enhanced the students’ academic self-concept.
This study demonstrates two pitfalls that researchers designing predictive studies must avoid. Researchers must be
careful to include in their studies participants who represent the entire possible range of performance on both the test
and the criterion. In addition, they must design predictive studies so that participants are unlikely to change over the
course of the study in ways that affect the abilities or traits that are being measured.
Barrett, Phillips, and Alexander (1981) compared the two methods for determining evidence of
validity based on predictor–criteria relationships in an organizational setting using cognitive
ability tests. They found that the two methods provide similar results. However, this may not
always be the case. For Your Information Box 7.3 describes a recent example of the two
approaches producing different results.
Selecting a Criterion
A criterion is an evaluative standard that researchers use to measure outcomes such as
performance, attitude, or motivation. Evidence of validity derived from test–criteria relationships
provides evidence that the test relates to some behavior or event that is independent of the
psychological test. As you recall from For Your Information Box 7.1, the researchers at Brigham
Young University constructed two criteria—a questionnaire and classifications on marital
satisfaction and marital stability—to demonstrate evidence of validity of the PREP-M.
For Your Information Box 7.3 Developing Concurrent and Predictive Evidence of Validity for Three Measures of
Readiness to Change Alcohol Use in Adolescents
Maisto and his colleagues (2011) were interested in motivation or readiness to change and how it relates to alcohol
use in adolescents. They felt that if a good measure of this construct could be identified, it would help in the design
of clinical interventions for the treatment of alcohol abuse. They noted that there was little empirical evidence to
support the validity of any of the existing measures of motivation or readiness to change, especially in an adolescent
population.
The researchers identified three existing measures of “readiness to change” frequently used in substance abuse
contexts. The first was the Stages of Change and Treatment Eagerness Scale (SOCRATES). This tool was designed
to measure two dimensions of readiness to change called Problem Recognition and Taking Steps concerning alcohol
use (Maisto, Chung, Cornelius, & Martin, 2003). The second measure they evaluated was the Readiness Ruler
(Center on Alcoholism, Substance Abuse and Addictions, 1995). This measure is a simple questionnaire that asks
respondents to rate on a scale of 1 to 10 how ready they are to change their alcohol use behavior using anchors such
as not ready to change, unsure, and trying to change. The third measure they investigated is called the Staging
Algorithm (Prochaska, DiClemente, & Norcross, 1992), which places people into five stages based on their
readiness to change: pre-contemplation, contemplation, preparation, action, and maintenance.
The research question was whether these instruments would show concurrent and/or predictive evidence of validity.
That is, would high scores on the readiness-to-change instruments be associated with lower alcohol consumption
reported when both measures were taken at the same point in time (concurrent evidence), and would high scores on
the instruments taken at one point in time predict lower alcohol consumption measured at a later point in time
(predictive evidence)?
The participants were adolescents aged 14 to 18 years who were recruited at their first treatment session from seven
different treatment programs for adolescent substance users. Two criteria were used for the study. The first was the
average percentage of days that the participants reported being abstinent from alcohol (PDA). The second was the
average number of drinks they consumed per drinking day (DDD). These criteria were measured three times during
the study—on the first day of treatment (concerning the previous 30 days) and at 6-month and 12-month follow-up
sessions. The three “readiness-to-change” measures were filled out by the participants each time.
Concurrent evidence of validity was gathered by correlating the scores on each readiness-to-change instrument with
the average alcohol consumption criteria reported by the participants at the same time. The correlation between
baseline alcohol consumption and the initial readiness change measure taken at the same time for PDA was positive
and statistically significant for the Readiness Ruler (.35), the Staging Algorithm (.39), and the SOCRATES Taking
Steps dimension (.22). Higher scores on each of the readiness measures were associated with a larger percentage of
days that the participants reported being abstinent from alcohol. Likewise, higher scores on the readiness
instruments were significantly associated with lower numbers on the DDD measure. The correlations were –.34 for
the Readiness Ruler, –.40 for the Staging Algorithm, and –.22 for the SOCRATES Taking Steps dimension. The
scores from the SOCRATES Problem Recognition measure were not correlated with either PDA or DDD.
Therefore, all the instruments with the exception of SOCRATES Problem Recognition demonstrated evidence of
concurrent validity. A similar pattern of results was observed when data collected at the 6-month follow-up were
analyzed.
Predictive evidence of validity was gathered using a statistical technique, discussed in this chapter, called multiple
regression. This technique can be applied to evaluate the relationship between a criterion variable and more than one
predictor variable. While the researchers in this study were interested only in evaluating how well the readiness-to-
change scores predicted later alcohol consumption (the criterion variable), they recognized that there were other
variables present in the study that might also be related to alcohol consumption. Some of these other variables were
age, gender, race, and how much alcohol each participant reported consuming at the beginning of the study. By
using multiple regression, the researchers were able to statistically “control for” the effects of these other variables.
Then they could estimate how well the readiness-to-change measures by themselves predicted future alcohol
consumption. This is called incremental validity because it shows how much additional variance is accounted for by
the readiness-to-change measures alone, over and above the variance accounted for by the other variables used in the
regression.
The researchers performed two regressions. First, the initial readiness-to-change scores on each instrument were
used to predict alcohol consumption after 6 months of treatment. Then the readiness-to-change scores taken at 6
months of treatment were used to predict alcohol consumption after 12 months of treatment. The results showed that
only the Readiness Ruler had significant predictive evidence of validity for both measures of alcohol consumption
(PDA and DDD) at 6 months and 12 months of treatment.
The interesting finding in this study is that while all the measures showed concurrent evidence of validity, only the
Readiness Ruler showed both concurrent and predictive evidence. There may be a number of plausible explanations
for these seemingly contradictory results. Can you think of some of those reasons?
In a business setting, employers use pre-employment tests to predict how well an applicant is
likely to perform a job. In this case, supervisors’ ratings of job performance can serve as a
criterion that represents performance on the job. Other criteria that represent job performance
include accidents on the job, attendance or absenteeism, disciplinary problems, training
performance, and ratings by peers—other employees at the work site. None of these measures
can represent job performance perfectly, but each provides information on important
characteristics of job performance.
Objective and Subjective Criteria
Criteria for job performance fall into two categories: objective and subjective. An objective
criterion is one that is observable and measurable, such as the number of accidents on the job,
the number of days absent, or the number of disciplinary problems in a month. A subjective
criterion is based on a person’s judgment. Supervisor and peer ratings are examples of
subjective criteria.
Each has advantages and disadvantages. Well-defined objective criteria contain less error
because they are usually tallies of observable events or outcomes. Their scope, however, is often
quite narrow. For instance, dollar volume of sales is an objective criterion that might be used to
measure a person’s sales ability. This number is easily calculated, and there is little chance of
disagreement on its numerical value. It does not, however, take into account a person’s
motivation or the availability of customers. On the other hand, a supervisor’s ratings of a
person’s sales ability may provide more information on motivation, but in turn ratings are based
on judgment and might be biased or based on information not related to sales ability, such as
expectations about race or gender. Table 7.1 lists a number of criteria used in educational,
clinical, and organizational settings.
Does the Criterion Measure What It Is Supposed to
Measure?
The concept of validity evidence based on content (addressed in the prior chapter) also applies to
criteria. Criteria must be representative of the events they are supposed to measure. Criterion
scores have evidence of validity to the extent that they match or represent the events in question.
Therefore, a criterion of sales ability must be representative of the entire testing universe of sales
ability. Because there is more to selling than just having the highest dollar volume of sales,
several objective criteria might be used to represent the entire testing universe of sales ability.
For instance, we might add the number of sales calls made each month to measure motivation
and add the size of the target population to measure customer availability.
Table 7.1 ■ Common Criteria
Table 7.1 ■ Common Criteria
Objective Subjective
Educational settings
Grade point average (GPA) X
Withdrawal or dismissal X
Table 7.1 ■ Common Criteria
Objective Subjective
Teacher’s recommendations
X
Clinical settings
Diagnosis
X
Behavioral observation X
Self-report
X
Organizational settings
Units produced X
Number of errors X
Ratings of performance
X
Subjective measures such as ratings can often demonstrate better evidence of their validity based
on content because the rater can provide judgments for a number of dimensions specifically
associated with job performance. Rating forms are psychological measures, and we expect them
to be reliable and valid, as we do for any measure. We estimate their reliability/precision using
the test–retest or internal consistency method, and we generate evidence of their validity by
matching their content to the knowledge, skills, abilities, or other characteristics (such as
behaviors, attitudes, personality characteristics, or other mental states) that are presumed to be
present in the test takers. (A later chapter contains more information on various types of rating
scales and their uses in organizations: “How are Tests Used in Organizational Settings?”) By
reporting the reliability of their criteria, researchers provide us with information on how
consistent their outcome measures are. As you may have noticed, the researchers at Brigham
Young University (Holman, Busby, & Larson, 1989) who conducted the study on the predictive
validity of the PREP-M reported high reliability/precision for their questionnaire, which was
their subjective criterion.
Sometimes criteria do not represent all of the dimensions in the behavior, attitude, or event being
measured. When this happens, the criterion has decreased evidence of validity based on its
content because it has underrepresented some important characteristics. If the criterion
measures more dimensions than those measured by the test, we say that criterion
contamination is present. For instance, if one were looking at the test–criterion relationship of a
test of sales aptitude, a convenient criterion might be the dollar volume of sales made over some
period of time. However, if the dollar volume of sales of a new salesperson reflected both his or
her own sales as well as sales that resulted from the filyling of back orders sold by the former
salesperson, the criterion would be considered contaminated.
In The News Box 7.1 What Are the Criteria for Success?
Choosing criteria for performance can be difficult. Consider the criteria for teacher performance in Tennessee. In
2010, Tennessee won a federal Race to the Top Grant worth about $501 million for the state’s public school system.
As part of the program outlined in the grant application, the criteria for evaluating teachers would be students’ test
scores on state subject matter tests (e.g., writing, math, and reading) and observations by school principals.
Think a minute about these criteria. Are these criteria objective or subjective? Are they well defined? Can these
criteria be reliably measured? Is there likely to be error in the measures of these criteria? Are they valid measures of
teacher performance?
Let’s look closely at these criteria. One criterion is objective and one is subjective. Can you tell which is which? If
you answered that the test scores are objective and the observations are subjective, you are correct. The test scores
have good attributes. They provide data that can be analyzed using statistical procedures. They can be collected,
scored, and secured efficiently. If the tests are constructed correctly the scores will be reliable and valid. On the
other hand, if the tests are not developed correctly, their scores can contain error and bias.
What about the principals’ observations? What are their attributes likely to be? We say that the observations are
subjective, because they are based on personal opinion. They will probably be affected by each principal’s
preconceived notions and opinions. One way those errors can be avoided is by training the principals to use a valid
observation form that relies only on behaviors. Such forms would identify behaviors and form the basis for ratings.
Under the grant guidelines, teachers would receive at least four 10-minute evaluations each year.
The federal grant requirements outline a plan that, if carried out correctly, can yield useful results for improving the
education of Tennessee’s students. These evaluations, however, are most important to the individual teachers in
Tennessee because they will be used to decide which teachers will be retained, given pay raises, promoted, and
granted tenure. As you might expect, some principals and teachers have serious complaints. Those who teach
subjects for which there are no state tests, such as art, music, physical education, and home economics, are allowed
to choose the subjects under which they wish to be evaluated. A physical education teacher could choose to be
evaluated using the school’s scores on the state writing test. Some principals who once did classroom visits regularly
now feel compelled to observe teachers only when they are formally evaluating them.
Has something gone wrong here? Should the state provide more regulations and rules to govern the evaluation
procedures? The state board is looking into the evaluation process. “Evaluations shouldn’t be terribly onerous, so
complex you get lost among the trees,” board chairman Fielding Rolston has said. “We don’t want there to be so
many checkmarks you can’t tell what’s being evaluated” (Crisp, 2011).
Sources: Crisp (2011), Winerip (2011), and Zehr (2011).
As you can see, when evaluating a validation study, it is important to think about the criterion in
the study as well as the predictor. When unreliable or inappropriate criteria are used for
validation, the true validity coefficient might be under- or overestimated. In the News Box 7.1
describes some issues associated with identifying appropriate criteria to evaluate the
performance of school teachers needed to meet the requirements for a large federal grant.
To close this section, we thought it would be useful for you to see some information about the
predictive validity of a test than many of you may have taken as a requirement for college
admission—the SAT. To learn more about the evidence that has been gathered to support the
validity of the SAT, see On the Web Box 7.1.
Calculating and Evaluating Validity Coefficients
You may recall that the correlation coefficient is a quantitative estimate of the linear relationship
between two variables. In validity studies, we refer to the correlation coefficient between the test
and the criterion as the validity coefficient and represent it in formulas and equations as rxy.
The x in the subscript refers to the test, and the y refers to the criterion. The validity coefficient
represents the amount or strength of the evidence of validity based on the relationship of the test
and the criterion.
Validity coefficients must be evaluated to determine whether they represent a level of validity
that makes the test useful and meaningful. This section describes two methods for evaluating
validity coefficients and how researchers use test–criterion relationship information to make
predictions about future behavior or performance.
Tests of Significance
A validity coefficient is interpreted in much the same way as a reliability coefficient, except that
our expectations for a very strong relationship are not as great. We cannot expect a test to have
as strong a relationship with another variable (test–criterion evidence of validity) as it does with
itself (reliability). Therefore, we must evaluate the validity coefficient by using a test of
significance and by examining the coefficient of determination.
The first question to ask about a validity coefficient is, “How likely is it that the correlation
between the test and the criterion resulted from chance or sampling error?” In other words, if the
test scores (e.g., SAT scores) and the criterion (e.g., college GPA) are completely unrelated, then
their true correlation is zero. If we conducted a study to determine the relationship between these
two variables and found that the correlation was .4, one question we would need to ask is, “What
is the probability that our study would have yielded the observed correlation by chance alone,
even if the variables were truly unrelated?” If the probability that the correlation occurred by
chance is low—less than 5 chances out of 100 (p < .05)—we can be reasonably sure that the test
and its criterion (in this example, SAT scores and college GPA) are truly related. This process is
called a test of significance. In statistical terms, for this example we would say that the validity
coefficient is significant at the .05 level. In organizational settings, it can be challenging for
validity studies to have statistically significant results at p < .05 because of small sample sizes
and criterion contamination.
Because larger sample sizes reduce sampling error, this test of significance requires that we take
into account the size of the group (N) from which we obtained our data. Appendix E can be used
to determine whether a correlation is significant at varying levels of significance. To use the
table in Appendix E, calculate the degrees of freedom (df) for your correlation using the
formula df = N – 2, and then determine the probability that the correlation occurred by chance by
looking across the row associated with those degrees of freedom. The correlation coefficient you
are evaluating should be larger than the critical value shown in the table. You can determine the
level of significance by looking at the column headings. At the level at which your correlation
coefficient is smaller than the value shown, the correlation can no longer be considered
significantly different from zero. For Your Information Box 7.4 provides an example of this
process.
On the Web Box 7.1 Validity and the SAT
To help make admissions decisions, many colleges and universities rely on applicants’ SAT scores. Because
academic rigor can vary from one high school to the next, the SAT—a standardized test—provides schools with a
fair and accurate way to put students on a level playing field to compare one student with another. However,
whether the SAT truly predicts success in college—namely, 1st-year college grades—is controversial.
To learn more about the validity of the SAT, visit the following websites:
Website Description
General Information
www.fairtest.org/facts/satvalidity.html
General discussion of
the following:
• What the SAT is
supposed to measure
• What SAT I
validity studies from
major colleges and
universities show
• How well the
SAT I predicts
success beyond the
freshman year
• How well the
SAT I predicts
college achievement
for women, students
of color, and older
students
• How colleges and
universities should
go about conducting
their own validity
studies
• Alternatives to the
SAT
Research Studies
https://research.collegeboard.org/sites/default/files/publications/2012/7/researchreport-
2008-5-validity-sat-predicting-first-year-college-grade-point-average.pdf
Research study
exploring the
predictive validity of
Website Description
the SAT in predicting
1st-year college
grade point average
(GPA)
www.ucop.edu/news/sat/research.html
Research study
presenting findings
on the relative
contributions of high
school GPA, SAT I
scores, and SAT II
scores in predicting
college success for
81,722 freshmen who
entered the
University of
California from fall
1996 through fall
1999
www.collegeboard.com/prod_downloads/sat/newsat_pred_val.pdf
Summary of research
exploring the
predictive value of
the SAT Writing
section
www.psychologicalscience.org/pdf/ps/frey.pdf
A paper that looks at
the relationship
between SAT scores
and general cognitive
ability
When researchers or test developers report a validity coefficient, they should also report its level
of significance. You might have noted that the validity coefficients of the PREP-M (reported
earlier in this chapter) are followed by the statements p < .01 and p <.05. This information tells
the test user that the likelihood a relationship was found by chance or as a result of sampling
error was less than 5 chances out of 100 (p < .05) or less than 1 chance out of 100 (p < .01).
For Your Information Box 7.4 Test of Significance for a Correlation Coefficient
Here we illustrate how to determine whether a correlation coefficient is significant (evidence of a true relationship)
or not significant (no relationship). Let’s say that we have collected data from 20 students. We have given the
students a test of verbal achievement, and we have correlated students’ scores with their grades from a course on
creative writing. The resulting correlation coefficient is .45.
We now go to the table of critical values for Pearson product–moment correlation coefficients in Appendix E. The
table shows the degrees of freedom (df) and alpha (α) levels for two-tailed and one-tailed tests. Psychologists usually
set their alpha level at 5 chances out of 100 (p < .05) using a two-tailed test, so we use that standard for our example.
Because we used the data from 20 students in our sample, we substitute 20 for N in the formula for degrees of
freedom (df = N – 2). Therefore, df = 20 – 2 or 18. We then go to the table and find 18 in the df column. Finally, we
locate the critical value in that row under the .05 column.
A portion of the table from Appendix E is reproduced in the table below, showing the alpha level for a two-tailed
test. The critical value of .4438 (bolded in the table) is the one we use to test our correlation. Because our correlation
(.45) is greater than the critical value (.4438), we can infer that the probability of finding our correlation by chance is
less than 5 chances out of 100. Therefore, we assume that there is a true relationship and refer to the correlation
coefficient as significant. Note that if we had set our alpha level at a more stringent standard of .01 (1 chance out of
100), our correlation coefficient would have been interpreted as not significant.
If the correlation between the test and the predictor is not as high as the critical value shown in the table, we can say
that the chance of error associated with the test is above generally accepted levels. In such a case, we would
conclude that the validity coefficient does not provide sufficient evidence of validity.
Critical Values for Pearson Product–Moment Correlation Coefficients
Critical Values for Pearson Product–Moment Correlation Coefficients
df .10 .05 .02 .01 .001
16 .4000 .4683 .5425 .5897 .7084
17 .3887 .4555 .5285 .5751 .6932
18 .3783 .4438 .5155 .5614 .6787
19 .3687 .4329 .5034 .5487 .6652
20 .3598 .4227 .4921 .5368 .6524
Source: From Statistical Tables for Biological, Agricultural and Medical Research by R. A. Fisher and F. Yates. Copyright © 1963. Published by
Pearson Education Limited.
The Coefficient of Determination
Another way to evaluate the validity coefficient is to determine the amount of variance that the
test and the criterion share. We can determine the amount of shared variance by squaring the
validity coefficient to obtain r2—called the coefficient of determination. For example, if the
correlation (r) between a test and a criterion is .30, the coefficient of determination (r2) is .09.
This means that the test and the criterion have 9% of their variance in common. Larger validity
coefficients represent stronger relationships with greater overlap between the test and the
criterion. Therefore, if r = .50, then r2 = .25—or 25% shared variance.
We can calculate the coefficient of determination for the correlation of husbands’ scores on the
PREP-M and the questionnaire on marital satisfaction and stability. By squaring the original
coefficient, .44, we obtain the coefficient of determination, r2 = .1936. This outcome means that
the predictor, the PREP-M, and the criterion, the questionnaire, shared (or had in common)
approximately 19% of their variance.
Unadjusted validity coefficients rarely exceed .50. Therefore, you can see that even when a
validity coefficient is statistically significant, the test can account for only a small portion of the
variability in the criterion. The coefficient of determination is important to calculate and
remember when using the correlation between the test and the criterion to make predictions
about future behavior or performance.
How Confident Can We Be About Estimates of Validity?
Conducting one validity study that demonstrates a strong relationship between the test and the
criterion is the first step in a process of validation, but it is not the final step. Studies that provide
evidence of a test’s validity should continue for as long as the test is being used. No matter how
well designed the validation study is, elements of chance, error, and situation-specific factors that
can over- or underinflate the estimate of validity are always present. Ongoing investigations of
validity include cross-validation (where the results that are obtained using one sample are used to
predict the results on a second, similar sample) and meta-analyses (where the results from many
studies are statistically combined to provide a more error-free estimate of validity). Psychologists
also inquire about whether validity estimates are stable from one situation or population to
another—a question of validity generalization. We have more to say about these topics in later
chapters.
The Relationship Between Reliability and Validity
You have already learned in your study of reliability that according to classical test theory,
observed scores on a test can be thought of as the sum of two components—the individual’s true
score on the construct that the test was designed to measure and a random component, which we
call measurement error. You have also learned that the reliability coefficient can be conceived of
as a test’s correlation with itself or another parallel test. That is why we often indicate the
reliability coefficient as Rxx, where the two subscripts are the same. The reason why random error
will always reduce the reliability coefficient is that any event that is random will have a zero
correlation with any other event. That is simply another way of saying that knowledge of one
random event will give you no information that would enable you to predict any other event. So
the more that random events affect a test score, the less that score can correlate with any other
measurement—even itself.
You have now learned that one way we provide evidence of validity is to correlate the scores on
a test with the scores on a criterion measure. This is called a validity coefficient. You also know
that test scores always contain random error so they are never perfectly reliable. And now you
have also learned that reliability is also a concern when we develop criteria measures. So if the
correlation of a test with itself is reduced from the maximum theoretical value 1.0 due to
measurement error, what happens to the correlation coefficient when we correlate test scores that
contain random error with a criterion measure which also contains random error? The answer is
that just as in the case of reliability, the random error in both measures will reduce the degree to
which the two sets of scores can correlate with each other no matter how well the construct that
is measured by the test actually predicts the construct measured by the criterion. This reduction
in the validity coefficient is referred to as attenuation due to unreliability.
It is a simple matter to quantify the degree to which unreliability can affect (attenuate or reduce)
the validity coefficient. Mathematically, the square root of the reliability coefficient of test will
set the upper limit of the validity coefficient. So if a test has a reliability of .64, the maximum
correlation that the test could have with a perfectly reliable criterion is the square root of the
reliability coefficient. In this example that would be .64=.8 , so the maximum validity coefficient
would be .8.
If the criterion is less than perfectly reliable (which will always be the case), the maximum
possible correlation between the test and criteria will be even lower. The maximum validity
coefficient between a test and criteria can also be easily calculated if you know the reliability of
both. It is simply the product of the square roots of both reliability coefficients. So if the
reliability of the criterion measure was .7, the maximum observed correlation the criterion could
have with a test that had a reliability coefficient of .64 would be .64×.7=.67 if the constructs
being measured were perfectly related, which also will never be the case. If the “true” correlation
between the constructs were actually .5 (which, like true scores in reliability calculations, you
can never really know), the observed correlation (the validity coefficient) between the test and
criterion for this example would be further reduced and is equal to .5×.64×.7=.34 . The general
formula that demonstrates that the correlation between a test and a criterion is dependent upon
the “true” correlation between the predictor and criterion constructs and the reliability of the
observed scores is:
rxoyo=rxtytRxxRyy
where
• rxoyo = the observed correlation between the predictor measure (test) and criterion
measure
• rxtyt = the “true” correlation between the predictor construct and criterion construct
• Rxx = the reliability coefficient of the predictor measure (test)
• Ryy = the reliability of the criterion measure
This attenuation of the “true” validity coefficient due to the unreliability of the test and the
criterion is the reason why observed validity coefficients often range between .2 and .4 while
reliability coefficients for well-designed tests are often greater than .8. Even though both
coefficients are correlations, the reliability coefficient is the correlation between the test and
itself that will always be higher than the correlation of the test with a less than perfectly reliable
criterion measure. The measurement error that is present in both will attenuate the observed
correlation.
Psychometricians have developed a method for “correcting” validity coefficients for attenuation
due to unreliability. These methods can be controversial because if they are used inappropriately,
they will misrepresent the relationship between a test and criterion (i.e., validity) and could lead
to incorrect inferences being made from the test scores. In Greater Depth Box 7.1 discusses the
correction of validity coefficients for attenuation due to unreliability in more detail, some of the
interpretive challenges such corrections present, and an important concept called operational
validity.
In Greater Depth Box 7.1 Operational Validity and the Correction for Attenuation in Validity Due to Unreliability
Consider the following (not so) hypothetical example. A large company wishes to use a personality test to predict
job performance to help select their employees. In particular, they are interested in the personality trait of
conscientiousness that they believe will be related to overall job performance. So they decide to conduct a predictive
validity study in which everybody hired takes the personality test to measure their level of conscientiousness, but
these results are not used as part of the selection process. One year later, they collect performance evaluation data on
everyone who took the test. To compute a validity coefficient, they correlate the scores on the conscientiousness
scale of the personality test with the scores on the performance evaluations and find that it is .29. This would be
called the observed validity of conscientiousness to predict job performance
But what if the performance evaluation they used as a criterion measure was unreliable? As we discussed earlier, the
observed validity coefficient will be attenuated or reduced due to this unreliability. Before discussing how the
validity coefficient can be statistically adjusted to account for the unreliability in the criterion, we need to discuss the
difference between the observed validity and something called the true score validity.
You learned in our discussion of reliability that all observed scores consist of two components—true scores and
random error. The true scores represent the degree to which a person possess a particular knowledge or trait if we
could measure it without error. We refer to this knowledge or trait as the construct that the test was designed to
measure. In our example, there are two constructs that the company needed to measure as part of their validity study.
The first was the personality trait of conscientiousness. This construct was measured via a personality test and can
be called the predictor construct. The second construct was overall job performance, which was measured via job
performance evaluation data. This can be called the criterion construct. If we could, we would really like to know
everyone’s true scores at the construct level. Then we could correlate the true scores on the predictor with the true
scores on the criterion and obtain the true score validity coefficient. But unfortunately, we can’t do that. All we can
do is correlate the scores on the imperfect observed measures that are designed to assess each construct to obtain an
observed validity coefficient. Because these observed scores contain random error (they are not perfectly reliable),
the observed validity coefficient will always be lower than what the true score validity coefficient would have been.
The relationships between the constructs and the observed measures of those constructs are depicted graphically
in Figure 7.1.
The fact that the observed validity coefficient will always be less that the hypothetical true score validity raises some
important questions: Since the observed validity coefficient will be reduced because of the presence of measurement
error, how are we to properly interpret it? Does the attenuated validity coefficient really accurately describe the
predictive relationship between the predictor and the criterion?
To answer these questions, let’s go back to our example. The predictive validity study that we described above used
performance evaluation data as its measure of overall job performance. The observed validity coefficient between
conscientiousness scores on the personality test and the job performance scores was measured to be .29. However,
research has demonstrated that the reliability/precision of performance evaluations as a measure of overall job
performance is relatively poor, with a mean meta-analytic reliability coefficient across many studies of only .52
(Viswesvaran, Ones, & Schmidt, 1996). This means that while 52% of the variance in job performance ratings is due
to the true scores on the construct, 48% of the variance is attributable to measurement error. Therefore, the observed
correlation of .29 between conscientiousness overall job performance measured by performance evaluation data
level will be lower than it would be if we could measure both without error at the construct level. The true
correlation has been attenuated due to measurement error (unreliability) in the observed criterion measure. The
implications of this is that the personality trait of conscientiousness may actually be a better predictor of overall job
performance than the validity coefficient suggests!
Description
Figure 7.1 ■ Graphical Representation of True Score, Observed Score, and Operational Validity Source: Copyright © 2014 by Society for Industrial and Organizational Psychology.
There is a correction that can be applied to estimate what the validity coefficient would be if we were able to
measure the criterion construct without error. It is called the correction for attenuation. The formula is quite
simple:
rxy(corrected)=rxy(asmeasured)Ryy
where:
• rxy (corrected) = the validity coefficient corrected for attenuation
• rxy (as measured) = the original measured validity coefficient
• Ryy = the estimated reliability of the observed criterion
In our example, the measured validity coefficient was determined to be .29. The estimated reliability of the criterion
(the job performance evaluation data) was estimated via previous research to be .52 (Viswesvaran et al., 1996).
Therefore, the validity coefficient corrected for attenuation due to the unreliability in the criterion would
be .29.52=.40. The corrected validity coefficient can be interpreted to be the correlation between the predictor
(conscientiousness in our example) and the criterion construct (job performance), not the
criterion measure (performance evaluation data). We call this correlation the operational validity or the true
validity of the predictor for predicting the criterion construct (Viswesvaran et al., 2014). It is an attempt to reflect
the true relationship between the predictor and the criterion once the error in measurement is removed from the
criterion variable. In our example, it suggests that in actual operation, the construct of conscientiousness is
better predictor of the construct of job performance than is apparent when you simply look at the correlation
between conscientiousness and performance evaluation data based on the observed data.
It may have occurred to you that while the validity coefficient is the correlation between a predictor variable and a
criterion variable, we have only applied the correction to the criterion variable. While we could have also applied the
correction to the predictor variable as well, it would not have been appropriate to do so in this case (Society for
Industrial and Organizational Psychology, 2003). This is because once we choose a test, we are interested in how
well that test (with all its error) predicts the criterion. If a test has a lot of error, (i.e., is not very reliable) it will do a
poor job of prediction because of that error. There is no reason to correct for that as the correlation between the
unreliable test and the criterion accurately represents the degree to which the test can or cannot predict the criterion.
However, if the criterion measure is unreliable, then the conclusions or inferences we will draw from the validity
coefficient could be very flawed. While the test might be a poor predictor of the criterion at the measurement level
because of criterion unreliability, as you have seen, it might actually be a much better predictor at the construct level. That is, the operational validity of the test-criterion relationship might be very good while the observed
validity is poorer. The benefit or utility of a test actually comes from its operational validity. This is a measure of
how well the criterion will be predicated by the test in actual practice. That’s why operational validity is sometimes
also referred to as true validity (Viswesvaran et al., 2014). Measurement error in the criterion serves to mask the true
relationship between the two.
In closing this section, it is important for us to state that the practice of correcting validity coefficients is something
of a controversial area and there are differing opinions about the appropriateness of the corrections. For a dissenting
view, the interested student should see LeBreton, Scherer, and James, (2014). Also, we have only discussed
correcting validity coefficients for unreliability in the criterion measure. Validity coefficients can also be corrected
for range restriction. As we mentioned earlier in this chapter, restriction in range can also reduce a validity
coefficient when tests are used for selection purposes because we only have access to the test and criterion scores for
those people who are actually selected, not the full range of people who might have taken the test. Also, because test
scores will often be correlated with other criteria used to select employees (such as the scores on interviews), those
with the lowest test scores will often not be selected even if the test scores themselves are not used as a selection
criterion as would be the case in a predictive validity study. This will result in an indirect restriction of range on the
test scores that will then artificially reduce the observed validity coefficient. A comprehensive review of the
statistical issues that are present when conducting criterion-related validity studies to gather evidence for validity
based on a test’s relationship with other variables can be found in Van Iddekinge and Ployhart (2008). Finally, the
corrections we have described in this section are most often used when large scale psychometric meta-analyses are
conducted that combine the results of many smaller individual studies to arrive at a more accurate estimates of
validity for a particular predictor variable. They are less often used to correct the results obtained in individual
studies like our example because the small sample sizes usually present in these studies would likely result in
unstable corrected estimates that would vary significantly if the study were repeated on a different sample.
Using Validity Information to Make Predictions
When a relationship can be established between a test and a criterion, we can use test scores from
other individuals to predict how well new individuals will perform on the criterion measure. For
example, some universities use students’ scores on the SAT to predict the students’ success in
college. Organizations use job candidates’ scores on pre-employment tests that have
demonstrated evidence of validity to predict those candidates’ scores on the criteria of job
performance.
Linear Regression
We use the statistical process called linear regression when we use one set of test scores (X) to
predict one set of criterion scores (Y′). While a full description of linear regression is beyond the
scope of this book, we can show you the basic process.
We start by constructing the following linear regression equation:
Y′ = a + bX,
where
• Y′ = the predicted score on the criterion
• a = the intercept
• b = the slope (also called a b weight, regression weight, or regression coefficient)
• X = the score the individual made on the predictor test
You may recognize this formula from previous math courses you have taken as the equation for a
straight line. In linear regression, we refer to this line as the regression line. We calculate
the slope or b weight (b) of the regression line—the expected change in Y for every one-unit
change in X—using the following formula:
b=rsysx
where
• r = the correlation coefficient
• sx = the standard deviation of the distribution of X
• sy = the standard deviation of the distribution of Y
The intercept is the place where the regression line crosses the y-axis. The intercept (a) is
calculated using the following formula:
a=Y¯−bX¯
where
• Y¯ = the mean of the distribution of Y
• b = the slope
• X¯ = the mean of the distribution of X
You may have noticed that we are using the symbols for the sample standard deviation (sx and sy)
and the sample mean (X¯ and Y¯ ) here and in For Your Information Box 7.5 instead of the
symbols for the population values. This is because a regression equation is usually used to make
predictions about a population based on sample data.
A test for statistical significance can be performed on b, the regression weight. This is a test that
evaluates whether the slope of the regression line is statistically significantly different from zero.
If b is significantly different from zero, it means that X can be considered to be a valid predictor
of the criterion, Y. Mathematically, a test of b in a simple linear regression will give you exactly
the same results as a test of the correlation between X and Y. If this correlation is statistically
significant, then it also means that the slope of the regression line that predicts Y from X is
statistically significantly different from zero as well. If the correlation between X and Y is not
statistically significant, there would be no reason to perform a regression, as that would mean
that the predictor (X) does not provide any predictive information about criterion (Y).
For Your Information Box 7.5 shows the calculation of a linear regression equation and how it is
used to predict scores on a criterion.
The process of using correlated data to make predictions is also important in clinical settings. For
Your Information Box 7.6 describes how clinicians use psychological test scores to identify
adolescents at risk for committing suicide.
Multiple Regression
Complex criteria, such as job performance and success in graduate school, are often difficult to
predict with a single test. In these situations, researchers frequently use more than one test to
make a more accurate prediction. A technique called multiple regression is often used in this
situation.
We use the statistical process of multiple regression when we have more than one set of test
scores (X 1, X 2, … Xn) used for predicting a criterion (Y′). A multiple regression equation
expands the familiar linear regression equation to include more than one predictor or test as
follows:
Y′ = a + b1X1 + b2X2 + b3X3 … bnXn,
where
• Y′ = the predicted score on the criterion
• a = the intercept (where the regression line crosses the y-axis)
• Xi = the predictor
• bi = the expected change in Y for every one-unit change in Xi, when all the other
predictors in the equation do not vary or remain constant. As in simple linear regression,
these are also called b weights or regression weights. The b weight is also related to
slope, but when there are more than two predictors, this cannot be graphically represented
because you would actually need to graph in more than three dimensions to see it.
For Your Information Box 7.5 Making Predictions With a Linear Regression Equation
Research suggests that academic self-efficacy (ASE) and class grades are related. We have made up the following
data to show how we could use the scores on an ASE test to predict a student’s grade. We have also done the various
calculations for you. (Note: Our fake data set is small, to facilitate this illustration.)
For instance, we can ask, “If a student scores 65 on the ASE test, what course grade would we expect the student to
receive?” We have assigned numbers to each grade to facilitate this analysis, therefore, 1 = D, 2 = C, 3 = B, and 4 =
A.
• Step 1: Calculate the means and standard deviations of X and Y.
X¯=63.2
Y¯=2.6 sx = 20.82
sy = .97
• Step 2: Calculate the correlation coefficient (rxy) for X and Y.
rxy =.67
• Step 3: Calculate the slope (b) and intercept (a).
b=rsysx,
b=.67×.9720.82 b = .031
a=Y¯− bX¯ a = 2.6 – (.031)(63.2)
a = .64
• Step 4: Calculate Y′ (the predicated grade) when X = 65.
Y′ = a + bX
Y′ = .64 + (.031)(65)
Y′ = .64 + 2.02 = 2.66
• Step 5: Translate the number calculated for Y′ back into a letter grade.
Student ASE (X) Grade (Y)
1 80 3
2 62 2
3 90 4
4 40 2
5 55 2
Student ASE (X) Grade (Y)
6 85 2
7 70 4
8 75 3
9 25 1
10 50 3
Therefore, a predicted numerical grade of 2.66 convert to a letter grade of between C and B, perhaps a C+.
The best prediction we can make is that a person who scored 65 on an ASE test would be expected to earn a course
grade of C+. Note that by substituting any test score for X, we will receive a corresponding prediction for a score
on Y.
This equation actually provides a predicted score on the criterion (Y′) for each test score (X). When the Y′ values are
plotted, they form the linear regression line associated with the correlation between the test and the criterion.
For Your Information Box 7.6 Evidence of Validity of the Suicide Probability Scale Using the Predictive Method
Although the general incidence of suicide has decreased during the past two decades, the rate for people between 15
and 24 years old has tripled. Suicide is generally considered to be the second or third most common cause of death
among adolescents, even though it is underreported (O’Connor, 1997–2014).
If young people who are at risk for committing suicide or making suicide attempts can be identified, greater
vigilance is likely to prevent such actions. Researchers at Father Flanagan’s Boys’ Home, in Boys Town, Nebraska,
conducted a validity study using the predictive method for the Suicide Probability Scale (SPS) that provided
encouraging results for predicting suicidal behaviors in adolescents (Larzelere, Smith, Batenhorst, & Kelly, 1996).
The SPS contains 36 questions that assess suicide risk, including thoughts about suicide, depression, and isolation.
The researchers administered the SPS to 840 boys and girls when they were admitted to the Boys Town residential
treatment program from 1988 through 1993. The criteria for this study were the numbers of suicide attempts, suicide
verbalizations, and self-destructive behaviors recorded in the program’s daily incident reports completed by
supervisors of the group homes. (The interrater reliabilities for reports of verbalizations and reports of self-
destructive behaviors were very high at .97 and .89, respectively. The researchers were unable to calculate a
reliability estimate for suicide attempts because only one attempt was recorded in the reports they selected for the
reliability analysis.)
After controlling for a number of confounding variables, such as gender, age, and prior attempts at suicide, the
researchers determined that the total SPS score and each of its subscales differentiated (p = .05) between those who
attempted suicide and those who did not. In other words, the mean SPS scores of those who attempted suicide were
significantly higher than the mean SPS scores of those who did not attempt suicide. The mean SPS scores of those
who displayed self-destructive behaviors were also significantly higher (p = .01) than the mean SPS scores of those
who did not attempt self-destructive behaviors. Finally, the total SPS score correlated at .25 (p = .001) with the
suicide verbalization rate. Predictions made by the SPS for those at risk for attempting suicide showed that each 1-
point increase in the total SPS score predicted a 2.4% greater likelihood of a subsequent suicide attempt.
The researchers suggested a cutoff score of 74 for those without prior suicide attempts and a cutoff score of 53 for
those with prior suicide attempts. In other words, if an adolescent who has no history of suicide attempts scores
above 74 on the SPS, the youth would be classified as at risk for suicide and treated accordingly. If an adolescent
who has a history of a suicide attempt scores below 53, the youth would be classified as not at risk for suicide.
The researchers emphasized, however, that although the SPS demonstrated statistically significant validity in
predicting suicide attempts, it is not a perfect predictor. A number of suicide attempts were also recorded for those
with low scores, and therefore a low SPS score does not ensure that an adolescent will not attempt suicide. The SPS
does, however, provide an instrument for accurately identifying adolescents at risk for committing suicide.
The subscripts following each b and X are used to identify each predictor in the regression
equation.
In multiple regression, there is still one criterion (Y), as in simple linear regression, but there are
now multiple predictors. There are a number of different statistics we can use when interpreting
the results of a multiple regression. One of the statistics is analogous to the correlation
coefficient (r) used in simple regression. It is called the multiple correlation coefficient and is
indicated by a capital letter R. R describes the overall relationship between more than one
predictor and a criterion. R is interpreted in a similar fashion to the usual correlation coefficient.
Like any correlation coefficient, R can be subjected to a test of significance. If R is significant, it
indicates that all the predictors in the equation taken together explain a statistically significant
amount of variance in the criterion. However, an even more useful statistic for interpreting the
results of a multiple regression is called the coefficient of multiple determination (R2). Earlier
in this chapter we discussed the coefficient of determination. You will recall that the coefficient
of determination (r2) is simply the square of a correlation coefficient between a single predictor
and a criterion. It is interpreted as the proportion of variance that is shared by the two variables.
Likewise, the coefficient of multiple determination (R2) is the square of the multiple correlation
coefficient. R2 is a statistic that is obtained through multiple regression analysis, which is
interpreted as the total proportion of variance in the criterion variable that is accounted for
by all the predictors in the multiple regression equation.
In multiple regression, we usually expect that all of the included predictors will be correlated
with the criterion—that’s why we chose them as predictors in the first place. However, in most
cases, the predictors will also be correlated with each other as well as with the criterion. This
correlation among the predictor variables in a multiple regression is called multicollinearity and
can create difficulty in interpreting the results. As you know, when variables are correlated, it
indicates that they share something in common. When we have two or more predictors that are
both correlated among themselves and are also correlated with the criterion, we may not know
whether each predictor is accounting for a separate, unique portion of the variance in the
criterion. Sometimes, both predictors may be accounting for the same variance in the criterion. If
this is the case, using two predictors would not provide any more predictive power than simply
using either one by itself. This can complicate the interpretation of the results of a multiple
regression equation. The issue often arises when you need to answer the question of whether
adding an additional test to a test battery is worth the effort and expense. Here is an example.
Suppose a college admissions officer wanted to investigate the degree to which he could predict
students’ 1st-year college GPA (the criterion) using measures of the student’s success in high
school as predictors. One of the predictors that he decides to use is the students’ self-reported
high school GPAs stated on their applications for admission. The other predictor he decides to
use is the students’ GPAs that are reported on their official high school transcripts. You probably
can immediately see that the two predictors would be extremely highly correlated, because they
are both measuring exactly the same thing. As a result, using both predictors in a multiple
regression would not provide any independent predictive information regarding 1st-year college
GPA. Therefore, there would be no reason to include them both as predictors. Anytime we use
multiple predictors to predict a criterion, it is important to evaluate the extent to which they are
predicting unique, nonoverlapping parts of the variance in the criteria. Multiple regression can
help us to make that determination.
In Greater Depth Box 7.2 explores in more detail the interpretation of multiple regression results.
In Greater Depth Box 7.2 Interpreting Multiple Regression Results
When we interpret the results of a multiple regression analysis, the first thing that we typically look for is whether
the value of R2 is statistically significant. If it is significant, it indicates that all the predictors taken together are able
to predict a significant amount of variance in the criterion. Next we look at the size of R2 because the size tells us
how much variance in the criterion is accounted for simultaneously by all the predictors that were included in the
regression. Finally, we look at which of the b weights (if any) are significant. When a b weight is statistically
significant, this means that the predictor associated with that b weight is explaining a unique, nonredundant amount
of variance in the criterion that isn’t already accounted for by any of the other predictors in the regression.
This ability of multiple regression to provide information on the amount of unique variance a predictor accounts for
in a criterion after the variance accounted for by the other predictors is taken into account is one of its most
important features. We use this information to establish evidence of validity when more than one predictor is used to
predict a criterion. We do this by entering the predictors one at a time in a predetermined order into the regression.
After each predictor is entered into the regression, the total variance accounted for in the criterion (R2) is
recomputed. If R2 significantly increases when a new predictor is entered, there is evidence that the new predictor is
accounting for additional variance in the criterion. If R2 does not significantly increase when the new predictor is
entered, it means that the predictor does not explain any variance in the criterion that has not already been explained by the predictors that have already been entered into the regression. This increment in R2 is called R2 change or,
more simply, R2Δ and is also referred to as incremental validity.
It is important to understand that the change in R2 observed each time a new variable is entered into the regression
will depend on the order we enter the predictors. When the predictors are correlated (which they almost always are),
it will mean that they are both partially explaining the same variance in the criterion. As a result, the predictor
entered first into the regression will be able to account for the largest amount of criterion variance. The next
predictor entered into the regression will be able to account only for that variance in the criterion that wasn’t already
accounted for by the first predictor. If the two predictors are highly correlated, then most of the variance in the
criterion that the second predictor could account for in the criterion would already have been accounted for by the
first predictor. Therefore, the R2Δ for the second predictor would be very low. However, if the order of entry of the
predictors were reversed and the second predictor were entered first into the regression, it would now account for the
larger portion of the variance in the criterion, and the original first predictor would now account for only a little
additional variance. Therefore, the decision on the order that the predictors will be entered into the regression when
investigating incremental validity is critical and must be carefully considered and explained by the researcher. The
conclusions that are reached about the relative importance of the predictors might be very different just because of
the order in which they were entered into the regression.
A small example will help clarify what we have explained above. Presume that a human resources (HR) manager
wants to use two well-designed personality tests (call them Test H and Test N) to predict performance in a particular
job. From prior research, she knows that both tests have independently been shown to be valid predictors of job
performance in similar jobs, with validity coefficients of .30. So her thinking is that using both of the tests would be
more predictive than using only one of them. To make sure of this, she gives all employees currently in the job both
personality tests, and also collects the employees’ performance ratings. She analyzes her data using multiple
regression, using the test scores as the predictor and the performance ratings as the criterion. First she enters Test H
into the regression and is pleased to see that R2 is statistically significant for predicting the performance ratings.
Next, she enters Test N into the regression and is surprised to see that the change in R2 that occurs (R2Δ) is not
significant. Test N doesn’t seem to be adding any predictive ability at all. Just to check her results, she repeats the
regression. But this time she enters Test N into the regression first. Now, R2 for Test N is significant, so she
proceeds to add Test H into the equation, and again, R2 does not increase significantly. What the HR manager has
discovered is that both Test H and Test N are explaining the same variance in the criterion. Whichever test is entered
into the regression first is explaining all the variance that can be explained in the criterion by these personality tests,
leaving nothing for the second test to explain. Therefore, there is no incremental validity that will be gained by using
the second test. Therefore, there is no reason to include the second test to select employees. The HR manager will
need to decide which of the two tests she wants to include based on some other factor, such as cost.
We have a final observation about the example above. We discussed only the R2 value that resulted from the
regression, not the b weights associated with each predictor. This is because our interest was in determining the
incremental validity of the predictors (the tests). Earlier in this section, we said that when a b weight is statistically
significant, it means that the predictor associated with that b weight is explaining a unique, nonredundant amount of
variance in the criterion after all of the variance accounted for in the criterion by every other predictor is taken into
account. In the example above, although R2 was significant, neither b weight would have been significant. This is
because b weights reflect only the amount of variance in the criterion that is not already explained by any other
predictor in the regression. In our example, Test H and Test N accounted for the same variance in the criterion. That
is, neither test was accounting for any unique variance over and above what the other test wasn’t already accounting
for. Therefore, although the overall regression was able to account for a significant amount of variance in the
criterion, the variance that each test was individually accounting for was redundant. As a result, neither b weight
would have been statistically significant.
The study described next is a good example of how researchers use multiple regression to gather evidence of
incremental validity when using more than one predictor.
Chibnall and Detrick (2003) published a study that examined the usefulness of three personality inventories—the
Minnesota Multiphasic Personality Inventory–2 (MMPI-2), the Inwald Personality Inventory (IPI; an established
police officer screening test), and the Revised NEO Personality Inventory (NEO PI-R)—for predicting the
performance of police officers. They administered the inventories to 79 police recruits and compared the test scores
with two criteria: academic performance and physical performance. Tables 7.2 shows the outcome of the study for
the academic performance criterion.
Table 7.2 ■ Multiple Regression Model for Predicting Academic Performance of Police Recruits (R2 = .55)
Table 7.2 ■ Multiple Regression Model for Predicting Academic Performance of Police Recruits (R2 = .55)
Step 1
Demographic Variables
Step 2
IPI Scales
Step 3
MMPI-2 Scales
Step 4
NEO PI-R Scales
Recruit class Trouble law Depression Assertiveness
Marital status Antisocial Hypomania Ideas
Race Obsessiveness
Depression
R 2Δ = .20 R 2Δ = .16 R 2Δ = .08 R 2Δ = .11
Source: Reprinted with permission from J. T. Chibnall and P. Detrick. (2003). “The NEO PI-R, Inwald Personality Inventory, and MMPI-2 in the
prediction of police academy performance: A case for incremental validity.” American Journal of Criminal Justice, 27(2), 33–248. Note: Step refers to the order that a predictor is entered into the regression equation for predicting academic performance. Step 1 is the first
predictor entered, Step 2 is the second, and so on. The predictors are the individual demographic characteristics or the subscales that reached
significance. IPI = Inwald Personality Inventory, MMPI-2 = Minnesota Multiphasic Personality Inventory–2, NEO PI-R = Revised NEO
Personality Inventory. R 2Δ is the percentage of incremental variance in academic performance contributed by each predictor when entered into
the equation in the order shown.
When the researchers entered the demographic variables of recruit class, marital status, and race into the regression
first, they jointly accounted for 20% of the prediction of academic performance. In the second step, the researchers
entered the test scores from three IPI scales. Table 7.2 shows the contribution of the IPI scales that contributed
significantly to the prediction. Together, the three scales of the IPI contributed an additional 16% of the variance in
the criterion (R2Δ). In the third step, the researchers entered two scales of the MMPI-2, and together they accounted
for an additional 8% of the variance in the criterion. Finally, the researchers entered three scales of the NEO PI-R,
and together they accounted for another 11% of the variance. Altogether, the demographic characteristics and the
three inventories accounted for 55% of the variance in academic performance (R2).
Physical performance was not predicted by demographic characteristics or most of the other tests included in the
study. Only three dimensions of the NEO PI-R accounted for a significant amount of variance in physical
performance (20%).
Chapter Summary
Evidence of validity based on test–criteria relations—the extent to which a test is related to independent behavior or
events—is one of the major methods for obtaining evidence of test validity. The usual method for demonstrating this
evidence is to correlate scores on the test with a measure of the behavior we wish to predict. This measure of
independent behavior or performance is called the criterion.
Evidence of validity based on test–criteria relations depends on evidence that the scores on the test correlate
significantly with an independent criterion—a standard used to measure some characteristic of an individual, such as
a person’s performance, attitude, or motivation. Criteria may be objective or subjective, but they must be reliable
and valid. There are two methods for demonstrating evidence of validity based on test–criteria relations: predictive
and concurrent.
There is a strong relationship between reliability and validity. If a test is not reliable, it will not correlate well with
any criterion due to random measurement error. The resulting reduction of the validity coefficient over what it
would have been if there were less measurement error in the predictor is called attenuation. There are statistical
procedures that can be used to correct for attenuation but their use can be controversial.
We use correlation to describe the relationship between a psychological test and a criterion. In this case, the
correlation coefficient is referred to as the validity coefficient. Psychologists interpret validity coefficients using
tests of significance and the coefficient of determination. Statistical artifacts such as unreliability and restriction can
result in a reduction of the observed validity coefficient.
Either a linear regression equation or a multiple regression equation can be used to predict criterion scores from test
scores. Predictions of success or failure on the criterion enable test users to use test scores for making decisions
about hiring. When we use multiple regression we have to be aware that the predictors are likely to be correlated
with one another and that can complicate the interpretation of the results.
Engaging in the Learning Process
Learning is the process of gaining knowledge and skills through schooling or studying. Although you can learn by
reading the chapter material, attending class, and engaging in discussion with your instructor, more actively
engaging in the learning process may help you better learn and retain chapter information. To help you actively
engage in the learning process, we encourage you to access our new supplementary student workbook. The
workbook contains critical thinking activities to help you understand and apply information and help you make
progress toward learning and retaining material. If you do not have a copy of the workbook, you can purchase a
copy through sagepub.com.
Key Concepts
After completing your study of this chapter, you should be able to define each of the following
terms. These terms are bolded in the text of this chapter and defined in the Glossary.
• attenuation due to unreliability
• b weight
• coefficient of determination
• coefficient of multiple determination
• concurrent evidence of validity
• correction for attenuation
• criterion
• criterion contamination
• criterion-related validity
• incremental validity
• intercept
• linear regression
• multiple regression
• objective criterion
• operational validity
• peers
• predictive evidenceof validity
• restriction of range
• slope
• subjective criterion
• test of significance
• validity coefficient
Critical Thinking Questions
The following are some critical thinking questions to support the learning objectives for this chapter.
Learning Objectives Critical Thinking Questions
Identify evidence of validity of a test based on
its relationships to external criteria and
describe two methods for obtaining this
evidence.
• What is the benefit of making a distinction between the
predictive and concurrent methods of establishing evidence of
validity based on test content?
• What do you think the impact would be on the results of a
predictive validity study conducted in an organization if an
unexpected layoff of personnel occurred before the study was
completed?
Read and interpret validity studies.
• If you were a test publisher and a client who bought your test
reported that a concurrent validity study they conducted didn’t
show evidence of validity, what are some of the question you
would want to ask them to help understand their results?
• If you were asked to do an in-class presentation on the topic
of test validity based on a test’s relationship with other
Learning Objectives Critical Thinking Questions
variables, what are some of the criteria you would want to be
included in your professor’s evaluation of your presentation?
Why?
Discuss how restriction of range occurs and
its consequences.
• Under what circumstances would restriction of range not be
of concern when conducting a validity study based on test-
criterion relationships?
• Do you think that restriction of range could also be a
problem when estimating the reliability of a test? Explain.
Describe the differences between evidence of
validity based on test content and evidence
based on relationships with other variables.
• Do you think it would be possible for a test that had evidence
for validity based on content to not show evidence of validity
based on test-criterion relationships? Why or why not?
Describe the difference between
reliability/precision and validity.
• How would you explain, in your own words, how reliability
affects validity?
• Why is the following statement true: “A test can be reliable
but not valid, but it can’t be valid if it is not reliable.”
Define and give examples of objective and
subjective criteria, and explain why criteria
must be reliable and valid.
• If you had to develop an objective criterion to use in a test
validity study, what steps would you take to demonstrate that the
criteria itself was valid?
Interpret a validity coefficient, calculate the
coefficient of determination, and conduct a
test of significance for a validity coefficient.
• If test X was shown to have a validity coefficient of .4, and
test B had one of .3, how could you use the concept of the
coefficient of determination to quantify the differences between
the validities of the two tests?
• If you were tutoring a classmate on the topic of validity, how
would you explain the meaning of a statistically non-significant
validity coefficient?
Understand why measured validity will be
reduced by unreliability in the predictor or
criterion measure and what statistical
correction can be applied to adjust for this
reduction.
• Job performance reviews are a usually a subjective criterion
which have been shown to have a fairly low reliability
coefficient of about .52. Nonetheless, they are a frequently used
criteria to validate tests used for employee selection. Why might
this fact be of concern in these types of validity studies?
• How might the conclusions drawn form a validity study be
affected if the correction for attenuation is applied to both the
predictor measure and the criterion measure instead of only to
the criterion measure as is usually recommended?
Explain the concept of operational or “true”
validity and how it is calculated.
• Why is the concept of operational validity sometimes called
“true validity”? In what sense is the validity “true”?
Learning Objectives Critical Thinking Questions
Explain the concept of regression, calculate
and interpret a linear regression formula, and
interpret a multiple regression formula.
• What are some of the characteristics that are shared between
linear regression and multiple regression? What are some of the
differences?
• Why do you think that it is important for a researcher who is
trying to develop a battery of tests to predict a psychological
disorder to understand the concept of incremental validity?
• A linear regression equation includes a number of important
elements that help us understand the relationship between a
predictor variable and a criterion variable. What are they and
how does each one help us understand the relationship?
• Some people may think that if we want to predict an outcome
using tests, the more tests we use, the better the prediction will
be. Why is this statement often false? When might it be true?
Descriptions of Images and Figures Back to Figure
The figure has two circles, which are connected by downward vertical arrows to two boxes. The
circles are adjacent to each other and so are the boxes.
The circle on the left is labeled Predictor Construct (Conscientiousness). An arrow from the
circle is pointing to the box below it, which is labeled Predictor Measure (Personality Test).
The circle on the right is labeled Criterion Construct (Job Performance). An arrow from the
circle is pointing to a box below it, which is labeled Criterion Measure (Performance Appraisal).
A horizontal arrow pointing from Predictor Construct to Criterion Construct is labeled True
Score Validity (construct level).
A diagonal arrow pointing from Predictor Measure to Criterion Construct is labeled Operational
(true) Validity
A horizontal arrow pointing from Predictor Measure to Criterion Measure is labeled Observed
Score Validity (measurement level).
Go to Next section
- 7 How Do We Gather Evidence of Validity Based on Test–Criterion Relationships?
- Learning Objectives
- What Is Evidence of Validity Based on Test–Criterion Relationships?
- Methods for Providing Evidence of Validity Based on Test–Criterion Relationships
- The Predictive Method
- The Concurrent Method
- Selecting a Criterion
- Objective and Subjective Criteria
- Does the Criterion Measure What It Is Supposed to Measure?
- Calculating and Evaluating Validity Coefficients
- Tests of Significance
- The Coefficient of Determination
- How Confident Can We Be About Estimates of Validity?
- The Relationship Between Reliability and Validity
- Using Validity Information to Make Predictions
- Linear Regression
- Multiple Regression
- Key Concepts
- Critical Thinking Questions
- Descriptions of Images and Figures