psych

abc7746
ch7.pdf

Author: Miller, L. A., & Lovler, R. L. (2020). In Foundations of psychological testing: A practical approach (6th ed., pp. 54–84). SAGE.

7 How Do We Gather Evidence of Validity Based on

Test–Criterion Relationships?

Learning Objectives

After completing your study of this chapter, you should be able to do the following:

• Identify evidence of validity of a test based on its relationships to external criteria, and

describe two methods for obtaining this evidence.

• Read and interpret validity studies.

• Discuss how restriction of range occurs and its consequences.

• Describe the differences between evidence of validity based on test content and evidence

based on relationships with external criteria.

• Describe the difference between reliability/precision and validity.

• Define and give examples of objective and subjective criteria, and explain why criteria

must be reliable and valid.

• Interpret a validity coefficient, calculate the coefficient of determination, and conduct a

test of significance for a validity coefficient.

• Understand why measured validity will be reduced by unreliability in the predictor or

criterion measure and what statistical correction can be applied to adjust for this

reduction.

• Explain the concept of operational or “true” validity and how it is calculated.

• Explain the concept of regression, calculate and interpret a linear regression formula, and

interpret a multiple regression formula.

“The graduate school I’m applying to says they won’t accept anyone who scores less than 1,000 on the GRE. How

did they decide that 1,000 is the magic number?”

“Before we married, my fiancée and I went to a premarital counselor. She gave us a test that predicted how happy

our marriage would be.”

“My company uses a test for hiring salespeople to work as telemarketers. The test is designed for people selling life

insurance and automobiles. Is this a good test for hiring telemarketers?”

Have you ever wondered how psychological tests really work? Can we be comfortable using an

individual’s answers to test questions to make decisions about hiring him for a job or admitting

her to college? Can mental disorders really be diagnosed using scores on standard

questionnaires?

Psychologists who use tests for decision making are constantly asking these questions and others

like them. When psychologists use test scores for making decisions that affect individual lives,

they, as well as the public, want substantial evidence that the correct decisions are being made.

This chapter describes the processes that psychologists use to ensure that tests perform properly

when they are used for making predictions and decisions. We begin by discussing the concept of

validity evidence based on a test’s relationships to other variables, specifically external criteria.

As we discussed in the previous chapter, “How Do We Gather Evidence of Validity Based on the

Content of a Test?” this evidence has traditionally been called criterion-related evidence of

validity. We also discuss the importance of selecting a valid criterion measure, how to evaluate

validity coefficients, and the statistical processes that provide evidence that a test can be used for

making predictions.

What Is Evidence of Validity Based on Test–Criterion

Relationships?

In the last chapter, we introduced you to the concept of evidence of validity based on a test’s

relationship with other variables. We said that one method for obtaining evidence is to

investigate how well the test scores correlate with observed behaviors or events. When test

scores correlate with specific behaviors, attitudes, or events, we can confirm that there is

evidence of validity. In other words, the test scores may be used to predict

those specific behaviors, attitudes, or events. But as you recall, we cannot use such evidence to

make an overall statement that the test is valid. We also said that this evidence has traditionally

been referred to as criterion-related validity (a term that we use occasionally in this chapter, as it

is still widely used by testing practitioners).

For example, when you apply for a job, you might be asked to take a test designed to predict how

well you will perform the job. If the job is clerical and the test really predicts how well you will

perform, your test score will be related to your skill in performing clerical duties such as word

processing and filing. To provide evidence that the test predicts clerical performance,

psychologists correlate test score with a measure of each individual’s performance on clerical

tasks such as supervisor’s ratings. The measure of performance that we correlate with test scores

is called the criterion. If higher test scores relate to higher performance ratings, the test shows

evidence of validity based on the relationship between these two variables, traditionally referred

to as criterion-related validity. Educators use admissions tests to forecast how successful an

applicant will be in college or graduate school. The SAT and the Graduate Record Examination

(GRE) are admissions tests used by colleges. The criterion of success in college is often the

student’s first-year grade point average (GPA). In a clinical setting, psychologists often use tests

to diagnose mental disorders. In this case, the criterion is the diagnoses made by several

psychologists or psychiatrists independent of the test. Researchers then correlate the diagnoses

with the test scores to establish evidence of validity.

Methods for Providing Evidence of Validity Based on Test–

Criterion Relationships

There are two methods for demonstrating evidence of validity based on test–criterion

relationships: the predictive method and the concurrent method. This section defines and gives

examples of each method.

The Predictive Method

When it is important to show a relationship between test scores and a future behavior,

researchers use the predictive method to establish evidence of validity. In this case, a large group

of people take the test (the predictor), and their scores are held for a predetermined time interval,

such as 6 months. After the time interval passes, researchers collect a measure of some behavior,

for example, a rating or other measure of performance, on the same people (the criterion). Then

researchers correlate the test scores with the criterion scores. If the test scores and the criterion

scores have a strong relationship, the test has demonstrated predictive evidence of validity.

Researchers at Brigham Young University used the predictive method to demonstrate evidence

of validity of the PREParation for Marriage Questionnaire (PREP-M). For Your Information Box

7.1 describes the study they conducted.

Psychologists might use the predictive method in an organizational setting to establish evidence

of validity for an employment test. To do so, they administer an employment test (predictor) to

candidates for a job. Researchers file test scores in a secure place, and the company does not use

the scores for making hiring decisions. The company makes hiring decisions based on other

criteria, such as interviews or different tests. After a predetermined time interval, usually 3 to 6

months, supervisors evaluate the new hires on how well they perform the job (the criterion). To

determine whether the test scores predict the candidates who were successful and unsuccessful,

researchers correlate the test scores with the ratings of job performance. The resulting correlation

coefficient is called the validity coefficient, a statistic used to infer the strength of the evidence

of validity that the test scores might demonstrate in predicting job performance.

For Your Information Box 7.1 Evidence of Validity Based on Test–Criterion Relationships of a Premarital

Assessment Instrument

In 1991, researchers at Brigham Young University (Holman, Larson, & Harmer, 1994) conducted a study to

determine the evidence of validity based on test–criterion relationships of the PREParation for Marriage

Questionnaire (PREP-M; Holman, Busby, & Larson, 1989). Counselors use the PREP-M with engaged couples who

are participating in premarital courses or counseling. The PREP-M has 206 questions that provide information on

couples’ shared values, readiness for marriage, background, and home environment. The researchers contacted 103

married couples who had taken the PREP-M a year earlier as engaged couples and asked them about their marital

satisfaction and stability.

© Kati Neudert/iStockphoto

The researchers predicted that those couples who had high scores on the PREP-M would express high satisfaction

with their marriages. The researchers used two criteria to test their hypothesis. First, they drew questions from the

Marital Comparison Level Index (Sabatelli, 1984) and the Marital Instability Scale (Booth, Johnson, & Edwards,

1983) to construct a criterion that measured each couple’s level of marital satisfaction and marital stability. The

questionnaire showed internal consistency of .83. The researchers also classified each couple as “married satisfied,”

“married dissatisfied,” or “canceled/delayed” and as “married stable,” “married unstable,” or “canceled/delayed.”

These classifications provided a second criterion.

The researchers correlated the couples’ scores on the PREP-M with their scores on the criterion questionnaire. The

husbands’ scores on the PREP-M correlated at .44 (p < .01) with questions on marital satisfaction and at .34 (p <

.01) with questions on marital stability. The wives’ scores on the PREP-M were correlated with the same questions

at .25 (p < .01) and .20 (p < .05), respectively. These correlations show that the PREP-M is a moderate to strong

predictor of marital satisfaction and stability—good evidence of the validity of the PREP-M. (Later in this chapter,

we discuss the size of correlation coefficients needed to establish evidence of validity.)

In addition, the researchers compared the mean scores of those husbands and wives classified as married satisfied,

married dissatisfied, or canceled/delayed and those classified as married stable, married unstable, or

canceled/delayed. As predicted, those who were married satisfied or married stable scored higher on the PREP-M

than did those in the other two respective categories. In practical terms, these analyses show that counselors can use

scores on the PREP-M to make predictions about how satisfying and stable a marriage will be.

To get the best measure of validity, everyone who took the test would need to be hired so that all

test scores could be correlated with a measure of job performance (something that it is not

usually practical to do). This is because it is desirable to get the widest range of test scores

possible (including the very low ones) to understand fully how all the test scores relate to job

performance. Therefore, gathering predictive evidence of validity can present problems for some

organizations because it is important that everyone who took the test is also measured on the

criterion. Some organizations might not be able to hire everyone who applies regardless of

qualifications, and there are usually more applicants than available positions, so not all

applicants can be hired. Also, organizations frequently will be using some other selection tool

such as an interview to make hiring decisions, and typically, only people who do well on the

interview will be hired. Therefore, even predictive studies in organizations may only have access

to the scores of a portion of the candidates who applied for the job. Because those actually hired

are likely to be the higher performers, a restriction of range in the distribution of test scores is

created. In other words, if the test is a valid predictor of job performance and the other selection

tools that are used to make a hiring decision are also valid predictors, then people with lower

scores on the test will be less likely to be hired. This causes the range of test scores to be reduced

or restricted to those who scored relatively higher. Because a validity study conducted on these

data will not have access to the full range of test scores, the validity coefficient calculated only

from this restricted group is likely to be lower than if all candidates had been hired and included

in the study.

Why would the resulting validity coefficient from a range-restricted group be lower than it would

be if the entire group was available to measure? Think of it like this: The worst case of restricted

range would be if everyone obtained exactly the same score on the test (similar to what would

happen if you hired only those people who made a perfect score on the test). If this situation

occurred, the correlation between the test scores and any other criteria would be zero. This is

because if the test scores do not vary from person to person, high performers and lower

performers would all have exactly the same test score. We cannot distinguish high performers

from low performers when everybody gets the same score, and therefore these test scores cannot

be predictive of job performance. Using the full range of test scores enables you to obtain a more

accurate validity coefficient, which usually will be higher than the coefficient you obtained using

the restricted range of scores. However, a correlation coefficient can be statistically adjusted for

restriction of range, which, when used properly, can provide a corrected estimate of the validity

coefficient of the employment test in the unrestricted population. We have much more to say

about corrections to measured validity coefficients later in the chapter when we cover the

relationship between reliability and validity. These problems exist in educational and clinical

settings as well because individuals might not be admitted to an institution or might leave during

the predictive study. For Your Information Box 7.2 describes a validation study that might have

failed to find evidence of validity because of restriction of range.

The Concurrent Method

The method of demonstrating concurrent evidence of validity based on test–criteria

relationships is an alternative to the predictive method. In the concurrent method, test

administration and criterion measurement happen at approximately the same time. This method

does not involve prediction. Instead, it provides information about the present and the status quo

(Cascio, 1991). A study by Maisto and colleagues (2011), described in For Your Information

Box 7.3, is a good example of a study designed to assess concurrent (as well as predictive)

evidence of validity for an instrument used in a clinical setting.

The concurrent method involves administering two measures, the test and a second measure of

the attribute, to the same group of individuals at as close to the same point in time as possible.

For example, the test might be a paper-and-pencil measure of American literature, and the

second measure might be a grade in an American literature course. Usually, the first measure is

the test being validated, and the criterion is another type of measure of performance such as a

rating, grade, or diagnosis. It is very important that the criterion test itself be reliable and valid

(we discuss this further later in this chapter). The researchers then correlate the scores on the two

measures. If the scores correlate, the test scores demonstrate evidence of validity.

In organizational settings, researchers often use concurrent studies as alternatives to predictive

studies because of the difficulties of using a predictive design that we discussed earlier. In this

setting, the process is to administer the test to employees currently in the position for which the

test is being considered as a selection tool and then to collect criterion data on the same people

(such as performance appraisal data). In some cases, the criterion data are specifically designed

to be used in the concurrent study, while in other cases recent, existing data are used. Then the

test scores are correlated with the criterion data and the validity coefficient is calculated.

For Your Information Box 7.2 Did Restriction of Range Decrease the Validity Coefficient?

Does a student’s academic self-concept—how the student views himself or herself in the role of a student—affect

the student’s academic performance? Michael and Smith (1976) developed the Dimensions of Self-Concept

(DOSC), a self-concept measure that emphasizes school-related activities and that has five subscales that measure

level of aspiration, anxiety, academic interest and satisfaction, leadership and initiative, and identification versus

alienation.

Researchers at the University of Southern California (Gribbons, Tobey, & Michael, 1995) examined the evidence of

validity based on test–criterion relationships of the DOSC by correlating DOSC test scores with GPA. They selected

176 new undergraduates from two programs for students considered at risk for academic difficulties. The students

came from a variety of ethnic backgrounds, and 57% were men.

At the beginning of the semester, the researchers administered the DOSC to the students following the guidelines

described in the DOSC manual (Michael, Smith, & Michael, 1989). At the end of the semester, they obtained each

student’s first-semester GPA from university records. When they analyzed the data for evidence of

reliability/precision and validity, the DOSC showed high internal consistency, but scores on the DOSC did not

predict GPA.

Did something go wrong? One conclusion is that self-concept as measured by the DOSC is unrelated to GPA.

However, if the study or the measures were somehow flawed, the predictive evidence of validity of the DOSC might

have gone undetected. The researchers suggested that perhaps academic self-concept lacks stability during students’

first semester. Although the internal consistency of the DOSC was established, the researchers did not measure the

test–retest reliability/precision of the test. Therefore, this possibility cannot be ruled out. The researchers also

suggested that GPA might be an unreliable criterion.

Could restriction of range have caused the validity of the DOSC to go undetected? This is a distinct possibility for

two reasons. First, for this study the researchers chose only those students who were at risk for experiencing

academic difficulties. Because the unrestricted population of students also contains those who are expected to

succeed, the researchers might have restricted the range of both the test and the criterion. Second, the students in the

study enrolled in programs to help them become successful academically. Therefore, participating in the programs

might have enhanced the students’ academic self-concept.

This study demonstrates two pitfalls that researchers designing predictive studies must avoid. Researchers must be

careful to include in their studies participants who represent the entire possible range of performance on both the test

and the criterion. In addition, they must design predictive studies so that participants are unlikely to change over the

course of the study in ways that affect the abilities or traits that are being measured.

Barrett, Phillips, and Alexander (1981) compared the two methods for determining evidence of

validity based on predictor–criteria relationships in an organizational setting using cognitive

ability tests. They found that the two methods provide similar results. However, this may not

always be the case. For Your Information Box 7.3 describes a recent example of the two

approaches producing different results.

Selecting a Criterion

A criterion is an evaluative standard that researchers use to measure outcomes such as

performance, attitude, or motivation. Evidence of validity derived from test–criteria relationships

provides evidence that the test relates to some behavior or event that is independent of the

psychological test. As you recall from For Your Information Box 7.1, the researchers at Brigham

Young University constructed two criteria—a questionnaire and classifications on marital

satisfaction and marital stability—to demonstrate evidence of validity of the PREP-M.

For Your Information Box 7.3 Developing Concurrent and Predictive Evidence of Validity for Three Measures of

Readiness to Change Alcohol Use in Adolescents

Maisto and his colleagues (2011) were interested in motivation or readiness to change and how it relates to alcohol

use in adolescents. They felt that if a good measure of this construct could be identified, it would help in the design

of clinical interventions for the treatment of alcohol abuse. They noted that there was little empirical evidence to

support the validity of any of the existing measures of motivation or readiness to change, especially in an adolescent

population.

The researchers identified three existing measures of “readiness to change” frequently used in substance abuse

contexts. The first was the Stages of Change and Treatment Eagerness Scale (SOCRATES). This tool was designed

to measure two dimensions of readiness to change called Problem Recognition and Taking Steps concerning alcohol

use (Maisto, Chung, Cornelius, & Martin, 2003). The second measure they evaluated was the Readiness Ruler

(Center on Alcoholism, Substance Abuse and Addictions, 1995). This measure is a simple questionnaire that asks

respondents to rate on a scale of 1 to 10 how ready they are to change their alcohol use behavior using anchors such

as not ready to change, unsure, and trying to change. The third measure they investigated is called the Staging

Algorithm (Prochaska, DiClemente, & Norcross, 1992), which places people into five stages based on their

readiness to change: pre-contemplation, contemplation, preparation, action, and maintenance.

The research question was whether these instruments would show concurrent and/or predictive evidence of validity.

That is, would high scores on the readiness-to-change instruments be associated with lower alcohol consumption

reported when both measures were taken at the same point in time (concurrent evidence), and would high scores on

the instruments taken at one point in time predict lower alcohol consumption measured at a later point in time

(predictive evidence)?

The participants were adolescents aged 14 to 18 years who were recruited at their first treatment session from seven

different treatment programs for adolescent substance users. Two criteria were used for the study. The first was the

average percentage of days that the participants reported being abstinent from alcohol (PDA). The second was the

average number of drinks they consumed per drinking day (DDD). These criteria were measured three times during

the study—on the first day of treatment (concerning the previous 30 days) and at 6-month and 12-month follow-up

sessions. The three “readiness-to-change” measures were filled out by the participants each time.

Concurrent evidence of validity was gathered by correlating the scores on each readiness-to-change instrument with

the average alcohol consumption criteria reported by the participants at the same time. The correlation between

baseline alcohol consumption and the initial readiness change measure taken at the same time for PDA was positive

and statistically significant for the Readiness Ruler (.35), the Staging Algorithm (.39), and the SOCRATES Taking

Steps dimension (.22). Higher scores on each of the readiness measures were associated with a larger percentage of

days that the participants reported being abstinent from alcohol. Likewise, higher scores on the readiness

instruments were significantly associated with lower numbers on the DDD measure. The correlations were –.34 for

the Readiness Ruler, –.40 for the Staging Algorithm, and –.22 for the SOCRATES Taking Steps dimension. The

scores from the SOCRATES Problem Recognition measure were not correlated with either PDA or DDD.

Therefore, all the instruments with the exception of SOCRATES Problem Recognition demonstrated evidence of

concurrent validity. A similar pattern of results was observed when data collected at the 6-month follow-up were

analyzed.

Predictive evidence of validity was gathered using a statistical technique, discussed in this chapter, called multiple

regression. This technique can be applied to evaluate the relationship between a criterion variable and more than one

predictor variable. While the researchers in this study were interested only in evaluating how well the readiness-to-

change scores predicted later alcohol consumption (the criterion variable), they recognized that there were other

variables present in the study that might also be related to alcohol consumption. Some of these other variables were

age, gender, race, and how much alcohol each participant reported consuming at the beginning of the study. By

using multiple regression, the researchers were able to statistically “control for” the effects of these other variables.

Then they could estimate how well the readiness-to-change measures by themselves predicted future alcohol

consumption. This is called incremental validity because it shows how much additional variance is accounted for by

the readiness-to-change measures alone, over and above the variance accounted for by the other variables used in the

regression.

The researchers performed two regressions. First, the initial readiness-to-change scores on each instrument were

used to predict alcohol consumption after 6 months of treatment. Then the readiness-to-change scores taken at 6

months of treatment were used to predict alcohol consumption after 12 months of treatment. The results showed that

only the Readiness Ruler had significant predictive evidence of validity for both measures of alcohol consumption

(PDA and DDD) at 6 months and 12 months of treatment.

The interesting finding in this study is that while all the measures showed concurrent evidence of validity, only the

Readiness Ruler showed both concurrent and predictive evidence. There may be a number of plausible explanations

for these seemingly contradictory results. Can you think of some of those reasons?

In a business setting, employers use pre-employment tests to predict how well an applicant is

likely to perform a job. In this case, supervisors’ ratings of job performance can serve as a

criterion that represents performance on the job. Other criteria that represent job performance

include accidents on the job, attendance or absenteeism, disciplinary problems, training

performance, and ratings by peers—other employees at the work site. None of these measures

can represent job performance perfectly, but each provides information on important

characteristics of job performance.

Objective and Subjective Criteria

Criteria for job performance fall into two categories: objective and subjective. An objective

criterion is one that is observable and measurable, such as the number of accidents on the job,

the number of days absent, or the number of disciplinary problems in a month. A subjective

criterion is based on a person’s judgment. Supervisor and peer ratings are examples of

subjective criteria.

Each has advantages and disadvantages. Well-defined objective criteria contain less error

because they are usually tallies of observable events or outcomes. Their scope, however, is often

quite narrow. For instance, dollar volume of sales is an objective criterion that might be used to

measure a person’s sales ability. This number is easily calculated, and there is little chance of

disagreement on its numerical value. It does not, however, take into account a person’s

motivation or the availability of customers. On the other hand, a supervisor’s ratings of a

person’s sales ability may provide more information on motivation, but in turn ratings are based

on judgment and might be biased or based on information not related to sales ability, such as

expectations about race or gender. Table 7.1 lists a number of criteria used in educational,

clinical, and organizational settings.

Does the Criterion Measure What It Is Supposed to

Measure?

The concept of validity evidence based on content (addressed in the prior chapter) also applies to

criteria. Criteria must be representative of the events they are supposed to measure. Criterion

scores have evidence of validity to the extent that they match or represent the events in question.

Therefore, a criterion of sales ability must be representative of the entire testing universe of sales

ability. Because there is more to selling than just having the highest dollar volume of sales,

several objective criteria might be used to represent the entire testing universe of sales ability.

For instance, we might add the number of sales calls made each month to measure motivation

and add the size of the target population to measure customer availability.

Table 7.1 ■ Common Criteria

Table 7.1 ■ Common Criteria

Objective Subjective

Educational settings

Grade point average (GPA) X

Withdrawal or dismissal X

Table 7.1 ■ Common Criteria

Objective Subjective

Teacher’s recommendations

X

Clinical settings

Diagnosis

X

Behavioral observation X

Self-report

X

Organizational settings

Units produced X

Number of errors X

Ratings of performance

X

Subjective measures such as ratings can often demonstrate better evidence of their validity based

on content because the rater can provide judgments for a number of dimensions specifically

associated with job performance. Rating forms are psychological measures, and we expect them

to be reliable and valid, as we do for any measure. We estimate their reliability/precision using

the test–retest or internal consistency method, and we generate evidence of their validity by

matching their content to the knowledge, skills, abilities, or other characteristics (such as

behaviors, attitudes, personality characteristics, or other mental states) that are presumed to be

present in the test takers. (A later chapter contains more information on various types of rating

scales and their uses in organizations: “How are Tests Used in Organizational Settings?”) By

reporting the reliability of their criteria, researchers provide us with information on how

consistent their outcome measures are. As you may have noticed, the researchers at Brigham

Young University (Holman, Busby, & Larson, 1989) who conducted the study on the predictive

validity of the PREP-M reported high reliability/precision for their questionnaire, which was

their subjective criterion.

Sometimes criteria do not represent all of the dimensions in the behavior, attitude, or event being

measured. When this happens, the criterion has decreased evidence of validity based on its

content because it has underrepresented some important characteristics. If the criterion

measures more dimensions than those measured by the test, we say that criterion

contamination is present. For instance, if one were looking at the test–criterion relationship of a

test of sales aptitude, a convenient criterion might be the dollar volume of sales made over some

period of time. However, if the dollar volume of sales of a new salesperson reflected both his or

her own sales as well as sales that resulted from the filyling of back orders sold by the former

salesperson, the criterion would be considered contaminated.

In The News Box 7.1 What Are the Criteria for Success?

Choosing criteria for performance can be difficult. Consider the criteria for teacher performance in Tennessee. In

2010, Tennessee won a federal Race to the Top Grant worth about $501 million for the state’s public school system.

As part of the program outlined in the grant application, the criteria for evaluating teachers would be students’ test

scores on state subject matter tests (e.g., writing, math, and reading) and observations by school principals.

Think a minute about these criteria. Are these criteria objective or subjective? Are they well defined? Can these

criteria be reliably measured? Is there likely to be error in the measures of these criteria? Are they valid measures of

teacher performance?

Let’s look closely at these criteria. One criterion is objective and one is subjective. Can you tell which is which? If

you answered that the test scores are objective and the observations are subjective, you are correct. The test scores

have good attributes. They provide data that can be analyzed using statistical procedures. They can be collected,

scored, and secured efficiently. If the tests are constructed correctly the scores will be reliable and valid. On the

other hand, if the tests are not developed correctly, their scores can contain error and bias.

What about the principals’ observations? What are their attributes likely to be? We say that the observations are

subjective, because they are based on personal opinion. They will probably be affected by each principal’s

preconceived notions and opinions. One way those errors can be avoided is by training the principals to use a valid

observation form that relies only on behaviors. Such forms would identify behaviors and form the basis for ratings.

Under the grant guidelines, teachers would receive at least four 10-minute evaluations each year.

The federal grant requirements outline a plan that, if carried out correctly, can yield useful results for improving the

education of Tennessee’s students. These evaluations, however, are most important to the individual teachers in

Tennessee because they will be used to decide which teachers will be retained, given pay raises, promoted, and

granted tenure. As you might expect, some principals and teachers have serious complaints. Those who teach

subjects for which there are no state tests, such as art, music, physical education, and home economics, are allowed

to choose the subjects under which they wish to be evaluated. A physical education teacher could choose to be

evaluated using the school’s scores on the state writing test. Some principals who once did classroom visits regularly

now feel compelled to observe teachers only when they are formally evaluating them.

Has something gone wrong here? Should the state provide more regulations and rules to govern the evaluation

procedures? The state board is looking into the evaluation process. “Evaluations shouldn’t be terribly onerous, so

complex you get lost among the trees,” board chairman Fielding Rolston has said. “We don’t want there to be so

many checkmarks you can’t tell what’s being evaluated” (Crisp, 2011).

Sources: Crisp (2011), Winerip (2011), and Zehr (2011).

As you can see, when evaluating a validation study, it is important to think about the criterion in

the study as well as the predictor. When unreliable or inappropriate criteria are used for

validation, the true validity coefficient might be under- or overestimated. In the News Box 7.1

describes some issues associated with identifying appropriate criteria to evaluate the

performance of school teachers needed to meet the requirements for a large federal grant.

To close this section, we thought it would be useful for you to see some information about the

predictive validity of a test than many of you may have taken as a requirement for college

admission—the SAT. To learn more about the evidence that has been gathered to support the

validity of the SAT, see On the Web Box 7.1.

Calculating and Evaluating Validity Coefficients

You may recall that the correlation coefficient is a quantitative estimate of the linear relationship

between two variables. In validity studies, we refer to the correlation coefficient between the test

and the criterion as the validity coefficient and represent it in formulas and equations as rxy.

The x in the subscript refers to the test, and the y refers to the criterion. The validity coefficient

represents the amount or strength of the evidence of validity based on the relationship of the test

and the criterion.

Validity coefficients must be evaluated to determine whether they represent a level of validity

that makes the test useful and meaningful. This section describes two methods for evaluating

validity coefficients and how researchers use test–criterion relationship information to make

predictions about future behavior or performance.

Tests of Significance

A validity coefficient is interpreted in much the same way as a reliability coefficient, except that

our expectations for a very strong relationship are not as great. We cannot expect a test to have

as strong a relationship with another variable (test–criterion evidence of validity) as it does with

itself (reliability). Therefore, we must evaluate the validity coefficient by using a test of

significance and by examining the coefficient of determination.

The first question to ask about a validity coefficient is, “How likely is it that the correlation

between the test and the criterion resulted from chance or sampling error?” In other words, if the

test scores (e.g., SAT scores) and the criterion (e.g., college GPA) are completely unrelated, then

their true correlation is zero. If we conducted a study to determine the relationship between these

two variables and found that the correlation was .4, one question we would need to ask is, “What

is the probability that our study would have yielded the observed correlation by chance alone,

even if the variables were truly unrelated?” If the probability that the correlation occurred by

chance is low—less than 5 chances out of 100 (p < .05)—we can be reasonably sure that the test

and its criterion (in this example, SAT scores and college GPA) are truly related. This process is

called a test of significance. In statistical terms, for this example we would say that the validity

coefficient is significant at the .05 level. In organizational settings, it can be challenging for

validity studies to have statistically significant results at p < .05 because of small sample sizes

and criterion contamination.

Because larger sample sizes reduce sampling error, this test of significance requires that we take

into account the size of the group (N) from which we obtained our data. Appendix E can be used

to determine whether a correlation is significant at varying levels of significance. To use the

table in Appendix E, calculate the degrees of freedom (df) for your correlation using the

formula df = N – 2, and then determine the probability that the correlation occurred by chance by

looking across the row associated with those degrees of freedom. The correlation coefficient you

are evaluating should be larger than the critical value shown in the table. You can determine the

level of significance by looking at the column headings. At the level at which your correlation

coefficient is smaller than the value shown, the correlation can no longer be considered

significantly different from zero. For Your Information Box 7.4 provides an example of this

process.

On the Web Box 7.1 Validity and the SAT

To help make admissions decisions, many colleges and universities rely on applicants’ SAT scores. Because

academic rigor can vary from one high school to the next, the SAT—a standardized test—provides schools with a

fair and accurate way to put students on a level playing field to compare one student with another. However,

whether the SAT truly predicts success in college—namely, 1st-year college grades—is controversial.

To learn more about the validity of the SAT, visit the following websites:

Website Description

General Information

www.fairtest.org/facts/satvalidity.html

General discussion of

the following:

• What the SAT is

supposed to measure

• What SAT I

validity studies from

major colleges and

universities show

• How well the

SAT I predicts

success beyond the

freshman year

• How well the

SAT I predicts

college achievement

for women, students

of color, and older

students

• How colleges and

universities should

go about conducting

their own validity

studies

• Alternatives to the

SAT

Research Studies

https://research.collegeboard.org/sites/default/files/publications/2012/7/researchreport-

2008-5-validity-sat-predicting-first-year-college-grade-point-average.pdf

Research study

exploring the

predictive validity of

Website Description

the SAT in predicting

1st-year college

grade point average

(GPA)

www.ucop.edu/news/sat/research.html

Research study

presenting findings

on the relative

contributions of high

school GPA, SAT I

scores, and SAT II

scores in predicting

college success for

81,722 freshmen who

entered the

University of

California from fall

1996 through fall

1999

www.collegeboard.com/prod_downloads/sat/newsat_pred_val.pdf

Summary of research

exploring the

predictive value of

the SAT Writing

section

www.psychologicalscience.org/pdf/ps/frey.pdf

A paper that looks at

the relationship

between SAT scores

and general cognitive

ability

When researchers or test developers report a validity coefficient, they should also report its level

of significance. You might have noted that the validity coefficients of the PREP-M (reported

earlier in this chapter) are followed by the statements p < .01 and p <.05. This information tells

the test user that the likelihood a relationship was found by chance or as a result of sampling

error was less than 5 chances out of 100 (p < .05) or less than 1 chance out of 100 (p < .01).

For Your Information Box 7.4 Test of Significance for a Correlation Coefficient

Here we illustrate how to determine whether a correlation coefficient is significant (evidence of a true relationship)

or not significant (no relationship). Let’s say that we have collected data from 20 students. We have given the

students a test of verbal achievement, and we have correlated students’ scores with their grades from a course on

creative writing. The resulting correlation coefficient is .45.

We now go to the table of critical values for Pearson product–moment correlation coefficients in Appendix E. The

table shows the degrees of freedom (df) and alpha (α) levels for two-tailed and one-tailed tests. Psychologists usually

set their alpha level at 5 chances out of 100 (p < .05) using a two-tailed test, so we use that standard for our example.

Because we used the data from 20 students in our sample, we substitute 20 for N in the formula for degrees of

freedom (df = N – 2). Therefore, df = 20 – 2 or 18. We then go to the table and find 18 in the df column. Finally, we

locate the critical value in that row under the .05 column.

A portion of the table from Appendix E is reproduced in the table below, showing the alpha level for a two-tailed

test. The critical value of .4438 (bolded in the table) is the one we use to test our correlation. Because our correlation

(.45) is greater than the critical value (.4438), we can infer that the probability of finding our correlation by chance is

less than 5 chances out of 100. Therefore, we assume that there is a true relationship and refer to the correlation

coefficient as significant. Note that if we had set our alpha level at a more stringent standard of .01 (1 chance out of

100), our correlation coefficient would have been interpreted as not significant.

If the correlation between the test and the predictor is not as high as the critical value shown in the table, we can say

that the chance of error associated with the test is above generally accepted levels. In such a case, we would

conclude that the validity coefficient does not provide sufficient evidence of validity.

Critical Values for Pearson Product–Moment Correlation Coefficients

Critical Values for Pearson Product–Moment Correlation Coefficients

df .10 .05 .02 .01 .001

16 .4000 .4683 .5425 .5897 .7084

17 .3887 .4555 .5285 .5751 .6932

18 .3783 .4438 .5155 .5614 .6787

19 .3687 .4329 .5034 .5487 .6652

20 .3598 .4227 .4921 .5368 .6524

Source: From Statistical Tables for Biological, Agricultural and Medical Research by R. A. Fisher and F. Yates. Copyright © 1963. Published by

Pearson Education Limited.

The Coefficient of Determination

Another way to evaluate the validity coefficient is to determine the amount of variance that the

test and the criterion share. We can determine the amount of shared variance by squaring the

validity coefficient to obtain r2—called the coefficient of determination. For example, if the

correlation (r) between a test and a criterion is .30, the coefficient of determination (r2) is .09.

This means that the test and the criterion have 9% of their variance in common. Larger validity

coefficients represent stronger relationships with greater overlap between the test and the

criterion. Therefore, if r = .50, then r2 = .25—or 25% shared variance.

We can calculate the coefficient of determination for the correlation of husbands’ scores on the

PREP-M and the questionnaire on marital satisfaction and stability. By squaring the original

coefficient, .44, we obtain the coefficient of determination, r2 = .1936. This outcome means that

the predictor, the PREP-M, and the criterion, the questionnaire, shared (or had in common)

approximately 19% of their variance.

Unadjusted validity coefficients rarely exceed .50. Therefore, you can see that even when a

validity coefficient is statistically significant, the test can account for only a small portion of the

variability in the criterion. The coefficient of determination is important to calculate and

remember when using the correlation between the test and the criterion to make predictions

about future behavior or performance.

How Confident Can We Be About Estimates of Validity?

Conducting one validity study that demonstrates a strong relationship between the test and the

criterion is the first step in a process of validation, but it is not the final step. Studies that provide

evidence of a test’s validity should continue for as long as the test is being used. No matter how

well designed the validation study is, elements of chance, error, and situation-specific factors that

can over- or underinflate the estimate of validity are always present. Ongoing investigations of

validity include cross-validation (where the results that are obtained using one sample are used to

predict the results on a second, similar sample) and meta-analyses (where the results from many

studies are statistically combined to provide a more error-free estimate of validity). Psychologists

also inquire about whether validity estimates are stable from one situation or population to

another—a question of validity generalization. We have more to say about these topics in later

chapters.

The Relationship Between Reliability and Validity

You have already learned in your study of reliability that according to classical test theory,

observed scores on a test can be thought of as the sum of two components—the individual’s true

score on the construct that the test was designed to measure and a random component, which we

call measurement error. You have also learned that the reliability coefficient can be conceived of

as a test’s correlation with itself or another parallel test. That is why we often indicate the

reliability coefficient as Rxx, where the two subscripts are the same. The reason why random error

will always reduce the reliability coefficient is that any event that is random will have a zero

correlation with any other event. That is simply another way of saying that knowledge of one

random event will give you no information that would enable you to predict any other event. So

the more that random events affect a test score, the less that score can correlate with any other

measurement—even itself.

You have now learned that one way we provide evidence of validity is to correlate the scores on

a test with the scores on a criterion measure. This is called a validity coefficient. You also know

that test scores always contain random error so they are never perfectly reliable. And now you

have also learned that reliability is also a concern when we develop criteria measures. So if the

correlation of a test with itself is reduced from the maximum theoretical value 1.0 due to

measurement error, what happens to the correlation coefficient when we correlate test scores that

contain random error with a criterion measure which also contains random error? The answer is

that just as in the case of reliability, the random error in both measures will reduce the degree to

which the two sets of scores can correlate with each other no matter how well the construct that

is measured by the test actually predicts the construct measured by the criterion. This reduction

in the validity coefficient is referred to as attenuation due to unreliability.

It is a simple matter to quantify the degree to which unreliability can affect (attenuate or reduce)

the validity coefficient. Mathematically, the square root of the reliability coefficient of test will

set the upper limit of the validity coefficient. So if a test has a reliability of .64, the maximum

correlation that the test could have with a perfectly reliable criterion is the square root of the

reliability coefficient. In this example that would be .64=.8 , so the maximum validity coefficient

would be .8.

If the criterion is less than perfectly reliable (which will always be the case), the maximum

possible correlation between the test and criteria will be even lower. The maximum validity

coefficient between a test and criteria can also be easily calculated if you know the reliability of

both. It is simply the product of the square roots of both reliability coefficients. So if the

reliability of the criterion measure was .7, the maximum observed correlation the criterion could

have with a test that had a reliability coefficient of .64 would be .64×.7=.67 if the constructs

being measured were perfectly related, which also will never be the case. If the “true” correlation

between the constructs were actually .5 (which, like true scores in reliability calculations, you

can never really know), the observed correlation (the validity coefficient) between the test and

criterion for this example would be further reduced and is equal to .5×.64×.7=.34 . The general

formula that demonstrates that the correlation between a test and a criterion is dependent upon

the “true” correlation between the predictor and criterion constructs and the reliability of the

observed scores is:

rxoyo=rxtytRxxRyy

where

• rxoyo = the observed correlation between the predictor measure (test) and criterion

measure

• rxtyt = the “true” correlation between the predictor construct and criterion construct

• Rxx = the reliability coefficient of the predictor measure (test)

• Ryy = the reliability of the criterion measure

This attenuation of the “true” validity coefficient due to the unreliability of the test and the

criterion is the reason why observed validity coefficients often range between .2 and .4 while

reliability coefficients for well-designed tests are often greater than .8. Even though both

coefficients are correlations, the reliability coefficient is the correlation between the test and

itself that will always be higher than the correlation of the test with a less than perfectly reliable

criterion measure. The measurement error that is present in both will attenuate the observed

correlation.

Psychometricians have developed a method for “correcting” validity coefficients for attenuation

due to unreliability. These methods can be controversial because if they are used inappropriately,

they will misrepresent the relationship between a test and criterion (i.e., validity) and could lead

to incorrect inferences being made from the test scores. In Greater Depth Box 7.1 discusses the

correction of validity coefficients for attenuation due to unreliability in more detail, some of the

interpretive challenges such corrections present, and an important concept called operational

validity.

In Greater Depth Box 7.1 Operational Validity and the Correction for Attenuation in Validity Due to Unreliability

Consider the following (not so) hypothetical example. A large company wishes to use a personality test to predict

job performance to help select their employees. In particular, they are interested in the personality trait of

conscientiousness that they believe will be related to overall job performance. So they decide to conduct a predictive

validity study in which everybody hired takes the personality test to measure their level of conscientiousness, but

these results are not used as part of the selection process. One year later, they collect performance evaluation data on

everyone who took the test. To compute a validity coefficient, they correlate the scores on the conscientiousness

scale of the personality test with the scores on the performance evaluations and find that it is .29. This would be

called the observed validity of conscientiousness to predict job performance

But what if the performance evaluation they used as a criterion measure was unreliable? As we discussed earlier, the

observed validity coefficient will be attenuated or reduced due to this unreliability. Before discussing how the

validity coefficient can be statistically adjusted to account for the unreliability in the criterion, we need to discuss the

difference between the observed validity and something called the true score validity.

You learned in our discussion of reliability that all observed scores consist of two components—true scores and

random error. The true scores represent the degree to which a person possess a particular knowledge or trait if we

could measure it without error. We refer to this knowledge or trait as the construct that the test was designed to

measure. In our example, there are two constructs that the company needed to measure as part of their validity study.

The first was the personality trait of conscientiousness. This construct was measured via a personality test and can

be called the predictor construct. The second construct was overall job performance, which was measured via job

performance evaluation data. This can be called the criterion construct. If we could, we would really like to know

everyone’s true scores at the construct level. Then we could correlate the true scores on the predictor with the true

scores on the criterion and obtain the true score validity coefficient. But unfortunately, we can’t do that. All we can

do is correlate the scores on the imperfect observed measures that are designed to assess each construct to obtain an

observed validity coefficient. Because these observed scores contain random error (they are not perfectly reliable),

the observed validity coefficient will always be lower than what the true score validity coefficient would have been.

The relationships between the constructs and the observed measures of those constructs are depicted graphically

in Figure 7.1.

The fact that the observed validity coefficient will always be less that the hypothetical true score validity raises some

important questions: Since the observed validity coefficient will be reduced because of the presence of measurement

error, how are we to properly interpret it? Does the attenuated validity coefficient really accurately describe the

predictive relationship between the predictor and the criterion?

To answer these questions, let’s go back to our example. The predictive validity study that we described above used

performance evaluation data as its measure of overall job performance. The observed validity coefficient between

conscientiousness scores on the personality test and the job performance scores was measured to be .29. However,

research has demonstrated that the reliability/precision of performance evaluations as a measure of overall job

performance is relatively poor, with a mean meta-analytic reliability coefficient across many studies of only .52

(Viswesvaran, Ones, & Schmidt, 1996). This means that while 52% of the variance in job performance ratings is due

to the true scores on the construct, 48% of the variance is attributable to measurement error. Therefore, the observed

correlation of .29 between conscientiousness overall job performance measured by performance evaluation data

level will be lower than it would be if we could measure both without error at the construct level. The true

correlation has been attenuated due to measurement error (unreliability) in the observed criterion measure. The

implications of this is that the personality trait of conscientiousness may actually be a better predictor of overall job

performance than the validity coefficient suggests!

Description

Figure 7.1 ■ Graphical Representation of True Score, Observed Score, and Operational Validity Source: Copyright © 2014 by Society for Industrial and Organizational Psychology.

There is a correction that can be applied to estimate what the validity coefficient would be if we were able to

measure the criterion construct without error. It is called the correction for attenuation. The formula is quite

simple:

rxy(corrected)=rxy(asmeasured)Ryy

where:

• rxy (corrected) = the validity coefficient corrected for attenuation

• rxy (as measured) = the original measured validity coefficient

• Ryy = the estimated reliability of the observed criterion

In our example, the measured validity coefficient was determined to be .29. The estimated reliability of the criterion

(the job performance evaluation data) was estimated via previous research to be .52 (Viswesvaran et al., 1996).

Therefore, the validity coefficient corrected for attenuation due to the unreliability in the criterion would

be .29.52=.40. The corrected validity coefficient can be interpreted to be the correlation between the predictor

(conscientiousness in our example) and the criterion construct (job performance), not the

criterion measure (performance evaluation data). We call this correlation the operational validity or the true

validity of the predictor for predicting the criterion construct (Viswesvaran et al., 2014). It is an attempt to reflect

the true relationship between the predictor and the criterion once the error in measurement is removed from the

criterion variable. In our example, it suggests that in actual operation, the construct of conscientiousness is

better predictor of the construct of job performance than is apparent when you simply look at the correlation

between conscientiousness and performance evaluation data based on the observed data.

It may have occurred to you that while the validity coefficient is the correlation between a predictor variable and a

criterion variable, we have only applied the correction to the criterion variable. While we could have also applied the

correction to the predictor variable as well, it would not have been appropriate to do so in this case (Society for

Industrial and Organizational Psychology, 2003). This is because once we choose a test, we are interested in how

well that test (with all its error) predicts the criterion. If a test has a lot of error, (i.e., is not very reliable) it will do a

poor job of prediction because of that error. There is no reason to correct for that as the correlation between the

unreliable test and the criterion accurately represents the degree to which the test can or cannot predict the criterion.

However, if the criterion measure is unreliable, then the conclusions or inferences we will draw from the validity

coefficient could be very flawed. While the test might be a poor predictor of the criterion at the measurement level

because of criterion unreliability, as you have seen, it might actually be a much better predictor at the construct level. That is, the operational validity of the test-criterion relationship might be very good while the observed

validity is poorer. The benefit or utility of a test actually comes from its operational validity. This is a measure of

how well the criterion will be predicated by the test in actual practice. That’s why operational validity is sometimes

also referred to as true validity (Viswesvaran et al., 2014). Measurement error in the criterion serves to mask the true

relationship between the two.

In closing this section, it is important for us to state that the practice of correcting validity coefficients is something

of a controversial area and there are differing opinions about the appropriateness of the corrections. For a dissenting

view, the interested student should see LeBreton, Scherer, and James, (2014). Also, we have only discussed

correcting validity coefficients for unreliability in the criterion measure. Validity coefficients can also be corrected

for range restriction. As we mentioned earlier in this chapter, restriction in range can also reduce a validity

coefficient when tests are used for selection purposes because we only have access to the test and criterion scores for

those people who are actually selected, not the full range of people who might have taken the test. Also, because test

scores will often be correlated with other criteria used to select employees (such as the scores on interviews), those

with the lowest test scores will often not be selected even if the test scores themselves are not used as a selection

criterion as would be the case in a predictive validity study. This will result in an indirect restriction of range on the

test scores that will then artificially reduce the observed validity coefficient. A comprehensive review of the

statistical issues that are present when conducting criterion-related validity studies to gather evidence for validity

based on a test’s relationship with other variables can be found in Van Iddekinge and Ployhart (2008). Finally, the

corrections we have described in this section are most often used when large scale psychometric meta-analyses are

conducted that combine the results of many smaller individual studies to arrive at a more accurate estimates of

validity for a particular predictor variable. They are less often used to correct the results obtained in individual

studies like our example because the small sample sizes usually present in these studies would likely result in

unstable corrected estimates that would vary significantly if the study were repeated on a different sample.

Using Validity Information to Make Predictions

When a relationship can be established between a test and a criterion, we can use test scores from

other individuals to predict how well new individuals will perform on the criterion measure. For

example, some universities use students’ scores on the SAT to predict the students’ success in

college. Organizations use job candidates’ scores on pre-employment tests that have

demonstrated evidence of validity to predict those candidates’ scores on the criteria of job

performance.

Linear Regression

We use the statistical process called linear regression when we use one set of test scores (X) to

predict one set of criterion scores (Y′). While a full description of linear regression is beyond the

scope of this book, we can show you the basic process.

We start by constructing the following linear regression equation:

Y′ = a + bX,

where

• Y′ = the predicted score on the criterion

• a = the intercept

• b = the slope (also called a b weight, regression weight, or regression coefficient)

• X = the score the individual made on the predictor test

You may recognize this formula from previous math courses you have taken as the equation for a

straight line. In linear regression, we refer to this line as the regression line. We calculate

the slope or b weight (b) of the regression line—the expected change in Y for every one-unit

change in X—using the following formula:

b=rsysx

where

• r = the correlation coefficient

• sx = the standard deviation of the distribution of X

• sy = the standard deviation of the distribution of Y

The intercept is the place where the regression line crosses the y-axis. The intercept (a) is

calculated using the following formula:

a=Y¯−bX¯

where

• Y¯ = the mean of the distribution of Y

• b = the slope

• X¯ = the mean of the distribution of X

You may have noticed that we are using the symbols for the sample standard deviation (sx and sy)

and the sample mean (X¯ and Y¯ ) here and in For Your Information Box 7.5 instead of the

symbols for the population values. This is because a regression equation is usually used to make

predictions about a population based on sample data.

A test for statistical significance can be performed on b, the regression weight. This is a test that

evaluates whether the slope of the regression line is statistically significantly different from zero.

If b is significantly different from zero, it means that X can be considered to be a valid predictor

of the criterion, Y. Mathematically, a test of b in a simple linear regression will give you exactly

the same results as a test of the correlation between X and Y. If this correlation is statistically

significant, then it also means that the slope of the regression line that predicts Y from X is

statistically significantly different from zero as well. If the correlation between X and Y is not

statistically significant, there would be no reason to perform a regression, as that would mean

that the predictor (X) does not provide any predictive information about criterion (Y).

For Your Information Box 7.5 shows the calculation of a linear regression equation and how it is

used to predict scores on a criterion.

The process of using correlated data to make predictions is also important in clinical settings. For

Your Information Box 7.6 describes how clinicians use psychological test scores to identify

adolescents at risk for committing suicide.

Multiple Regression

Complex criteria, such as job performance and success in graduate school, are often difficult to

predict with a single test. In these situations, researchers frequently use more than one test to

make a more accurate prediction. A technique called multiple regression is often used in this

situation.

We use the statistical process of multiple regression when we have more than one set of test

scores (X 1, X 2, … Xn) used for predicting a criterion (Y′). A multiple regression equation

expands the familiar linear regression equation to include more than one predictor or test as

follows:

Y′ = a + b1X1 + b2X2 + b3X3 … bnXn,

where

• Y′ = the predicted score on the criterion

• a = the intercept (where the regression line crosses the y-axis)

• Xi = the predictor

• bi = the expected change in Y for every one-unit change in Xi, when all the other

predictors in the equation do not vary or remain constant. As in simple linear regression,

these are also called b weights or regression weights. The b weight is also related to

slope, but when there are more than two predictors, this cannot be graphically represented

because you would actually need to graph in more than three dimensions to see it.

For Your Information Box 7.5 Making Predictions With a Linear Regression Equation

Research suggests that academic self-efficacy (ASE) and class grades are related. We have made up the following

data to show how we could use the scores on an ASE test to predict a student’s grade. We have also done the various

calculations for you. (Note: Our fake data set is small, to facilitate this illustration.)

For instance, we can ask, “If a student scores 65 on the ASE test, what course grade would we expect the student to

receive?” We have assigned numbers to each grade to facilitate this analysis, therefore, 1 = D, 2 = C, 3 = B, and 4 =

A.

• Step 1: Calculate the means and standard deviations of X and Y.

X¯=63.2

Y¯=2.6 sx = 20.82

sy = .97

• Step 2: Calculate the correlation coefficient (rxy) for X and Y.

rxy =.67

• Step 3: Calculate the slope (b) and intercept (a).

b=rsysx,

b=.67×.9720.82 b = .031

a=Y¯− bX¯ a = 2.6 – (.031)(63.2)

a = .64

• Step 4: Calculate Y′ (the predicated grade) when X = 65.

Y′ = a + bX

Y′ = .64 + (.031)(65)

Y′ = .64 + 2.02 = 2.66

• Step 5: Translate the number calculated for Y′ back into a letter grade.

Student ASE (X) Grade (Y)

1 80 3

2 62 2

3 90 4

4 40 2

5 55 2

Student ASE (X) Grade (Y)

6 85 2

7 70 4

8 75 3

9 25 1

10 50 3

Therefore, a predicted numerical grade of 2.66 convert to a letter grade of between C and B, perhaps a C+.

The best prediction we can make is that a person who scored 65 on an ASE test would be expected to earn a course

grade of C+. Note that by substituting any test score for X, we will receive a corresponding prediction for a score

on Y.

This equation actually provides a predicted score on the criterion (Y′) for each test score (X). When the Y′ values are

plotted, they form the linear regression line associated with the correlation between the test and the criterion.

For Your Information Box 7.6 Evidence of Validity of the Suicide Probability Scale Using the Predictive Method

Although the general incidence of suicide has decreased during the past two decades, the rate for people between 15

and 24 years old has tripled. Suicide is generally considered to be the second or third most common cause of death

among adolescents, even though it is underreported (O’Connor, 1997–2014).

If young people who are at risk for committing suicide or making suicide attempts can be identified, greater

vigilance is likely to prevent such actions. Researchers at Father Flanagan’s Boys’ Home, in Boys Town, Nebraska,

conducted a validity study using the predictive method for the Suicide Probability Scale (SPS) that provided

encouraging results for predicting suicidal behaviors in adolescents (Larzelere, Smith, Batenhorst, & Kelly, 1996).

The SPS contains 36 questions that assess suicide risk, including thoughts about suicide, depression, and isolation.

The researchers administered the SPS to 840 boys and girls when they were admitted to the Boys Town residential

treatment program from 1988 through 1993. The criteria for this study were the numbers of suicide attempts, suicide

verbalizations, and self-destructive behaviors recorded in the program’s daily incident reports completed by

supervisors of the group homes. (The interrater reliabilities for reports of verbalizations and reports of self-

destructive behaviors were very high at .97 and .89, respectively. The researchers were unable to calculate a

reliability estimate for suicide attempts because only one attempt was recorded in the reports they selected for the

reliability analysis.)

After controlling for a number of confounding variables, such as gender, age, and prior attempts at suicide, the

researchers determined that the total SPS score and each of its subscales differentiated (p = .05) between those who

attempted suicide and those who did not. In other words, the mean SPS scores of those who attempted suicide were

significantly higher than the mean SPS scores of those who did not attempt suicide. The mean SPS scores of those

who displayed self-destructive behaviors were also significantly higher (p = .01) than the mean SPS scores of those

who did not attempt self-destructive behaviors. Finally, the total SPS score correlated at .25 (p = .001) with the

suicide verbalization rate. Predictions made by the SPS for those at risk for attempting suicide showed that each 1-

point increase in the total SPS score predicted a 2.4% greater likelihood of a subsequent suicide attempt.

The researchers suggested a cutoff score of 74 for those without prior suicide attempts and a cutoff score of 53 for

those with prior suicide attempts. In other words, if an adolescent who has no history of suicide attempts scores

above 74 on the SPS, the youth would be classified as at risk for suicide and treated accordingly. If an adolescent

who has a history of a suicide attempt scores below 53, the youth would be classified as not at risk for suicide.

The researchers emphasized, however, that although the SPS demonstrated statistically significant validity in

predicting suicide attempts, it is not a perfect predictor. A number of suicide attempts were also recorded for those

with low scores, and therefore a low SPS score does not ensure that an adolescent will not attempt suicide. The SPS

does, however, provide an instrument for accurately identifying adolescents at risk for committing suicide.

The subscripts following each b and X are used to identify each predictor in the regression

equation.

In multiple regression, there is still one criterion (Y), as in simple linear regression, but there are

now multiple predictors. There are a number of different statistics we can use when interpreting

the results of a multiple regression. One of the statistics is analogous to the correlation

coefficient (r) used in simple regression. It is called the multiple correlation coefficient and is

indicated by a capital letter R. R describes the overall relationship between more than one

predictor and a criterion. R is interpreted in a similar fashion to the usual correlation coefficient.

Like any correlation coefficient, R can be subjected to a test of significance. If R is significant, it

indicates that all the predictors in the equation taken together explain a statistically significant

amount of variance in the criterion. However, an even more useful statistic for interpreting the

results of a multiple regression is called the coefficient of multiple determination (R2). Earlier

in this chapter we discussed the coefficient of determination. You will recall that the coefficient

of determination (r2) is simply the square of a correlation coefficient between a single predictor

and a criterion. It is interpreted as the proportion of variance that is shared by the two variables.

Likewise, the coefficient of multiple determination (R2) is the square of the multiple correlation

coefficient. R2 is a statistic that is obtained through multiple regression analysis, which is

interpreted as the total proportion of variance in the criterion variable that is accounted for

by all the predictors in the multiple regression equation.

In multiple regression, we usually expect that all of the included predictors will be correlated

with the criterion—that’s why we chose them as predictors in the first place. However, in most

cases, the predictors will also be correlated with each other as well as with the criterion. This

correlation among the predictor variables in a multiple regression is called multicollinearity and

can create difficulty in interpreting the results. As you know, when variables are correlated, it

indicates that they share something in common. When we have two or more predictors that are

both correlated among themselves and are also correlated with the criterion, we may not know

whether each predictor is accounting for a separate, unique portion of the variance in the

criterion. Sometimes, both predictors may be accounting for the same variance in the criterion. If

this is the case, using two predictors would not provide any more predictive power than simply

using either one by itself. This can complicate the interpretation of the results of a multiple

regression equation. The issue often arises when you need to answer the question of whether

adding an additional test to a test battery is worth the effort and expense. Here is an example.

Suppose a college admissions officer wanted to investigate the degree to which he could predict

students’ 1st-year college GPA (the criterion) using measures of the student’s success in high

school as predictors. One of the predictors that he decides to use is the students’ self-reported

high school GPAs stated on their applications for admission. The other predictor he decides to

use is the students’ GPAs that are reported on their official high school transcripts. You probably

can immediately see that the two predictors would be extremely highly correlated, because they

are both measuring exactly the same thing. As a result, using both predictors in a multiple

regression would not provide any independent predictive information regarding 1st-year college

GPA. Therefore, there would be no reason to include them both as predictors. Anytime we use

multiple predictors to predict a criterion, it is important to evaluate the extent to which they are

predicting unique, nonoverlapping parts of the variance in the criteria. Multiple regression can

help us to make that determination.

In Greater Depth Box 7.2 explores in more detail the interpretation of multiple regression results.

In Greater Depth Box 7.2 Interpreting Multiple Regression Results

When we interpret the results of a multiple regression analysis, the first thing that we typically look for is whether

the value of R2 is statistically significant. If it is significant, it indicates that all the predictors taken together are able

to predict a significant amount of variance in the criterion. Next we look at the size of R2 because the size tells us

how much variance in the criterion is accounted for simultaneously by all the predictors that were included in the

regression. Finally, we look at which of the b weights (if any) are significant. When a b weight is statistically

significant, this means that the predictor associated with that b weight is explaining a unique, nonredundant amount

of variance in the criterion that isn’t already accounted for by any of the other predictors in the regression.

This ability of multiple regression to provide information on the amount of unique variance a predictor accounts for

in a criterion after the variance accounted for by the other predictors is taken into account is one of its most

important features. We use this information to establish evidence of validity when more than one predictor is used to

predict a criterion. We do this by entering the predictors one at a time in a predetermined order into the regression.

After each predictor is entered into the regression, the total variance accounted for in the criterion (R2) is

recomputed. If R2 significantly increases when a new predictor is entered, there is evidence that the new predictor is

accounting for additional variance in the criterion. If R2 does not significantly increase when the new predictor is

entered, it means that the predictor does not explain any variance in the criterion that has not already been explained by the predictors that have already been entered into the regression. This increment in R2 is called R2 change or,

more simply, R2Δ and is also referred to as incremental validity.

It is important to understand that the change in R2 observed each time a new variable is entered into the regression

will depend on the order we enter the predictors. When the predictors are correlated (which they almost always are),

it will mean that they are both partially explaining the same variance in the criterion. As a result, the predictor

entered first into the regression will be able to account for the largest amount of criterion variance. The next

predictor entered into the regression will be able to account only for that variance in the criterion that wasn’t already

accounted for by the first predictor. If the two predictors are highly correlated, then most of the variance in the

criterion that the second predictor could account for in the criterion would already have been accounted for by the

first predictor. Therefore, the R2Δ for the second predictor would be very low. However, if the order of entry of the

predictors were reversed and the second predictor were entered first into the regression, it would now account for the

larger portion of the variance in the criterion, and the original first predictor would now account for only a little

additional variance. Therefore, the decision on the order that the predictors will be entered into the regression when

investigating incremental validity is critical and must be carefully considered and explained by the researcher. The

conclusions that are reached about the relative importance of the predictors might be very different just because of

the order in which they were entered into the regression.

A small example will help clarify what we have explained above. Presume that a human resources (HR) manager

wants to use two well-designed personality tests (call them Test H and Test N) to predict performance in a particular

job. From prior research, she knows that both tests have independently been shown to be valid predictors of job

performance in similar jobs, with validity coefficients of .30. So her thinking is that using both of the tests would be

more predictive than using only one of them. To make sure of this, she gives all employees currently in the job both

personality tests, and also collects the employees’ performance ratings. She analyzes her data using multiple

regression, using the test scores as the predictor and the performance ratings as the criterion. First she enters Test H

into the regression and is pleased to see that R2 is statistically significant for predicting the performance ratings.

Next, she enters Test N into the regression and is surprised to see that the change in R2 that occurs (R2Δ) is not

significant. Test N doesn’t seem to be adding any predictive ability at all. Just to check her results, she repeats the

regression. But this time she enters Test N into the regression first. Now, R2 for Test N is significant, so she

proceeds to add Test H into the equation, and again, R2 does not increase significantly. What the HR manager has

discovered is that both Test H and Test N are explaining the same variance in the criterion. Whichever test is entered

into the regression first is explaining all the variance that can be explained in the criterion by these personality tests,

leaving nothing for the second test to explain. Therefore, there is no incremental validity that will be gained by using

the second test. Therefore, there is no reason to include the second test to select employees. The HR manager will

need to decide which of the two tests she wants to include based on some other factor, such as cost.

We have a final observation about the example above. We discussed only the R2 value that resulted from the

regression, not the b weights associated with each predictor. This is because our interest was in determining the

incremental validity of the predictors (the tests). Earlier in this section, we said that when a b weight is statistically

significant, it means that the predictor associated with that b weight is explaining a unique, nonredundant amount of

variance in the criterion after all of the variance accounted for in the criterion by every other predictor is taken into

account. In the example above, although R2 was significant, neither b weight would have been significant. This is

because b weights reflect only the amount of variance in the criterion that is not already explained by any other

predictor in the regression. In our example, Test H and Test N accounted for the same variance in the criterion. That

is, neither test was accounting for any unique variance over and above what the other test wasn’t already accounting

for. Therefore, although the overall regression was able to account for a significant amount of variance in the

criterion, the variance that each test was individually accounting for was redundant. As a result, neither b weight

would have been statistically significant.

The study described next is a good example of how researchers use multiple regression to gather evidence of

incremental validity when using more than one predictor.

Chibnall and Detrick (2003) published a study that examined the usefulness of three personality inventories—the

Minnesota Multiphasic Personality Inventory–2 (MMPI-2), the Inwald Personality Inventory (IPI; an established

police officer screening test), and the Revised NEO Personality Inventory (NEO PI-R)—for predicting the

performance of police officers. They administered the inventories to 79 police recruits and compared the test scores

with two criteria: academic performance and physical performance. Tables 7.2 shows the outcome of the study for

the academic performance criterion.

Table 7.2 ■ Multiple Regression Model for Predicting Academic Performance of Police Recruits (R2 = .55)

Table 7.2 ■ Multiple Regression Model for Predicting Academic Performance of Police Recruits (R2 = .55)

Step 1

Demographic Variables

Step 2

IPI Scales

Step 3

MMPI-2 Scales

Step 4

NEO PI-R Scales

Recruit class Trouble law Depression Assertiveness

Marital status Antisocial Hypomania Ideas

Race Obsessiveness

Depression

R 2Δ = .20 R 2Δ = .16 R 2Δ = .08 R 2Δ = .11

Source: Reprinted with permission from J. T. Chibnall and P. Detrick. (2003). “The NEO PI-R, Inwald Personality Inventory, and MMPI-2 in the

prediction of police academy performance: A case for incremental validity.” American Journal of Criminal Justice, 27(2), 33–248. Note: Step refers to the order that a predictor is entered into the regression equation for predicting academic performance. Step 1 is the first

predictor entered, Step 2 is the second, and so on. The predictors are the individual demographic characteristics or the subscales that reached

significance. IPI = Inwald Personality Inventory, MMPI-2 = Minnesota Multiphasic Personality Inventory–2, NEO PI-R = Revised NEO

Personality Inventory. R 2Δ is the percentage of incremental variance in academic performance contributed by each predictor when entered into

the equation in the order shown.

When the researchers entered the demographic variables of recruit class, marital status, and race into the regression

first, they jointly accounted for 20% of the prediction of academic performance. In the second step, the researchers

entered the test scores from three IPI scales. Table 7.2 shows the contribution of the IPI scales that contributed

significantly to the prediction. Together, the three scales of the IPI contributed an additional 16% of the variance in

the criterion (R2Δ). In the third step, the researchers entered two scales of the MMPI-2, and together they accounted

for an additional 8% of the variance in the criterion. Finally, the researchers entered three scales of the NEO PI-R,

and together they accounted for another 11% of the variance. Altogether, the demographic characteristics and the

three inventories accounted for 55% of the variance in academic performance (R2).

Physical performance was not predicted by demographic characteristics or most of the other tests included in the

study. Only three dimensions of the NEO PI-R accounted for a significant amount of variance in physical

performance (20%).

Chapter Summary

Evidence of validity based on test–criteria relations—the extent to which a test is related to independent behavior or

events—is one of the major methods for obtaining evidence of test validity. The usual method for demonstrating this

evidence is to correlate scores on the test with a measure of the behavior we wish to predict. This measure of

independent behavior or performance is called the criterion.

Evidence of validity based on test–criteria relations depends on evidence that the scores on the test correlate

significantly with an independent criterion—a standard used to measure some characteristic of an individual, such as

a person’s performance, attitude, or motivation. Criteria may be objective or subjective, but they must be reliable

and valid. There are two methods for demonstrating evidence of validity based on test–criteria relations: predictive

and concurrent.

There is a strong relationship between reliability and validity. If a test is not reliable, it will not correlate well with

any criterion due to random measurement error. The resulting reduction of the validity coefficient over what it

would have been if there were less measurement error in the predictor is called attenuation. There are statistical

procedures that can be used to correct for attenuation but their use can be controversial.

We use correlation to describe the relationship between a psychological test and a criterion. In this case, the

correlation coefficient is referred to as the validity coefficient. Psychologists interpret validity coefficients using

tests of significance and the coefficient of determination. Statistical artifacts such as unreliability and restriction can

result in a reduction of the observed validity coefficient.

Either a linear regression equation or a multiple regression equation can be used to predict criterion scores from test

scores. Predictions of success or failure on the criterion enable test users to use test scores for making decisions

about hiring. When we use multiple regression we have to be aware that the predictors are likely to be correlated

with one another and that can complicate the interpretation of the results.

Engaging in the Learning Process

Learning is the process of gaining knowledge and skills through schooling or studying. Although you can learn by

reading the chapter material, attending class, and engaging in discussion with your instructor, more actively

engaging in the learning process may help you better learn and retain chapter information. To help you actively

engage in the learning process, we encourage you to access our new supplementary student workbook. The

workbook contains critical thinking activities to help you understand and apply information and help you make

progress toward learning and retaining material. If you do not have a copy of the workbook, you can purchase a

copy through sagepub.com.

Key Concepts

After completing your study of this chapter, you should be able to define each of the following

terms. These terms are bolded in the text of this chapter and defined in the Glossary.

• attenuation due to unreliability

• b weight

• coefficient of determination

• coefficient of multiple determination

• concurrent evidence of validity

• correction for attenuation

• criterion

• criterion contamination

• criterion-related validity

• incremental validity

• intercept

• linear regression

• multiple regression

• objective criterion

• operational validity

• peers

• predictive evidenceof validity

• restriction of range

• slope

• subjective criterion

• test of significance

• validity coefficient

Critical Thinking Questions

The following are some critical thinking questions to support the learning objectives for this chapter.

Learning Objectives Critical Thinking Questions

Identify evidence of validity of a test based on

its relationships to external criteria and

describe two methods for obtaining this

evidence.

• What is the benefit of making a distinction between the

predictive and concurrent methods of establishing evidence of

validity based on test content?

• What do you think the impact would be on the results of a

predictive validity study conducted in an organization if an

unexpected layoff of personnel occurred before the study was

completed?

Read and interpret validity studies.

• If you were a test publisher and a client who bought your test

reported that a concurrent validity study they conducted didn’t

show evidence of validity, what are some of the question you

would want to ask them to help understand their results?

• If you were asked to do an in-class presentation on the topic

of test validity based on a test’s relationship with other

Learning Objectives Critical Thinking Questions

variables, what are some of the criteria you would want to be

included in your professor’s evaluation of your presentation?

Why?

Discuss how restriction of range occurs and

its consequences.

• Under what circumstances would restriction of range not be

of concern when conducting a validity study based on test-

criterion relationships?

• Do you think that restriction of range could also be a

problem when estimating the reliability of a test? Explain.

Describe the differences between evidence of

validity based on test content and evidence

based on relationships with other variables.

• Do you think it would be possible for a test that had evidence

for validity based on content to not show evidence of validity

based on test-criterion relationships? Why or why not?

Describe the difference between

reliability/precision and validity.

• How would you explain, in your own words, how reliability

affects validity?

• Why is the following statement true: “A test can be reliable

but not valid, but it can’t be valid if it is not reliable.”

Define and give examples of objective and

subjective criteria, and explain why criteria

must be reliable and valid.

• If you had to develop an objective criterion to use in a test

validity study, what steps would you take to demonstrate that the

criteria itself was valid?

Interpret a validity coefficient, calculate the

coefficient of determination, and conduct a

test of significance for a validity coefficient.

• If test X was shown to have a validity coefficient of .4, and

test B had one of .3, how could you use the concept of the

coefficient of determination to quantify the differences between

the validities of the two tests?

• If you were tutoring a classmate on the topic of validity, how

would you explain the meaning of a statistically non-significant

validity coefficient?

Understand why measured validity will be

reduced by unreliability in the predictor or

criterion measure and what statistical

correction can be applied to adjust for this

reduction.

• Job performance reviews are a usually a subjective criterion

which have been shown to have a fairly low reliability

coefficient of about .52. Nonetheless, they are a frequently used

criteria to validate tests used for employee selection. Why might

this fact be of concern in these types of validity studies?

• How might the conclusions drawn form a validity study be

affected if the correction for attenuation is applied to both the

predictor measure and the criterion measure instead of only to

the criterion measure as is usually recommended?

Explain the concept of operational or “true”

validity and how it is calculated.

• Why is the concept of operational validity sometimes called

“true validity”? In what sense is the validity “true”?

Learning Objectives Critical Thinking Questions

Explain the concept of regression, calculate

and interpret a linear regression formula, and

interpret a multiple regression formula.

• What are some of the characteristics that are shared between

linear regression and multiple regression? What are some of the

differences?

• Why do you think that it is important for a researcher who is

trying to develop a battery of tests to predict a psychological

disorder to understand the concept of incremental validity?

• A linear regression equation includes a number of important

elements that help us understand the relationship between a

predictor variable and a criterion variable. What are they and

how does each one help us understand the relationship?

• Some people may think that if we want to predict an outcome

using tests, the more tests we use, the better the prediction will

be. Why is this statement often false? When might it be true?

Descriptions of Images and Figures Back to Figure

The figure has two circles, which are connected by downward vertical arrows to two boxes. The

circles are adjacent to each other and so are the boxes.

The circle on the left is labeled Predictor Construct (Conscientiousness). An arrow from the

circle is pointing to the box below it, which is labeled Predictor Measure (Personality Test).

The circle on the right is labeled Criterion Construct (Job Performance). An arrow from the

circle is pointing to a box below it, which is labeled Criterion Measure (Performance Appraisal).

A horizontal arrow pointing from Predictor Construct to Criterion Construct is labeled True

Score Validity (construct level).

A diagonal arrow pointing from Predictor Measure to Criterion Construct is labeled Operational

(true) Validity

A horizontal arrow pointing from Predictor Measure to Criterion Measure is labeled Observed

Score Validity (measurement level).

Go to Next section

  • 7 How Do We Gather Evidence of Validity Based on Test–Criterion Relationships?
  • Learning Objectives
  • What Is Evidence of Validity Based on Test–Criterion Relationships?
  • Methods for Providing Evidence of Validity Based on Test–Criterion Relationships
    • The Predictive Method
    • The Concurrent Method
  • Selecting a Criterion
    • Objective and Subjective Criteria
  • Does the Criterion Measure What It Is Supposed to Measure?
  • Calculating and Evaluating Validity Coefficients
    • Tests of Significance
    • The Coefficient of Determination
    • How Confident Can We Be About Estimates of Validity?
  • The Relationship Between Reliability and Validity
  • Using Validity Information to Make Predictions
    • Linear Regression
    • Multiple Regression
  • Key Concepts
  • Critical Thinking Questions
    • Descriptions of Images and Figures