discussion3

profileACCOUNTING123
Cooketal06ConceptsinValidityReliabilityforPsychometricInstr.pdf

R

C P D D

P c m a e f v p a w r b

M C

0 d

The American Journal of Medicine (2006) 119, 166.e7-166.e16

EVIEW

urrent Concepts in Validity and Reliability for sychometric Instruments: Theory and Application

avid A. Cook, MD, MHPE, Thomas J. Beckman, MD, FACP

ivision of Gene

V s a w t

E n v c

E-mail address

002-9343/$ -see f oi:10.1016/j.amjm

ral Internal Medicine, Mayo Clinic College of Medicine, Rochester, Minn.

ABSTRACT

alidity and reliability relate to the interpretation of scores from psychometric instruments (eg, symptom cales, questionnaires, education tests, and observer ratings) used in clinical practice, research, education, nd administration. Emerging paradigms replace prior distinctions of face, content, and criterion validity ith the unitary concept “construct validity,” the degree to which a score can be interpreted as representing

he intended underlying construct. Evidence to support the validity argument is collected from 5 sources:

Content: do instrument items completely represent the construct? Response process: the relationship between the intended construct and the thought processes of subjects or observers Internal structure: acceptable reliability and factor structure Relations to other variables: correlation with scores from another instrument assessing the same construct Consequences: do scores really make a difference?

vidence should be sought from a variety of sources to support a given interpretation. Reliable scores are ecessary, but not sufficient, for valid interpretation. Increased attention to the systematic collection of alidity evidence for scores from psychometric instruments will improve assessments in research, patient are, and education. © 2006 Elsevier Inc. All rights reserved.

KEYWORDS: Construct validity; Reproducibility of results; Educational measurement; Medical education; Quality of life; Questionnaire

s m a o a q n p p f i p

c a

hysicians must be skilled in assessing the quality of out- omes reported in the literature and obtained from instru- ents in clinical practice. Frequently these outcomes are

ssessed using instruments such as scales, questionnaires, ducation tests, and observer ratings that attempt to measure actors such as symptoms, attitudes, knowledge, or skills in arious settings of medical practice (Table 1).1-9 For the urposes of this article, we will refer to all such instruments s psychometric. The term “validity” refers to the degree to hich the conclusions (interpretations) derived from the

esults of any assessment are “well-grounded or justifiable; eing at once relevant and meaningful.”10 However, the

Requests for reprints should be addressed to: David A. Cook, MD, HPE, Baldwin 4-A, Division of General Internal Medicine, Mayo Clinic ollege of Medicine, 200 First Street SW, Rochester, MN 55905.

e: [email protected].

ront matter © 2006 Elsevier Inc. All rights reserved. ed.2005.10.036

kills required to assess the validity of results from psycho- etric assessments are different than the skills used in

ppraising the medical literature11 or interpreting the results f laboratory tests.12 In a recent review of clinical teaching ssessment, we found that validity and reliability were fre- uently misunderstood and misapplied.13 We also have oted that research studies with sound methods often fail to resent a broad spectrum of validity evidence supporting the rimary outcome.6,14-16 Thus, we recognized a need for urther discussion of validity in the context of psychometric nstruments and how this relates to clinical research and ractice.

Methods for evaluating the validity of results from psy- hometric assessments derive from theories of psychology nd educational assessment,17,18 and there is extensive lit-

rature in these disciplines. However, we are not aware of

r c p v s d p a s r m t c e e d

V A I I V w p s u v c a

h m r k i s

p h t f i

o f d b t r n p

166.e8 The American Journal of Medicine, Vol 119, No 2, February 2006

ecent reviews for physicians. Furthermore, within the psy- hologic literature there is variation in terminology and ractices. In an attempt to establish a unified approach to alidity, the American Psychological Association published tandards that integrate emerging concepts.19 These stan- ards readily translate to medical ractice and research and provide comprehensive approach for as-

essing the validity of results de- ived from psychometric instru- ents. This article will discuss

his model and its application to linical medicine, research, and ducation. Reliability, a necessary lement of validity, will also be iscussed within this framework.

ALIDITY, CONSTRUCTS, ND MEANINGFUL NTERPRETATION OF NSTRUMENT SCORES alidity refers to “the degree to hich evidence and theory sup- ort the interpretations of test cores entailed by the proposed ses of tests.”19 In other words, alidity describes how well one an legitimately trust the results of test as interpreted for a specific purpose. Many instruments measure a physical quantity such as

eight, blood pressure, or serum sodium. Interpreting the eaning of such results is straightforward.20 In contrast,

esults from assessments of patient symptoms, student nowledge, or physician attitudes have no inherent mean- ng. Rather, they attempt to measure an underlying con- truct, an “intangible collection of abstract concepts and

CLINICAL SIGNIF

● Best clinical, res practice require methods. This a vative framework lidity of scores fr symptom scales, tion tests, and o

● Validity is viewe tured argument from a variety o refute proposed strument scores.

● A thorough unde work will transfo proach validity.

Table 1 Examples of psychometric instruments used in medic

Medical setting Type of instrument

Clinical practice Symptom or disease severity scale Screening tool

Research Symptom or disease severity scale Quality of life inventory Questionnaire (survey)

Education Written examination

Objective structured clinical exami standardized patient examination

Learner or teacher assessment Course evaluation

Administration Questionnaire (survey)

AUA-SI � American Urological Association Symptom Index; PRIME-MD Licensing Exam; Mini-CEX � Mini-clinical evaluation exercise; SFDP-26 Cardiomyopathy Questionnaire; LCSS � Lung Cancer Symptom Scale; BP

rinciples.”21 The results of any psychometric assessment ave meaning (validity) only in the context of the construct hey purport to assess.17 Table 2 lists constructs (inferences) or selected instruments.3,5,8,22 Because the validity of an nstrument’s scores hinges on the construct, a clear defini-

tion of the intended construct is the first step in any validity eval- uation. Note that many of the con- structs listed in Table 2 would benefit from more precision and clarity.

Validity is not a property of the instrument, but of the instru- ment’s scores and their interpreta- tions.17,19 For example, an instru- ment originally developed for depression screening might be le- gitimately considered for assess- ing anxiety. In contrast, we would expect cardiology board examina- tion scores to accurately assess the construct “knowledge of cardiol- ogy,” but not “knowledge of pul- monary medicine” or “procedural skill in coronary angiography.” Note that the instruments in these examples did not change— only the score interpretations.

Because validity is a property f inferences, not instruments, validity must be established or each intended interpretation. In the example above, the epression instrument’s scores would require further study efore use in assessing anxiety. Similarly, a patient symp- om scale whose scores provided valid inferences under esearch study conditions or in highly selected patients may eed further evaluation before use in a typical clinical ractice.

CE

, and educational ound assessment presents an inno- evaluating the va- struments such as tionnaires, educa- er ratings.

a carefully struc- embling evidence rces to support or rpretations of in-

ding of this frame- ow physicians ap-

tice

Specific examples

AUA-SI symptom score for BPH1

CAGE screen for alcoholism,2 PRIME-MD3

screen for depression AUA-SI,1 KCCQ4

LCSS5

Survey of teens regarding tobacco use6

USMLE Step 1,7 locally developed multiple- choice exam

or USMLE Step 2 Clinical Skills,7 locally developed test of interviewing skill

Mini-CEX,8 SFDP-269

Locally developed evaluation form Staff or patient satisfaction survey

ary Care Evaluation of Mental Disorders; USMLE � United States Medical ford Faculty Development Program Questionnaire; KCCQ � Kansas City nign prostatic hypertrophy.

ICAN

earch s s

rticle for

om in ques bserv

d as ass

f sou inte

rstan rm h

al prac

nation

� Prim � Stan H � be

Table 2 Potential inferences and sources of validity evidence for scores from selected psychometric instruments

Instrument type

Sample instrument

Intended inference from scores*

Potential sources of information for each validity evidence category

Content Response process Internal structure Relations to other variables Consequences

Multiple-choice exam

Internal medicine certifying exam

“Competence in the diagnosis and treatment of common conditions . . . and excellence in the broad domain of internal medicine”22

Test blueprint; qualifications of question writers; well- written questions

Clarity of instructions; student thought process as he or she answers the questions; test security and scoring

Internal consistency; item discrimination

Correlation with clinical rotation grades, scores on other tests, or long- term follow-up of patient outcomes

Method of determining exam pass/fail score; differential pass/fail rates among examinees expected to perform similarly

Clinical performance evaluation

Mini-CEX “Clinical competence of candidates for certification”8

Test blueprint; qualifications of question writers; well- written questions

Rater training; rater thought process as he or she observes performer; test scoring

Inter-rater reliability; factor analysis to identify distinct dimensions of clinical performance

Correlation with scores on other performance assessments

Method of determining pass/fail score; differential pass/fail rates among examinees expected to perform similarly

Patient assessment

PRIME-MD This patient has one or more “of 18 possible current mental disorders”3

Qualifications of question writers; well-written questions; evidence that questions adequately represent domain

Language barrier; patient thought process as he or she answers the questions

Test-retest reliability; internal consistency

Correlation with clinically diagnosed depression; scores from other depression assessments, or health care use

Method of determining score thresholds; improvement in patient outcomes after implementation of this instrument

Questionnaire Lung Cancer Symptom Scale

“Physical and functional dimensions of quality of life”5

Well-written questions; evidence that questions adequately represent domain

Language barrier; patient thought process as he or she answers the questions

Internal consistency; factor analysis

Correlation with an objective assessment of quality of life, eg, hospitalization

Improvement in patient outcomes after implementation of this instrument

Mini-CEX � Mini-clinical evaluation exercise; PRIME-MD � Primary Care Evaluation of Mental Disorders. *Intended inference as represented by instrument authors in cited publication.

166.e9 Cook

and B eckm

an Validity

and Reliability

of Psychom

etric Instrum

ents

A W t v r i a u n

a p r p e r a s a l i i n

t H t s w r m c o v d

S M s r d e v d s o o i o f c e T

C t i

t b d f p o a d r

R c i n a i h p t t r t r

I T e m s s t v w t w p i a w i a a

R f w n c f b u i e p

C

166.e10 The American Journal of Medicine, Vol 119, No 2, February 2006

Conceptual Approach to Validity e often read about “validated instruments.” This concep-

ualization implies a dichotomy— either the instrument is alid or it is not. This view is inaccurate. First, we must emember that validity is a property of the inference, not the nstrument. Second, the validity of interpretations is always

matter of degree. An instrument’s scores will reflect the nderlying construct more accurately or less accurately but ever perfectly.

Validity is best viewed as a hypothesis or “interpretive rgument” for which evidence is collected in support of roposed inferences.17,23,24 As Downing states, “Validity equires an evidentiary chain which clearly links the inter- retation of . . . scores . . . to a network of theory, hypoth- ses, and logic which are presented to support or refute the easonableness of the desired interpretations.”21 As with ny hypothesis-driven research, the hypothesis is clearly tated, evidence is collected to evaluate the most problem- tic assumptions, and the hypothesis is critically reviewed, eading to a new cycle of tests and evidence “until all nferences in the interpretive argument are plausible, or the nterpretive argument is rejected.”25 However, validity can ever be proven.

Validity has traditionally been separated into 3 distinct ypes, namely, content, criterion, and construct validity.26

owever, contemporary thinking on the subject suggests hat these distinctions are arbitrary17,19 and that all validity hould be conceptualized under one overarching frame- ork, “construct validity.” This approach underscores the

easoning that an instrument’s scores are only useful inas- uch as they reflect a construct and that evidence should be

ollected to support this relationship. The distinct concepts f content and criterion validity are preserved as sources of alidity evidence within the construct validity rubric, as iscussed below.

ources of Validity Evidence essick17 identifies 5 sources of evidence to support con-

truct validity: content, response process, internal structure, elations to other variables, and consequences. These are not ifferent types of validity but rather they are categories of vidence that can be collected to support the construct alidity of inferences made from instrument scores. Evi- ence should be sought from several different sources to upport any given interpretation, and strong evidence from ne source does not obviate the need to seek evidence from ther sources. While accruing evidence, one should specif- cally consider two threats to validity: inadequate sampling f the content domain (construct underrepresentation) and actors exerting nonrandom influence on scores (bias, or onstruct-irrelevant variance).24,27 The sources of validity vidence are discussed below, and examples are provided in able 2.

ontent. Content evidence involves evaluating the “rela- ionship between a test’s content and the construct it is

ntended to measure.”19 The content should represent the q

ruth (construct), the whole truth (construct), and nothing ut the truth (construct). Thus, we look at the construct efinition, the instrument’s intended purpose, the process or developing and selecting items (the individual questions, rompts, or cases comprising the instrument), the wording f individual items, and the qualifications of item writers nd reviewers. Content evidence is often presented as a etailed description of steps taken to ensure that the items epresent the construct.28

esponse Process. Reviewing the actions and thought pro- esses of test takers or observers (response process) can lluminate the “fit between the construct and the detailed ature of performance . . . actually engaged in.”19 For ex- mple, educators might ask, “Do students taking a test ntended to assess diagnostic reasoning actually invoke igher-order thinking processes?” They could approach this roblem by asking a group of students to “think aloud” as hey answer questions. If an instrument requires one person o rate the performance of another, evidence supporting esponse process might show that raters have been properly rained. Data security and methods for scoring and reporting esults also constitute evidence for this category.21

nternal Structure. Reliability29,30 (discussed below and in able 3) and factor analysis31,32 data are generally consid- red evidence of internal structure.21,31 Scores intended to easure a single construct should yield homogenous re-

ults, whereas scores intended to measure multiple con- tructs should demonstrate heterogenous responses in a pat- ern predicted by the constructs. Furthermore, systematic ariation in responses to specific items among subgroups ho were expected to perform similarly (termed “differen-

ial item functioning”) suggests a flaw in internal structure, hereas confirmation of predicted differences provides sup- orting evidence in this category.19 For example, if Hispan- cs consistently answer a question one way and Caucasians nswer another way, regardless of other responses, this will eaken (or support, if this was expected) the validity of

ntended interpretations. This contrasts with subgroup vari- tions in total score, which reflect relations to other vari- bles as discussed next.

elations to Other Variables. Correlation with scores rom another instrument or outcome for which correlation ould be expected, or lack of correlation where it would ot, supports interpretation consistent with the underlying onstruct.18,33 For example, correlation between scores rom a questionnaire designed to assess the severity of enign prostatic hypertrophy and the incidence of acute rinary retention would support the validity of the intended nferences. For a quality of life assessment, score differ- nces among patients with varying health states would sup- ort validity.

onsequences. Evaluating intended or unintended conse-

uences of an assessment can reveal previously unnoticed

Table 3 Different ways to assess reliability*

Source of reliability Description Measures Definitions Comments

Internal consistency Do all the items on an instrument measure the same construct? (If an instrument measures more than one construct, a single score will not measure either construct very well. We would expect high correlation between item scores measuring a single construct.) Note: Internal consistency is probably the most commonly reported reliability statistic, in part because it can be calculated after a single administration of a single instrument. Because instrument halves can be considered “alternate forms,” internal consistency can be viewed as an estimate of parallel forms reliability.

Split-half reliability

Kuder-Richardson

Cronbach’s alpha

Correlation between scores on the first and second halves of a given instrument

Similar concept to split-half, but accounts for all items

A generalized form of the Kuder-Richardson formulas

Rarely used in practice because the “effective” instrument is only half as long as the actual instrument; the Spearman-Brown† formula can adjust this result

Assumes all items are equivalent, measure a single construct, and have dichotomous responses

Assumes all items are equivalent and measure a single construct; can be used with dichotomous or continuous data

Temporal stability Does the instrument produce similar results when administered a second time?

Test-retest reliability Administer the instrument to the same person at different times

Usually quantified using correlation (eg, Pearson’s r)

Parallel forms Do different versions of the “same” instrument produce similar results?

Alternate forms reliability Administer different versions of the instrument to the same individual at the same or different times

Usually quantified using correlation (eg, Pearson’s r)

Agreement (inter-rater reliability)

When using raters, does it matter who does the rating? Is one rater’s score similar to another’s?

Percent agreement

Phi Kappa Kendall’s tau Intraclass correlation coefficient

Percent of identical responses

Simple correlation Agreement corrected for chance Agreement on ranked data Uses analysis of variance to

estimate how well ratings from different raters coincide

Does not account for agreement that would occur by chance

Does not account for chance

Generalizability theory How much of the error in measurement is the result of each factor (eg, item, item grouping, subject, rater, day of administration) involved in the measurement process?

Generalizability coefficient Complex model that allows estimation of multiple sources of error

As the name implies, this elegant method is “generalizable” to virtually any setting in which reliability is assessed; for example, it can determine the relative contribution of internal consistency and inter-rater reliability to the overall reliability of a given instrument

For more details regarding the concepts in this table, please see references.30,37-41

This table adapted from Beckman TJ, Ghosh AK, Cook DA, Erwin PJ, Mandrekar JN. How reliable are assessments of clinical teaching? A review of the published instruments. J Gen Intern Med. 2004;19:971; used with permission from Blackwell Publishing.

*“Items” are the individual questions on the instrument. The “construct” is what is being measured, such as knowledge, attitude, skill, or symptom in a specific area. †The Spearman Brown “prophecy” formula allows one to calculate the reliability of an instrument’s scores when the number of items is increased (or decreased).

166.e11 Cook

and B eckm

an Validity

and Reliability

of Psychom

etric Instrum

ents

s s f c i o t w w e f n s d m F p l E g d a

I d l c r p r t a t

t i t o t a c e s t t h w e a f t h r p m t

W A i i t s c c r e t f s t a i l a i b o

R S R s n i m r H i r u w a a m c t T e i

r m n i v b c I s R v

166.e12 The American Journal of Medicine, Vol 119, No 2, February 2006

ources of invalidity. For example, if a teaching assessment hows that male instructors are consistently rated lower than emales it could represent a source of unexpected bias. It ould also mean that males are less effective teachers. Ev- dence of consequences thus requires a link relating the bservations back to the original construct before it can ruly be said to influence the validity of inferences. Another ay to assess evidence of consequences is to explore hether desired results have been achieved and unintended

ffects avoided. In the example just cited, if highly rated aculty ostracized those with lower scores, this unexpected egative outcome would certainly affect the meaning of the cores and thus their validity.17 On the other hand, if reme- iation of faculty with lower scores led to improved perfor- ance, it would support the validity of these interpretations. inally, the method used to determine score thresholds (eg, ass/fail cut scores or classification of symptom severity as ow, moderate, or high) also falls under this category.21

vidence of consequences is the most controversial cate- ory of validity evidence and was the least reported evi- ence source in our recent review of instruments used to ssess clinical teaching.34

ntegrating the Evidence. The words “intended” and “pre- icted” are used frequently in the above paragraphs. Each ine of evidence relates back to the underlying (theoretical) onstruct and will be most powerful when used to confirm elationships stated a priori.17,25 If evidence does not sup- ort the original validity argument, the argument “may be ejected, or it may be improved by adjusting the interpreta- ion and/or the measurement procedure”25 after which the rgument must be evaluated anew. Indeed, validity evalua- ion is an ongoing cycle of testing and revision.17,31,35

The amount of evidence necessary will vary according to he proposed uses of the instrument. Circumstances requir- ng a high degree of confidence in the accuracy of interpre- ations (eg, high-stakes board certification or the primary utcome in a research study) will mandate more evidence han settings where a lower degree of confidence is accept- ble. Some instrument types will rely more heavily on ertain categories of validity evidence than others.21 For xample, observer ratings (eg, medical student clinical as- essments) should show strong evidence of internal struc- ure characterized by high inter-rater agreement. Interpreta- ions for multiple-choice exams, on the other hand, should ave abundant content evidence. Both types of instrument ould, of course, benefit greatly from multiple sources of

vidence. Interpretations informing important decisions in ny setting should be based on substantial validity evidence rom multiple sources. Recent authors have proposed that he validity arguments for directly observable attributes (eg, andwashing habits) and those for observations intended to eflect a latent or theoretical trait (eg, feelings about disease revention) are inherently different.18,25 If accepted, this odel will provide additional guidance regarding the rela-

ive importance of the various evidence sources.36 r

hat About Face Validity? lthough the expression “face validity” has many mean-

ngs, it is usually used to describe the appearance of validity n the absence of empirical testing. This is akin to estimating he speed of a car based on its outward appearance or the tructural integrity of a building based on a view from the urb. Such judgments amount to mere guesswork. The con- epts of content evidence and face validity bear superficial esemblance but are in fact quite different. Whereas content vidence represents a systematic and documented approach o ensure that the instrument assesses the desired construct, ace validity bases judgment on the appearance of the in- trument. Downing and Haladyna note, “Superficial quali- ies . . . may represent an essential characteristic of the ssessment, but . . . the appearance of validity is not valid- ty.”27 DeVellis37 cites additional concerns about face va- idity, including fallibility of judgments based on appear- nce, differing perceptions among developers and users, and nstances in which inferring intent from appearance might e counterproductive. For these reasons, we discourage use f this term.

ELIABILITY: NECESSARY, BUT NOT UFFICIENT, FOR VALID INFERENCES eliability refers to the reproducibility or consistency of

cores from one assessment to another.19 Reliability is a ecessary, but not sufficient, component of validity.21,29 An nstrument that does not yield reliable scores does not per- it valid interpretations. Imagine obtaining blood pressure

eadings of 185/100 mm Hg, 80/40 mm Hg, and 140/70 mm g in 3 consecutive measurements over a 3-minute period

n an otherwise stable patient. How would we interpret these esults? Given the wide variation of readings, we would be nlikely to accept the average (135/70 mm Hg), nor would e rely on the first reading alone. Rather, we would prob-

bly conclude that the measurements are unreliable and seek dditional information. Scores from psychometric instru- ents are just as susceptible to unreliability, but with one

rucial distinction: It is often impractical or even impossible o obtain multiple measurements in a single individual. hus, it is essential that ample evidence be accumulated to stablish the reliability of scores before using an instrument n practice.

There are numerous ways to categorize and measure eliability (Table 3).30,37-41 The relative importance of each easure will vary according to the instrument type.30 Inter-

al consistency measures how well the scores for individual tems on the instrument correlate with each other and pro- ides an approximation of parallel form reliability (see elow). We would expect that scores measuring a single onstruct would correlate highly (high internal consistency). f internal consistency is low, it raises the possibility that the cores are, in fact, measuring more than one construct. eproducibility over time (test-retest), between different ersions of an instrument (parallel forms), and between

aters (inter-rater) are other measures of reliability. The

A t

f t s v a e j f g

v h q s m i a i o s j

P C C m c i u M s s s a w t m

a h u t I S u l

f p p t w

d o s

t n t

d t i i p p b r

m c t t t i q o p j

d a e l o w i m fi

P C W t e p p a r

t a o m s o

166.e13Cook and Beckman Validity and Reliability of Psychometric Instruments

ppendix contains more information on interpretation of hese measures.

Generalizability theory42 provides a unifying framework or the various reliability measures. Under this framework he unreliability of scores can be attributed to various ources of error (called facets), such as item variance, rater ariance, and subject variance. Generalizability studies use nalysis of variance to quantify the contribution of each rror source to the overall error (unreliability) of the scores, ust as analysis of variance does in clinical research. For urther reading, see Shavelson and Webb’s43 primer on eneralizability theory.

We emphasize that although reliability is prerequisite to alidity, it is not sufficient.29 This contrasts with what we ave observed in the literature, where reliability is fre- uently cited as the sole evidence supporting a “valid in- trument.”13,34 As noted above, evidence should be accu- ulated from multiple sources to support the validity of

nferences drawn from a given instrument’s scores. Reli- bility constitutes only one form of evidence. It is also mportant to note that reliability, like validity, is a property f the score and not the instrument itself.30 The same in- trument, used in a different setting or with different sub- ects, can demonstrate wide variation in reliability.29,41

RACTICAL APPLICATION OF VALIDITY ONCEPTS IN SELECTING AN INSTRUMENT onsumers of previously developed psychometric instru- ents in clinical practice, research, or education need to

arefully weigh the evidence supporting the validity of the nterpretations they are trying to make. Scores from a pop- lar instrument may not have evidence to justify their use. any authors cite evidence from only one or two sources,

uch as reliability or correlation with another instrument’s cores, to support the validity of interpretations. Such in- truments should be used with caution. To illustrate the pplication of these principles in selecting an instrument, we ill systematically evaluate an instrument to assess symp-

oms of benign prostatic hypertrophy in English-speaking en. First we must identify potential instruments. Reviewing

rticles from a MEDLINE search using the terms “prostatic yperplasia” and “symptom” reveals multiple instruments sed to assess benign prostatic hypertrophy symp- oms.1,44-48 The American Urological Association Symptom ndex1 (AUA-SI, also known as the International Prostate ymptom Score) seems to be by far the most commonly sed instrument. After confirming our impression with a ocal expert, we select this instrument for further review.

Content evidence for AUA-SI scores is abundant and ully supportive.1 The instrument authors reviewed both ublished and unpublished sources to develop an initial item ool that reflected the desired content domain. Word choice, ime frame, and response set were carefully defined. Items

ere deleted or modified after pilot testing. t

Some response process evidence is available. Patient ebriefing revealed little ambiguity in wording, except for ne question that was subsequently modified.1 Scores from elf-administration or interview are similar.49

Internal structure is supported by good to excellent in- ernal consistency and test-retest reliability,1,49,50 although ot all studies confirm this.51 Factor analysis confirms two heorized subscales.50,52

In regard to relations to other variables, AUA-SI scores istinguished patients with clinical benign prostatic hyper- rophy from young healthy controls,1 correlated with other ndices of benign prostatic hypertrophy symptoms,53 and mproved after prostatectomy.54 Another study found that atients with a score decrease of 3 points felt slightly im- roved.51 However, a study found no significant association etween scores and urinary peak flow or postvoid esidual.55

Evidence of consequences is minimal. Thresholds for ild, moderate, and severe symptoms were developed by

omparing scores with global symptom ratings,1 suggesting hat such classifications are meaningful. One study56 found hat 81% of patients with mild symptoms did not require herapy over 2 years, again supporting the meaning (valid- ty) of these scores. More meaningful evidence of conse- uences might come from a study comparing the outcomes f men whose treatment was guided by the AUA-SI, com- ared with men whose treatment was guided by clinical udgment alone, but we are not aware of such a study.

In summary, AUA-SI scores are well supported by evi- ence of content, internal structure, relations to other vari- bles, and to a lesser extent response process, whereas vidence of consequences is minimal. These scores are ikely to be useful, although their meaning (consequences n patient care) could be studied further. For completeness e ought to similarly evaluate some of the other available

nstruments. Also, because validity and reliability evidence ay not generalize to new settings, we should collect con- rmatory data in our own clinic.

RACTICAL APPLICATION OF VALIDITY ONCEPTS IN DEVELOPING AN INSTRUMENT hen developing psychometric instruments, careful at-

ention should again be given to each category of validity vidence in turn. To illustrate the application of these rinciples, we will discuss how evidence could be lanned, collected, and documented when developing an ssessment of clinical performance for internal medicine esidents.

The first step in developing any instrument is to iden- ify the construct and corresponding content. In our ex- mple we could look at residency program objectives and ther published objectives such as Accreditation Com- ittee for Graduate Medical Education competencies,57

earch the literature on qualifications of ideal physicians, r interview faculty and residents. We also should search

he literature for previously published instruments, which

m ( g i a

s u p t f W m w r

c I f u fi

s s c c t w p M i t a F t y o f

w a s p r w F c v t

e t s p s i

C A m m t w V i t v d t s A m o r f t w c

A W a s

A I R f t s r c p d p A r s p s w m i l w b a I u e

166.e14 The American Journal of Medicine, Vol 119, No 2, February 2006

ight be used verbatim or adapted. From the themes constructs) identified we would develop a blueprint to uide creation of individual questions. Questions would deally be written by faculty trained in question writing nd then checked for clarity by other faculty.

For response process, we would ensure that the re- ponse format is familiar to faculty, or if not (eg, if we se computer-based forms), that faculty have a chance to ractice with the new format. Faculty should receive raining in both learner assessment in general and our orm specifically, with the opportunity to ask questions.

e would ensure security measures and accurate scoring ethods. We could also conduct a pilot study in which e ask faculty to “think out loud” as they observe and

ate several residents. In regard to internal structure, inter-rater reliability is

ritical so we would need data to calculate this statistic. nternal consistency is of secondary importance for per- ormance ratings,30 but this and factor analysis would be seful to verify that the themes or constructs we identi- ed during development hold true in practice.

For relations to variables, we could correlate our in- trument scores with scores from another instrument as- essing clinical performance. Note, however, that this omparison is only as good as the instrument with which omparison is made. Thus, comparing our scores with hose from an instrument with little supporting evidence ould have limited value. Alternatively, we could com- are the scores from our instrument with United States edical Licensing Examination scores, scores from an

n-training exam, or any other variable that we believe is heoretically related to clinical performance. We could lso plan to compare results among different subgroups. or example, if we expect performance to improve over

ime, we could compare scores among postgraduate ears. Finally, we could follow residents into fellowship r clinical practice and see whether current scores predict uture performance.

Last, we should not neglect evidence of consequences. If e have set a minimum passing score below which remedial

ction will be taken, we must clearly document how this core was determined. If subgroup analysis reveals unex- ected relationships (eg, if a minority group is consistently ated lower than other groups), we should investigate hether this finding reflects on the validity of the test. inally, if low-scoring residents receive remedial action, we ould perform follow-up to determine whether this inter- ention was effective, which would support the inference hat intervention was warranted.

It should now be clear that the collection of validity vidence requires foresight and careful planning. Much of he data described above will not be available without con- cious effort. We encourage developers or researchers of sychometric instruments to systematically use the 5 ources of validity evidence as a framework when develop-

ng or evaluating instruments.

ONCLUSION clear understanding of validity and reliability in psycho-

etric assessment is essential for practitioners in diverse edical settings. As Foster and Cone note, “Science rests on

he adequacy of its measurement. Poor measures provide a eak foundation for research and clinical endeavors.”18

alidity concerns the degree to which scores reflect the ntended underlying construct, and refers to the interpreta- ion of results rather than the instrument itself. It is best iewed as a carefully structured argument in which evi- ence is assembled to support or refute proposed interpre- ations of results. Reproducible (reliable) results are neces- ary, but not sufficient, for valid inferences to be drawn. lthough this review focused on psychometric instruments, any of the concepts discussed here have implications for

ther health care applications such as rater agreement in adiology,58 illness severity scales,59,60 data abstraction orms, and even clinical pathways.61 Increased attention to he systematic collection and appraisal of validity evidence ill improve assessments in research, education, and patient

are.

CKNOWLEDGMENTS e thank Steven M. Downing, PhD (University of Illinois

t Chicago, Department of Medical Education), for his in- ights and constructive critique.

PPENDIX: INTERPRETATION OF RELIABILITY NDICES AND FACTOR ANALYSIS eliability is usually reported as a coefficient41 ranging

rom 0 to 1. The reliability coefficient can be interpreted as he correlation between scores on two administrations of the ame instrument, and in fact test-retest and alternate form eliability are usually calculated using statistical tests of orrelation. The reliability coefficient can also be inter- reted as the proportion of score variance explained by ifferences between subjects (the remainder being ex- lained by a combination of random and systematic error). value of 0 represents no correlation (all error), whereas 1

epresents perfect correlation (all variance attributable to ubjects). Acceptable values will vary according to the pur- ose of the instrument. For high-stakes settings (eg, licen- ure examination) reliability should be greater than 0.9, hereas for less important situations values of 0.8 or 0.7 ay be acceptable.30 Note that the interpretation of reliabil-

ty coefficients is different than the interpretation of corre- ation coefficients in other applications, where a value of 0.6 ould often be considered quite high.62 Low reliability can e improved by increasing the number of items or observers nd (in education settings) using items of medium difficulty.30

mprovement expected from adding items can be estimated sing the Spearman-Brown “prophecy” formula (described lsewhere).41

A less common, but often more useful,63 measure of

s ( w e r o i s a o s f

m w f d ( i a l K 0

t i s w s s g e u c

R

1

1

1

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2

3

3

3

3

166.e15Cook and Beckman Validity and Reliability of Psychometric Instruments

core variance is the standard error of measurement (SEM) not to be confused with the standard error of the mean, hich is also abbreviated SEM). The SEM, given by the

quation SEM � standard deviation � square root (1- eliability),64 is the “standard deviation of an individual’s bserved scores”19 and can be used to develop a confidence nterval for an individual’s true score (the true score is the core uninfluenced by random error). For example, 95% of n individual’s scores on retesting should fall within 2 SEM f the individual’s true score. Note, however, that the ob- erved score only estimates the true score; see Harvill64 for urther discussion.

Agreement between raters on binary outcomes (eg, heart urmur present: yes or no?) is often reported using kappa, hich represents agreement corrected for chance.40 A dif-

erent but related test, weighted kappa, is necessary when etermining inter-rater agreement on ordinally ranked data eg, Likert scaled responses) to account for the variation in ntervals between data points in ordinally ranked data (eg, in

typical 5-point Likert scale the “distance” from 1 to 2 is ikely different than the distance from 2 to 3). Landis and och65 suggest that kappa less than 0.4 is poor, from 0.4 to .75 is good, and greater than 0.75 is excellent.

Factor analysis32 is used to investigate relationships be- ween items in an instrument and the constructs they are ntended to measure. Some instruments intend to measure a ingle construct (“symptoms of urinary obstruction”), hereas others try to assess multiple constructs (“depres-

ion,” “anxiety,” and “personality disorder”). Factor analy- is can determine whether the items intended to measure a iven construct actually “cluster” together into “factors” as xpected. Items that “load” on more than one factor, or on nexpected factors, may not be measuring their intended onstructs.

eferences 1. Barry MJ, Fowler FJ Jr, O’Leary MP, et al. The American Urological

Association symptom index for benign prostatic hyperplasia. J Urol. 1992;148:1549-1557.

2. Ewing JA. Detecting alcoholism: the CAGE questionnaire. JAMA. 1984;252:1905-1907.

3. Spitzer RL, Williams JB, Kroenke K, et al. Utility of a new procedure for diagnosing mental disorders in primary care. The PRIME-MD 1000 study. JAMA. 1994;272:1749-1756.

4. Green C, Porter C, Bresnahan D, Spertus J. Development and evalu- ation of the Kansas City Cardiomyopathy Questionnaire: a new health status measure for heart failure. J Am Coll Cardiol. 2000;35:1245- 1255.

5. Hollen P, Gralla R, Kris M, Potanovich L. Quality of life assessment in individuals with lung cancer: testing the Lung Cancer Symptom Scale (LCSS). Eur J Cancer. 1993;29A(Suppl 1):S51-S58.

6. Bauer UE, Johnson TM, Hopkins RS, Brooks RG. Changes in youth cigarette use and intentions following implementation of a tobacco control program: findings from the Florida Youth Tobacco Survey, 1998-2000. JAMA. 2000;284:723-728.

7. National Board of Medical Examiners. United States Medical Licens- ing Exam Bulletin. Produced by Federation of State Medical Boards of the United States and the National Board of Medical Examiners. Available at: http://www.usmle.org/bulletin/2005/testing.htm. Ac-

cessed March 7, 2005.

8. Norcini JJ, Blank LL, Duffy FD, Fortna GS. The mini-CEX: a method for assessing clinical skills. Ann Intern Med. 2003;138:476-481.

9. Litzelman DK, Stratos GA, Marriott DJ, Skeff KM. Factorial valida- tion of a widely disseminated educational framework for evaluating clinical teachers. Acad Med. 1998;73:688-695.

0. Merriam-Webster Online. Available at: http://www.m-w.com/. Ac- cessed March 10, 2005.

1. Sackett DL, Richardson WS, Rosenberg W, Haynes RB. Evidence- Based Medicine: How to Practice and Teach EBM. Edinburgh: Churchill Livingstone; 1998.

2. Wallach J. Interpretation of Diagnostic Tests. 7th ed. Philadelphia: Lippincott Williams & Wilkins; 2000.

3. Beckman TJ, Ghosh AK, Cook DA, Erwin PJ, Mandrekar JN. How reliable are assessments of clinical teaching? A review of the published instruments. J Gen Intern Med. 2004;19:971-977.

4. Shanafelt TD, Bradley KA, Wipf JE, Back AL. Burnout and self- reported patient care in an internal medicine residency program. Ann Intern Med. 2002;136:358-367.

5. Alexander GC, Casalino LP, Meltzer DO. Patient-physician commu- nication about out-of-pocket costs. JAMA. 2003;290:953-958.

6. Pittet D, Simon A, Hugonnet S, Pessoa-Silva CL, Sauvan V, Perneger TV. Hand hygiene among physicians: performance, beliefs, and per- ceptions. Ann Intern Med. 2004;141:1-8.

7. Messick S. Validity. In: Linn RL, editor. Educational Measurement, 3rd Ed. New York: American Council on Education and Macmillan; 1989.

8. Foster SL, Cone JD. Validity issues in clinical assessment. Psychol Assess. 1995;7:248-260.

9. American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Stan- dards for Educational and Psychological Testing. Washington, DC: American Educational Research Association; 1999.

0. Bland JM, Altman DG. Statistics notes: validating scales and indexes. BMJ. 2002;324:606-607.

1. Downing SM. Validity: on the meaningful interpretation of assessment data. Med Educ. 2003;37:830-837.

2. 2005 Certification Examination in Internal Medicine Information Booklet. Produced by American Board of Internal Medicine. Available at: http://www.abim.org/resources/publications/IMRegistrationBook. pdf. Accessed September 2, 2005.

3. Kane MT. An argument-based approach to validity. Psychol Bull. 1992;112:527-535.

4. Messick S. Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. Am Psychol. 1995;50:741-749.

5. Kane MT. Current concerns in validity theory. J Educ Meas. 2001; 38:319-342.

6. American Psychological Association. Standards for Educational and Psychological Tests and Manuals. Washington, DC: American Psy- chological Association; 1966.

7. Downing SM, Haladyna TM. Validity threats: overcoming interfer- ence with proposed interpretations of assessment data. Med Educ. 2004;38:327-333.

8. Haynes SN, Richard DC, Kubany ES. Content validity in psycholog- ical assessment: a functional approach to concepts and methods. Psy- chol Assess. 1995;7:238-247.

9. Feldt LS, Brennan RL. Reliability. In: Linn RL, editor. Educational Measurement, 3rd Ed. New York: American Council on Education and Macmillan; 1989.

0. Downing SM. Reliability: on the reproducibility of assessment data. Med Educ. 2004;38:1006-1012.

1. Clark LA, Watson D. Constructing validity: basic issues in objective scale development. Psychol Assess. 1995;7:309-319.

2. Floyd FJ, Widaman KF. Factor analysis in the development and refinement of clinical assessment instruments. Psychol Assess. 1995; 7:286-299.

3. Campbell DT, Fiske DW. Convergent and discriminant validation by

the multitrait-multimethod matrix. Psychol Bull. 1959;56:81-105.

3

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

5

5

5

5

5

5

5

5

5

5

6

6

6

6

6

6

166.e16 The American Journal of Medicine, Vol 119, No 2, February 2006

4. Beckman TJ, Cook DA, Mandrekar JN. What is the validity evidence for assessments of clinical teaching? J Gen Intern Med. 2005;20:1159- 1164.

5. Smith GT, McCarthy DM. Methodological considerations in the re- finement of clinical assessment instruments. Psychol Assess. 1995;7: 300-308.

6. Kane MT. Content-related validity evidence in test development. In: Downing SM, Haladyna TM, editors. Handbook of Test Development. Mahwah, NJ: Lawrence Erlbaum Associates; 2006,131-153.

7. DeVellis RF. Scale Development: Theory and Applications. 2nd ed. Thousand Oaks, CA: Sage Publications; 2003.

8. Nunnally JC, Berstein IH. Psychometric Theory. 3rd ed. New York: McGraw-Hill; 1994.

9. McMillan J, Schumacher S. Research in Education: A Conceptual Introduction. 5th ed. New York: Addison Wesley Longman; 2001.

0. Howell D. Statistical Methods for Psychology. 5th ed. Pacific Grove, CA: Duxbury; 2002.

1. Traub RE, Rowley GL. An NCME instructional module on under- standing reliability. Educational Measurement: Issues and Practice. 1991;10(1):37-45.

2. Brennan RL. Generalizability Theory. New York: Springer-Verlag; 2001.

3. Shavelson R, Webb N. Generalizability Theory: A Primer. Newbury Park: Sage Publications; 1991.

4. Boyarsky S, Jones G, Paulson DF, Prout GR Jr. New look at bladder neck obstruction by the Food and Drug Administration regulators: guidelines for investigation of benign prostatic hypertrophy. Trans Am Assoc Genitourin Surg. 1976;68:29-32.

5. Madsen PO, Iversen P. A point system for selecting operative candi- dates. In: Hinman F, editor. Benign Prostatic Hypertrophy. New York: Springer-Verlag; 1983,763-765.

6. Fowler FJ Jr, Wennberg JE, Timothy RP, Barry MJ, Mulley AG Jr, Hanley D. Symptom status and quality of life following prostatectomy. JAMA. 1988;259:3018-3022.

7. Hald T, Nordling J, Andersen JT, Bilde T, Meyhoff HH, Walter S. A patient weighted symptom score system in the evaluation of uncom- plicated benign prostatic hyperplasia. Scand J Urol Nephrol. 1991; 138(suppl):59-62.

8. Donovan JL, Abrams P, Peters TJ, et al. The ICS-“BPH” Study: the psychometric validity and reliability of the ICSmale questionnaire. Br J Urol. 1996;77:554-62.

9. Barry MJ, Fowler FJ, Chang Y, Liss CL, Wilson H, Stek M Jr. The American Urological Association symptom index: does mode of ad- ministration affect its psychometric properties? J Urol. 1995;154: 1056-1059.

0. Welch G, Kawachi I, Barry MJ, Giovannucci E, Colditz GA, Willett WC. Distinction between symptoms of voiding and filling in benign prostatic hyperplasia: findings from the Health Professionals Fol-

low-up Study. Urology. 1998;51:422-427.

1. Barry MJ, Williford WO, Chang Y, et al. Benign prostatic hyperplasia specific health status measures in clinical research: how much change in the American Urological Association symptom index and the benign prostatic hyperplasia impact index is perceptible to patients? J Urol. 1995;154:1770-1774.

2. Barry MJ, Williford WO, Fowler FJ Jr, Jones KM, Lepor H. Filling and voiding symptoms in the American Urological Association symp- tom index: the value of their distinction in a Veterans Affairs random- ized trial of medical therapy in men with a clinical diagnosis of benign prostatic hyperplasia. J Urol. 2000;164:1559-1564.

3. Barry MJ, Fowler FJ Jr, O’Leary MP, Bruskewitz RC, Holtgrewe HL, Mebust WK. Correlation of the American Urological Association symptom index with self-administered versions of the Madsen- Iversen, Boyarsky and Maine Medical Assessment Program symptom indexes. J Urol. 1992;148:1558-1563.

4. Schwartz EJ, Lepor H. Radical retropubic prostatectomy reduces symptom scores and improves quality of life in men with moderate and severe lower urinary tract symptoms. J Urol. 1999;161:1185-1188.

5. Barry MJ, Cockett AT, Holtgrewe HL, McConnell JD, Sihelnik SA, Winfield HN. Relationship of symptoms of prostatism to commonly used physiological and anatomical measures of the severity of benign prostatic hyperplasia. J Urol. 1993;150:351-358.

6. Kaplan SA, Olsson CA, Te AE. The American Urological Association symptom score in the evaluation of men with lower urinary tract symptoms: at 2 years of followup, does it work? J Urol. 1996;155: 1971-1974.

7. Program Requirements for Residency Education in Internal Medicine. Produced by Accreditation Council for Graduate Medical Education. Available at: http://www.acgme.org/. Accessed December 22, 2003.

8. Kundel H, Polansky M. Measurement of observer agreement. Radiol- ogy. 2003;228:303-308.

9. Knaus W, Wagner D, Draper E, et al. The APACHE III prognostic system. Risk prediction of hospital mortality for critically ill hospital- ized adults. Chest. 1991;100:1619-1636.

0. Fine MJ, Auble TE, Yealy DM, et al. A prediction rule to identify low-risk patients with community-acquired pneumonia. N Engl J Med. 1997;336:243-250.

1. Marrie TJ, Lau CY, Wheeler SL, Wong CJ, Vandervoort MK, Feagan BG. A controlled trial of a critical pathway for treatment of commu- nity-acquired pneumonia. CAPITAL Study Investigators. JAMA. 2000;283:749-755.

2. Fraenkel JR, Wallen NE. How to Design and Evaluate Research in Education. New York, NY: McGraw-Hill; 2003.

3. Cronbach LJ. My current thoughts on coefficient alpha and successor procedures. Educ Psychol Meas. 2004;64:391-418.

4. Harvill LM. NCME Instructional module: standard error of measure- ment. Educational Measurement: Issues and Practice. 1991;10(2):33- 41.

5. Landis J, Koch G. The measurement of observer agreement for cate-

gorical data. Biometrics. 1977;33:159-174.

  • Current Concepts in Validity and Reliability for Psychometric Instruments: Theory and Application
    • VALIDITY, CONSTRUCTS, AND MEANINGFUL INTERPRETATION OF INSTRUMENT SCORES
      • A Conceptual Approach to Validity
      • Sources of Validity Evidence
        • Content.
        • Response Process.
        • Internal Structure.
        • Relations to Other Variables.
        • Consequences.
        • Integrating the Evidence.
      • What About Face Validity
    • RELIABILITY: NECESSARY, BUT NOT SUFFICIENT, FOR VALID INFERENCES
    • PRACTICAL APPLICATION OF VALIDITY CONCEPTS IN SELECTING AN INSTRUMENT
    • PRACTICAL APPLICATION OF VALIDITY CONCEPTS IN DEVELOPING AN INSTRUMENT
    • CONCLUSION
    • ACKNOWLEDGMENTS
    • APPENDIX: INTERPRETATION OF RELIABILITY INDICES AND FACTOR ANALYSIS
    • References