Evaluate a Quantitative Study

profilebwilliams327
depressioninpatients.pdf

Are We Accurately Evaluating Depression in Patients With Cancer?

Rebecca M. Saracino Memorial Sloan Kettering Cancer Center, New York, New

York, and Fordham University

Ezgi Aytürk and Heining Cham Fordham University

Barry Rosenfeld Memorial Sloan Kettering Cancer Center, New York, New

York, and Fordham University

Leah M. Feuerstahler Fordham University

Christian J. Nelson Memorial Sloan Kettering Cancer Center, New York, New York

Depression remains poorly managed in oncology, in part because of the difficulty of reliably screening and assessing for depression in the context of medical illness. Whether somatic items really skew the ability to identify “true” depression, or represent meaningful indicators of depression, remains to be determined. This study utilized item response theory (IRT) to compare the performance of traditional depression criteria with Endicott’s substitutive criteria (ESC; tearfulness or depressed appearance; social withdrawal; brooding; cannot be cheered up). The Patient Health Questionnaire (PHQ-9), ESC, and Center for Epidemiologic Studies Depression Scale (CES-D) were administered to 558 outpatients with cancer. IRT models were utilized to evaluate global and item fit for traditional PHQ-9 items compared with a modified version replacing the 4 somatic items with ESC. The modified PHQ-9 ESC scale was the best fit using a partial credit model; model fit was improved after collapsing the middle 2 response categories and removing psychomotor agitation/retardation. This improved model showed satisfactory scale precision and internal consistency, and was free from differential item functioning for gender, age, and race. Concurrent and criterion validity were supported. Thus, as many have speculated, utilizing the ESC may result in more accurate identification of depressive symptoms in oncology. Depressed mood, anhedonia, and suicidal ideation retained their expected properties in the modified scale, indicating that the traditional underlying syndrome of depression likely remains the same, but the ESC may provide more specificity when assessing patients with cancer.

Public Significance Statement Alternative approaches to assessing depression in patients with cancer may be more accurate than current approaches, which rely heavily on physical symptoms. An improved approach might eliminate physical symptoms and focus more on emotional symptoms.

Keywords: depression, diagnostic criteria, oncology, IRT, screening

Supplemental materials: http://dx.doi.org/10.1037/pas0000765.supp

Accurate assessment of depression in patients with medical illness is critically important, as those with comorbid mood disorders are at significantly greater risk for nonadherence to medical treatments and

premature mortality (DiMatteo, Lepper, & Croghan, 2000; Misono, Weiss, Fann, Redman, & Yueh, 2008). Historically, clinicians and researchers have debated whether or not the reliance on somatic

This article was published Online First August 8, 2019. X Rebecca M. Saracino, Department of Psychiatry and Behavioral

Sciences, Memorial Sloan Kettering Cancer Center, New York, New York, and Department of Psychology, Fordham University; Ezgi Aytürk and Heining Cham, Department of Psychology, Fordham University; Barry Rosenfeld, Department of Psychiatry and Behavioral Sciences, Memorial Sloan Kettering Cancer Center, and Department of Psychology, Fordham University; X Leah M. Feuerstahler, Department of Psychology, Fordham

University; Christian J. Nelson, Department of Psychiatry and Behavioral Sciences, Memorial Sloan Kettering Cancer Center.

This research was supported by funding from the National Institutes of Health (T32CA009461 and P30CA008748).

Correspondence concerning this article should be addressed to Rebecca M. Saracino, Department of Psychiatry and Behavioral Sciences, Memorial Sloan Kettering Cancer Center, 641 Lexington Avenue, 7th Floor, New York, NY 10022. E-mail: [email protected]

T hi

s do

cu m

en t

is co

py ri

gh te

d by

th e

A m

er ic

an P

sy ch

ol og

ic al

A ss

oc ia

ti on

or on

e of

it s

al li

ed pu

bl is

he rs

. T

hi s

ar ti

cl e

is in

te nd

ed so

le ly

fo r

th e

pe rs

on al

us e

of th

e in

di vi

du al

us er

an d

is no

t to

be di

ss em

in at

ed br

oa dl

y.

Psychological Assessment © 2019 American Psychological Association 2020, Vol. 32, No. 1, 98 –107 ISSN: 1040-3590 http://dx.doi.org/10.1037/pas0000765

98

items when rendering a depression diagnosis inappropriately in- flates the prevalence of depressive disorders among the medically ill, especially in oncology settings (Jones et al., 2015; Krebber et al., 2014; Saracino, Rosenfeld, & Nelson, 2018). Somatic items (i.e., sleep disturbance, fatigue, appetite changes, diminished con- centration) may reflect side effects of treatment or the pathology of the underlying illness itself. Despite this concern, the Patient Health Questionnaire-9 item (PHQ-9; Kroenke & Spitzer, 2002), which relies exclusively on Diagnostic and Statistical Manual of Mental Disorders criteria, remains one of the most widely utilized depression screening measures across medical settings (e.g., pri- mary care, oncology, cardiovascular disease; Dyer, Williams, Bombardier, Vannoy, & Fann, 2016; Forkmann, Gauggel, Span- genberg, Brähler, & Glaesmer, 2013; Gothwal, Bagga, & Suma- lini, 2014; Kendel et al., 2010; Pedersen, Mathiasen, Christensen, & Makransky, 2016; Williams et al., 2009).

The PHQ-9 consists of nine items, each of which corresponds to one of the nine symptoms required for a diagnosis of a major depressive disorder (MDD) as defined by the Diagnostic and Statistical Manual of Mental Disorders (DSM; American Psychi- atric Association, 2013). Respondents are asked to rate how often they have been bothered by each of the nine symptoms over the preceding 2 weeks. Respondents rate each item on a 4-point scale (0 � not at all, 1 � several days, 2 � more than half the days, 3 � nearly every day). Due to its popularity, a handful of studies have used item response theory (IRT) to examine the PHQ-9 in samples of medical patients (Dyer et al., 2016; Forkmann et al., 2013; Gothwal et al., 2014; Kendel et al., 2010; Pedersen et al., 2016; Williams et al., 2009). For example, Kendel et al. (2010) observed that among 1,271 patients undergoing coronary artery bypass graft surgery, most of the somatic items on the PHQ-9 did not meet criteria for a good overall model fit (i.e., according to fit statistics). Instead, they found that six out of seven items on the Hospital Anxiety and Depression Scale Depression subscale (HADS-D; Zigmond & Snaith, 1983), which rely entirely on cognitive and affective symptoms, and the two PHQ-9 items reflecting the DSM gateway symptoms of MDD (i.e., depressed mood and anhedonia) plus fatigue, were the strongest indicators of the underlying con- struct. They also identified differential item functioning (DIF) across genders on two PHQ-9 items; women were more likely than men to endorse depressed mood and fatigue conditional on the latent trait. In theory, DIF is an undesirable property of an item, as it indicates that respondents from different groups (e.g., males and females) with the same level of the latent trait have different probabilities of endorsing an item (Holland & Wainer, 1993).

A study of 1,531 patients with heart disease and implantable cardioverter defibrillators identified PHQ-9 items reflecting de- pressed mood, feeling bad about yourself or that you are a failure, and suicidal ideation, as being the best items for discriminating individuals with higher and lower levels of depression (Pedersen et al., 2016). They also found significant DIF for gender for the depressed mood item, such that women were more likely than men to endorse this item at the same underlying level of depression. Additionally, overall model fit was substantially improved after collapsing the two middle response options (several days and more than half the days) in the 4-point scale, indicating that these two response options were not meaningfully distinguished from one another. Another study of 100 adults with a history of traumatic brain injury demonstrated similar findings, as all PHQ-9 items

demonstrated good fit when the two intermediate response cate- gories were collapsed (Dyer et al., 2016). Thus, regardless of the relative performance of individual items across clinical samples, a collapsed, three response option format may be most suitable for the PHQ-9.

In oncology settings, alternative approaches to depression as- sessment have been proposed (e.g., Cavanaugh, 1995; Endicott, 1984) in order to increase the specificity of depression screening measures and decrease the potential overinclusivity of the criteria used by the DSM. The most widely recognized of these approaches are the substitutive criteria proposed by Endicott (1984; ESC), who recommended replacing the four somatic symptoms with four alternative symptoms: tearfulness or depressed appearance in face or body posture; social withdrawal or decreased talkativeness; brooding, self-pity, or pessimism; and cannot be cheered up, doesn’t smile, no response to good news or funny situations. Although widely cited, there is a dearth of published research that has systematically evaluated this proposal.

Only one prior study has utilized IRT to compare the perfor- mance of traditional DSM criteria with the Endicott substitutive approach, using a structured clinical interview to rate each of the criteria under investigation. Akechi et al. (2009) examined the utility of the DSM–IV criteria for MDD, along with the Endicott’s substitutive criteria and those proposed by Ca- vanaugh (1995), who recommended replacing the four DSM somatic items with two behavioral criteria: “not participating in medical treatment in spite of ability to do so” and “functioning at a lower level than medical condition warrants or failure to progress in recovery despite improved medical condition.” In a sample of 728 cancer patients diagnosed with depression (based on DSM–IV criteria), these authors found that the Endicott and Cavanaugh’s criteria were among the symptoms with the most utility in assessing depression across the spectrum of severity. Endicott’s “tearfulness or depressed appearance” and “brood- ing, self-pity, or pessimism” were particularly good indicators of mild depression, while “not participating in medical care” (Cavanaugh) and “social withdrawal” (Endicott) were good indicators of moderate to severe depression. For patients with severe depression, Endicott’s “cannot be cheered up . . .” symptom was the most salient indicator. Although none of the DSM–IV criteria had a high ability to discriminate between individuals with more or less severe depression in this sample, this finding may have been impacted by their study methodol- ogy, because they included only patients that met DSM criteria for MDD (thereby reducing the variability in the DSM–IV symptoms). Nevertheless, the authors suggested that the sub- stitutive criteria proposed by Endicott and Cavanaugh are prom- ising, given their apparent utility in discriminating depressive symptom severity. In addition to a restricted symptom range due to inclusion criteria, this study also relied on clinician interview, which is a costly and unrealistic approach to depres- sion screening, particularly in busy oncology settings in which clinicians do not have the training nor the time to conduct psychiatric diagnostic interviews.

Despite its popularity, no studies to date have utilized IRT to examine the PHQ-9 in patients with cancer, nor have these methods been extended to study the Endicott’s substitutive criteria in a self-report format. Cancer and its treatment have unique disease sequelae and treatment side effects that are not

T hi

s do

cu m

en t

is co

py ri

gh te

d by

th e

A m

er ic

an P

sy ch

ol og

ic al

A ss

oc ia

ti on

or on

e of

it s

al li

ed pu

bl is

he rs

. T

hi s

ar ti

cl e

is in

te nd

ed so

le ly

fo r

th e

pe rs

on al

us e

of th

e in

di vi

du al

us er

an d

is no

t to

be di

ss em

in at

ed br

oa dl

y.

99DEPRESSION EVALUATION IN CANCER

necessarily as salient in other medical conditions such as heart disease or brain injury. While fatigue may be cross-cutting, symptoms such as appetite, concentration, and sleep distur- bances are particularly salient in oncology (Akechi et al., 2003). Given its wide popularity and development for specific use in oncology, the present study focused on the classic symptoms of MDD and Endicott’s criteria only; the alternative symptoms proposed by Cavanaugh were not included in the current study as they were developed for general medical settings, not spe- cifically for use with cancer patients. While depression screen- ing measures can identify general distress, dysphoria, and sub- syndromal depression (in addition to MDD), the goal of the current study was to evaluate the DSM criteria for MDD (via the PHQ-9) and the Endicott’s substitutive criteria as a first step toward further psychometric validation of the substitutive ap- proach. The present study searched for the best-fitting measure- ment structure for the 13 items (nine DSM criteria plus four Endicott’s substitutive criteria items) using several IRT models. Differential item functioning (DIF) of the selected measurement structure was also tested across gender (males vs. females), age (40 – 69 years old vs. 70 or above), and racial groups (non- Hispanic White participants vs. ethnic minority participants), as well as precision and internal consistency of scale scores and concurrent validity of score interpretations.

Method

Participants and Procedure

Participants were recruited from outpatient clinics at Memorial Sloan Kettering Cancer Center (MSK) between January 2016 and May 2016. To be eligible for participation, patients had to be 40 years or older,1 fluent in English, and have a cancer diagnosis. Patients were approached by trained research personnel while awaiting routine clinic appointments; those who were eligible were informed of the study procedures, risks, and benefits, and invited to participate. The study was approved by the MSK and Fordham University Institutional Review Boards.

Measures

All participants completed a packet of questionnaires in a fixed order, including the Patient Health Questionnaire-9 (PHQ-9) and four items assessing the Endicott criteria. Table 1 presents the PHQ-9 items and Endicott’s substitutive criteria (ESC) items, along with the percentage endorsing each response option. As noted above, respondents were asked to rate how often they have been bothered by the symptoms described by the items over the last 2 weeks on a 4-point scale (0 � not at all, 1 � several days, 2 � more than half the days, 3 � nearly every day). Endicott (1984) proposed four alternative symptoms (tearfulness or de- pressed appearance in face or body posture; social withdrawal or decreased talkativeness; brooding, self-pity, or pessimism; and cannot be cheered up, doesn’t smile, no response to good news or funny situations) as substitutes for four DSM symptoms that are most commonly confounded by medical illness (sleep disturbance, fatigue, appetite changes, diminished concentration). These four items were assessed using the same instructions and response scale as PHQ-9 items.

Participants were also administered the Center for Epidemio- logic Studies Depression Scale (CES-D; Radloff, 1977), a self- report measure of 20 depressive symptoms. Past research indicates acceptable psychometric properties and has supported a four-factor structure: depressed affect, positive affect, somatic complaints, and interpersonal problems (Nelson, Cho, Berk, Holland, & Roth, 2010; Saracino, Cham, Rosenfeld, & Nelson, 2018; Vodermaier, Linden, & Siu, 2009). The CES-D was used to examine the concurrent validity of PHQ-9 and ESC item scores; it was not included in IRT analyses as the primary focus was on approximat- ing DSM criteria for MDD, which are more directly assessed by the PHQ-9. Sociodemographic and medical data were also col- lected by participant self-report.

Data Analyses

Missing data analysis. A total of 663 patients completed the study questionnaires. Missing data rates for the PHQ-9 and ESC items were low (M � 7.2%, range: 6.5% to 7.7%). The differences between the sample with complete data (N � 558) and those with missing observations were small in effect sizes (all Cohen’s d � .29 and W � .15; Cohen, 1988) across sociodemographic and medical data, indicating that listwise deletion was appropriate to handle cases with the missing values.

IRT analysis. Following prior studies (e.g., Dyer et al., 2016; Forkmann et al., 2013; Gothwal et al., 2014; Kendel et al., 2010; Lamoureux et al., 2009; Pedersen et al., 2016; Williams et al., 2009), two polytomous Rasch models were used: the partial credit model (PCM; Masters, 1982) and the rating scale model (RSM; Andrich, 1978). Two polytomous non-Rasch models were also analyzed: the generalized partial credit model (GPCM; Muraki, 1992) and graded response model (GRM; Samejima, 1969). Rasch models (PCM and RSM) use observed item response patterns to estimate a person’s ability (in this case, depression severity) and an item’s difficulty (depression level that the item represents) on a continuous latent variable (depression). It models the probability of a given response as a logistic function of the difference between a person’s ability and item difficulty (Andrich, 1978). With di- chotomous data (e.g., yes/no or correct/incorrect), the higher the person’s ability relative to the item difficulty, the more likely a person is to endorse the item. With polytomous data, Rasch models estimate the response category threshold parameters. Category thresholds refer to the point where the probability of choosing either one of two adjacent response options (e.g., “not at all” vs. “several days”) is equal. RSM is the simplest (most constrained) polytomous Rasch model which assumes equal category thresh- olds across all items of a given scale and estimates a difficulty parameter for each item. The PCM is more relaxed than RSM as it estimates separate item thresholds for each item. However, both models assume the same discrimination for all items (i.e., the degree to which an item differentiates people with different depression levels). In these two models, average or sum scores

1 Age 40 was selected as the inclusion criteria cut-off in order to differentiate the sample from what the National Comprehensive Cancer Network (Coccia et al., 2018) operationalized as “adolescent and young adult,” which refers to patients from 15 to 39 years of age. This age group was selected as the primary purpose was to examine depression assessment in adults.

T hi

s do

cu m

en t

is co

py ri

gh te

d by

th e

A m

er ic

an P

sy ch

ol og

ic al

A ss

oc ia

ti on

or on

e of

it s

al li

ed pu

bl is

he rs

. T

hi s

ar ti

cl e

is in

te nd

ed so

le ly

fo r

th e

pe rs

on al

us e

of th

e in

di vi

du al

us er

an d

is no

t to

be di

ss em

in at

ed br

oa dl

y.

100 SARACINO ET AL.

of the items can be used as the overall scale score. The GPCM and GRM differ from the polytomous Rasch models in that they estimate different discrimination parameters for each item (the degree to which an item differentiates people with different depression levels). Because the items can have different dis- criminating power in GPCM and GRM, both models require specialized algorithms to computing the scale scores. Unlike the GPCM, the GRM estimates the probability of choosing a par- ticular response category or above, but assumes that the item category thresholds are always ordered.

Three indices of model fit criteria were used to select the best-fitting model(s): (a) C2 goodness-of-fit test statistic (Cai & Monroe, 2014; Maydeu-Olivares & Joe, 2006); (b) Akaike Infor- mation Criterion (AIC; small value indicates better model fit); and (c) Bayesian Information Criterion (BIC; small value indicates better model fit). The unidimensional structure was first tested with PHQ-9 items only (termed PHQ-9-Original) and then a uni- dimensional structure with the four PHQ-9 items (sleep distur- bances, fatigue, appetite changes, trouble concentrating) substi- tuted by the ESC items (termed PHQ-9-Substitutive). Both measurement structures were tested with the PCM, RSM, GPCM, and GRM models. Based on the results of these analyses, the models were modified by collapsing the response options of the items and removing items that negatively impacted model fit (described in more detail below).

DIF analysis. After deciding on the optimal measurement structure for the IRT analysis, the simultaneous item bias test (SIBTEST; Shealy & Stout, 1993) was used to examine if there was differential functioning of PHQ-9 and Endicott items across gender (males: n � 288 vs. females: n � 270), age (younger: 40 – 69 years old; n � 380 vs. older: 70 or above; n � 178), and racial groups (non-Hispanic White: n � 455 vs. ethnic minority participants: n � 103). Age 70 was used to bifurcate the sample as patients with cancer who are over 70-years-old have been shown to experience significantly more medical comorbidity that those younger than 70 (Bluethmann, Mariotto, & Rowland, 2016). Both uniform DIF and nonuniform DIF were tested with one crossing

point (Chalmers, 2018; Li & Stout, 1996). The SIBTEST estimates a standardized mean difference (�) capturing the group differences in correct response probabilities (� � 0 indicates no DIF) and provides a significance test to determine if � is significantly different from zero. � values between zero and .05 are considered small DIF, between .05 and .1 are considered moderate DIF, and .1 or above are considered large DIF (Shealy & Stout, 1993). To avoid inflated Type I error rate due to multiple testing of � for each item, Holm’s (1979) procedure was used to adjust p values (Kim & Oshima, 2013).

Validity analysis. The proportion of participants who ob- tained the lowest possible scale score on the PHQ-9-Original and on the selected substitutive measurement structure was calculated. It was expected that there would be a higher proportion of patients with a scale score of zero in the selected substitutive structure. To examine the convergent and discriminant validity of the selected substitutive structure and compare the relative differences between the selected substitutive structure and PHQ-9-Original, we calcu- lated the correlations between the scale scores of the selected substitutive structure, PHQ-9-Original, and the CES-D total score and factors (depressed affect, positive affect, somatic complaints, and interpersonal problems). It was expected that there would be larger correlations between the selected substitutive structure and the CES-D depressed affect factor and total scores, because the depressed affect factor is most closely aligned with the affective DSM criteria. Finally, participants who reported receiving treat- ment for depression and those who did not were compared on the scale scores of the selected substitutive structure and PHQ-9- Original. It was anticipated that the difference between the two groups would be larger on the selected substitutive structure than the PHQ-9-Original.

All IRT and DIF analyses (except for person separation reliabil- ity; described in more detail below) were conducted using the R mirt package (Version 1.29; Chalmers, 2012). Person separation reliability indices were calculated using the R eRm package (Ver- sion 0.16 –1; Mair & Hatzinger, 2007).

Table 1 Percentage (%) of Response Options of PHQ-9 and Endicott’s Substitutive Criteria Items

Abbreviated item label

Percentage (%) endorsing response option

Not at all (0)

Several days (1)

More than half the days (2)

Nearly every day (3)

Patient Health Questionnaire-9 Item (PHQ-9) 1. Anhedonia 64.0 21.5 9.1 5.4 2. Depressed mood 63.8 26.0 6.4 3.8 3. Sleep disturbances 44.8 28.7 14.0 12.5 4. Fatigue 32.4 37.5 15.4 14.7 5. Appetite changes 58.6 22.6 10.2 8.6 6. Feeling bad about yourself 77.8 14.9 4.8 2.5 7. Trouble concentrating 65.1 23.7 6.4 4.8 8. Psychomotor agitation and retardation 80.3 10.9 5.9 2.9 9. Suicidal ideation 92.6 6.3 .9 .2

Endicott’s substitutive criteria 1. Socially withdrawn 75.1 16.5 5.0 3.4 2. Tearfulness 78.1 14.7 5.0 2.2 3. Brooding 71.1 21.0 4.7 3.2 4. Could not be cheered up 82.8 12.2 3.9 1.1

T hi

s do

cu m

en t

is co

py ri

gh te

d by

th e

A m

er ic

an P

sy ch

ol og

ic al

A ss

oc ia

ti on

or on

e of

it s

al li

ed pu

bl is

he rs

. T

hi s

ar ti

cl e

is in

te nd

ed so

le ly

fo r

th e

pe rs

on al

us e

of th

e in

di vi

du al

us er

an d

is no

t to

be di

ss em

in at

ed br

oa dl

y.

101DEPRESSION EVALUATION IN CANCER

Results

Participant Characteristics

The sample (N � 558) was approximately evenly split by gender (51.6% male; n � 288) and ranged in age from 40 to 90 years or older2 (M � 64.7, SD � 10.3; see Table 2). Most participants were White (87.6%; n � 489; including n � 455 non-Hispanic and n � 34 Hispanic), married or living with a partner (70.6%; n � 394), and had a college and/or graduate education (70.4%; n � 393). The most common cancer diagnoses were gynecological (16.8%; n � 94), lung (15.2%; n � 85), and prostate (13.1%; n � 73). Over one third of participants reported

Stage IV disease (37.5%; n � 209). The majority of participants had received active cancer treatment within the preceding 6 months (71.3%; n � 398).

Initial Analysis of Unidimensionality

Confirmatory factor analysis (CFA) was conducted to test the unidimensionality of the PHQ-9-Original and PHQ-9-Substitutive. Models were estimated using polychoric correlations and diago- nally weighted least squares estimation via the R lavaan package (Rosseel, 2012). A full report of the results can be found in the online supplementary materials. The comparative fit index (CFI) and Tucker-Lewis index (TLI) suggested good model fit of both the PHQ-9-Original and PHQ-9-Substitutive (all values � .99); however, the PHQ-9-Original had slightly worse RMSEA than PHQ-9-Substitutive (i.e., .066 vs. .028, respectively). Taken to- gether, these model fit indices suggest that both PHQ-9-Original and PHQ-9-Substitutive were sufficiently unidimensional for IRT analysis.

IRT Analysis

All the IRT models (PCM, RSM, GPCM, GRM) converged properly in the PHQ-9-Original and PHQ-9-Substitutive measure- ment structures. Panels A and B in Table 3 present the global model fit results for the IRT models of the two structures. Com- pared with PHQ-9-Original, the PHQ-9-Substitutive structure had a better model fit in terms of AIC and BIC across all four IRT models. Therefore, the remaining analyses used only the PHQ-9- Substitutive structure. However, the PHQ-9-Subsitutive structure generated a significant C2 test statistic (ps � .001) for all four models, indicating that none of the models fit the data well. Since the more complex GPCM and GRM did not fit better than PCM and RSM, they were not considered further.3

Next, the item fit of the PCM and RSM were compared using the PHQ-9-Substitutive structure using: (a) S-�2 item fit test sta- tistic (Kang & Chen, 2008; Orlando & Thissen, 2000); and (b) item infit (information weighted mean square), where a value of 1.0 indicates perfect fit and values between 0.7 and 1.3 are con- sidered acceptable fit (Wright & Linacre, 1994). Results showed that Item 8 on the PHQ-9 (“moving or speaking so slowly that other people could have noticed or the opposite— being so fidgety or restless that you have been moving a lot more than usual”) was the only item that showed both significant S-�2 test statistics, PCM: S-�2(df � 22) � 39.63, p � .01; RSM: S-�2(df � 23) � 45.37, p � .004, as well as infit values beyond the acceptable range (PCM: 1.45; RSM: 1.57). Following Forkmann et al. (2013) and Kendel et al. (2010), this item was removed from PHQ-9- Substitutive structure and the PCM and RSM models were fit again to this new structure (termed PHQ-8-Substitutive).4

In PHQ-8-Substitutive, the PCM had a significant C2 test sta- tistic, C2(df � 27) � 46.7, p � .01, while RSM did not, C2(df �

2 Due to HIPPA protection participants who were 90 years or older (n � 2) checked a box indicating they were in this age range.

3 GPCM and GRM results for all steps are available upon request. 4 Analysis of the PCM threshold parameter estimates and item response

curves of this item in PHQ-9-Substitutive supported this decision (avail- able upon request).

Table 2 Demographic Characteristics

Demographic Frequency %

Gender Male 288 48.4 Female 270 51.6

Race White 489 87.6 African American 29 5.2 Asian or Pacific Islander 21 3.8 Other 19 3.4

Ethnicity Hispanic 48 8.6 Not Hispanic 510 91.4

Marital status Single (never married) 40 7.2 Married/living with partner 394 70.6 Divorced/separated 75 13.4 Widowed 49 8.8

Education Did not graduate high school 22 4.0 High school graduate/GED/some college 142 25.5 College graduate 168 30.1 Graduate degree/professional training 225 40.4 Missing 1 .0

Treatment status Active treatment 398 71.3 Off treatment 138 24.7 Missing 22 4.0

Comorbidity Present 203 36.4 Absent 353 63.3 Missing 2 .3

Disease stage In remission/not staged 24 4.3 Stage 1 34 6.1 Stage 2 34 6.1 Stage 3 77 13.8 Stage 4 209 37.4 Missing 180 32.3

Primary cancer Gynecological 94 16.8 Lung 85 15.2 Prostate 73 13.1 Colon 47 8.4

Past depression treatment Yes 131 23.5 No 427 76.5

Current depression treatment Yes 90 16.1 No 468 83.9

T hi

s do

cu m

en t

is co

py ri

gh te

d by

th e

A m

er ic

an P

sy ch

ol og

ic al

A ss

oc ia

ti on

or on

e of

it s

al li

ed pu

bl is

he rs

. T

hi s

ar ti

cl e

is in

te nd

ed so

le ly

fo r

th e

pe rs

on al

us e

of th

e in

di vi

du al

us er

an d

is no

t to

be di

ss em

in at

ed br

oa dl

y.

102 SARACINO ET AL.

41) � 53.1, p � .10. The AIC and BIC estimates for these models were smaller than those generated by the PHQ-9-Substitutive, further supporting the PHQ-8-Subsititutive structure. However, there were still poorly fitting items in both models, especially in the RSM. The threshold parameter estimates and item character- istic curves of the eight items of PHQ-8-Substitutive in the PCM (Panel A in Figure S1 in online supplementary materials)5 and RSM were evaluated. Across items, 72.3% of the 95% confidence intervals of the threshold parameters for response 1 (several days) and 2 (more than half the days) overlapped in the PCM. Consistent with previous studies (Forkmann et al., 2013; Gothwal et al., 2014; Lamoureux et al., 2009; Pedersen et al., 2016), these two response options were collapsed. This structure was termed PHQ-8- Substitutive-Collapsed.

The PCM and RSM were fit to the PHQ-8-Substitutive- Collapsed structure. Both models fit the data well, generating nonsignificant C2 test statistics, PCM: C2(df � 27) � 28.7, p � .37; RSM: C2(df � 34) � 37.9, p � .30, and lower AIC and BIC than those in PHQ-8-Substitutive (see “D” in Column 1, Table 3). In the PCM, all items showed nonsignificant S-�2 test statistics and acceptable infit (see Table 4). The item characteristic curves showed little overlap in the response options across items (Panel B in Figure S1 in online supplementary materials). In the RSM, two items had poor fit: the PHQ-9 depressed mood item, S-�2(df � 9) � 18.88, p � .03, infit � 0.66, and the Endicott “could not be cheered up” item, S-�2(df � 9) � 13.58, p � .03, infit � .78. Given these findings, PCM was determined to be the best fitting model in the PHQ-8-Substitutive-Collapsed structure.

Table 5 shows the percentages of each response option for the items, along with the threshold parameter and standard error esti-

mates using the PCM analysis for the PHQ-8-Substitutive- Collapsed data. The PHQ-9 suicidal ideation item had the largest threshold parameter estimates, reflecting the smallest percentage of response options 1 and 2 (several/most days and nearly every day, respectively). The PHQ-9 anhedonia and depressed mood items had the lowest threshold parameter estimates, reflecting a relatively higher percentages of response options 1 and 2. Overall, the threshold parameter estimates of response options 1 and 2 were very high across items (�3), reflecting low endorsement of re- sponse options reflecting greater depression severity.

To further understand the psychometric properties of this model, the person separation reliability (PSR) was calculated. PSR is based on the replicability of the ordering of persons along the latent trait and is conceptually equivalent to Cronbach’s alpha (Andrich, 1982; de Ayala, 2009; Wright & Masters, 1982). The model had reliability of .81, which is considered acceptable (Nun- nally, 1978). Corrected item-total correlations ranged between .51 and .77, reflecting that all items were highly correlated with the rest of the scale.6

DIF Analysis

The standardized mean difference � estimates and the signifi- cance test results for uniform and nonuniform DIF detection in the items comprising the PHQ-8-Substitutive-Collapsed structure across gender, age, and racial groups are presented as in Table S4 of the online supplementary materials. No items showed signifi- cant DIF using Holm’s adjusted p values accounting for multiple comparisons.

Validity Analysis

A frequency analysis showed that for the PHQ-9-Original, 117 of the 558 participants (21%) obtained the lowest possible scale score of 0, whereas on the PHQ-8-Substitutive-Collapsed, 267 individuals (47.8%) obtained a scale score of 0. The discrepancy in scores between these two versions of the PHQ supports the hy- pothesis that the somatic items potentially inflate the scores on the PHQ-9-Original and may overestimate depression severity.

The PHQ-8-Substitutive-Collapsed scale scores had a correla- tion of .81 with the CES-D depressed affect factor, r � .33 with the positive affect factor, r � .74 with the somatic complaints factor, r � .42 with the interpersonal problems factors, and r � .81 with the CES-D total score.7 Conversely, the PHQ-9-Original scale

5 Results of the RSM in PHQ-8-Substitutive were consistent with those of the PCM (available upon request). Across items, 18.7% of the 95% confidence intervals of the threshold parameters for response options 1 and 2 overlapped.

6 We have examined the degree to which the local independence as- sumption was violated in the PCM using the PHQ-8-Substitutive- Collapsed structure. In sum, the assumption was fulfilled and unidimen- sionality was supported. Detailed results were summarized in the online supplementary materials.

7 The PHQ-8-Substitutive-Collapsed maximum a posteriori (MAP) fac- tor scores under PCM had similar correlations: r � .79 with the CES-D depressed affect factor, r � .33 with the positive affect factor, r � .75 with the somatic complaints factor, r � .38 with the interpersonal problems factors, and r � .80 with the CES-D total score.

Table 3 Global Model Fit of the IRT Models

Model C2 df p AIC BIC

(A) PHQ-9-Original PCM 90.0 35 �.001 7532.3 7653.4 RSM 142.3 51 �.001 7573.1 7625.0 GPCM 69.5 27 �.001 7472.4 7628.1 GRM 136.7 27 �.001 7444.7 7600.4

(B) PHQ-9-Substitutive PCM 95.8 35 �.001 5525.0 5646.1 RSM 134.3 51 �.001 5538.7 5590.6 GPCM 82.3 27 �.001 5468.2 5623.9 GRM 87.8 27 �.001 5442.9 5598.6

(C) PHQ-8-Substitutive PCM 46.7 27 .01 4863.6 4971.7 RSM 53.1 41 .10 4863.8 4911.4

(D) PHQ-8-Substitutive-Collapsed PCM 28.7 27 .37 3956.9 4030.4 RSM 37.9 34 .30 3951.7 3994.9

Note. PHQ-9-Original is a unidimensional structure with PHQ-9 items only. PHQ-9-Substitutive is a unidimensional structure with the four PHQ-9 items (Items 3, 4, 5, 7) substituted by the Endicott items. PHQ-8- Substitutive removes PHQ-9 Item 8 from PHQ-9-Substitutive. PHQ-8- Substitutive-Collapsed combines response options 1 (several days) and 2 (more than half the days) from PHQ-8-Substitutive. PCM is partial credit model, RSM is rating scale model, GPCM is generalized partial credit model, GRM is graded response model. AIC � Akaike Information Cri- terion; BIC � Bayesian Information Criterion.

T hi

s do

cu m

en t

is co

py ri

gh te

d by

th e

A m

er ic

an P

sy ch

ol og

ic al

A ss

oc ia

ti on

or on

e of

it s

al li

ed pu

bl is

he rs

. T

hi s

ar ti

cl e

is in

te nd

ed so

le ly

fo r

th e

pe rs

on al

us e

of th

e in

di vi

du al

us er

an d

is no

t to

be di

ss em

in at

ed br

oa dl

y.

103DEPRESSION EVALUATION IN CANCER

scores had a correlation of .74 with the CES-D depressed affect factor, r � .32 with the positive affect factor, r � .84 with the somatic complaints factor, r � .37 with the interpersonal problems factor, and r � .82 with the CES-D total score. The PHQ-8- Substitutive-Collapsed and PHQ-9-Original scale scores were also highly correlated (r � .86).

Scale scores of the PHQ-8-Substitutive-Collapsed and PHQ- 9-Original were also compared between participants who re- ported receiving treatment for depression and those who did not. The Cohen’s d for the PHQ-8-Substitutive-Collapsed (d � .70) was larger than that of PHQ-9-Original (d � .60), suggest- ing that substituting the somatic items with the ESC items provided a more accurate reflection of depression “caseness” in the oncology setting.

Discussion

This is the first study to use IRT to analyze the PHQ-9 in an oncology setting, and to examine whether the Endicott substitutive criteria (ESC) items improve the detection of depression. The preliminary CFA results support the unidimensional IRT analysis for the original PHQ-9 and the revised structure, with somatic PHQ-9 items replaced by the ESC items in patients with cancer. The IRT analyses indicated that the replacement of somatic items in the PHQ-9 with ESC resulted in a large improvement in model fit to the original PHQ-9. These analyses supported the often

discussed (but rarely investigated) recommendation to replace the somatic depression items with ESC items. Although preliminary, these results indicate that the substitutive items generate a potentially psychometrically superior measure of depression. Overall, an eight- item version of the PHQ that substitutes somatic items with Endicott items performed the best, and a three-option response format (PHQ- 8-Substitutive-Collapsed) generated the best model fit.

Among the competing Rasch (i.e., RSM, PCM) and non-Rasch (i.e., GPCM, GRM) polytomous IRT models, PCM was the best fitting model to the final PHQ-8-Substitutive-Collapsed Structure. This indicates that the individual scale items differ in their thresh- olds, that is, some items (e.g., suicidal ideation) require higher levels of depression to choose a higher response option than others do. However, all items have similar levels of discriminating power (i.e., the degree to which an item differentiates people with different depression levels), which allows the use of average/sum scores for evaluation purposes in applied settings. If a non-Rasch model would have provided significantly better fit, then sum scores would be suboptimal (and potentially less reliable) estimates.

In terms of overall item endorsement, only 21% of participants obtained a total score of zero on the original PHQ-9, whereas over 45% of participants obtained a score of zero when the ESC items were used. This observation suggests that, as has been a frequent concern, the somatic symptoms of depression are more likely to be endorsed by patients with cancer. In addition to improved model fit

Table 4 Item Fit of Partial Credit Model and Rating Scale Model in PHQ-8-Substitutive-Collapsed Structure

Item

Partial credit model (PCM) Rating scale model (RSM)

S-�2 df p Infit z (Infit) S-�2 df p Infit z (Infit)

Anhedonia 12.59 10 .25 .83 �2.25 12.82 10 .23 .83 �2.33 Depressed mood 14.92 10 .13 .70 �4.30 18.88 9 .03 .66 �5.04 Feeling bad about yourself 7.43 8 .49 1.08 .97 6.47 6 .37 1.11 1.31 Suicidal ideation 13.63 8 .09 .97 �.22 14.00 7 .05 .97 �.19 Socially withdrawn 10.43 10 .40 .91 �1.07 11.95 9 .22 .96 �.49 Tearfulness 3.62 7 .82 .90 �1.18 3.59 6 .73 .92 �.98 Brooding 10.35 10 .41 .85 �1.90 1.07 9 .34 .84 �2.03 Could not be cheered up 13.39 7 .06 .78 �2.59 13.58 6 .03 .78 �2.60

Note. Anhedonia, depressed mood, feeling bad about yourself, and suicidal ideation are PHQ-9 items. Socially withdrawn, tearfulness, brooding, and could not be cheered up are Endicott items.

Table 5 Percentage (%) of Response Options and Parameter Estimates of the Partial Credit Model in PHQ-8-Substitutive-Collapsed Structure

Item

Response option (%) Parameter estimates Corrected item-total correlation0 1 2

Threshold 1 (Response 0 & 1)

Threshold 2 (Response 1 & 2)

Anhedonia 64.0 30.6 5.4 1.13 (.35) 4.85 (.27) .72 Depressed mood 63.8 32.4 3.8 1.10 (.37) 5.39 (.29) .77 Feeling bad about yourself 77.8 19.7 2.5 2.46 (.41) 5.76 (.34) .61 Suicidal ideation 92.7 7.2 .2 4.56 (1.06) 8.21 (1.03) .51 Socially withdrawn 75.1 21.5 3.4 2.18 (.39) 5.38 (.31) .69 Tearfulness 78.1 19.7 2.2 2.49 (.43) 5.95 (.36) .68 Brooding 71.1 25.6 3.2 1.78 (.38) 5.52 (.31) .72 Could not be cheered up 82.8 16.1 1.1 3.02 (.52) 6.70 (.47) .70

Note. Standard errors are in parentheses. Anhedonia, depressed mood, feeling bad about yourself, and suicidal ideation are PHQ-9 items. Socially withdrawn, tearfulness, brooding, and could not be cheered up are Endicott items.

T hi

s do

cu m

en t

is co

py ri

gh te

d by

th e

A m

er ic

an P

sy ch

ol og

ic al

A ss

oc ia

ti on

or on

e of

it s

al li

ed pu

bl is

he rs

. T

hi s

ar ti

cl e

is in

te nd

ed so

le ly

fo r

th e

pe rs

on al

us e

of th

e in

di vi

du al

us er

an d

is no

t to

be di

ss em

in at

ed br

oa dl

y.

104 SARACINO ET AL.

over the traditional PHQ-9, the ESC items on the PHQ-8- Substitutive-Collapsed demonstrated high item-total correlations, indicating strong internal consistency. Support for utilizing the Endicott criteria in lieu of the traditional somatic items is consis- tent with the findings of Akechi et al. (2009), among a large sample of depressed patients with cancer who completed semi- structured diagnostic interviews. They reported that the ESC items “social withdrawal or decreased talkativeness” and “cannot be cheered up, doesn’t smile, no response to good news or funny situations” had moderate difficulty and high discrimination param- eters, suggesting the potential utility of these ESC items as markers of moderate to severe depression in oncology settings. The present study extends the findings of Akechi et al. (2009) to a more heterogeneous sample of cancer patients, increasing the general- izability of these findings.

This study is also the first to examine the psychometric perfor- mance of the Endicott items when administered in a self-report format. The results suggest that the four items can be reliably and meaningfully administered in this format. Relying on the original PHQ-9, however, may increase the risk of overinclusivity when screening for depression. This is problematic in that it can deplete already limited mental health resources and possibly subject pa- tients without clinically significant depressive symptoms to unnec- essary treatments with their own side effect profiles. On the other hand, lower mean scores observed for the PHQ-8-Substitutive- Collapsed indicates that further research may be necessary to determine the optimal algorithm or cut-score for optimally identi- fying patients with clinically significant depression.

The data also revealed large infit statistics for the psychomotor retardation and agitation item in the PHQ-9-Substitutive model, and removal of this item improved model fit. Previous IRT eval- uations of the PHQ-9 have also found this item to detract from model fit in samples of community-dwelling older adults (Fork- mann et al., 2013) and patients undergoing coronary artery bypass graft surgery (Kendel et al., 2010). It is unclear whether the wording of this item, which measures both hyper- and hypoactiv- ity, is problematic, or whether psychomotor changes are simply not reliable in self-report format, as individuals may not be able to readily observe these changes in themselves (i.e., compared with a clinician-rated evaluation of this symptom). It is also possible that this item is more appropriately categorized as somatic, and it would therefore not be expected to perform as well as the other exclusively cognitive and affective items on the revised PHQ-9.

The two gateway symptoms, anhedonia and depressed mood, were relatively easier (i.e., were endorsed at lower levels of de- pression compared to other items). This observation lends support for the approach to MDD diagnosis utilized by the DSM, which requires at least one of these two symptoms to be present for the diagnosis. It suggests that even in oncology, these two symptoms likely represent important “entry criteria” for identifying patients who are experiencing genuine depressive syndromes. In contrast, suicidal ideation was the most difficult item. This finding is hardly surprising, as suicidal ideation, although infrequently endorsed, is a key indicator of the highest levels of depressive symptoms. It also suggests that although some increased preoccupation with death or dying might be expected in the context of a life- threatening illness like cancer, suicidal ideation still remains an important indicator of depression and should not be normalized without further evaluation.

Similar to previous investigations of the PHQ-9 among medical patients (Dyer et al., 2016; Forkmann et al., 2013; Gothwal et al., 2014; Lamoureux et al., 2009; Pedersen et al., 2016), a collapsed range of response options, with several days and more than half the days combined into a single option was the most suitable scoring approach. The presence of two intermediate response options appears problematic, particularly when the PHQ-9 sum score is used as an indicator of change since clinically important changes may be obscured if patients do not perceive a meaningful difference between these two response options. Future research with this collapsed set of response options is needed in order to determine optimal clinical cutoffs for identifying levels of depres- sion severity.

There was no differential item functioning identified for gender, age, or racial groups on the proposed PHQ-8-Substitutive- Collapsed, suggesting that the construct is similarly understood across these groups and that therefore, total score and individual item endorsement can be interpreted and compared across groups. Studies of the original PHQ-9 in other medical groups have occa- sionally observed DIF for gender, with women more readily en- dorsing depressed mood than men (Kendel et al., 2010; Pedersen et al., 2016). It is possible that cancer and its side effects diminish the potential gender differences typically observed on this item in other populations, but future research should examine this further.

Limitations and Suggestions for Future Research

Despite the contributions of the current study in elucidating an improved approach to depression assessment in oncology settings, several limitations warrant note. First, the sample, while diverse in terms of age, cancer type and stage, was relatively homogeneous in terms of race, ethnicity, and education. The study deliberately sampled a heterogenous sample of patients with cancer (both by disease and treatment status) to obtain a broad understanding of depressive symptom presentation and item endorsement in oncol- ogy outpatients. However, there are potentially important disease- specific and treatment-specific sequelae that may contribute to the relative prominence of affective, cognitive, and somatic items that could not be disentangled in the current study. Similarly, the number of individuals who were approached about participating in the study but declined was not evaluated and thus conclusions about potential selection bias cannot be ascertained. Participants were also healthy enough to receive ambulatory care, while those who were more critically ill are not represented in this sample. Given the limited sociodemographic diversity of the sample, it is unclear if the observed results would be maintained in a more diverse sample. This is important given that previous research has supported different manifestations of depression among population subgroups (e.g., Kalibatseva, Leong, & Ham, 2014). For example, one study utilized IRT to compare depressive symptom endorse- ment between Asian and European American community-dwelling adults (Kalibatseva et al., 2014). That study detected DIF for nearly one quarter of depression items, with European Americans more reporting higher levels of affective symptoms but the same level of somatic symptoms. Thus, the results of this study must be interpreted in light of this limitation, as item utility might be different depending on the subgroup. Future studies should include more racial, ethnic, and socioeconomic diversity, as well as ex- amination of potential differences in item utility depending on

T hi

s do

cu m

en t

is co

py ri

gh te

d by

th e

A m

er ic

an P

sy ch

ol og

ic al

A ss

oc ia

ti on

or on

e of

it s

al li

ed pu

bl is

he rs

. T

hi s

ar ti

cl e

is in

te nd

ed so

le ly

fo r

th e

pe rs

on al

us e

of th

e in

di vi

du al

us er

an d

is no

t to

be di

ss em

in at

ed br

oa dl

y.

105DEPRESSION EVALUATION IN CANCER

disease severity (e.g., inpatient palliative care or hospice, survi- vorship clinics, etc.).

The findings of the current study do not allow for a determina- tion of classification accuracy, given the absence of a “gold stan- dard” criterion measure such as an expert clinician interview. Therefore, while the findings provide tentative support for a re- vised version of the PHQ-9 using the ESC and condensed response options, further evaluation is needed, ideally using expert clinician diagnostic interviews to more fully evaluate these modifications. Similarly, because some ESC items included multiple constructs within a single item (e.g., “brooding, self-pity, or pessimism”), future research should separate these constructs in order to clarify which are the most salient and if there are meaningful differences in the psychometric properties of each element that warrant sepa- rating compound items into unique items. Moreover, the current study focused on the substitutive criteria proposed by Endicott (1984) for cancer, but other substitutive approaches like that pro- posed by Cavanaugh (1995) might also warrant additional rigorous examination. Finally, the current study was cross-sectional and therefore the relationship between symptoms (i.e., both affective and somatic), antidepressant treatment, and symptom manage- ment/treatment response could not be determined. Repeated as- sessment of depressive symptoms over time would also allow for the determination of the reliability of the substitutive and somatic symptoms and their predictive validity in the cancer setting.

Conclusion

Notwithstanding these limitations, this study was the first to examine the performance of the PHQ-9 and Endicott’s substi- tutive criteria using IRT in a large sample of oncology outpa- tients. As expected, the somatic items included in the PHQ-9 had poorer model fit than a model that replaced these items with the four Endicott’s substitutive criteria. Likewise, a collapsed response scale further improved the overall model fit. With these modifications, there was no evidence of significant dif- ferential item functioning across gender, age, or racial groups. Taken together, this study provides strong preliminary support for utilizing the Endicott substitutive criteria when screening for depression in oncology settings. The potential impact of these findings on clinical practice is substantial, as the growing numbers of patients with cancer means an even higher burden on already limited mental health resources in these settings. Developing a more precise method (i.e., maximizing both sen- sitivity and specificity) of identifying patients who are experi- encing genuine depressive symptoms above and beyond the somatic symptoms of their illness and treatment will decrease unnecessary triage for additional psychiatric evaluation. More- over, it lessens the likelihood of patients being unnecessarily prescribed psychotropic medications in the absence of genu- inely severe depressive symptoms. Future replication studies and studies with more demographically and socioeconomically heterogeneous patient samples are needed to further determine the robustness of these findings.

References

Akechi, T., Ietsugu, T., Sukigara, M., Okamura, H., Nakano, T., Akizuki, N., . . . Uchitomi, Y. (2009). Symptom indicator of severity of depres-

sion in cancer patients: A comparison of the DSM–IV criteria with alternative diagnostic criteria. General Hospital Psychiatry, 31, 225– 232. http://dx.doi.org/10.1016/j.genhosppsych.2008.12.004

Akechi, T., Nakano, T., Akizuki, N., Okamura, M., Sakuma, K., Nakanishi, T., . . . Uchitomi, Y. (2003). Somatic symptoms for diagnosing major depression in cancer patients. Psychosomatics, 44, 244 –248. http://dx .doi.org/10.1176/appi.psy.44.3.244

American Psychiatric Association. (2013). Diagnostic and statistical man- ual of mental disorders (5th ed.). Arlington, VA: American Psychiatric Association.

Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573. http://dx.doi.org/10.1007/BF02293814

Andrich, D. (1982). An index of person separation in latent trait theory, the traditional KR. 20 index, and the Guttman scale response pattern. Edu- cation Research and Perspectives, 9, 95–104. Retrieved from https:// rasch.org/erp7.htm

Bentler, P. M. (1990). Comparative fit indexes in structural models. Psy- chological Bulletin, 107, 238 –246. http://dx.doi.org/10.1037/0033-2909 .107.2.238

Bluethmann, S. M., Mariotto, A. B., & Rowland, J. H. (2016). Anticipating the “Silver Tsunami”: Prevalence trajectories and comorbidity burden among older cancer survivors in the United States. Cancer Epidemiol- ogy, Biomarkers & Prevention, 25, 1029 –1036. http://dx.doi.org/10 .1158/1055-9965.EPI-16-0133

Cai, L., & Monroe, S. (2014). A new statistic for evaluating item response theory models for ordinal data (CRESST Report 839). Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Retrieved from https://files .eric.ed.gov/fulltext/ED555726.pdf

Cavanaugh, S. (1995). Depression in the medically ill. Critical issues in diagnostic assessment. Psychosomatics, 36, 48 –59. http://dx.doi.org/10 .1016/S0033-3182(95)71707-8

Chalmers, R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1–29. http://dx.doi.org/10.18637/jss.v048.i06

Chalmers, R. P. (2018). Improving the Crossing-SIBTEST statistic for detecting non-uniform DIF. Psychometrika, 83, 376 –386. http://dx.doi .org/10.1007/s11336-017-9583-8

Coccia, P. F., Pappo, A. S., Beaupin, L., Borges, V. F., Borinstein, S. C., Chugh, R., . . . Shead, D. A. (2018). Adolescent and young adult oncology. Journal of the National Comprehensive Cancer Network, 16, 66 –97. http://dx.doi.org/10.6004/jnccn.2018.0001

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.

de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: Guilford Press.

DiMatteo, M. R., Lepper, H. S., & Croghan, T. W. (2000). Depression is a risk factor for noncompliance with medical treatment: Meta-analysis of the effects of anxiety and depression on patient adherence. Archives of Internal Medicine, 160, 2101–2107. http://dx.doi.org/10.1001/archinte .160.14.2101

Dyer, J. R., Williams, R., Bombardier, C. H., Vannoy, S., & Fann, J. R. (2016). Evaluating the psycho-metric properties of 3 depression mea- sures in a sample of persons with traumatic brain injury and major depressive disorder. The Journal of Head Trauma Rehabilitation, 31, 225–232. http://dx.doi.org/10.1097/HTR.0000000000000177

Endicott, J. (1984). Measurement of depression in patients with cancer. Cancer, 53, 2243–2249. http://dx.doi.org/10.1002/cncr.1984.53.s10 .2243

Forkmann, T., Gauggel, S., Spangenberg, L., Brähler, E., & Glaesmer, H. (2013). Dimensional assessment of depressive severity in the elderly general population: Psychometric evaluation of the PHQ-9 using Rasch analysis. Journal of Affective Disorders, 148, 323–330. http://dx.doi.org/ 10.1016/j.jad.2012.12.019

T hi

s do

cu m

en t

is co

py ri

gh te

d by

th e

A m

er ic

an P

sy ch

ol og

ic al

A ss

oc ia

ti on

or on

e of

it s

al li

ed pu

bl is

he rs

. T

hi s

ar ti

cl e

is in

te nd

ed so

le ly

fo r

th e

pe rs

on al

us e

of th

e in

di vi

du al

us er

an d

is no

t to

be di

ss em

in at

ed br

oa dl

y.

106 SARACINO ET AL.

Gothwal, V. K., Bagga, D. K., & Sumalini, R. (2014). Rasch validation of the PHQ-9 in people with visual impairment in South India. Journal of Affective Disorders, 167, 171–177. http://dx.doi.org/10.1016/j.jad.2014 .06.019

Holland, P. W., & Wainer, H. (1993). Differential item functioning. Hills- dale, NJ: Erlbaum.

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6, 65–70. Retrieved from https:// www.jstor.org/stable/4615733

Hu, L. T., & Bentler, P. M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification. Psychological Methods, 3, 424 – 453. http://dx.doi.org/10.1037/1082- 989X.3.4.424

Jones, S. M., Ludman, E. J., McCorkle, R., Reid, R., Bowles, E. J. A., Penfold, R., & Wagner, E. H. (2015). A differential item function analysis of somatic symptoms of depression in people with cancer. Journal of Affective Disorders, 170, 131–137. http://dx.doi.org/10.1016/ j.jad.2014.09.002

Kalibatseva, Z., Leong, F. T. L., & Ham, E. H. (2014). A symptom profile of depression among Asian Americans: Is there evidence for differential item functioning of depressive symptoms? Psychological Medicine, 44, 2567–2578. http://dx.doi.org/10.1017/S0033291714000130

Kang, T., & Chen, T. (2008). Performance of the generalized S-X2 item fit index for polytomous IRT models. Journal of Educational Measure- ment, 45, 391– 406. http://dx.doi.org/10.1111/j.1745-3984.2008.00071.x

Kendel, F., Wirtz, M., Dunkel, A., Lehmkuhl, E., Hetzer, R., & Regitz- Zagrosek, V. (2010). Screening for depression: Rasch analysis of the dimensional structure of the PHQ-9 and the HADS-D. Journal of Affective Disorders, 122, 241–246. http://dx.doi.org/10.1016/j.jad.2009.07.004

Kim, J., & Oshima, T. C. (2013). Effect of multiple testing adjustment in differential item functioning detection. Educational and Psychological Measurement, 73, 458 – 470. http://dx.doi.org/10.1177/0013164412467033

Krebber, A. M. H., Buffart, L. M., Kleijn, G., Riepma, I. C., de Bree, R., Leemans, C. R., . . . Verdonck-de Leeuw, I. M. (2014). Prevalence of depression in cancer patients: A meta-analysis of diagnostic interviews and self-report instruments. Psycho-Oncology, 23, 121–130. http://dx .doi.org/10.1002/pon.3409

Kroenke, K., & Spitzer, R. L. (2002). The PHQ-9: A new depression diagnostic and severity measure. Psychiatric Annals, 32, 509 –515. http://dx.doi.org/10.3928/0048-5713-20020901-06

Lamoureux, E. L., Tee, H. W., Pesudovs, K., Pallant, J. F., Keeffe, J. E., & Rees, G. (2009). Can clinicians use the PHQ-9 to assess depression in people with vision loss? Optometry and Vision Science, 86, 139 –145. http://dx.doi.org/10.1097/OPX.0b013e318194eb47

Li, H.-H., & Stout, W. (1996). A new procedure for detection of crossing DIF. Psychometrika, 61, 647– 677. http://dx.doi.org/10.1007/ BF02294041

Mair, P., & Hatzinger, R. (2007). Extended Rasch modeling: The eRm package for the application of IRT models in R. Journal of Statistical Software, 20, 1–20. http://dx.doi.org/10.18637/jss.v020.i09

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psy- chometrika, 47, 149 –174. http://dx.doi.org/10.1007/BF02296272

Maydeu-Olivares, A., & Joe, H. (2006). Limited information goodness-of- fit testing in multidimensional contingency tables. Psychometrika, 71, 713–732. http://dx.doi.org/10.1007/s11336-005-1295-9

Misono, S., Weiss, N. S., Fann, J. R., Redman, M., & Yueh, B. (2008). Incidence of suicide in persons with cancer. Journal of Clinical Oncol- ogy, 26, 4731– 4738. http://dx.doi.org/10.1200/JCO.2007.13.8941

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159 –176. http://dx.doi.org/10.1177/014662169201600206

Nelson, C. J., Cho, C., Berk, A. R., Holland, J., & Roth, A. J. (2010). Are gold standard depression measures appropriate for use in geriatric cancer patients? A systematic evaluation of self-report depression instruments used with geriatric, cancer, and geriatric cancer samples. Journal of Clinical Oncology, 28, 348 –356. http://dx.doi.org/10.1200/JCO.2009.23 .0201

Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York, NY: McGraw Hill.

Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Mea- surement, 24, 50 – 64. http://dx.doi.org/10.1177/01466216000241003

Pedersen, S. S., Mathiasen, K., Christensen, K. B., & Makransky, G. (2016). Psychometric analysis of the Patient Health Questionnaire in Danish patients with an implantable cardioverter defibrillator (The DEFIB-WOMEN study). Journal of Psychosomatic Research, 90, 105– 112. http://dx.doi.org/10.1016/j.jpsychores.2016.09.010

Radloff, L. S. (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, 385– 401. http://dx.doi.org/10.1177/014662167700100306

Rosseel, Y. (2012). lavaan: An R Package for structural equation modeling. Journal of Statistical Software, 48, 1–36. http://dx.doi.org/10.18637/jss .v048.i02

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Richmond, VA: Psychometric Society. Retrieved from https://www.psychometricsociety.org/sites/default/files/pdf/MN17.pdf

Saracino, R. M., Cham, H., Rosenfeld, B., & Nelson, C. (2018). Confir- matory factor analysis of the Center for Epidemiologic Studies Depression scale in oncology with examination of invariance between younger and older patients. European Journal of Psychological Assessment. Advance online publication. http://dx.doi.org/10.1027/1015-5759/a000510

Saracino, R. M., Rosenfeld, B., & Nelson, C. J. (2018). Performance of four diagnostic approaches to depression in adults with cancer. General Hospital Psychiatry, 51, 90 –95. http://dx.doi.org/10.1016/j.genhosp- psych.2018.01.006

Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detect test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159 –194. http:// dx.doi.org/10.1007/BF02294572

Steiger, J. H. (1989). EzPATH: A supplementary module for SYSTAT and SYGRAPH. Evanston, IL: SYSTAT.

Vodermaier, A., Linden, W., & Siu, C. (2009). Screening for emotional distress in cancer patients: A systematic review of assessment instru- ments. Journal of the National Cancer Institute, 101, 1464 –1488. http:// dx.doi.org/10.1093/jnci/djp336

Williams, R. T., Heinemann, A. W., Bode, R. K., Wilson, C. S., Fann, J. R., & Tate, D. G. (2009). Improving measurement properties of the Patient Health Questionnaire-9 with rating scale analysis. Rehabilitation Psy- chology, 54, 198 –203. http://dx.doi.org/10.1037/a0015529

Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8, 370.

Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago, IL: MESA Press.

Yen, W. M. (1993). Scaling performance assessments: Strategies for man- aging local item dependence. Journal of Educational Measurement, 30, 187–213. http://dx.doi.org/10.1111/j.1745-3984.1993.tb00423.x

Zigmond, A. S., & Snaith, R. P. (1983). The hospital anxiety and depres- sion scale. Acta Psychiatrica Scandinavica, 67, 361–370. http://dx.doi .org/10.1111/j.1600-0447.1983.tb09716.x

Received January 2, 2019 Revision received July 8, 2019

Accepted July 11, 2019 �

T hi

s do

cu m

en t

is co

py ri

gh te

d by

th e

A m

er ic

an P

sy ch

ol og

ic al

A ss

oc ia

ti on

or on

e of

it s

al li

ed pu

bl is

he rs

. T

hi s

ar ti

cl e

is in

te nd

ed so

le ly

fo r

th e

pe rs

on al

us e

of th

e in

di vi

du al

us er

an d

is no

t to

be di

ss em

in at

ed br

oa dl

y.

107DEPRESSION EVALUATION IN CANCER