wk 7 DQ 1

gascie4x2001
IRTCTTandFA2017.pdf

Article

Educational and Psychological Measurement

2015, Vol. 75(3) 389–405 � The Author(s) 2014

Reprints and permissions: sagepub.com/journalsPermissions.nav

DOI: 10.1177/0013164414559071 epm.sagepub.com

Relationships Among Classical Test Theory and Item Response Theory Frameworks via Factor Analytic Models

Nidhi Kohli1, Jennifer Koran2, and Lisa Henn1

Abstract

There are well-defined theoretical differences between the classical test theory (CTT) and item response theory (IRT) frameworks. It is understood that in the CTT framework, person and item statistics are test- and sample-dependent. This is not the perception with IRT. For this reason, the IRT framework is considered to be the- oretically superior to the CTT framework for the purpose of estimating person and item parameters. In previous simulation studies, IRT models were used both as gener- ating and as fitting models. Hence, results favoring the IRT framework could be attrib- uted to IRT being the data-generation framework. Moreover, previous studies only considered the traditional CTT framework for the comparison, yet there is consider- able literature suggesting that it may be more appropriate to use CTT statistics based on an underlying normal variable (UNV) assumption. The current study relates the class of CTT-based models with the UNV assumption to that of IRT, using confirma- tory factor analysis to delineate the connections. A small Monte Carlo study was car- ried out to assess the comparability between the item and person statistics obtained from the frameworks of IRT and CTT with UNV assumption. Results show the fra- meworks of IRT and CTT with UNV assumption to be quite comparable, with neither framework showing an advantage over the other.

1University of Minnesota, Minneapolis, MN, USA 2 Southern Illinois University, Carbondale, IL, USA

Corresponding Author:

Nidhi Kohli, Quantitative Methods in Education Program, Department of Educational Psychology,

University of Minnesota, 161 Education Sciences Building, 56 East River Road, Minneapolis, MN 55455,

USA.

Email: nkohli@umn.edu

Keywords

classical test theory, item response theory, relationship, factor analysis

Classical test theory (CTT) has been prominent in the field of educational measure-

ment since the 1920s; however, for the last three decades, item response theory (IRT)

has been the primary framework for educational measurement and psychometric

issues. A commonly held belief is that the IRT framework is theoretically superior to

the CTT framework for the estimation of person and item parameters because the per-

son and item statistics based on the CTT framework are test- and sample-dependent,

respectively. Specifically, the item statistics derived from the CTT-based models,

item difficulty and item discrimination, are dependent on the sample of respondents

selected to answer the items. If the same items are given to a different sample, and

the item difficulty and item discrimination indices are computed on CTT-based mod-

els, they may vary substantially depending on the nature of the sample. Similarly, the

scores earned by test takers depend on the items they have been asked to answer. If

the test takers are given another set of more or less difficult items, their number-

correct test scores likely are going to be lower or higher, respectively, than their

number-correct scores on the original set of items.

In contrast to CTT, the person and item statistics based on IRT are considered to

be stable across different samples of items and persons, respectively. As explained

by Lord (1980), this perspective on the ability variable follows from viewing the

item response function as a regression function of the observed test outcomes on the

ability variable. The probability of observing a particular outcome likely is unaf-

fected by how many among the test subjects has a particular level of ability. The

invariance of the item parameters follows conceptually from this framing of the item

response function as a regression function, with other elements in the function con-

ceived as fixed parameters.

We have found nothing in the literature that examines this item parameter estimate

stability via simulation, that is to say, that examines variation in parameter estimates

at moderate sample sizes. Nonetheless, this conceptual stability property often seems

attributed to parameter estimates as well.

In previous simulation studies, researchers have used IRT models both as generating

and as fitting models, which confounds comparability between CTT and IRT frame-

works with data-model congruity. An alternate explanation for results favoring the IRT

framework is that IRT was used to generate the data, and thus, the data will more natu-

rally fit this model. In addition, researchers have chosen traditional CTT item statistics,

proportion correct and point–biserial correlations, and a traditional person-statistic,

unweighted total scores, for the comparability comparison. But there is considerable lit-

erature suggesting that it may be more appropriate to use statistics based on an underly-

ing normal variable (UNV) assumption, such as thresholds and biserial correlations.

Despite well-defined theoretical differences between the CTT and IRT frame-

works, the empirical research comparing the two frameworks has failed to exhibit

390 Educational and Psychological Measurement 75(3)

differences between the two in terms of person and item parameter estimates. To

explore the distinctions between the two frameworks in greater depth, we first present

a review of the empirical literature comparing them. Prior simulation studies have

exclusively used IRT models to generate the data, and the literature does not consider

models with an UNV assumption common in factor analytic models for categorical

data. However, recent literature has introduced models with an UNV assumption as

an extension of the CTT framework. As both IRT models and CTT-based models

with the UNV assumption have been demonstrated to be members of the class of

confirmatory factor analysis models, we expect a high level of comparability between

results obtained under related member models. To test this theory, we compare item

and person statistics for the extended CTT framework and the IRT framework using

data generated under both frameworks. That we simulate data under each framework

is, to our knowledge, a unique contribution.

Literature Review

Prior studies comparing CTT and IRT frameworks have not found that differences

between the two translate to advantage of one framework over another. Many works

have mentioned IRT parameter invariance as part of a general introduction of the

IRT model (see, e.g., Hambleton & Jones, 1993; Sharkness & DeAngelo, 2011).

Rudner (1983) examined how the magnitude of an item discrimination value should

change if the location of the ability variable is not the same for two groups of exami-

nees. Cook, Eignor, and Hessy (1988) compared three administrations of a Biology

achievement test in part to examine stability of IRT item parameter estimates. They

found lack of stability, noting that it is affected by the advancement in skill level of

the test-takers. We have found no prior literature examining IRT parameter invar-

iance via simulation. Prior studies comparing estimates under CTT and IRT para-

digms show high correlations between CTT and IRT not only for person ability but

also for item difficulty (Courville, 2004; Fan, 1998; Lawson, 1991). Fan’s (1998)

research in particular ‘‘failed to support the IRT framework for its ostensible super-

iority over CTT in producing invariant item statistics’’ (p. 378). Item discrimination

indices are less highly correlated between the two frameworks, dipping as low as

0.60, particularly when the range of difficulty parameter values exceeds 0.5 in abso-

lute value (Fan, 1998; MacDonald & Paunonen, 2002). These correspondences are

highest when the traditional CTT statistics are compared to the corresponding item

statistics in one- and two-parameter logistic IRT models (Fan, 1998; MacDonald &

Paunonen, 2002). All these studies use IRT as the data generation model.

Some attempts have been made to relate one framework to the other (Hambleton

& Jones, 1993; Lord, 1980; Miyazaki, 2005). Lord (1980) presented approximate

expressions for IRT item discrimination parameter and item difficulty parameter as

functions of CTT item–biserial correlation and pass/fail threshold parameter. He

called these relations ‘‘crude’’ and added that they ‘‘are given . . . not for practical

use but rather to give an idea of the nature of the item discrimination parameter’’

Kohli et al. 391

(pp. 33-34). Miyazaki (2005) used two-level hierarchical generalized linear models

as the intermediate framework to relate these two approaches, which additionally

requires a normal distributional assumption on the observed test scores and the use of

the identity link. This distributional assumption is not part of the core framework of

CTT, making this approach for associating the two classes of model more restrictive.

Despite the empirically demonstrated similarities between the two modeling fra-

meworks, the results have led some researchers to nonetheless conclude that the IRT

framework is superior to the CTT framework (MacDonald & Paunonen, 2002). We

see two problems with the body of prior research comparing the CTT and IRT frame-

works. First, prior empirical studies have exclusively used IRT to generate the data;

thus, empirical results favoring IRT might be due to the design of the study. Second,

prior studies comparing CTT and IRT have not considered CTT-based models or sta-

tistics with a UNV assumption. However, recent literature has presented CTT-based

models with a UNV assumption as legitimate, desirable extensions of the CTT frame-

work (Raykov & Marcoulides, 2011).

Whereas previous efforts have sought to connect IRT to CTT with approximate

expressions or with hierarchical models, other research supports a relation between a

class of CTT models and some IRT models using confirmatory factor analysis to

delineate the associations. This approach does not require a distributional assumption

on the response as with Miyazaki (2005). Where CTT observed scores are considered

as a result of tests containing a single, binary item, CTT can be applied to scored

responses to individual items. This is possible because CTT assumes only the exis-

tence of the mathematical expectation of the observed score not that the observed

score be continuous, contrary to popular misconception (Raykov & Marcoulides,

2011). Furthermore, the item scores from several such single item tests can be

assumed to fit either parallel, tau-equivalent, or congeneric 1

CTT-based models, and

these CTT-based models have been demonstrated as members within the family of

confirmatory factor analysis models (DeVellis, 1991; Graham, 2006; Jöreskog,

1971). An extension, CTT-based models with a UNV assumption (Ferrando, 2000;

Raykov & Marcoulides, 2011), has produced models shown to be members of the

family of nonlinear confirmatory factor analysis models (Raykov & Marcoulides,

2011). The two-parameter IRT model (and the one-parameter model nested within it)

has likewise been shown to be mathematically equivalent to nonlinear confirmatory

factor analysis model (Kamata & Bauer, 2008; McDonald, 1999; Takane & de

Leeuw, 1987; Wirth & Edwards, 2007).

As both the CTT-based models with a UNV assumption and the one- and two-

parameter IRT models have been demonstrated to be members of the class of non-

linear confirmatory factor analysis models, we expect a high level of similarity

between results when the two frameworks are assessed under comparable conditions.

We compare the two formulations using simulated data generated in each framework

and compare that framework’s parameter estimates to those resulting from the fit of

the analogous model in the other framework. We thereby show relations between the

two classes under conditions of parity.

392 Educational and Psychological Measurement 75(3)

Method

Overview

A small Monte Carlo simulation study was carried out to examine whether results

reported in the literature held when the data generation model was varied (i.e., data

were generated under both IRT framework and CTT with UNV assumption frame-

work). The simulation design was composed of two conditions/factors: test length

(total number of items) and number of examinees. We chose these two factors for

the simulation design because we expect them to affect the magnitude of comparabil-

ity between the item and person statistics arising from the CTT with UNV assump-

tion framework and those from the IRT framework. Small sample sizes are known to

affect item parameter estimates adversely, and short tests are known to affect person

ability estimates adversely. The test length factor took values of 20, 40, and 60 items,

whereas the number of examinees factor took values of 500 and 1,000 examinees.

Thus, the combination of manipulated factors (3 3 2) resulted in a Monte Carlo

simulation with six cells or conditions. The conditions are listed in Table 1. For each

condition, 100 replications were generated.

Within each manipulated condition, data sets were generated for each of four mod-

els: two models from CTT with UNV assumption, the parallel and the congeneric

models, and two from IRT, the one-parameter logistic (1PL) and two-parameter logis-

tic (2PL) models. The tau-equivalent CTT-based model with the UNV assumption

was not used in the simulation study, because with the UNV assumption, the tau-

equivalent model is functionally equivalent to the parallel model. This is due to the

fact that constraining the loadings also results in the error variances being equal to

one another. The four models used in the study are explained in detail in the follow-

ing sections.

The One-Parameter Logistic Rasch Model

This model is the more restrictive of the two IRT models. All items are given the

same weight in determining the level of the latent construct for an individual. This

model is typically presented in its logistic form for an individual as

Table 1. Simulation Conditions.

Condition Number of items Number of examinees

Case 1 20 500 Case 2 20 1,000 Case 3 40 500 Case 4 40 1,000 Case 5 60 500 Case 6 60 1,000

Kohli et al. 393

P Xki = 1juð Þ= 1

1 + exp �D ui � bkð Þ½ � , ð1Þ

where ui represents the level of the latent trait for an individual; bk is the item diffi-

culty parameter on a scale approximating the normal ogive scale (nonlinear factor

analysis model with probit transformation), which describes how much of the latent

construct an individual must possess to have a 50% probability of endorsing item k;

and D = 1:7 is a scaling constant that when multiplied by an item parameter approxi- mately produces the corresponding value for the parameter on the logistic scale.

In the 1PL model, the relevant person statistic is an estimate of u. The relevant

item statistic is an estimate of the item difficulty parameter bk .

The Two-Parameter Birnbaum Model

This model is considered to be the less restrictive model of the two IRT models.

Items that are more discriminating are given greater weight in determining the level

of the latent construct for an individual. This model typically is presented in its logis-

tic form as

P Xki = 1juð Þ= 1

1 + exp �Dak ui � bkð Þ½ � , ð2Þ

where ui represents the level of the latent trait for an individual; bk is the item diffi-

culty parameter describing how much of the latent construct an individual must pos-

sess to have a 50% probability of endorsing item k; ak is the item slope

(discrimination) parameter on a scale approximating the normal ogive scale (non-

linear factor analysis model with probit transformation), which describes the strength

of the relationship between item k and the latent trait u; and D = 1:7 is a scaling con- stant that when multiplied by an item parameter approximately produces the corre-

sponding value for the parameter on the logistic scale.

In the 2PL model, the relevant person statistic is an estimate of u. The relevant

item statistics are estimates of the item difficulty parameter bk and the item slope

(discrimination) parameter ak .

The Underlying Normal Variable Assumption

The UNV assumption is a popular approach in the latent variable modeling literature

(Jöreskog, 1990; Mislevy, 1986; Muthén, 1978, 1984; Muthén & Christoffersson,

1981; Takane & de Leeuw, 1987). Each observed binary item score variable Xk with

two categories is assumed to be a coarse representation of an underlying unobserved

continuous variable X�k . If X � k is assumed to be univariate normally distributed, then

X�k is the UNV. A monotonic transformation matches the density of the observed bin- ary distribution to the density of the continuous distribution:

394 Educational and Psychological Measurement 75(3)

Xk = 0 if X�k \tk, 1 if X�k � tk:

� ð3Þ

The UNV X�k is assumed to have a range from negative infinity to positive infinity. The notation tk will be used to refer to the single estimated threshold for dichotomous

item k.

The Parallel Model With the UNV Assumption

Applying the UNV assumption to the parallel model concept (Ferrando, 2000), X�ki for each item k for individual i can be decomposed into true score Ti and error Eki as

X � ki = Ti + Eki ð4Þ

and

var Ekið Þ= var Ekð Þ,

where k, k9 2 1, . . . , pf g are item indices and k 6¼ k9. By definition, E Ekið Þ= 0. Analogous to the parallel model without the UNV assumption, all UNV item scores

are assumed to share the same true score Ti. In addition, all UNV item scores are

assumed to have equal reliability (equal measurement error). There is no subscript k

on the true score Ti. Because the true score Ti is assumed to be the same across items,

the distinguishing subscript k is not needed.

The Congeneric Model With the UNV Assumption

Applying the UNV assumption to the congeneric model concept (Ferrando, 2000),

the UNV item scores X�ki are linear functions of the same true score, and individual item error variances are not constrained to be equal. All UNV item scores X�ki have true scores that are linear functions of the same true score Ti. In equation form, the

model is

X � ki = l

� k Ti + Eki, ð5Þ

where l�k is the unique loading for item k. By definition, E Ekið Þ= 0. Note that Tki = l

� k Ti.

In the congeneric model with the UNV assumption, the relevant person statistic is

an estimate of Ti (such as a factor score). Because l � k differs across items, Ti is not

necessarily a linear function of the number correct score. Thus, in the congeneric

model, the number correct score does not necessarily contain the same information as

the true score Ti. The relevant item statistics are the item threshold tk and the biserial

(polyserial) correlation rbs. As in the parallel model with the UNV assumption, if the

item score Xki is the result of dichotomous scoring 0, 1f g, then 1 � probit tkð Þ is the proportion of individuals, where Xki = 1 (item proportion correct pk ). Thus, in the

Kohli et al. 395

case of the congeneric model with the UNV assumption, the item proportion correct

pk contains the same information as the item threshold. The biserial (polyserial) cor-

relation rbs is equivalent to the standardized factor loading, which is the correlation

between the UNV item score X�ki and the common factor Ti.

Data Generation

The discrimination parameter, a, for the IRT models was generated on the normal

ogive scale, meaning that a scaling factor of D = 1:7 was incorporated into the IRT model as per Equation 2. Item difficulty, b, values for the IRT models were sampled

from a uniform distribution; b;Uni �2, 2ð Þ. This is comparable to values used in the MacDonald and Paunonen (2002) study. In the 1PL model, the item discrimination,

a, was fixed at 1. Item discrimination values for the 2PL were sampled from a uni-

form distribution; a;Uni 1, 2ð Þ. Person parameter, u, values were drawn from the standard normal distribution. Corresponding bounds for sampling distributions for

loading, l, and threshold, t, parameters for the CTT-based models with UNV

assumption were calculated using the following equations:

l � k =

ak=Dffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 + ak=Dð Þ2

q ð6Þ

and

tk = ak=Dð Þbkffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

1 + ak=Dð Þ2 q ð7Þ

(Wirth & Edwards, 2007).

Thus, for the parallel model with the UNV assumption, the threshold, t, values

were sampled from a uniform distribution, with t;Uni �1:014, 1:014ð Þ. Factor load- ings, l, were set to 1. For the congeneric model with the UNV assumption, the

threshold t;Uni �1, 1:52ð Þ, while the factor loading l;Uni 0:5, 0:76ð Þ. As with the IRT model, person parameter, T , values for both CTT with UNV assumption models

were assumed to be standard normally distributed. Data for the models considered

across all the manipulated conditions were generated by using R version 3.0.2 (R

Development Core Team, 2013), with data for the CTT with UNV models generated

with the R package psych, version 1.4.5 (Revelle, 2014). The IRT models were fit

with the R package mirt, version 1.4 (Chalmers, 2012), using the EM algorithm, and

CTT with UNV assumption models were fit with the R package lavaan, version 0.5-

17.701 (Rosseel, 2012), using a weighted least square estimator with robust variance

estimator, mimicking that of the WLSMV estimator in Mplus (Muthén & Muthén,

1998-2012).

396 Educational and Psychological Measurement 75(3)

Analysis

On fitting the models, the correlations of person and item estimates were calculated.

Correlations were computed across pairings of two sets of statistics: the IRT statistics

and the CTT with UNV assumption model statistics. The correlation of theta, u, the

person statistic obtained from the IRT models, and the factor score, T , obtained from

the CTT with UNV assumption models, were calculated to assess the degree of com-

parability of person estimates. Correlations between the item difficulty parameter, b,

obtained from the IRT models, and the item threshold, t, obtained from the CTT with

UNV assumption models, were calculated to assess the degree of comparability of

item difficulty estimates. The correlations between the item discrimination parameter,

a, obtained from the IRT models, and the factor loading, l, obtained from the CTT

with UNV assumption models, were obtained to assess the degree of comparability of

item discrimination estimates. Finally, the median correlation and the range of corre-

lations for each pairing across 100 replicates were computed for reporting. The corre-

lation calculations were handled by R version 3.0.2.

Results

Results appear in Tables 2 to 9. Each table presents results for one of the model pair-

ings. The same model used to generate the data was fit back to the data, and the

resulting estimates were compared to another model. Tables 2 and 3 concern data

generated by the 1PL IRT model. The 1PL IRT model fit is compared with the paral-

lel CTT with UNV assumption model in Table 2 and to the congeneric CTT with

UNV assumption model in Table 3. Tables 4 and 5 concern data generated by the

2PL IRT model, compared again to the parallel and congeneric CTT with UNV

assumption models.

Tables 6 and 7 use data generated under the parallel CTT with UNV assumption

model. The fit of the parallel model was compared with estimates from the 1PL IRT

model in Table 6 and the 2PL IRT model in Table 7. Tables 8 and 9 make similar

comparisons using data generated under the congeneric CTT with UNV assumption

model.

Thus, Tables 2 and 6 make the same comparisons, estimates from the 1PL IRT

model and the parallel CTT with UNV assumption model. In Table 2, the data origi-

nate from the 1PL IRT model, whereas in Table 6, the data originate from the paral-

lel CTT with UNV assumption model. Analogous pairings occur for Tables 3 and 7,

Tables 4 and 8, and Tables 5 and 9.

The results of the analyses comparing IRT item difficulty parameter estimates with

those for item threshold t from the CTT with UNV assumption appear in the center

three columns in each of Tables 2 to 9. The correlations, based on the models paired

in each table, indicate a high degree of comparability. Note that there is only one set

of threshold estimates for both the parallel and congeneric CTT with UNV assump-

tion models. This occurs because, when fitting the model to the data, the thresholds

are estimated as a first step, and the specific model (parallel or congeneric) is fit as a

Kohli et al. 397

T a b

le 2 .

C o rr

e la

ti o n

B e tw

e e n

1 P L

IR T

an d

P ar

al le

l F it s,

D at

a F ro

m 1 P L

IR T

M o d e l.

C o rr

e la

ti o n

in d is

cr im

in at

io n

p ar

am e te

r C

o rr

e la

ti o n

in d if fi cu

lt y

p ar

am e te

r C

o rr

e la

ti o n

in fa

ct o r

sc o re

s

1 P L

d at

a vs

. p ar

al le

l fi t

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

C as

e 1

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 9

0 .9

9 9

1 .0

0 0

C as

e 2

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 9

0 .9

9 9

0 .9

9 9

C as

e 3

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 9

1 .0

0 0

1 .0

0 0

C as

e 4

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 9

1 .0

0 0

1 .0

0 0

C as

e 5

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

C as

e 6

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

N o te

. 1 P L

IR T

= o n e -p

ar am

e te

r lo

gi st

ic it e m

re sp

o n se

th e o ry

. �

= D

is cr

im in

at io

n p ar

am e te

r is

n o t

av ai

la b le

in th

e 1 P L

IR T

m o d e l,

an d

lo ad

in gs

ar e

n o t

u n iq

u e

in

th e

p ar

al le

l m

o d e l.

W it h

th e se

m o d e ls

, th

e co

rr e la

ti o n

is n o t

ca lc

u la

te d .

T a b

le 3 .

C o rr

e la

ti o n s

B e tw

e e n

C o n ge

n e ri

c an

d 1 P L

IR T

F it s,

D at

a F ro

m 1 P L

IR T

M o d e l.

C o rr

e la

ti o n

in d is

cr im

in at

io n

p ar

am e te

r C

o rr

e la

ti o n

in d if fi cu

lt y

p ar

am e te

r C

o rr

e la

ti o n

in fa

ct o r

sc o re

s

1 P L

d at

a vs

. co

n ge

n e ri

c fi t

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

C as

e 1

� �

� 0 .9

9 3

0 .9

9 8

0 .9

9 9

0 .9

9 7

0 .9

9 8

0 .9

9 9

C as

e 2

� �

� 0 .9

9 5

0 .9

9 9

1 .0

0 0

0 .9

9 8

0 .9

9 9

0 .9

9 9

C as

e 3

� �

� 0 .9

8 6

0 .9

9 8

0 .9

9 9

0 .9

9 9

0 .9

9 9

0 .9

9 9

C as

e 4

� �

� 0 .9

9 7

0 .9

9 9

1 .0

0 0

0 .9

9 9

0 .9

9 9

1 .0

0 0

C as

e 5

� �

� 0 .9

9 5

0 .9

9 8

0 .9

9 9

0 .9

9 9

0 .9

9 9

1 .0

0 0

C as

e 6

� �

� 0 .9

9 8

0 .9

9 9

1 .0

0 0

0 .9

9 9

0 .9

9 9

1 .0

0 0

N o te

. 1 P L

IR T

= o n e -p

ar am

e te

r lo

gi st

ic it e m

re sp

o n se

th e o ry

. �

= D

is cr

im in

at io

n p ar

am e te

r is

n o t

av ai

la b le

in th

e 1 P L

IR T

m o d e l,

an d

lo ad

in gs

ar e

n o t

u n iq

u e

in

th e

p ar

al le

l m

o d e l.

W it h

th e se

m o d e ls

, th

e co

rr e la

ti o n

is n o t

ca lc

u la

te d .

398

T a b

le 5 .

C o rr

e la

ti o n s

B e tw

e e n

C o n ge

n e ri

c an

d 2 P L

IR T

F it s,

D at

a F ro

m 2 P L

IR T

M o d e l.

C o rr

e la

ti o n

in d is

cr im

in at

io n

p ar

am e te

r C

o rr

e la

ti o n

in d if fi cu

lt y

p ar

am e te

r C

o rr

e la

ti o n

in fa

ct o r

sc o re

s

2 P L

d at

a vs

. co

n ge

n e ri

c fi t

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

C as

e 1

0 .6

5 6

0 .9

0 9

0 .9

8 3

0 .9

9 1

0 .9

9 7

0 .9

9 9

0 .9

9 8

0 .9

9 9

1 .0

0 0

C as

e 2

0 .8

7 7

0 .9

5 2

0 .9

8 6

0 .9

9 2

0 .9

9 7

0 .9

9 9

0 .9

9 9

0 .9

9 9

1 .0

0 0

C as

e 3

0 .5

0 5

0 .9

1 3

0 .9

7 4

0 .9

9 2

0 .9

9 7

0 .9

9 9

0 .9

9 8

0 .9

9 9

1 .0

0 0

C as

e 4

0 .8

8 7

0 .9

4 5

0 .9

7 3

0 .9

9 5

0 .9

9 7

0 .9

9 9

0 .9

9 9

1 .0

0 0

1 .0

0 0

C as

e 5

0 .6

9 6

0 .9

1 4

0 .9

6 3

0 .9

9 4

0 .9

9 7

0 .9

9 9

0 .9

9 8

0 .9

9 9

1 .0

0 0

C as

e 6

0 .8

8 7

0 .9

4 8

0 .9

7 7

0 .9

9 5

0 .9

9 8

0 .9

9 9

0 .9

9 9

1 .0

0 0

1 .0

0 0

N o te

. 2 P L

IR T

= tw

o -p

ar am

e te

r lo

gi st

ic it e m

re sp

o n se

th e o ry

.

T a b

le 4 .

C o rr

e la

ti o n s

B e tw

e e n

P ar

al le

l an

d 2 P L

IR T

F it s,

D at

a F ro

m 2 P L

IR T

M o d e l.

C o rr

e la

ti o n

in d is

cr im

in at

io n

p ar

am e te

r C

o rr

e la

ti o n

in d if fi cu

lt y

p ar

am e te

r C

o rr

e la

ti o n

in fa

ct o r

sc o re

s

2 P L

d at

a vs

. p ar

al le

l fi t

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

C as

e 1

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 5

0 .9

9 7

0 .9

9 8

C as

e 2

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 5

0 .9

9 7

0 .9

9 9

C as

e 3

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 5

0 .9

9 8

0 .9

9 9

C as

e 4

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 6

0 .9

9 8

0 .9

9 8

C as

e 5

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 6

0 .9

9 8

0 .9

9 9

C as

e 6

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 7

0 .9

9 8

0 .9

9 9

N o te

. 2 P L

IR T

= tw

o -p

ar am

e te

r lo

gi st

ic it e m

re sp

o n se

th e o ry

. �

= D

is cr

im in

at io

n p ar

am e te

r is

n o t

av ai

la b le

in th

e 1 P L

IR T

m o d e l,

an d

lo ad

in gs

ar e

n o t

u n iq

u e

in

th e

p ar

al le

l m

o d e l.

W it h

th e se

m o d e ls

, th

e co

rr e la

ti o n

is n o t

ca lc

u la

te d .

399

T a b

le 7 .

C o rr

e la

ti o n s

B e tw

e e n

2 P L

IR T

an d

P ar

al le

l F it s,

D at

a F ro

m P ar

al le

l M

o d e l.

C o rr

e la

ti o n

in d is

cr im

in at

io n

p ar

am e te

r C

o rr

e la

ti o n

in d if fi cu

lt y

p ar

am e te

r C

o rr

e la

ti o n

in fa

ct o r

sc o re

s

P ar

al le

l d at

a vs

. 2 P L

fi t

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

C as

e 1

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 5

0 .9

9 7

0 .9

9 9

C as

e 2

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 7

0 .9

9 8

0 .9

9 9

C as

e 3

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 7

0 .9

9 8

0 .9

9 9

C as

e 4

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 9

0 .9

9 9

0 .9

9 9

C as

e 5

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 8

0 .9

9 9

0 .9

9 9

C as

e 6

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 9

0 .9

9 9

1 .0

0 0

N o te

. 2 P L

IR T

= tw

o -p

ar am

e te

r lo

gi st

ic it e m

re sp

o n se

th e o ry

. �

= D

is cr

im in

at io

n p ar

am e te

r is

n o t

av ai

la b le

in th

e 1 P L

IR T

m o d e l,

an d

lo ad

in gs

ar e

n o t

u n iq

u e

in

th e

p ar

al le

l m

o d e l.

W it h

th e se

m o d e ls

, th

e co

rr e la

ti o n

is n o t

ca lc

u la

te d .

T a b

le 6 .

C o rr

e la

ti o n s

B e tw

e e n

1 P L

IR T

an d

P ar

al le

l F it s,

D at

a F ro

m P ar

al le

l M

o d e l.

C o rr

e la

ti o n

in d is

cr im

in at

io n

p ar

am e te

r C

o rr

e la

ti o n

in d if fi cu

lt y

p ar

am e te

r C

o rr

e la

ti o n

in fa

ct o r

sc o re

s

P ar

al le

l d at

a vs

. 1 P L

fi t

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

C as

e 1

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 9

1 .0

0 0

1 .0

0 0

C as

e 2

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

0 .9

9 9

1 .0

0 0

1 .0

0 0

C as

e 3

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

C as

e 4

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

C as

e 5

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

C as

e 6

� �

� 1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

1 .0

0 0

N o te

. 1 P L

IR T

= o n e -p

ar am

e te

r lo

gi st

ic it e m

re sp

o n se

th e o ry

. �

= D

is cr

im in

at io

n p ar

am e te

r is

n o t

av ai

la b le

in th

e 1 P L

IR T

m o d e l,

an d

lo ad

in gs

ar e

n o t

u n iq

u e

in

th e

p ar

al le

l m

o d e l.

W it h

th e se

m o d e ls

, th

e co

rr e la

ti o n

is n o t

ca lc

u la

te d .

400

T a b

le 9 .

C o rr

e la

ti o n s

B e tw

e e n

2 P L

IR T

an d

C o n ge

n e ri

c F it s,

D at

a F ro

m C

o n ge

n e ri

c M

o d e l.

C o rr

e la

ti o n

in d is

cr im

in at

io n

p ar

am e te

r C

o rr

e la

ti o n

in d if fi cu

lt y

p ar

am e te

r C

o rr

e la

ti o n

in fa

ct o r

sc o re

s

C o n ge

n e ri

c d at

a vs

. 2 P L

fi t

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

C as

e 1

0 .8

9 2

0 .9

6 3

0 .9

9 1

0 .9

6 2

0 .9

8 9

0 .9

9 8

0 .9

9 9

0 .9

9 9

1 .0

0 0

C as

e 2

0 .8

9 9

0 .9

7 0 .9

9 1

0 .9

7 6

0 .9

9 2

0 .9

9 7

0 .9

9 9

0 .9

9 9

1 .0

0 0

C as

e 3

0 .9

0 3

0 .9

6 6

0 .9

8 3

0 .9

6 6

0 .9

9 1

0 .9

9 5

0 .9

9 9

1 .0

0 0

1 .0

0 0

C as

e 4

0 .9

0 6

0 .9

6 6

0 .9

8 6

0 .9

8 2

0 .9

9 1

0 .9

9 5

0 .9

9 9

1 .0

0 0

1 .0

0 0

C as

e 5

0 .9

0 1

0 .9

6 3

0 .9

8 6

0 .9

6 8

0 .9

8 8

0 .9

9 6

0 .9

9 9

1 .0

0 0

1 .0

0 0

C as

e 6

0 .9

1 8

0 .9

6 7

0 .9

8 4

0 .9

8 5

0 .9

9 1

0 .9

9 5

0 .9

9 9

1 .0

0 0

1 .0

0 0

N o te

. 2 P L

IR T

= tw

o -p

ar am

e te

r lo

gi st

ic it e m

re sp

o n se

th e o ry

.

T a b

le 8 .

C o rr

e la

ti o n s

B e tw

e e n

1 P L

IR T

an d

C o n ge

n e ri

c F it s,

D at

a F ro

m C

o n ge

n e ri

c M

o d e l.

C o rr

e la

ti o n

in d is

cr im

in at

io n

p ar

am e te

r C

o rr

e la

ti o n

in d if fi cu

lt y

p ar

am e te

r C

o rr

e la

ti o n

in fa

ct o r

sc o re

s

C o n ge

n e ri

c d at

a vs

. 1 P L

fi t

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

M in

im u m

M e d ia

n M

ax im

u m

C as

e 1

� �

� 0 .9

6 2

0 .9

8 9

0 .9

9 8

0 .9

9 1

0 .9

9 5

0 .9

9 7

C as

e 2

� �

� 0 .9

7 6

0 .9

9 2

0 .9

9 7

0 .9

9 2

0 .9

9 6

0 .9

9 8

C as

e 3

� �

� 0 .9

6 6

0 .9

9 1

0 .9

9 5

0 .9

9 6

0 .9

9 7

0 .9

9 8

C as

e 4

� �

� 0 .9

8 2

0 .9

9 1

0 .9

9 5

0 .9

9 6

0 .9

9 7

0 .9

9 8

C as

e 5

� �

� 0 .9

6 8

0 .9

8 8

0 .9

9 6

0 .9

9 6

0 .9

9 8

0 .9

9 9

C as

e 6

� �

� 0 .9

8 5

0 .9

9 1

0 .9

9 5

0 .9

9 7

0 .9

9 8

0 .9

9 9

N o te

. 1 P L

IR T

= o n e -p

ar am

e te

r lo

gi st

ic it e m

re sp

o n se

th e o ry

. �

= D

is cr

im in

at io

n p ar

am e te

r is

n o t

av ai

la b le

in th

e 1 P L

IR T

m o d e l,

an d

lo ad

in gs

ar e

n o t

u n iq

u e

in

th e

p ar

al le

l m

o d e l.

W it h

th e se

m o d e ls

, th

e co

rr e la

ti o n

is n o t

ca lc

u la

te d .

401

second step. The median correlation coefficient consistently is quite high among all

cases. Furthermore, the range of correlation coefficients is very narrow.

The results of the analysis comparing IRT item discrimination parameter esti-

mates with those for factor loading l from the CTT with UNV assumption model

appear in the first three columns of Tables 5 and 9. Item discrimination indices are

not estimated in the 1PL IRT model, and loadings are not estimated in the parallel

CTT with UNV assumption model. As a result, no correlations are reported for

Tables 2 to 4 and Tables 6 to 8. The correlations in Tables 5 and 9, based on the

models paired in those tables, were somewhat lower than the results of item diffi-

culty estimates. Of particular interest are correlations of 2PL IRT and congeneric fits

when data were generated by the 2PL IRT model (Table 5). The correlation seems

more sensitive to sample size than test length, though in the main all correlations are

quite good. Minima reflect the range reported by Fan (1998), especially considering

that work reports averages across simulations. Table 9 reflects the same trends,

though not as pronounced.

Finally, correlations of factor score estimates to those for ability parameter, the last

three columns in each of Tables 2 to 9, also are quite strong across all conditions and

model pairings.

There was almost no difference in correlation values for analogous pairings

regardless of the model from which the data were generated. For example, the corre-

lations calculated between the difficulty parameter estimate for 1PL IRT and thresh-

old parameter estimate for congeneric CTT with UNV assumption when the data

were generated from the 1PL IRT model, Table 2, were quite similar to correlations

calculated between those parameters for those models when the data were generated

from the congeneric model, Table 6. The exception is with discrimination parameter

estimates; correlations were slightly lower when the data were generated with the

2PL IRT model, Table 5, than when the data were generated with the congeneric

CTT with UNV assumption model, Table 9.

Discussion

The findings of this study reflect results in the literature. Item difficulty parameter

estimates obtained from the IRT and the CTT with UNV assumption models were

highly comparable across all conditions and model pairings. This is consistent with

the findings of Fan (1998) and MacDonald and Paunonen (2002). The correlation for

item discrimination parameter estimates obtained from the IRT and the CTT with

UNV assumption models were lower, also reflected by Fan (1998) and MacDonald

and Paunonen (2002). We find greater sensitivity to sample size than to test length,

though the difference between small and large sample size is more pronounced for

shorter tests.

For the most part, the correlations were high for all model pairings regardless of

which element in the pair had served as the data generation model. Only in the

402 Educational and Psychological Measurement 75(3)

discrimination parameter estimate correlations is any noticeable difference seen. This

finding lends support to the idea that the two modeling frameworks have equal merit.

MacDonald and Paunonen (2002) suggested high accuracy for the discrimination

parameter, as measured by correlation of estimate to true value, only when the diffi-

culty parameter is restricted to a narrow range. Such a finding can be expected in

light of the characterization by McCullagh and Nelder (1989) of the relationship

between logistic and probit functions as ‘‘almost linearly related over the interval

0:1 � p � 0:9’’ (p. 109). As the cumulative probabilities corresponding to the bounds of the threshold parameter for our congeneric model, 0.16 to 0.88, approach

these values, the deteriorated correlation in the discrimination parameter may simply

reflect the deterioration of the linearization approximation between the two func-

tional forms.

The accepted view of IRT item parameter invariance is founded on the para-

meters’ function in the abstract model concept. In contrast, the understood suscept-

ibility of item statistics in the CTT framework to variations in data concerns specific

sample estimates. Purported superiority results from a comparison of unlike con-

cepts. In this article, we show comparability between the CTT with UNV assumption

and IRT, both in concept and through correlation of analogous parameter estimates.

The implication of this comparability is that the strength of the model concept in

IRT can apply equally to a CTT-based model with UNV assumption. Conversely,

cautions regarding parameter estimation in CTT-based models with UNV assumption

apply to parameter estimation in IRT models as well.

The invariance of IRT item parameters at the conceptual level is not, actually,

absolute. Lord (1980) notes that the location and scale of the ability variable is arbi-

trary. This fact means that, as noted by Rupp and Zumbo (2006), item parameters

actually are invariant only up to a linear transformation unless the location and scale

of the ability variable are held constant from test group to test group.

Moreover, the fact that parameter estimates often are sensitive to the sample on

which the estimates are based is not inherently a defect. Such an attribute can allow

the researcher to uncover variations that affect the outcome of interest and thus

advance the field. Researchers need simply to keep this feature in mind when inter-

preting model results.

High correlations cannot distinguish the scenario of stable estimates between the

two frameworks and the drift that is the same in each framework. This is a limitation

of the present study. This study also does not directly examine parameter estimate

stability as functions of either ability distribution of the test sample or sample size.

An investigation focusing on the latter would represent an important contribution,

since most performance attributes regarding parameter estimation rely on an assump-

tion of a sufficiently large, yet unquantified, sample size. But real data are always

limited in number.

The theoretical framework used here along with the correlation results show the

frameworks of IRT and CTT with UNV assumption to be quite comparable, with nei-

ther framework showing an advantage over the other. This finding presents the

Kohli et al. 403

opportunity for CTT with UNV models to be applied in contexts where they had not

been considered previously.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship,

and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of

this article.

Note

1. Essentially parallel, essentially tau-equivalent, and essentially congeneric models are not

specifically addressed. However, the results and conclusions of this study using models

defined without a mean structure are equally applicable to corresponding models defined

with a mean structure, as correlations are invariant to the locations of the scales.

References

Chalmers, R. P. (2012). MIRT: A multidimensional item response theory package for the R

environment. Journal of Statistical Software, 48(6), 1-29. Retrieved from http:

//www.jstatsoft.org/v48/i06/

Cook, L. L., Eignor, D. R., & Hessy, L. T. (1988). A comparative study of the effects of

recency of instruction on the stability of IRT and conventional item parameter estimates.

Journal of Educational Measurement, 25(1), 31-45.

Courville, T. G. (2004). An empirical comparison of item response theory and classical test

theory item/person statistics (Doctoral dissertation). Retrieved from ProQuest Dissertations

and Theses. (Accession Order No. 3141396)

DeVellis, R. F. (1991). Scale development: Theory and applications. Newbury Park, CA: Sage.

Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of

their item/person statistics. Educational and Psychological Measurement, 58, 357-381.

Ferrando, P. J. (2000). Testing the equivalence among different item response formats in

personality measurement: A structural equation modeling approach. Structural Equation

Modeling, 7, 271-286.

Graham, J. M. (2006). Congeneric and (essentially) tau-equivalent estimates of score

reliability: What they are and how to use them. Educational and Psychological

Measurement, 66, 930-944.

Hambleton, R., & Jones, R. (1993). Comparison of classical test theory and item response

theory. Educational Measurement: Issues and Practice, Fall, 38-47.

Jöreskog, K. G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36,

109-133.

Jöreskog, K. G. (1990). New developments in LISREL: Analysis of ordinal variables using

polychoric correlations and weighted least squares. Quality and Quantity, 24, 387-404.

404 Educational and Psychological Measurement 75(3)

Kamata, A., & Bauer, D. J. (2008). A note on the relation between factor analytic and item

response theory models. Structural Equation Modelling, 15, 136-153.

Lawson, S. (1991). One parameter latent trait measurement: Do the results justify the effort?

In B. Thompson (Ed.), Advances in educational research: Substantive findings,

methodological developments (Vol. 1, pp. 159-168). Greenwich, CT: JAI Press.

Lord, F. (1980). Applications of item response theory to practical testing problems. Hillsdale,

NJ: Lawrence Erlbaum.

MacDonald, P., & Paunonen, S. (2002). A Monte Carlo comparison of item and person

statistics based on item response theory versus classical test theory. Educational and

Psychological Measurement, 62, 921-943.

McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Boca Raton, FL:

Chapman & Hall/CRC.

McDonald, R. P. (1999). Test theory: A unified treatment. New York, NY: Psychology Press.

Mislevy, R. J. (1986). Recent developments in the factor analysis of categorical variables.

Journal of Educational Statistics, 11, 3-31.

Miyazaki, Y. (2005). Some links between classical and modern test theory via the two-level

hierarchical generalized linear model. Journal of Applied Measurement, 6, 289-310.

Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical,

and continuous latent variable indicators. Psychometrika, 49, 115-132.

Muthén, B. O. (1978). Contributions to factor analysis of dichotomous variables.

Psychometrika, 43, 551-560.

Muthén, B. O., & Christoffersson, A. (1981). Simultaneous factor analysis of dichotomous

variables in several groups. Psychometrika, 46, 407-419.

Muthén, L. K., & Muthén, B. O. (1998-2012). Mplus user’s guide (7th ed.). Los Angeles, CA:

Muthén & Muthén.

R Development Core Team. (2013). R: A language and environment for statistical computing

(ISBN 3-900051-07-0). Vienna, Austria: R Foundation for Statistical Computing. Retrieved

from http://www.R-project.org/

Raykov, T., & Marcoulides, G. A. (2011). Introduction to psychometric theory. New York,

NY: Routledge.

Revelle, W. (2014). Psych: Procedures for psychological, psychometric, and personality

research. Evanston, IL: Northwestern University. Retrieved from http://CRAN.R-

project.org/package=psych

Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of

Statistical Software, 48(2), 1-36. Retrieved from http://www.jstatsoft.org/v48/i02/

Rudner, L. M. (1983). A closer look at latent trait parameter invariance. Educational and

Psychological Measurement, 43, 951-955.

Rupp, A. A., & Zumbo, B. D. (2006). Understanding parameter invariance in unidimensional

IRT models. Educational and Psychological Measurement, 66, 63-84.

Sharkness, J., & DeAngelo, L. (2011). Measuring student involvement: A comparison of

classical test theory and item response theory in the construction of scales from student

surveys. Research in Higher Education, 52, 480-507.

Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and

factor analysis of discretized variables. Psychometrika, 52, 393-408.

Wirth, R. J., & Edwards, M. C. (2007). Item factor analysis: Current approaches and future

directions. Psychological Methods, 12, 58-79.

Kohli et al. 405