wk 7 DQ 1
Article
Educational and Psychological Measurement
2015, Vol. 75(3) 389–405 � The Author(s) 2014
Reprints and permissions: sagepub.com/journalsPermissions.nav
DOI: 10.1177/0013164414559071 epm.sagepub.com
Relationships Among Classical Test Theory and Item Response Theory Frameworks via Factor Analytic Models
Nidhi Kohli1, Jennifer Koran2, and Lisa Henn1
Abstract
There are well-defined theoretical differences between the classical test theory (CTT) and item response theory (IRT) frameworks. It is understood that in the CTT framework, person and item statistics are test- and sample-dependent. This is not the perception with IRT. For this reason, the IRT framework is considered to be the- oretically superior to the CTT framework for the purpose of estimating person and item parameters. In previous simulation studies, IRT models were used both as gener- ating and as fitting models. Hence, results favoring the IRT framework could be attrib- uted to IRT being the data-generation framework. Moreover, previous studies only considered the traditional CTT framework for the comparison, yet there is consider- able literature suggesting that it may be more appropriate to use CTT statistics based on an underlying normal variable (UNV) assumption. The current study relates the class of CTT-based models with the UNV assumption to that of IRT, using confirma- tory factor analysis to delineate the connections. A small Monte Carlo study was car- ried out to assess the comparability between the item and person statistics obtained from the frameworks of IRT and CTT with UNV assumption. Results show the fra- meworks of IRT and CTT with UNV assumption to be quite comparable, with neither framework showing an advantage over the other.
1University of Minnesota, Minneapolis, MN, USA 2 Southern Illinois University, Carbondale, IL, USA
Corresponding Author:
Nidhi Kohli, Quantitative Methods in Education Program, Department of Educational Psychology,
University of Minnesota, 161 Education Sciences Building, 56 East River Road, Minneapolis, MN 55455,
USA.
Email: nkohli@umn.edu
Keywords
classical test theory, item response theory, relationship, factor analysis
Classical test theory (CTT) has been prominent in the field of educational measure-
ment since the 1920s; however, for the last three decades, item response theory (IRT)
has been the primary framework for educational measurement and psychometric
issues. A commonly held belief is that the IRT framework is theoretically superior to
the CTT framework for the estimation of person and item parameters because the per-
son and item statistics based on the CTT framework are test- and sample-dependent,
respectively. Specifically, the item statistics derived from the CTT-based models,
item difficulty and item discrimination, are dependent on the sample of respondents
selected to answer the items. If the same items are given to a different sample, and
the item difficulty and item discrimination indices are computed on CTT-based mod-
els, they may vary substantially depending on the nature of the sample. Similarly, the
scores earned by test takers depend on the items they have been asked to answer. If
the test takers are given another set of more or less difficult items, their number-
correct test scores likely are going to be lower or higher, respectively, than their
number-correct scores on the original set of items.
In contrast to CTT, the person and item statistics based on IRT are considered to
be stable across different samples of items and persons, respectively. As explained
by Lord (1980), this perspective on the ability variable follows from viewing the
item response function as a regression function of the observed test outcomes on the
ability variable. The probability of observing a particular outcome likely is unaf-
fected by how many among the test subjects has a particular level of ability. The
invariance of the item parameters follows conceptually from this framing of the item
response function as a regression function, with other elements in the function con-
ceived as fixed parameters.
We have found nothing in the literature that examines this item parameter estimate
stability via simulation, that is to say, that examines variation in parameter estimates
at moderate sample sizes. Nonetheless, this conceptual stability property often seems
attributed to parameter estimates as well.
In previous simulation studies, researchers have used IRT models both as generating
and as fitting models, which confounds comparability between CTT and IRT frame-
works with data-model congruity. An alternate explanation for results favoring the IRT
framework is that IRT was used to generate the data, and thus, the data will more natu-
rally fit this model. In addition, researchers have chosen traditional CTT item statistics,
proportion correct and point–biserial correlations, and a traditional person-statistic,
unweighted total scores, for the comparability comparison. But there is considerable lit-
erature suggesting that it may be more appropriate to use statistics based on an underly-
ing normal variable (UNV) assumption, such as thresholds and biserial correlations.
Despite well-defined theoretical differences between the CTT and IRT frame-
works, the empirical research comparing the two frameworks has failed to exhibit
390 Educational and Psychological Measurement 75(3)
differences between the two in terms of person and item parameter estimates. To
explore the distinctions between the two frameworks in greater depth, we first present
a review of the empirical literature comparing them. Prior simulation studies have
exclusively used IRT models to generate the data, and the literature does not consider
models with an UNV assumption common in factor analytic models for categorical
data. However, recent literature has introduced models with an UNV assumption as
an extension of the CTT framework. As both IRT models and CTT-based models
with the UNV assumption have been demonstrated to be members of the class of
confirmatory factor analysis models, we expect a high level of comparability between
results obtained under related member models. To test this theory, we compare item
and person statistics for the extended CTT framework and the IRT framework using
data generated under both frameworks. That we simulate data under each framework
is, to our knowledge, a unique contribution.
Literature Review
Prior studies comparing CTT and IRT frameworks have not found that differences
between the two translate to advantage of one framework over another. Many works
have mentioned IRT parameter invariance as part of a general introduction of the
IRT model (see, e.g., Hambleton & Jones, 1993; Sharkness & DeAngelo, 2011).
Rudner (1983) examined how the magnitude of an item discrimination value should
change if the location of the ability variable is not the same for two groups of exami-
nees. Cook, Eignor, and Hessy (1988) compared three administrations of a Biology
achievement test in part to examine stability of IRT item parameter estimates. They
found lack of stability, noting that it is affected by the advancement in skill level of
the test-takers. We have found no prior literature examining IRT parameter invar-
iance via simulation. Prior studies comparing estimates under CTT and IRT para-
digms show high correlations between CTT and IRT not only for person ability but
also for item difficulty (Courville, 2004; Fan, 1998; Lawson, 1991). Fan’s (1998)
research in particular ‘‘failed to support the IRT framework for its ostensible super-
iority over CTT in producing invariant item statistics’’ (p. 378). Item discrimination
indices are less highly correlated between the two frameworks, dipping as low as
0.60, particularly when the range of difficulty parameter values exceeds 0.5 in abso-
lute value (Fan, 1998; MacDonald & Paunonen, 2002). These correspondences are
highest when the traditional CTT statistics are compared to the corresponding item
statistics in one- and two-parameter logistic IRT models (Fan, 1998; MacDonald &
Paunonen, 2002). All these studies use IRT as the data generation model.
Some attempts have been made to relate one framework to the other (Hambleton
& Jones, 1993; Lord, 1980; Miyazaki, 2005). Lord (1980) presented approximate
expressions for IRT item discrimination parameter and item difficulty parameter as
functions of CTT item–biserial correlation and pass/fail threshold parameter. He
called these relations ‘‘crude’’ and added that they ‘‘are given . . . not for practical
use but rather to give an idea of the nature of the item discrimination parameter’’
Kohli et al. 391
(pp. 33-34). Miyazaki (2005) used two-level hierarchical generalized linear models
as the intermediate framework to relate these two approaches, which additionally
requires a normal distributional assumption on the observed test scores and the use of
the identity link. This distributional assumption is not part of the core framework of
CTT, making this approach for associating the two classes of model more restrictive.
Despite the empirically demonstrated similarities between the two modeling fra-
meworks, the results have led some researchers to nonetheless conclude that the IRT
framework is superior to the CTT framework (MacDonald & Paunonen, 2002). We
see two problems with the body of prior research comparing the CTT and IRT frame-
works. First, prior empirical studies have exclusively used IRT to generate the data;
thus, empirical results favoring IRT might be due to the design of the study. Second,
prior studies comparing CTT and IRT have not considered CTT-based models or sta-
tistics with a UNV assumption. However, recent literature has presented CTT-based
models with a UNV assumption as legitimate, desirable extensions of the CTT frame-
work (Raykov & Marcoulides, 2011).
Whereas previous efforts have sought to connect IRT to CTT with approximate
expressions or with hierarchical models, other research supports a relation between a
class of CTT models and some IRT models using confirmatory factor analysis to
delineate the associations. This approach does not require a distributional assumption
on the response as with Miyazaki (2005). Where CTT observed scores are considered
as a result of tests containing a single, binary item, CTT can be applied to scored
responses to individual items. This is possible because CTT assumes only the exis-
tence of the mathematical expectation of the observed score not that the observed
score be continuous, contrary to popular misconception (Raykov & Marcoulides,
2011). Furthermore, the item scores from several such single item tests can be
assumed to fit either parallel, tau-equivalent, or congeneric 1
CTT-based models, and
these CTT-based models have been demonstrated as members within the family of
confirmatory factor analysis models (DeVellis, 1991; Graham, 2006; Jöreskog,
1971). An extension, CTT-based models with a UNV assumption (Ferrando, 2000;
Raykov & Marcoulides, 2011), has produced models shown to be members of the
family of nonlinear confirmatory factor analysis models (Raykov & Marcoulides,
2011). The two-parameter IRT model (and the one-parameter model nested within it)
has likewise been shown to be mathematically equivalent to nonlinear confirmatory
factor analysis model (Kamata & Bauer, 2008; McDonald, 1999; Takane & de
Leeuw, 1987; Wirth & Edwards, 2007).
As both the CTT-based models with a UNV assumption and the one- and two-
parameter IRT models have been demonstrated to be members of the class of non-
linear confirmatory factor analysis models, we expect a high level of similarity
between results when the two frameworks are assessed under comparable conditions.
We compare the two formulations using simulated data generated in each framework
and compare that framework’s parameter estimates to those resulting from the fit of
the analogous model in the other framework. We thereby show relations between the
two classes under conditions of parity.
392 Educational and Psychological Measurement 75(3)
Method
Overview
A small Monte Carlo simulation study was carried out to examine whether results
reported in the literature held when the data generation model was varied (i.e., data
were generated under both IRT framework and CTT with UNV assumption frame-
work). The simulation design was composed of two conditions/factors: test length
(total number of items) and number of examinees. We chose these two factors for
the simulation design because we expect them to affect the magnitude of comparabil-
ity between the item and person statistics arising from the CTT with UNV assump-
tion framework and those from the IRT framework. Small sample sizes are known to
affect item parameter estimates adversely, and short tests are known to affect person
ability estimates adversely. The test length factor took values of 20, 40, and 60 items,
whereas the number of examinees factor took values of 500 and 1,000 examinees.
Thus, the combination of manipulated factors (3 3 2) resulted in a Monte Carlo
simulation with six cells or conditions. The conditions are listed in Table 1. For each
condition, 100 replications were generated.
Within each manipulated condition, data sets were generated for each of four mod-
els: two models from CTT with UNV assumption, the parallel and the congeneric
models, and two from IRT, the one-parameter logistic (1PL) and two-parameter logis-
tic (2PL) models. The tau-equivalent CTT-based model with the UNV assumption
was not used in the simulation study, because with the UNV assumption, the tau-
equivalent model is functionally equivalent to the parallel model. This is due to the
fact that constraining the loadings also results in the error variances being equal to
one another. The four models used in the study are explained in detail in the follow-
ing sections.
The One-Parameter Logistic Rasch Model
This model is the more restrictive of the two IRT models. All items are given the
same weight in determining the level of the latent construct for an individual. This
model is typically presented in its logistic form for an individual as
Table 1. Simulation Conditions.
Condition Number of items Number of examinees
Case 1 20 500 Case 2 20 1,000 Case 3 40 500 Case 4 40 1,000 Case 5 60 500 Case 6 60 1,000
Kohli et al. 393
P Xki = 1juð Þ= 1
1 + exp �D ui � bkð Þ½ � , ð1Þ
where ui represents the level of the latent trait for an individual; bk is the item diffi-
culty parameter on a scale approximating the normal ogive scale (nonlinear factor
analysis model with probit transformation), which describes how much of the latent
construct an individual must possess to have a 50% probability of endorsing item k;
and D = 1:7 is a scaling constant that when multiplied by an item parameter approxi- mately produces the corresponding value for the parameter on the logistic scale.
In the 1PL model, the relevant person statistic is an estimate of u. The relevant
item statistic is an estimate of the item difficulty parameter bk .
The Two-Parameter Birnbaum Model
This model is considered to be the less restrictive model of the two IRT models.
Items that are more discriminating are given greater weight in determining the level
of the latent construct for an individual. This model typically is presented in its logis-
tic form as
P Xki = 1juð Þ= 1
1 + exp �Dak ui � bkð Þ½ � , ð2Þ
where ui represents the level of the latent trait for an individual; bk is the item diffi-
culty parameter describing how much of the latent construct an individual must pos-
sess to have a 50% probability of endorsing item k; ak is the item slope
(discrimination) parameter on a scale approximating the normal ogive scale (non-
linear factor analysis model with probit transformation), which describes the strength
of the relationship between item k and the latent trait u; and D = 1:7 is a scaling con- stant that when multiplied by an item parameter approximately produces the corre-
sponding value for the parameter on the logistic scale.
In the 2PL model, the relevant person statistic is an estimate of u. The relevant
item statistics are estimates of the item difficulty parameter bk and the item slope
(discrimination) parameter ak .
The Underlying Normal Variable Assumption
The UNV assumption is a popular approach in the latent variable modeling literature
(Jöreskog, 1990; Mislevy, 1986; Muthén, 1978, 1984; Muthén & Christoffersson,
1981; Takane & de Leeuw, 1987). Each observed binary item score variable Xk with
two categories is assumed to be a coarse representation of an underlying unobserved
continuous variable X�k . If X � k is assumed to be univariate normally distributed, then
X�k is the UNV. A monotonic transformation matches the density of the observed bin- ary distribution to the density of the continuous distribution:
394 Educational and Psychological Measurement 75(3)
Xk = 0 if X�k \tk, 1 if X�k � tk:
� ð3Þ
The UNV X�k is assumed to have a range from negative infinity to positive infinity. The notation tk will be used to refer to the single estimated threshold for dichotomous
item k.
The Parallel Model With the UNV Assumption
Applying the UNV assumption to the parallel model concept (Ferrando, 2000), X�ki for each item k for individual i can be decomposed into true score Ti and error Eki as
X � ki = Ti + Eki ð4Þ
and
var Ekið Þ= var Ekð Þ,
where k, k9 2 1, . . . , pf g are item indices and k 6¼ k9. By definition, E Ekið Þ= 0. Analogous to the parallel model without the UNV assumption, all UNV item scores
are assumed to share the same true score Ti. In addition, all UNV item scores are
assumed to have equal reliability (equal measurement error). There is no subscript k
on the true score Ti. Because the true score Ti is assumed to be the same across items,
the distinguishing subscript k is not needed.
The Congeneric Model With the UNV Assumption
Applying the UNV assumption to the congeneric model concept (Ferrando, 2000),
the UNV item scores X�ki are linear functions of the same true score, and individual item error variances are not constrained to be equal. All UNV item scores X�ki have true scores that are linear functions of the same true score Ti. In equation form, the
model is
X � ki = l
� k Ti + Eki, ð5Þ
where l�k is the unique loading for item k. By definition, E Ekið Þ= 0. Note that Tki = l
� k Ti.
In the congeneric model with the UNV assumption, the relevant person statistic is
an estimate of Ti (such as a factor score). Because l � k differs across items, Ti is not
necessarily a linear function of the number correct score. Thus, in the congeneric
model, the number correct score does not necessarily contain the same information as
the true score Ti. The relevant item statistics are the item threshold tk and the biserial
(polyserial) correlation rbs. As in the parallel model with the UNV assumption, if the
item score Xki is the result of dichotomous scoring 0, 1f g, then 1 � probit tkð Þ is the proportion of individuals, where Xki = 1 (item proportion correct pk ). Thus, in the
Kohli et al. 395
case of the congeneric model with the UNV assumption, the item proportion correct
pk contains the same information as the item threshold. The biserial (polyserial) cor-
relation rbs is equivalent to the standardized factor loading, which is the correlation
between the UNV item score X�ki and the common factor Ti.
Data Generation
The discrimination parameter, a, for the IRT models was generated on the normal
ogive scale, meaning that a scaling factor of D = 1:7 was incorporated into the IRT model as per Equation 2. Item difficulty, b, values for the IRT models were sampled
from a uniform distribution; b;Uni �2, 2ð Þ. This is comparable to values used in the MacDonald and Paunonen (2002) study. In the 1PL model, the item discrimination,
a, was fixed at 1. Item discrimination values for the 2PL were sampled from a uni-
form distribution; a;Uni 1, 2ð Þ. Person parameter, u, values were drawn from the standard normal distribution. Corresponding bounds for sampling distributions for
loading, l, and threshold, t, parameters for the CTT-based models with UNV
assumption were calculated using the following equations:
l � k =
ak=Dffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 + ak=Dð Þ2
q ð6Þ
and
tk = ak=Dð Þbkffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 + ak=Dð Þ2 q ð7Þ
(Wirth & Edwards, 2007).
Thus, for the parallel model with the UNV assumption, the threshold, t, values
were sampled from a uniform distribution, with t;Uni �1:014, 1:014ð Þ. Factor load- ings, l, were set to 1. For the congeneric model with the UNV assumption, the
threshold t;Uni �1, 1:52ð Þ, while the factor loading l;Uni 0:5, 0:76ð Þ. As with the IRT model, person parameter, T , values for both CTT with UNV assumption models
were assumed to be standard normally distributed. Data for the models considered
across all the manipulated conditions were generated by using R version 3.0.2 (R
Development Core Team, 2013), with data for the CTT with UNV models generated
with the R package psych, version 1.4.5 (Revelle, 2014). The IRT models were fit
with the R package mirt, version 1.4 (Chalmers, 2012), using the EM algorithm, and
CTT with UNV assumption models were fit with the R package lavaan, version 0.5-
17.701 (Rosseel, 2012), using a weighted least square estimator with robust variance
estimator, mimicking that of the WLSMV estimator in Mplus (Muthén & Muthén,
1998-2012).
396 Educational and Psychological Measurement 75(3)
Analysis
On fitting the models, the correlations of person and item estimates were calculated.
Correlations were computed across pairings of two sets of statistics: the IRT statistics
and the CTT with UNV assumption model statistics. The correlation of theta, u, the
person statistic obtained from the IRT models, and the factor score, T , obtained from
the CTT with UNV assumption models, were calculated to assess the degree of com-
parability of person estimates. Correlations between the item difficulty parameter, b,
obtained from the IRT models, and the item threshold, t, obtained from the CTT with
UNV assumption models, were calculated to assess the degree of comparability of
item difficulty estimates. The correlations between the item discrimination parameter,
a, obtained from the IRT models, and the factor loading, l, obtained from the CTT
with UNV assumption models, were obtained to assess the degree of comparability of
item discrimination estimates. Finally, the median correlation and the range of corre-
lations for each pairing across 100 replicates were computed for reporting. The corre-
lation calculations were handled by R version 3.0.2.
Results
Results appear in Tables 2 to 9. Each table presents results for one of the model pair-
ings. The same model used to generate the data was fit back to the data, and the
resulting estimates were compared to another model. Tables 2 and 3 concern data
generated by the 1PL IRT model. The 1PL IRT model fit is compared with the paral-
lel CTT with UNV assumption model in Table 2 and to the congeneric CTT with
UNV assumption model in Table 3. Tables 4 and 5 concern data generated by the
2PL IRT model, compared again to the parallel and congeneric CTT with UNV
assumption models.
Tables 6 and 7 use data generated under the parallel CTT with UNV assumption
model. The fit of the parallel model was compared with estimates from the 1PL IRT
model in Table 6 and the 2PL IRT model in Table 7. Tables 8 and 9 make similar
comparisons using data generated under the congeneric CTT with UNV assumption
model.
Thus, Tables 2 and 6 make the same comparisons, estimates from the 1PL IRT
model and the parallel CTT with UNV assumption model. In Table 2, the data origi-
nate from the 1PL IRT model, whereas in Table 6, the data originate from the paral-
lel CTT with UNV assumption model. Analogous pairings occur for Tables 3 and 7,
Tables 4 and 8, and Tables 5 and 9.
The results of the analyses comparing IRT item difficulty parameter estimates with
those for item threshold t from the CTT with UNV assumption appear in the center
three columns in each of Tables 2 to 9. The correlations, based on the models paired
in each table, indicate a high degree of comparability. Note that there is only one set
of threshold estimates for both the parallel and congeneric CTT with UNV assump-
tion models. This occurs because, when fitting the model to the data, the thresholds
are estimated as a first step, and the specific model (parallel or congeneric) is fit as a
Kohli et al. 397
T a b
le 2 .
C o rr
e la
ti o n
B e tw
e e n
1 P L
IR T
an d
P ar
al le
l F it s,
D at
a F ro
m 1 P L
IR T
M o d e l.
C o rr
e la
ti o n
in d is
cr im
in at
io n
p ar
am e te
r C
o rr
e la
ti o n
in d if fi cu
lt y
p ar
am e te
r C
o rr
e la
ti o n
in fa
ct o r
sc o re
s
1 P L
d at
a vs
. p ar
al le
l fi t
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
C as
e 1
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 9
0 .9
9 9
1 .0
0 0
C as
e 2
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 9
0 .9
9 9
0 .9
9 9
C as
e 3
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 9
1 .0
0 0
1 .0
0 0
C as
e 4
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 9
1 .0
0 0
1 .0
0 0
C as
e 5
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
C as
e 6
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
N o te
. 1 P L
IR T
= o n e -p
ar am
e te
r lo
gi st
ic it e m
re sp
o n se
th e o ry
. �
= D
is cr
im in
at io
n p ar
am e te
r is
n o t
av ai
la b le
in th
e 1 P L
IR T
m o d e l,
an d
lo ad
in gs
ar e
n o t
u n iq
u e
in
th e
p ar
al le
l m
o d e l.
W it h
th e se
m o d e ls
, th
e co
rr e la
ti o n
is n o t
ca lc
u la
te d .
T a b
le 3 .
C o rr
e la
ti o n s
B e tw
e e n
C o n ge
n e ri
c an
d 1 P L
IR T
F it s,
D at
a F ro
m 1 P L
IR T
M o d e l.
C o rr
e la
ti o n
in d is
cr im
in at
io n
p ar
am e te
r C
o rr
e la
ti o n
in d if fi cu
lt y
p ar
am e te
r C
o rr
e la
ti o n
in fa
ct o r
sc o re
s
1 P L
d at
a vs
. co
n ge
n e ri
c fi t
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
C as
e 1
� �
� 0 .9
9 3
0 .9
9 8
0 .9
9 9
0 .9
9 7
0 .9
9 8
0 .9
9 9
C as
e 2
� �
� 0 .9
9 5
0 .9
9 9
1 .0
0 0
0 .9
9 8
0 .9
9 9
0 .9
9 9
C as
e 3
� �
� 0 .9
8 6
0 .9
9 8
0 .9
9 9
0 .9
9 9
0 .9
9 9
0 .9
9 9
C as
e 4
� �
� 0 .9
9 7
0 .9
9 9
1 .0
0 0
0 .9
9 9
0 .9
9 9
1 .0
0 0
C as
e 5
� �
� 0 .9
9 5
0 .9
9 8
0 .9
9 9
0 .9
9 9
0 .9
9 9
1 .0
0 0
C as
e 6
� �
� 0 .9
9 8
0 .9
9 9
1 .0
0 0
0 .9
9 9
0 .9
9 9
1 .0
0 0
N o te
. 1 P L
IR T
= o n e -p
ar am
e te
r lo
gi st
ic it e m
re sp
o n se
th e o ry
. �
= D
is cr
im in
at io
n p ar
am e te
r is
n o t
av ai
la b le
in th
e 1 P L
IR T
m o d e l,
an d
lo ad
in gs
ar e
n o t
u n iq
u e
in
th e
p ar
al le
l m
o d e l.
W it h
th e se
m o d e ls
, th
e co
rr e la
ti o n
is n o t
ca lc
u la
te d .
398
T a b
le 5 .
C o rr
e la
ti o n s
B e tw
e e n
C o n ge
n e ri
c an
d 2 P L
IR T
F it s,
D at
a F ro
m 2 P L
IR T
M o d e l.
C o rr
e la
ti o n
in d is
cr im
in at
io n
p ar
am e te
r C
o rr
e la
ti o n
in d if fi cu
lt y
p ar
am e te
r C
o rr
e la
ti o n
in fa
ct o r
sc o re
s
2 P L
d at
a vs
. co
n ge
n e ri
c fi t
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
C as
e 1
0 .6
5 6
0 .9
0 9
0 .9
8 3
0 .9
9 1
0 .9
9 7
0 .9
9 9
0 .9
9 8
0 .9
9 9
1 .0
0 0
C as
e 2
0 .8
7 7
0 .9
5 2
0 .9
8 6
0 .9
9 2
0 .9
9 7
0 .9
9 9
0 .9
9 9
0 .9
9 9
1 .0
0 0
C as
e 3
0 .5
0 5
0 .9
1 3
0 .9
7 4
0 .9
9 2
0 .9
9 7
0 .9
9 9
0 .9
9 8
0 .9
9 9
1 .0
0 0
C as
e 4
0 .8
8 7
0 .9
4 5
0 .9
7 3
0 .9
9 5
0 .9
9 7
0 .9
9 9
0 .9
9 9
1 .0
0 0
1 .0
0 0
C as
e 5
0 .6
9 6
0 .9
1 4
0 .9
6 3
0 .9
9 4
0 .9
9 7
0 .9
9 9
0 .9
9 8
0 .9
9 9
1 .0
0 0
C as
e 6
0 .8
8 7
0 .9
4 8
0 .9
7 7
0 .9
9 5
0 .9
9 8
0 .9
9 9
0 .9
9 9
1 .0
0 0
1 .0
0 0
N o te
. 2 P L
IR T
= tw
o -p
ar am
e te
r lo
gi st
ic it e m
re sp
o n se
th e o ry
.
T a b
le 4 .
C o rr
e la
ti o n s
B e tw
e e n
P ar
al le
l an
d 2 P L
IR T
F it s,
D at
a F ro
m 2 P L
IR T
M o d e l.
C o rr
e la
ti o n
in d is
cr im
in at
io n
p ar
am e te
r C
o rr
e la
ti o n
in d if fi cu
lt y
p ar
am e te
r C
o rr
e la
ti o n
in fa
ct o r
sc o re
s
2 P L
d at
a vs
. p ar
al le
l fi t
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
C as
e 1
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 5
0 .9
9 7
0 .9
9 8
C as
e 2
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 5
0 .9
9 7
0 .9
9 9
C as
e 3
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 5
0 .9
9 8
0 .9
9 9
C as
e 4
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 6
0 .9
9 8
0 .9
9 8
C as
e 5
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 6
0 .9
9 8
0 .9
9 9
C as
e 6
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 7
0 .9
9 8
0 .9
9 9
N o te
. 2 P L
IR T
= tw
o -p
ar am
e te
r lo
gi st
ic it e m
re sp
o n se
th e o ry
. �
= D
is cr
im in
at io
n p ar
am e te
r is
n o t
av ai
la b le
in th
e 1 P L
IR T
m o d e l,
an d
lo ad
in gs
ar e
n o t
u n iq
u e
in
th e
p ar
al le
l m
o d e l.
W it h
th e se
m o d e ls
, th
e co
rr e la
ti o n
is n o t
ca lc
u la
te d .
399
T a b
le 7 .
C o rr
e la
ti o n s
B e tw
e e n
2 P L
IR T
an d
P ar
al le
l F it s,
D at
a F ro
m P ar
al le
l M
o d e l.
C o rr
e la
ti o n
in d is
cr im
in at
io n
p ar
am e te
r C
o rr
e la
ti o n
in d if fi cu
lt y
p ar
am e te
r C
o rr
e la
ti o n
in fa
ct o r
sc o re
s
P ar
al le
l d at
a vs
. 2 P L
fi t
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
C as
e 1
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 5
0 .9
9 7
0 .9
9 9
C as
e 2
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 7
0 .9
9 8
0 .9
9 9
C as
e 3
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 7
0 .9
9 8
0 .9
9 9
C as
e 4
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 9
0 .9
9 9
0 .9
9 9
C as
e 5
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 8
0 .9
9 9
0 .9
9 9
C as
e 6
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 9
0 .9
9 9
1 .0
0 0
N o te
. 2 P L
IR T
= tw
o -p
ar am
e te
r lo
gi st
ic it e m
re sp
o n se
th e o ry
. �
= D
is cr
im in
at io
n p ar
am e te
r is
n o t
av ai
la b le
in th
e 1 P L
IR T
m o d e l,
an d
lo ad
in gs
ar e
n o t
u n iq
u e
in
th e
p ar
al le
l m
o d e l.
W it h
th e se
m o d e ls
, th
e co
rr e la
ti o n
is n o t
ca lc
u la
te d .
T a b
le 6 .
C o rr
e la
ti o n s
B e tw
e e n
1 P L
IR T
an d
P ar
al le
l F it s,
D at
a F ro
m P ar
al le
l M
o d e l.
C o rr
e la
ti o n
in d is
cr im
in at
io n
p ar
am e te
r C
o rr
e la
ti o n
in d if fi cu
lt y
p ar
am e te
r C
o rr
e la
ti o n
in fa
ct o r
sc o re
s
P ar
al le
l d at
a vs
. 1 P L
fi t
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
C as
e 1
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 9
1 .0
0 0
1 .0
0 0
C as
e 2
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
0 .9
9 9
1 .0
0 0
1 .0
0 0
C as
e 3
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
C as
e 4
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
C as
e 5
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
C as
e 6
� �
� 1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
1 .0
0 0
N o te
. 1 P L
IR T
= o n e -p
ar am
e te
r lo
gi st
ic it e m
re sp
o n se
th e o ry
. �
= D
is cr
im in
at io
n p ar
am e te
r is
n o t
av ai
la b le
in th
e 1 P L
IR T
m o d e l,
an d
lo ad
in gs
ar e
n o t
u n iq
u e
in
th e
p ar
al le
l m
o d e l.
W it h
th e se
m o d e ls
, th
e co
rr e la
ti o n
is n o t
ca lc
u la
te d .
400
T a b
le 9 .
C o rr
e la
ti o n s
B e tw
e e n
2 P L
IR T
an d
C o n ge
n e ri
c F it s,
D at
a F ro
m C
o n ge
n e ri
c M
o d e l.
C o rr
e la
ti o n
in d is
cr im
in at
io n
p ar
am e te
r C
o rr
e la
ti o n
in d if fi cu
lt y
p ar
am e te
r C
o rr
e la
ti o n
in fa
ct o r
sc o re
s
C o n ge
n e ri
c d at
a vs
. 2 P L
fi t
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
C as
e 1
0 .8
9 2
0 .9
6 3
0 .9
9 1
0 .9
6 2
0 .9
8 9
0 .9
9 8
0 .9
9 9
0 .9
9 9
1 .0
0 0
C as
e 2
0 .8
9 9
0 .9
7 0 .9
9 1
0 .9
7 6
0 .9
9 2
0 .9
9 7
0 .9
9 9
0 .9
9 9
1 .0
0 0
C as
e 3
0 .9
0 3
0 .9
6 6
0 .9
8 3
0 .9
6 6
0 .9
9 1
0 .9
9 5
0 .9
9 9
1 .0
0 0
1 .0
0 0
C as
e 4
0 .9
0 6
0 .9
6 6
0 .9
8 6
0 .9
8 2
0 .9
9 1
0 .9
9 5
0 .9
9 9
1 .0
0 0
1 .0
0 0
C as
e 5
0 .9
0 1
0 .9
6 3
0 .9
8 6
0 .9
6 8
0 .9
8 8
0 .9
9 6
0 .9
9 9
1 .0
0 0
1 .0
0 0
C as
e 6
0 .9
1 8
0 .9
6 7
0 .9
8 4
0 .9
8 5
0 .9
9 1
0 .9
9 5
0 .9
9 9
1 .0
0 0
1 .0
0 0
N o te
. 2 P L
IR T
= tw
o -p
ar am
e te
r lo
gi st
ic it e m
re sp
o n se
th e o ry
.
T a b
le 8 .
C o rr
e la
ti o n s
B e tw
e e n
1 P L
IR T
an d
C o n ge
n e ri
c F it s,
D at
a F ro
m C
o n ge
n e ri
c M
o d e l.
C o rr
e la
ti o n
in d is
cr im
in at
io n
p ar
am e te
r C
o rr
e la
ti o n
in d if fi cu
lt y
p ar
am e te
r C
o rr
e la
ti o n
in fa
ct o r
sc o re
s
C o n ge
n e ri
c d at
a vs
. 1 P L
fi t
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
M in
im u m
M e d ia
n M
ax im
u m
C as
e 1
� �
� 0 .9
6 2
0 .9
8 9
0 .9
9 8
0 .9
9 1
0 .9
9 5
0 .9
9 7
C as
e 2
� �
� 0 .9
7 6
0 .9
9 2
0 .9
9 7
0 .9
9 2
0 .9
9 6
0 .9
9 8
C as
e 3
� �
� 0 .9
6 6
0 .9
9 1
0 .9
9 5
0 .9
9 6
0 .9
9 7
0 .9
9 8
C as
e 4
� �
� 0 .9
8 2
0 .9
9 1
0 .9
9 5
0 .9
9 6
0 .9
9 7
0 .9
9 8
C as
e 5
� �
� 0 .9
6 8
0 .9
8 8
0 .9
9 6
0 .9
9 6
0 .9
9 8
0 .9
9 9
C as
e 6
� �
� 0 .9
8 5
0 .9
9 1
0 .9
9 5
0 .9
9 7
0 .9
9 8
0 .9
9 9
N o te
. 1 P L
IR T
= o n e -p
ar am
e te
r lo
gi st
ic it e m
re sp
o n se
th e o ry
. �
= D
is cr
im in
at io
n p ar
am e te
r is
n o t
av ai
la b le
in th
e 1 P L
IR T
m o d e l,
an d
lo ad
in gs
ar e
n o t
u n iq
u e
in
th e
p ar
al le
l m
o d e l.
W it h
th e se
m o d e ls
, th
e co
rr e la
ti o n
is n o t
ca lc
u la
te d .
401
second step. The median correlation coefficient consistently is quite high among all
cases. Furthermore, the range of correlation coefficients is very narrow.
The results of the analysis comparing IRT item discrimination parameter esti-
mates with those for factor loading l from the CTT with UNV assumption model
appear in the first three columns of Tables 5 and 9. Item discrimination indices are
not estimated in the 1PL IRT model, and loadings are not estimated in the parallel
CTT with UNV assumption model. As a result, no correlations are reported for
Tables 2 to 4 and Tables 6 to 8. The correlations in Tables 5 and 9, based on the
models paired in those tables, were somewhat lower than the results of item diffi-
culty estimates. Of particular interest are correlations of 2PL IRT and congeneric fits
when data were generated by the 2PL IRT model (Table 5). The correlation seems
more sensitive to sample size than test length, though in the main all correlations are
quite good. Minima reflect the range reported by Fan (1998), especially considering
that work reports averages across simulations. Table 9 reflects the same trends,
though not as pronounced.
Finally, correlations of factor score estimates to those for ability parameter, the last
three columns in each of Tables 2 to 9, also are quite strong across all conditions and
model pairings.
There was almost no difference in correlation values for analogous pairings
regardless of the model from which the data were generated. For example, the corre-
lations calculated between the difficulty parameter estimate for 1PL IRT and thresh-
old parameter estimate for congeneric CTT with UNV assumption when the data
were generated from the 1PL IRT model, Table 2, were quite similar to correlations
calculated between those parameters for those models when the data were generated
from the congeneric model, Table 6. The exception is with discrimination parameter
estimates; correlations were slightly lower when the data were generated with the
2PL IRT model, Table 5, than when the data were generated with the congeneric
CTT with UNV assumption model, Table 9.
Discussion
The findings of this study reflect results in the literature. Item difficulty parameter
estimates obtained from the IRT and the CTT with UNV assumption models were
highly comparable across all conditions and model pairings. This is consistent with
the findings of Fan (1998) and MacDonald and Paunonen (2002). The correlation for
item discrimination parameter estimates obtained from the IRT and the CTT with
UNV assumption models were lower, also reflected by Fan (1998) and MacDonald
and Paunonen (2002). We find greater sensitivity to sample size than to test length,
though the difference between small and large sample size is more pronounced for
shorter tests.
For the most part, the correlations were high for all model pairings regardless of
which element in the pair had served as the data generation model. Only in the
402 Educational and Psychological Measurement 75(3)
discrimination parameter estimate correlations is any noticeable difference seen. This
finding lends support to the idea that the two modeling frameworks have equal merit.
MacDonald and Paunonen (2002) suggested high accuracy for the discrimination
parameter, as measured by correlation of estimate to true value, only when the diffi-
culty parameter is restricted to a narrow range. Such a finding can be expected in
light of the characterization by McCullagh and Nelder (1989) of the relationship
between logistic and probit functions as ‘‘almost linearly related over the interval
0:1 � p � 0:9’’ (p. 109). As the cumulative probabilities corresponding to the bounds of the threshold parameter for our congeneric model, 0.16 to 0.88, approach
these values, the deteriorated correlation in the discrimination parameter may simply
reflect the deterioration of the linearization approximation between the two func-
tional forms.
The accepted view of IRT item parameter invariance is founded on the para-
meters’ function in the abstract model concept. In contrast, the understood suscept-
ibility of item statistics in the CTT framework to variations in data concerns specific
sample estimates. Purported superiority results from a comparison of unlike con-
cepts. In this article, we show comparability between the CTT with UNV assumption
and IRT, both in concept and through correlation of analogous parameter estimates.
The implication of this comparability is that the strength of the model concept in
IRT can apply equally to a CTT-based model with UNV assumption. Conversely,
cautions regarding parameter estimation in CTT-based models with UNV assumption
apply to parameter estimation in IRT models as well.
The invariance of IRT item parameters at the conceptual level is not, actually,
absolute. Lord (1980) notes that the location and scale of the ability variable is arbi-
trary. This fact means that, as noted by Rupp and Zumbo (2006), item parameters
actually are invariant only up to a linear transformation unless the location and scale
of the ability variable are held constant from test group to test group.
Moreover, the fact that parameter estimates often are sensitive to the sample on
which the estimates are based is not inherently a defect. Such an attribute can allow
the researcher to uncover variations that affect the outcome of interest and thus
advance the field. Researchers need simply to keep this feature in mind when inter-
preting model results.
High correlations cannot distinguish the scenario of stable estimates between the
two frameworks and the drift that is the same in each framework. This is a limitation
of the present study. This study also does not directly examine parameter estimate
stability as functions of either ability distribution of the test sample or sample size.
An investigation focusing on the latter would represent an important contribution,
since most performance attributes regarding parameter estimation rely on an assump-
tion of a sufficiently large, yet unquantified, sample size. But real data are always
limited in number.
The theoretical framework used here along with the correlation results show the
frameworks of IRT and CTT with UNV assumption to be quite comparable, with nei-
ther framework showing an advantage over the other. This finding presents the
Kohli et al. 403
opportunity for CTT with UNV models to be applied in contexts where they had not
been considered previously.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship,
and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of
this article.
Note
1. Essentially parallel, essentially tau-equivalent, and essentially congeneric models are not
specifically addressed. However, the results and conclusions of this study using models
defined without a mean structure are equally applicable to corresponding models defined
with a mean structure, as correlations are invariant to the locations of the scales.
References
Chalmers, R. P. (2012). MIRT: A multidimensional item response theory package for the R
environment. Journal of Statistical Software, 48(6), 1-29. Retrieved from http:
//www.jstatsoft.org/v48/i06/
Cook, L. L., Eignor, D. R., & Hessy, L. T. (1988). A comparative study of the effects of
recency of instruction on the stability of IRT and conventional item parameter estimates.
Journal of Educational Measurement, 25(1), 31-45.
Courville, T. G. (2004). An empirical comparison of item response theory and classical test
theory item/person statistics (Doctoral dissertation). Retrieved from ProQuest Dissertations
and Theses. (Accession Order No. 3141396)
DeVellis, R. F. (1991). Scale development: Theory and applications. Newbury Park, CA: Sage.
Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of
their item/person statistics. Educational and Psychological Measurement, 58, 357-381.
Ferrando, P. J. (2000). Testing the equivalence among different item response formats in
personality measurement: A structural equation modeling approach. Structural Equation
Modeling, 7, 271-286.
Graham, J. M. (2006). Congeneric and (essentially) tau-equivalent estimates of score
reliability: What they are and how to use them. Educational and Psychological
Measurement, 66, 930-944.
Hambleton, R., & Jones, R. (1993). Comparison of classical test theory and item response
theory. Educational Measurement: Issues and Practice, Fall, 38-47.
Jöreskog, K. G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36,
109-133.
Jöreskog, K. G. (1990). New developments in LISREL: Analysis of ordinal variables using
polychoric correlations and weighted least squares. Quality and Quantity, 24, 387-404.
404 Educational and Psychological Measurement 75(3)
Kamata, A., & Bauer, D. J. (2008). A note on the relation between factor analytic and item
response theory models. Structural Equation Modelling, 15, 136-153.
Lawson, S. (1991). One parameter latent trait measurement: Do the results justify the effort?
In B. Thompson (Ed.), Advances in educational research: Substantive findings,
methodological developments (Vol. 1, pp. 159-168). Greenwich, CT: JAI Press.
Lord, F. (1980). Applications of item response theory to practical testing problems. Hillsdale,
NJ: Lawrence Erlbaum.
MacDonald, P., & Paunonen, S. (2002). A Monte Carlo comparison of item and person
statistics based on item response theory versus classical test theory. Educational and
Psychological Measurement, 62, 921-943.
McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Boca Raton, FL:
Chapman & Hall/CRC.
McDonald, R. P. (1999). Test theory: A unified treatment. New York, NY: Psychology Press.
Mislevy, R. J. (1986). Recent developments in the factor analysis of categorical variables.
Journal of Educational Statistics, 11, 3-31.
Miyazaki, Y. (2005). Some links between classical and modern test theory via the two-level
hierarchical generalized linear model. Journal of Applied Measurement, 6, 289-310.
Muthén, B. (1984). A general structural equation model with dichotomous, ordered categorical,
and continuous latent variable indicators. Psychometrika, 49, 115-132.
Muthén, B. O. (1978). Contributions to factor analysis of dichotomous variables.
Psychometrika, 43, 551-560.
Muthén, B. O., & Christoffersson, A. (1981). Simultaneous factor analysis of dichotomous
variables in several groups. Psychometrika, 46, 407-419.
Muthén, L. K., & Muthén, B. O. (1998-2012). Mplus user’s guide (7th ed.). Los Angeles, CA:
Muthén & Muthén.
R Development Core Team. (2013). R: A language and environment for statistical computing
(ISBN 3-900051-07-0). Vienna, Austria: R Foundation for Statistical Computing. Retrieved
from http://www.R-project.org/
Raykov, T., & Marcoulides, G. A. (2011). Introduction to psychometric theory. New York,
NY: Routledge.
Revelle, W. (2014). Psych: Procedures for psychological, psychometric, and personality
research. Evanston, IL: Northwestern University. Retrieved from http://CRAN.R-
project.org/package=psych
Rosseel, Y. (2012). lavaan: An R package for structural equation modeling. Journal of
Statistical Software, 48(2), 1-36. Retrieved from http://www.jstatsoft.org/v48/i02/
Rudner, L. M. (1983). A closer look at latent trait parameter invariance. Educational and
Psychological Measurement, 43, 951-955.
Rupp, A. A., & Zumbo, B. D. (2006). Understanding parameter invariance in unidimensional
IRT models. Educational and Psychological Measurement, 66, 63-84.
Sharkness, J., & DeAngelo, L. (2011). Measuring student involvement: A comparison of
classical test theory and item response theory in the construction of scales from student
surveys. Research in Higher Education, 52, 480-507.
Takane, Y., & de Leeuw, J. (1987). On the relationship between item response theory and
factor analysis of discretized variables. Psychometrika, 52, 393-408.
Wirth, R. J., & Edwards, M. C. (2007). Item factor analysis: Current approaches and future
directions. Psychological Methods, 12, 58-79.
Kohli et al. 405