A Comparison of Methods of
Estimating Subscale Scores
for Mixed-Format Tests
DavidShin
PearsonEducationalMeasurement
May2007
Using assessment and research to
promote learning rr0701
Pearson Educational Measurement (PEM) is the most comprehensive provider
of educational assessment products, services and solutions. As a pioneer in
educational measurement, PEM has been a trusted partner in district, state
and national assessments for more than 50 years. PEM helps educators and parents
use assessment and research to promote learning and academic achievement.
PEM Research Reports provide dissemination of PEM research and
assessment-related articles prior to publication. PEM reports in .pdf format may be
downloaded at:
http://www.pearsonedmeasurement.com/research/research.htm
2
Because the world is complex and resources are often limited, test scores often serve to
both rank individuals and provide diagnostic feedback (Wainer, Vevea, Camacho, Reeve,
Rosa, Nelson, Swygert, and Thissen, 2000). These two purposes create challenges from
the view of content coverage. Standardized achievement tests serve the purpose of
ranking very well. To serve the purpose of diagnosis, standardized achievement tests
must have clusters of items that yield interpretable scores. These clusters could be
learning objectives, subtests, or learning standards. Scores on these clusters of items will
be referred to as objective scores in this paper.
If there are a large number of items measuring an objective on a test, the
estimation of an objective score might be precise and reliable. However, in many cases
the number of items is fewer than optimal for the level of reliability desired. This
condition is problematic, but it often exists in practice (Pommerich, Nicewander, and
Hanson, 1999). The purpose of this paper is to review and evaluate a number of methods
that attempt to provide a more precise and reliable estimation of objective scores. The
first section of this paper reviews a number of different methods for estimating objective
scores. The second section of this paper presents a study evaluating a subset of the more
practical of these methods.
Review of Current Objective Score Estimation Methods
A number of studies have been devoted to proposing methods for the more
precise and reliable estimation of objective scores (Yen 1987; Yen, Sykes, Ito, and Julian,
1997; Bock, Thissen, and Zimowski, 1997; Pommerich et al., 1999; Wainer et al., 2000;
Gessaroli, 2004; Kahraman and Kamata, 2004; Tate, 2004; Shin, Ansley, Tsai, and Mao,
3
2005). This section provides a review of these methods. All of these methods, implicitly
or explicitly, estimate the objective scores using collateral test information. This review
notes how different methods take advantage of collateral information in different ways.
The IRT domain score was used as the estimation of the objective score by Bock,
Thissen, and Zimowski (1997). The collateral test information is used in a sense that
when the item parameters are calibrated, the data from items in the other objectives will
actually contribute to the estimation of the item parameters in the objective of interest.
For example, if items 1 to 6 are measuring the first objective and items 7 to 12 are
measuring the second objective in the same test, in calibrating the test, the data from
items 7 to 12 actually contribute to the estimation of the item parameters of items 1 to 6.
Bock compared the proportion-correct score with IRT domain score computed by theta
from either maximum likelihood estimation (MLE) or Bayesian estimation and found that
the objective score estimated from the IRT estimation is more accurate than the
proportion-correct score.
Several studies extended the work of Bock, Thissen, and Zimowski (1997).
Pommerich et al. (1999) similarly used IRT domain score to estimate the objective score.
The only difference is that they did not estimate the score at an individual level but rather
at a group level. Tate (2004) extended Bock’s study to the multidimensional case with
different dimensionality and degree of correlation between subsets of test. The main
purpose of Tate’s study was to find a better method between MLE and the expected a
posteriori (EAP) estimation to estimate the objective score. He found that the choice of
estimation approach depends on the intended uses of the objective scores (Tate, 2000,
p.107).
4
Adopting a somewhat different approach, Kahraman and Kamata (2004) also tried
to estimate the objective score using IRT models. Kahraman and Kamata used out-of-
scale items which is explicitly a kind of collateral test information to help the estimate of
the objective score. The out-of-scale items are the items that measure different objectives
but are in the same test as the objective of interest. For example, if the score of Objective
One is estimated, then the items in Objective Two, Three, and so on are out-of-scale
items. They controlled the number of out-of-scale items, correlation between objectives,
and the discrimination of items and found that the correlation between objectives needs to
be at least 0.5 for the moderate-discrimination out-of-scale items and 0.9 for the high-
discrimination items in order to take advantage of the out-of-scale items.
Wainer et al. (2000) used an empirical Bayesian (EB) method to compute the
objective score. The basic concept of this method is similar to Kelley’s (1927) regressed
score. Indeed, it is a multivariate version of Kelley’s method. In Kelley’s method, only
one test score was used to estimate the regressed score, but in Wainer’s method, several
other objective scores were used to estimate the objective score of interest.
Yen (1987) and Yen et al.(1997) combined the IRT and EB method to compute
the objective score using a method labeled Objective Performance Index (OPI). They
used the IRT domain score as the prior and assumed that the prior had a beta distribution.
With another assumption that the likelihood of the objective score follows a binomial
distribution, the posterior distribution is then also a beta distribution. The mean, standard
deviation (SD) of the posterior distribution are the estimated objective score and its
standard error.
5
Gessaroli (2004) used multidimensional IRT (MIRT) to compute the objective
score. Gessaroli (2004) then compared it to the Wainer’s EB method (Wainer et al.,
2000). He found that the EB method had almost the same results as the MIRT methods.
Shin et al. (2005) applied the Markov chain Monte Carlo (MCMC) technique to
estimate the objective scores of IRT, OPI, and Wainer et al.’s methods. Shin et al. (2005)
then compared these MCMC versions to their original non-MCMC counterparts. They
found that the MCMC alternatives performed either the same or slightly better than the
non-MCMC methods.
Some of the methods reviewed in this paper may not be practical for use in a
large scale testing program. For example: the method of Pommerich et al. (1999) only
estimated the objective score for groups. It may not meet the need for a large scale test
that reports individual objective scores. The method of Gessaroli (2004) involved MIRT
that is rarely used in a large scale test. The method of Kahraman and Kamata (2004)
required certain conditions in order to take the advantage of this method and these
conditions may not be met in many large scale tests (p.417). The study of Tate (2004)
was very similar to Bock’s study (Bock, Thissen, and Zimowski, 1997). The MCMC IRT
and MCMC OPI method in Shin et al. (2005) are too time-consuming and hence may not
be practical.
However, other methods reviewed in this paper may be suitable for use in a large
scale testing program. The method of Yen et al. (1997) is actually currently used for
some state tests. The Bock, Thissen, and Zimowski (1997) method is convenient to be
implemented in the tests using IRT. The Wainer et al. (2000) method and the MCMC
Wainer methods of Shin et al. (2005) performed better than the other methods in the
6
study of Shin et al. (2005). Therefore, these methods were included in the study reported
in the next section.
Evaluation of Selected Objective Score Estimation Methods
The study reported in this section compares five methods that use collateral test
information to estimate the objective score for a mixed-format test. These methods
include an adjusted version of Bock et al.’s item response theory (IRT) approach (Bock
et al., 1997), Yen’s objective performance index (OPI) approach (Yen, 1997), Wainer et
al.’s regressed score approach (Wainer et al., 2000), the Shin’s MCMC regressed score
approach (Shin et al., 2005), and the proportion-correct score approach. They are referred
as the Bock method, the Yen method, the Wainer method, the Shin method, and
proportion-correct method hereafter. In addition to comparing these five methods using a
common data set, the present study extends earlier work by including mixed-format tests.
Only Yen et al. (1997) consider the case of a mixed-format test. As more large scale tests
require the reporting of objective scores and the mixed-format tests become more
common, it is necessary to conduct a study that compares different objective score
estimation methods for mixed-format tests.
From previous studies (Yen, 1987; Yen et al., 1997; Wainer et al., 2000;
Pommerich et al., 1999; Bock et al., 1997, Shin et al. 2005), the number of items in an
objective, and the correlation between objectives were two main factors that affected the
estimation results. In the current study, the proportion of the polytomous items in a test
and the student sample size were also studied. The performance of the methods under
7
different combinations of these four factors was compared. Six main questions were
investigated in this study:
(1) What is the order of the objective score reliabilities estimated from different methods
and how are they influenced by the four factors studied?
(2) What is the nominal 95% confident / credibility interval (95CI) of each method and
how are they influenced by the four factors studied?
(3) What is the order of the widths of the 95CI for each method and how are they
influenced by the four factors studied?
(4) What is the magnitude and order of the absolute bias (Bias) of different methods and
how are they influenced by the four factors studied?
(5) What is the magnitude and order of the standard deviation of estimation (SD) of
different methods and how are they influenced by the four factors studied?
(6) What is the magnitude and order of the root-mean-square error (RMSE) of different
methods and how are they influenced by the four factors studied?
The first question addresses the reliability of the objective score estimated by each
method. As mentioned previously, one of the reasons to use methods other than the
proportion-correct method is that the proportion-correct objective score is not reliable
given the limited number of items in each objective. Therefore, the objective score
estimated from the selected method must be more reliable than the proportion-correct
score. However, because some of the estimating methods such as the IRT method and
OPI method do not provide a way to estimate reliability, empirically it becomes
impossible to compare the reliability of each method. In the present study, since the true
score and the estimated score of each method are both available from the simulation
8
process, it becomes possible to compute the correlation between the true score and the
estimated score and the reliability of each method can be obtained through the following
equation:
2 2 2
2 2
' 2 2 2 2 2( ) ( )T T T TX TX TX
XX TX
X X T X T X T
= = = = = (1)
Where XX’,, X, T, TX , and XT are the reliability, standard error of the estimated score,
standard error of the true score, covariance of the true and estimated score, and the
correlation between true score and estimated score, respectively.
The second question concerns the accuracy of the nominal 95% confidence /
credibility intervals (95CI) of each method. For the objective score estimated from
proportion-corrected method, the 95CI is the confidence interval. For the other methods,
the 95CI represents the credibility interval. Very often when an objective score is
reported, the 95CI is also presented. However, the nominal 95CI may not really cover the
true score 95% of the time. Through the simulation process in current study, the lower
and upper bound of each estimated objective score was computed and the actual
“percentage that the 95CI covers the true score”, percent coverage (PC), for each method
was then computed to answer this question.
The third question is about the widths of the 95CI for each method. It is possible
for a method to have a 95CI that covers the true score 95% of the time but has a wide
range of its 95CI. For example, a 95CI of the proportion-correct method may range from
4% to 95%. Obviously this range will cover the true proportion-correct score well
because the whole range of the proportion-correct is just from 0% to 100%. However,
such a 95CI is actually meaningless because its range is just too wide. Therefore, in this
9
study the widths of the 95CI for each method were compared. It is necessary to consider
both the 95CI and its width at the same time in order to find a better estimating method.
The fourth to sixth questions are about the Bias, SD, and RMSE. They are defined
as follows:
| ( ) |,Bias E W= (2)
2[ ( )] ,SD E W E W= (3)
2 2 2( ) ,RMSE E W Bias SD= = + (4)
where W is the point estimator, and is the true parameter.
To summarize, six criteria, (1) reliability, (2) percent-coverage of true score for a
nominal 95% confidence/credibility interval (95PC), (3) the width of the 95%
confidence/credibility interval (95CI), (4) Bias, (5) SD, and (6) RMSE were used as the
dependent variables in the comparisons of the different methods. A more desirable
estimating method should have high score reliability, narrow 95CI, accurate percent-
coverage of a nominal 95CI, small SD, Bias that is close to zero, and consequently, small
RMSE.
Procedures
Simulated Responses
For the purpose of this study, data were simulated to represent different
conditions found in test data. Four factors are considered in generating the simulated
data. Detailed information about these four factors is provided in Table 1. In all, 3* 3 * 3
* 3 = 81 conditions were considered in the simulation study.
10
Table 1. Simulation Factors and Number of Levels
Factors No. of levels Description
Number of examinees 3 250, 500, 1000
Test length 3 6, 12, or 18 items for each objective
Correlation between objectives 3 Approximately 1.0, 0.8, and 0.5
Ratio of CR/MC items 3 0, 20% or 50% (in number of items)
The present study used empirical item parameters from a state testing program.
Within this item pool there are 30 MC items and 18 CR items (each with three score
categories, 0, 1, and 2). The ability ( ) values for the examinees on different number of
objectives were simulated from a standardized multivariate normal distribution with
correlation coefficients equal to 1.0, 0.8, or 0.5.
For each of the 81 conditions, 100 simulation trials (i.e., response vectors) were
used. At each of the conditions, with item parameters assumed to be known, the
simulation process involved the following steps:
1. Generate ’s for each of the examinees from a standardized multivariate
normal distribution. The generated values were restricted to be between -3
and 3. The correlation coefficients between ’s were .5, .8 or 1.0.
2. Compute Pij( ) by using the IRT 3pl equation for item i in objective j and
Pijk( ) by using the generalized partial credit model equation for category k of
item i in objective j. The item parameters were randomly selected with
replacement from the item pool. Pij( ) and Pijk( ) are defined in the “Bock
method” section of this paper.
3. Use Pij( ) and Pijk( ) in step 2 to compute the true score for each objective.
For example, if items 1 through 6 were in objective 1, the true score of
11
objective 1 was the total of the Pij( ) and weighted Pijk( ) of item 1 through 6.
This true objective score was used as the baseline to compare different
methods of estimating the objective score.
4. Generate response, yij using Pij( ) and Pijk( ) from step 3. yij is either a sample
from the binomial distribution with probability Pij( ) or a sample from the
multinomial distribution with probabilities equal to Pijk( ), where k is the
index for category level and ranges from 0 to 2.
5. Use the data from step 4 to estimate the objective scores using different
estimating methods. The details about each method are described later.
6. Repeat step 4 and step 5 for 100 times and compute objective score reliability,
95CI, percentage coverage of nominal 95CI, Bias, SD, and RMSE for further
analyses.
It should be noticed that in the simulation process the data were simulated
according to the correlation between ’s rather than the correlation between the objective
scores. However, because the objective score are a monotonically increasing function of
, the correlation between the objective scores is approximately equal to the correlation
between ’s.
Objective Scores
The simulated data were used as input for different estimating methods to
compute the objective scores. Then these estimated objective scores were used for
comparison of the estimating methods. The following are the brief descriptions of these
estimating methods.
12
Bock method
The data from the simulation procedure were used to estimate the examinees’
values for the whole test. The estimated parameters, $ , were then entered into equations:
(1) $$
$
exp[1.7 ( )]( ) (1 )
1 exp[1.7 ( )]
ij ij
ij ij ij
ij ij
a bP c c
a b= +
+ , or (2) $
$
$
1
0 0
exp ( )
( )
exp ( )i
k
ij ij
ijk m c
ij ij
c
a b
P
a b
=
= =
= to
estimate Pij( ˆ ) and Pijk( ˆ ), where i, j, , k, and mi represent item, objective, score level
index, current computed score level, and total score levels for item i, respectively . The
objective scores for objective j, IRT T, were then computed by the equation,
$j
1
1IRT T ( ),
jI
ij
ijn =
= (5)
where i represents item, Ij is the number of items in objective j, and nj is the maximum
possible points in objective j. Note that nj equals1
( 1)jI
i
i
m=
.
For MC items,
$ $( ) ( )ij ijP= ; (6)
for CR items,
$ $
1
( ) ( 1) ( )im
ij ijk
k
k P=
= . (7)
Bayes estimates of scale scores, $ , were used to compute the IRT objective scores with a
normal (0, 1) as the prior distribution for the abilities. Usually when the objective scores
are estimated, the item parameters (i.e., the item pool) already exist. Therefore, in this
study, the item parameters were assumed known.
The variance of the IRT T can be expressed in the equation below:
13
$ $ $ $
j
2
2
1 1 1 1
IRT T 2
( ) 1 ( ) ( 1) ( ) ( 1) ( )
VAR ,
MC CR i in n m m
ij ij ijk ijk
i i k k
j
P P k P k P
n
= = = =
+
= (8)
where nmc, and ncr are the number of MC and CR items, respectively, and nj represents, as
defined previously, the maximum possible points in the objective j.
Yen method
The following steps were used to estimate Yen’s T (Yen et al., 1997):
1. Estimate IRT item parameters for all selected items
2. Estimate for the whole test (including all objectives).
3. For each objective, calculate IRT Tj (see Equation 1 on page 5), �jT , where j
represents objective j.
4. Obtain
�
� �
2
1
( ).
(1 )
j
j
xJ jj n
j jj
n TQ
T T=
= (9)
If 2 ( ,.10),Q J> then Yen Tj, �j
j
j
xT
n= ,
,j jp x= (10)
and
j j jq n x= (11)
, where xj is the observed score obtained in objective j, nj is the max points can be
obtained in objective j, and J is the number of objectives.
If 2 ( ,.10),Q J then
� *jj j jp T n x= + (12)
14
, and
� *(1 ) .j j j j jq T n n x= + (13)
The Yen Tj , �
jT , is defined to be �� *
*
jj j jj
j j j j
p T n xT
p q n n
+= =
+ +, where
� �
�
$ $$
$
*
2
( , )
2
1 1
1
( | )[1 ( | )]1
( | )
1 1( ) 1 ( ) 1.
1'( )
j j
j
j j
jj
I I
I
ij ijI
i ij j
ij
ij
T Tn
T
n n
n
μ μ
= =
=
=
(14)
For MC items:
$
$ $1.7 1 ( ) ( )'( )
(1 )
ij ij ij ij
ij
ij
a P P c
c= (15)
and
$
$
$ $
$
$
$
2
1
22 2
1
'( )( , )
( ) 1 ( )
1.7 1 ( ) ( ) ,
1( )
MC
MC
nij
i ij ij
nij ij ij ij
i ijij
IP P
a P P c
cP
=
=
=
=
(16)
where nMC is the total number of MC items in the whole test.
For CR items:
$ $ $
$ ${ }1
22
1
'( ) ( ) ( 1) ( )
( 1) ( ) ( )
i
i
m
ij ij ij ij
k
m
ij ijk ij
k
a k
a k P
=
=
=
=
(17)
15
and
$
$
$ $
$ $
$ $
$ $
2
21 2
1
22
2 2
1
21 2
1
22 2
1
'( )( , )
( 1) ( ) ( )
( 1) ( ) ( )
=
( 1) ( ) ( )
( 1) ( ) ( )
CR
i
i
CR
i
i
nij
mi
ijk ij
k
m
ij ijk ijnk
mi
ijk ij
k
m
ij ijk ij
k
I
k P
a k P
k P
a k P
=
=
=
=
=
=
=
=1
,CRn
i=
(18)
where nMC is the total number of CR items in the whole test. The definition of $( )ij is in
Equations (6) and (7).
The variance of Yen T can be expressed in the equation below:
j Yen T 2
Var ,( ) ( 1)
j j
j j j j
p q
p q p q=
+ + + (19)
where pj and qj are defined in Equations (10) to (13).
Wainer method
In vector notation, for the multivariate situation involving several objective scores
collected in the vector x,
REG T = x. + â(x. - x) . (20)
x. is the mean vector which contains the means of each objective involved. is a matrix
that is the multivariate analog for the estimated reliability for each objective. All that is
needed to calculate REG T are estimates of . The equation for calculating is:
= Strue
( Sobs
)-1
, (21)
16
where Sobs
is the observed variance-covariance matrix of the objective scores and Strue
is
the variance-covariance matrix of the true objective scores.
Because errors are uncorrelated with true scores, it is easy to see that
' 'jv jv jv jvx x= . Where 'jv jv and
'jv jvx x are the off-diagonal elements of true
and obs
, the
population variance-covariance matrices of the true objective scores and observed
objective scores. It is in the diagonal elements of Sobs
and Strue
that the difference arises.
However, if the diagonal elements of Sobs
are multiplied by the reliability, 2 2/ x , of the
subscale in question, the results are the diagonal elements of Strue
. It is customary to
estimate reliability with Cronbach’s coefficient (Wainer et al., 2000).
The score variance of the estimates for the vth objective is the vth diagonal
element of the matrix,
1( )true obs true
=C S S S . (22)
Shin method
Instead of using the empirical Bayes approach in the regressed score method, the
MCMC regressed score method uses a fully Bayesian model. Here is an example with
two objectives for the illustration of the full Bayesian estimation. If there are two
objectives called M and V, the observed scores and true scores of these two objectives are
xm, xv, m and v, respectively. The fully Bayesian model results in the posterior
distributions of ( m |xm, xv), MREG Tm, and ( v |xm, xv), MREG Tv.
From classical test theory (CTT):
pm pm pmx = + , (23)
and
.pv pv pvx = + (24)
17
where p represents examinee.
These can be re-parameterized via CTT equations as:
2, (0, ),pm pm pm pm mx N= + � (25)
where
2, (0, );pm m pm pm mNμ= + � (26)
and
2, (0, )pv pv pv pv vx N= + � , (27)
where
2, (0, )pv v pv pv vNμ= + � . (28)
pm and pv are the error terms. μ pm and μ pv are the “common” true scores for
objective m and objective v. pm and pv are the “unique” true scores for examinee p on
objective m and objective v.
In the fully Bayesian model, instead of using the maximum likelihood estimate
(MLE) like the empirical Bayesian method did to replace the unknown parameters, it is
necessary to give a prior distribution on the unknown parameters such as 2
m , 2
v , 2
m ,
2
v , and mv .
If it is assumed that
2 2 2
2 2 2
2 2
2 2
N ,
pm m m m mv m mv
pv v mv v v mv v
pm m m mv m mv
pv v mv v mv v
x
x
μ
μ
μ
μ
+
+� , (29)
then
1 2 1
, 21 11 21 11 12| N + ,pm m
pm pm pv m m
pv v
xx x
x
μμ
μ� , (30)
18
and
1 2 1
, 31 11 31 11 12
x| N + ,
x
pm m
pv pm pv v v
pv v
x xμ
μμ
� , (31)
where
2 2
11 2 2,m m mv
mv v v
+=
+ (32)
2
12 22 2,m mv
mv v
= = (33)
2
21 ,m mv= (34)
and
2
31 mv v= . (35)
The priors of mμ , vμ , 2
m , 2
v , 2
m , 2
m , and mv need to be specified. This
model was coded with WinBUGS to estimate the full Bayesian regressed objective
scores. The variance of the estimation is the variance of the posterior distribution and the
95% credibility interval equals to the interval between the .025 and .975 quartiles of the
posterior distribution.
Results
The results of the comparison of each method are presented in Figures 1 to 24.
Figures 1 to 4 are the results of the comparison of the estimated reliability under different
factors (the number of examinees, the number of items in each objective, the correlation
between objectives, and the ratio of CR/MC items). For brevity, if not specially
mentioned, the reliability in the following text refers to the estimated reliability. Figures
19
5 to 8 are the results of comparing the width of 95CI for each method. Figure 9 to 12 are
the results of the comparison of the percentage coverage of 95CI for each method.
Figures 13 to 16 are the results of the comparison of the absolute bias under different
factors (the number of examinees, the number of items in each objective, the correlation
between objectives, and the ratio of CR/MC items). Figures 17 to 20 are the results of
comparing the SD for each method. Figure 21 to 24 are the results of the comparison of
the percentage RMSE for each method.
Comparison of Reliability
Number of examinees
1000500250
Rel
iabi
lity
.80
.78
.76
.74
.72
.70
Methods
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 1. Reliability for different number of examinees
Figure 1 is the comparison of reliability for different number of examinees. From
Figure 1 it can be seen that:
20
(1) The number of examinees did not impact the reliability of the objective score
because the reliability for different numbers of examinees remains approximately
the same except for the Bock method. However, the largest reliability in Bock
method is .74 which is only about .03 higher than the lowers reliability, .71, for
this method.
(2) The order of the magnitude of the objective score reliability for each method is:
Shin = Wainer > Yen > Bock > proportion-correct. The objective score estimated
from the Shin and Wainer method have the highest reliability.
Number of Items in each objective
18126
Rel
iabi
lity
.9
.8
.7
.6
.5
Methods
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 2. Reliability for different number of items in each objective
Figure 2 is the comparison of reliability for different number of items in each objective.
From Figure 2 it can be seen that:
21
(1) The number of items in each objective had impact on the reliability of the
objective score because the reliability for different number of items per objective
increased as more items were in each objective. The slope of the increasing was
greater from 6 to 12 items and then became flatter from 12 to 18 items.
(2) The order of the magnitude of the reliability of objective score for each method is:
Shin = Wainer > Yen > Bock > proportion-correct. The objective scores estimated
from the Shin and Wainer methods have the highest reliability.
(3) For the Bock’s method, the reliability became the same for the 12 and 18 item
cases.
Correlation between objectives
1.00.80.5
Rel
iabi
lity
.9
.8
.7
.6
.5
Methods
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 3. Reliability for different correlation between objectives
22
Figure 3 is the comparison of reliability for different correlation between objectives.
From Figure 3 it can be seen that:
(1) The correlation between objectives had impact on the reliability of the objective
score because the reliability increased as the correlation became higher. However,
the effect is not shown in the proportion-correct method.
(2) The order of the magnitude of the reliability of objective score for each method is
approximately: Shin = Wainer > Yen > proportion-correct. The objective scores
estimated from the Shin and Wainer methods have the highest reliability.
(3) The Bock’s method had a unique performance in this case. When the correlation
between objectives equaled to .5, it had the lowest reliability and when the
correlation equaled to one, it had the highest reliability. This finding is related to
the assumption of the Bock method. Because Bock’s objective score is actually
the IRT domain score, it needs to satisfy the assumption of IRT which requires
each objective to test the same thing (i.e., the unidimensionality assumption).
Therefore, in the case when the correlation between objectives is 0.5, the
assumption is violated resulting in it the worst method to estimate the objective
score. However, when the correlation was 1.0, the assumption was met and it was
the best method for the estimating.
23
Ratio of CR/MC items (in number of items)
0.50.20
Rel
iabi
lity
.82
.80
.78
.76
.74
.72
.70
.68
.66
Methods
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 4. Reliability for different ratio of CR/MC items
Figure 4 is the comparison of reliability for different ratio of CR/MC items in each
objective. From Figure 4 it can be seen that:
(1) The ratio of CR/MC items in each objective had impact on the reliability of the
objective score because the reliability for different ratio of CR/MC items
increased as the ratio of CR/MC items in each objective increased.
(2) The order of the magnitude of the reliability of objective score for each method is:
Shin = Wainer > Yen > Bock > proportion-correct. The objective scores estimated
from the Shin Wainer methods have the highest reliability.
(4) Because the CR items used in this study have the same number of categories (3
categories), more CR items means more possible points can be obtained in each
24
objective. That is, if there are more possible points in each objective, the
reliability of estimated objective scores will be higher.
To sum up: as the number of items per objective, the correlation between objectives,
and the ratio of CR/MC items increased, the reliability of the estimated objective
score also increased. The number of examinees did not affect the reliability of the
objective score. Within these 5 methods studied, generally the estimated objective
scores from the Shin and Wainer methods have the highest reliability except for the
case when the correlation between objectives is 1.0. In that case, the Bock method
yielded the highest reliability.
Comparison of the Width of 95CI
Number of examinees
1000500250
Wid
th o
f 95
CI
.6
.5
.4
.3
.2
Methods
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 5. Width of 95CI for different number of examinees
25
Figure 5 is the comparison of the width of 95CI for different number of examinees.
From Figure 5 it can be seen that:
(1) The number of examinees did not have impact on the width of 95CI.
(2) The order of the magnitude of the width of 95CI for each method is: Proportion-
correct > Bock > Shin = Wainer > Yen.
Number of items in each objective
18126
Wid
th o
f 95
CI
.8
.7
.6
.5
.4
.3
.2
.1
Methods
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 6. Width of 95CI for different number of items in each objective
Figure 6 is the comparison of the width of 95CI for different number of items per
objective. From Figure 6 it can be seen that:
(1) The number of examinees had impact on the width of 95CI. As the number of
items per objective increased, the width of 95CI decreased.
26
(2) The order of the magnitude of the width of 95CI for each method is: Proportion-
correct > Bock > Shin = Wainer > Yen.
Correlation between objectives
1.00.80.5
Wid
th o
f 95
CI
.6
.5
.4
.3
.2
Methods
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 7. Width of 95CI for different correlation between objectives
Figure 7 is the comparison of the width of 95CI for different correlation between
objectives. From Figure 7 it can be seen that:
(1) The correlation between objectives did not have impact on the width of 95CI for
the Bock, proportion-correct, and Yen methods; but has slight effect on the
Wainer and Shin methods. For these two methods, as the correlation between
objectives increased, the width of 95CI slightly decreased.
(2) The order of the magnitude of the width of 95CI for each method is: Proportion-
correct > Bock > Shin = Wainer > Yen.
27
Ratio of CR/MC items (in number of items)
0.50.20
Wid
th o
f 95
CI
.6
.5
.4
.3
.2
Methods
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 8. Width of 95CI for different ratio of CR/MC items
Figure 8 is the comparison of the width of 95CI for different ratio of CR/MC items.
From Figure 8 it can be seen that:
(1) The ratio of CR/MC items did not have impact on the width of 95CI except for
the Bock method. For Bock methods, as the number of items per objective
increased, the width of 95CI slightly decreased.
(2) The order of the magnitude of the width of 95CI for each method is: Proportion-
correct > Bock > Shin = Wainer > Yen.
To sum up: within the four factors studied, only the number of items per objective had
impact on the width of 95CI. More items per objective tended to lead to a narrower width
28
of 95CI. The order of the magnitude of the width of 95CI is consistent across different
factors. The Yen method has the narrowest 95CI.
Comparison of the Percent Coverage of 95CI
Number of examinees
1000500250
Per
cent
age
cove
rage
of 95
CI
102
100
98
96
94
92
90
Methods
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 9. Percent coverage of 95CI for different number of examinees
Figure 9 is the comparison of the percent coverage of 95CI for different number of
examinees. From Figure 9 it can be seen that:
(1) Generally, the number of examinees did not have impact on the percent coverage
of 95CI.
(2) Almost all the methods have a conservative 95CI because their percentage
coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers
the true score more than 95% of the time.
29
Number of Items in each objective
18126
Per
cent
age
cove
rage
of 95
CI
102
100
98
96
94
92
Methods
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 10. Percent coverage of 95CI for different number of items in each objective
Figure 10 is the comparison of the percent coverage of 95CI for different number of
items in each objective.
From Figure 10 it can be seen that:
(1) For the Bock and Yen methods, the number of items in each objective had impact
on the percent coverage of 95CI. Generally, as the number of items per objective
increased the percent coverage of 95CI decreased.
(2) Almost all the methods have a conservative 95CI because their percentage
coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers
the true score more than 95% of the time.
30
Correlation between objectives
1.00.80.5
Per
cent
age
cove
rage
of 95
CI
102
100
98
96
94
92
Method
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 11. Percent coverage of 95CI for different correlation between objectives
Figure 11 is the comparison of the percent coverage of 95CI for different correlation
between objectives.
From Figure 11 it can be seen that:
(1) For the Bock and Yen methods, the correlation between objectives has an obvious
impact on the percent coverage of 95CI.
(2) Almost all the methods have a conservative 95CI because their percentage
coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers
the true score more than 95% of the time.
31
Ratio of CR/MC items (in number of items)
0.50.20
Per
cent
age
cove
rage
of 95
CI
102
100
98
96
94
92
90
Methods
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 12. Percent coverage of 95CI for different ratio of CR/MC items
Figure 12 is the comparison of the percent coverage of 95CI for different ratio of
CR/MC items. From Figure 12 it can be seen that:
(1) For the Yen methods, the CR/MC ratio has some impact on the percent coverage
of 95CI but the pattern is not consistent across situations.
(2) Almost all the methods have a conservative 95CI because their percentage
coverage of 95CI are all larger than 95%. That means, the nominal 95CI covers
the true score more than 95% of the time.
To sum up: the 95CI for most of the methods except for Yen method are conservative.
That is, the nominal 95CI covered the true score more than 95% of the time in the
32
simulation study. Yen method has the 95CI that is more close to its nominal value, 95%.
This may due to the non-symmetrical property of the 95CI for Yen’s method.
Comparison of Bias
Number of examinees
1000500250
Bia
s
.06
.05
.04
.03
.02
.01
0.00
Method
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 13. Bias for different number of examinees
Figure 13 is the comparison of bias for different number of examinees. From
Figure 13 it can be seen that:
(1) The number of examinees did not impact the Bias of the objective score because
the Bias for different number of examinees remains approximately the same for
each method.
33
(2) The order of the Bias for each method was: Bock > Wainer > Shin > Yen >
proportion-correct. The magnitudes of Bias ranged from approximately 0.01 to
0.05. That is, if the perfect score is 100, the Bias is around 1 to 5 points.
Number of items in each objective
18126
Bia
s
.07
.06
.05
.04
.03
.02
.01
0.00
Method
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 14. Bias for different number of items in each objective
Figure 14 is the comparison of bias for different number of items in each objective. From
Figure 14 it can be seen that:
(1) The number of items in each objective had impact on the Bias of the objective
scores for Wainer, Shin and Yen methods because the Bias for different number
of items per objective increased as more items were in each objective. The slope
of the increasing was greater from 6 to 12 items and then became flatter from 12
to 18 items. The number of items did not affect the Bias for proportion-correct
method. For the Bock method, the Bias decreased as the number of items
34
increased from 6 to 12, but increased slightly as the number of items increased
from 12 to 18.
(2) The order of the Bias for each method was approximately: Bock > Wainer > Shin
> Yen > proportion-correct.
Correlation between objectives
1.00.80.5
Bia
s
.10
.08
.06
.04
.02
0.00
Method
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 15. Bias for different correlation between objectives
Figure 15 is the comparison of bias for different correlation between objectives. From
Figure 15 it can be seen that:
(1) The correlation between objectives had impact on the Bias for each method
because the Bias increased as the correlation became higher. However, the effect
was not shown in the proportion-correct method.
35
(2) The order of the Bias for each method was approximately: Bock > Wainer > Shin
> Yen > proportion-correct.
(3) The Bock’s method had a unique performance in this case. When the correlation
between objectives equaled to .5, it had the highest Bias but when the correlation
equaled to one, it had the relatively low Bias. This finding is related to the
assumption of the Bock method. Because Bock’s objective score is actually the
IRT domain score, it needs to satisfy the assumption of IRT which requires each
objective to test the same thing (i.e., the unidimensionality assumption).
Therefore, in the case when the correlation between objectives is 0.5, the
assumption is violated resulting in it the worst method to estimate the objective
score. However, when the correlation was 1.0, the assumption was met and it
made the Bias lower.
Ratio of CR/MC items (in number of items)
0.50.20
Bia
s
.06
.05
.04
.03
.02
.01
0.00
Method
Bock
Yen
Wainer
Shin
Proportion-correct
36
Figure 16. Bias for different ratio of CR/MC items
Figure 16 is the comparison of bias for different ratio of CR/MC items in each objective.
From Figure 4 it can be seen that:
(1) The ratio of CR/MC items in each objective did not impact the Bias of the
objective score
(2) The order of the Bias for each method was approximately: Bock > Wainer > Shin
> Yen > proportion-correct.
To sum up: the Bias was affected by two factors: the number of items per objective
and the correlation between objectives. As the number of items per objective or the
correlation between objectives increased, the Bias of the estimated objective score
also increased. The number of examinees and the ratio of CR/MC items did not affect
the Bias of the objective score. Within these 5 methods studied, the order of the
magnitude of Bias is consistent across four factors. Generally, the order was: Bock >
Wainer > Shin > Yen > proportion-correct. The magnitude of the Bias approximately
ranged from 0.01 to 0.08.
Comparison of the SD
37
Number of examinees
1000500250
SD
.13
.12
.11
.10
.09
.08
.07
.06
Method
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 17. SD for different number of examinees
Figure 17 is the comparison of the SD for different number of examinees. From
Figure 17 it can be seen that:
(1) The number of examinees did not impact the SD.
(2) The order of the SD for each method is: Proportion-correct > Yen > Shin >
Wainer > Bock.
38
Number of item in each objective
18126
SD
.18
.16
.14
.12
.10
.08
.06
.04
Method
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 18. SD for different number of items in each objective
Figure 18 is the comparison of the SD for different number of items per objective.
From Figure 18 it can be seen that:
(1) The number of examinees had impact on the SD. As the number of items per
objective increased, the SD decreased.
(2) Generally, the order of the SD for each method is: Proportion-correct > Yen >
Shin > Wainer > Bock.
39
Correlation between objectives
1.00.80.5
SD
.13
.12
.11
.10
.09
.08
.07
.06
Method
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 19. SD for different correlation between objectives
Figure 19 is the comparison of the SD for different correlation between objectives.
From Figure 19 it can be seen that:
(1) The correlation between objectives had impact on the SD except for proportion-
correct method. For the other methods, as the correlation between objectives
increased, the SD decreased.
(2) The order of the SD for each method is: Proportion-correct > Yen > Shin >
Wainer > Bock.
40
Ration of CR/MC items (in number of items)
0.50.20
SD
.14
.13
.12
.11
.10
.09
.08
.07
.06
Method
Bock
Yen
Wainer
shin
Proportion-correct
Figure 20. SD for different ratio of CR/MC items
Figure 20 is the comparison of the SD for different ratio of CR/MC items. From
Figure 20 it can be seen that:
(1) The ratio of CR/MC items did not impact the SD.
(2) The order of the SD for each method is: Proportion-correct > Yen > Shin >
Wainer > Bock. .
To sum up: within the four factors studied, only the number of items per objective and
the correlation between objectives had impact on the SD. More items per objective or
higher correlation between objectives tended to lead to a narrower SD. The order of the
SD is consistent across different factors. The order is: Proportion-correct > Yen > Shin >
Wainer > Bock. The magnitude of the SD ranged approximately from 0.07 to 0.16.
Comparison of the RMSE
41
Number of examinees
1000500250
RM
SE
.13
.12
.11
.10
.09
.08
Methods
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 21. RMSE for different number of examinees
Figure 21 is the comparison of the RMSE for different number of examinees. From
Figure 21 it can be seen that:
(1) Generally, the number of examinees did not impact the RMSE.
(2) The order of the RMSE for each method is: Proportion-correct > Yen > Bock >
Shin Wainer.
42
Number of items in each objective
18126
RM
SE
.18
.16
.14
.12
.10
.08
.06
Method
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 22. RMSE for different number of items in each objective
Figure 22 is the comparison of the RMSE for different number of items in each
objective.
From Figure 22 it can be seen that:
(1) The number of items in each objective had impact on the RMSE for each method.
Generally, as the number of items per objective increased the RMSE decreased.
(2) The order of the RMSE for each method is: Proportion-correct > Yen > Bock >
Shin Wainer.
43
Correlation between objectives
1.00.80.5
RM
SE
.13
.12
.11
.10
.09
.08
.07
Method
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 23. RMSE for different correlation between objectives
Figure 23 is the comparison of the RMSE for different correlation between
objectives.
From Figure 23 it can be seen that:
(1) Except for the proportion-correct method, the correlation between objectives had
impact on the RMSE. As the correlation between objectives increased the RMSE
decreased.
(2) The order of the RMSE for each method was generally: Proportion-correct > Yen
> Bock > Shin Wainer.
(3) It should be noticed that the Bock method was greatly affected by the correlation
between objectives. When the correlation equals to 0.5, the assumption of Bock
method is violated very much and its RMSE became high. However, when the
44
correlation equals to 1.0, the Bock assumption is perfectly met and its RMSE
became the lowest.
Ratio of CR/MC items (in number of items)
0.50.20
RM
SE
.14
.13
.12
.11
.10
.09
.08
Method
Bock
Yen
Wainer
Shin
Proportion-correct
Figure 24. RMSE for different ratio of CR/MC items
Figure 24 is the comparison of the RMSE for different ratio of CR/MC items. From
Figure 24 it can be seen that:
(3) The CR/MC ratio had slightly impact on the RMSE for each method. As the ratio
increased, the RMSE slight decreased.
(4) The order of the RMSE for each method was generally: Proportion-correct > Yen
> Bock > Shin Wainer.
To sum up: within the four factors studied, only the number of items per objective and
the correlation between objectives had impact on the RMSE. More items per objective or
45
higher correlation between objectives tended to lead to a smaller RMSE. The order of the
SD is consistent across different factors. The order of the RMSE for each method was
generally: Proportion-correct > Yen > Bock > Shin Wainer. The magnitude of the
RMSE ranged approximately from 0.08 to 0.16.
Discussion
Based on the results from this simulation study, it appears that using estimation
methods other than the proportion-correct method improved the estimation of objective
scores in terms of the reliability of the objective score. Factors that affect the reliability of
objective scores included the number of items per objective, the correlation between
objectives, and the ratio of CR/MC items. The number of examinees did not affect the
reliability of the objective score. As the value of these factors increased, the reliability of
the estimated objective score also increase. Only the number of items per objective
affected the 95CI and the percent coverage of 95CI. Generally, the objective scores
estimated from the Wainer and Shin methods had the highest reliability. They ranged
approximately from 0.68 to 0.83. The objective scores estimated from the Yen and
proportion-correct methods had relatively lower reliability. The range of reliabilities was
from 0.59 to 0.75 for the proportion-correct method and from 0.65 to 0.80 for the Yen
method. The studied factors seem to have had a larger impact on the Bock method,
especially the correlation between objectives. When the correlation was 1.0, the Bock
method had the highest reliability (around 0.85).
Table 2 shows the effective gain in number of items for each method versus the
proportion-correct method using the Spearman-Brown prophesy formula. It can be seen
from Table 2 that the use of the Wainer and Shin methods can lead to an effective gain of
46
up to 1.63 times the number of items in a subscore. The Bock method may yield an
effective gain up to 1.89 times of the number of items in an objective. For example, if
there are 6 items in an objective, using the Wainer and Shin methods may obtain the
reliability as if there are 9.78 (6 * 1.63 = 9.78) items in that objective compared to the
proportion-correct method. In other words, to achieve the score reliability of the
Wainer/Shin methods with the proportion-correct method, the number of items per
objective must be increased by a factor of 1.63.
Table 2. The Effective Gained in Number of Items for each method
Methods Wainer / Shin Yen Bock Proportion-
correct
Original reliability min 0.68 0.65 0.58 0.59
max 0.83 0.80 0.85 0.75
Number of items Gained min 1.48 1.29 0.96 1.00
max 1.63 1.33 1.89 1.00
In the percentage scale, the widths of the 95CI are approximately 27%, 35%,
50%, and 55% for the Yen method, Wainer and Shin methods, Bock method, and
proportion-correct method, respectively. That means if an objective score is given based
on the Yen method, 95% of the time the true objective score will fall in the range with
the width equal to 27%. Similarly, 95% of the time a score based on the Wainer and Shin
methods will fall in a range with width equal to 35% , and so on. Basically, the narrower
the width is, the more precise the estimation will be. However, this needs to be evaluated
with the percentage coverage of the nominal width of CI (confidence or credibility
interval). Generally, a good estimator should have a narrow 95CI and an accurate
47
percentage coverage of the nominal CI. By combining the results of the 95CI width and
the percentage coverage of 95CI, it can be seen that Yen’s method provided a narrow
95CI and an relatively accurate percentage coverage of 95CI. The other methods are all
too conservative because their 95CI covered the true value more than 95% of the time.
Therefore, an adjusted 95CI value should be developed and studied in order to obtain a
more precise 95CI for these methods.
In general, using estimation methods other than the proportion-correct method
improved the estimation of objective scores in terms of SD and RMSE. The proportion-
correct method had the smallest Bias but the largest SD. Because the magnitude of Bias is
much smaller than that of SD, the RMSE of the proportion-correct method became the
largest. Therefore, the use of the other methods is expected to estimate the objective
scores better than the proportion-correct method.
Although the Bock method had the smallest SD, it also had the largest Bias.
Compared to the Bock method, the Wainer and Shin methods had SD’s similar to the
Bock method, but had smaller Bias than the Bock method. Therefore, the Wainer and
Shin methods had slightly smaller RMSE than the Bock method.
Factors that affected the RMSE, SD, and Bias of objective scores included the
number of items per objective and the correlation between objectives. The number of
examinees and the ratio of CR/MC items did not affect the RMSE, SD, and Bias of
objective scores.
Table 3 shows the maximum and minimum RMSE values for each method. Using
the methods other than the proportion-correct method reduces the RMSE by as much as
0.055, which is 33 percent of the proportion-correct RMSE. .
48
Table 3. The Maximum and Minimum RMSE
Values for each method
Method Wainer/Shin Yen Bock Proportion-correct
Max 0.11 0.12 0.11 0.165
Min 0.08 0.09 0.075 0.13
From the results of this study, it can be seen that the Wainer and Shin methods
yield objective scores that have the highest reliability. Therefore, theoretically, these two
methods may be considered as the better methods. It is then necessary to compare these
two methods in more detail. These two methods have similar performance in reliability,
95CI, and percent coverage of 95CI. They differ in three ways: (1) the capability to
compute the conditional standard error of measurement (CSEM), (2) the capability to
improve the estimation as more prior information is available, and (3) the complexity of
computation. From Equations (19), (27) and (28), it can be seen that the computation of
CSEM is possible for Shin method but not for Wainer method. For the Wainer method,
only an overall SEM for the whole group can be obtained.
As for the capability to improve the estimation as more prior information is
available, it can be seen from the results that the number of examinees (within the scope
of the present study) will not affect the estimation of results. That means that additional
student data may not be able to improve the estimation. However, if a state testing
program is running for more than one year, it is possible that data from the previous year
can be used as the prior to reduce the size of the credibility intervals in estimating
objective scores. The Bayesian approach, which is the basis of Shin method, lends itself
49
naturally to sequential algorithms. That is, the posterior after the ith data point becomes
the prior for the i+1 data point and so on.
For example, Figure 13 shows the effect of the sequential algorithm for 5 years
(i.e. using the data of year 1 as the prior for year 2, year 2 for year 3, etc.). In Figure 13,
the line labeled as rmse-0 represents the root-mean square error (RMSE) of the test that
has 0% of CR items, and the label rmse-5 represents the RMSE of the test that the ratio of
CR/MC items is 50%. The X-axis represents the years and the Y-axis represents the value
of RMSE in proportion-correct scale. That is, the value 0.12 means 12%. It can be seen
from Figure 13 that as the prior changed from year 1 to later years, the RMSE based on
the Shin method decreased until year four. After that the RMSE remained the same. This
result implies that if more information can contribute to the specification of the prior
distribution, the Shin method may perform better than the Wainer method.
Figure 13. the root-mean square error of objective scores
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0 2 4 6
Year
RM
SE
rmse-0
rmse-5
50
The price for the Shin method is its complexity of computation. Although the
model used in this method is relatively simple, when conducting MCMC sampling, more
computation time is required than the other methods. For a test with 20,000 students and
7 objectives, it would take approximately 18 minutes to compute the objective scores
with Shin methods. One way to overcome this may be to compute the 21 , 31 , and
11 in equation (29), (31), and (32), respectively, from the past years’ data and then it
will be able to estimate the weights in Equations (27) and (28) for the computation of the
objective score. In this way, the estimation time for the objective score will be greatly
reduced. Further research about accuracy of this approach needs to be conducted.
Conclusion, Limitation, and Future Study
From the results of this study, using collateral test information to estimate
objective scores did increase the reliability and reduce the width of 95CI compared to the
proportion-correct method. Within the methods studied, the Wainer and Shin methods
yielded objective scored that had the highest reliability. The Yen method had the
narrowest 95CI and relatively accurate percent coverage of 95CI. The factors that
affected reliability were the number of items per objective, the correlation between
objectives, and the CR/MC ratio. As the value of these factor increased, the reliability of
objective scores also increased. Only the number of items per objective affected the width
of 95CI and the percentage coverage of 95CI. The more items per objective, the narrower
the width 95CI was. Also, using collateral test information to estimate objective scores
did reduce the RMSE compared to the proportion-correct method. Within the methods
studied, the Wainer and Shin methods yielded objective score that had the lowest RMSE.
The Bock method performed best when the correlation between objectives was equal to
51
one. The factors that affected the results of Bias, SD, and RMSE were the number of
items per objective and the correlation between objectives. The more items per objective
or the higher the correlation between objectives, the smaller the Bias, SD, and RMSE
were.
In the present study, only simulation data was used to compare the performance of
different methods. Therefore, the results will generalize best to situations similar to those
studied in this paper. Since there may be various conditions in the empirical situation
such as the skew of student ability distribution or extreme item parameters, more studies
using the empirical data will be needed. In the simulation process, only fixed correlations
between objectives and only fixed numbers of categories was used. Practically, these
variables should be studied under more mixed and complex situations. A more complex
design of simulation study may be beneficial for future study. Since some state testing
programs use different IRT models such as Rasch model for MC items and Partial Credit
model for polytomous items, simulation under these models may be needed for those
programs. The MCMC method is still time-consuming compared to the other methods.
Part of the reason is that the available software, WinBUGS, is not optimized for the
method used in this study. If the MCMC method is used operationally, a more efficient
programming language could be used. From the results of this study, the 95CI for the
Wainer and Shin methods are too conservative. Therefore, a future study to adjust their
95CI would be beneficial. Under the policy of NCLB, the growth of the students’ overall
achievement is a hot topic in the educational research field. However, not many pay
attention to the growth of students’ achievement in the subscale level. Therefore, after a
reliable estimate of objective score is available, the next step should consider how to link
52
the objective scores from year to year or even create a vertical scale on the subscale base
for measuring the growth. Research on the objective scores seems important and there are
still many unanswered questions that need to be studied.
53
References
Bock, R. D., Thissen, D., & Zimowski, M. F.(1997). IRT estimation of domain scores.
Journal of educational measurement, 34,197-211.
Gessaroli, M. E. (2004). Using hierarchical multidimensional item response theory to
estimate augmented subscores. Paper presented at the annual meeting of the National
Council on Measurement in Education, San Diego, CA.
Kahraman, N., & Kamata, A. (2004). Increasing the precision of subscale scores by using
out-of-scale information. Applied psychological measurement, 28(6), 407-426.
Kolen, M. J., & Brennan, R. L. (2004). Test equating: Methods and practices( 2nd
ed).
New York: Springer-Verlag.
Kelley, T. L. (1927). The interpretation of educational measurements. New York: World
Book.
Patz, R. J., & Junker, B. W. (1999b). Applications and extensions of MCMC in IRT:
Multiple item types, missing data, and rated responses. Journal of educational and
behavioral statistics. 24, 342-366.
Pommerich, M., Nicewander, W. A., & Hanson, B. (1999). Estimating average domain
scores. Journal of educational measurement, 36, 199-216.
Shin, C. D., Ansley, T., Tsai, T., & Mao X. (2005). A comparison of methods of
estimating objective scores. Paper presented at the annual meeting of the National
Council on Measurement in Education, Montreal, Quebec, Canada.
Tate, R. L. (2004). Implications of multidimensionality for total score and subscore
performance. Applied measurement in education, 17(2). 89-112.
Wainer, H., Vevea, J. L., Camacho, F., Reeve III, B. B., Rosa, K., Nelson, L., Swygert,
K. A., & Thissen, D. (2000). Augmented scores—“borrowing strength” to compute
54
scores based on small numbers of items. In D. Thissen, & H. Wainer (Ed.), Test
scoring. (pp. 343-387). Hillsdale, NJ: Earlbaum Associates.
Yen, W. M. (1987). A Bayesian / IRT index of objective performance. Paper presented at
the annual meeting of the Psychometric Society, Montreal, Quebec, Canada, June 1-
19.
Yen, W. M., Sykes, R. C., Ito, K., & Julian, M. (1997). A Bayesian / IRT index of
objective performance for tests with mixed-item types. Paper presented at the annual
meeting of the National Council on Measurement in Education in Chicago.