An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

1/23

Article

Applied Psychological Measurement

36(7) 602624

The Author(s) 2012

Reprints and permission:

sagepub.com/journalsPermissions.nav

DOI: 10.1177/0146621612451522

http://apm.sagepub.com

An Evaluation of ItemResponse TheoryClassification Accuracy and

Consistency Indices

Adam E. Wyse1 and Shiqi Hao2

Abstract

This article introduces two new classification consistency indices that can be used when itemresponse theory (IRT) models have been applied. The new indices are shown to be related toRudners classification accuracy index and Guos classification accuracy index. The Rudner- andGuo-based classification accuracy and consistency indices are evaluated and compared with esti-

mates from the more commonly applied IRT-recursive procedure using a simulation study anddata from two large-scale assessments. Results from the simulation study and practical examplessuggested that the Guo- and Rudner-based indices tended to produce estimates that were closerto the simulated values and exceeded those from the IRT-recursive-based procedure. However,

results did suggest that the Rudner- and Guo-based indices can have some undesirable features thatare important to keep in mind when applying them in practice. The values of the classification accu-

racy and consistency indices appeared to be affected by a number of factors including the distribu-tion of examinees, test length, the placement of the cut-scores, and the proficiency estimatorsapplied to estimate examinee ability. Suggestions are made that an important part of investigationsevaluating classification accuracy and consistency indices should be the creation of figures that show

the value of the classification accuracy and classification consistency for individual examinees acrossthe range of possible scores as these figures can help provide indications into subtle and importantdifferences between indices.

Keywords

classification consistency, classification accuracy, item response theory, cut-scores, u metric,number-correct scores

One common use of test scores after they are computed is to compare the test scores with cut-

scores to determine the level of performance that an examinee achieved on the assessment.

Based on the scores that students receive, students are classified into different levels on the

1Michigan Department of Education, Arden Hills, MN, USA2Michigan Department of Education, Lansing, USA

Corresponding Author:Adam E. Wyse, Michigan Department of Education, Bureau of Assessment and Accountability, 1813 Chatham Ave.,

Arden Hills, MN 55112, USA

Email: [email protected]
http://apm.sagepub.com/


2/23

assessment and decisions are made based on those classifications. A critical measurement con-

cern when using cut-scores to make decisions is the classification accuracy and the classifica-

tion consistency expected on the assessment. Classification consistency is the degree to which

examinees would be classified into the same performance categories over parallel replications

of the same assessment (Lee, 2010). Classification accuracy is the degree to which observedclassification would agree with true classifications assuming known cut-scores on a single

assessment (Lee, 2010).

There are numerous procedures for computing classification accuracy and classification con-

sistency. Procedures for computing classification accuracy and consistency have been discussed

in Huynh (1976); Subkoviak (1976); Hanson and Brennan (1990); Livingston and Lewis

(1995); Schulz, Kolen, and Nicewander (1999); Wang, Kolen, and Harris (2000); Rudner

(2001, 2005); Lee, Hanson, and Brennan (2002); Brennan and Wan (2004); Guo (2006);

Martineau (2007); Lee, Brennan, and Wan (2009); and Lee (2010). Lee (2010) provided an

excellent summary of many of the approaches and gave empirical examples of how several of

the procedures work for computing classification accuracy and classification consistency. Most

of the procedures assume that scores are reported in the number-correct score metric and differ

primarily in the models used in calculating the indices (e.g., beta-binomial model, application

of item response theory [IRT] recursive formula), and whether an examinee distribution or each

examinee score is considered in computing the classification accuracy and consistency indices.

Lee refers to methods that use examinee distributions as distribution methods and methods that

use examinee scores as person methods.

Rudners (2001, 2005) classification accuracy index and Guos (2006) classification

accuracy index are somewhat different from the approaches discussed in Lee (2010) in that

they can be applied to data that is scored in the IRT u metric or a linear transformation of

this metric. This characteristic of these two indices distinguishes them from the other meth-

ods. No approach for computing classification consistency currently exists when data arereported in the IRT u metric or a linear transformation of this metric. Given that no such

index has been formulated and research has not been conducted to compare Rudners or

Guos approaches with the other more commonly used indices, such as the IRT-recursive

procedure, which assumes that the reporting metric is number-correct scores, an important

question is to what extent indices based on Rudners or Guos formulations differ from other

more commonly used approaches.

The purposes of this article are to introduce two new IRT-based classification consistency

indices, one that is an extension of Rudners classification accuracy index and one that is an

extension of Guos classification accuracy index, and explore how classification accuracy and

consistency indices based on Rudners and Guos formulations and the IRT-recursive procedureperform in simulation and practice. In the next section of this article, Rudners classification

accuracy index is reviewed and the new index for computing classification consistency that is

an extension of Rudners index is introduced. This is followed by a discussion of Guos classifi-

cation accuracy index and the introduction of a new classification consistency index that is an

extension of Guos formulation. These indices are then contrasted with the more commonly

used approach for computing classification accuracy and consistency with IRT that uses the

IRT-recursive formula discussed in Schulz et al. (1999), Wang et al. (2000), Lee et al. (2002),

and Lee (2010). A simulation study is then provided to evaluate the performance of the different

indices with various proficiency estimators, two different ability distributions, two different test

lengths, and three different sets of cut-scores. Practical examples from two large-scale assess-ments then show the values of indices with various proficiency estimators in practical situations.

The article concludes with discussion and some areas for future research.

Wyse and Hao 603


3/23

Classification Accuracy and Consistency Indices

Rudner-Based Indices

Rudners (2001, 2005) classification accuracy index uses three data vectors to compute classifi-

cation accuracy (Martineau, 2007). The first vector is a vector of C+ 1 cut-scores:

k= k1 k2 kC+ 1 , where k1\k2\ \kC+ 1 and k1 = , kC+ 1 =: 1

This vector of cut-scores contains the operational cut-scores on the assessment, and the lower

and upper bounds for all categories. For example, if there are three operational cut-scores, the

vector in Equation 1 would contain the three operational cut-scores and positive and negative

infinities. The second vector is the vector of estimated examinee scores, which can be repre-

sented as

u= u1 u2 uNe 0

, 2

where Ne is the number of examinees and ui is the IRT ability estimate for examinee i. The vec-

tor in Equation 2 contains each examinees ability estimate and suggests that Rudners index is

a person method. The third vector is a vector of standard error estimates, which can be written

as

su

= su1 su2 suNe

h i0, 3

where Ne is the number of examinees and sui is an IRT standard error estimate for examinee i.

The standard errors in Equation 3 can be computed from an individuals IRT test information

function. In this case, the estimate for the standard error for an examinee is

sui =1ffiffiffiffiffiffiffiffiffiffi

I ui q , 4

where I(ui) is the value of the test information function for examinee i.

One then finds the area between each successive pair of cut points assuming conditional nor-

mality of the standard error estimate around each examinees ability estimate. The normality

assumption comes from asymptotic theory and IRT assumptions when using maximum likeli-

hood (ML) estimation, which imply that as the number of items and examinees become large,

one should expect that an examinees ML estimate should converge to a normal distributionwith a mean ofu and a standard deviation of 1 over the reciprocal of the square root of the indi-

viduals test information function. The expected probability of scoring in each performance-

level category Cbased on these assumptions can be written as

^piC = f kCi , kCi + 1 , ui, sui

, 5

where f a, b, m, s is the area under a normal curve from a to b with a mean ofm and a standarddeviation ofs and the other terms have the same meanings as before. It is important to point out

that although assumptions underlying Equation 5 come from asymptotic theory and ML estima-

tion, Equation 5 and the normal distribution assumption can be employed with any proficiency

estimator.

One can then define a Ne3C matrix of expected probabilities, P, that contains the expected

probabilities of each examinee falling into each performance level category C. The expected

604 Applied Psychological Measurement 36(7)


4/23

probability that corresponds to the performance level category that the examinee is classified

into is assumed to be the expected probability of correct classification, and the other probabil-

ities are assumed to be the expected misclassification probabilities. Define a Ne3C matrix of

weights, W, which is used to flag the performance-level category that the examinee obtained

on the assessment and write the matrix as

W=

w11 w12 w1Cw21 w22 w2C

.

.

....

...

wNe1 wNe2 wNeC

26664

37775, 6

where the weight, wci, equals 1 if the examinees score is classified into performance level cate-

gory C, and 0 otherwise.

Rudners expected classification accuracy index can be found by performing element by ele-

ment multiplication of P withW, taking the sum of all the elements in the resultant matrix, and

dividing by the number of examinees, Ne. Mathematically, the index can be written as

t =

PP W

Ne

, 7

where * denotes element by element matrix multiplication. As classification accuracy can be

found based on the administration of a single assessment, Equation 7 only contains the matrices

P andW and does not involve the product of P with itself.

By comparison, classification consistency provides a measure of the proportion of exami-

nees who would be classified into the same category on parallel replications of the same assess-

ment. This involves taking the product of^P with itself and does not involve a matrix to flag the

observed performance level of the examinee. The new classification consistency index can

therefore be expressed as

g =

PP P

Ne

, 8

where * again denotes element by element multiplication andNe again is the number of exami-

nees. The index in Equation 8 almost seems trivial as it should always be less than or equal to

the expected classification accuracy given that Equation 8 involves squaring the elements of

the P matrix. Nonetheless, understanding the relationships between the indices and having a

classification consistency index that can be computed when data are scored in the u metric is

practically useful given that such an index has not been formulated to this point, and that it is

common practice to report classification accuracy and consistency following the administration

of an assessment.

Guos Indices

Guos (2006) classification accuracy index was originally designed as an extension of Rudners

index in the context of ML estimation and it can be loosely viewed as a person-based index.

The index makes no assumption of normality of an examinees standard error estimate around

their ability estimate, and calculates expected classification probabilities and the P matrix based

on individual examinee likelihood functions from IRT models. The avoidance of the normality

assumption is an advantage of the method as the normality assumption only holds

Wyse and Hao 605


5/23

asymptotically and never is completely satisfied in practice. For dichotomous items, the likeli-

hood function can be written as

L u1i, u2i, . . . , uni uj =Yn

j= 1

PijuijQ1uij, 9

where i is the examinee, jis the item on the test, uij is the response to item jby examinee i with

1 signaling a correct response and 0 signaling an incorrect response, Pij is the probability of a

correct response to item j given u, and Qij is the probability of an incorrect response to item j

given u which is computed as 1 Pij. Similar likelihood functions can be written out for polyt-omous items and mixed format tests.

The expected probability of scoring in any particular category can be found using the likeli-

hood functions as

^pic =

Pkci + 1

u = kci L u1i, u2i,

. . .

, uni uj

PC+ 1h = 1

Pkh + 1u = kh

L u1i, u2i, . . . , uni uj

, 10

wherePkci + 1

u = kciL u1i, u2i, . . . , uni uj is the sum of likelihood function values from performance

category C to next higher performance category C+ 1 for a set of equally spaced u points

between the cut-scores (e.g., 100 equally spaced points) and the denominator is the sum of the

likelihood function values across all performance categories. The fact that Guos method uses

sets of equally spacedu points between cut-scores suggests that Equation 2 should not be a vec-

tor, but it should be a Ne3 ((C+ 1)3NP) matrix, where C is the number of performance cate-

gories and NP is the number of equally spaced points between cut-scores. This suggests that

Guos method is not a person method in the traditional sense of how person methods are

conceptualized.

It is also important to note that to be able to compute the expected probabilities for the high-

est and lowest categories, the highest and lowest cut-scores in Equation 1 need to be set at arbi-

trary high and low u values, such as u = 6 and26. That is, the vector of cut-scores should be

expressed as

k= k1 k2 kC+ 1 , where k1\k2\ \kC+ 1 and k1 = 6, kC+ 1 = 6: 11

The use of arbitrary high and low values for the extreme cut-scores instead of positive and

negative infinities is also a small difference between the Guo and Rudner approaches.

The computations from Equation 10 can then be put into the P matrix similar to the Rudner-

based indices, and the weight matrix W can be formulated based on the examinee ability esti-

mates and comparing those estimates with the cut-scores. The classification accuracy index and

the new classification consistency index based on Guos formulation can then be determined

based on Equations 7 and 8 applied to these P andW matrices. The relationships between the

indices again are fairly clear and it can be observed that classification consistency should be less

than or equal to classification accuracy.

IRT-Recursive-Based IndicesSimilar to the Rudner-based indices, one again starts with a vector of cut-scores when comput-

ing the IRT-recursive-based classification indices (Lee, 2010; Lee et al., 2002; Schulz et al.,



6/23

1999; Wang et al., 2000). However, the vector of cut-scores is expressed in the number-correct

score metric instead of the u metric. Denote this vector of cut-scores as

k= k1 k2 kC+ 1 , where k1\k2\ \kC+ 1 and k1 = 0, kC+ 1 = m: 12

In Equation 12, m represents the maximum possible score on the assessment and each cut-score is assumed to be determined from translating the u cut-score to the number-correct score

metric. It is important to recognize that in translating these cut-scores to the number-correct

score metric, rounding is needed as the cut-scores in the u metric may not align perfectly with a

particular number-correct score.

As indices based on IRT-recursive formula can also be computed as a person method, the

computation of the indices also includes a vector of ability estimates which is identical to

Equation 2. It is also possible to compute the IRT-recursive-based indices using the quadrature

points from a run of an IRT software program as a distribution method. However, using the

quadrate points is designed to approximate the full vector of ability estimates, and hence, it

makes sense to write the indices using the full vector of ability estimates. One uses these abilityestimates to create a distribution of the probabilities of receiving each number-correct score

using the IRT-recursive formula (Thissen, Pommerich, Billeaud, & Williams, 1995).

To write the IRT-recursive formula for dichotomous items, define fn(x ui ) as the conditional

distribution of number-correct scores over the first n items for an examinee with ability ui, and^Pij as the probability of a correct response to item j by examinee i. Define f1(x = 0 ui

) = 1 ^Pijas the probability of earning a score of zero for examinee i on the first item. For n.1, the recur-

sion formula can be written as follows:

fn x ui =fn1 x ui

1 ^Pin

x = 0

fn1 x^

ui

1 ^

Pin

+fn1 x 1^

ui ^

Pin 0\

x\

nfn1 x 1 ui

^Pin x = n:13

This formula can also be extended to polytomous items and mixed format tests as is

described in Thissen et al. (1995).

Then, the probability of scoring in each performance category can be represented as

^pic =XkC+ 1

x = kC

fn X =x u

: 14

Following similar logic as is used with the Rudner-based indices, one forms the matrices P

and W and computes classification accuracy and consistency using the formulations in

Equations 7 and 8. Again, the relationships between the indices are fairly clear and it can be

observed that classification consistency should be less than or equal to classification accuracy.

Similarities and Differences Between Indices

It is important to highlight some of the key similarities and differences between the indices.

First, it is apparent that there are different distributional assumptions with each approach. For

the Guo-based indices, a single distribution is not assumed and the expected probabilities under-

lying the indices are driven by the likelihood functions. These likelihood functions are typically

not symmetrical and can change depending on the response pattern of the examinee. For the

IRT-recursive indices, the distribution of number-correct scores is assumed to follow a com-

pound binomial distribution if all the items are dichotomous or a compound multinomial

Wyse and Hao 607


7/23

distribution if the test contains some polytomous items. For the Rudner-based indices, a normal

distribution for the examinee ability estimates is assumed when calculating the indices. The dif-

ferent assumptions and formulations may give rise to disparate classification accuracy and clas-

sification consistency estimates. However, all the formulations are based on properties of IRT

models and typical IRT assumptions. This includes the assumption that the examinee ability

estimates and item parameters used are good estimates of the underlying parameters.

All three sets of indices can be classified as person-based methods. However, the Guo-based

indices are notably different from the Rudner-based or IRT-recursive-based indices. This can be

seen in how the P matrix is determined. In the Guo-based indices, each examinees ability esti-

mate does not enter into the computations in Equation 10 or the P matrix. But, it is the response

pattern of the examinee and the equally spacedu points that drive the computations of the likeli-

hood function values and the P matrix. This implies that the choice of proficiency estimator will

not affect the P matrix as the response pattern is unchanged across proficiency estimators; only

the weight matrix W flagging the observed classifications of the examinees can change with

different proficiency estimators. This is in contrast to the IRT-recursive-based indices and the

Rudner-based indices in which the P andW matrices can change. In the Rudner-based indices,a likelihood function is not applied and the computation of the P matrix is based on individual

examinee test information functions that change in value with different proficiency estimators.

Similarly, for the IRT-recursive indices, the likelihood functions are not used and each exami-

nees ability estimate is input into the IRT-recursive formula to determine the probability of

receiving each number-correct score given their estimated ability. The important implication of

the fact that P matrix does not change for Guos index is that Guos classification accuracy

index can be potentially different for various proficiency estimators, but the classification con-

sistency index will be identical across proficiency estimators because the P matrix is not chan-

ged. This is a potential drawback to the Guo-based classification consistency indices as one

would expect that the choice of proficiency estimator would affect classification consistency.Another difference is that the Rudner- and Guo-based indices perform computations assum-

ing the reporting metric is the u metric or a linear transformation of this metric, whereas the

IRT-recursive-based indices assume that the reporting metric is number-correct scores or a

transformation of number-correct scores. This can lead to some small differences in the poten-

tial cut-scores as some form of rounding is often needed to translate the cut-score from the

u metric to the number-correct score metric.

Simulation Study

Given that the classification consistency indices based on Rudners and Guos formulations arenew, an important question is whether the Rudner- and Guo-based indices perform better than

the IRT-recursive procedure which is more commonly used to compute classification accuracy

and consistency with IRT models. In addition, as the assumptions used with Guo- and Rudner-

based indices are closely tied to ML estimation and each of the three indices make different dis-

tributional assumptions, another key question is how using different proficiency estimators

affects classification accuracy and consistency indices. Maybe, different indices perform better

in different conditions.

To investigate these questions, a simulation study was performed in which several different

factors were manipulated. Data were simulated for two different ability distributions, two differ-

ent test lengths, three different sets of cut-scores, and four different proficiency estimators. In

the simulation, three cut-scores and four performance categories were assumed in each condi-

tion. The number of cut-scores and the number of performance categories are fixed because the

effects of the number of cut-scores and performance categories are well known. In particular, it



8/23

has been shown that classification accuracy and consistency increase as the number of perfor-

mance categories decreases (Ercikan & Julian, 2002; Lee et al., 2002). Prior investigations with

the indices examined in this study indicate that these patterns hold (Lee, 2010; Lee et al., 2002;

Martineau, 2007). The number of examinees was also fixed at 2,000 as the preliminary investi-

gations with other sample sizes (e.g., 10,000 and 25,000) produced results that were similar to

the 2,000 examinees. A single fixed test form from which the item parameters were drawn in

this simulation was also assumed. This test form consisted of 60 three-parameter logistic (3PL)

model items that were drawn from an ACT (American College Testing) mathematics test admi-

nistered to a sample of more than 100,000 students. The estimated parameters from this sample

were assumed to be the true known item parameters in the simulation.

Examinee Distributions

Two different examinee distributions were investigated in this study. The first group of exami-

nees was drawn from a normal distribution with a mean of 0 and a standard deviation of 1. The

second group of examinees was drawn from a normal distribution with a mean of 0.5 and astandard deviation of 1.25. These two groups of examinees were chosen arbitrarily. The first

group was designed to be similar to a typical group of students, and the assumptions used for

ability distributions in many software packages when resolving the IRT indeterminacy problem.

The second group was designed to represent a group with slightly more ability and greater dis-

persion. It is expected that the ability distributions would affect the values of the indices and

interact with the placement of the cut-scores. When the distribution of examinees is closer to

the cut-scores, the values of the indices are expected to decrease, probably in somewhat similar

fashion for all three indices.

Test Length

Two different test lengths were included in the simulation. The first test length included the full

set of 60 items from the ACT mathematics test. The second test length was 30 items and con-

sisted of the odd items from the ACT mathematics test. It is expected that as the length of the

test is shortened, the classification accuracy and classification consistency indices will decrease

as examinee ability estimates tend to have more error with shorter test lengths.

Cut-Scores

Three different sets of cut-scores were considered in this study. The first set of cut-scores wereu = 20.75, 0.00, and 0.75. These cut-scores were designed to represent a situation in which the

cut-scores were symmetrically distributed around 0 and centered on the mean of the first ability

distribution. The second set of cut-scores were u = 20.75, 20.35, and 0.75. This allowed the

impact of nonsymmetrical cut-scores to be investigated. It is expected that these cut-scores

would lower the values of the indices for some examinees between 20.75 and20.35 as the

cut-scores are closer together and raise the value of the indices, and for some examinees

between 20.35 and 0.75 as these cut-scores are farther apart. As the distance between cut-

scores increases, examinees located between the cut-scores should see their classification accu-

racy and classification consistency estimate rise. However, the value of the indices may be

somewhat similar to that when the cuts were set at u = 20.75, 0.00, and 0.75 due to the trade-

off between the individual examinee classification accuracy and consistency estimates at differ-

ent regions of the scale. The final set of cut-scores investigated were u = 20.827, 20.034,

and 0.694 for the 60-item test and u = 20.745, 20.042, and 0.706 for the 30-item test. These

Wyse and Hao 609


9/23

cut-scores were included to capture the condition in which u cuts align as closely to a set of

number-correct scores as is possible. One might expect that this condition would result in the

most similar values for classification accuracy and consistency across the indices as the effect

of rounding is essentially removed.

Proficiency Estimators

Four different proficiency estimators were considered in this study. The first proficiency esti-

mator was the IRT true-score (TS) estimator. This estimator was found by estimating the item

parameters in BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 1996) and then applying the

NewtonRaphson procedure to each number-correct score to determine each examinees u esti-

mate. This estimator is important to study as the IRT-recursive-based procedure assumes that

the reporting metric is number-correct scores or a transformation of number-correct scores. One

would expect that the IRT-recursive-based indices would perform better with this estimator

than with other estimators because the estimator and the philosophy of the index best align in

this case.The other three estimators were the estimators available in BILOG-MG. These are the ML

estimator, the expected a posteriori (EAP) estimator, and the maximum a posteriori (MAP) esti-

mator. The EAP and MAP estimators are Bayesian estimators, which tend to be pulled toward

the mean of the prior ability distribution in comparison with the ML estimator. Each estimator

was computed using the default settings of the BILOG-MG except that the IDIST = 3 option

was used with EAP estimator in the SCORE command, the number of quadrate points was

increased to 40, the number of expectation-maximization (EM) cycles was increased to 200,

and the number of NEWTON cycles was increased to 100. Kolen and Tong (2010) demon-

strated that various proficiency estimators can perform differently for classifying students into

performance categories in practical contexts and it is expected that these findings would trans-late to the computation of classification accuracy and consistency indices. One might expect

that the ML estimator would perform the best for the Rudner- and Guo-based indices as they

have theoretical underpinnings related to ML estimation.

Simulating and Estimating Classification Accuracy and Consistency

To evaluate the performance of each index, the classification accuracy and consistency were

simulated and estimated using R. To simulate classification accuracy, the simulated ability dis-

tributions were assumed to be the true distributions, and the estimated thetas (i.e., the us) were

computed from the item responses generated from the assumed true distributions and were takenas the observed distributions. The cut-scores were then applied to each distribution and the pro-

portion of classifications that remained the same in the observed and true distributions was taken

as the simulated classification accuracy. To find the simulated classification consistency, the

same true known ability distributions were assumed and two separate sets of item responses

were simulated for each group of examinees. The values of the estimators for the two sets of

item responses were determined, and the cut-scores were applied to the observed-score distribu-

tions. The proportion of classifications that remained the same for the two observed-score distri-

butions was taken as the simulated classification consistency.

Estimated classification accuracy and consistency were found by computing the Rudner-

based indices, Guo-based indices, and the IRT-recursive-based indices in R applied to the esti-

mated item and person parameters. For the Guo-based indices, 100 equally spacedu points were

used between each pair of cut-scores. The estimated classification accuracy and consistency

were contrasted with the simulated classification accuracy and consistency estimates to



10/23

determine which indices best recovered the simulated values. To provide a baseline condition,

the values of indices were also found using the assumed known item parameters and us. For the

Guo index, a set of item responses was simulated to compute the P matrix. The correct classifi-

cations in the W matrix were computed by identifying the performance category in which the

likelihood function was maximized based on the simulated item responses. The baseline condi-tions are labeled No Est. in the tables in the results section as no estimation of item or person

parameters was used. Each cut-score in the number-correct score metric needed to apply the

IRT-recursive procedure was rounded to the nearest number-correct score when computing the

indices. This led to cut-scores on the u scale, which were different for the Rudner- and Guo-

based indices in comparison with the IRT-recursive-based indices.

Results of Simulation Study

Tables 1 to 4 show the results from the simulation study. Table 1 displays the results when the

ability distribution was assumed to be normal with a mean of 0 and a standard deviation of 1for the 60-item test. Table 2 displays the results when the ability distribution was assumed to

be normal with a mean of 0.5 and a standard deviation of 1.25 for the 60-item test. Table 3 dis-

plays the results when the ability distribution was assumed to be normal with a mean of 0 and a

standard deviation of 1 for the 30-item test. Table 4 displays the results when the ability distri-

bution was assumed to be normal with a mean of 0.5 and a standard deviation of 1.25 for the

30-item test. In each table, the results for the Rudner-based indices are shown at the top of the

table, the results for the Guo-based indices are shown in the middle of the table, and the results

for the IRT-recursive-based indices are shown at the bottom of the table. The results for the dif-

ferent cut-scores are shown under the three column headings in each table.

Several important findings can be observed in the tables. Specifically, the Guo-based indicestended to be the largest, followed by the Rudner-based indices, and the IRT-recursive-based

indices. For classification accuracy, the estimated values for the Guo-based indices were often

the closest to the simulated values. For classification consistency, the Rudner- or Guo-based

indices performed best depending on the estimator. In several cases, the differences between

the three indices were trivial with differences in the third decimal place. However, there were

some differences that approached 0.04 or 0.05. Differences of 0.05 between the indices might

be viewed as somewhat large given that the indices are restricted to a range of 0.00 to 1.00.

This suggests that the index chosen to report classification accuracy and consistency can have

key impacts on the numbers that are reported.

In addition, it is important to notice that in the case of no estimation of item or person param-

eters, the value for the ML estimator was closest to the values computed for the Rudner- and

Guo-based indices. This suggests that the ML-based estimates of the indices were very close to

the value of the indices with no estimation of item and person parameters. This is somewhat

expected as the Rudner- and Guo-based indices are closely tied to assumptions for ML estima-

tion. For the IRT-recursive indices with no estimation, there was not a single proficiency esti-

mator that was closest to the no estimation condition across the tables.

It can also be seen that the indices tended to be closest in value with the TS estimator in com-

parison with the other proficiency estimators. In addition, the tables suggest that the Rudners

classification accuracy and consistency indices tended to be greatest for EAP and MAP estima-

tors, the Guos classification accuracy indices tended to be greatest for the ML and EAP estima-

tors, and the IRT-recursive classification accuracy and consistency indices tended to be greatest

for the TS and ML estimators. The Guo-based classification consistency index did not change

across proficiency estimators as was expected. Clearly, it is possible for interactions to exist

Wyse and Hao 611


11/23

Table1

.

SimulatedandEstimatedClassificationAccuracyandConsistency

forN(0,1

)AbilityDistributionfor

60Items

Cuts(u=2

0.7

5,

0.0

0,

0.7

5)

Cuts(u=20.7

5,20.3

5,

0.7

5)

Cuts(u=20.8

27,20.0

34,

0.6

94)

Index

Estimator

Simulated

accuracy

Estimated

accuracy

Simulated

consistency

Estimated

consistency

Simulate

d

accuracy

Estimated

accuracy

Simulated

consistency

Estimated

consiste

ncy

Simulated

accuracy

Estimated

accuracy

Simula

ted

consistency

Estimated

consistency

Rudner

TS

0.8

32

0.8

29

0.7

56

0.7

63

0.8

23

0.8

23

0.7

55

0.7

57

0.8

22

0.8

31

0.74

3

0.7

66

ML

0.8

47

0.8

36

0.7

84

0.7

68

0.8

45

0.8

33

0.7

83

0.7

68

0.8

48

0.8

39

0.77

1

0.7

70

MAP

0.8

56

0.8

43

0.7

88

0.7

76

0.8

52

0.8

39

0.7

89

0.7

74

0.8

50

0.8

44

0.78

1

0.7

78

EAP

0.8

56

0.8

48

0.7

93

0.7

80

0.8

52

0.8

46

0.7

95

0.7

81

0.8

51

0.8

45

0.78

1

0.7

79

NoEst.

0.8

38

0.7

70

0.8

36

0.7

71

0.8

36

0.7

67

Guo

TS

0.8

32

0.8

22

0.7

56

0.8

04

0.8

23

0.8

19

0.7

55

0.8

06

0.8

22

0.8

19

0.74

3

0.7

98

ML

0.8

47

0.8

52

0.7

84

0.8

04

0.8

45

0.8

52

0.7

83

0.8

06

0.8

48

0.8

50

0.77

1

0.7

98

MAP

0.8

56

0.8

34

0.7

88

0.8

04

0.8

52

0.8

44

0.7

89

0.8

06

0.8

50

0.8

49

0.78

1

0.7

98

EAP

0.8

56

0.8

39

0.7

93

0.8

04

0.8

52

0.8

47

0.7

95

0.8

06

0.8

51

0.8

53

0.78

1

0.7

98

NoEst.

0.8

59

0.8

00

0.8

58

0.8

02

0.8

57

0.7

98

Recursive

TS

0.8

32

0.8

23

0.7

56

0.7

63

0.8

23

0.8

17

0.7

55

0.7

61

0.8

22

0.8

26

0.74

3

0.7

64

ML

0.8

47

0.8

19

0.7

84

0.7

58

0.8

45

0.8

15

0.7

83

0.7

64

0.8

48

0.8

18

0.77

1

0.7

61

MAP

0.8

56

0.8

11

0.7

88

0.7

42

0.8

52

0.8

05

0.7

89

0.7

51

0.8

50

0.8

06

0.78

1

0.7

53

EAP

0.8

56

0.8

19

0.7

93

0.7

59

0.8

52

0.8

14

0.7

95

0.7

60

0.8

51

0.8

08

0.78

1

0.7

54

NoEst.

0.8

06

0.7

48

0.8

06

0.7

52

0.8

17

0.7

48

Note:IRT

=item

responsetheory;TS=IRTtru

e-scoreestimator;ML=IRTmaximum

likelihoodestimator;MAP=IRTmaximum

aposterioriestimator;EAP=IRTexpecteda

posterior

iestimator;NoEst.=valueofindexassumingnoestimationoftheitem

abilityorpersonparameters.

Rudneristh

ecalculationoftheindexbasedonRudners

formulation,

Guoisthecalculationoftheindex

basedonGuosformulation,

andrecursiveisthecalculationoftheindexbas

edontheIRT-recursive-basedformula

tion.

612


12/23

Table2.

SimulatedandEstimatedClassific

ationAccuracyandConsistencyforN(0.5,1.2

5)AbilityDistribution

for60Items

Cuts(u=20.7

5,

0.0

0,

0.7

5)

Cuts(u=20.7

5,20.3

5,

0.7

5)

Cuts(u=20.8

27,20.0

34,

0.6

94)

Index

Estimator

Simulated

accuracy

Estimated

accuracy

Simulated

consistency

Estimated

consistency

Simulated

accuracy

Estimated

accuracy

Simulated

consistency

Estimated

consiste

ncy

Simulated

accuracy

Estimated

accuracy

Simula

ted

consistency

Estimated

consistency

Rudner

TS

0.8

73

0.8

61

0.8

15

0.8

01

0.8

79

0.8

62

0.8

24

0.8

06

0.8

69

0.8

61

0.81

2

0.8

03

ML

0.8

81

0.8

71

0.8

33

0.8

15

0.8

91

0.8

72

0.8

41

0.8

20

0.8

84

0.8

72

0.82

5

0.8

17

MAP

0.8

89

0.8

80

0.8

43

0.8

26

0.8

92

0.8

76

0.8

39

0.8

25

0.8

76

0.8

78

0.83

5

0.8

24

EAP

0.8

82

0.8

79

0.8

41

0.8

25

0.8

89

0.8

79

0.8

52

0.8

28

0.8

82

0.8

77

0.83

4

0.8

22

NoEst.

0.8

72

0.8

16

0.8

74

0.8

22

0.8

72

0.8

17

Guo

TS

0.8

73

0.8

71

0.8

15

0.8

44

0.8

79

0.8

69

0.8

24

0.8

48

0.8

69

0.8

68

0.81

2

0.8

44

ML

0.8

81

0.8

87

0.8

33

0.8

44

0.8

91

0.8

88

0.8

41

0.8

48

0.8

84

0.8

87

0.82

5

0.8

44

MAP

0.8

89

0.8

69

0.8

43

0.8

44

0.8

92

0.8

82

0.8

39

0.8

48

0.8

76

0.8

77

0.83

5

0.8

44

EAP

0.8

82

0.8

77

0.8

41

0.8

44

0.8

89

0.8

88

0.8

52

0.8

48

0.8

82

0.8

82

0.83

4

0.8

44

NoEst.

0.8

89

0.8

43

0.8

89

0.8

46

0.8

89

0.8

43

Recursive

TS

0.8

73

0.8

63

0.8

15

0.8

15

0.8

79

0.8

64

0.8

24

0.8

20

0.8

69

0.8

60

0.81

2

0.8

13

ML

0.8

81

0.8

57

0.8

33

0.8

14

0.8

91

0.8

60

0.8

41

0.8

21

0.8

84

0.8

61

0.82

5

0.8

13

MAP

0.8

89

0.8

50

0.8

43

0.8

04

0.8

92

0.8

53

0.8

39

0.8

11

0.8

76

0.8

60

0.83

5

0.8

10

EAP

0.8

82

0.8

57

0.8

41

0.8

12

0.8

89

0.8

59

0.8

52

0.8

18

0.8

82

0.8

61

0.83

4

0.8

09

NoEst.

0.8

51

0.8

06

0.8

54

0.8

13

0.8

61

0.8

07

Note:IRT

=item

responsetheory;TS=IRTtrue-scoreestimator;ML=IRTmaximum

likelihoodestimator;MAP=IRTmaximum

aposterioriestimator;EAP=IR

Texpecteda

posteriori

estimator;NoEst.=valueofindexassumingnoestimationoftheitem

abilit

yorpersonparameters.

Rudneristhe

calculationoftheindexbasedonRud

nersformulation,

Guoisthe

calculationoftheindexbasedonGuo

sformulation,

andrecursiveisthecalculationoftheindexbasedontheIRT-recursive-basedformulation.

613


13/23

Table3.

SimulatedandEstimatedClassificationAccuracyandConsistencyforN(0,1

)AbilityDistributionfor30Items

Cuts(u=2

0.7

5,

0.0

0,

0.7

5)

Cuts(u=20.7

5,20.3

5,

0.7

5)

Cuts(u=20.7

45,20.0

42,

0.7

06)

Index

Estimator

Simulated

accuracy

Estimated

accuracy

Simulated

consistency

Estimated

consistency

Simulate

d

accuracy

Estimated

accuracy

Simulated

consistency

Estimated

consiste

ncy

Simulated

accuracy

Estimated

accuracy

Simula

ted

consist

ency

Estimated

consistency

Rudner

TS

0.7

79

0.7

66

0.6

80

0.6

86

0.7

79

0.7

72

0.7

03

0.6

98

0.7

74

0.7

66

0.67

9

0.6

83

ML

0.8

07

0.7

92

0.7

15

0.7

07

0.8

12

0.7

95

0.7

35

0.7

17

0.8

02

0.7

90

0.72

0

0.7

04

MAP

0.8

01

0.7

98

0.7

20

0.7

13

0.8

17

0.7

97

0.7

44

0.7

16

0.8

05

0.7

98

0.71

8

0.7

12

EAP

0.8

07

0.8

00

0.7

34

0.7

14

0.8

21

0.8

03

0.7

59

0.7

16

0.8

12

0.8

00

0.73

3

0.7

12

NoEst.

0.7

89

0.7

01

0.7

92

0.7

12

0.7

87

0.6

99

Guo

TS

0.7

79

0.7

83

0.6

80

0.7

56

0.7

79

0.7

89

0.7

03

0.7

67

0.7

74

0.7

85

0.67

9

0.7

52

ML

0.8

07

0.8

16

0.7

15

0.7

56

0.8

12

0.8

19

0.7

35

0.7

67

0.8

02

0.8

14

0.72

0

0.7

52

MAP

0.8

01

0.8

05

0.7

20

0.7

56

0.8

17

0.8

19

0.7

44

0.7

67

0.8

05

0.8

11

0.71

8

0.7

52

EAP

0.8

07

0.8

11

0.7

34

0.7

56

0.8

21

0.8

24

0.7

59

0.7

67

0.8

12

0.8

17

0.73

3

0.7

52

NoEst.

0.8

22

0.7

51

0.8

26

0.7

61

0.8

21

0.7

47

Recursive

TS

0.7

79

0.7

68

0.6

80

0.7

03

0.7

79

0.7

78

0.7

03

0.7

25

0.7

74

0.7

67

0.67

9

0.7

03

ML

0.8

07

0.7

62

0.7

15

0.6

99

0.8

12

0.7

65

0.7

35

0.7

19

0.8

02

0.7

63

0.72

0

0.6

99

MAP

0.8

01

0.7

62

0.7

20

0.7

63

0.8

17

0.7

59

0.7

44

0.7

05

0.8

05

0.7

61

0.71

8

0.6

95

EAP

0.8

07

0.7

63

0.7

34

0.7

58

0.8

21

0.7

67

0.7

59

0.7

12

0.8

12

0.7

64

0.73

3

0.6

97

NoEst.

0.7

44

0.6

87

0.7

53

0.6

94

0.7

66

0.6

87

Note:IRT

=item



likelihoodestimator;MAP=IRTmax

imum

aposterioriestimator;EAP=IR

Texpecteda

posterioriestimator;NoEst.=valueofindexas

sumingnoestimationoftheitem


RudneristhecalculationoftheindexbasedonRudnersformulation,

GuoisthecalculationoftheindexbasedonGu

osformulation,

andrecursiveistheca

lculationoftheindexbasedontheIRT

-recursive-basedformulation.

614


14/23

Table4.

SimulatedandEstimatedClassifi

cationAccuracyandConsistencyforN(0.5,1.2

5)AbilityDistribution

for30Items

Cuts(u=2

0.7

5,

0.0

0,

0.7

5)

Cuts(u=20.7

5,20.3

5,

0.7

5)

Cuts(u=20.7

45,20.0

42,

0.7

06)

Index

Estimator

Simulated

accuracy

Estimated

accuracy

Simulated

consistency

Estimated

consistency

Simulate

d

accurac

y

Estimated

accuracy

Simulated

consistency

Estima

ted

consistency

Simulated

accuracy

Estimated

accuracy

Simulated

consistency

Estimated

consistency

Rudner

TS

0.8

11

0.7

70

0.7

62

0.6

99

0.8

31

0.7

75

0.7

78

0.71

2

0.8

20

0.7

71

0.762

0.7

00

ML

0.8

34

0.8

28

0.7

83

0.7

58

0.8

48

0.8

37

0.8

02

0.77

4

0.8

43

0.8

29

0.796

0.7

59

MAP

0.8

35

0.8

34

0.7

82

0.7

63

0.8

47

0.8

37

0.7

99

0.77

1

0.8

38

0.8

34

0.781

0.7

63

EAP

0.8

38

0.8

31

0.7

80

0.7

58

0.8

50

0.8

38

0.7

95

0.77

1

0.8

40

0.8

30

0.778

0.7

58

NoEst.

0.8

25

0.7

49

0.8

30

0.76

1

0.8

25

0.7

50

Guo

TS

0.8

11

0.8

25

0.7

62

0.7

95

0.8

31

0.8

34

0.7

78

0.80

8

0.8

20

0.8

27

0.762

0.7

96

ML

0.8

34

0.8

48

0.7

83

0.7

95

0.8

48

0.8

55

0.8

02

0.80

8

0.8

43

0.8

50

0.796

0.7

96

MAP

0.8

35

0.8

33

0.7

82

0.7

95

0.8

47

0.8

53

0.7

99

0.80

8

0.8

38

0.8

35

0.781

0.7

96

EAP

0.8

38

0.8

38

0.7

80

0.7

95

0.8

50

0.8

58

0.7

95

0.80

8

0.8

40

0.8

37

0.778

0.7

96

NoEst.

0.8

53

0.7

94

0.8

59

0.80

5

0.8

52

0.7

94

Recursive

TS

0.8

11

0.8

17

0.7

62

0.7

69

0.8

31

0.8

23

0.7

78

0.78

4

0.8

20

0.8

17

0.762

0.7

69

ML

0.8

34

0.8

05

0.7

83

0.7

67

0.8

48

0.8

20

0.8

02

0.78

6

0.8

43

0.8

16

0.796

0.7

67

MAP

0.8

35

0.8

00

0.7

82

0.7

59

0.8

47

0.8

10

0.7

99

0.77

0

0.8

38

0.8

10

0.781

0.7

59

EAP

0.8

38

0.7

95

0.7

80

0.7

58

0.8

50

0.8

10

0.7

95

0.77

2

0.8

40

0.8

07

0.778

0.7

56

NoEst.

0.8

05

0.7

58

0.8

14

0.76

7

0.8

22

0.7

59

Note:IRT

=item



likelihoodestimator;MAP=IRTmax

imum

aposterioriestimator;EAP=IRTexpecteda

posterioriestimator;NoEst.=valueofindexassumingnoestimationoftheitem


RudneristhecalculationoftheindexbasedonRudnersformulation,

GuoisthecalculationoftheindexbasedonGu

osformulation,

andrecursiveistheca

lculationoftheindexbasedontheIRT-recursive-basedformulation.

615


15/23

between the proficiency estimator that is chosen and the index selected to report classification

accuracy or consistency.

Tables 1 through 4 also indicate that in many situations, the estimated classification accuracy

and consistency were lower than the values simulated for the Rudner-based and IRT-recursive-

based indices with a couple of exceptions for a few of the computations with the TS estimator.

The Guo-based classification accuracy indices tended to be lower than the simulated values for

the TS, EAP, and MAP estimators and larger for the ML estimator for the 60-item test. For the

30-item test, the estimated classification accuracy exceeded the simulated classification accu-

racy in most situations. The Guo-based classification consistency indices were higher than the

simulated classification consistency in almost all cases. The simulated values tended to be best

recovered with the TS estimators, although there were a few exceptions when the cut-scores

were at u = 20.827, 20.034, and 0.694 where some other proficiency estimators were better

recovered, and with the Guo-based indices with the cuts at u = 20.75, 20.35, and 0.75 where

the ML estimator performed the best.

In terms of test length, the results follow what one would expect with the classification accu-

racy and classification consistency dropping quite a bit across the board between Tables 1 and 3and Tables 2 and 4 when the test length was 60 items compared with 30 items. When looking at

these tables, one also notices that the differences between the TS estimator and the EAP, MAP,

and ML estimators tended to become larger as the test length was decreased for the Rudner-

and Guo-based indices. In addition, the differences between the IRT-recursive indices and the

other indices also tended to increase when test length was decreased. This suggests that there

are important potential interactions between the length of the test, different proficiency estima-

tors, and the index that one chooses to employ.

In terms of the different cut-scores, Table 1 suggests that when the distribution was assumed

to be normal with a mean of 0 and a standard deviation of 1 for the 60-item test, the values of

the Rudner-based indices tended to be the least when the cut-scores were at u = 20.75, 20.35,and 0.75, and they tended to be the greatest when the cut-scores were at u = 20.827, 20.034,

and 0.694. For the Guo-based classification accuracy indices, the TS and ML estimators were

greatest for u = 20.75, 20.35, and 0.75, and least for u = 20.827, 20.034, and 0.694. For the

EAP and MAP estimators, the u = 20.827, 20.034, and 0.694 cuts produced the highest classi-

fication accuracy estimates and u = 20.75, 0.00, and 0.75 produced the least. The Guos classi-

fication consistency indices were greatest for all estimators when the cuts were at for

u = 20.75, 20.35, and 0.75. For the IRT-recursive-based indices, the pattern was slightly dif-

ferent where except for the TS estimator, the u = 20.75, 0.00, and 0.75 cuts produced the high-

est value of the indices. For the TS estimator, the u = 20.827, 20.034, and 0.694 cuts

produced the highest classification accuracy and consistency. The patterns were not as clearand consistent when the ability distribution was assumed to be normal with a mean of 0.5 and a

standard deviation of 1.25 for the 60-item test (see Table 2). In this case, many of the estimated

values of classification accuracy and consistency were very similar and trivially different across

cut-scores. For the 30-item tests (see Tables 3 and 4), the cuts at u = 20.75, 20.35, and 0.75

tended to produce the highest classification accuracy and consistency for all three sets of

indices for both distributions of examinees. It is also important to notice that when the ability

distribution had a mean of 0.5 and a standard deviation of 1.25 as opposed to a mean of 0 and a

standard deviation of 1, the values of the indices rose across the board. This is consistent with

the understanding that as the ability of the examinees moves away from the cut-scores classifi-

cation accuracy and classification consistency goes up.

Figures 1 and 2 provide pictures of the classification accuracy and classification consistency

for the three different indices at various u locations. Figure 1 is for the 60-item test and Figure 2

is for the 30-item test. The figures do not assume a particular proficiency estimator and were



16/23

created based on the assumed known item parameters for the 30- and 60-item tests. The x-axisis the examinees u and the y-axis is the value of the classification accuracy or the classification

consistency for that u. The solid lines show the Rudner-based index, the jagged dotted lines

show the Guo-based index, and the dashed lines show the IRT-recursive-based index. The lines

for the Guo-based indices are not smooth due to the simulation of item responses for examinees

at each u needed to calculate the indices. For the Rudner-based and IRT-recursive-based indices,

each u can be applied in conjunction with the known item parameters without simulating item

responses creating a smooth line. The pictures clearly show the different functional forms of

each of the indices and what the value of each index would be for an examinee at each u value.

The figures help to explain some of the findings in Tables 1 through 4. In particular, it

appears that the Guo-based indices have a slightly different pattern in terms of the value of theindices for examinees at different us than the Rudner- and IRT-based indices as they do not go

up as high in between cut-scores or as low at the cut-scores as the other two indices. This is

probably due in part to the use of the item response patterns and the focus on likelihood func-

tions instead of examinee us when computing the indices. One can also see that the Rudner-

based indices exceeded the IRT-recursive-based indices in between the cut-scores. These two

indices have dips for the cut-scores in Figures 1 and 2 that do not align exactly for the first two

panels in each figure due to the rounding of the cut-scores. At the extremes of the u distribu-

tion, the Guo-based and IRT-recursive-based indices had higher classification accuracy and

consistency. This makes sense because for the Rudner-based indices having an extreme score

often was associated with having an extremely low value of the test information function whichwould lower the classification accuracy and consistency. The IRT-recursive-based and Guo-

based indices, however, do not consider test information, and extreme scores were associated

Figure 1. Plot of classification accuracy and consistency curves for simulations with 60 itemsNote: The solid line in each panel is for the Rudner-based index, the dotted line is for the Guo-based index, and the

dashed line is for the IRT-recursive-based index. The top left panel is the classification accuracy curves with cuts at u =

20.75, 0.00, and 0.75; the top middle panel is the classification accuracy curves with cuts at u = 20.75, 20.35, and

0.75; the top right panel is the classification accuracy curves with cuts at u = 20.827, 20.034, and 0.694; the bottom

left panel is the classification consistency curves with cuts at u = 20.75, 0.00, and 0.75; the bottom middle panel is the

classification consistency curves with cuts at u = 20.75, 20.35, and 0.75; and the top bottom right panel is the

classification consistency curves with cuts at u = 20.827, 20.034, and 0.694.

Wyse and Hao 617


17/23

with higher classification accuracy and consistency. This is an important difference that is

worth noting and suggests that with distribution of examinees with more extreme scores that

Rudner-based indices would probably be lower than the other two indices. This is a potential

downside to the Rudner-based indices as one would anticipate that the probability of accurately

and consistently classifying an examinee with an extreme score would be high. In many practi-

cal situations, most of the examinees will often be in regions where the cut-scores are located,

and one would probably expect that the Rudner-based indices would work well as they did in

the simulations.

Michigan Merit Examination (MME) Data

Data for the practical examples were drawn from the MME. The MME is a large-scale assess-

ment given to 11th graders and some eligible 12th graders that is used for school accountability

and adequate yearly progress determinations in Michigan. The MME has five subject tests

(reading, math, science, writing, and social studies) consisting of items from the ACT,

WorkKeys, and custom-Michigan-developed components. Subsets of items are selected from

ACT and WorkKeys along with the Michigan-developed components to align with Michigans

high school academic content standards. These items are used to determine an examinees score

in each subject. Data from the MME reading and math tests are considered in the examples in

this article.

The MME reading test consists of 51 operational multiple-choice items: 32 of the items come

from the ACT reading test and 19 of the items come from the WorkKeys reading for informa-

tion test. An examinees reported score is a linear transformation of his or her u estimate from

Figure 2. Plot of classification accuracy and consistency curves for simulations with 30 itemsNote: The solid line in each panel is for the Rudner-based index, the dotted line is for the Guo-based index, and the

dashed line is for the IRT-recursive-based index. The top left panel is the classification accuracy curves with cuts at u =

20.75, 0.00, and 0.75; the top middle panel is the classification accuracy curves with cuts at u = 20.75, 20.35, and

0.75; the top right panel is the classification accuracy curves with cuts at u = 20.745, 20.042, and 0.706; the bottom

left panel is the classification consistency curves with cuts at u = 20.75, 0.00, and 0.75; the bottom middle panel is the

classification consistency curves with cuts at u = 20.75, 20.35, and 0.75; and the top bottom right panel is the

classification consistency curves with cuts at u = 20.745,20.042, and 0.706.



18/23

applying the 3PL model to these data. The 3PL model exhibited moderate degrees of misfit.

The MME technical report puts misfit at roughly 43% of the items not fitting the model when

using the S X2 fit statistic of Orlando and Thissen (2000). There were 98,423 examinees whoreceived valid scores on the initial form of the MME reading test that were considered in this

article. The estimated reliability for these data was .89.

The MME math test is made up of 67 operational multiple-choice items: 3 of the items come

from the WorkKeys locating information test, 12 of the items come from the WorkKeys applied

mathematics test, 36 of the items come from the ACT mathematics test, and the remaining items

are custom-developed items. Scores reported to examinees again are a linear transformation of

each examinees u estimate from applying 3PL model to these data. Model fit using the S X2

fit statistic reported in the MME technical report was 28% of the items not fitting the model.

There were 97,888 examinees who received valid scores on the MME math test considered in

this article. The estimated reliability for these data was .87.

Results for MME Data

Table 5 displays the results for the classification accuracy and consistency for the MME read-

ing and math tests for the three cut-scores that are used to make classification decisions on each

assessment. The results for the Rudner-based indices are shown at the top of the table, the

results for Guo-based indices are shown in the middle of the table, and the results for the IRT-

recursive-based indices are shown at the bottom of the table. For all three indices, the classifi-

cation accuracy and consistency were higher for the reading test compared with the math test,

except for the MAP and EAP estimators for the Guo-based classification accuracy index.

For the MME math test, the results were similar to the simulation, where the Guo-

based indices tended to exceed the Rudner-based indices, which exceeded the IRT-recursive-

based indices. The TS estimator had a larger classification accuracy value for the Rudner-based

indices than for the Guo-based indices for these data. For the MME reading test, the results

Table 5. Estimated Classification Accuracy and Consistency for MME Reading and Math Tests

Reading (n = 98,423) Math (n = 97,888)

Index Estimator Accuracy Consistency Accuracy Consistency

Rudner TS 0.829 0.763 0.800 0.727ML 0.821 0.761 0.806 0.734MAP 0.810 0.750 0.792 0.717EAP 0.807 0.746 0.800 0.726

Guo TS 0.799 0.759 0.792 0.758ML 0.821 0.759 0.817 0.758MAP 0.801 0.759 0.814 0.758EAP 0.811 0.759 0.817 0.758

Recursive TS 0.800 0.744 0.782 0.720ML 0.801 0.740 0.783 0.721MAP 0.792 0.730 0.763 0.696EAP 0.788 0.725 0.772 0.705

Note: MME = Michigan Merit Examination; IRT = item response theory; TS = IRT true-score estimator; ML = IRT

maximum likelihood estimator; MAP = IRT maximum a posteriori estimator; EAP = IRT expected a posteriori

estimator. Rudner is the calculation of the index based on Rudners formulation, Guo is the calculation of the index

based on Guos formulation, and recursive is the calculation of the index based on the IRT-recursive-based

formulation.

Wyse and Hao 619


19/23

were different, in some cases, the Rudner-based indices were larger than the Guo-based indices

for classification accuracy and consistency. The IRT-recursive-based indices were again thesmallest. The largest differences between the three indices across the proficiency estimators

were around 0.03 for the MME reading test and 0.05 for the MME math test. These levels of

differences were somewhat similar to some of the differences observed in the simulation.

Somewhat different from the simulation was the rank ordering of values of the indices across

the proficiency estimators. In the simulation, the EAP and MAP estimators tended to have the

highest values for the Rudner-based indices. However, in the practical examples, the EAP and

MAP estimators had values that were lower than those for the TS and ML estimators for the

Rudner-based indices. The EAP and MAP estimators also had lower classification accuracy

and consistency estimates for the IRT-based recursive indices. The ML estimator again had the

highest value for the Guo-based classification accuracy indices, and the classification consis-

tency was the same for all estimators.

Figure 3 graphically displays the classification accuracy and classification consistency for

the indices at various u locations for the MME reading and math tests similar to Figures 1 and 2.

The top panels are for the MME reading tests and the bottom panels are for the MME math tests.

The solid lines in the panels are for the Rudner-based indices, the jagged dotted lines are for the

Guo-based indices, and the dashed lines are for the IRT-based recursive indices. The dips in the

figures for Rudner-based indices show the placements of the cut-scores. These dips do not lie

directly on top of each other for the Rudner- and IRT-based indices due to the rounding needed

to calculate the IRT-based recursive indices in the number-correct score metric. The spread and

placement of these cut-scores were disparate on both tests. For the reading test, the cut-scores

were more spread apart, and for the math test, the cut-scores were closer together. The figuresshow the impact of these cut-score placements on the value of indices. When the cut-scores

were closer together, classification accuracy and consistency for individual examinees tended to

Figure 3. Plot of classification accuracy and consistency curves for MME reading and mathNote: MME = Michigan Merit Examination; IRT = item response theory. The solid line in each panel is for the Rudner-

based index, the dotted line is for the Guo-based index, and the dashed line is for the IRT-recursive-based index. The

top left panel is the classification accuracy curve with four performance levels for reading, the top right panel is the

classification consistency curve with four performance levels for reading, the bottom right panel is the classification

accuracy curve with four performance levels for math, and the bottom left panel is the classification consistency curve

with four performance levels for math.



20/23

decrease in comparison with when they were further apart. The panels also show that between

the first and second cut-score, the Rudner-based indices tended to exceed the IRT-based recur-

sive indices. The lines for the Guo-based indices again had a different pattern than the Rudner-

based or IRT-recursive indices with dips not going as low and the peaks not going as high.

More notable differences for the Guo-based indices were found for the MME reading test com-

pared with the MME math test. For the MME reading test, the peak between the first and second

cut-scores is associated with a smooth dip and lower classification accuracy and consistency for

the Guo-based index in comparison with the Rudner-based index. As many examinees had

scores that were between these cut-scores, this may explain why the Rudner-based indices were

in some cases higher for these data in Table 5.

Discussion and Conclusion

The purposes of this article were (a) to introduce classification consistency indices based on

Rudners and Guos formulations and (b) to evaluate the performance of the Rudner-based,

Guo-based, and IRT-recursive-based indices in simulation and practice. The development ofthese new indices is important because many of the current approaches for calculating classifi-

cation accuracy and consistency assume that the reporting metric is number-correct scores or a

transformation of number-correct scores, which may not be the approach used to determine

scores when IRT models are applied. This can lead to small subtle differences in the cut-scores

in the IRT u metric due to the rounding needed to compute the indices. The Rudner- and Guo-

based indices do not make the assumption that the reporting metric is number-correct scores and

can be applied when tests are scored in the u metric or a linear transformation of this metric.

The Rudner and Guo indices also are computationally easier to compute and are closely tied to

assumptions used with ML estimation, which is an often used approach for estimating examinee

abilities with IRT models.Despite the conceptual and practical appeal of the Rudner- and Guo-based indices, the per-

formance of these indices with various proficiency estimators and across a variety of conditions

has not been fully investigated. To date, only Martineau (2007) and Wyse (2011) have looked at

the performance of Rudners classification accuracy index considering some of the factors that

can affect the index. However, these articles did not consider the classification consistency

index, did not look at the interaction of the indices with various proficiency estimators, and did

not look at how the Rudner-based indices perform in comparison with other commonly used

classification accuracy and consistency indices. The Guo-based classification index has only

been compared with the Rudner-based index using a practical example when the index was orig-

inally formulated and has not been compared with the Rudner-based or IRT-recursive indices ina systematic way. This study provided an initial investigation of a few of these factors in a simu-

lated and practical setting.

Results from these investigations suggested that the Guo-based indices tended to have the

highest classification accuracy and consistency, followed by the Rudner-based indices and the

IRT-based recursive indices. The Guos classification accuracy index and the Guo- and Rudner-

based classification consistency indices performed the best for recovering classification accu-

racy and consistency. The differences among the three indices were often small and some cases

trivially different, but there were some differences on the magnitude of 0.04 or 0.05 units for

the whole population. This finding is important, especially given that Lee (2010) has observed

that IRT-based recursive indices tend to be also higher than values estimated with the non-IRT-

based Livingston and Lewis (1995) procedure. This may suggest that there may be even larger

differences between Guo- and Rudner-based indices and those from the Livingston and Lewis

procedure. Future research could compare these indices in a variety of situations. This research

Wyse and Hao 621


21/23

would be valuable as the simulation in this study, although designed to look at a variety of fac-

tors that can affect the indices, may not reflect the full range of factors that affect the indices in

all situations. It is possible that the indices may perform differently with alternate tests, different

placements of cut-scores, and various other factors, such as skewed score distributions.

The results from the simulations and investigations also suggest some potential features of

the Rudner- and Guo-based indices that should be highlighted. Specifically, the Guo-based clas-

sification consistency tended to be notably higher than other indices and did not change with

different proficiency estimators. This suggests that one should use caution when applying the

Guos classification consistency index, and the Guo-based formulation may be better when

investigations are focused only on classification accuracy or a single proficiency estimator,

such as the ML estimator. In terms of the Rudner-based indices, results suggested that the

indices may be adversely affected when the examinee distribution contains more examinees

with extreme scores as extremes scores often have less test information. This may suggest that

in these situations, one may want to consider the application of another index.

A notable finding of this study was that the values of the classification accuracy and classifi-

cation consistency indices can change for various proficiency estimators. These findings aresimilar to those in Kolen and Tong (2010), who observed that the choice of different proficiency

estimators can affect the number of students reported in different performance levels. It is well

known that alternate proficiency estimators have different properties and that choosing different

estimators can change examinee ability estimates and classifications. This article highlights that

the choice of proficiency estimator can also affect the value of the classification accuracy and

consistency. Additional research could evaluate classification accuracy and consistency with

different proficiency estimators in other contexts.

This study also highlights the benefit of creating classification accuracy and classification

plots, like those in Figures 1 to 3, when investigating different classification indices. These

plots allow the researcher and the practitioner to look across the range of possible scores andidentify regions in which the indices are performing differently. These pictures can also help

identify possible explanations for why the indices tended to produce disparate values in simu-

lated and practical settings. In this article, the figures helped to show some of the differences in

how the Rudner-based indices treated extreme scores; the Rudner-based indices tended to have

lower classification accuracy and consistency for extreme scores because these scores tended to

be associated with lower values of the IRT test information function. The figures also depicted

some differences in the cut-scores due to the rounding of scores that was needed with IRT-

recursive-based procedure as well as higher values for the Rudner-based indices between the

cut-scores. One can also see some of the fundamental differences between the Guo-based and

the other indices due to the use of likelihood functions and response patterns with the Guoindices. The Guo indices produced graphs that were less smooth and in which the peaks and

valleys between and at cut-scores were less pronounced. In the simulation, this produced results

in which the Guo-based indices often exceeded the Rudner- and IRT-based indices. However,

as the MME reading practical example suggested, it will not always hold that the Guo-based

indices will be larger than the other indices. The figures do suggest that it is possible for the

indices to be higher or lower depending on the distribution of examinee scores.

It is also important to point out the rather obvious observation that there are a number of fac-

tors that can affect classification accuracy and consistency. Some of the factors that can affect

the classification accuracy and consistency values estimated include the classification accuracy

or consistency index chosen, the distribution of examinee performance, the number and place-

ment of the cut-scores, the proficiency estimator and scoring metric chosen, the properties and

number of items in the assessment, and the models applied to the test data as well as the fit of

those models. It is also important to note the relationships that exist between classification



22/23

accuracy and classification consistency. As Equations 7 and 8 suggest, classification accuracy

should be greater than or equal to classification consistency given that classification accuracy

involves computations assuming a single administration of the assessment, and classification

consistency involves administrations of parallel forms or the squaring of computations from a

single administration to approximate the similarity of classifications across forms. Both types of

indices can be valuable, but the indices address different questions. This suggests that depending

on the situation and question of interest related to the classification decisions that one index or

the other may be more appropriate for that question. This implies that reporting of both indices

may not always be necessary. It also means that a critical consideration is having classification

accuracy and consistency indices that come from the same foundation that can be applied to the

same data because the question asked may better fit one type of index or the other.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or pub-

lication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Brennan, R. L., & Wan, L. (2004). Bootstrap procedures for estimating decision consistency for single-

administration complex assessments (CASMA Research Report No. 7). Iowa City: Center for

Advanced Studies in Measurement and Assessment, University of Iowa.

Ercikan, K., & Julian, M. (2002). Classification accuracy of assigning student performance to proficiencylevels: Guidelines for assessment design. Applied Measurement in Education, 15, 269-294.

Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment

Research & Evaluation, 11(6). Retrieved from http://pareonline.net/pdf/v11n6.pdf

Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated

under alternative strong true score models. Journal of Educational Measurement, 27, 345-359.

Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational

Measurement, 13, 253-264.

Kolen, M. J., & Tong, Y. (2010). Psychometric properties of IRT proficiency estimates. Educational

Measurement: Issues and Practice, 29, 8-14.

Lee, W. (2010). Classification consistency and accuracy for complex assessments using item response

theory. Journal of Educational Measurement, 47, 1-17.Lee, W., Brennan, R. L., & Wan, L. (2009). Classification consistency and accuracy for complex assessments

under the compound multinomial model. Applied Psychological Measurement, 33, 374-390.

Lee, W., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for

multiple classifications. Applied Psychological Measurement, 26, 412-432.

Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on

test scores. Journal of Educational Measurement, 32, 179-197.

Martineau, J. A. (2007). An expansion and practical evaluation of expected classification accuracy.

Applied Psychological Measurement, 31, 181-194.

Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response

theory models. Applied Psychological Measurement, 24, 50-64.

Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical

Assessment Research & Evaluation, 7(14). Retrieved from http://PAREonline.net/getvn.asp?v=7&n=14

Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research & Evaluation,

10(13). Retrieved from http://pareonline.net/pdf/v10n13.pdf

Wyse and Hao 623

7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and

Date post:	14-Apr-2018
Category:	Documents
Upload:	shodhganga
View:	219 times
Download:	0 times

An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

Documents