+ All Categories
Home > Documents > An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

Date post: 14-Apr-2018
Category:
Upload: shodhganga
View: 219 times
Download: 0 times
Share this document with a friend

of 23

Transcript
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    1/23

    Article

    Applied Psychological Measurement

    36(7) 602624

    The Author(s) 2012

    Reprints and permission:

    sagepub.com/journalsPermissions.nav

    DOI: 10.1177/0146621612451522

    http://apm.sagepub.com

    An Evaluation of ItemResponse TheoryClassification Accuracy and

    Consistency Indices

    Adam E. Wyse1 and Shiqi Hao2

    Abstract

    This article introduces two new classification consistency indices that can be used when itemresponse theory (IRT) models have been applied. The new indices are shown to be related toRudners classification accuracy index and Guos classification accuracy index. The Rudner- andGuo-based classification accuracy and consistency indices are evaluated and compared with esti-

    mates from the more commonly applied IRT-recursive procedure using a simulation study anddata from two large-scale assessments. Results from the simulation study and practical examplessuggested that the Guo- and Rudner-based indices tended to produce estimates that were closerto the simulated values and exceeded those from the IRT-recursive-based procedure. However,

    results did suggest that the Rudner- and Guo-based indices can have some undesirable features thatare important to keep in mind when applying them in practice. The values of the classification accu-

    racy and consistency indices appeared to be affected by a number of factors including the distribu-tion of examinees, test length, the placement of the cut-scores, and the proficiency estimatorsapplied to estimate examinee ability. Suggestions are made that an important part of investigationsevaluating classification accuracy and consistency indices should be the creation of figures that show

    the value of the classification accuracy and classification consistency for individual examinees acrossthe range of possible scores as these figures can help provide indications into subtle and importantdifferences between indices.

    Keywords

    classification consistency, classification accuracy, item response theory, cut-scores, u metric,number-correct scores

    One common use of test scores after they are computed is to compare the test scores with cut-

    scores to determine the level of performance that an examinee achieved on the assessment.

    Based on the scores that students receive, students are classified into different levels on the

    1Michigan Department of Education, Arden Hills, MN, USA2Michigan Department of Education, Lansing, USA

    Corresponding Author:Adam E. Wyse, Michigan Department of Education, Bureau of Assessment and Accountability, 1813 Chatham Ave.,

    Arden Hills, MN 55112, USA

    Email: [email protected]

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    2/23

    assessment and decisions are made based on those classifications. A critical measurement con-

    cern when using cut-scores to make decisions is the classification accuracy and the classifica-

    tion consistency expected on the assessment. Classification consistency is the degree to which

    examinees would be classified into the same performance categories over parallel replications

    of the same assessment (Lee, 2010). Classification accuracy is the degree to which observedclassification would agree with true classifications assuming known cut-scores on a single

    assessment (Lee, 2010).

    There are numerous procedures for computing classification accuracy and classification con-

    sistency. Procedures for computing classification accuracy and consistency have been discussed

    in Huynh (1976); Subkoviak (1976); Hanson and Brennan (1990); Livingston and Lewis

    (1995); Schulz, Kolen, and Nicewander (1999); Wang, Kolen, and Harris (2000); Rudner

    (2001, 2005); Lee, Hanson, and Brennan (2002); Brennan and Wan (2004); Guo (2006);

    Martineau (2007); Lee, Brennan, and Wan (2009); and Lee (2010). Lee (2010) provided an

    excellent summary of many of the approaches and gave empirical examples of how several of

    the procedures work for computing classification accuracy and classification consistency. Most

    of the procedures assume that scores are reported in the number-correct score metric and differ

    primarily in the models used in calculating the indices (e.g., beta-binomial model, application

    of item response theory [IRT] recursive formula), and whether an examinee distribution or each

    examinee score is considered in computing the classification accuracy and consistency indices.

    Lee refers to methods that use examinee distributions as distribution methods and methods that

    use examinee scores as person methods.

    Rudners (2001, 2005) classification accuracy index and Guos (2006) classification

    accuracy index are somewhat different from the approaches discussed in Lee (2010) in that

    they can be applied to data that is scored in the IRT u metric or a linear transformation of

    this metric. This characteristic of these two indices distinguishes them from the other meth-

    ods. No approach for computing classification consistency currently exists when data arereported in the IRT u metric or a linear transformation of this metric. Given that no such

    index has been formulated and research has not been conducted to compare Rudners or

    Guos approaches with the other more commonly used indices, such as the IRT-recursive

    procedure, which assumes that the reporting metric is number-correct scores, an important

    question is to what extent indices based on Rudners or Guos formulations differ from other

    more commonly used approaches.

    The purposes of this article are to introduce two new IRT-based classification consistency

    indices, one that is an extension of Rudners classification accuracy index and one that is an

    extension of Guos classification accuracy index, and explore how classification accuracy and

    consistency indices based on Rudners and Guos formulations and the IRT-recursive procedureperform in simulation and practice. In the next section of this article, Rudners classification

    accuracy index is reviewed and the new index for computing classification consistency that is

    an extension of Rudners index is introduced. This is followed by a discussion of Guos classifi-

    cation accuracy index and the introduction of a new classification consistency index that is an

    extension of Guos formulation. These indices are then contrasted with the more commonly

    used approach for computing classification accuracy and consistency with IRT that uses the

    IRT-recursive formula discussed in Schulz et al. (1999), Wang et al. (2000), Lee et al. (2002),

    and Lee (2010). A simulation study is then provided to evaluate the performance of the different

    indices with various proficiency estimators, two different ability distributions, two different test

    lengths, and three different sets of cut-scores. Practical examples from two large-scale assess-ments then show the values of indices with various proficiency estimators in practical situations.

    The article concludes with discussion and some areas for future research.

    Wyse and Hao 603

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    3/23

    Classification Accuracy and Consistency Indices

    Rudner-Based Indices

    Rudners (2001, 2005) classification accuracy index uses three data vectors to compute classifi-

    cation accuracy (Martineau, 2007). The first vector is a vector of C+ 1 cut-scores:

    k= k1 k2 kC+ 1 , where k1\k2\ \kC+ 1 and k1 = , kC+ 1 =: 1

    This vector of cut-scores contains the operational cut-scores on the assessment, and the lower

    and upper bounds for all categories. For example, if there are three operational cut-scores, the

    vector in Equation 1 would contain the three operational cut-scores and positive and negative

    infinities. The second vector is the vector of estimated examinee scores, which can be repre-

    sented as

    u= u1 u2 uNe 0

    , 2

    where Ne is the number of examinees and ui is the IRT ability estimate for examinee i. The vec-

    tor in Equation 2 contains each examinees ability estimate and suggests that Rudners index is

    a person method. The third vector is a vector of standard error estimates, which can be written

    as

    su

    = su1 su2 suNe

    h i0, 3

    where Ne is the number of examinees and sui is an IRT standard error estimate for examinee i.

    The standard errors in Equation 3 can be computed from an individuals IRT test information

    function. In this case, the estimate for the standard error for an examinee is

    sui =1ffiffiffiffiffiffiffiffiffiffi

    I ui q , 4

    where I(ui) is the value of the test information function for examinee i.

    One then finds the area between each successive pair of cut points assuming conditional nor-

    mality of the standard error estimate around each examinees ability estimate. The normality

    assumption comes from asymptotic theory and IRT assumptions when using maximum likeli-

    hood (ML) estimation, which imply that as the number of items and examinees become large,

    one should expect that an examinees ML estimate should converge to a normal distributionwith a mean ofu and a standard deviation of 1 over the reciprocal of the square root of the indi-

    viduals test information function. The expected probability of scoring in each performance-

    level category Cbased on these assumptions can be written as

    ^piC = f kCi , kCi + 1 , ui, sui

    , 5

    where f a, b, m, s is the area under a normal curve from a to b with a mean ofm and a standarddeviation ofs and the other terms have the same meanings as before. It is important to point out

    that although assumptions underlying Equation 5 come from asymptotic theory and ML estima-

    tion, Equation 5 and the normal distribution assumption can be employed with any proficiency

    estimator.

    One can then define a Ne3C matrix of expected probabilities, P, that contains the expected

    probabilities of each examinee falling into each performance level category C. The expected

    604 Applied Psychological Measurement 36(7)

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    4/23

    probability that corresponds to the performance level category that the examinee is classified

    into is assumed to be the expected probability of correct classification, and the other probabil-

    ities are assumed to be the expected misclassification probabilities. Define a Ne3C matrix of

    weights, W, which is used to flag the performance-level category that the examinee obtained

    on the assessment and write the matrix as

    W=

    w11 w12 w1Cw21 w22 w2C

    .

    .

    ....

    ...

    wNe1 wNe2 wNeC

    26664

    37775, 6

    where the weight, wci, equals 1 if the examinees score is classified into performance level cate-

    gory C, and 0 otherwise.

    Rudners expected classification accuracy index can be found by performing element by ele-

    ment multiplication of P withW, taking the sum of all the elements in the resultant matrix, and

    dividing by the number of examinees, Ne. Mathematically, the index can be written as

    t =

    PP W

    Ne

    , 7

    where * denotes element by element matrix multiplication. As classification accuracy can be

    found based on the administration of a single assessment, Equation 7 only contains the matrices

    P andW and does not involve the product of P with itself.

    By comparison, classification consistency provides a measure of the proportion of exami-

    nees who would be classified into the same category on parallel replications of the same assess-

    ment. This involves taking the product of^P with itself and does not involve a matrix to flag the

    observed performance level of the examinee. The new classification consistency index can

    therefore be expressed as

    g =

    PP P

    Ne

    , 8

    where * again denotes element by element multiplication andNe again is the number of exami-

    nees. The index in Equation 8 almost seems trivial as it should always be less than or equal to

    the expected classification accuracy given that Equation 8 involves squaring the elements of

    the P matrix. Nonetheless, understanding the relationships between the indices and having a

    classification consistency index that can be computed when data are scored in the u metric is

    practically useful given that such an index has not been formulated to this point, and that it is

    common practice to report classification accuracy and consistency following the administration

    of an assessment.

    Guos Indices

    Guos (2006) classification accuracy index was originally designed as an extension of Rudners

    index in the context of ML estimation and it can be loosely viewed as a person-based index.

    The index makes no assumption of normality of an examinees standard error estimate around

    their ability estimate, and calculates expected classification probabilities and the P matrix based

    on individual examinee likelihood functions from IRT models. The avoidance of the normality

    assumption is an advantage of the method as the normality assumption only holds

    Wyse and Hao 605

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    5/23

    asymptotically and never is completely satisfied in practice. For dichotomous items, the likeli-

    hood function can be written as

    L u1i, u2i, . . . , uni uj =Yn

    j= 1

    PijuijQ1uij, 9

    where i is the examinee, jis the item on the test, uij is the response to item jby examinee i with

    1 signaling a correct response and 0 signaling an incorrect response, Pij is the probability of a

    correct response to item j given u, and Qij is the probability of an incorrect response to item j

    given u which is computed as 1 Pij. Similar likelihood functions can be written out for polyt-omous items and mixed format tests.

    The expected probability of scoring in any particular category can be found using the likeli-

    hood functions as

    ^pic =

    Pkci + 1

    u = kci L u1i, u2i,

    . . .

    , uni uj

    PC+ 1h = 1

    Pkh + 1u = kh

    L u1i, u2i, . . . , uni uj

    , 10

    wherePkci + 1

    u = kciL u1i, u2i, . . . , uni uj is the sum of likelihood function values from performance

    category C to next higher performance category C+ 1 for a set of equally spaced u points

    between the cut-scores (e.g., 100 equally spaced points) and the denominator is the sum of the

    likelihood function values across all performance categories. The fact that Guos method uses

    sets of equally spacedu points between cut-scores suggests that Equation 2 should not be a vec-

    tor, but it should be a Ne3 ((C+ 1)3NP) matrix, where C is the number of performance cate-

    gories and NP is the number of equally spaced points between cut-scores. This suggests that

    Guos method is not a person method in the traditional sense of how person methods are

    conceptualized.

    It is also important to note that to be able to compute the expected probabilities for the high-

    est and lowest categories, the highest and lowest cut-scores in Equation 1 need to be set at arbi-

    trary high and low u values, such as u = 6 and26. That is, the vector of cut-scores should be

    expressed as

    k= k1 k2 kC+ 1 , where k1\k2\ \kC+ 1 and k1 = 6, kC+ 1 = 6: 11

    The use of arbitrary high and low values for the extreme cut-scores instead of positive and

    negative infinities is also a small difference between the Guo and Rudner approaches.

    The computations from Equation 10 can then be put into the P matrix similar to the Rudner-

    based indices, and the weight matrix W can be formulated based on the examinee ability esti-

    mates and comparing those estimates with the cut-scores. The classification accuracy index and

    the new classification consistency index based on Guos formulation can then be determined

    based on Equations 7 and 8 applied to these P andW matrices. The relationships between the

    indices again are fairly clear and it can be observed that classification consistency should be less

    than or equal to classification accuracy.

    IRT-Recursive-Based IndicesSimilar to the Rudner-based indices, one again starts with a vector of cut-scores when comput-

    ing the IRT-recursive-based classification indices (Lee, 2010; Lee et al., 2002; Schulz et al.,

    606 Applied Psychological Measurement 36(7)

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    6/23

    1999; Wang et al., 2000). However, the vector of cut-scores is expressed in the number-correct

    score metric instead of the u metric. Denote this vector of cut-scores as

    k= k1 k2 kC+ 1 , where k1\k2\ \kC+ 1 and k1 = 0, kC+ 1 = m: 12

    In Equation 12, m represents the maximum possible score on the assessment and each cut-score is assumed to be determined from translating the u cut-score to the number-correct score

    metric. It is important to recognize that in translating these cut-scores to the number-correct

    score metric, rounding is needed as the cut-scores in the u metric may not align perfectly with a

    particular number-correct score.

    As indices based on IRT-recursive formula can also be computed as a person method, the

    computation of the indices also includes a vector of ability estimates which is identical to

    Equation 2. It is also possible to compute the IRT-recursive-based indices using the quadrature

    points from a run of an IRT software program as a distribution method. However, using the

    quadrate points is designed to approximate the full vector of ability estimates, and hence, it

    makes sense to write the indices using the full vector of ability estimates. One uses these abilityestimates to create a distribution of the probabilities of receiving each number-correct score

    using the IRT-recursive formula (Thissen, Pommerich, Billeaud, & Williams, 1995).

    To write the IRT-recursive formula for dichotomous items, define fn(x ui ) as the conditional

    distribution of number-correct scores over the first n items for an examinee with ability ui, and^Pij as the probability of a correct response to item j by examinee i. Define f1(x = 0 ui

    ) = 1 ^Pijas the probability of earning a score of zero for examinee i on the first item. For n.1, the recur-

    sion formula can be written as follows:

    fn x ui =fn1 x ui

    1 ^Pin

    x = 0

    fn1 x^

    ui

    1 ^

    Pin

    +fn1 x 1^

    ui ^

    Pin 0\

    x\

    nfn1 x 1 ui

    ^Pin x = n:13

    This formula can also be extended to polytomous items and mixed format tests as is

    described in Thissen et al. (1995).

    Then, the probability of scoring in each performance category can be represented as

    ^pic =XkC+ 1

    x = kC

    fn X =x u

    : 14

    Following similar logic as is used with the Rudner-based indices, one forms the matrices P

    and W and computes classification accuracy and consistency using the formulations in

    Equations 7 and 8. Again, the relationships between the indices are fairly clear and it can be

    observed that classification consistency should be less than or equal to classification accuracy.

    Similarities and Differences Between Indices

    It is important to highlight some of the key similarities and differences between the indices.

    First, it is apparent that there are different distributional assumptions with each approach. For

    the Guo-based indices, a single distribution is not assumed and the expected probabilities under-

    lying the indices are driven by the likelihood functions. These likelihood functions are typically

    not symmetrical and can change depending on the response pattern of the examinee. For the

    IRT-recursive indices, the distribution of number-correct scores is assumed to follow a com-

    pound binomial distribution if all the items are dichotomous or a compound multinomial

    Wyse and Hao 607

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    7/23

    distribution if the test contains some polytomous items. For the Rudner-based indices, a normal

    distribution for the examinee ability estimates is assumed when calculating the indices. The dif-

    ferent assumptions and formulations may give rise to disparate classification accuracy and clas-

    sification consistency estimates. However, all the formulations are based on properties of IRT

    models and typical IRT assumptions. This includes the assumption that the examinee ability

    estimates and item parameters used are good estimates of the underlying parameters.

    All three sets of indices can be classified as person-based methods. However, the Guo-based

    indices are notably different from the Rudner-based or IRT-recursive-based indices. This can be

    seen in how the P matrix is determined. In the Guo-based indices, each examinees ability esti-

    mate does not enter into the computations in Equation 10 or the P matrix. But, it is the response

    pattern of the examinee and the equally spacedu points that drive the computations of the likeli-

    hood function values and the P matrix. This implies that the choice of proficiency estimator will

    not affect the P matrix as the response pattern is unchanged across proficiency estimators; only

    the weight matrix W flagging the observed classifications of the examinees can change with

    different proficiency estimators. This is in contrast to the IRT-recursive-based indices and the

    Rudner-based indices in which the P andW matrices can change. In the Rudner-based indices,a likelihood function is not applied and the computation of the P matrix is based on individual

    examinee test information functions that change in value with different proficiency estimators.

    Similarly, for the IRT-recursive indices, the likelihood functions are not used and each exami-

    nees ability estimate is input into the IRT-recursive formula to determine the probability of

    receiving each number-correct score given their estimated ability. The important implication of

    the fact that P matrix does not change for Guos index is that Guos classification accuracy

    index can be potentially different for various proficiency estimators, but the classification con-

    sistency index will be identical across proficiency estimators because the P matrix is not chan-

    ged. This is a potential drawback to the Guo-based classification consistency indices as one

    would expect that the choice of proficiency estimator would affect classification consistency.Another difference is that the Rudner- and Guo-based indices perform computations assum-

    ing the reporting metric is the u metric or a linear transformation of this metric, whereas the

    IRT-recursive-based indices assume that the reporting metric is number-correct scores or a

    transformation of number-correct scores. This can lead to some small differences in the poten-

    tial cut-scores as some form of rounding is often needed to translate the cut-score from the

    u metric to the number-correct score metric.

    Simulation Study

    Given that the classification consistency indices based on Rudners and Guos formulations arenew, an important question is whether the Rudner- and Guo-based indices perform better than

    the IRT-recursive procedure which is more commonly used to compute classification accuracy

    and consistency with IRT models. In addition, as the assumptions used with Guo- and Rudner-

    based indices are closely tied to ML estimation and each of the three indices make different dis-

    tributional assumptions, another key question is how using different proficiency estimators

    affects classification accuracy and consistency indices. Maybe, different indices perform better

    in different conditions.

    To investigate these questions, a simulation study was performed in which several different

    factors were manipulated. Data were simulated for two different ability distributions, two differ-

    ent test lengths, three different sets of cut-scores, and four different proficiency estimators. In

    the simulation, three cut-scores and four performance categories were assumed in each condi-

    tion. The number of cut-scores and the number of performance categories are fixed because the

    effects of the number of cut-scores and performance categories are well known. In particular, it

    608 Applied Psychological Measurement 36(7)

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    8/23

    has been shown that classification accuracy and consistency increase as the number of perfor-

    mance categories decreases (Ercikan & Julian, 2002; Lee et al., 2002). Prior investigations with

    the indices examined in this study indicate that these patterns hold (Lee, 2010; Lee et al., 2002;

    Martineau, 2007). The number of examinees was also fixed at 2,000 as the preliminary investi-

    gations with other sample sizes (e.g., 10,000 and 25,000) produced results that were similar to

    the 2,000 examinees. A single fixed test form from which the item parameters were drawn in

    this simulation was also assumed. This test form consisted of 60 three-parameter logistic (3PL)

    model items that were drawn from an ACT (American College Testing) mathematics test admi-

    nistered to a sample of more than 100,000 students. The estimated parameters from this sample

    were assumed to be the true known item parameters in the simulation.

    Examinee Distributions

    Two different examinee distributions were investigated in this study. The first group of exami-

    nees was drawn from a normal distribution with a mean of 0 and a standard deviation of 1. The

    second group of examinees was drawn from a normal distribution with a mean of 0.5 and astandard deviation of 1.25. These two groups of examinees were chosen arbitrarily. The first

    group was designed to be similar to a typical group of students, and the assumptions used for

    ability distributions in many software packages when resolving the IRT indeterminacy problem.

    The second group was designed to represent a group with slightly more ability and greater dis-

    persion. It is expected that the ability distributions would affect the values of the indices and

    interact with the placement of the cut-scores. When the distribution of examinees is closer to

    the cut-scores, the values of the indices are expected to decrease, probably in somewhat similar

    fashion for all three indices.

    Test Length

    Two different test lengths were included in the simulation. The first test length included the full

    set of 60 items from the ACT mathematics test. The second test length was 30 items and con-

    sisted of the odd items from the ACT mathematics test. It is expected that as the length of the

    test is shortened, the classification accuracy and classification consistency indices will decrease

    as examinee ability estimates tend to have more error with shorter test lengths.

    Cut-Scores

    Three different sets of cut-scores were considered in this study. The first set of cut-scores wereu = 20.75, 0.00, and 0.75. These cut-scores were designed to represent a situation in which the

    cut-scores were symmetrically distributed around 0 and centered on the mean of the first ability

    distribution. The second set of cut-scores were u = 20.75, 20.35, and 0.75. This allowed the

    impact of nonsymmetrical cut-scores to be investigated. It is expected that these cut-scores

    would lower the values of the indices for some examinees between 20.75 and20.35 as the

    cut-scores are closer together and raise the value of the indices, and for some examinees

    between 20.35 and 0.75 as these cut-scores are farther apart. As the distance between cut-

    scores increases, examinees located between the cut-scores should see their classification accu-

    racy and classification consistency estimate rise. However, the value of the indices may be

    somewhat similar to that when the cuts were set at u = 20.75, 0.00, and 0.75 due to the trade-

    off between the individual examinee classification accuracy and consistency estimates at differ-

    ent regions of the scale. The final set of cut-scores investigated were u = 20.827, 20.034,

    and 0.694 for the 60-item test and u = 20.745, 20.042, and 0.706 for the 30-item test. These

    Wyse and Hao 609

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    9/23

    cut-scores were included to capture the condition in which u cuts align as closely to a set of

    number-correct scores as is possible. One might expect that this condition would result in the

    most similar values for classification accuracy and consistency across the indices as the effect

    of rounding is essentially removed.

    Proficiency Estimators

    Four different proficiency estimators were considered in this study. The first proficiency esti-

    mator was the IRT true-score (TS) estimator. This estimator was found by estimating the item

    parameters in BILOG-MG (Zimowski, Muraki, Mislevy, & Bock, 1996) and then applying the

    NewtonRaphson procedure to each number-correct score to determine each examinees u esti-

    mate. This estimator is important to study as the IRT-recursive-based procedure assumes that

    the reporting metric is number-correct scores or a transformation of number-correct scores. One

    would expect that the IRT-recursive-based indices would perform better with this estimator

    than with other estimators because the estimator and the philosophy of the index best align in

    this case.The other three estimators were the estimators available in BILOG-MG. These are the ML

    estimator, the expected a posteriori (EAP) estimator, and the maximum a posteriori (MAP) esti-

    mator. The EAP and MAP estimators are Bayesian estimators, which tend to be pulled toward

    the mean of the prior ability distribution in comparison with the ML estimator. Each estimator

    was computed using the default settings of the BILOG-MG except that the IDIST = 3 option

    was used with EAP estimator in the SCORE command, the number of quadrate points was

    increased to 40, the number of expectation-maximization (EM) cycles was increased to 200,

    and the number of NEWTON cycles was increased to 100. Kolen and Tong (2010) demon-

    strated that various proficiency estimators can perform differently for classifying students into

    performance categories in practical contexts and it is expected that these findings would trans-late to the computation of classification accuracy and consistency indices. One might expect

    that the ML estimator would perform the best for the Rudner- and Guo-based indices as they

    have theoretical underpinnings related to ML estimation.

    Simulating and Estimating Classification Accuracy and Consistency

    To evaluate the performance of each index, the classification accuracy and consistency were

    simulated and estimated using R. To simulate classification accuracy, the simulated ability dis-

    tributions were assumed to be the true distributions, and the estimated thetas (i.e., the us) were

    computed from the item responses generated from the assumed true distributions and were takenas the observed distributions. The cut-scores were then applied to each distribution and the pro-

    portion of classifications that remained the same in the observed and true distributions was taken

    as the simulated classification accuracy. To find the simulated classification consistency, the

    same true known ability distributions were assumed and two separate sets of item responses

    were simulated for each group of examinees. The values of the estimators for the two sets of

    item responses were determined, and the cut-scores were applied to the observed-score distribu-

    tions. The proportion of classifications that remained the same for the two observed-score distri-

    butions was taken as the simulated classification consistency.

    Estimated classification accuracy and consistency were found by computing the Rudner-

    based indices, Guo-based indices, and the IRT-recursive-based indices in R applied to the esti-

    mated item and person parameters. For the Guo-based indices, 100 equally spacedu points were

    used between each pair of cut-scores. The estimated classification accuracy and consistency

    were contrasted with the simulated classification accuracy and consistency estimates to

    610 Applied Psychological Measurement 36(7)

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    10/23

    determine which indices best recovered the simulated values. To provide a baseline condition,

    the values of indices were also found using the assumed known item parameters and us. For the

    Guo index, a set of item responses was simulated to compute the P matrix. The correct classifi-

    cations in the W matrix were computed by identifying the performance category in which the

    likelihood function was maximized based on the simulated item responses. The baseline condi-tions are labeled No Est. in the tables in the results section as no estimation of item or person

    parameters was used. Each cut-score in the number-correct score metric needed to apply the

    IRT-recursive procedure was rounded to the nearest number-correct score when computing the

    indices. This led to cut-scores on the u scale, which were different for the Rudner- and Guo-

    based indices in comparison with the IRT-recursive-based indices.

    Results of Simulation Study

    Tables 1 to 4 show the results from the simulation study. Table 1 displays the results when the

    ability distribution was assumed to be normal with a mean of 0 and a standard deviation of 1for the 60-item test. Table 2 displays the results when the ability distribution was assumed to

    be normal with a mean of 0.5 and a standard deviation of 1.25 for the 60-item test. Table 3 dis-

    plays the results when the ability distribution was assumed to be normal with a mean of 0 and a

    standard deviation of 1 for the 30-item test. Table 4 displays the results when the ability distri-

    bution was assumed to be normal with a mean of 0.5 and a standard deviation of 1.25 for the

    30-item test. In each table, the results for the Rudner-based indices are shown at the top of the

    table, the results for the Guo-based indices are shown in the middle of the table, and the results

    for the IRT-recursive-based indices are shown at the bottom of the table. The results for the dif-

    ferent cut-scores are shown under the three column headings in each table.

    Several important findings can be observed in the tables. Specifically, the Guo-based indicestended to be the largest, followed by the Rudner-based indices, and the IRT-recursive-based

    indices. For classification accuracy, the estimated values for the Guo-based indices were often

    the closest to the simulated values. For classification consistency, the Rudner- or Guo-based

    indices performed best depending on the estimator. In several cases, the differences between

    the three indices were trivial with differences in the third decimal place. However, there were

    some differences that approached 0.04 or 0.05. Differences of 0.05 between the indices might

    be viewed as somewhat large given that the indices are restricted to a range of 0.00 to 1.00.

    This suggests that the index chosen to report classification accuracy and consistency can have

    key impacts on the numbers that are reported.

    In addition, it is important to notice that in the case of no estimation of item or person param-

    eters, the value for the ML estimator was closest to the values computed for the Rudner- and

    Guo-based indices. This suggests that the ML-based estimates of the indices were very close to

    the value of the indices with no estimation of item and person parameters. This is somewhat

    expected as the Rudner- and Guo-based indices are closely tied to assumptions for ML estima-

    tion. For the IRT-recursive indices with no estimation, there was not a single proficiency esti-

    mator that was closest to the no estimation condition across the tables.

    It can also be seen that the indices tended to be closest in value with the TS estimator in com-

    parison with the other proficiency estimators. In addition, the tables suggest that the Rudners

    classification accuracy and consistency indices tended to be greatest for EAP and MAP estima-

    tors, the Guos classification accuracy indices tended to be greatest for the ML and EAP estima-

    tors, and the IRT-recursive classification accuracy and consistency indices tended to be greatest

    for the TS and ML estimators. The Guo-based classification consistency index did not change

    across proficiency estimators as was expected. Clearly, it is possible for interactions to exist

    Wyse and Hao 611

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    11/23

    Table1

    .

    SimulatedandEstimatedClassificationAccuracyandConsistency

    forN(0,1

    )AbilityDistributionfor

    60Items

    Cuts(u=2

    0.7

    5,

    0.0

    0,

    0.7

    5)

    Cuts(u=20.7

    5,20.3

    5,

    0.7

    5)

    Cuts(u=20.8

    27,20.0

    34,

    0.6

    94)

    Index

    Estimator

    Simulated

    accuracy

    Estimated

    accuracy

    Simulated

    consistency

    Estimated

    consistency

    Simulate

    d

    accuracy

    Estimated

    accuracy

    Simulated

    consistency

    Estimated

    consiste

    ncy

    Simulated

    accuracy

    Estimated

    accuracy

    Simula

    ted

    consistency

    Estimated

    consistency

    Rudner

    TS

    0.8

    32

    0.8

    29

    0.7

    56

    0.7

    63

    0.8

    23

    0.8

    23

    0.7

    55

    0.7

    57

    0.8

    22

    0.8

    31

    0.74

    3

    0.7

    66

    ML

    0.8

    47

    0.8

    36

    0.7

    84

    0.7

    68

    0.8

    45

    0.8

    33

    0.7

    83

    0.7

    68

    0.8

    48

    0.8

    39

    0.77

    1

    0.7

    70

    MAP

    0.8

    56

    0.8

    43

    0.7

    88

    0.7

    76

    0.8

    52

    0.8

    39

    0.7

    89

    0.7

    74

    0.8

    50

    0.8

    44

    0.78

    1

    0.7

    78

    EAP

    0.8

    56

    0.8

    48

    0.7

    93

    0.7

    80

    0.8

    52

    0.8

    46

    0.7

    95

    0.7

    81

    0.8

    51

    0.8

    45

    0.78

    1

    0.7

    79

    NoEst.

    0.8

    38

    0.7

    70

    0.8

    36

    0.7

    71

    0.8

    36

    0.7

    67

    Guo

    TS

    0.8

    32

    0.8

    22

    0.7

    56

    0.8

    04

    0.8

    23

    0.8

    19

    0.7

    55

    0.8

    06

    0.8

    22

    0.8

    19

    0.74

    3

    0.7

    98

    ML

    0.8

    47

    0.8

    52

    0.7

    84

    0.8

    04

    0.8

    45

    0.8

    52

    0.7

    83

    0.8

    06

    0.8

    48

    0.8

    50

    0.77

    1

    0.7

    98

    MAP

    0.8

    56

    0.8

    34

    0.7

    88

    0.8

    04

    0.8

    52

    0.8

    44

    0.7

    89

    0.8

    06

    0.8

    50

    0.8

    49

    0.78

    1

    0.7

    98

    EAP

    0.8

    56

    0.8

    39

    0.7

    93

    0.8

    04

    0.8

    52

    0.8

    47

    0.7

    95

    0.8

    06

    0.8

    51

    0.8

    53

    0.78

    1

    0.7

    98

    NoEst.

    0.8

    59

    0.8

    00

    0.8

    58

    0.8

    02

    0.8

    57

    0.7

    98

    Recursive

    TS

    0.8

    32

    0.8

    23

    0.7

    56

    0.7

    63

    0.8

    23

    0.8

    17

    0.7

    55

    0.7

    61

    0.8

    22

    0.8

    26

    0.74

    3

    0.7

    64

    ML

    0.8

    47

    0.8

    19

    0.7

    84

    0.7

    58

    0.8

    45

    0.8

    15

    0.7

    83

    0.7

    64

    0.8

    48

    0.8

    18

    0.77

    1

    0.7

    61

    MAP

    0.8

    56

    0.8

    11

    0.7

    88

    0.7

    42

    0.8

    52

    0.8

    05

    0.7

    89

    0.7

    51

    0.8

    50

    0.8

    06

    0.78

    1

    0.7

    53

    EAP

    0.8

    56

    0.8

    19

    0.7

    93

    0.7

    59

    0.8

    52

    0.8

    14

    0.7

    95

    0.7

    60

    0.8

    51

    0.8

    08

    0.78

    1

    0.7

    54

    NoEst.

    0.8

    06

    0.7

    48

    0.8

    06

    0.7

    52

    0.8

    17

    0.7

    48

    Note:IRT

    =item

    responsetheory;TS=IRTtru

    e-scoreestimator;ML=IRTmaximum

    likelihoodestimator;MAP=IRTmaximum

    aposterioriestimator;EAP=IRTexpecteda

    posterior

    iestimator;NoEst.=valueofindexassumingnoestimationoftheitem

    abilityorpersonparameters.

    Rudneristh

    ecalculationoftheindexbasedonRudners

    formulation,

    Guoisthecalculationoftheindex

    basedonGuosformulation,

    andrecursiveisthecalculationoftheindexbas

    edontheIRT-recursive-basedformula

    tion.

    612

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    12/23

    Table2.

    SimulatedandEstimatedClassific

    ationAccuracyandConsistencyforN(0.5,1.2

    5)AbilityDistribution

    for60Items

    Cuts(u=20.7

    5,

    0.0

    0,

    0.7

    5)

    Cuts(u=20.7

    5,20.3

    5,

    0.7

    5)

    Cuts(u=20.8

    27,20.0

    34,

    0.6

    94)

    Index

    Estimator

    Simulated

    accuracy

    Estimated

    accuracy

    Simulated

    consistency

    Estimated

    consistency

    Simulated

    accuracy

    Estimated

    accuracy

    Simulated

    consistency

    Estimated

    consiste

    ncy

    Simulated

    accuracy

    Estimated

    accuracy

    Simula

    ted

    consistency

    Estimated

    consistency

    Rudner

    TS

    0.8

    73

    0.8

    61

    0.8

    15

    0.8

    01

    0.8

    79

    0.8

    62

    0.8

    24

    0.8

    06

    0.8

    69

    0.8

    61

    0.81

    2

    0.8

    03

    ML

    0.8

    81

    0.8

    71

    0.8

    33

    0.8

    15

    0.8

    91

    0.8

    72

    0.8

    41

    0.8

    20

    0.8

    84

    0.8

    72

    0.82

    5

    0.8

    17

    MAP

    0.8

    89

    0.8

    80

    0.8

    43

    0.8

    26

    0.8

    92

    0.8

    76

    0.8

    39

    0.8

    25

    0.8

    76

    0.8

    78

    0.83

    5

    0.8

    24

    EAP

    0.8

    82

    0.8

    79

    0.8

    41

    0.8

    25

    0.8

    89

    0.8

    79

    0.8

    52

    0.8

    28

    0.8

    82

    0.8

    77

    0.83

    4

    0.8

    22

    NoEst.

    0.8

    72

    0.8

    16

    0.8

    74

    0.8

    22

    0.8

    72

    0.8

    17

    Guo

    TS

    0.8

    73

    0.8

    71

    0.8

    15

    0.8

    44

    0.8

    79

    0.8

    69

    0.8

    24

    0.8

    48

    0.8

    69

    0.8

    68

    0.81

    2

    0.8

    44

    ML

    0.8

    81

    0.8

    87

    0.8

    33

    0.8

    44

    0.8

    91

    0.8

    88

    0.8

    41

    0.8

    48

    0.8

    84

    0.8

    87

    0.82

    5

    0.8

    44

    MAP

    0.8

    89

    0.8

    69

    0.8

    43

    0.8

    44

    0.8

    92

    0.8

    82

    0.8

    39

    0.8

    48

    0.8

    76

    0.8

    77

    0.83

    5

    0.8

    44

    EAP

    0.8

    82

    0.8

    77

    0.8

    41

    0.8

    44

    0.8

    89

    0.8

    88

    0.8

    52

    0.8

    48

    0.8

    82

    0.8

    82

    0.83

    4

    0.8

    44

    NoEst.

    0.8

    89

    0.8

    43

    0.8

    89

    0.8

    46

    0.8

    89

    0.8

    43

    Recursive

    TS

    0.8

    73

    0.8

    63

    0.8

    15

    0.8

    15

    0.8

    79

    0.8

    64

    0.8

    24

    0.8

    20

    0.8

    69

    0.8

    60

    0.81

    2

    0.8

    13

    ML

    0.8

    81

    0.8

    57

    0.8

    33

    0.8

    14

    0.8

    91

    0.8

    60

    0.8

    41

    0.8

    21

    0.8

    84

    0.8

    61

    0.82

    5

    0.8

    13

    MAP

    0.8

    89

    0.8

    50

    0.8

    43

    0.8

    04

    0.8

    92

    0.8

    53

    0.8

    39

    0.8

    11

    0.8

    76

    0.8

    60

    0.83

    5

    0.8

    10

    EAP

    0.8

    82

    0.8

    57

    0.8

    41

    0.8

    12

    0.8

    89

    0.8

    59

    0.8

    52

    0.8

    18

    0.8

    82

    0.8

    61

    0.83

    4

    0.8

    09

    NoEst.

    0.8

    51

    0.8

    06

    0.8

    54

    0.8

    13

    0.8

    61

    0.8

    07

    Note:IRT

    =item

    responsetheory;TS=IRTtrue-scoreestimator;ML=IRTmaximum

    likelihoodestimator;MAP=IRTmaximum

    aposterioriestimator;EAP=IR

    Texpecteda

    posteriori

    estimator;NoEst.=valueofindexassumingnoestimationoftheitem

    abilit

    yorpersonparameters.

    Rudneristhe

    calculationoftheindexbasedonRud

    nersformulation,

    Guoisthe

    calculationoftheindexbasedonGuo

    sformulation,

    andrecursiveisthecalculationoftheindexbasedontheIRT-recursive-basedformulation.

    613

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    13/23

    Table3.

    SimulatedandEstimatedClassificationAccuracyandConsistencyforN(0,1

    )AbilityDistributionfor30Items

    Cuts(u=2

    0.7

    5,

    0.0

    0,

    0.7

    5)

    Cuts(u=20.7

    5,20.3

    5,

    0.7

    5)

    Cuts(u=20.7

    45,20.0

    42,

    0.7

    06)

    Index

    Estimator

    Simulated

    accuracy

    Estimated

    accuracy

    Simulated

    consistency

    Estimated

    consistency

    Simulate

    d

    accuracy

    Estimated

    accuracy

    Simulated

    consistency

    Estimated

    consiste

    ncy

    Simulated

    accuracy

    Estimated

    accuracy

    Simula

    ted

    consist

    ency

    Estimated

    consistency

    Rudner

    TS

    0.7

    79

    0.7

    66

    0.6

    80

    0.6

    86

    0.7

    79

    0.7

    72

    0.7

    03

    0.6

    98

    0.7

    74

    0.7

    66

    0.67

    9

    0.6

    83

    ML

    0.8

    07

    0.7

    92

    0.7

    15

    0.7

    07

    0.8

    12

    0.7

    95

    0.7

    35

    0.7

    17

    0.8

    02

    0.7

    90

    0.72

    0

    0.7

    04

    MAP

    0.8

    01

    0.7

    98

    0.7

    20

    0.7

    13

    0.8

    17

    0.7

    97

    0.7

    44

    0.7

    16

    0.8

    05

    0.7

    98

    0.71

    8

    0.7

    12

    EAP

    0.8

    07

    0.8

    00

    0.7

    34

    0.7

    14

    0.8

    21

    0.8

    03

    0.7

    59

    0.7

    16

    0.8

    12

    0.8

    00

    0.73

    3

    0.7

    12

    NoEst.

    0.7

    89

    0.7

    01

    0.7

    92

    0.7

    12

    0.7

    87

    0.6

    99

    Guo

    TS

    0.7

    79

    0.7

    83

    0.6

    80

    0.7

    56

    0.7

    79

    0.7

    89

    0.7

    03

    0.7

    67

    0.7

    74

    0.7

    85

    0.67

    9

    0.7

    52

    ML

    0.8

    07

    0.8

    16

    0.7

    15

    0.7

    56

    0.8

    12

    0.8

    19

    0.7

    35

    0.7

    67

    0.8

    02

    0.8

    14

    0.72

    0

    0.7

    52

    MAP

    0.8

    01

    0.8

    05

    0.7

    20

    0.7

    56

    0.8

    17

    0.8

    19

    0.7

    44

    0.7

    67

    0.8

    05

    0.8

    11

    0.71

    8

    0.7

    52

    EAP

    0.8

    07

    0.8

    11

    0.7

    34

    0.7

    56

    0.8

    21

    0.8

    24

    0.7

    59

    0.7

    67

    0.8

    12

    0.8

    17

    0.73

    3

    0.7

    52

    NoEst.

    0.8

    22

    0.7

    51

    0.8

    26

    0.7

    61

    0.8

    21

    0.7

    47

    Recursive

    TS

    0.7

    79

    0.7

    68

    0.6

    80

    0.7

    03

    0.7

    79

    0.7

    78

    0.7

    03

    0.7

    25

    0.7

    74

    0.7

    67

    0.67

    9

    0.7

    03

    ML

    0.8

    07

    0.7

    62

    0.7

    15

    0.6

    99

    0.8

    12

    0.7

    65

    0.7

    35

    0.7

    19

    0.8

    02

    0.7

    63

    0.72

    0

    0.6

    99

    MAP

    0.8

    01

    0.7

    62

    0.7

    20

    0.7

    63

    0.8

    17

    0.7

    59

    0.7

    44

    0.7

    05

    0.8

    05

    0.7

    61

    0.71

    8

    0.6

    95

    EAP

    0.8

    07

    0.7

    63

    0.7

    34

    0.7

    58

    0.8

    21

    0.7

    67

    0.7

    59

    0.7

    12

    0.8

    12

    0.7

    64

    0.73

    3

    0.6

    97

    NoEst.

    0.7

    44

    0.6

    87

    0.7

    53

    0.6

    94

    0.7

    66

    0.6

    87

    Note:IRT

    =item

    responsetheory;TS=IRTtru

    e-scoreestimator;ML=IRTmaximum

    likelihoodestimator;MAP=IRTmax

    imum

    aposterioriestimator;EAP=IR

    Texpecteda

    posterioriestimator;NoEst.=valueofindexas

    sumingnoestimationoftheitem

    abilityorpersonparameters.

    RudneristhecalculationoftheindexbasedonRudnersformulation,

    GuoisthecalculationoftheindexbasedonGu

    osformulation,

    andrecursiveistheca

    lculationoftheindexbasedontheIRT

    -recursive-basedformulation.

    614

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    14/23

    Table4.

    SimulatedandEstimatedClassifi

    cationAccuracyandConsistencyforN(0.5,1.2

    5)AbilityDistribution

    for30Items

    Cuts(u=2

    0.7

    5,

    0.0

    0,

    0.7

    5)

    Cuts(u=20.7

    5,20.3

    5,

    0.7

    5)

    Cuts(u=20.7

    45,20.0

    42,

    0.7

    06)

    Index

    Estimator

    Simulated

    accuracy

    Estimated

    accuracy

    Simulated

    consistency

    Estimated

    consistency

    Simulate

    d

    accurac

    y

    Estimated

    accuracy

    Simulated

    consistency

    Estima

    ted

    consistency

    Simulated

    accuracy

    Estimated

    accuracy

    Simulated

    consistency

    Estimated

    consistency

    Rudner

    TS

    0.8

    11

    0.7

    70

    0.7

    62

    0.6

    99

    0.8

    31

    0.7

    75

    0.7

    78

    0.71

    2

    0.8

    20

    0.7

    71

    0.762

    0.7

    00

    ML

    0.8

    34

    0.8

    28

    0.7

    83

    0.7

    58

    0.8

    48

    0.8

    37

    0.8

    02

    0.77

    4

    0.8

    43

    0.8

    29

    0.796

    0.7

    59

    MAP

    0.8

    35

    0.8

    34

    0.7

    82

    0.7

    63

    0.8

    47

    0.8

    37

    0.7

    99

    0.77

    1

    0.8

    38

    0.8

    34

    0.781

    0.7

    63

    EAP

    0.8

    38

    0.8

    31

    0.7

    80

    0.7

    58

    0.8

    50

    0.8

    38

    0.7

    95

    0.77

    1

    0.8

    40

    0.8

    30

    0.778

    0.7

    58

    NoEst.

    0.8

    25

    0.7

    49

    0.8

    30

    0.76

    1

    0.8

    25

    0.7

    50

    Guo

    TS

    0.8

    11

    0.8

    25

    0.7

    62

    0.7

    95

    0.8

    31

    0.8

    34

    0.7

    78

    0.80

    8

    0.8

    20

    0.8

    27

    0.762

    0.7

    96

    ML

    0.8

    34

    0.8

    48

    0.7

    83

    0.7

    95

    0.8

    48

    0.8

    55

    0.8

    02

    0.80

    8

    0.8

    43

    0.8

    50

    0.796

    0.7

    96

    MAP

    0.8

    35

    0.8

    33

    0.7

    82

    0.7

    95

    0.8

    47

    0.8

    53

    0.7

    99

    0.80

    8

    0.8

    38

    0.8

    35

    0.781

    0.7

    96

    EAP

    0.8

    38

    0.8

    38

    0.7

    80

    0.7

    95

    0.8

    50

    0.8

    58

    0.7

    95

    0.80

    8

    0.8

    40

    0.8

    37

    0.778

    0.7

    96

    NoEst.

    0.8

    53

    0.7

    94

    0.8

    59

    0.80

    5

    0.8

    52

    0.7

    94

    Recursive

    TS

    0.8

    11

    0.8

    17

    0.7

    62

    0.7

    69

    0.8

    31

    0.8

    23

    0.7

    78

    0.78

    4

    0.8

    20

    0.8

    17

    0.762

    0.7

    69

    ML

    0.8

    34

    0.8

    05

    0.7

    83

    0.7

    67

    0.8

    48

    0.8

    20

    0.8

    02

    0.78

    6

    0.8

    43

    0.8

    16

    0.796

    0.7

    67

    MAP

    0.8

    35

    0.8

    00

    0.7

    82

    0.7

    59

    0.8

    47

    0.8

    10

    0.7

    99

    0.77

    0

    0.8

    38

    0.8

    10

    0.781

    0.7

    59

    EAP

    0.8

    38

    0.7

    95

    0.7

    80

    0.7

    58

    0.8

    50

    0.8

    10

    0.7

    95

    0.77

    2

    0.8

    40

    0.8

    07

    0.778

    0.7

    56

    NoEst.

    0.8

    05

    0.7

    58

    0.8

    14

    0.76

    7

    0.8

    22

    0.7

    59

    Note:IRT

    =item

    responsetheory;TS=IRTtru

    e-scoreestimator;ML=IRTmaximum

    likelihoodestimator;MAP=IRTmax

    imum

    aposterioriestimator;EAP=IRTexpecteda

    posterioriestimator;NoEst.=valueofindexassumingnoestimationoftheitem

    abilityorpersonparameters.

    RudneristhecalculationoftheindexbasedonRudnersformulation,

    GuoisthecalculationoftheindexbasedonGu

    osformulation,

    andrecursiveistheca

    lculationoftheindexbasedontheIRT-recursive-basedformulation.

    615

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    15/23

    between the proficiency estimator that is chosen and the index selected to report classification

    accuracy or consistency.

    Tables 1 through 4 also indicate that in many situations, the estimated classification accuracy

    and consistency were lower than the values simulated for the Rudner-based and IRT-recursive-

    based indices with a couple of exceptions for a few of the computations with the TS estimator.

    The Guo-based classification accuracy indices tended to be lower than the simulated values for

    the TS, EAP, and MAP estimators and larger for the ML estimator for the 60-item test. For the

    30-item test, the estimated classification accuracy exceeded the simulated classification accu-

    racy in most situations. The Guo-based classification consistency indices were higher than the

    simulated classification consistency in almost all cases. The simulated values tended to be best

    recovered with the TS estimators, although there were a few exceptions when the cut-scores

    were at u = 20.827, 20.034, and 0.694 where some other proficiency estimators were better

    recovered, and with the Guo-based indices with the cuts at u = 20.75, 20.35, and 0.75 where

    the ML estimator performed the best.

    In terms of test length, the results follow what one would expect with the classification accu-

    racy and classification consistency dropping quite a bit across the board between Tables 1 and 3and Tables 2 and 4 when the test length was 60 items compared with 30 items. When looking at

    these tables, one also notices that the differences between the TS estimator and the EAP, MAP,

    and ML estimators tended to become larger as the test length was decreased for the Rudner-

    and Guo-based indices. In addition, the differences between the IRT-recursive indices and the

    other indices also tended to increase when test length was decreased. This suggests that there

    are important potential interactions between the length of the test, different proficiency estima-

    tors, and the index that one chooses to employ.

    In terms of the different cut-scores, Table 1 suggests that when the distribution was assumed

    to be normal with a mean of 0 and a standard deviation of 1 for the 60-item test, the values of

    the Rudner-based indices tended to be the least when the cut-scores were at u = 20.75, 20.35,and 0.75, and they tended to be the greatest when the cut-scores were at u = 20.827, 20.034,

    and 0.694. For the Guo-based classification accuracy indices, the TS and ML estimators were

    greatest for u = 20.75, 20.35, and 0.75, and least for u = 20.827, 20.034, and 0.694. For the

    EAP and MAP estimators, the u = 20.827, 20.034, and 0.694 cuts produced the highest classi-

    fication accuracy estimates and u = 20.75, 0.00, and 0.75 produced the least. The Guos classi-

    fication consistency indices were greatest for all estimators when the cuts were at for

    u = 20.75, 20.35, and 0.75. For the IRT-recursive-based indices, the pattern was slightly dif-

    ferent where except for the TS estimator, the u = 20.75, 0.00, and 0.75 cuts produced the high-

    est value of the indices. For the TS estimator, the u = 20.827, 20.034, and 0.694 cuts

    produced the highest classification accuracy and consistency. The patterns were not as clearand consistent when the ability distribution was assumed to be normal with a mean of 0.5 and a

    standard deviation of 1.25 for the 60-item test (see Table 2). In this case, many of the estimated

    values of classification accuracy and consistency were very similar and trivially different across

    cut-scores. For the 30-item tests (see Tables 3 and 4), the cuts at u = 20.75, 20.35, and 0.75

    tended to produce the highest classification accuracy and consistency for all three sets of

    indices for both distributions of examinees. It is also important to notice that when the ability

    distribution had a mean of 0.5 and a standard deviation of 1.25 as opposed to a mean of 0 and a

    standard deviation of 1, the values of the indices rose across the board. This is consistent with

    the understanding that as the ability of the examinees moves away from the cut-scores classifi-

    cation accuracy and classification consistency goes up.

    Figures 1 and 2 provide pictures of the classification accuracy and classification consistency

    for the three different indices at various u locations. Figure 1 is for the 60-item test and Figure 2

    is for the 30-item test. The figures do not assume a particular proficiency estimator and were

    616 Applied Psychological Measurement 36(7)

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    16/23

    created based on the assumed known item parameters for the 30- and 60-item tests. The x-axisis the examinees u and the y-axis is the value of the classification accuracy or the classification

    consistency for that u. The solid lines show the Rudner-based index, the jagged dotted lines

    show the Guo-based index, and the dashed lines show the IRT-recursive-based index. The lines

    for the Guo-based indices are not smooth due to the simulation of item responses for examinees

    at each u needed to calculate the indices. For the Rudner-based and IRT-recursive-based indices,

    each u can be applied in conjunction with the known item parameters without simulating item

    responses creating a smooth line. The pictures clearly show the different functional forms of

    each of the indices and what the value of each index would be for an examinee at each u value.

    The figures help to explain some of the findings in Tables 1 through 4. In particular, it

    appears that the Guo-based indices have a slightly different pattern in terms of the value of theindices for examinees at different us than the Rudner- and IRT-based indices as they do not go

    up as high in between cut-scores or as low at the cut-scores as the other two indices. This is

    probably due in part to the use of the item response patterns and the focus on likelihood func-

    tions instead of examinee us when computing the indices. One can also see that the Rudner-

    based indices exceeded the IRT-recursive-based indices in between the cut-scores. These two

    indices have dips for the cut-scores in Figures 1 and 2 that do not align exactly for the first two

    panels in each figure due to the rounding of the cut-scores. At the extremes of the u distribu-

    tion, the Guo-based and IRT-recursive-based indices had higher classification accuracy and

    consistency. This makes sense because for the Rudner-based indices having an extreme score

    often was associated with having an extremely low value of the test information function whichwould lower the classification accuracy and consistency. The IRT-recursive-based and Guo-

    based indices, however, do not consider test information, and extreme scores were associated

    Figure 1. Plot of classification accuracy and consistency curves for simulations with 60 itemsNote: The solid line in each panel is for the Rudner-based index, the dotted line is for the Guo-based index, and the

    dashed line is for the IRT-recursive-based index. The top left panel is the classification accuracy curves with cuts at u =

    20.75, 0.00, and 0.75; the top middle panel is the classification accuracy curves with cuts at u = 20.75, 20.35, and

    0.75; the top right panel is the classification accuracy curves with cuts at u = 20.827, 20.034, and 0.694; the bottom

    left panel is the classification consistency curves with cuts at u = 20.75, 0.00, and 0.75; the bottom middle panel is the

    classification consistency curves with cuts at u = 20.75, 20.35, and 0.75; and the top bottom right panel is the

    classification consistency curves with cuts at u = 20.827, 20.034, and 0.694.

    Wyse and Hao 617

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    17/23

    with higher classification accuracy and consistency. This is an important difference that is

    worth noting and suggests that with distribution of examinees with more extreme scores that

    Rudner-based indices would probably be lower than the other two indices. This is a potential

    downside to the Rudner-based indices as one would anticipate that the probability of accurately

    and consistently classifying an examinee with an extreme score would be high. In many practi-

    cal situations, most of the examinees will often be in regions where the cut-scores are located,

    and one would probably expect that the Rudner-based indices would work well as they did in

    the simulations.

    Michigan Merit Examination (MME) Data

    Data for the practical examples were drawn from the MME. The MME is a large-scale assess-

    ment given to 11th graders and some eligible 12th graders that is used for school accountability

    and adequate yearly progress determinations in Michigan. The MME has five subject tests

    (reading, math, science, writing, and social studies) consisting of items from the ACT,

    WorkKeys, and custom-Michigan-developed components. Subsets of items are selected from

    ACT and WorkKeys along with the Michigan-developed components to align with Michigans

    high school academic content standards. These items are used to determine an examinees score

    in each subject. Data from the MME reading and math tests are considered in the examples in

    this article.

    The MME reading test consists of 51 operational multiple-choice items: 32 of the items come

    from the ACT reading test and 19 of the items come from the WorkKeys reading for informa-

    tion test. An examinees reported score is a linear transformation of his or her u estimate from

    Figure 2. Plot of classification accuracy and consistency curves for simulations with 30 itemsNote: The solid line in each panel is for the Rudner-based index, the dotted line is for the Guo-based index, and the

    dashed line is for the IRT-recursive-based index. The top left panel is the classification accuracy curves with cuts at u =

    20.75, 0.00, and 0.75; the top middle panel is the classification accuracy curves with cuts at u = 20.75, 20.35, and

    0.75; the top right panel is the classification accuracy curves with cuts at u = 20.745, 20.042, and 0.706; the bottom

    left panel is the classification consistency curves with cuts at u = 20.75, 0.00, and 0.75; the bottom middle panel is the

    classification consistency curves with cuts at u = 20.75, 20.35, and 0.75; and the top bottom right panel is the

    classification consistency curves with cuts at u = 20.745,20.042, and 0.706.

    618 Applied Psychological Measurement 36(7)

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    18/23

    applying the 3PL model to these data. The 3PL model exhibited moderate degrees of misfit.

    The MME technical report puts misfit at roughly 43% of the items not fitting the model when

    using the S X2 fit statistic of Orlando and Thissen (2000). There were 98,423 examinees whoreceived valid scores on the initial form of the MME reading test that were considered in this

    article. The estimated reliability for these data was .89.

    The MME math test is made up of 67 operational multiple-choice items: 3 of the items come

    from the WorkKeys locating information test, 12 of the items come from the WorkKeys applied

    mathematics test, 36 of the items come from the ACT mathematics test, and the remaining items

    are custom-developed items. Scores reported to examinees again are a linear transformation of

    each examinees u estimate from applying 3PL model to these data. Model fit using the S X2

    fit statistic reported in the MME technical report was 28% of the items not fitting the model.

    There were 97,888 examinees who received valid scores on the MME math test considered in

    this article. The estimated reliability for these data was .87.

    Results for MME Data

    Table 5 displays the results for the classification accuracy and consistency for the MME read-

    ing and math tests for the three cut-scores that are used to make classification decisions on each

    assessment. The results for the Rudner-based indices are shown at the top of the table, the

    results for Guo-based indices are shown in the middle of the table, and the results for the IRT-

    recursive-based indices are shown at the bottom of the table. For all three indices, the classifi-

    cation accuracy and consistency were higher for the reading test compared with the math test,

    except for the MAP and EAP estimators for the Guo-based classification accuracy index.

    For the MME math test, the results were similar to the simulation, where the Guo-

    based indices tended to exceed the Rudner-based indices, which exceeded the IRT-recursive-

    based indices. The TS estimator had a larger classification accuracy value for the Rudner-based

    indices than for the Guo-based indices for these data. For the MME reading test, the results

    Table 5. Estimated Classification Accuracy and Consistency for MME Reading and Math Tests

    Reading (n = 98,423) Math (n = 97,888)

    Index Estimator Accuracy Consistency Accuracy Consistency

    Rudner TS 0.829 0.763 0.800 0.727ML 0.821 0.761 0.806 0.734MAP 0.810 0.750 0.792 0.717EAP 0.807 0.746 0.800 0.726

    Guo TS 0.799 0.759 0.792 0.758ML 0.821 0.759 0.817 0.758MAP 0.801 0.759 0.814 0.758EAP 0.811 0.759 0.817 0.758

    Recursive TS 0.800 0.744 0.782 0.720ML 0.801 0.740 0.783 0.721MAP 0.792 0.730 0.763 0.696EAP 0.788 0.725 0.772 0.705

    Note: MME = Michigan Merit Examination; IRT = item response theory; TS = IRT true-score estimator; ML = IRT

    maximum likelihood estimator; MAP = IRT maximum a posteriori estimator; EAP = IRT expected a posteriori

    estimator. Rudner is the calculation of the index based on Rudners formulation, Guo is the calculation of the index

    based on Guos formulation, and recursive is the calculation of the index based on the IRT-recursive-based

    formulation.

    Wyse and Hao 619

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    19/23

    were different, in some cases, the Rudner-based indices were larger than the Guo-based indices

    for classification accuracy and consistency. The IRT-recursive-based indices were again thesmallest. The largest differences between the three indices across the proficiency estimators

    were around 0.03 for the MME reading test and 0.05 for the MME math test. These levels of

    differences were somewhat similar to some of the differences observed in the simulation.

    Somewhat different from the simulation was the rank ordering of values of the indices across

    the proficiency estimators. In the simulation, the EAP and MAP estimators tended to have the

    highest values for the Rudner-based indices. However, in the practical examples, the EAP and

    MAP estimators had values that were lower than those for the TS and ML estimators for the

    Rudner-based indices. The EAP and MAP estimators also had lower classification accuracy

    and consistency estimates for the IRT-based recursive indices. The ML estimator again had the

    highest value for the Guo-based classification accuracy indices, and the classification consis-

    tency was the same for all estimators.

    Figure 3 graphically displays the classification accuracy and classification consistency for

    the indices at various u locations for the MME reading and math tests similar to Figures 1 and 2.

    The top panels are for the MME reading tests and the bottom panels are for the MME math tests.

    The solid lines in the panels are for the Rudner-based indices, the jagged dotted lines are for the

    Guo-based indices, and the dashed lines are for the IRT-based recursive indices. The dips in the

    figures for Rudner-based indices show the placements of the cut-scores. These dips do not lie

    directly on top of each other for the Rudner- and IRT-based indices due to the rounding needed

    to calculate the IRT-based recursive indices in the number-correct score metric. The spread and

    placement of these cut-scores were disparate on both tests. For the reading test, the cut-scores

    were more spread apart, and for the math test, the cut-scores were closer together. The figuresshow the impact of these cut-score placements on the value of indices. When the cut-scores

    were closer together, classification accuracy and consistency for individual examinees tended to

    Figure 3. Plot of classification accuracy and consistency curves for MME reading and mathNote: MME = Michigan Merit Examination; IRT = item response theory. The solid line in each panel is for the Rudner-

    based index, the dotted line is for the Guo-based index, and the dashed line is for the IRT-recursive-based index. The

    top left panel is the classification accuracy curve with four performance levels for reading, the top right panel is the

    classification consistency curve with four performance levels for reading, the bottom right panel is the classification

    accuracy curve with four performance levels for math, and the bottom left panel is the classification consistency curve

    with four performance levels for math.

    620 Applied Psychological Measurement 36(7)

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    20/23

    decrease in comparison with when they were further apart. The panels also show that between

    the first and second cut-score, the Rudner-based indices tended to exceed the IRT-based recur-

    sive indices. The lines for the Guo-based indices again had a different pattern than the Rudner-

    based or IRT-recursive indices with dips not going as low and the peaks not going as high.

    More notable differences for the Guo-based indices were found for the MME reading test com-

    pared with the MME math test. For the MME reading test, the peak between the first and second

    cut-scores is associated with a smooth dip and lower classification accuracy and consistency for

    the Guo-based index in comparison with the Rudner-based index. As many examinees had

    scores that were between these cut-scores, this may explain why the Rudner-based indices were

    in some cases higher for these data in Table 5.

    Discussion and Conclusion

    The purposes of this article were (a) to introduce classification consistency indices based on

    Rudners and Guos formulations and (b) to evaluate the performance of the Rudner-based,

    Guo-based, and IRT-recursive-based indices in simulation and practice. The development ofthese new indices is important because many of the current approaches for calculating classifi-

    cation accuracy and consistency assume that the reporting metric is number-correct scores or a

    transformation of number-correct scores, which may not be the approach used to determine

    scores when IRT models are applied. This can lead to small subtle differences in the cut-scores

    in the IRT u metric due to the rounding needed to compute the indices. The Rudner- and Guo-

    based indices do not make the assumption that the reporting metric is number-correct scores and

    can be applied when tests are scored in the u metric or a linear transformation of this metric.

    The Rudner and Guo indices also are computationally easier to compute and are closely tied to

    assumptions used with ML estimation, which is an often used approach for estimating examinee

    abilities with IRT models.Despite the conceptual and practical appeal of the Rudner- and Guo-based indices, the per-

    formance of these indices with various proficiency estimators and across a variety of conditions

    has not been fully investigated. To date, only Martineau (2007) and Wyse (2011) have looked at

    the performance of Rudners classification accuracy index considering some of the factors that

    can affect the index. However, these articles did not consider the classification consistency

    index, did not look at the interaction of the indices with various proficiency estimators, and did

    not look at how the Rudner-based indices perform in comparison with other commonly used

    classification accuracy and consistency indices. The Guo-based classification index has only

    been compared with the Rudner-based index using a practical example when the index was orig-

    inally formulated and has not been compared with the Rudner-based or IRT-recursive indices ina systematic way. This study provided an initial investigation of a few of these factors in a simu-

    lated and practical setting.

    Results from these investigations suggested that the Guo-based indices tended to have the

    highest classification accuracy and consistency, followed by the Rudner-based indices and the

    IRT-based recursive indices. The Guos classification accuracy index and the Guo- and Rudner-

    based classification consistency indices performed the best for recovering classification accu-

    racy and consistency. The differences among the three indices were often small and some cases

    trivially different, but there were some differences on the magnitude of 0.04 or 0.05 units for

    the whole population. This finding is important, especially given that Lee (2010) has observed

    that IRT-based recursive indices tend to be also higher than values estimated with the non-IRT-

    based Livingston and Lewis (1995) procedure. This may suggest that there may be even larger

    differences between Guo- and Rudner-based indices and those from the Livingston and Lewis

    procedure. Future research could compare these indices in a variety of situations. This research

    Wyse and Hao 621

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    21/23

    would be valuable as the simulation in this study, although designed to look at a variety of fac-

    tors that can affect the indices, may not reflect the full range of factors that affect the indices in

    all situations. It is possible that the indices may perform differently with alternate tests, different

    placements of cut-scores, and various other factors, such as skewed score distributions.

    The results from the simulations and investigations also suggest some potential features of

    the Rudner- and Guo-based indices that should be highlighted. Specifically, the Guo-based clas-

    sification consistency tended to be notably higher than other indices and did not change with

    different proficiency estimators. This suggests that one should use caution when applying the

    Guos classification consistency index, and the Guo-based formulation may be better when

    investigations are focused only on classification accuracy or a single proficiency estimator,

    such as the ML estimator. In terms of the Rudner-based indices, results suggested that the

    indices may be adversely affected when the examinee distribution contains more examinees

    with extreme scores as extremes scores often have less test information. This may suggest that

    in these situations, one may want to consider the application of another index.

    A notable finding of this study was that the values of the classification accuracy and classifi-

    cation consistency indices can change for various proficiency estimators. These findings aresimilar to those in Kolen and Tong (2010), who observed that the choice of different proficiency

    estimators can affect the number of students reported in different performance levels. It is well

    known that alternate proficiency estimators have different properties and that choosing different

    estimators can change examinee ability estimates and classifications. This article highlights that

    the choice of proficiency estimator can also affect the value of the classification accuracy and

    consistency. Additional research could evaluate classification accuracy and consistency with

    different proficiency estimators in other contexts.

    This study also highlights the benefit of creating classification accuracy and classification

    plots, like those in Figures 1 to 3, when investigating different classification indices. These

    plots allow the researcher and the practitioner to look across the range of possible scores andidentify regions in which the indices are performing differently. These pictures can also help

    identify possible explanations for why the indices tended to produce disparate values in simu-

    lated and practical settings. In this article, the figures helped to show some of the differences in

    how the Rudner-based indices treated extreme scores; the Rudner-based indices tended to have

    lower classification accuracy and consistency for extreme scores because these scores tended to

    be associated with lower values of the IRT test information function. The figures also depicted

    some differences in the cut-scores due to the rounding of scores that was needed with IRT-

    recursive-based procedure as well as higher values for the Rudner-based indices between the

    cut-scores. One can also see some of the fundamental differences between the Guo-based and

    the other indices due to the use of likelihood functions and response patterns with the Guoindices. The Guo indices produced graphs that were less smooth and in which the peaks and

    valleys between and at cut-scores were less pronounced. In the simulation, this produced results

    in which the Guo-based indices often exceeded the Rudner- and IRT-based indices. However,

    as the MME reading practical example suggested, it will not always hold that the Guo-based

    indices will be larger than the other indices. The figures do suggest that it is possible for the

    indices to be higher or lower depending on the distribution of examinee scores.

    It is also important to point out the rather obvious observation that there are a number of fac-

    tors that can affect classification accuracy and consistency. Some of the factors that can affect

    the classification accuracy and consistency values estimated include the classification accuracy

    or consistency index chosen, the distribution of examinee performance, the number and place-

    ment of the cut-scores, the proficiency estimator and scoring metric chosen, the properties and

    number of items in the assessment, and the models applied to the test data as well as the fit of

    those models. It is also important to note the relationships that exist between classification

    622 Applied Psychological Measurement 36(7)

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and Consistency Indices

    22/23

    accuracy and classification consistency. As Equations 7 and 8 suggest, classification accuracy

    should be greater than or equal to classification consistency given that classification accuracy

    involves computations assuming a single administration of the assessment, and classification

    consistency involves administrations of parallel forms or the squaring of computations from a

    single administration to approximate the similarity of classifications across forms. Both types of

    indices can be valuable, but the indices address different questions. This suggests that depending

    on the situation and question of interest related to the classification decisions that one index or

    the other may be more appropriate for that question. This implies that reporting of both indices

    may not always be necessary. It also means that a critical consideration is having classification

    accuracy and consistency indices that come from the same foundation that can be applied to the

    same data because the question asked may better fit one type of index or the other.

    Declaration of Conflicting Interests

    The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or pub-

    lication of this article.

    Funding

    The author(s) received no financial support for the research, authorship, and/or publication of this article.

    References

    Brennan, R. L., & Wan, L. (2004). Bootstrap procedures for estimating decision consistency for single-

    administration complex assessments (CASMA Research Report No. 7). Iowa City: Center for

    Advanced Studies in Measurement and Assessment, University of Iowa.

    Ercikan, K., & Julian, M. (2002). Classification accuracy of assigning student performance to proficiencylevels: Guidelines for assessment design. Applied Measurement in Education, 15, 269-294.

    Guo, F. (2006). Expected classification accuracy using the latent distribution. Practical Assessment

    Research & Evaluation, 11(6). Retrieved from http://pareonline.net/pdf/v11n6.pdf

    Hanson, B. A., & Brennan, R. L. (1990). An investigation of classification consistency indexes estimated

    under alternative strong true score models. Journal of Educational Measurement, 27, 345-359.

    Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational

    Measurement, 13, 253-264.

    Kolen, M. J., & Tong, Y. (2010). Psychometric properties of IRT proficiency estimates. Educational

    Measurement: Issues and Practice, 29, 8-14.

    Lee, W. (2010). Classification consistency and accuracy for complex assessments using item response

    theory. Journal of Educational Measurement, 47, 1-17.Lee, W., Brennan, R. L., & Wan, L. (2009). Classification consistency and accuracy for complex assessments

    under the compound multinomial model. Applied Psychological Measurement, 33, 374-390.

    Lee, W., Hanson, B. A., & Brennan, R. L. (2002). Estimating consistency and accuracy indices for

    multiple classifications. Applied Psychological Measurement, 26, 412-432.

    Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on

    test scores. Journal of Educational Measurement, 32, 179-197.

    Martineau, J. A. (2007). An expansion and practical evaluation of expected classification accuracy.

    Applied Psychological Measurement, 31, 181-194.

    Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response

    theory models. Applied Psychological Measurement, 24, 50-64.

    Rudner, L. M. (2001). Computing the expected proportions of misclassified examinees. Practical

    Assessment Research & Evaluation, 7(14). Retrieved from http://PAREonline.net/getvn.asp?v=7&n=14

    Rudner, L. M. (2005). Expected classification accuracy. Practical Assessment Research & Evaluation,

    10(13). Retrieved from http://pareonline.net/pdf/v10n13.pdf

    Wyse and Hao 623

    http://apm.sagepub.com/
  • 7/30/2019 An Evaluation of Item Response Theory Classification Accuracy and


Recommended