+ All Categories
Home > Documents > The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Date post: 05-Feb-2016
Category:
Upload: krysta
View: 43 times
Download: 0 times
Share this document with a friend
Description:
The Influence of Item calibration Error on variable-Length Computerized Adaptive testing. Ying Cheng 2/5/2012. Outline. Introduction Review prior research investigating the effects of item calibration error on related measurement procedures. Purpose Method Results Conclusions. - PowerPoint PPT Presentation
22
The Influence of Item calibration Error on variable-Length Computerized Adaptive testing Ying Cheng 2/5/2012
Transcript
Page 1: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Ying Cheng

2/5/2012

Page 2: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Outline

• Introduction• Review prior research investigating the effects

of item calibration error on related measurement procedures.

• Purpose• Method• Results• Conclusions

Page 3: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Introduction• Variable-length computerized adaptive testing (VL-

CAT): adaptive in terms of both item selection and length.

• Any CAT program requires a bank of previously calibrated item parameters, but these are often assumed to be the true values.

• However, only estimates of the item parameters are available, and because adaptive item selection involves optimization, capitalization on chance may occur.– van der Linden and Glas (2000): how the effects of the

capitalization on item calibration error in fixed-length CAT.

Page 4: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

• Calibration error: the magnitude of sampling variability in item parameter estimates as determined by the size of the calibration sample for a given method of calibration and distribution of the latent trait.

Page 5: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Termination criteria in VL-CAT

• Conditional standard error (CSE): a test ends when the standard error of latent trait estimate falls below a predetermined threshold.– Achieving roughly uniform measurement precision

across the range of ability.– Test length largely depends on the examinee’s

latent trait level.• Examinees with extreme true θ values will tend to have

long test.

Page 6: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

• Ability confidence interval (ACI): a test stops when the (i.e., 95%) confidence interval for θ falls entirely above or below the cut point.– Test length depends on the relative location of

true ability to the cut point.• Examinees with true θ values near the cut will tend to

have very long test.

Page 7: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Calibration Error and Latent Trait Estimation

• In item response theory (IRT): assume the true item parameters are known, thus, the SE of latent trait estimate reflects only measurement error.

• In practice, only estimates of item parameters can be obtained, hence, the SE will be underestimated when the additional source of error is ignored.

Page 8: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

• Cheng and Yuan (2010): “upward correction” to the asymptotic SE of the maximum likelihood ability estimate.– – SE* will be larger than the SE based on test

information alone.

1 2ˆ( ) ( ) ( )SE I I v v

Page 9: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Capitalization on Calibration Error via item Selection

• Items with large a values are generally preferred in two- or three-parameter logistic models (2PLM or 3PLM) when the maximum item information was used for item selection.

• Calibration sample: the larger the error, the larger the effects of the capitalization on calibration error.

• The ratio of item bank size to test length: the larger the ratio, the larger the likelihood of selecting items only from those with the larger estimation error.

Page 10: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Purpose of Study

• Manipulate the magnitude of calibration error via the calibration sample size, and examine the effects on average test length and classification accuracy in several realistic VL-CAT scenarios.

Page 11: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Method• Independent variables:

– IRT models: 2PLM or 3PLM– Termination criteria: CSE (threshold of .316) or ACI

(95%)– Calibration sample sizes: N = ∞, 2500, 1000, or 500.

• Dependent variable– Average test length– Empirical bias of latent trait estimate– The percentage of correctly classified examinees at

each true value of θ

– 1

1

ˆ ˆ( , )relative test efficiency

ˆ( , )

m

i iim

i ii

I

I

Page 12: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Results

• Relative Test Efficiency and Test Length– Whether the maximum information criterion

capitalized on calibration error ACI: Figure 2– Implications of capitalization on chance for test

length Figure 3

• Ability Recovery and Classification Accuracy– Conditional bias Figure 4– The effect of calibration error on classification

accuracy Table 3 & 4

Page 13: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Discussion

• CSE Termination Rule– Test length was sensitive to the magnitude of item

calibration error.– The maximum likelihood ability estimator may exhibit

non-negligible bias in the presence of calibration error.– Classification accuracy tended to suffer for small

calibration samples, but because the magnitude of bias in estimate of latent trait was not large in the vicinity of the cut, the reduction in classification accuracy was not large (no more than 5%).

Page 14: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

• ACI Termination Rule– Test length was clearly robust to the magnitude of

calibration error.– The pattern and magnitude of bias was similar for all

values of N, and so there was no strong or systematic effect of N on classification accuracy.

– Because the ACI rule is sensitive to the cut location we suspect that the robustness of bias and classification accuracy to the magnitude of calibration error may hold even for more extreme cut locations.

Page 15: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

• Limitations of the Current Study– Whether alternative criteria would also be sensitive

to capitalization on chance.– Whether alternative stopping rules might also be

sensitive to capitalization on chance.– Impose non-statistical constraints on item selection

• i.e., exposure control and content balancing.

– Use the upward-corrected SE, this method is not currently feasible in adaptive testing scenarios.

Page 16: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Figure 2

-2 -1 0 1 2

0.0

0.5

1.0

1.5

2.0

(a) 2PLM

Re

lativ

e E

ffic

ien

cy

N = N = 2500N = 1000N = 500

-2 -1 0 1 20

.00

.51

.01

.52

.0

(b) 3PLM

Re

lativ

e E

ffic

ien

cy

• Regardless of IRT model, the true and “estimated” item parameters are identical in the N = ∞ conditions, so relative efficiency is equal to one for all values of θ. • As N decreases, relative efficiency steadily increases overestimation of item information becomes more severe as N decreases.• The problem of capitalization on chance is greater for smaller calibration samples and the more complex model.

Page 17: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Figure 3a & 3b

-2 -1 0 1 2

15

20

25

30

35

40

(a) 2PLM, CSE Criterion

Ave

rag

e T

est

Le

ng

th

N = N = 2500N = 1000N = 500

-2 -1 0 1 2

15

20

25

30

35

40

(b) 3PLM, CSE Criterion

Ave

rag

e T

est

Le

ng

th

• Tests tend to be spuriously short for small values of N, regardless of IRT model.• The effect of N on test length is relatively uniform for the 2PLM conditions, whereas the effect varies quite a bit for the 3PLM conditions.

Page 18: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Figure 3c & 3d

-2 -1 0 1 2

15

20

25

30

35

40

(c) 2PLM, ACI Criterion

Ave

rag

e T

est

Le

ng

th

-2 -1 0 1 2

15

20

25

30

35

40

(d) 3PLM, ACI Criterion

A

vera

ge

Te

st L

en

gth

• Tests are quite long near the cut, whereas only the minimum 15 items are required farther from the cut. •There is only a negligible effect of N on the average test length, save for a small region near the cut in the 3PLM conditions.

Page 19: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Figure 4a & 4b

-2 -1 0 1 2

-0.4

-0.2

0.0

0.2

0.4

(a) 2PLM, CSE Criterion

Bia

s

N = N = 2500N = 1000N = 500

-2 -1 0 1 2-0

.4-0

.20

.00

.20

.4

(b) 3PLM, CSE Criterion

Bia

s

• As N decreases, there emerges a systematic relationship between bias and ability. In particular, the magnitude of bias is greatest at the extremes.

Page 20: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Figure 4c & 4d

-2 -1 0 1 2

-0.4

-0.2

0.0

0.2

0.4

(c) 2PLM, ACI Criterion

Bia

s

-2 -1 0 1 2

-0.4

-0.2

0.0

0.2

0.4

(d) 3PLM, ACI Criterion

Bia

s

• Bias is negative below the cut ( θ = 0.5) and positive above it, and this trend is most apparent in the region near the cut (i.e., 0 < θ < 1).

θ= 0.5θ= 0.5

Page 21: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Table 3Cut = 0.5 Cut = 1.5

θ Level θ Level

Model N .00 .25 .50 .75 1.00 1.00 1.25 1.50 1.75 2.00

2PLM

∞ 93.6 80.5 48.5 80.7 93.9 95.2 80.4 52.7 79.2 93.7

2500 94.5 77.6 49.3 78.8 94.0 95.3 78.7 50.0 76.0 93.3

1000 92.7 78.5 48.8 77.5 94.5 93.7 76.9 46.0 73.2 91.9

500 94.7 77.7 48.9 75.7 92.9 93.7 84.3 42.0 67.9 88.0

Difference* 1.1 −2.8 0.4 −4.9 −0.9 −1.5 3.9 −10.7 −11.3 −5.7

3PLM

∞ 92.1 78.7 48.4 77.7 94.9 95.5 78.1 52.9 77.6 93.3

2500 93.2 75.3 48.9 80.9 94.3 94.0 80.5 46.5 77.3 91.3

1000 93.2 74.4 49.5 76.7 93.6 91.6 81.6 42.0 72.4 90.1

500 91.7 73.7 50.1 76.3 93.5 91.7 80.3 38.8 65.2 84.4

Difference −0.4 −4.9 1.7 −1.5 −1.5 −3.7 2.1 −14.1 −12.4 −8.9

• In general, classification accuracy decreases as N decreases, but this is not always the case.

Page 22: The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Table 4Cut = 0.5

θ Level

Model N .00 .25 .50 .75 1.00

2PLM

∞ 98.5 87.7 49.7 86.7 98.5

2500 99.1 87.1 48.3 87.5 98.9

1000 98.7 88.8 47.6 85.6 98.7

500 98.4 87.1 45.9 83.5 98.8

Difference* −0.1 −0.7 −3.9 −3.2 0.3

3PLM

∞ 96.0 80.5 47.5 82.0 97.5

2500 96.3 80.4 50.3 86.1 97.3

1000 96.3 78.1 49.7 81.9 96.0

500 94.0 79.7 50.9 82.1 96.5

Difference −2.0 −0.8 3.5 0.1 −0.9

• There is no consistent relationship between N and classification accuracy.• The relationship between bias and true ability depends on the location of the cut.


Recommended