The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Ying Cheng

2/5/2012

Outline

• Introduction• Review prior research investigating the effects

of item calibration error on related measurement procedures.

• Purpose• Method• Results• Conclusions

Introduction• Variable-length computerized adaptive testing (VL-

CAT): adaptive in terms of both item selection and length.

• Any CAT program requires a bank of previously calibrated item parameters, but these are often assumed to be the true values.

• However, only estimates of the item parameters are available, and because adaptive item selection involves optimization, capitalization on chance may occur.– van der Linden and Glas (2000): how the effects of the

capitalization on item calibration error in fixed-length CAT.

• Calibration error: the magnitude of sampling variability in item parameter estimates as determined by the size of the calibration sample for a given method of calibration and distribution of the latent trait.

Termination criteria in VL-CAT

• Conditional standard error (CSE): a test ends when the standard error of latent trait estimate falls below a predetermined threshold.– Achieving roughly uniform measurement precision

across the range of ability.– Test length largely depends on the examinee’s

latent trait level.• Examinees with extreme true θ values will tend to have

long test.

• Ability confidence interval (ACI): a test stops when the (i.e., 95%) confidence interval for θ falls entirely above or below the cut point.– Test length depends on the relative location of

true ability to the cut point.• Examinees with true θ values near the cut will tend to

have very long test.

Calibration Error and Latent Trait Estimation

• In item response theory (IRT): assume the true item parameters are known, thus, the SE of latent trait estimate reflects only measurement error.

• In practice, only estimates of item parameters can be obtained, hence, the SE will be underestimated when the additional source of error is ignored.

• Cheng and Yuan (2010): “upward correction” to the asymptotic SE of the maximum likelihood ability estimate.– – SE* will be larger than the SE based on test

information alone.

1 2ˆ( ) ( ) ( )SE I I v v

Capitalization on Calibration Error via item Selection

• Items with large a values are generally preferred in two- or three-parameter logistic models (2PLM or 3PLM) when the maximum item information was used for item selection.

• Calibration sample: the larger the error, the larger the effects of the capitalization on calibration error.

• The ratio of item bank size to test length: the larger the ratio, the larger the likelihood of selecting items only from those with the larger estimation error.

Purpose of Study

• Manipulate the magnitude of calibration error via the calibration sample size, and examine the effects on average test length and classification accuracy in several realistic VL-CAT scenarios.

Method• Independent variables:

– IRT models: 2PLM or 3PLM– Termination criteria: CSE (threshold of .316) or ACI

(95%)– Calibration sample sizes: N = ∞, 2500, 1000, or 500.

• Dependent variable– Average test length– Empirical bias of latent trait estimate– The percentage of correctly classified examinees at

each true value of θ

– 1

1

ˆ ˆ( , )relative test efficiency

ˆ( , )

m

i iim

i ii

I

I

Results

• Relative Test Efficiency and Test Length– Whether the maximum information criterion

capitalized on calibration error ACI: Figure 2– Implications of capitalization on chance for test

length Figure 3

• Ability Recovery and Classification Accuracy– Conditional bias Figure 4– The effect of calibration error on classification

accuracy Table 3 & 4

Discussion

• CSE Termination Rule– Test length was sensitive to the magnitude of item

calibration error.– The maximum likelihood ability estimator may exhibit

non-negligible bias in the presence of calibration error.– Classification accuracy tended to suffer for small

calibration samples, but because the magnitude of bias in estimate of latent trait was not large in the vicinity of the cut, the reduction in classification accuracy was not large (no more than 5%).

• ACI Termination Rule– Test length was clearly robust to the magnitude of

calibration error.– The pattern and magnitude of bias was similar for all

values of N, and so there was no strong or systematic effect of N on classification accuracy.

– Because the ACI rule is sensitive to the cut location we suspect that the robustness of bias and classification accuracy to the magnitude of calibration error may hold even for more extreme cut locations.

• Limitations of the Current Study– Whether alternative criteria would also be sensitive

to capitalization on chance.– Whether alternative stopping rules might also be

sensitive to capitalization on chance.– Impose non-statistical constraints on item selection

• i.e., exposure control and content balancing.

– Use the upward-corrected SE, this method is not currently feasible in adaptive testing scenarios.

Figure 2

-2 -1 0 1 2

0.0

0.5

1.0

1.5

2.0

(a) 2PLM

Re

lativ

e E

ffic

ien

cy

N = N = 2500N = 1000N = 500

-2 -1 0 1 20

.00

.51

.01

.52

.0

(b) 3PLM

Re

lativ

e E

ffic

ien

cy

• Regardless of IRT model, the true and “estimated” item parameters are identical in the N = ∞ conditions, so relative efficiency is equal to one for all values of θ. • As N decreases, relative efficiency steadily increases overestimation of item information becomes more severe as N decreases.• The problem of capitalization on chance is greater for smaller calibration samples and the more complex model.

Figure 3a & 3b

-2 -1 0 1 2

15

20

25

30

35

40

(a) 2PLM, CSE Criterion

Ave

rag

e T

est

Le

ng

th

N = N = 2500N = 1000N = 500

-2 -1 0 1 2

15

20

25

30

35

40

(b) 3PLM, CSE Criterion

Ave

rag

e T

est

Le

ng

th

• Tests tend to be spuriously short for small values of N, regardless of IRT model.• The effect of N on test length is relatively uniform for the 2PLM conditions, whereas the effect varies quite a bit for the 3PLM conditions.

Figure 3c & 3d

-2 -1 0 1 2

15

20

25

30

35

40

(c) 2PLM, ACI Criterion

Ave

rag

e T

est

Le

ng

th

-2 -1 0 1 2

15

20

25

30

35

40

(d) 3PLM, ACI Criterion

A

vera

ge

Te

st L

en

gth

• Tests are quite long near the cut, whereas only the minimum 15 items are required farther from the cut. •There is only a negligible effect of N on the average test length, save for a small region near the cut in the 3PLM conditions.

Figure 4a & 4b

-2 -1 0 1 2

-0.4

-0.2

0.0

0.2

0.4

(a) 2PLM, CSE Criterion

Bia

s

N = N = 2500N = 1000N = 500

-2 -1 0 1 2-0

.4-0

.20

.00

.20

.4

(b) 3PLM, CSE Criterion

Bia

s

• As N decreases, there emerges a systematic relationship between bias and ability. In particular, the magnitude of bias is greatest at the extremes.

Figure 4c & 4d

-2 -1 0 1 2

-0.4

-0.2

0.0

0.2

0.4

(c) 2PLM, ACI Criterion

Bia

s

-2 -1 0 1 2

-0.4

-0.2

0.0

0.2

0.4

(d) 3PLM, ACI Criterion

Bia

s

• Bias is negative below the cut ( θ = 0.5) and positive above it, and this trend is most apparent in the region near the cut (i.e., 0 < θ < 1).

θ= 0.5θ= 0.5

Table 3Cut = 0.5 Cut = 1.5

θ Level θ Level

Model N .00 .25 .50 .75 1.00 1.00 1.25 1.50 1.75 2.00

2PLM

∞ 93.6 80.5 48.5 80.7 93.9 95.2 80.4 52.7 79.2 93.7

2500 94.5 77.6 49.3 78.8 94.0 95.3 78.7 50.0 76.0 93.3

1000 92.7 78.5 48.8 77.5 94.5 93.7 76.9 46.0 73.2 91.9

500 94.7 77.7 48.9 75.7 92.9 93.7 84.3 42.0 67.9 88.0

Difference* 1.1 −2.8 0.4 −4.9 −0.9 −1.5 3.9 −10.7 −11.3 −5.7

3PLM

∞ 92.1 78.7 48.4 77.7 94.9 95.5 78.1 52.9 77.6 93.3

2500 93.2 75.3 48.9 80.9 94.3 94.0 80.5 46.5 77.3 91.3

1000 93.2 74.4 49.5 76.7 93.6 91.6 81.6 42.0 72.4 90.1

500 91.7 73.7 50.1 76.3 93.5 91.7 80.3 38.8 65.2 84.4

Difference −0.4 −4.9 1.7 −1.5 −1.5 −3.7 2.1 −14.1 −12.4 −8.9

• In general, classification accuracy decreases as N decreases, but this is not always the case.

Table 4Cut = 0.5

θ Level

Model N .00 .25 .50 .75 1.00

2PLM

∞ 98.5 87.7 49.7 86.7 98.5

2500 99.1 87.1 48.3 87.5 98.9

1000 98.7 88.8 47.6 85.6 98.7

500 98.4 87.1 45.9 83.5 98.8

Difference* −0.1 −0.7 −3.9 −3.2 0.3

3PLM

∞ 96.0 80.5 47.5 82.0 97.5

2500 96.3 80.4 50.3 86.1 97.3

1000 96.3 78.1 49.7 81.9 96.0

500 94.0 79.7 50.9 82.1 96.5

Difference −2.0 −0.8 3.5 0.1 −0.9

• There is no consistent relationship between N and classification accuracy.• The relationship between bias and true ability depends on the location of the cut.

Date post:	05-Feb-2016
Category:	Documents
Upload:	krysta
View:	43 times
Download:	0 times

The Influence of Item calibration Error on variable-Length Computerized Adaptive testing

Documents