Empowered by Psychometricsmedia01.commpartners.com/NCTA/2015Conference... · Computerized Adaptive...

transcript

Empowered by Psychometrics Inside the Black Box of

Computerized Adaptive Testing

Jim Wollack and Sonya Sedivy

University of Wisconsin – Madison

#nctaconf15

Purpose of Session

• Introduce several key psychometric concepts

and gain an appreciation for the theoretical

underpinnings of computerized adaptive

testing.

• Conceptual Overview of Computerized Adaptive

Testing

• Introduction of Item Response Theory

• Score Estimation in CAT

• Item Selection in CAT

• Practical Issues in CAT

Poll Question

Consider the following topics related to computerized

adaptive testing.

I. How items are chosen for administration

II. How a test score is determined

III. Test security issues specific to CAT

Please indicate which of these topics you currently would

be comfortable discussing with an examinee/parent.

SECTION I

• Conceptual Overview of Computerized

Adaptive Testing

Classical Test Theory

• X = T + E

• Person characteristics

• Total test score serves as a proxy for examinee’s

level on the construct

• Item characteristics

• Item difficulty is estimated as the proportion of

examinees who answer an item correctly

• Item discrimination estimated as correlation

between item score (1/0) and total score

Sample Data

I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 X

S1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 2

S2 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 5

S3 1 1 1 1 1 0 0 1 0 0 0 1 0 1 0 8

S4 1 0 1 1 1 0 1 0 0 0 1 1 1 1 0 9

S5 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 11

S6 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 13

p .46 .21 .72 .56 .89 .26 .51 .44 .53 .31 .77 .96 .29 .66 .37

Computerized Adaptive Testing (CAT)

• Each examinee receives a customized exam

that is built to maximize the precision of their

score.

• When an examinee answers a question

incorrectly, next item will be easier

• When an examinee answers a question correctly,

next item will be harder.

• Difficulty of assessments varies so that each

examinee gets an exam for which they’ll answer

approximately 50% of the items correct.

Reliability versus Precision

• Reliability is a measure of the amount of error

in a group of test scores.

• Standard error of measurement (SEM) is a function

of reliability, and represents an average amount

that an individual’s score might vary upon retest.

• SEM is the same value for all examinees.

• Precision is a measure of the amount of error in

an individual’s test score.

• Varies based on examinee’s trait level.

• Measure of the quality of a specific test score.

Reliability versus Precision

• For a population of examinees, the test will

not be equally precise for all.

• Precision is maximized when examinees take

items that are challenging, but doable.

Estimate trait level (score)

Estimate precision of score

Is score sufficiently precise?

Has maximum test length

been reached

Deliver new item that will

maximize precision

Adaptive Algorithm

Candidate gets

a small set of

random items of

varying difficulty

End Test

YES YES

Sample Data

I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 X

S1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 2

S2 0 0 1 0 1 0 1 0 0 0 1 1 0 0 0 5

S3 1 1 1 1 1 0 0 1 0 0 0 1 0 1 0 8

S4 1 0 1 1 1 0 1 0 0 0 1 1 1 1 0 9

S5 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 11

S6 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 13

p .46 .21 .72 .56 .89 .26 .51 .44 .53 .31 .77 .96 .29 .66 .37

Sample Data—CAT

I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 X N

S1 0 0 0 1 1 0 2 6

S2 0 1 1 0 1 0 0 3 7

S3 1 1 0 0 0 1 3 6

S4 1 1 1 0 0 0 3 6

S5 1 0 0 1 1 0 3 6

S6 1 1 0 1 1 1 5 6

• Everybody sees different items and takes tests of different lengths • CAT algorithm forces most students to a score of approximately 50% • How can we assign scores fairly and accurately?

Sample Data—CAT (items ordered by p)

S1 1 0 1 0 0 0 2 6

S2 1 1 1 0 0 0 0 3 7

S3 0 1 1 0 1 0 3 6

S4 1 0 1 1 0 0 3 6

S5 1 1 0 1 0 0 3 6

S6 1 1 1 1 0 1 5 6

• Picture is somewhat improved • Still can’t report accurate scores with any confidence

SECTION II

• Introduction to Item Response Theory

Item Response Theory

• Mathematical modeling approach to test scoring

and analysis

• Less intuitive, but more sophisticated approach

• Solves many problems with CTT

• Sample-dependency of item/exam statistics

• Test-dependency of total scores

• Tough to compare people and items

• Equal item weighting

• No good way to account for guessing

Trait Level vs. Prob. Correct Response

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

q (Examinee Trait Level)

LOW HIGH

An Item Characteristic Curve

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

LOW HIGH

Sample Independent—Same Curve

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

LOW HIGH

Item Response Theory

• Directly models the probability of a candidate

getting an item correct based on their overall

level on the construct and item characteristics

• Item characteristics

• Item Difficulty

• Item Discrimination

• Pseudo-Guessing probability

Item Difficulty

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

LOW HIGH

Easier

Harder

Item Difficulty

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

LOW HIGH

Easier

Harder

Item Discrimination

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

.59 – .41 = .18

.68 – .32 = .36

LOW HIGH

Discriminating

Accounting for Guessing

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

LOW HIGH

Putting it all Together

-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0

LOW HIGH Score

SECTION III

• Score Estimation in CAT

Estimating Examinee Scores

• Requires a large pool of items with known

item parameters.

• Two “approaches” to score estimation

• Visual approach

• Conceptual approach

Visual approach to score estimation

• Test Characteristic Curve (TCC)

• Describes relationship between total test score

and examinee trait level

• TCC is obtained by “adding” item characteristic

curves across all trait levels

• Each test has its own TCC

200 250 300 350 400 450 500 550 600 650 700 750 800

Test Characteristic Curve--2 Items

200 250 300 350 400 450 500 550 600 650 700 750 800

Test Characteristic Curve-4 Items

200 250 300 350 400 450 500 550 600 650 700 750 800

Sample Data—CAT (items ordered by p)

S1 1 0 1 0 0 0 2 6

S2 1 1 1 0 0 0 0 3 7

S3 0 1 1 0 1 0 3 6

S4 1 0 1 1 0 0 3 6

S5 1 1 0 1 0 0 3 6

S6 1 1 1 1 0 1 5 6 ●

200 250 300 350 400 450 500 550 600 650 700 750 800

Test Characteristic Curves for These 6 Examinees

Person 1 Person 2 Person 3 Person 4 Person 5 Person 6

200 250 300 350 400 450 500 550 600 650 700 750 800

Test Characteristic Curve

Limitation to Visual Approach

• Suggests that any two

students who earn

same number correct

score will earn the

same scaled score.

• Particular pattern of

right and wrong

answers is important

• How a number correct

score is obtained

really does matter in

score estimation

Item Curves for 6 Items

Conceptual Approach to Score Estimation

LOW HIGH Score

1.00Item Curves for 6 Items

.93 .93

.83 .83

.75 .25

.71 .71

.59 .59

.33 .67

LOW HIGH Score

P(Corr) P(answer)

1.00 Probability of Response for 6 Items

.93 .93

.83 .83

.75 .25

.71 .71

.59 .59

.33 .67

LOW HIGH Score

P(Corr) P(answer)

• Likelihood of response pattern found by

multiplying probabilities of individual responses

• .93 × .83 × .25 × .71 × .59 × .67 = .0542

1.00 Probability of Response for 6 Items

.93 .93

.83 .83

.75 .25

.71 .71

.59 .59

.33 .67

LOW HIGH Score

P(Corr) P(answer)

• Likelihood of response pattern found by

multiplying probabilities of individual responses

• .93 × .83 × .25 × .71 × .59 × .67 = .0542

200 250 300 350 400 450 500 550 600 650 700 750 800

Score of 528

maximizes the

likelihood

Likelihood Function

SECTION IV

• Item Selection in CAT

Selecting Items in CAT

• Each new item is selected so as to maximize

precision

• What does this mean?

• Item information is a measure of the amount that

we learn about a person’s trait level by

administering a particular item.

Selecting Items in CAT

• General characteristics of item information

• A specific item provides different amounts of

information, depending on the following

• Trait level of examinee

• Characteristics of item

• Information is maximized when an item is

administered whose difficulty is very close to the

examinee’s trait level.

• Information tends to be higher for highly

discriminating items

200 250 300 350 400 450 500 550 600 650 700 750 800

Item Information

200 250 300 350 400 450 500 550 600 650 700 750 800

Item Information Current Estimate: 400

Black answered correctly

New Estimate: 475

200 250 300 350 400 450 500 550 600 650 700 750 800

Green answered incorrectly

New Estimate: 440

200 250 300 350 400 450 500 550 600 650 700 750 800

Brown answered correctly

New Estimate: 460

200 250 300 350 400 450 500 550 600 650 700 750 800

Blue answered incorrectly

New Estimate: 448

200 250 300 350 400 450 500 550 600 650 700 750 800

Black answered correctly

New Estimate: 456

200 250 300 350 400 450 500 550 600 650 700 750 800

Item Information

Final score estimate = 456

SECTION V

• Practical Issues in CAT

Security Issues in CAT

• Security

• Advantages

• Essentially eliminates answer copying

• Shows fewer items to each candidate

• Disadvantages

• Increased reliance on handful of items

• Exposes entire test bank to group of candidates

• Gaming

200 250 300 350 400 450 500 550 600 650 700 750 800

Item Information

Other Practical Issues in CAT

• Item Review

• Sparce Data Matrices

Re-Polling Question

Consider the following topics related to computerized

adaptive testing.

I. How items are chosen for administration

II. How a test score is determined

III. Test security issues specific to CAT

After participating in this session, please indicate which of

these topics you would now be more comfortable

discussing with an examinee/parent.

Thank you

• For more information, please contact

Jim Wollack

University of Wisconsin – Madison

jwollack@wisc.edu

Empowered by Psychometricsmedia01.commpartners.com/NCTA/2015Conference... · Computerized Adaptive...

Documents