Download - Using Item Response Theory to Improve Assessment

Day 1 PM: Using IRT

Item and test information

Comparison of IRT to Classical Test Theory

How to do IRT analysis

Part 1

Item and test information

Information

Information is the tool that IRT uses

to build tests

It is a statistical term that quantifies

how much something “adds” to a

procedure

Or, alternatively, how much

uncertainty (error) it decreases

A good test has a lot of information!

Item information

IRT calculates information for each

item and test at each level of q

It is therefore not a single number –

it is a function across ability

Each item has an item information

function

Each test has an test information

function

Item information

Some items provide information for

high students, some for low

Same is true for tests: a test can be

more accurate for certain score

ranges – and IRT will tell you which

Information

Item information is summative, that

is, it can be added up to obtain the

test information function (TIF)

Then we know where to add/subtract

items

Bonus: The TIF can also be inverted

to obtain a predicted SEM curve

Item information

With CTT, “information” can be

conceptualized by jointly considering

the P and rpbis

◦ Obviously, a higher rpbis is better

Definitely don’t want negative!

◦ P represents which examinees it is most

appropriate for

P = 0.95 is easy, good for low examinees

P = 0.50 is hard, good for high examinees

Item information

But since items and examinees are

not on the same scale, there is no

direct connection

With IRT, there is

Item with b = 0.7 is good for person

with q = 0.7

◦ This is the basis of adaptive testing –

doing this continually

Item information

Item information takes this idea and quantifies it across the spectrum

It is therefore a function of q as well as the item parameters

Where P(q) is the probability of a correct answer for a given value and Q(q) is 1-P

2

2 2 ( ) ( )I

( ) 1

i i i

i

i i

Q P cD a

P c

q q q

q

Item information

That is the computational equation

Conceptual version that is seen in the

literature is

Or the slope squared over the

conditional variance

2

I ( ) / ( ) (1 ( ) )i i iP P Pq q q q

Graphing info

So what does this mean?

We calculate with that equation, and

it will be higher wherever the slope

of the IRF is higher (for a given value

of q)

This is the item information function

(IIF)

Graphing info

So the location of the item

determines the location of the IIF

The discrimination of the item

determines the spread/peakedness of

the IIF

Information decreases as the guessing

parameter increases

Some example items

Seq a b c1 1.00 -2.00 0.262 0.70 -1.00 0.213 0.40 -0.50 0.304 0.50 1.00 0.005 0.80 0.00 0.22

Example item IRFs

IIFs – example items

Graphing info functions

Note that a lower slope is not ALL

bad

Even though Item 3’s peak is lower, it

provides some info at a much wider

range

So items like that are quite useful

when info is needed across a wide

range

Using item info

Item information is inversely related

to error in measurement

If the item provides more info, it

reduces error

The equation:

2/1

1 qq ISEM

Using item info

Key point: an item has less error where it has more information

--> where it has more slope

A test has less error where it has more information (items)

Using item info

IIFs are another way to examine

items individually

They are also what adaptive testing

utilizes for item selection

But the best use of item info: test

information and test assembly…

Test information

As a result of the assumption of local

independence, IIFs can be summed to

obtain a test information function

(TIF)

Same is true for IRFs – they can be

summed into a TRF

◦ This converts thetas to estimated raw

score

Test information

Test information, like item

information, shows how well a test

measures at each value of q

Also inverts to CSEM

This is extremely useful for test

assembly (aka construction, design,

or building)

Test information

Consider the 5 IRFs…

Test information

The TRF is…

Test information

Consider the 5 IIFs…

Test information

The TIF is…

Test information

The CSEM curve is…

Test assembly

Form building is more efficient and

better directed with IRT

Reason: we can predict measurement

error (SEM) at each level of θ, not

just overall reliability

Test assembly

This then allows you to build test

forms with specific TIFs or CSEMs in

mind

Or multiple forms with the same TIF

The following figures have the same

average a (0.9) but differ in where

they provide information

TRFs

TIFs

CSEMs

Test development

You can build your test with specific

TRF/TIF/SEM graph in mind

Peak at cutscore?

This can be done inside item bankers

(FastTEST & FT Web) or in separate

spreadsheets (my Form Building Tool)

Bank development

You can also build the bank for a

testing program with the desired TIF

in mind

If you know you want it to be peaked,

write items at the desired level of

difficulty to build an adequate bank

Bank development

Otherwise you risk overexposure

Don’t use all your best items at once

to make a peaked TIF – or any TIF for

that matter

In the theoretical IRT world, we don’t

have to worry about that, but

exposure is a real issue

Bank development

That is the reason linear-on-the-fly

(LOFT )was developed – to massively

reduce exposure and increase

security

◦ Every person gets an very similar TIF, but

a completely different test

◦ These tests are parallel, from an IRT

point of view

◦ Tests are conventional fixed-form

Part 2

A brief comparison of CTT and IRT

CTT and IRT Assumptions

IRT: ◦ Unidimensionality and local independence

◦ Responses modeled by IRF

◦ Parameters, not statistics (sample independence)

CTT: ◦ X = T + E

◦ (1) true scores and error scores are uncorrelated; (2) the average error score in the sample is zero

◦ Statistics (not parameters) are sample-based

Comparing CTT and IRT

CTT is said to have weaker assumptions

◦ Does not explicitly assume

unidimensionality

But if not there, statistics will be iffy, and rpbis

and reliability suffer

Sum scoring implicitly assumes items are

equivalent, which means unidimensional (all

items count equally on one total score)


CTT is said to have weaker assumptions

◦ Does not explicitly assume IRF

But if the idea of an IRF is not working, then the

item isn’t either

And if you use rpbis, you assume a linear IRF,

which is actually impossible!


CTT item statistics are at odds with

each other

◦ P says that there is one common

probability of a correct response

(binomial)

◦ But rpbis says that P increases with total

score (~ability)


Classical SEM: same for everyone

IRT SEM: different for everyone –

depends on the items you see and

your ability

Which is more realistic?


Direct comparison of item statistics

◦ We still use “difficulty” and

“discrimination”

◦ How different are they from CTT?

◦ Difficulty correlates highly (>0.90)

◦ Discrimination does not – because Rpbis

is linear and IRT is not


IRT and CTT scores also correlate

>0.95

So why use IRT?

There are distinct advantages…

Advantages of IRT

IRT has parameters, not statistics

Sample-independent… within a linear

transformation

Huh? This means that if you have two

calibration groups of different levels,

we can convert parameters/scores

with a simple y = mx + b

(Linking)

Advantages of IRT

Items and people are on the same

scale

Easier to interpret, and allows

adaptive testing

Advantages of IRT

Information provides an important

tool for test building and bank

development

Better match the purposes of a test

IRT CSEM allows far better

description of precision

Advantages of IRT

More precise scores

CTT number correct scoring is limited

to k + 1 scores

3PL has 2k scores

Compare with 10 items:

◦ 11 vs 1024 possible scores

Advantages of IRT

Scores take item difficulty into

account

Allows direct comparison of

examinees that saw different sets of

items

Scores also account for guessing

Advantages of IRT

Nonlinear IRF – the linear IRF

assumed by CTT is impossible

Allows for different SEM for every

examinee

Not realistic to assume they are all

the same

Disadvantages of IRT

Sample size

CTT: 50 is OK, 100 is great

◦ It is much easier to fit a straight line

“model” than an IRF because it is an

oversimplification

IRT: 100 is bare minimum for 1PL

◦ 3PL? ~500

◦ Puts it out of reach of small testing

programs


No “native” distractor analysis unless

polytomous models

Can adapt the CTT idea of

quantile/distractor plot with IRT

◦ IRT programs will also give you option P

and Rpbis


Complexity

◦ Not only do you have to understand it

yourself, but…

◦ You also have to explain it to

stakeholders!


However, note that these are not big

problems

◦ Many places have plenty of sample size

◦ You can still use CTT for distractor

analysis (always use both!!!!)

◦ The complexity is not too bad unless

using complex models

◦ Often, the biggest issue is the

stakeholders!

IRT Analysis

How do I go about doing this?

IRT Analysis

Xcalibre 4 for IRT

CTT analysis with Iteman 4 (not

necessary, but sometimes helps)

Also:

◦ Scoring and graphing tool

◦ Form building tool

◦ Empirical IRFs in Excel

◦ Have we covered these sufficiently?

IRT Analysis

I’m assuming here we are analyzing

just one sample of one test

What would I look for? Basic…

◦ Items with good parameters (keep/clone)

◦ Items with bad parameters (retire)

Evaluate their CTT option statistics

◦ TIF/CSEM – meet our needs? (not

good/bad in absolute sense)

IRT Analysis

What would I look for? Advanced…

◦ Dimensionality assessment (reliability,

any items/sections “off on their own”)

◦ Item fit (also dimensionality, and possible

item issues)

◦ Test sections – any stand out for being

hard, easy, low discriminations, poor

precision, etc?

◦ CSEM/TIF for sections: anything under-

measured?

IRT Analysis

What would I look for? Advanced…

◦ Finally: what do you want to see in the

data, and how will the test be used?

Later, we’ll talk about more

advanced uses like:

◦ Linking and equating multiple forms

◦ Test assembly

◦ Adaptive testing

◦ Dimensionality evaluation

Iteman 4.1

Performs comprehensive classical

analysis

Quantile plots allow broad evaluation

of IRF shape

Advantages:

◦ Easily understandable – can use with SMEs

◦ Includes distractors

Xcalibre 4.1

Provides a comprehensive and user-

friendly IRT analysis

Allows evaluation of individual items

and test as a whole

All major graphs

Many summary graphs (freqs etc.)

Classical analysis too

Reasons for Xcalibre 4.1

Current available software (Parscale,

Bilog, Multilog, ConQuest, WinSteps,

ICL) still require programming skills

Some still run on DOS!

If IRT is to be more widely used, it

needs a user-friendly system

◦ Input and output


Better input

◦ Yes: Point and click buttons

◦ No: DOS programming quasi-language

Better output

◦ Yes: Word docs (RTF), spreadsheets (CSV)

◦ No: DOS txt files with ugly tables


Advanced users with programming

skills and need for customized analysis

can still utilize previous software

Xcalibre 4.1 is designed for a wider

range of users

The following description is of Xcalibre

4, but also applies to Iteman 4

Xcalibre 4.1 Interface

Divided into tabs

Move left to right…

Xcalibre 4.1 Interface

All options are specified with buttons

or simple entry boxes

No code based on keywords

◦ Best example: IRT models (you’ll see)

Also: usable error messages

Specify files/input; choose options

I’ll now show how to use X4, and do

some analysis of real data…