Day 1 PM: Using IRT
Item and test information
Comparison of IRT to Classical Test Theory
How to do IRT analysis
Part 1
Item and test information
Information
Information is the tool that IRT uses
to build tests
It is a statistical term that quantifies
how much something “adds” to a
procedure
Or, alternatively, how much
uncertainty (error) it decreases
A good test has a lot of information!
Item information
IRT calculates information for each
item and test at each level of q
It is therefore not a single number –
it is a function across ability
Each item has an item information
function
Each test has an test information
function
Item information
Some items provide information for
high students, some for low
Same is true for tests: a test can be
more accurate for certain score
ranges – and IRT will tell you which
Information
Item information is summative, that
is, it can be added up to obtain the
test information function (TIF)
Then we know where to add/subtract
items
Bonus: The TIF can also be inverted
to obtain a predicted SEM curve
Item information
With CTT, “information” can be
conceptualized by jointly considering
the P and rpbis
◦ Obviously, a higher rpbis is better
Definitely don’t want negative!
◦ P represents which examinees it is most
appropriate for
P = 0.95 is easy, good for low examinees
P = 0.50 is hard, good for high examinees
Item information
But since items and examinees are
not on the same scale, there is no
direct connection
With IRT, there is
Item with b = 0.7 is good for person
with q = 0.7
◦ This is the basis of adaptive testing –
doing this continually
Item information
Item information takes this idea and quantifies it across the spectrum
It is therefore a function of q as well as the item parameters
Where P(q) is the probability of a correct answer for a given value and Q(q) is 1-P
2
2 2 ( ) ( )I
( ) 1
i i i
i
i i
Q P cD a
P c
q q q
q
Item information
That is the computational equation
Conceptual version that is seen in the
literature is
Or the slope squared over the
conditional variance
2
I ( ) / ( ) (1 ( ) )i i iP P Pq q q q
Graphing info
So what does this mean?
We calculate with that equation, and
it will be higher wherever the slope
of the IRF is higher (for a given value
of q)
This is the item information function
(IIF)
Graphing info
So the location of the item
determines the location of the IIF
The discrimination of the item
determines the spread/peakedness of
the IIF
Information decreases as the guessing
parameter increases
Some example items
Seq a b c1 1.00 -2.00 0.262 0.70 -1.00 0.213 0.40 -0.50 0.304 0.50 1.00 0.005 0.80 0.00 0.22
Example item IRFs
IIFs – example items
Graphing info functions
Note that a lower slope is not ALL
bad
Even though Item 3’s peak is lower, it
provides some info at a much wider
range
So items like that are quite useful
when info is needed across a wide
range
Using item info
Item information is inversely related
to error in measurement
If the item provides more info, it
reduces error
The equation:
2/1
1 qq ISEM
Using item info
Key point: an item has less error where it has more information
--> where it has more slope
A test has less error where it has more information (items)
Using item info
IIFs are another way to examine
items individually
They are also what adaptive testing
utilizes for item selection
But the best use of item info: test
information and test assembly…
Test information
As a result of the assumption of local
independence, IIFs can be summed to
obtain a test information function
(TIF)
Same is true for IRFs – they can be
summed into a TRF
◦ This converts thetas to estimated raw
score
Test information
Test information, like item
information, shows how well a test
measures at each value of q
Also inverts to CSEM
This is extremely useful for test
assembly (aka construction, design,
or building)
Test information
Consider the 5 IRFs…
Test information
The TRF is…
Test information
Consider the 5 IIFs…
Test information
The TIF is…
Test information
The CSEM curve is…
Test assembly
Form building is more efficient and
better directed with IRT
Reason: we can predict measurement
error (SEM) at each level of θ, not
just overall reliability
Test assembly
This then allows you to build test
forms with specific TIFs or CSEMs in
mind
Or multiple forms with the same TIF
The following figures have the same
average a (0.9) but differ in where
they provide information
TRFs
TIFs
CSEMs
Test development
You can build your test with specific
TRF/TIF/SEM graph in mind
Peak at cutscore?
This can be done inside item bankers
(FastTEST & FT Web) or in separate
spreadsheets (my Form Building Tool)
Bank development
You can also build the bank for a
testing program with the desired TIF
in mind
If you know you want it to be peaked,
write items at the desired level of
difficulty to build an adequate bank
Bank development
Otherwise you risk overexposure
Don’t use all your best items at once
to make a peaked TIF – or any TIF for
that matter
In the theoretical IRT world, we don’t
have to worry about that, but
exposure is a real issue
Bank development
That is the reason linear-on-the-fly
(LOFT )was developed – to massively
reduce exposure and increase
security
◦ Every person gets an very similar TIF, but
a completely different test
◦ These tests are parallel, from an IRT
point of view
◦ Tests are conventional fixed-form
Part 2
A brief comparison of CTT and IRT
CTT and IRT Assumptions
IRT: ◦ Unidimensionality and local independence
◦ Responses modeled by IRF
◦ Parameters, not statistics (sample independence)
CTT: ◦ X = T + E
◦ (1) true scores and error scores are uncorrelated; (2) the average error score in the sample is zero
◦ Statistics (not parameters) are sample-based
Comparing CTT and IRT
CTT is said to have weaker assumptions
◦ Does not explicitly assume
unidimensionality
But if not there, statistics will be iffy, and rpbis
and reliability suffer
Sum scoring implicitly assumes items are
equivalent, which means unidimensional (all
items count equally on one total score)
Comparing CTT and IRT
CTT is said to have weaker assumptions
◦ Does not explicitly assume IRF
But if the idea of an IRF is not working, then the
item isn’t either
And if you use rpbis, you assume a linear IRF,
which is actually impossible!
Comparing CTT and IRT
CTT item statistics are at odds with
each other
◦ P says that there is one common
probability of a correct response
(binomial)
◦ But rpbis says that P increases with total
score (~ability)
Comparing CTT and IRT
Classical SEM: same for everyone
IRT SEM: different for everyone –
depends on the items you see and
your ability
Which is more realistic?
Comparing CTT and IRT
Direct comparison of item statistics
◦ We still use “difficulty” and
“discrimination”
◦ How different are they from CTT?
◦ Difficulty correlates highly (>0.90)
◦ Discrimination does not – because Rpbis
is linear and IRT is not
Comparing CTT and IRT
IRT and CTT scores also correlate
>0.95
So why use IRT?
There are distinct advantages…
Advantages of IRT
IRT has parameters, not statistics
Sample-independent… within a linear
transformation
Huh? This means that if you have two
calibration groups of different levels,
we can convert parameters/scores
with a simple y = mx + b
(Linking)
Advantages of IRT
Items and people are on the same
scale
Easier to interpret, and allows
adaptive testing
Advantages of IRT
Information provides an important
tool for test building and bank
development
Better match the purposes of a test
IRT CSEM allows far better
description of precision
Advantages of IRT
More precise scores
CTT number correct scoring is limited
to k + 1 scores
3PL has 2k scores
Compare with 10 items:
◦ 11 vs 1024 possible scores
Advantages of IRT
Scores take item difficulty into
account
Allows direct comparison of
examinees that saw different sets of
items
Scores also account for guessing
Advantages of IRT
Nonlinear IRF – the linear IRF
assumed by CTT is impossible
Allows for different SEM for every
examinee
Not realistic to assume they are all
the same
Disadvantages of IRT
Sample size
CTT: 50 is OK, 100 is great
◦ It is much easier to fit a straight line
“model” than an IRF because it is an
oversimplification
IRT: 100 is bare minimum for 1PL
◦ 3PL? ~500
◦ Puts it out of reach of small testing
programs
Disadvantages of IRT
No “native” distractor analysis unless
polytomous models
Can adapt the CTT idea of
quantile/distractor plot with IRT
◦ IRT programs will also give you option P
and Rpbis
Disadvantages of IRT
Complexity
◦ Not only do you have to understand it
yourself, but…
◦ You also have to explain it to
stakeholders!
Disadvantages of IRT
However, note that these are not big
problems
◦ Many places have plenty of sample size
◦ You can still use CTT for distractor
analysis (always use both!!!!)
◦ The complexity is not too bad unless
using complex models
◦ Often, the biggest issue is the
stakeholders!
IRT Analysis
How do I go about doing this?
IRT Analysis
Xcalibre 4 for IRT
CTT analysis with Iteman 4 (not
necessary, but sometimes helps)
Also:
◦ Scoring and graphing tool
◦ Form building tool
◦ Empirical IRFs in Excel
◦ Have we covered these sufficiently?
IRT Analysis
I’m assuming here we are analyzing
just one sample of one test
What would I look for? Basic…
◦ Items with good parameters (keep/clone)
◦ Items with bad parameters (retire)
Evaluate their CTT option statistics
◦ TIF/CSEM – meet our needs? (not
good/bad in absolute sense)
IRT Analysis
What would I look for? Advanced…
◦ Dimensionality assessment (reliability,
any items/sections “off on their own”)
◦ Item fit (also dimensionality, and possible
item issues)
◦ Test sections – any stand out for being
hard, easy, low discriminations, poor
precision, etc?
◦ CSEM/TIF for sections: anything under-
measured?
IRT Analysis
What would I look for? Advanced…
◦ Finally: what do you want to see in the
data, and how will the test be used?
Later, we’ll talk about more
advanced uses like:
◦ Linking and equating multiple forms
◦ Test assembly
◦ Adaptive testing
◦ Dimensionality evaluation
Iteman 4.1
Performs comprehensive classical
analysis
Quantile plots allow broad evaluation
of IRF shape
Advantages:
◦ Easily understandable – can use with SMEs
◦ Includes distractors
Xcalibre 4.1
Provides a comprehensive and user-
friendly IRT analysis
Allows evaluation of individual items
and test as a whole
All major graphs
Many summary graphs (freqs etc.)
Classical analysis too
Reasons for Xcalibre 4.1
Current available software (Parscale,
Bilog, Multilog, ConQuest, WinSteps,
ICL) still require programming skills
Some still run on DOS!
If IRT is to be more widely used, it
needs a user-friendly system
◦ Input and output
Reasons for Xcalibre 4.1
Better input
◦ Yes: Point and click buttons
◦ No: DOS programming quasi-language
Better output
◦ Yes: Word docs (RTF), spreadsheets (CSV)
◦ No: DOS txt files with ugly tables
Reasons for Xcalibre 4.1
Advanced users with programming
skills and need for customized analysis
can still utilize previous software
Xcalibre 4.1 is designed for a wider
range of users
The following description is of Xcalibre
4, but also applies to Iteman 4
Xcalibre 4.1 Interface
Divided into tabs
Move left to right…
Xcalibre 4.1 Interface
All options are specified with buttons
or simple entry boxes
No code based on keywords
◦ Best example: IRT models (you’ll see)
Also: usable error messages
Specify files/input; choose options
I’ll now show how to use X4, and do
some analysis of real data…