REL Midwest and ARC
Item Response TheoryItem Response Theory
Kimberly Maier, Ph.D.Assistant Professor, Measurement and Quantitative MethodsMichigan State University
Andrew SwanlundSenior Research Associate and PsychometricianLearning Point Associates
November 2009
REL Midwest and ARC2
Topics Covered
Rationale for IRT/Rasch methods
The 1PL/Rasch Dichotomous Model
The Rasch Rating Scale Model
Psychometric Analysis/Diagnostics
Advanced IRT Models
REL Midwest and ARC3
What’s a Latent Trait?
An unobservable trait (theorized to exist) that cannot be directly measured, although it can usually be easily described.
Examples include the following:– Mathematics achievement– Intelligence– Attitudes– Opinions
Tests or questionnaires are used to assess the latent trait.
REL Midwest and ARC4
Test Theory
Test theory or psychometrics is used to:– Develop questionnaires.
Reliability
Validity
Determine dimensionality
– Provide measures (on a scale) for examinees.
Two “flavors” of test theory:– Classical Test Theory– Modern Test Theory (IRT)
REL Midwest and ARC5
Classical Test Theory
True score = Observed score + error
Focus on raw test scores
Item difficulty– How hard is it to get the item “right”?– Or, how hard is it to agree with a statement?– Measured as the proportion of respondents who
get the item “correct.”
REL Midwest and ARC6
Classical Test Theory
Item discrimination– How effectively an item differentiates between
examinees who are high and those who are low on the latent trait.
– Two types of measures of discrimination:
Index of discrimination – cannot perform statistical significance tests to determine if it is zero.
Various correlation coefficients (e.g., point biserial, biserial, tetrachoric, and phi coefficient) – measure of relationship between responses on an item and the performance on the entire test.
REL Midwest and ARC7
What Classical Test Theory Can’t Do
A person’s ability level and the survey item difficulties cannot be estimated separately.– Implications:
A person’s measure of the latent trait is dependent on the survey items administered.
Items’ means depend on the sample of people who took the survey.
– Therefore, ALL estimates of the model are sample dependent and cannot be compared across samples varying in the distribution of the underlying latent trait.
REL Midwest and ARC8
What Classical Test Theory Can’t Do
Doesn’t provide information about how examinees at different ability levels on the trait have performed on individual items.
Difficult to compare performance of examinees who have taken different tests that measure the same trait.
Difficult to apply results to another group to be tested.
REL Midwest and ARC9
Item Response Theory
Items on a test/instrument measure a single latent trait or several latent traits (multidimensional)
Allows one to compare the performance of one group taking Test A with another group taking Test B
The results of an item analysis can be applied to groups of respondents other than the original group used for the analysis
REL Midwest and ARC10
General Ideas of IRT
The item response model gives us an idea of the probability that a person with latent trait level θ will correctly answer an item of difficulty δ.
The relationship between ability (attitude) and item response is characterized by an item characteristic curve.
Each person is assumed to have a level of ability that situates him or her on the item characteristic curve.
REL Midwest and ARC11
Item Characteristic Curve (ICC)
REL Midwest and ARC12
Important Points About the ICC
Item difficulty– The level of ability at which 50 percent of the respondents
are able to correctly answer the item
Item discrimination– The slope of the item characteristic curve– Determines difference in probabilities of respondents of
different ability levels to answer the item correctly
Difficulty and discrimination are independent of one another.
REL Midwest and ARC13
Measurement
The value of a latent trait measure θ usually varies from -3 to +3, although the limits are - ∞
to +∞
(this can be rescaled to any metric if
desired).
The higher one’s ability level, the higher his or her probability of correctly answering the item.
REL Midwest and ARC14
Psychometric Models
Three Main Properties/Assumptions of IRT (including Rasch)*– Unidimensionality– Local Independence– Monotonicity of the Item Response Functions
*These also hold for Factor Analysis and Classical Test Theory, but we discuss them here in the IRT framework.
REL Midwest and ARC15
Psychometric Models
Unidimensionality– The latent trait (or construct) is represented by a single
number (often denoted θ)– Examples – mathematics ability, self-efficacy– Some constructs are multidimensional – personality – “the
big five” (openness, conscientiousness, extroversion, agreeableness, neuroticism).
Models we talk about today are for unidimensional constructs only.
Can model multidimensional constructs with MIRT (Reckase, 2009)
REL Midwest and ARC16
Psychometric Models
Local Independence– Responses to items are independent from one
another after taking into account examinee/respondent ability
– Items shouldn’t cue one another – that is, knowing the answer to one item gives you the answer to another
REL Midwest and ARC17
Psychometric Models
Monotonicity of the Item Characteristic Curve– Higher ability examinees should have a higher
probability of successful/favorable response than lower ability examinees
REL Midwest and ARC18
The 1PL/Rasch Dichotomous Model
What kind of data can we use with the dichotomous model?– Multiple-choice or correct/incorrect test data– Checklist data– Yes/no survey responses
REL Midwest and ARC19
The 1PL/Rasch Dichotomous Model
The 1-Parameter Logistic (or Rasch) Model:
Note: The probability of a correct response depends only on ability of the person () and the difficulty of the item ().
ji
jijiXP
exp1exp
,|1
REL Midwest and ARC20
1PL/Rasch Theoretical ICCs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-3 -2.7
-2.4
-2.1
-1.8
-1.5
-1.2
-0.9
-0.6
-0.3
1.527
E-15 0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3
Ability
Prob
.
Diff iculty = -1
Diff iculty = 0
Diff iculty = 1
REL Midwest and ARC21
The 1PL/Rasch Dichotomous Model
Measures of ability and item parameters are reported in logits:
– The mathematical unit of ability is defined as the log odds for succeeding on items of the kind chosen to define the “zero” point on the scale.
– The mathematical unit of an item’s difficulty is defined as the log odds for eliciting failure from persons with “zero” ability.
logit ln1
i
i
pp
REL Midwest and ARC22
The 1PL/Rasch Dichotomous Model
Probabilities of correctly answering an item with difficulty of 1.0: Ability Probability
-3.00 0.02-2.00 0.05-1.00 0.120.00 0.271.00 0.502.00 0.733.00 0.88
REL Midwest and ARC23
2-PL Model
The 2-parameter logistic model:
Note: The probability of a correct response depends ability of the person (
), the difficulty of the item (bj ), and the discrimination of the item (aj ).
jij
jijjji ba
babaXP
exp1
exp,,|1
REL Midwest and ARC24
2-Parameter Logistic ICCs
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-3 -2.7
-2.4
-2.1
-1.8
-1.5
-1.2
-0.9
-0.6
-0.3
1.527
E-15
0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4 2.7 3
Ability
Prob
.
REL Midwest and ARC25
3-PL Model
The 3-parameter logistic model
Note: The probability of a correct response depends ability of the person (i ), the difficulty of the item (bj ), the discrimination (aj ), and the guessing parameter (cj ).
jij
jijjjjjji ba
bacccbaXP
exp1
exp1,,,|1
REL Midwest and ARC26
Difference Between Rasch and 2/3PL?
2/3PL models will likely better fit the data (higher parameterized models do that), but it requires a lot of data to find stable parameter estimates
Modeling items with high/low discrimination and high guessing parameters can lead to the inclusion of lower quality items (according to some); conversely, Rasch analysis may result in omitting some items
In 2/3PL, pattern of responses matter; in Rasch, there is one scale score per raw score
REL Midwest and ARC27
The Rasch Rating Scale Model
What are some uses of the rating scale model?– Survey response data with a standard set of
responses across many items (SD, D, A, SA)– Scoring rubrics where the performance levels are
defined similarly across all indicators
If score definitions vary across items, would require Partial Credit Model
REL Midwest and ARC28
The Rasch Rating Scale Model
Here’s the math…
))((exp
))((exp
00
0
jin
k
j
m
k
jin
x
jnix
REL Midwest and ARC29
Rasch Probability Curve
Category Probability Curves for an Agreement Scale…
ThresholdsThresholds
REL Midwest and ARC30
“Validation” of an Instrument
Some psychometric properties to consider– Reliability– Rating scale functioning– Item and person fit– Point-measure correlation– Differential item functioning (DIF)– Dimensionality
REL Midwest and ARC31
“Validation” of a Survey
Reliability– A definition: the degree to which scores are free
from measurement error, or how consistent the scores are within an administration and over time.
– In general, reliability increases with the number of items and the ability of those items to spread people out along the scoring metric.
– Rules of thumb – 0.7 = OK, 0.8 = good, 0.9 = excellent (but it depends on the use of scores)
REL Midwest and ARC32
Reliability and Score Distribution
A scale with reliability 0.40
20.00 40.00 60.00 80.00
TLSS
0
50
100
150
Cou
nt
REL Midwest and ARC33
“Validation” of a Survey
Rating Scale Functioning– Are the respondents using the rating scale in a
consistent fashion?– Are any categories being over- or underutilized?– Is there a good distribution of responses across
categories?– Are the categories “disordered”?
REL Midwest and ARC34
Disordered Categories
A frequency scale with a problem…(never, daily, weekly, biweekly, monthly)
4/5 Threshold
3/4 Thresholddaily
never
weekly
REL Midwest and ARC35
What’s wrong with this scale?
A five-point partial credit observation item…
REL Midwest and ARC36
“Validation” of a Survey
Item and Person Fit– Are there unpredictable responses in the data (under-fit)?
Can indicate multidimensionality, confusing wording or multiple meaning, content not consistent with the construct, or multiple classes of respondents
– Are there responses that are too predictable (over-fit)?
Can indicate redundancy in the items (or response sets)– Are those responses made be individuals at the center of
the distribution or at the extremes?
REL Midwest and ARC37
“Validation” of a Survey
An example of a misfitting person…
KEY: .1.=OBSERVED, 1=EXPECTED, (1)=OBSERVED, BUT VERY UNEXPECTED.
NUMBER - NAME ------------------ MEASURE - INFIT (MNSQ) OUTFIT - S.E.
372 3457102 62.28 4.8 A 4.8 7.85
-10 10 30 50 70 90 110
|---------+---------+---------+---------+---------+---------| NUM Item
(2) 4 10* 3c
.4. 12* 3e
4 (5) 14* 3g
.3. 4 9* 3b
(3) 4 8* 3a
4 (5) 13* 3f
4 (5) 11* 3d
|---------+---------+---------+---------+---------+---------| NUM Item
-10 10 30 50 70 90 110
REL Midwest and ARC38
“Validation” of a Survey
Point-measure correlation
Extent to which item-rating correlates with total score.+----------------------------------------------------------------------------+|ENTRY RAW MODEL| INFIT | OUTFIT |PTMEA| ||NUMBER SCORE COUNT MEASURE S.E. |MNSQ ZSTD|MNSQ ZSTD|CORR.| Item G ||------------------------------------+----------+----------+-----+-----------|| 115 13 15 35.86 8.33|1.51 1.1|9.90 3.5|A-.24| @51b D || 116 29 377 90.20 2.05|1.23 1.5|4.26 6.0|B-.02| I52 C || 79 46 342 83.49 1.72|1.46 3.8|4.25 8.2|C-.16| I32 E || 114 15 373 97.81 2.73|1.11 .5|3.86 3.8|D .02| I51 C || 120 327 369 34.48 1.75|1.31 2.5|2.98 5.3|E-.04| I54 C || 117 17 29 53.96 4.51|1.72 2.8|2.83 3.7|F .05| @52b D || 124 286 368 43.80 1.38|1.20 2.6|1.63 3.6|G .21| avail57 A || 113 372 378 12.33 4.16|1.02 .2|1.61 .9|H .07| I50 D || 118 144 356 64.60 1.23|1.33 5.8|1.58 6.1|I .20| I53 C || 121 263 324 41.06 1.55|1.19 2.1|1.53 2.5|J .21| @54b D || 75 319 376 38.22 1.55|1.10 1.1|1.32 1.5|K .26| I28 E || 119 100 142 51.10 2.09|1.16 1.6|1.30 1.5|L .35| @53b D || 122 247 338 47.16 1.36|1.04 .6|1.21 1.6|M .37| I55 C || 123 259 370 48.77 1.27|1.19 3.1|1.17 1.5|N .30| avail56 A |
REL Midwest and ARC39
“Validation” of a Survey
Differential Item Functioning Analysis– A method for looking for item bias (or differing
perceptions of the items for a survey)– Item bias is different from test bias (the
cumulative effect of item bias on total score)– DIF analysis looks for items where similar ability
respondents from different demographics respond in a very different manner
REL Midwest and ARC40
“Validation” of a Survey
Dimensionality (Rasch PCA)– Similar to exploratory factor analysis– Examines variance structure in the model
residuals after factoring out variance explained by scale scores
– Looks at correlations in the residual matrix to identify factors (dimensions) that may be affecting patterns of responses
REL Midwest and ARC41
Estimation of Model Parameters
The values of the respondents’ latent trait measures and the difficulties of the items are all unknown quantities.
Likelihood approaches are commonly used to estimate parameters– 1PL/Rasch – Joint maximum likelihood (JMLE),
conditional maximum likelihood (CMLE)– All models – Marginal maximum likelihood
(MMLE)
Bayesian techniques are another option
REL Midwest and ARC42
Available Software
IRT (1/2/3PL, Graded Response Model, Generalized Partial Credit Model)– BILOG-MG 3.0, PARSCALE 4.0, TESTFACT 4.0,
and MULTILOG 7.0, R modules such as eRm
Rasch (1PL, Rasch Rating Scale Model, Partial Credit Model)– WINSTEPS, RUMM, BIGSTEPS, Conquest,
WINMIRA, R modules such as plink
REL Midwest and ARC43
More Advanced Models
Multidimensional random coefficients multinomial logit model (MRCML) – Conquest
Mixture distribution Rasch models (latent class analysis) – WINMIRA
Many-faceted Rasch model (FACETS)
Multilevel Rasch models (HLM)