Statistical and operational complexities of the studies II Assessment design: Use of IRT and Plausible Values
Andrés Sandoval-Hernández – IEA DPC
Workshop on using PISA, TIMSS & PIRLS, TALIS datasets
Ispra, Italy- June 24-27, 2014
Note: These slides were prepared as part of the IEA training portfolio with the collaboration of IEA staff and resource persons.
Table of Contents
• Introduction
• Producing comparable scores
• Scaling procedures
• Calculating standard errors
2
Introduction
• Complex Sample Design
• Probabilistic, stratified, multistage sample designs
• Need to take sample design into account when computing estimates
• Complex Assessment Design
• Multiple matrix sample designs where nobody takes all items, and no items are given to all
• Need to produce comparable scores and to take measurement uncertainty into account when computing estimates
3
Introduction – What we really want to know?
• How would the students have performed on the test had we been able to administer ALL the items to ALL of them
• Since we did not test everyone on everything, we need to make our best guess (scientific estimation)
• Remember our goal… • Administer in a sensible design
• Obtain comparable scores
• Correct for unreliability
4
Table of Contents
• Introduction
• Producing comparable scores
• Scaling procedures
• Calculating standard errors
5
Producing Comparable Scores
• Raw scores do not take into account the difficulty of the items
• Different students took different items
• Student comparability between different tests/ subsets of a test is not possible
• Instead, student achievement is estimated using scale scores computed based on Item Response Theory (IRT)
6
Why IRT?
• Many items are needed to assess a domain as broadly defined such as, for example, mathematics
• At the same time it is unreasonable to administer the whole item battery to each sampled student because: • Students‘ results will be affected by fatigue
• Principals and teachers would be hesitant to free students for very long testing periods which would reduce participation
Students are assigned subsets of the item pool
7
IRT: Item Response Theory
• Response to an item depends on the interaction between the “ability” of the respondent, and characteristics of the item
• Persons of high ability should answer easy items correctly
• Persons of low ability should not answer difficult items correctly
• Does not make assumptions of normal distribution but assumes unidimensionality of measurement
8
Item Characteristic Curve
9
Advantages of IRT Models
• IRT models allow us to create a continuum on which both student performance and item difficulty will be located, linked by a probabilistic function
• IRT allows for performance in a subject to be summarized on a common scale even when different students are administered different items
• Facilitates linking when dealing with rotated test forms
10
Advantages of IRT
• It allows us to:
• Evaluate the effectiveness of a test at different levels of ability
• Design tests to best measure at specific ability level
• Develop new tests and investigate them without administering them
• Develop item statistics that do not change when the group of examinees change
11
Table of Contents
• Introduction
• Producing comparable scores
• Scaling procedures
• Calculating standard errors
12
Scaling Procedures
• Achievement is initially estimated using computed scale scores based on IRT
• IRT allows for performance in a subject to be summarized on a common scale even when different students are administered different items
• In addition to IRT, these studies make use of multiple imputations or “plausible values” methodology
13
14
Plausible Values
• Random draws from the estimated ability distribution of students with similar item response patterns and background characteristics
• The variance of these draws reflects the uncertainty of measurement
• Think of a regression where the predictors are item responses and background data
PV2 PV4 PV1 PV5 PV3
PV1-PV5: randomly drawn plausible values Ability Distribution
Using Plausible Values
• Plausible values are optimal for obtaining population estimates
• Plausible values should not be used for individual reporting
• Compute statistics with each plausible value and average results
15
Table of Contents
• Introduction
• Producing comparable scores
• Scaling procedures
• Calculating standard errors
16
Calculating standard errors
17
• The standard error for any statistic estimated from an LSA is a combination of sampling and assessment variances
Standard Error (t)= se(sampling) + se(assessment)
Also known as:
Variance(t)= Var(sampling) + Var(assessment)
• Standard Error is square root of variance
Computing Standard Error (TIMSS & PIRLS)
+
75
1
2
1
i
i
Sampling Variance
15
5
1
2
i
i
Assessment Variance
Computing Standard Error (PISA)
+
Sampling Variance
15
5
1
2
i
i
Imputation Variance
80
2
1
280 (1 0.5)
i
i
Computing Standard Error (TALIS)
Sampling Variance
80
2
1
280 (1 0.5)
i
i
Computing Standard Error (PIAAC)
+
Sampling Variance Imputation Variance
NOTE: In PIAAC the replication method changes from country to country. So the
formula would be different according to this.
Example: Comparing Standard Errors
22
SPSS IDB Analyzer
SE taking into account sampling & assessment error
SE calculated with SPSS (assuming simple random sample)
The IEA/ETS Research Institute
(www.IERInstitute.org)
23
57 points!
Only 1 point!
Summarizing… (about the PVs)
• NEVER treat them as an individual score
• NEVER use the average
• ALWAYS repeat the analysis separately with each plausible value
• When selecting variables
• When selecting people
• Report the average of the statistics computed
• When conducting significance testing, combine the assessment variance with the sampling variance
24
Summarizing… (in general)
• If you do not take into account the sample and test design into your analysis you simply end up with the wrong answer. • Sampling weights
• Replicate weights
• Plausible values
• If we did not have to do this, we wouldn’t!!
• Programs like IDB-Analyzer, WESVAR and AM take sample and test design into account.
25
Thank you for your attention!
Any questions?
26