Assessment design: Use of IRT and Plausible Values · Assessment design: Use of IRT and Plausible...

Statistical and operational complexities of the studies II Assessment design: Use of IRT and Plausible Values

Andrés Sandoval-Hernández – IEA DPC

Workshop on using PISA, TIMSS & PIRLS, TALIS datasets

Ispra, Italy- June 24-27, 2014

Note: These slides were prepared as part of the IEA training portfolio with the collaboration of IEA staff and resource persons.

Table of Contents

• Introduction

• Producing comparable scores

• Scaling procedures

• Calculating standard errors

2

Introduction

• Complex Sample Design

• Probabilistic, stratified, multistage sample designs

• Need to take sample design into account when computing estimates

• Complex Assessment Design

• Multiple matrix sample designs where nobody takes all items, and no items are given to all

• Need to produce comparable scores and to take measurement uncertainty into account when computing estimates

3

Introduction – What we really want to know?

• How would the students have performed on the test had we been able to administer ALL the items to ALL of them

• Since we did not test everyone on everything, we need to make our best guess (scientific estimation)

• Remember our goal… • Administer in a sensible design

• Obtain comparable scores

• Correct for unreliability

4

Table of Contents

• Introduction




5

Producing Comparable Scores

• Raw scores do not take into account the difficulty of the items

• Different students took different items

• Student comparability between different tests/ subsets of a test is not possible

• Instead, student achievement is estimated using scale scores computed based on Item Response Theory (IRT)

6

Why IRT?

• Many items are needed to assess a domain as broadly defined such as, for example, mathematics

• At the same time it is unreasonable to administer the whole item battery to each sampled student because: • Students‘ results will be affected by fatigue

• Principals and teachers would be hesitant to free students for very long testing periods which would reduce participation

Students are assigned subsets of the item pool

7

IRT: Item Response Theory

• Response to an item depends on the interaction between the “ability” of the respondent, and characteristics of the item

• Persons of high ability should answer easy items correctly

• Persons of low ability should not answer difficult items correctly

• Does not make assumptions of normal distribution but assumes unidimensionality of measurement

8

Item Characteristic Curve

9

Advantages of IRT Models

• IRT models allow us to create a continuum on which both student performance and item difficulty will be located, linked by a probabilistic function

• IRT allows for performance in a subject to be summarized on a common scale even when different students are administered different items

• Facilitates linking when dealing with rotated test forms

10

Advantages of IRT

• It allows us to:

• Evaluate the effectiveness of a test at different levels of ability

• Design tests to best measure at specific ability level

• Develop new tests and investigate them without administering them

• Develop item statistics that do not change when the group of examinees change

11

Table of Contents

• Introduction




12

Scaling Procedures

• Achievement is initially estimated using computed scale scores based on IRT

• IRT allows for performance in a subject to be summarized on a common scale even when different students are administered different items

• In addition to IRT, these studies make use of multiple imputations or “plausible values” methodology

13

14

Plausible Values

• Random draws from the estimated ability distribution of students with similar item response patterns and background characteristics

• The variance of these draws reflects the uncertainty of measurement

• Think of a regression where the predictors are item responses and background data

PV2 PV4 PV1 PV5 PV3

PV1-PV5: randomly drawn plausible values Ability Distribution

Using Plausible Values

• Plausible values are optimal for obtaining population estimates

• Plausible values should not be used for individual reporting

• Compute statistics with each plausible value and average results

15

Table of Contents

• Introduction




16

Calculating standard errors

17

• The standard error for any statistic estimated from an LSA is a combination of sampling and assessment variances

Standard Error (t)= se(sampling) + se(assessment)

Also known as:

Variance(t)= Var(sampling) + Var(assessment)

• Standard Error is square root of variance

Computing Standard Error (TIMSS & PIRLS)

+

75

1

2

1

i

i

Sampling Variance

15

5

1

2

i

i

Assessment Variance

Computing Standard Error (PISA)

+

Sampling Variance

15

5

1

2

i

i

Imputation Variance

80

2

1

280 (1 0.5)

i

i

Computing Standard Error (TALIS)

Sampling Variance

80

2

1

280 (1 0.5)

i

i

Computing Standard Error (PIAAC)

+

Sampling Variance Imputation Variance

NOTE: In PIAAC the replication method changes from country to country. So the

formula would be different according to this.

Example: Comparing Standard Errors

22

SPSS IDB Analyzer

SE taking into account sampling & assessment error

SE calculated with SPSS (assuming simple random sample)

The IEA/ETS Research Institute

(www.IERInstitute.org)

23

57 points!

Only 1 point!

Summarizing… (about the PVs)

• NEVER treat them as an individual score

• NEVER use the average

• ALWAYS repeat the analysis separately with each plausible value

• When selecting variables

• When selecting people

• Report the average of the statistics computed

• When conducting significance testing, combine the assessment variance with the sampling variance

24

Summarizing… (in general)

• If you do not take into account the sample and test design into your analysis you simply end up with the wrong answer. • Sampling weights

• Replicate weights

• Plausible values

• If we did not have to do this, we wouldn’t!!

• Programs like IDB-Analyzer, WESVAR and AM take sample and test design into account.

25

Thank you for your attention!

Any questions?

26

Date post:	23-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Assessment design: Use of IRT and Plausible Values · Assessment design: Use of IRT and Plausible...

Documents