Applications in educational monitoring and practice: the logic of standard-setting Mark Wilson...

Applications in educational monitoring and practice:

the logic of standard-setting

Mark Wilson University of California, Berkeley

• Presented at the Standard-setting in the Nordic Countries Conference,

CEMO, University of Oslo, September 22, 2015

Outline

• The logic of standard-setting

• Applied to two traditional standard-setting methods

• An alternative: Unidimensional constructs

• An alternative: Multidimensional constructs

• Summary, Conclusions, etc.

Outline






The logic of standard-setting

I. A way to define the outcome objectives

II. A way to decide what is (qualitatively) “enough” of the standards

III. A way to make manifest student performance on the standards (i.e, a “test”)

IV. A way to decide which performances are acceptable

I. A way to define the outcome objectives: The “Standards”

• Students in Grade X studying subject Y should be able to succeed on the following “standards” ...

• A. First standard

• B. Second standard

• :

• N. Nth standard

II. A way to decide what is (qualitatively) “enough” of the

standards• Success on every standard in {A, B, ...N}?

• Enough success every standard in {A, B, ...N}?

• Enough success on enough of the standards in {A, B, ...N}?

III. A way to make manifest student performance on the standards:

The Test.

• Therefore a Test T is constructed,

• Consisting, say, of n items for each of the N standards (i.e, nN items altogether), or some approximation to that. – (wlog, assume that each item is scored 0 or 1.)

IV. A way to decide which performances are acceptable:

The setting of cut-scores or “Standard Setting”

• Students are scored on the Test, and array from 0 to nN in scores.

• (Can calculate a mean, standard deviation, percentiles, etc.)

• Thus, the standard “Standard Setting” problem:

• How to decide what score represents “enough” of subject Y.

In summary,



III. A way to make manifest student performance on the standards: The Test

IV. A way to decide which performances are acceptable: The setting of cut-scores or “Standard-Setting”

However,

• most “methods” of Standard Setting start at the end, at part IV.

• And proceed to develop and test technical solutions to just this problem,

• In my view, THIS is the real Standard Setting problem

Outline






Outline


• Applied to two traditional standard-setting methods• The Angoff Method




The Angoff Method

• Concept: “... ‘borderline’ test-taker...one whose knowledge and skills are on the borderline between the upper group and the lower group.”

• “...the judge considers each question as a whole and makes a judgment of the probability that that a borderline test-taker would answer the question correctly.”

• “...the passing score is computed from the expected scores for the individual items...”– Livingston, S.A., and Zieky, M.J. (1982). Passing Scores: A

Manual for Setting Standards of Performance on Educational and Occupational Tests. Princeton: Educational Testing Service.

Reminder about the logic above




IV. A way to decide which performances are acceptable: The setting of cut-scores or “Standard Setting”

Critique of Angoff Method based on the logic above


assumes that the standards have been developed


assumes that this has been developed, and that the judges know it


assumes that the test has been developed with this in mind



assumes that “taking the average” of the probabilities (expectations) is the best way to summarize the judgments.

Outline

• The logic of standard-setting• Applied to two traditional standard-setting

methods• The Angoff Method

• The “Matrix” method

• An alternative: Unidimensional constructs• An alternative: Multidimensional constructs• Summary, Conclusions, etc.

Practical Context

• Mixed modes of assessment – e.g., multiple choice + performance items

(lots) (few)

• Need to map into criterion levels – e.g., performance levels

• Need to maintain comparability of criterion levels across administrations:– Standard-setting– Standard-propagating

“Matrix” Method

• Based on judgment of teachers and other professionals involved in testing process

• Matrix of multiple choice by open-ended total scores mapped to performance levels

• Final matrix decided by committee consensus

Example of

Perform-ance

Levels(High-school

Algebra)

Example of Performance Levels(detail)

An example matrix

An example matrix

An example matrix 1

2

3

4

5

An example matrix

Critique of the “Matrix” Method based on the logic above


assumes that the standards have been developed and related to the

“performance levels,” but, in fact, not the usually the case



could be true if there was a relationship between the standards and the performance level


assumes that the test has been developed with this in mind

could be true if there was a relationship between the standards and the performance level, and this had been used to develop the items


assumes that the judges can create it

Outline



• An alternative: Unidimensional constructs– The BEAR Assessment System



How the BEAR Assessment System helps here...





The BEAR Assessment System: BAS

An Example:Assessing Data Modeling and Statistical

Reasoning (ADM) project

PIs: Rich Lehrer & Leona Schauble (Vanderbilt U) & Mark Wilson (UC Berkeley)

-Developed an assessment system for a “Statistical Modeling” curriculum for middle school

-Multi-year, multidisciplinary collaborative of teachers, learning science and assessment experts

-Designed “a developmental perspective on learning” - learning progression with 7 relational construct maps

-Used “reformed curriculum” – conjecture-based whole-class discussions within instruction

-Embedded “new ideas about assessment” into everyday instruction

Conceptualization of measurement variables:

• CoS4 - Investigate and anticipate qualities of a sampling distribution.

• CoS3 - Consider statistics as measures of qualities of a sample distribution.

• CoS2 - Calculate statistics.

• CoS1 - Describe qualities of distribution informally.

A Sample Construct Map for:Conceptions of Statistics

30

Conceptualization of measurement variables:

• CoS4 - Investigate and anticipate qualities of a sampling distribution.

• CoS3 - Consider statistics as measures of qualities of a sample distribution.

• CoS2 - Calculate statistics.

• CoS1 - Describe qualities of distribution informally.

A Sample Construct Map for:

Conceptions of Statistics

31

detail view: CoS3 CoS3F Choose/Evaluate statistic by considering qualities of one or

more samples.CoS3E Predict the effect on a statistic of a change in the process

generating the sample.CoS3D Predict how a statistic is affected by changes in its components

or otherwise demonstrate knowledge of relations among components.

CoS3C Generalize the use of a statistic beyond its original context of application or invention.

CoS3B Invent a sharable (replicable) measurement process to quantify a quality of the sample.

CoS3A Invent an idiosyncratic measurement process to quantify a quality of the sample based on tacit knowledge that others may

notshare.

32

33

Items Design:Open Assessment Prompt

Students received their final grades in Science today. In addition to giving each student their grade, the teacher also told the class about the overall class average.

Student Final grades

Robyn 10

Jake 9

Calvin 6

Sasha 7

Mike 8

Lori 8

When the teacher finished grading Mina’s work and added her final grade into the overall class average, the overall class average stayed the same. What could Mina’s final grade have been? (Show your work).

48/6 = 8

Where the “Standards” are... CoS3F Choose/Evaluate statistic by considering qualities of one or

more samples.CoS3E Predict the effect on a statistic of a change in the process

generating the sample.CoS3D Predict how a statistic is affected by changes in its components

or otherwise demonstrate knowledge of relations among components.

CoS3C Generalize the use of a statistic beyond its original context of application or invention.

CoS3B Invent a sharable (replicable) measurement process to quantify a quality of the sample.

CoS3A Invent an idiosyncratic measurement process to quantify a quality of the sample based on tacit knowledge that others may

notshare.

34

Measurement Model:Wright Map

Initial results for CoS ...

Outline



• An alternative: Unidimensional constructs– The BEAR Assessment System– Banding & Construct-Mapping



How Banding & Construct-Mapping help here...




IV. A way to decide which performances are acceptable: The setting of cut-scores or “Standard Setting”: Item-side

Outline• The logic of standard-setting

• A few traditional standard-setting methods

• An alternative: Unidimensional constructs– The BEAR Assessment System– Banding & Construct-Mapping



How the BEAR Assessment System helps here...




IV. A way to decide which performances are acceptable: The setting of cut-scores or “Standard Setting”: Student-side

Construct-Mapping

Aim: to give judges information that will help them balance different aspects of the test

– Could be sub-components of content, or item-types– E.g. (from Algebra example), MC items and open-

ended items

Technique: Wright map relating score levels to item locations to indicate what the response vector tells us about what a student knows and can do

The Wright Map• The item types are scaled together to

estimate the best-fitting composite, according to pre-determined item-weights (substantive decision)

• The calibration is then used to create a map of all item/level locations

• The judging committee, through a consensus-building process, chooses cut points between performance levels on the map

Software tool

• Dynamic display of item map

• Displays, for any chosen proficiency level– probability of passing all multiple-choice items– probability of attaining every level on written

response items– expected total score on multiple-choice section;

expected score on each written response item

• Allows choice of weights for item types

Final Standards Map

Example of resulting "mapping matrix"

Outline






Learning progressions

• Learning progressions are descriptions of the successively more sophisticated ways of thinking about an important domain of knowledge and practice that can follow one another as children learn about and investigate a topic over a broad span of time. They are crucially dependent on instructional practices if they are to occur. (CCII, 2009)

– Aka learning trajectories, progressions of developmental competence, and profile strands

• More than one path leads to competence• Need to engage in curriculum debate about which learning

progressions are most important– Try and choose them so that we end up with fewer standards per

grade level

Image of a Learning Progression: Curriculum version

One Possible Relationship:the levels of the learning progression are levels of several construct

maps

Another possible relationship:the levels are staggered

Making a summative construct map

Making a composite scale for the summative construct

A combined reflective/formative model.

Then the standard-setting scale becomes ...

a derived measure based on the sampling design of the levels across the constructs

Reliability can be controlled by lighter/heavier sampling of items

A “construct map” can be developed in a post-hoc way (similar to the PISA “defined variable”)

Can be combined with other derived measures from other age or grade-appropriate Learning Progressions.

Outline






Summary, Conclusions, etc.• Standard-setting must be seen as more than a mere “technical

exercise”• It involves much prior work, both substantive and technical,

including • (a) How to develop standards that are “ready” for standard-setting• (b) How to develop items that support that• (c) How to decide which student performances on the test are

“enough” • and requires an overarching framework of all of that to be coherent

Presentation has offered ...

• Sample of how traditional standard setting methods fall short from this perspective

• A suggestion for one approach that does attempt to address this issue

• Its a complex problem, and hence one should not expect an easy solution

• Simpler for single-dimension constructs, more complex for higher-dimension constructs.

Further Issues

• Items– Consistency over years– Large “committee effect” of specifics of “rich”

items

• Factors that affect committee judgments– difficulty of items in a particular year

– committee leadership

– committee membership

• Effects on teaching, policy ...

Teaching/Learning• Policy-makers, administrators, etc. need the results

of standard-setting for large-scale tests.• In the main, teachers do not!!!• They need good formative assessments, and the

positive effects of good formative assessment is well-documented – eg., Black & Wiliam meta-anlaysis

• Thus, a major requirement of standardized tests and standard-setting methods is that they not damage classroom instruction.

Teaching/Learning

• The approach described above has the virtue that it bases good large-scale test construction (i.e., for standard-setting)

• On good formative assessment.

Thank-you.

• [email protected]

• http://bearcenter.berkeley.edu

mailto:[email protected]

Footnote on I.


assumes that the standards have been developed and related to the

“performance levels,” but, in fact, not the case

If there are no standards, only the performance levels, then this critique may be empty. (Eg., Language Testing)

A further issue ...

-Sometimes the best choice for the relative weight between the item modes is not clear-This affects the slope of “diagonal sections” in the previous slide-Can adapt software to allow the committee to also decide which weight to choose-We have not found that committees are good at this.

Date post:	20-Jan-2016
Category:	Documents
Upload:	nathan-sanders
View:	215 times
Download:	0 times

Applications in educational monitoring and practice: the logic of standard-setting Mark Wilson...

Documents