Date post: | 20-Jan-2016 |
Category: |
Documents |
Upload: | nathan-sanders |
View: | 215 times |
Download: | 0 times |
Applications in educational monitoring and practice:
the logic of standard-setting
Mark Wilson University of California, Berkeley
• Presented at the Standard-setting in the Nordic Countries Conference,
CEMO, University of Oslo, September 22, 2015
Outline
• The logic of standard-setting
• Applied to two traditional standard-setting methods
• An alternative: Unidimensional constructs
• An alternative: Multidimensional constructs
• Summary, Conclusions, etc.
Outline
• The logic of standard-setting
• Applied to two traditional standard-setting methods
• An alternative: Unidimensional constructs
• An alternative: Multidimensional constructs
• Summary, Conclusions, etc.
The logic of standard-setting
I. A way to define the outcome objectives
II. A way to decide what is (qualitatively) “enough” of the standards
III. A way to make manifest student performance on the standards (i.e, a “test”)
IV. A way to decide which performances are acceptable
I. A way to define the outcome objectives: The “Standards”
• Students in Grade X studying subject Y should be able to succeed on the following “standards” ...
• A. First standard
• B. Second standard
• :
• N. Nth standard
II. A way to decide what is (qualitatively) “enough” of the
standards• Success on every standard in {A, B, ...N}?
• Enough success every standard in {A, B, ...N}?
• Enough success on enough of the standards in {A, B, ...N}?
III. A way to make manifest student performance on the standards:
The Test.
• Therefore a Test T is constructed,
• Consisting, say, of n items for each of the N standards (i.e, nN items altogether), or some approximation to that. – (wlog, assume that each item is scored 0 or 1.)
IV. A way to decide which performances are acceptable:
The setting of cut-scores or “Standard Setting”
• Students are scored on the Test, and array from 0 to nN in scores.
• (Can calculate a mean, standard deviation, percentiles, etc.)
• Thus, the standard “Standard Setting” problem:
• How to decide what score represents “enough” of subject Y.
In summary,
I. A way to define the outcome objectives: The “Standards”
II. A way to decide what is (qualitatively) “enough” of the standards
III. A way to make manifest student performance on the standards: The Test
IV. A way to decide which performances are acceptable: The setting of cut-scores or “Standard-Setting”
However,
• most “methods” of Standard Setting start at the end, at part IV.
• And proceed to develop and test technical solutions to just this problem,
• In my view, THIS is the real Standard Setting problem
Outline
• The logic of standard-setting
• Applied to two traditional standard-setting methods
• An alternative: Unidimensional constructs
• An alternative: Multidimensional constructs
• Summary, Conclusions, etc.
Outline
• The logic of standard-setting
• Applied to two traditional standard-setting methods• The Angoff Method
• An alternative: Unidimensional constructs
• An alternative: Multidimensional constructs
• Summary, Conclusions, etc.
The Angoff Method
• Concept: “... ‘borderline’ test-taker...one whose knowledge and skills are on the borderline between the upper group and the lower group.”
• “...the judge considers each question as a whole and makes a judgment of the probability that that a borderline test-taker would answer the question correctly.”
• “...the passing score is computed from the expected scores for the individual items...”– Livingston, S.A., and Zieky, M.J. (1982). Passing Scores: A
Manual for Setting Standards of Performance on Educational and Occupational Tests. Princeton: Educational Testing Service.
Reminder about the logic above
I. A way to define the outcome objectives: The “Standards”
II. A way to decide what is (qualitatively) “enough” of the standards
III. A way to make manifest student performance on the standards: The Test
IV. A way to decide which performances are acceptable: The setting of cut-scores or “Standard Setting”
Critique of Angoff Method based on the logic above
I. A way to define the outcome objectives: The “Standards”
assumes that the standards have been developed
II. A way to decide what is (qualitatively) “enough” of the standards
assumes that this has been developed, and that the judges know it
III. A way to make manifest student performance on the standards: The Test
assumes that the test has been developed with this in mind
IV. A way to decide which performances are acceptable: The setting of cut-scores or “Standard Setting”
assumes that this has been developed, and that the judges know it
assumes that “taking the average” of the probabilities (expectations) is the best way to summarize the judgments.
Outline
• The logic of standard-setting• Applied to two traditional standard-setting
methods• The Angoff Method
• The “Matrix” method
• An alternative: Unidimensional constructs• An alternative: Multidimensional constructs• Summary, Conclusions, etc.
Practical Context
• Mixed modes of assessment – e.g., multiple choice + performance items
(lots) (few)
• Need to map into criterion levels – e.g., performance levels
• Need to maintain comparability of criterion levels across administrations:– Standard-setting– Standard-propagating
“Matrix” Method
• Based on judgment of teachers and other professionals involved in testing process
• Matrix of multiple choice by open-ended total scores mapped to performance levels
• Final matrix decided by committee consensus
Example of
Perform-ance
Levels(High-school
Algebra)
Example of Performance Levels(detail)
An example matrix
An example matrix
An example matrix 1
2
3
4
5
An example matrix
Critique of the “Matrix” Method based on the logic above
I. A way to define the outcome objectives: The “Standards”
assumes that the standards have been developed and related to the
“performance levels,” but, in fact, not the usually the case
II. A way to decide what is (qualitatively) “enough” of the standards
assumes that this has been developed, and that the judges know it
could be true if there was a relationship between the standards and the performance level
III. A way to make manifest student performance on the standards: The Test
assumes that the test has been developed with this in mind
could be true if there was a relationship between the standards and the performance level, and this had been used to develop the items
IV. A way to decide which performances are acceptable: The setting of cut-scores or “Standard Setting”
assumes that the judges can create it
Outline
• The logic of standard-setting
• Applied to two traditional standard-setting methods
• An alternative: Unidimensional constructs– The BEAR Assessment System
• An alternative: Multidimensional constructs
• Summary, Conclusions, etc.
How the BEAR Assessment System helps here...
I. A way to define the outcome objectives: The “Standards”
II. A way to decide what is (qualitatively) “enough” of the standards
III. A way to make manifest student performance on the standards: The Test
IV. A way to decide which performances are acceptable: The setting of cut-scores or “Standard Setting”
The BEAR Assessment System: BAS
An Example:Assessing Data Modeling and Statistical
Reasoning (ADM) project
PIs: Rich Lehrer & Leona Schauble (Vanderbilt U) & Mark Wilson (UC Berkeley)
-Developed an assessment system for a “Statistical Modeling” curriculum for middle school
-Multi-year, multidisciplinary collaborative of teachers, learning science and assessment experts
-Designed “a developmental perspective on learning” - learning progression with 7 relational construct maps
-Used “reformed curriculum” – conjecture-based whole-class discussions within instruction
-Embedded “new ideas about assessment” into everyday instruction
Conceptualization of measurement variables:
• CoS4 - Investigate and anticipate qualities of a sampling distribution.
• CoS3 - Consider statistics as measures of qualities of a sample distribution.
• CoS2 - Calculate statistics.
• CoS1 - Describe qualities of distribution informally.
A Sample Construct Map for:Conceptions of Statistics
30
Conceptualization of measurement variables:
• CoS4 - Investigate and anticipate qualities of a sampling distribution.
• CoS3 - Consider statistics as measures of qualities of a sample distribution.
• CoS2 - Calculate statistics.
• CoS1 - Describe qualities of distribution informally.
A Sample Construct Map for:
Conceptions of Statistics
31
detail view: CoS3 CoS3F Choose/Evaluate statistic by considering qualities of one or
more samples.CoS3E Predict the effect on a statistic of a change in the process
generating the sample.CoS3D Predict how a statistic is affected by changes in its components
or otherwise demonstrate knowledge of relations among components.
CoS3C Generalize the use of a statistic beyond its original context of application or invention.
CoS3B Invent a sharable (replicable) measurement process to quantify a quality of the sample.
CoS3A Invent an idiosyncratic measurement process to quantify a quality of the sample based on tacit knowledge that others may
notshare.
32
33
Items Design:Open Assessment Prompt
Students received their final grades in Science today. In addition to giving each student their grade, the teacher also told the class about the overall class average.
Student Final grades
Robyn 10
Jake 9
Calvin 6
Sasha 7
Mike 8
Lori 8
When the teacher finished grading Mina’s work and added her final grade into the overall class average, the overall class average stayed the same. What could Mina’s final grade have been? (Show your work).
48/6 = 8
Where the “Standards” are... CoS3F Choose/Evaluate statistic by considering qualities of one or
more samples.CoS3E Predict the effect on a statistic of a change in the process
generating the sample.CoS3D Predict how a statistic is affected by changes in its components
or otherwise demonstrate knowledge of relations among components.
CoS3C Generalize the use of a statistic beyond its original context of application or invention.
CoS3B Invent a sharable (replicable) measurement process to quantify a quality of the sample.
CoS3A Invent an idiosyncratic measurement process to quantify a quality of the sample based on tacit knowledge that others may
notshare.
34
Measurement Model:Wright Map
Initial results for CoS ...
Outline
• The logic of standard-setting
• Applied to two traditional standard-setting methods
• An alternative: Unidimensional constructs– The BEAR Assessment System– Banding & Construct-Mapping
• An alternative: Multidimensional constructs
• Summary, Conclusions, etc.
How Banding & Construct-Mapping help here...
I. A way to define the outcome objectives: The “Standards”
II. A way to decide what is (qualitatively) “enough” of the standards
III. A way to make manifest student performance on the standards: The Test
IV. A way to decide which performances are acceptable: The setting of cut-scores or “Standard Setting”: Item-side
Outline• The logic of standard-setting
• A few traditional standard-setting methods
• An alternative: Unidimensional constructs– The BEAR Assessment System– Banding & Construct-Mapping
• An alternative: Multidimensional constructs
• Summary, Conclusions, etc.
How the BEAR Assessment System helps here...
I. A way to define the outcome objectives: The “Standards”
II. A way to decide what is (qualitatively) “enough” of the standards
III. A way to make manifest student performance on the standards: The Test
IV. A way to decide which performances are acceptable: The setting of cut-scores or “Standard Setting”: Student-side
Construct-Mapping
Aim: to give judges information that will help them balance different aspects of the test
– Could be sub-components of content, or item-types– E.g. (from Algebra example), MC items and open-
ended items
Technique: Wright map relating score levels to item locations to indicate what the response vector tells us about what a student knows and can do
The Wright Map• The item types are scaled together to
estimate the best-fitting composite, according to pre-determined item-weights (substantive decision)
• The calibration is then used to create a map of all item/level locations
• The judging committee, through a consensus-building process, chooses cut points between performance levels on the map
Software tool
• Dynamic display of item map
• Displays, for any chosen proficiency level– probability of passing all multiple-choice items– probability of attaining every level on written
response items– expected total score on multiple-choice section;
expected score on each written response item
• Allows choice of weights for item types
Final Standards Map
Example of resulting "mapping matrix"
Outline
• The logic of standard-setting
• A few traditional standard-setting methods
• An alternative: Unidimensional constructs
• An alternative: Multidimensional constructs
• Summary, Conclusions, etc.
Learning progressions
• Learning progressions are descriptions of the successively more sophisticated ways of thinking about an important domain of knowledge and practice that can follow one another as children learn about and investigate a topic over a broad span of time. They are crucially dependent on instructional practices if they are to occur. (CCII, 2009)
– Aka learning trajectories, progressions of developmental competence, and profile strands
• More than one path leads to competence• Need to engage in curriculum debate about which learning
progressions are most important– Try and choose them so that we end up with fewer standards per
grade level
Image of a Learning Progression: Curriculum version
One Possible Relationship:the levels of the learning progression are levels of several construct
maps
Another possible relationship:the levels are staggered
Making a summative construct map
Making a composite scale for the summative construct
A combined reflective/formative model.
Then the standard-setting scale becomes ...
a derived measure based on the sampling design of the levels across the constructs
Reliability can be controlled by lighter/heavier sampling of items
A “construct map” can be developed in a post-hoc way (similar to the PISA “defined variable”)
Can be combined with other derived measures from other age or grade-appropriate Learning Progressions.
Outline
• The logic of standard-setting
• A few traditional standard-setting methods
• An alternative: Unidimensional constructs
• An alternative: Multidimensional constructs
• Summary, Conclusions, etc.
Summary, Conclusions, etc.• Standard-setting must be seen as more than a mere “technical
exercise”• It involves much prior work, both substantive and technical,
including • (a) How to develop standards that are “ready” for standard-setting• (b) How to develop items that support that• (c) How to decide which student performances on the test are
“enough” • and requires an overarching framework of all of that to be coherent
Presentation has offered ...
• Sample of how traditional standard setting methods fall short from this perspective
• A suggestion for one approach that does attempt to address this issue
• Its a complex problem, and hence one should not expect an easy solution
• Simpler for single-dimension constructs, more complex for higher-dimension constructs.
Further Issues
• Items– Consistency over years– Large “committee effect” of specifics of “rich”
items
• Factors that affect committee judgments– difficulty of items in a particular year
– committee leadership
– committee membership
• Effects on teaching, policy ...
Teaching/Learning• Policy-makers, administrators, etc. need the results
of standard-setting for large-scale tests.• In the main, teachers do not!!!• They need good formative assessments, and the
positive effects of good formative assessment is well-documented – eg., Black & Wiliam meta-anlaysis
• Thus, a major requirement of standardized tests and standard-setting methods is that they not damage classroom instruction.
Teaching/Learning
• The approach described above has the virtue that it bases good large-scale test construction (i.e., for standard-setting)
• On good formative assessment.
Footnote on I.
I. A way to define the outcome objectives: The “Standards”
assumes that the standards have been developed and related to the
“performance levels,” but, in fact, not the case
If there are no standards, only the performance levels, then this critique may be empty. (Eg., Language Testing)
A further issue ...
-Sometimes the best choice for the relative weight between the item modes is not clear-This affects the slope of “diagonal sections” in the previous slide-Can adapt software to allow the committee to also decide which weight to choose-We have not found that committees are good at this.