Comparing standards of examination papers when there are no archived scripts
Ian Jones Loughborough UniversityColin Foster Loughborough University
Jodie Hunter Massey University, New Zealand
13th Annual UK Rasch User Group MeetingCambridge 2019
Fifty years of A-level mathematics: havestandards changed?
Ian Jonesa,*, Chris Wheadonb, Sara Humphriesc andMatthewInglisaaMathematics Education Centre, Loughborough University, UK; bNoMore Marking Ltd.;cOfqual
Advanced-level (A-level) mathematics is a high-profile qualification taken by many school leavers inEngland, Wales, Northern Ireland and around the world as preparation for university study. Con-cern has been expressed in these countries that standards in A-level mathematics have declined overtime, and that school leavers enter university or the workplace lacking the required mathematicalknowledge and skills. The situation in England, Wales and Northern Ireland reflects more generalinternational concerns about decreasing educational standards. However, evidence to support thisconcern has been of limited scope, rarely subjected to peer-review and of questionable validity. Ourstudy overcame the limitations of previous research into standards over time by applying a compara-tive judgement technique that enabled the direct comparison of mathematical performance acrossdifferent examinations. Furthermore, unlike previous research, all examination questions were re-typeset and candidate responses rewritten to reduce bias arising from surface cues. Using this tech-nique, mathematics experts judged A-level scripts from the 1960s, 1990s and the 2010s. We reportthat the experts believed current A-level mathematics standards to have declined since the 1960s,although there was no evidence that they believed standards have declined since the 1990s. We con-trast our findings with those from previous comparison studies and consider implications for futureresearch into standards over time.
Keywords: A-level mathematics; standards; assessment; comparative judgement
Background
Numerous articles and reports have been published over recent years decrying themathematical knowledge of school leavers in England and Wales (e.g. Walport et al.,2010; ACME, 2011). This includes those who have achieved high grades inAdvanced-level (A-level) mathematics (Hawkes & Savage, 2000; Croft et al., 2009),a course usually associated with achieving university entrance to science, engineeringand mathematics courses in England and Wales. High-profile and on-going mediacoverage (e.g. Willis & Paton, 2009) suggests that standards were higher some time inthe past, but have declined since. In this article we investigate whether this is in factthe case.Concerns about declining standards perhaps go back as far as accredited education
itself, but of particular relevance to the current debate in England and Wales is theinfluential Dearing report (National Committee of Inquiry into Higher Education,
*Corresponding author. Mathematics Education Centre, Loughborough University, Loughbor-ough, LE11 3TU, UK. Email: [email protected]
© 2016 British Educational Research Association
British Educational Research JournalVol. 42, No. 4, August 2016, pp. 543–560
DOI: 10.1002/berj.3224
BERJ (2016) Results
ABE
A Grades
B Grades
E Grades
Achi
evem
ent P
aram
eter
Est
imat
e
0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
Year of Examination1960 1970 1980 1990 2000 2010
BERJ (2016) Method
BERJ (2016) Preparation• Question papers typeset for consistency.
• Candidate responses transcribed for consistency.
• 66 scripts divided into 546 questions and uploaded to website for judging procedure.
BERJ (2016) Preparation• Question papers typeset for consistency.
• Candidate responses transcribed for consistency.
• 66 scripts divided into 546 questions and uploaded to website for judging procedure.
TIME-CONSUMING AND EXPENSIVE
BERJ (2016) Preparation• Question papers typeset for consistency.
• Candidate responses transcribed for consistency.
• 66 scripts divided into 546 questions and uploaded to website for judging procedure.
TIME-CONSUMING AND EXPENSIVEREQUIRES GRADED ARCHIVE
Can we apply CJ to standards comparison
without graded scripts?
Hope 1
r = .68sc
ripts
−3
−2
−1
0
1
2
perfect solutions−2 −1 0 1 2
Model solutions vs. graded scripts
An investigation of construct relevant and irrelevant featuresof mathematics problem-solving questions using comparativejudgement and Kelly’s Repertory GridStephen D. Holmes, Qingping He and Michelle Meadows
Office of Qualifications and Examinations Regulation, Coventry, UK
ABSTRACTThe relationship between the characteristics of 33 mathematicalproblem-solving questions answered by 16-year-old students inEngland and the quality of problem-solving elicited wasinvestigated in two studies. The first study used comparativejudgement (CJ) to estimate the quality of the problem-solvingelicited by each question, involving 33 mathematics teachersjudging pairs of journal-style responses to the questions and theapplication of the Bradley–Terry model. In the second study avariant of Kelly’s Repertory Grid was used with five mathematicsteachers to identify 23 dimensions along which the problem-solving questions varied. Significant relationships between ratingson some dimensions and the problem-solving quality estimated inthe first study were found. This suggests that the Kelly’s RepertoryGrid approach could be an effective way to identify features ofquestions that are relevant to the construct being assessed andfeatures that could be potential sources of construct-irrelevantvariance in test scores.
ARTICLE HISTORYReceived 29 April 2016Accepted 28 October 2016
KEYWORDSsummative assessment;mathematical problem-solving; validity
Introduction
In recent years there has been discontent, particularly from employers, about the perceivedlack of practical mathematical ability in England’s workforce and concerns that secondaryschool mathematics is not providing the skills required in the workplace or higher edu-cation (ACME, 2011; CBI, 2006; Toner, 2011; Vordermann, Porkess, Budd, Dunne, &Rahman-Hart, 2011). There have also been claims that the examinations in mathematicsfor 16-year-olds in England are not suitable for assessing the underlying mathematicalability of the students (Jones & Inglis, 2015). One solution suggested is that schools andschool qualifications should place more emphasis on problem-solving and non-routineuse of mathematics (Ofsted, 2012; Vordermann et al., 2011). This is consistent with theworldwide move towards the desire to train and assess these skills (e.g. ACT, 2006),perhaps best exemplified by the type of items used in the OECD PISA tests (OECD,2014), the results of which have an ever-increasing impact on policymakers worldwide.The PISA Assessment and Analytical Framework (OECD, 2013) details the use of itemswhich assess mathematical literacy, meaning the flexible application of mathematical
© 2017 Crown Copyright
CONTACT Stephen D. Holmes [email protected]
RESEARCH IN MATHEMATICS EDUCATION, 2017VOL. 19, NO. 2, 112–129https://doi.org/10.1080/14794802.2017.1334576
Dow
nloa
ded
by [L
ough
boro
ugh
Uni
vers
ity] a
t 03:
24 2
3 A
ugus
t 201
7
Hope 2
Expected vs Actual Difficulty
A Comparison of Actual and Expected Difficulty, and Assessment of Problem Solving in GCSE Maths
Ofqual 2015 64
2.3.14 Item expected and actual difficulty relationship
Figure 28 shows that there was a moderately strong correlation28 between the expected difficulty of the items and the difficulty as experienced by students (r=0.66). The disattenuated correlation, which estimates what the correlation would be if the measurement of expected and actual difficulty had been more precise, was reasonably high (r=0.76).
Figure 28: A scatter plot to show the relationship between expected and actual difficulty of items
2.3.15 Residual analysis of the relationship between expected and actual difficulty
Analysis of the residuals of a linear model between expected and actual difficulty revealed no systematic pattern between the independent variable (item difficulty) and the residuals. However, there is a correlation between item order and the residuals
28 This correlation is between the study 1 difficulty parameters and the Rasch model parameters from study 2. The correlations between the study 1 parameters and study 2 item facility values were 0.56 for foundation tier and 0.68 for higher tier. Unlike the Rasch parameters which can be equated, the facility values for the two tiers cannot be combined to obtain one correlation.
rdisattenuated = .76
From page 64 of Ofqual (2015) A Comparison of Expected Difficulty, Actual Difficulty and Assessment of Problem Solving across GCSE Maths Sample Assessment Materials. Report Ofqual/15/5679.
Can we apply CJ to standards comparison without graded scripts?
Study 1
Judging non-typeset items only.
Study 1: comparative judgement
• Exam papers from 1964, 1968, 1996, 2012 (as per BERJ, 2016).
• Split into 42 question items.
• Judged by 8 maths PhD students, total 670 pairwise judgements.
• Internal consistency, SSR = .91.
• Inter-rater reliability (split-halves, 100 iterations), rmedian = .79.
Study 1: analysis
We compared item scores with
(i) the scores of the perfect candidates from the BERJ paper (“perfect scores”), and
(ii) the scores of the real scripts from the BERJ paper (“script scores”).
(Scores were available for 38 of the 42 questions judged for Study 1.)
Study 1: correlations
r = 0.63 r = 0.68r = 0.49
item vs perfect item vs script perfect vs script
Study 1: variance explained
• Year as a predictor of item score (BERJ, 2016).
• Present study F(1,36) = 35.83, p < .001, R2 = .500, year as predictor: b = -0.06.
• Perfect scores (BERJ, 2016)F(1,36) = 13.94, p < .001, R2 = .279, year as predictor: b = -0.03.
• Script scores (BERJ, 2016)F(1,36) = 13.62, p < .001, R2 = .274, year as predictor: b = -0.03.
Study 1: variance explained
• Year as a predictor of item score (BERJ, 2016).
• Present study F(1,36) = 35.83, p < .001, R2 = .500, year as predictor: b = -0.06.
• Perfect scores (BERJ, 2016)F(1,36) = 13.94, p < .001, R2 = .279, year as predictor: b = -0.03.
• Script scores (BERJ, 2016)F(1,36) = 13.62, p < .001, R2 = .274, year as predictor: b = -0.03.
Study 2
Judging (i) typeset papers only, and (ii) typeset papers with perfect solutions.
Can we apply CJ to standards comparison without graded scripts?
Study 2: Exam papers
Year Boards1964 JMB*1968 JMB*1990 JMB1996 AEB*, London, UCLES2000 Edexcel2006 MEI2012 AQA*, MEI2017 MEI
* included in BERJ (2016).
Study 2: comparative judgement
• (i) Papers Only.
• Judged by 5 maths PhD students, total 250 judgements, SSR = .84.
• (ii) Papers and Solutions.
• Judged by 5 different maths PhD students, total 330 judgements, SSR = .87.
Study 2: correlation1964Paper11968Paper1
1990JMB
1996AEB
1996London
1996UCLES
2000Edexcel
2006MEI 2012AQA
2012MEI
2017MEI
r = .74
Pape
rs o
nly
−1.5
−1.0
−0.5
0
0.5
1.0
1.5
2.0
Papers and solutions−1.5 −1.0 −0.5 0 0.5 1.0
Study 2: analysis
We compared exam paper scores with
(i) the scores of the perfect candidates from the BERJ paper (“perfect scores”), and
(ii) the scores of the real scripts from the BERJ paper (“script scores”).
Unlike for Study 1 we did this graphically.
Study 2: graphical analysis
PerfectScripts
Parameter
−1
0
1
1964Paper1
1968Paper1
1990JMB
1996AEB
1996London
1996UCLES
2000Edexcel
2006MEI
2012AQA
2012MEI
2017MEI
BERJ
Study 2: graphical analysis
PerfectScripts
Parameter
−1
0
1
1964Paper1
1968Paper1
1990JMB
1996AEB
1996London
1996UCLES
2000Edexcel
2006MEI
2012AQA
2012MEI
2017MEI
BERJ
Papers onlyPapers & solutions
Para
met
er
−1
0
1
2
1964Paper1
1968Paper1
1990JMB
1996AEB
1996London
1996UCLES
2000Edexcel
2006MEI
2012AQA
2012MEI
2017MEI
Study 2 data
Study 2: graphical analysis
PerfectScripts
Parameter
−1
0
1
1964Paper1
1968Paper1
1990JMB
1996AEB
1996London
1996UCLES
2000Edexcel
2006MEI
2012AQA
2012MEI
2017MEI
BERJ
Papers onlyPapers & solutions
Para
met
er
−1
0
1
2
1964Paper1
1968Paper1
1990JMB
1996AEB
1996London
1996UCLES
2000Edexcel
2006MEI
2012AQA
2012MEI
2017MEI
Study 2 data
Limitations
• Standards-based assessment research is nonsense (Goldstein, 1979; Newton, 1997).
• Study 2 had only four data points. No estimate available that results are due to chance.
• Papers vary in length from 8 to 40 pages. CJ score vs length: ρ = –.47, p = .15.
• Cannot say “a candidate who achieved a grade B in 1996 or 2012 appears to have ... performed approximately at the level of a candidate who achieved a grade E in 1964”
Thank you
Ian Jones Loughborough [email protected]
Colin Foster Loughborough [email protected]
Jodie Hunter Massey University, New Zealand