TheValidityofStandardizedTestsforEvaluatingCurricularInterventionsin
MathematicsandScienceJoshuaSussman
PostdoctoralScholarBerkeleyEvaluationandAssessmentResearch(BEAR)Center
UniversityofCalifornia,Berkeley
Talkoverview
• Threestudiesthatexaminetheuseofstandardizedacademictestsforevaluatingtheimpactofcurricularinterventions•Analyzethevalidity(AERA,APA,&NCME,2014)ofthetestforevaluatingtheintervention• Thestudiesleadtopoliticalandmethodologicalsolutionstoanenduringprobleminappliededucationalmeasurement.
Threestudies:Researchquestions
1. Howoftendoinvestigatorsusestandardizedteststoevaluatetheimpactofeducationalinterventions;arethetestsvalidfortheirintendedpurpose?
2. Howmuchalignmentattheitemlevelisnecessaryforvalidevaluation?
3. Whatresearchdesignscaninvestigatorsusetomitigatevalidityproblemswithstandardizedtestsasoutcomemeasures?
Aboutme
• Thegoalofmyworkistoadvanceappliedmeasurementinschools.• MyresearchexperienceincludescurriculumdevelopmentprojectsfundedbytheInstituteofEducationSciences(IES)andNationalScienceFoundation(NSF).DissertationresearchfundedbyanIESpre-doctoralfellowshipinintheResearchinCognitionandMathematicsEducationProgram• Experienceintestconstructionandvalidation(Blackracialidentity,sustainedattention,earlychildhooddevelopment,non-cognitivepredictorsofacademicsuccess,mathematicsandscience).
Reasonstoevaluateeducationalinterventionsusingstandardizedtestsasoutcomemeasures• Theyarereliablemeasuresofgrade-levelacademicproficiency,inamajorsubjectarea,forgroupsofstudents.• Theyprovidea“fair”measureoftheimpactofanacademicintervention.• Curriculum-independentandnotsubjecttoresearcherbiasesor“trainingeffects.”
• Schoolsareaccountableforimprovingtestscores
Problemswiththeuseofstandardizedtestsasoutcomemeasures
• Whatifthedomainoftheeducationalinterventionisnarrowerthan“mathematics?”• E.g.,fractions
• Thebroadtestdesigncanbeproblematic.• Alongstandingconsensusisthatweshouldevaluateinterventionsbydeterminingthedegreetowhichthegoalsoftheprogramarebeingrealizedinstudents(Baker,Chung,&Cai,2016;Tyler,1942).
Problemswiththeuseofstandardizedtestsasoutcomemeasures:contentmismatch
Problemswiththeuseofstandardizedtestsasoutcomemeasures:cognitivemismatch• Standardizedtestsdonotmeasureeverythingthatisimportantinacademiccompetence(Darling-Hammondetal.,2013;NRC,2001).• Specificissues:NRC(2004)foundseriousproblemswiththevalidityofstandardizedtestsin86evaluationsof25differentmathcurricula.
• Newstandardizedtestsinmathematicsdoabetterjobofmeasuringmodernlearninggoalsbutseriousshortcomingscontinuetoexist(Doorey &Polikoff,2016).• Inscience,existingtestsarenotdesignedtomeasurethemodernlearninggoalsintheNextGenerationScienceStandards(DeBarger,Penuel,&Harris,2013;Wertheimetal.,2016).
Study1:Afocusonprevalenceandvalidityofstandardizedtestsasoutcomemeasures1. Howoftendoinvestigatorsusestandardizedtestsaskeyoutcome
measures?2. Arethetestsvalid?• Dothegoalsoftheinterventionappeartoalignwiththemeasurementtargetofthestandardizedtest?• Doinvestigatorsestablishvalidityevidenceforthespecificuseofthetestperrecommendationsintheliterature(AERA,APA,&NCME,2014)?• Isthevalidityevidenceadequate?
Afocusonthealignmentaspectsoftestvalidity• Evaluatethevalidityevidencewithanemphasisonthealignmentbetweenthetestsandtheinterventions(Bhola,Impara,&Buckendahl,2003;Roach,Niebling,&Kurz,2009;Porter,2002)• Aprincipledwaytostudythematchbetweenatestandanintervention• Contentalignment• Cognitiveprocessalignment
• Welldevelopedinvestigationsintothealignmentbetweenstandardizedtestsandinterventionsarearelativelynewareaoftheliterature(e.g.,May,Johnson,Haimson,Sattar,&Gleason,2009)
Method• Asecondaryanalysisof85projectsfundedbytheIESmathematicsandscienceeducationprogram(2003– 2015).• Datasources
a) IESdatabaseentries(studygoals,descriptionofintervention,keymeasuresetc…)
b) ReportstoIESreceivedfromprojectPI’sc) Peer-reviewedarticlesassociatedwithprojectsd) Testinformationontheinternet
TheprevalenceofstandardizedtestsasoutcomemeasuresAnalysis:CalculatetheproportionoftheprojectsthatevaluatedacurricularinterventionusingdatafromastandardizedtestResults:• Mostprojectsdevelopedandevaluatedacurricularintervention(82%)• Mostinterventionprojectsused,orplannedtouse,astandardizedtestforimpactevaluation(72%)• Thus,evaluationofnewcurricularinterventionsusingstandardizedtestsiswidespreadpractice
ThevalidityofstandardizedtestsasoutcomemeasuresAnalysis:Threeraters,usingavalidityrubrictoscoreeachproject,reachedconsensusontheprojectswithmisalignmentbetweentheinterventionandthestandardizedtestusedasanoutcomemeasure.
Results:Theratersflagged54%oftheprojectsforamismatchbetweentheinterventionandthetest.• Testsmeasuredtoomuchacademiccontent• Learninggoalsweredifficulttomeasurewithatypicalstandardizedtest
• E.g.,Conductingscientificinvestigations;participatinginalearningcommunity.
ThevalidityofstandardizedtestsasoutcomemeasuresAnalysis:Foreachprojectflaggedforvalidityissues,thesamethreeraterscloselyexaminedthecorpusofdataforvalidityevidenceandtojudgetheadequacyofthevalidityevidence.
Data:ReportsfromPIs• Emailed68uniquePI’sforreportsand48responded(70.6%)• 33PI’sprovidedreports
33 reports
provided
25 projects flagged 11
projects
ReportsfromPIs
Thevalidityofstandardizedtestsasoutcomemeasures
• Analyzedreportsandpublishedarticles
• Fiveoutofthe11didnotevenmentionvalidityissues.• Sixoutofthe11containedvaliditydiscussions.
Results:Validitydiscussions
• Onlyoneestablishedadequatevalidityevidence
Results:Adequacyofvalidityevidence
Measurementissuesuncoveredduringtheanalysis• Thestandardizedtestdidnothaveenoughtestitemsthattappedthecontenttaughtbytheintervention.• Ilearnedalessonto“bemorespecificaboutthelearningoutcomesIwanttomeasureandselectanassessmentthatwillbemoresensitivetomeasuringthoseoutcomes.”• Oneinvestigatorcouldnotevaluatetheinterventionbecausethestandardizedtestdidnotmeasuretheappropriateconstruct.• Infollowupresearch,oneinvestigatorselectedasubsetofitemsfromthetest(i.e.,theusefulones).
Summary
• Majorityofprojectsengagedinappliedresearchandevaluationusingastandardizedtest• Abouthalfoftheseprojectswereflaggedaspotentiallyproblematic• Only6of11projectsestablishedany validityevidenceforthespecificuseofthetest• Only1of11establishedadequatevalidityevidence
Recommendations
• Cautiouslyinterpretevaluationsofnewcurriculathatpositiondatafromstandardizedtestsastheprimaryoutcomemeasure–theymaynotprovideaccurateandusefulinformationfordata-baseddecisionmaking.• Carefulitemselection• Proposalsthatincludeimpactevaluationshouldrequireinvestigatorstodiscussmeasurementindetail
Study2:Howmuchalignmentisenoughforvalidevaluation?• Inmanycases,onlyafewitemsonthestandardizedtestalignwiththeintervention(Sussman,2016).• Thisdatasimulationstudydevelopsapsychometricmodeloftherelationshipbetweenalignmentandthetreatmentsensitivity ofanevaluationdefinedastheabilityofanevaluationtodetecttheeffectofaneducationalintervention(Lipsley,1990;Mayetal.,2009).• Thepracticalgoalistodevelopamethod,akintopoweranalysis,thathelpsresearchersaccountformisalignmentwhentheydesignevaluations.
Alignmentbetweenamathtestandanintervention
Cognitive complexity Academiccontent
Addition Subtraction
Singledigit
Double digit
Double digitwithcarryingorborrowing
Interventionteachesthisarea
Method
• Datasimulationofhypotheticalevaluationswithanoutcomemeasurethatismoreorlessalignedwithanintervention• Theprimaryoutcomeistheaveragestatisticalpower,calculatedasafunctionoftestalignmentandinterventioneffectsize.• Powertodetectatruedifferencebetweenexperimentalandcontrol
• PsychometricmodelsfordatagenerationandfordataanalysisfromtheRaschfamilyofitemresponsemodels(Rasch,1960/1980;Adams,Wilson,&Wang,1997).
Keyassumptionsofthesimulation
• Effectivetreatmentsincreasetheprobabilitythatastudentsucceedsonanatestitemthatisaligned• Thetreatmenthasnoimpactonanitemthatisconsiderednotaligned• Thecontrolgroupisunaffected
Simulationvariables
Holdsamplesizeconsistentbutvaryalignmentoverarangeofeffectsizesfortheintervention.• Fixedsamplesize(N =600;300eachinexperimentalandcontrolgroups)• Fixedtestlength(N =50items)• Varyalignment(1– 50items)• Varyeffectsizeoftheintervention(0.1– 2.0SD)
Results Effect Size
0.10.20.30.40.50.60.70.80.91.01.52.0
0.0
0.2
0.4
0.6
0.8
1.0
0 20 40 60 80 100Simulated Alignment (%)
Stat
istic
al P
ower
Conclusions
• Alignmentshouldbenolessthan60%,foradequatestatisticalpowertodetecttreatmenteffectswithaneffectsizeof0.2SD.• Researchersmustbalancealignmentagainstsensitivitytodetectsmalleffectsizes.• Useofmultiplemeasureswithdifferentlevelsofalignmentrepresentanidealscenariofordevelopingacompellingevaluationargument(Cronbach,1963;House,1977;Penuel,2016).
Study3:Onesolutiontothealignmentproblem• Researchmethodsthatcoordinatedataandtheorycanpresentstrongerargumentsfortheefficacyofanintervention.• EmpiricalstudythatdocumentstheeffectivenessoftheLearningMathematicsthroughRepresentations (LMR)lessonsequenceforteachingEnglishLearners(ELs)mathematics.• TheevaluationcoordinatesdatafromaresearcherdevelopedtestandastandardizedtestwiththeoryabouthowthefeaturesofLMRmeettheneedsofELsinthemathematicsclassroom.
LearningMathematicsthroughRepresentations(LMR)
• LMRisa19-lessonnumberline-basedcurriculumunitthatsupportsupperelementarystudents’understandingsofintegersandfractions(Saxe,deKirby,Le,Sitabkhan,&Kang,2015)• Theunitsupportsmathematicallearningthrough(a)theuseofthenumberlineasacentralrepresentationalcontext,and(b)thebuildingofmathematicaldefinitionsinclassroomcommunitiesthatbecomeresourcestosupportstudentargumentationandproblemsolving.
Method
• 571studentsin21classrooms(4th and5th grade)containingbothELsandEnglishProficient(EP)studentsparticipatedinaquasi-experimentalstudy.• Therewere95ELsinthesample:44ELsin11LMRclassroomsand51ELsin10comparisonclassrooms.
• Studentscompletedasetoffour(pre,interim,post-test,andfollowup)researcherdevelopedassessmentsofintegersandfractions• Studentsalsocompletedthestatetestinmathematicsintheprioryearandtheendoftheinterventionyear
TheempiricalresultssupporttheefficacyofLMRforstudentsclassifiedasELs• MultilevelanalysisrevealedthattheELsinLMRclassroomsgainedmoreinmathematicsthantheELsinthematchedcomparisongrouponbothanassessmentofintegersandfractions(p=0.011;ES=0.68)andastandardizedassessmentinmathematics(p =0.010,ES =0.49)• LMReliminatedornarrowedtheachievementgapbetweenELsandEPs• Inaddition,theorysupportsLMR’spotentialasamathematicsinterventionbenefittingELs’achievement;narrowly(integersandfractions)andbroadly(grade-levelachievement)
TworesourcesformeetingtheneedsofELsinthemathematicsclassroom
1. Participationinmathematicalcommunication&argumentation(Darling-Hammond,2007;Moskchovich,2012;NCTM,2000;Schoenfeld,2002).
2. Multimodalopportunitiesforlearningusingvisualandembodiedrepresentations(Bustamante&Travis,1999;Hakuta &Santos,2012;Moschkovich,1999,2002;Schleppegrell,2007.)
Opening Problem
Opening Discussion
Partner Work
Closing Discussion
Closing Problem Student Thinking
& Problem Solving
1. Participationinmathematicalcommunication&argumentation
2. Multimodallearning(visualandembodiedrepresentations
ProvidingELsaccesstoparticipatinginmathematicslessons
1. Participationinmathematicalcommunication&argumentation
2. Multimodallearning(visualandembodiedrepresentations
ProvidingELsaccesstomathematicaldiscussions
1. Participateinmathematicalcommunication&argumentation
2. Multimodalopportunitiesforlearning(visualandembodiedrepresentations
Visualresourcesformathematicallearning
1. Participationinmathematicalcommunication&argumentation
2. Multimodalopportunitiesforlearning(visualandembodiedrepresentations
Embodiedresourcesformathematicallearning
1. Participationinmathematicalcommunication&argumentation
2. Multimodalopportunitiesforlearning(visualandembodiedrepresentations
MeetingtheneedsofELsinthemathematicsclassroom
Conclusions
• Standardizedtestshaveaplaceandpurpose,buttheyneedtobewellalignedtoserveasoutcomemeasures• Alignmentshouldbenolessthan60%todetectreasonableeffectsizes(0.2SD).• Highqualityevaluationsofeducationalinterventionscoordinatedataandtheory.
Effect Size 0.10.20.30.40.50.60.70.80.91.01.52.0
0.0
0.2
0.4
0.6
0.8
1.0
0 20 40 60 80 100Simulated Alignment (%)
Stat
istic
al P
ower
Plansforfutureresearch
• Measurementinspecialeducation• MeasuringprogresstowardsIndividualizedEducationPlangoals• Supportdata-baseddecisionmakingforfuturestudenteligibility,goalsandservices(interventions).
ReferencesSussman,J.,&Wilson,M.Theuseandvalidityofpreexistingachievementtestsforevaluatingnewcurricularinterventionsinscienceandmathematics.Underreview(reviseandresubmit):AmericanJournalofEvaluation.
Sussman,J.Standardizedtestsasoutcomemeasuresinappliedresearch:Apsychometricsimulationoftherelationshipbetweenalignmentandtreatmentsensitivity.TobesubmittedtoAppliedMeasurementinEducation.
Sussman,J.,&Saxe,G.B.Mathematicslearninginlanguageinclusiveclassrooms:supportingtheachievementofEnglishlearnersaswellastheirEnglishproficientpeers.TobesubmittedtoAmericanEducationalResearchJournal.