Nadine McBride, NCDPI Melinda Taylor, NCDPI Carrie Perkis , NCDPI

Enhancing the Technical Quality of the North Carolina Testing

Program: An Overview of Current Research Studies

Nadine McBride, NCDPIMelinda Taylor, NCDPICarrie Perkis, NCDPI

Overview

• Comparability• Consequential validity• Other projects on the horizon

Comparability• Previous Accountability Conference

presentations provided early results• Research funded by an Enhanced

Assessment Grant from the US Department of Education

• Focused on the following topics:– Translations– Simplified language– Computer-based– Alternative formats

What is Comparability?

Not just “same score”• Same content coverage• Same decision consistency• Same reliability & validity• Same other technical properties (i.e.,

factor structure)• Same interpretations of test results, with

the same level of confidence

Goal

• Develop and evaluate methods for determining the comparability of scores from test variations to scores from the general assessments

• The same inferences should be able to be made, with the same level of confidence, from variations of the same test.

Research Questions

• What methods can be used to evaluate score comparability?

• What types of information are needed to evaluate score comparability?

• How do different methods compare in the types of information about comparability they provide?

Products

• Comparability Handbook– Current Practice

• State Test Variations• Procedures for Developing Test Variations and Evaluating

Comparability– Literature Reviews – Research Reports– Recommendations

• Designing Test Variations• Evaluating Comparability of Scores

Results - Translations

• Replication methodology helpful when faced with small samples and widely different proficiency distributions– Gauge variability due to sampling (random) error– Gauge variability due to distribution differences

• Multiple methods for evaluating structure are helpful• Effect size criteria helpful for DIF• Congruence b/w structural & DIF results

Results – Simplified Language

• Carefully documented and followed development procedures focused on maintaining the item construct can support comparability arguments.

• Linking/equating approaches can be used to examine and/or establish comparability.

• Comparing item statistics using the non-target group can provide information about comparability.

Results – Computer-based

• Propensity score matching produced similar results to studies using within-subjects samples.

• Propensity score method provides a viable alternative to the difficult-to-implement repeated measures study.

• Propensity score method is sensitive to group differences. For instance, the method performed better when 8th and 9th grade groups were matched separately.

Results – Alternative Formats

• The burden of proof is much heavier for this type of test variation.

• A study based on students eligible for the general test can provide some, but not solid, evidence of comparability.

• Judgment-based studies combined with empirical studies are needed to evaluate comparability.

• More research is needed in methods for evaluating what constructs each test type is measuring.

Lessons Learned• It takes a village…

– Cooperative effort of SBE, IT, districts and schools to implement special studies

– Researchers to conduct studies, evaluate results

– Cooperative effort of researchers and TILSA members to review study design and results

– Assessment community to provide insight and explore new ideas

Consequential Validity

• What is consequential validity?– Amalgamation of evidence regarding the

degree to which use of test results have social consequences

– Can be both positive and negative; intended and unintended

Who’s Responsibility?

• Role of the Test Developer versus the Test User?

• Responsibility and roles are not clearly defined in the literature

• State may be designated as both a test developer and a user

Test Developer Responsibility

• Generally responsible for… – Intended effects– Likely side effects– Persistent unanticipated effects– Promoted use of scores– Effects of testing

Test Users’ Responsibility

• Generally responsible for… – Use of scores

• the further from the intended uses, the greater the responsibility

Role of Peer Review

• Element 4.1– For each assessment, including the

alternate assessment, has the state documented the issue of validity…. with respect to the following categories:• g) has the state ascertained whether the

assessment produces intended and unintended consequences?

Study Methodology

• Focus Groups– Conducted in five regions across the state– Led by NC State’s Urban Affairs – Completed in Dec 09 and Jan 10– Input of teachers and administration staff– Included large, small, rural, urban,

suburban schools

Study Methodology

• Survey Creation– Drafts currently modeled after surveys

conducted in other states– However, most of those were conducted

10+ years ago– Surveys will be finalized after focus group

results are reviewed

Study Methodology

• Survey administration– Testing Coordinators to receive survey

notification– Survey to be available in late March to April

Study Results

• Stay tuned!– Hope to make the report publicly available

on DPI testing website

Other Research Projects

• Trying out different item types• Item location effects• Auditing

Contact Information

• Nadine [email protected]

• Melinda [email protected]

• Carrie PerkisData [email protected]

Date post:	15-Feb-2016
Category:	Documents
Upload:	kiefer
View:	68 times
Download:	0 times

Nadine McBride, NCDPI Melinda Taylor, NCDPI Carrie Perkis , NCDPI

Documents