Steven L. Wise
Senior Research Fellow
Evaluating Test-Taking
Effort: When is a
Growth Score Not
Really a Growth Score?
• The ability to measure student growth across time is a key feature of MAP.
– Growth = RITTime2 – RITTime1
• A valid growth score, however, requires valid test scores at each time point.
• If either component score is invalid, the growth score is untrustworthy.
• How can we evaluate the validity of individual scores?
Measuring Student Growth
2
• A valid test score requires:
– a well-constructed test with standardized administration procedures
– that construct-irrelevant factors, which introduce construct-irrelevant variance (CIV) do not meaningfully affect test performance.
• ISV: how trustworthy is the score?
• Low ISV scores are distorted by construct-irrelevant factors.
Individual Score Validity (ISV)
3
• One student, one proctor (e.g., school psychologist).
• Proctor has been trained to observe the student’s test taking behavior during the test event.
• If warranted, the proctor will terminate the test, take corrective action, or invalidate the score.
• Potential reasons: lack of motivation, anxiety, illness, changes in testing environment.
• Examinees, items, and context are all construct-irrelevant factors.
Scenario 1: Individually Administered Achievement Test
4
• Many-to-one relationship between students and proctors.
• Monitoring responsibilities of proctors typically do not extend beyond
– maintaining standardized administration
– deterring cheating
• Looking for CIV not usually part of proctor’s role (and is impractical).
Scenario 2: Group Administered Achievement Test
5
• Student effort has been found to be a key construct-irrelevant factor affecting MAP scores.
• We implicitly assume that students give good effort when administered our MAP test.
• When they don’t, the resulting RITs tend to underestimate true proficiency.
Test-Taking Effort and MAP
6
• If low effort occurs at Time1 but not at Time2:
– Growth will be positively affected.
– Possibly unrealistically high positive growth
• If low effort occurs at Time2 but not at Time1:
– Growth will be negatively affected.
– Possibly negative growth
How Low Effort Distorts Growth Scores
7
Table 1. Percentages of Fall-Spring growth scores that are negative with magnitude exceeding two RIT standard errors
ContentArea
Grade
2 3 4 5 6 7 8 9
Math 1% 1% 2% 2% 5% 6% 8% 15%
Reading 1% 3% 5% 6% 9% 11% 12% 16%
How Often Do Negative Growth Scores Occur?
8
• Students who have become disengaged from their test and have stopped giving effort show two types of behaviors:
– They tend to answer questions very rapidly.
– Their answers tend to be correct at about a chance level (as opposed to the expected .50 rate characteristic of an adaptive test).
• Data on these behaviors can be objectively and unobtrusively collected by the computer.
Assessing Student Effort
9
• Two types of response behaviors
– Rapid-guessing behavior: the student responds before he or she would have able to read and consider the item.
– Solution behavior: all other behaviors.
• RTE equals the proportion of items for which the examinee exhibited solution behavior.
• Ranges from 0.0 (low) to 1.0 (high).• RTE measures the effort expended by a
student to a test.
Response Time Effort (RTE)
10
• We developed five flagging criteria for spotting test events whose scores indicate low ISV.
• Flags based on both RTE and response accuracy.
• They take into account that students often behave non-effortfully during only a portion of the test event.
• We will call “invalid” any test event that triggered at least one of the flags.
Five Effort Flags
11
• Consider a test event as a student seeing a series of items in a particular context.
• Student factors: gender, grade
• Item factors: content area, amount of reading, presence of a table, figure or graph.
• Context factors: item position, time of day, test stakes, heat/cold, noise distractions
Correlates of Student Effort
12
Average RTE in Math
13
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
3 4 5 6 7 8 9
Me
an R
TE
Grade
Females
Males
Average RTE in Reading
14
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
3 4 5 6 7 8 9
Me
an R
TE
Grade
Females
Males
Invalid Test Events in Math, By Time of Day
0
5
10
15
20
25
7:00 a.m. 8:00 a.m. 9:00 a.m. 10:00 a.m. 11:00 a.m. 12:00 noon 1:00 p.m. 2:00 p.m.
Pe
rce
nt
Inva
lid S
core
s
Time of Day
Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 9
15
Invalid Test Events in Reading, By Time of Day
0
5
10
15
20
25
7:00 a.m. 8:00 a.m. 9:00 a.m. 10:00 a.m. 11:00 a.m. 12:00 noon 1:00 p.m. 2:00 p.m.
Pe
rce
nt
Inva
lid S
core
s
Time of Day
Grade 3 Grade 4 Grade 5 Grade 6 Grade 7 Grade 8 Grade 9
16
Student Spr10 RIT Fa10 RIT Spr11 RIT
John 229 242 245
Paul 168 190 215
George 201 210 170
Ringo 229 174 241
Yoko 201 159 225
Some Actual MAP Test Events
Which of the score patterns seem reasonable? Which do not?17
Student Spr10 RIT Fa10 RIT Spr11 RIT
John 229 242 245
John’s RTE: 1.0 1.0 1.0
Paul 168 190 215
Paul’s RTE: .72 .90 1.0
George 201 210 170
George’s RTE: .80 .68 .52
Ringo 229 174 241
Ringo’s RTE: 1.0 .28 1.0
Yoko 201 159 225
Yoko’s RTE: .56 .42 .92
Considering the Test Events in Light of RTE Information
MAP scores with low RTE’s are not trustworthy.
• Identify suspect RITs on our reports.
• Try to preempt non-effortful responding by developing a smart test that monitors effort and displays messages to students and/or proctors.
• Develop methods for adjusting RITs for the amount of non-effortful behavior in a test event.
Addressing the Problem: What NWEA Can Do
• Explain to students the importance of their giving their best effort on MAP.
• Administer MAP in a setting that is free from construct-irrelevant factors.
• Administer MAP in the morning when possible.
Addressing the Problem: What You Can Do
20