Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Post on 12-Jan-2016

218 views 0 download

Tags:

transcript

Evaluation of Tutoring Systems

Kurt VanLehn

PSCL Summer School 2006

Outline Formative evaluations:

How to improve the tutor? Summative evaluations:

How well does the tutor work? Parametric evaluations:

Why does the tutor work?

User interface testing (usually the first formative eval) When

– Early, before the rest of the tutor is built

Who– Students

– Instructors

How– Talk aloud with headset mikes and Camtasia

– Sit with the user and ask them about every frown

– Other?

Also great for finding good problems to assign

The Andes user interface

Pyrenee’s GUI for same problem

Andes’ GUI for defining a velocity vector

Pyrenees’ methodT: What would you like to

define?a) displacementb) velocityc) acceleration

S: b T: Velocity of…

a) Hailstoneb) Cloudc) Earth

S: a

T: Type?a) instantaneousb) Average

S: aT: Time point?

a) T0b) T1

S: aT: Orientation?S: 270 degT: Name?S: vi

Wizard of Oz (A formative evaluation) Format

– Human in the next room watches a copy of screen– Human responds when student presses Hint button or makes an

error– Human must type very fast! Paste from stock answers?

User interface evaluation– Does the human have enough information?– Can the human intervene early enough?

Knowledge elicitation– What tutoring tactics were used?– What conditions determine when each tactic is used?

Snapshot critiques (A late formative evaluation) Procedure

– ITS keeps log file– Afterwards, randomly select events in log where student

got help– Print context leading up to the help message– Expert tutors write their help on the paper

How frequently does expert’s help match ITS’s?– How frequently do two expert’s help match?

Add to the ITS the help that experts agree on

Outline Formative evaluations:

How to improve the tutor? Summative evaluations:

How well does the tutor work? Parametric evaluations:

Why does the tutor work?

Next

Summative evaluations Question: Is the tutor more effective than a control? Typical design

– Experimental group uses the tutor

– Control group learns via the “traditional” method

– Pre & post tests

Data analysis– Did the tutor group “do better” than the control?

Three feasible designs. One factor is forced to be equal. Two factors vary.

Like homework

Like seatwork

Mastery learning

Training problems

Tutor = control

Tutor > control?

Tutor < control?

Training duration

Tutor < control?

Tutor = control

Tutor < control?

Post-test score

Tutor > control?

Tutor > control?

Tutor = control

Control conditions

Typical control conditions– Existing classroom instruction– Textbook & exercise problems (feedback?)– Another tutoring system– Human tutoring

» Null result does not “prove” computer tutor = human tutor

Define your control condition early– Drives the design of the tutor

Assessments (tests)

Your tests Instructor’s normal tests Standardized tests

When to test

Pre-test Immediate post-test Delayed post-test

– Measures retention Learning (pre-test, training, post-test)

– Measures acceleration of future learning (also called preparation for learning)

Example of acceleration of future learning (Min Chi & VanLehn, in prep.) Design

– Training on probability then physics– During probability only,

» Half students taught an explicit strategy» Half not taught a strategy (normal instruction)

Pre PostProbability Training

Sco

re

Pre PostPhysics Training

Sco

re

Preparation for learning

Ordinary transfer

Content of post-tests

Some problems from the pre-test– Determines if any learning occurred at all

Some problem similar to training problems– Measures near-transfer

Some problems dissimilar to training problems– Measures far-transfer

Use your cognitive task analysis!

Bad tests happen, so Pilot Pilot Pilot!

Blatant mistakes (shows up in means)– Too hard (floor effect)– Too easy (ceiling effect)– Too long (mental attrition)

Subtle mistakes (check variance)– Test doesn’t cover some training content– Test over-covers some training content– Test is too sensitive to background knowledge

» e.g., reading, basic math

Did the conditions differ?

My advice: Always do ANCOVAs– Condition is independent variable– Post-test score is dependent– Pre-test score is co-variate

Others advice: – Do ANOVAs on gains – If pre-test scores are not significantly different, do

ANOVAs on post-test scores

Effect sizes: Cohen’s d

Should be based on post-test scores: [mean(experimental)-mean(control)] / standard_deviation(control)

Common but misleading usage:[mean(post-test) – mean(pre-test)] / standard_deviation(pre-test)

Error bars help visualize results

0

10

20

30

40

50

CompleteParaphrase

CompleteSelf-explain

IncompleteParaphrase

IncompleteSelf-explain

Nu

mb

er o

f h

elp

req

ues

ts

Scatter plots help visualize results

Andesy = 0.9473x - 2.4138

R2 = 0.2882

Controlsy = 0.7956x - 2.5202

R2 = 0.2048

-3.0000

-2.0000

-1.0000

0.0000

1.0000

2.0000

3.0000

1 1.5 2 2.5 3 3.5 4

GPA

Z-s

core

on

exam

ANDES

CONTROLS

Linear (ANDES)

Linear (CONTROLS)

Pre-test score (or GPA)

Pos

t-te

st s

core

Andes

Control

If slopes were different, would have aptitude-treatment interaction (ATI)

Andesy = 0.9473x - 2.4138

R2 = 0.2882

Controlsy = 0.7956x - 2.5202

R2 = 0.2048

-3.0000

-2.0000

-1.0000

0.0000

1.0000

2.0000

3.0000

1 1.5 2 2.5 3 3.5 4

GPA

Z-s

core

on

exam

ANDES

CONTROLS

Linear (ANDES)

Linear (CONTROLS)

Pre-test score (or GPA)

Pos

t-te

st s

core

Andes

Control

Which students did the tutor help? Divide subjects into high/low pretest Plot gains Called “aptitude-treatment

interaction” (ATI) Need more subjects

Low pretest High pretest

GainTutoredControl

Which topics did the tutor teach best? Divide test items (e.g., into

deep/shallow knowledge) Plot gains Need more items

Deep Shallow

GainTutoredControl

Log file analyses Did students use the tutor as expected?

– Using help too much (help abusers)– Using help too little (help refusers)– Copying a solution from someone else (exclude?)

Correlations with gain– Errors corrected with or without help– Proportion of bottom-out hints– Time spent thinking before/after a hint

Learning curves for productions– If not a smooth curve, is it really a

single production?

Practical issues All experiments

– Human subjects institutional review board (IRB) Lab experiments

– Recruiting subjects over a whole semester; knowledge varies– Attrition: Students quit before they finish

Field (classroom) experiments– Access to classrooms and teachers– Instructors’ enthusiasm, technosavy, agreement with pedagogy– Ethics of requiring use/non-use of the tutor in high-stakes classes– Their tests vs. your tests

Web experiments– Insuring random assignment vs. attrition

Outline Formative evaluations:

How to improve the tutor? Summative evaluations:

How well does the tutor work? Parametric evaluations:

Why does the tutor work?Next

Parametric evaluations: Why does the tutor work? Hypothesize sources of benefit, such as

– Explication of hidden problem solving skill– Novel reification (GUI)

» E.g., showing goals on the screen

– Novel sequence of exercises/topics» E.g., story problems first, then equations

– Immediate feedback & help Plan an experiment or sequence of experiments

– Don’t try to do all 2^N combinations in one study– Vary only 1 or 2 factors

Two types of parametric experiments Removing a putative benefit from the tutor

– Two conditions:1. Tutor

2. Tutor minus a benefit (e.g., immediate feedback)

Add a putative benefit to the control, e.g.,– Three conditions:

1. Control

2. Control plus a benefit (e.g,. explication of a hidden skill)

3. Tutor

In vivo experimentation

High internal validity required– Helps us understand human learning– All but a few factors are controlled– Summative eval of tutoring usually varies too many

Often done in context of tutoring systems– Parametric– Off line, but tutoring system serves as pre/post test

Evaluations of tutoring systems Formative evaluations: How to improve the tutor?

– Pilot test user interface alone– Wizard of Oz– Hybrids

Summative evaluations: How well does the tutor work?– 2 conditions: with and without the tutor– Many supplementary analyses are possible

Parametric evaluations: Why does the tutor work?– Compare different versions of the tutor– Try putative benefits of the tutor with the control