+ All Categories
Home > Documents > Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Date post: 12-Jan-2016
Category:
Upload: andrew-scott
View: 218 times
Download: 0 times
Share this document with a friend
Popular Tags:
32
Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006
Transcript
Page 1: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Evaluation of Tutoring Systems

Kurt VanLehn

PSCL Summer School 2006

Page 2: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Outline Formative evaluations:

How to improve the tutor? Summative evaluations:

How well does the tutor work? Parametric evaluations:

Why does the tutor work?

Page 3: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

User interface testing (usually the first formative eval) When

– Early, before the rest of the tutor is built

Who– Students

– Instructors

How– Talk aloud with headset mikes and Camtasia

– Sit with the user and ask them about every frown

– Other?

Also great for finding good problems to assign

Page 4: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

The Andes user interface

Page 5: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Pyrenee’s GUI for same problem

Page 6: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Andes’ GUI for defining a velocity vector

Page 7: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Pyrenees’ methodT: What would you like to

define?a) displacementb) velocityc) acceleration

S: b T: Velocity of…

a) Hailstoneb) Cloudc) Earth

S: a

T: Type?a) instantaneousb) Average

S: aT: Time point?

a) T0b) T1

S: aT: Orientation?S: 270 degT: Name?S: vi

Page 8: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Wizard of Oz (A formative evaluation) Format

– Human in the next room watches a copy of screen– Human responds when student presses Hint button or makes an

error– Human must type very fast! Paste from stock answers?

User interface evaluation– Does the human have enough information?– Can the human intervene early enough?

Knowledge elicitation– What tutoring tactics were used?– What conditions determine when each tactic is used?

Page 9: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Snapshot critiques (A late formative evaluation) Procedure

– ITS keeps log file– Afterwards, randomly select events in log where student

got help– Print context leading up to the help message– Expert tutors write their help on the paper

How frequently does expert’s help match ITS’s?– How frequently do two expert’s help match?

Add to the ITS the help that experts agree on

Page 10: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Outline Formative evaluations:

How to improve the tutor? Summative evaluations:

How well does the tutor work? Parametric evaluations:

Why does the tutor work?

Next

Page 11: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Summative evaluations Question: Is the tutor more effective than a control? Typical design

– Experimental group uses the tutor

– Control group learns via the “traditional” method

– Pre & post tests

Data analysis– Did the tutor group “do better” than the control?

Page 12: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Three feasible designs. One factor is forced to be equal. Two factors vary.

Like homework

Like seatwork

Mastery learning

Training problems

Tutor = control

Tutor > control?

Tutor < control?

Training duration

Tutor < control?

Tutor = control

Tutor < control?

Post-test score

Tutor > control?

Tutor > control?

Tutor = control

Page 13: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Control conditions

Typical control conditions– Existing classroom instruction– Textbook & exercise problems (feedback?)– Another tutoring system– Human tutoring

» Null result does not “prove” computer tutor = human tutor

Define your control condition early– Drives the design of the tutor

Page 14: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Assessments (tests)

Your tests Instructor’s normal tests Standardized tests

Page 15: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

When to test

Pre-test Immediate post-test Delayed post-test

– Measures retention Learning (pre-test, training, post-test)

– Measures acceleration of future learning (also called preparation for learning)

Page 16: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Example of acceleration of future learning (Min Chi & VanLehn, in prep.) Design

– Training on probability then physics– During probability only,

» Half students taught an explicit strategy» Half not taught a strategy (normal instruction)

Pre PostProbability Training

Sco

re

Pre PostPhysics Training

Sco

re

Preparation for learning

Ordinary transfer

Page 17: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Content of post-tests

Some problems from the pre-test– Determines if any learning occurred at all

Some problem similar to training problems– Measures near-transfer

Some problems dissimilar to training problems– Measures far-transfer

Use your cognitive task analysis!

Page 18: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Bad tests happen, so Pilot Pilot Pilot!

Blatant mistakes (shows up in means)– Too hard (floor effect)– Too easy (ceiling effect)– Too long (mental attrition)

Subtle mistakes (check variance)– Test doesn’t cover some training content– Test over-covers some training content– Test is too sensitive to background knowledge

» e.g., reading, basic math

Page 19: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Did the conditions differ?

My advice: Always do ANCOVAs– Condition is independent variable– Post-test score is dependent– Pre-test score is co-variate

Others advice: – Do ANOVAs on gains – If pre-test scores are not significantly different, do

ANOVAs on post-test scores

Page 20: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Effect sizes: Cohen’s d

Should be based on post-test scores: [mean(experimental)-mean(control)] / standard_deviation(control)

Common but misleading usage:[mean(post-test) – mean(pre-test)] / standard_deviation(pre-test)

Page 21: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Error bars help visualize results

0

10

20

30

40

50

CompleteParaphrase

CompleteSelf-explain

IncompleteParaphrase

IncompleteSelf-explain

Nu

mb

er o

f h

elp

req

ues

ts

Page 22: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Scatter plots help visualize results

Andesy = 0.9473x - 2.4138

R2 = 0.2882

Controlsy = 0.7956x - 2.5202

R2 = 0.2048

-3.0000

-2.0000

-1.0000

0.0000

1.0000

2.0000

3.0000

1 1.5 2 2.5 3 3.5 4

GPA

Z-s

core

on

exam

ANDES

CONTROLS

Linear (ANDES)

Linear (CONTROLS)

Pre-test score (or GPA)

Pos

t-te

st s

core

Andes

Control

Page 23: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

If slopes were different, would have aptitude-treatment interaction (ATI)

Andesy = 0.9473x - 2.4138

R2 = 0.2882

Controlsy = 0.7956x - 2.5202

R2 = 0.2048

-3.0000

-2.0000

-1.0000

0.0000

1.0000

2.0000

3.0000

1 1.5 2 2.5 3 3.5 4

GPA

Z-s

core

on

exam

ANDES

CONTROLS

Linear (ANDES)

Linear (CONTROLS)

Pre-test score (or GPA)

Pos

t-te

st s

core

Andes

Control

Page 24: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Which students did the tutor help? Divide subjects into high/low pretest Plot gains Called “aptitude-treatment

interaction” (ATI) Need more subjects

Low pretest High pretest

GainTutoredControl

Page 25: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Which topics did the tutor teach best? Divide test items (e.g., into

deep/shallow knowledge) Plot gains Need more items

Deep Shallow

GainTutoredControl

Page 26: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Log file analyses Did students use the tutor as expected?

– Using help too much (help abusers)– Using help too little (help refusers)– Copying a solution from someone else (exclude?)

Correlations with gain– Errors corrected with or without help– Proportion of bottom-out hints– Time spent thinking before/after a hint

Learning curves for productions– If not a smooth curve, is it really a

single production?

Page 27: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Practical issues All experiments

– Human subjects institutional review board (IRB) Lab experiments

– Recruiting subjects over a whole semester; knowledge varies– Attrition: Students quit before they finish

Field (classroom) experiments– Access to classrooms and teachers– Instructors’ enthusiasm, technosavy, agreement with pedagogy– Ethics of requiring use/non-use of the tutor in high-stakes classes– Their tests vs. your tests

Web experiments– Insuring random assignment vs. attrition

Page 28: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Outline Formative evaluations:

How to improve the tutor? Summative evaluations:

How well does the tutor work? Parametric evaluations:

Why does the tutor work?Next

Page 29: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Parametric evaluations: Why does the tutor work? Hypothesize sources of benefit, such as

– Explication of hidden problem solving skill– Novel reification (GUI)

» E.g., showing goals on the screen

– Novel sequence of exercises/topics» E.g., story problems first, then equations

– Immediate feedback & help Plan an experiment or sequence of experiments

– Don’t try to do all 2^N combinations in one study– Vary only 1 or 2 factors

Page 30: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Two types of parametric experiments Removing a putative benefit from the tutor

– Two conditions:1. Tutor

2. Tutor minus a benefit (e.g., immediate feedback)

Add a putative benefit to the control, e.g.,– Three conditions:

1. Control

2. Control plus a benefit (e.g,. explication of a hidden skill)

3. Tutor

Page 31: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

In vivo experimentation

High internal validity required– Helps us understand human learning– All but a few factors are controlled– Summative eval of tutoring usually varies too many

Often done in context of tutoring systems– Parametric– Off line, but tutoring system serves as pre/post test

Page 32: Evaluation of Tutoring Systems Kurt VanLehn PSCL Summer School 2006.

Evaluations of tutoring systems Formative evaluations: How to improve the tutor?

– Pilot test user interface alone– Wizard of Oz– Hybrids

Summative evaluations: How well does the tutor work?– 2 conditions: with and without the tutor– Many supplementary analyses are possible

Parametric evaluations: Why does the tutor work?– Compare different versions of the tutor– Try putative benefits of the tutor with the control


Recommended