[ACM Press the 2013 international conference - Santa Monica, California, USA...

Detecting Boredom and Engagement During Writing with Keystroke Analysis, Task Appraisals, and Stable Traits

Robert Bixler

University of Notre Dame

384 Fitzpatrick Hall

Notre Dame, IN 46556

[email protected]

Sidney D’Mello

University of Notre Dame

384 Fitzpatrick Hall

Notre Dame, IN 46556

[email protected]

ABSTRACT

It is hypothesized that the ability for a system to

automatically detect and respond to users’ affective states

can greatly enhance the human-computer interaction

experience. Although there are currently many options for

affect detection, keystroke analysis offers several attractive

advantages to traditional methods. In this paper, we

consider the possibility of automatically discriminating

between natural occurrences of boredom, engagement, and

neutral by analyzing keystrokes, task appraisals, and stable

traits of 44 individuals engaged in a writing task. The

analyses explored several different arrangements of the

data: using downsampled and/or standardized data;

distinguishing between three different affect states or

groups of two; and using keystroke/timing features in

isolation or coupled with stable traits and/or task appraisals.

The results indicated that the use of raw data and the feature

set that combined keystroke/timing features with task

appraisals and stable traits, yielded accuracies that were

11% to 38% above random guessing and generalized to

new individuals. Applications of our affect detector for

intelligent interfaces that provide engagement support

during writing are discussed.

Author Keywords

Affect detection; keystroke dynamics; boredom;

engagement; free text.

ACM Classification Keywords

Categories and subject descriptors: H.5.m [Information

Interfaces and Presentation]: Miscellaneous

INTRODUCTION

One of the primary goals of the field of Affective

Computing is to develop intelligent systems that can detect

and respond to users’ affective states [28, 29]. In general,

affective systems can be subdivided into three categories:

systems that detect affect, systems that express affect, and

systems that experience affect (i.e., artificial emotions).

Although each type of system has far-reaching applications

in a myriad of research domains, detecting affect is an

integral component in designing applications that

intelligently communicate with users and assist them in

performing their tasks.

The Affective Computing community has recognized the

importance of detecting affect and a wide array of methods

have been developed [see 3, 27, and 35 for reviews]. The

preeminent modality associated with affect is that of the

face, presumably due to the well-known links between

affect and facial expressions [13, 26, 31]. A second

modality that has received considerable attention is speech

and it has been shown that paralinguistic features of speech

can serve as an important index into affect [18, 21]. Text

analysis is a third type of modality that has been heavily

researched as evidenced by the nascent field of sentiment

analysis [25]. The fourth modality includes physiological

signals, such as the electrical conductivity of the skin,

electrical activity of the heart, and activation of muscles in

the face. These measures of peripheral physiology can be

complemented by measures of central physiology such as

fMRI and EEG [24]. Other measures that have received

somewhat less attention include eye gaze, postures, and

gestures [3].

There are some important factors that must be considered

when evaluating the usefulness of any particular modality

for affect detection. These are: 1) validity of the signal as

being diagnostic of affect, 2) reliability of the signal in a

real-world environment, 3) time resolution of the signal,

and 4) cost of implementation and intrusiveness of the

sensors. Although significant advances in the development

of functional affect detection systems have been made over

the last decade, most of the current systems fail to achieve

one or more of these desirable features.

Taking a somewhat different approach from the common

modalities discussed above, the present paper focuses on

the use of keystroke analysis to develop fully automated

affect detectors while individuals perform a writing task

with a computer interface. The working hypothesis is that

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise,

or republish, to post on servers or to redistribute to lists, requires prior

specific permission and/or a fee.

IUI’13, March 19–22, 2013, Santa Monica, CA, USA.

Copyright © 2013 ACM 978-1-4503-1965-2/13/03...$15.00.

Session: Emotion and User Modeling IUI'13, March 19–22, 2013, Santa Monica, CA, USA

225

an analysis of users’ keystrokes can yield information that

signals their affective states. The focus is not on what

specific content is generated, but on how the content is

generated.

There are many attractive aspects of keystroke analysis for

affect detection. Keystroke analysis requires no extra

hardware beyond the already present computer and

keyboard. All that is required is software to record and

analyze the keystrokes, so the method is unobtrusive, cost

effective, and scalable. Keystrokes are transferred directly

from the keyboard to the data collection software, resulting

in a reliable signal with negligible interference. Because

each keystroke event occurs within a short time frame of

the previous keystroke, enough data can be gathered to

detect and respond to affect in a timely manner. The only

unknown factor is how accurately keystroke analysis

predicts affect, which is the focus of the present work. We

begin by briefly reviewing some of the research on

keystroke analysis as an affect detection channel followed

by an overview of the present study.

Previous Work on Keystroke Analysis and Affect

Although not directly relevant to affect detection, the

largest body on keystroke analysis is in research on

authentication systems. These systems aim to identify

individuals based on their unique typing patterns. The most

distinctive variation of keystroke-based authentication

systems is fixed text versus free text. Most attempts at

keystroke authentication involve developing a profile while

users type a pre-determined text [2, 17]. This profile is then

compared to any subsequent attempts to log in as that user.

Free text analysis consists of logging the keystrokes during

the user’s session without any constraints on what the user

can type, and comparing similarities between the

authentication profile and patterns observed during typing

[16]. Considerably less research has been done in this area

as fixed text analysis has a much lower error rate – a

measurement that is paramount in authentication systems

[5].

There have been a few studies that have used keystroke

analysis for affect detection. Vizer et al. [32] explored the

possibility of detecting users’ stress by continuously

monitoring keyboard interactions. Their features consisted

of timing, keystroke, and linguistic information. They tested

their models with both raw and standardized data in order to

minimize subject variability. They reported that their

classification rates using standardized data were consistent

with those achieved by other researchers using different

methods such as facial recognition, speech analysis, and a

pressure-sensing mouse.

Khanna and Sasikumar used a suite of classifiers in WEKA

to distinguish between positive, negative, and neutral affect

based on fixed text [19]. Their data was collected while

users re-typed text from chosen paragraphs. After copying

the paragraphs, users were instructed to indicate their

affective state during the typing task. The classifiers that

performed best were BF Trees and J48, with correct

classification rates of 89.0% and 84.1% for detecting

negative affect from neutral affect and correct classification

rates of 86.4% and 88.9% for detecting positive affect from

neutral affect.

A most recent study by Epp et al. [14] collected both fixed

and free text and analyzed models for six different affective

states. During their experiment they collected 10 minutes of

free text prior to a questionnaire that asked users which

affect they were experiencing. Following the questionnaire

was a passage from Alice in Wonderland which the user

was instructed to type as a sample of fixed text. They then

created models based on either the fixed or free text

samples.. Their best performing models all used fixed text

and achieved accuracies of 77.4%-87.8% with kappas

ranging between .55 and .75.

Overview of Current Study

Keystrokes are never generated in a vacuum, but are always

coupled to an underlying activity. The target activity in the

present study was a writing activity that involved the

composition of essays on a variety of different topics.

Writing is an activity that lends itself to keystroke analysis,

because the purpose of writing is to generate content, which

involves keystrokes in most cases. There is also evidence

which suggests that affective states might be tied to both the

topics and processes of writing, so it is important to

understand the affective states that arise during writing and

how they influence writing outcomes [12, 22]. In line with

this, the purpose of this paper is to develop automated

methods to monitor the moment-to-moment affective states

that individuals experience while writing about topics that

vary in emotional intensity. The hope is that the automated

methods can be integrated into intelligent interfaces that can

help students develop writing proficiency by detecting and

responding to their affective states (this is discussed in

more detail in the General Discussion section).

Our proposed approach varies from the previous work on

keystroke analysis reviewed above in several key ways.

First, we attempt to detect affective states that naturally

occur in the context of an ongoing activity, rather than

inducing affect or tracking affect before or after writing.

Second, we attempt to classify affect in 15 second intervals,

which is a considerably shorter time frame compared to

other free text methods. Third, the granularity of our affect

classification is finer; instead of either distinguishing

between positive and negative affect or detecting a

particular affective state, we are distinguishing between

either two or three non-basic affective states as discussed in

the next section. Fourth, our models include features, such

as number of pauses, which appear in very few studies on

keystroke analysis. Fifth, in addition to the keystrokes

themselves, our models also include information at two

additional levels of granularity. These include the writer’s

appraisals or impressions of the writing task prior to writing

(task appraisals) and relevant measures of writer’s


226

individual traits (stable traits). Finally, unlike the majority

of research on keystroke analysis which uses fixed text

entry methods, we pursue free text analysis because it offers

a glimpse into the mechanics of writing, which we

hypothesize conveys affect.

We begin by first describing the data set used to collect

training data for the affect detection models. Next, the

feature engineering, selection, and model building phases

are discussed. This is followed by the results of a cross-

validation analysis. Finally, an overview of our major

findings, a discussion of limitations, possible resolutions,

and potential applications are discussed.

DATA COLLECTION

Participants

The participants were 44 U.S. undergraduate students (68%

female; mean age of 19.9 years; 45% Caucasians, 52%

African Americans, and 3% “Other”) who participated for

course credit.

Essay Topics

The study involved relatively fine-grained tracking of

affective states while participants wrote essays on three

different topics: (a) academic topics which were obtained

from the writing portion of the American College Testing

(ACT) test, (b) socially charged issues like abortion and the

death penalty, and (c) personal emotional experiences such

as recent angry or happy experiences. These three topics

were selected because they reflect a continuum with respect

to their expected emotional impact, ranging from the

affectively neutral academic topics to the affectively

charged personal emotional experiences, with the socially

charged topic lying in between these two extremes.

In order to increase interest in the writing activity,

participants were allowed to choose one subtopic from a list

of options under each topic. Academic topics were adapted

from the writing portion of the ACT and subtopics included

“time spent in high school”, “the use of class discussion”,

and “social skills being taught in schools”. Socially charged

subtopics included “abortion”, “gays in the military”, and

“the death penalty”. Subtopics for the personal emotional

experiences condition involved writing about one of six

basic affective states (anger, disgust, fear, happiness,

sadness, and surprise).

Stable Trait Measures (Individual Differences)

In order to account for within subject differences, we

collected stable traits for each subject. The trait measures

we used were scholastic aptitude, writing apprehension,

and exposure to print. Participants’ self-reported ACT

scores were used as the measure of scholastic aptitude. Self-

reported ACT scores have been found to correlate with

actual test scores [4], so we have some confidence in this

measure.

Writing apprehension was measured with the validated 26-

item Writing Apprehension Test (WAT) [9]. The WAT is a

self-report inventory where participants respond to a series

of statements (e.g., “I avoid writing”, “I'm nervous about

writing”) via a 5-point Likert scale. As recommended by

Daly and Miller [9], the test was modified for use outside of

the classroom by removing six items pertaining to

classroom composition activities. Scores on the WAT can

range from 20 to 100, with lower scores indicated more

writing apprehension.

Exposure to print was measured with the Author

Recognition Test [8]. Participants were presented with a list

of names of 42 popular authors (e.g., J. R. R. Tolkien, Dean

Koontz) and they had to place a check mark by each author

they recognized. The version of the measure we

implemented did not have any distracter items; hence,

scores were computed as the proportion of authors that

were correctly recognized.

Task Appraisal Measures

Participants completed a Task Appraisal Questionnaire

prior to writing each essay. The questionnaire used a six-

point scale to collect information on participants: (1)

interest in the topic, (2) how often they discussed the topic,

(3) how comfortable with the topic they were (important for

socially charged issues), (4) how confidant they were in

their ability to write a good essay (5), how much effort they

would put into writing, (6) their perceived enjoyment while

writing, and (7) how personally connected they were to the

topic.

Procedure

Essay Writing

Participants were given 10 minutes to write an essay on

each of the three topics. The order of topics was

counterbalanced across participants with a 3 × 3 Latin

Square. For each topic, they were asked to select one of the

subtopics listed above. They typed their essays on a

computer interface. Each keystroke was logged along with

a timestamp from the beginning of both the session and

essay and the number of milliseconds since the last

keystroke. Videos of participants’ faces and computer

screens were also recorded.

Retrospective Affect Judgment

Participants provided self-judgments of their affective states

immediately after the writing session. Similar to a cued-

recall procedure [10, 30], the judgments for a participant’s

writing session began by playing a video of the face along

with the screen capture video on a widescreen monitor. The

screen capture included the writing prompt and dynamically

presented the text as it was written, thereby providing the

context of the writing session. Participants were instructed

to make judgments on what affective states were present at

any moment during the writing session by manually

pausing the videos. They were also instructed to make

judgments at each 15-second interval. The fifteen affective

states available to select were: anger, contempt, disgust,

fear, happiness, sadness, surprise, boredom, confusion,

delight, engagement, frustration, anxiety, curious, and

finally, neutral.


227

It is important to mention three points pertaining to the

present affect judgment methodology. Offline judgments

were used because they allow monitoring of participants’

affective states at multiple points, with minimal task

interference, and without participants knowing that these

states were being monitored during completion of the

learning task. Second, this retrospective affect-judgment

method has been previously validated [30], and analyses

comparing these offline affect judgments with online

measures encompassing self-reports and observations by

judges have produced similar distributions of emotions

[6,7]. Third, the offline affect annotations obtained via this

retrospective protocol correlate with online recordings of

facial activity and gross body movements in expected

directions [11]. Although no method is without its

limitations, the present method appears to be a viable

approach to track emotions at a relatively fine-grained

temporal resolution.

Affect Distributions

The retrospective affect judgment protocol yielded 5,551

affect judgments across the 44 participants. The distribution

of affective states supports two major conclusions. First,

the fourteen affective states cumulatively accounted for

78.9% of the observations, while neutral was reported for

the remaining 21.1% of the observations. Second, only six

out of the 14 affective states (excluding neutral) occurred

with some regularity. The most frequent affective state was

engagement with an occurrence rate of 35.4% followed by

boredom at 26.4%; these two states comprised

approximately half of the observations (51.8%). We chose

to focus on engagement, boredom, and neutral in this study

because they comprised 72.9% of the observations and the

remaining affective states were either observed at very low

frequencies or inconsistently observed across participants.

MODEL BUILDING

Three types of features were used for the present analysis:

keystroke/timing features, task appraisals, and stable traits.

ACT scores, WAT scores, and exposure to print were used

as stable traits among all subjects. The seven pre-essay

questions were used as task appraisals for each essay. These

two sets of features have been described in the previous

section, so we focus on keystroke/timing features here.

Keystroke and Timing Features

The list of keystroke and timing features is presented in

Table 1. These 12 features were reduced from a larger set of

19 based on tolerance analyses, which was used to identify

multicollinearity among features.

Features were computed in 15 second intervals, with each

interval culminating in an affect judgment (see above). The

features can be subdivided into four types: relative timing

features, keystroke content features, keystroke timing

features, and pausing behaviors. Relative timing features

consist of the time of an interval relative to the start of the

session or essay. Keystroke verbosity features consist of

features measuring the number of certain keystrokes such as

backspaces or the overall number of keystrokes in an

interval. Keystroke timing features consist of various

descriptive statistics of the latency between keystrokes,

such as the largest latency or the mean latency between

keystrokes in an interval. Finally, pausing behavior features

consist of the number of each type of pause within an

interval.

Outliers were taken to be values greater than three standard

deviations away from the mean and were not included in

the data that we built our models with. This data set is

referred to as the raw data set because no additional

transformations were performed other than outlier

treatment.

Description

Relative Timing

Session Time Elapsed time from start of session

Essay Time Elapsed time from start of essay

Keystroke Verbosity

Verbosity Number of keys within the interval

Backspace Use Number of backspaces within the

interval

Keystroke Timing

Largest

Latency

Largest time difference between

keystrokes within interval

Smallest

Latency

Smallest time difference between

keystrokes within interval

Mean Latency The mean of all the differences in time

between keystrokes within the interval

Median

Latency

The median of all the differences in time

between keystrokes within the interval

Pausing Behaviors

0.5 Second

Pauses

The number of pauses above .5 seconds

and below 1 second

1 Second

Pauses

The number of pauses above 1 second

and below 1.5 seconds

1.5 Second

Pauses

The number of pauses above 1.5 second

and below 2 seconds

2 Second

Pauses

The number of pauses above 2 second

and below 3 seconds

Table 1. Keystroke and Timing Features

Models

We created models with four different combinations of

features: keystroke/timing features alone; keystroke/timing

features with stable traits; keystroke/timing features with

task appraisals; and all three types of features.


228

In addition to models on the raw features, we also created

models with standardized features, models that were

downsampled to the affect state with lowest frequency

(boredom), and downsampled models with standardized

features. Downsampling was used to eliminate imbalance

arising from uneven class labels since models that are built

on uneven data have a tendency to skew classification

towards the majority class. Downsampling can improve

classification rates in instances where there is an uneven

distribution of class labels, as was the case for our data.

Standardization (i.e., z-scores) was done by subject on only

the keystroke/timing features, and was used to reduce

between-subject variability. Standardization was performed

after downsampling.

In addition to the boredom, engagement, neutral

discrimination, we also created three two-affect models.

This resulted in a model distinguishing boredom from

engagement, one distinguishing engagement from neutral,

and one distinguishing boredom from neutral.

We used WEKA implementations [34] of nine different

classifiers for supervised classification: J48, NaïveBayes,

BayesNet, SMO, DecisionTable, OneR, RandomForest,

RandomTree, and REPTree. For models with

downsampling, we ran the tests with ten different randomly

downsampled datasets and took the average values over the

ten runs.

We validated our results using the leave several subjects

out method. Specifically, we split our data into two groups,

with 66% of subjects in one group and the remaining 33%

in the other. We then trained the models on the first 66%

and tested on the other 33%. The leave-several-subjects out

method ensures that the models generalize to new subjects

because the same subject is either in the training or the

testing group, but never both. By splitting our data by

subject, we emulate the situation of encountering new

subjects with a previously trained model, ensuring

generalization. We did this a total of ten times, and took the

average kappa values and accuracies for each classifier over

those ten runs.

RESULTS

A total of 5,760 models (4 affect discriminations × 4 data

sets × 4 features × 9 classifiers × 10 iterations per classifier)

Model Best

Classifier

Kappa

Boredom-Engagement-Neutral

Keystroke/Timing DecisionTable 0.036

Keystroke/Timing + Task Appraisals BayesNet 0.100

Keystroke/Timing + Stable Traits NaiveBayes 0.116

Keystroke/Timing + Task Appraisals + Stable Traits BayesNet 0.171

Boredom - Engagement

Keystroke/Timing NaiveBayes 0.021

Keystroke/Timing + Task Appraisals NaiveBayes 0.270

Keystroke/Timing + Stable Traits RandomForest 0.357

Keystroke/Timing + Task Appraisals + Stable Traits REPTree 0.374

Engagement - Neutral

Keystroke/Timing NaiveBayes 0.105

Keystroke/Timing + Task Appraisals SMO 0.087

Keystroke/Timing + Stable Traits SMO 0.127

Keystroke/Timing + Task Appraisals + Stable Traits NaiveBayes 0.156

Boredom - Neutral

Keystroke/Timing BayesNet 0.102

Keystroke/Timing + Task Appraisals NaiveBayes 0.132

Keystroke/Timing + Stable Traits J48 0.121

Keystroke/Timing + Task Appraisals + Stable Traits BayesNet 0.115

Table 2. Kappa and accuracy for each data and feature set


229

were estimated. The raw data performed better or almost as

well as the standardized and downsampled data. As raw

data is more ideal to use for real world purposes, we

decided to focus on this data and the results for the models

on standardized and downsampled data are not discussed

further.

The results for the classifier that yielded the best average

performance across trials for a given combination of affect

discrimination and feature set are shown in Table 2. The

kappa values of the best feature set within each

classification task have been bolded. We use kappas instead

of accuracy to interpret the results, because kappas correct

for random guessing, which is of concern due to an uneven

distribution of priors of the three affective states.

From the table it is easy to see that classification between

boredom and engagement (kappa = .374; accuracy =

87.0%) was the easiest to perform. The best model for the

three-way boredom- engagement-neutral discrimination

yielded a kappa of 0.171 with an accuracy of 56.3%, which

is approximately half of the kappa of the boredom-

engagement discrimination. This is presumably due to

difficulty associated with discriminating boredom and

engagement from neutral (best kappas of 0.132 with an

accuracy of 65.0% and 0.156 with an accuracy of 68.4%,

respectively).

The effect of the different feature sets is apparent from the

table, as well. With the exception of the boredom and

neutral discrimination, in general using all three types of

features resulted in the best kappa values. The average

kappa value for the feature set with all three features across

the four discriminations was 0.204. The next best feature

set was the combination of keystroke/timing features and

stable traits, which resulted in an average of 0.180. The

third best feature set was the set with both task appraisals

and keystroke/timing features, with an average of 0.147.

The feature set with only keystroke/timing features resulted

in the lowest kappas, with an average of 0.066.

The results highlight several key points. First, the best

model for discriminating boredom, engagement, and neutral

resulted in a kappa value of 0.171, or approximately 17%

above chance. This suggests that it is possible to

automatically discriminate between boredom, engagement,

and neutral by coupling keystrokes and timing data with

information on task appraisals and stable traits related to the

writing task.

Second, the fact that the highest kappa was between

boredom and engagement, while kappas for the other three-

way discriminations were on par and lower, suggests that

the classifiers experienced difficulty in separating boredom

and engagement from neutral compared to discriminating

boredom from engagement. This is what could be expected

since boredom and engagement are on two opposite sides of

a continuum with neutral in between.

Third, there was an interaction between feature set and

affect discrimination. Specifically, the keystroke/timing

level features alone yielded very low kappas for the three

way-classification (kappa = .036, accuracy = 54.4%). This

is because they were not very useful in discriminating

boredom from engagement (kappa = .021, accuracy =

82.0%), although they could discriminate both boredom and

engagement from neutral (kappas of .105 with an accuracy

of 68.6% and .102 with an accuracy of 66.0%,

respectively). A combination of task appraisals and stable

traits along with keystroke/timing features was needed to

discriminate between the three states.

Fourth, the feature set with the keystroke/timing features

and stable traits generally performed better than either the

keystroke/timing features alone or keystroke/timing

features coupled with task appraisals. This indicates that the

stable traits were more useful than the task appraisals for

the discriminations.

GENERAL DISCUSSION

Scalable and accurate affect detection has and always will

be one of the predominant goals of affective computing. As

noted in the Introduction, some affect detection methods

currently in use have limitations, such as obtrusiveness and

cost, which keystroke analysis overcomes. Due to the

prevalence of typed entry, we hypothesized that keystroke

analysis is an attractive alternative to either replace or

complement other affect detection methods. We tested this

hypothesis by investigating how accurately we could

discriminate among natural occurrences of boredom,

engagement, and neutral during a writing task. By doing so,

we have expanded upon a relatively small amount of

previous research pertaining to affect detection using

keystroke analysis by using keystroke/timing features in

conjunction with stable traits and task appraisals to

distinguish between boredom, engagement, and neutral. In

this section, we take stock of our major findings, discuss

applied implications, and discuss limitations and potential

areas for future work.

Major Findings

Our results were illuminating in a number of ways. First,

we have shown that it is possible to discriminate between

boredom, engagement, and neutral using keystroke analysis

and both task appraisals and stable traits relevant to writing.

Our work differed from previous research on keystroke

based affect detection (see Introduction) in terms of our

focus on free text, tracking of non-basic emotions, finer-

grained temporal resolution for affect detection, a feature

set that couples keystrokes with task appraisals and stable

traits, validation methods that generalize to new learners,

and the fact that naturalistic experiences of affect were

being tracked.

Second, standardizing and downsampling the data did not

provide a significant increase in classification rate over raw

data. This is a significant advantage for real-world

implementation because raw keystroke data is immediately


230

available as input to the classifiers. Third, classification was

improved by combining keystroke analysis with both task

appraisals and relevant stable traits. Although this requires

additional data collection (see below), it may be worth the

improvement in detection. Fourth, we found that

distinguishing between boredom and engagement was

easier than distinguishing either boredom or engagement

from neutral.

Finally, it is important to note that the models were built

using purely free text keystroke analysis. Free text

keystroke analysis is a more accurate representation of real

world situations than fixed text keystroke analysis, and as

such is more appropriate for dynamic affect detection. Our

results indicate that free text keystroke analysis may be a

reasonable avenue for consideration in affect detection

systems.

Application of Findings to Intelligent User Interfaces

The present findings are applicable to any user interface

that utilizes keyboard for user input. The most immediate

application are systems that monitor affect during writing.

This is an important application area because writing is

increasingly being considered to be a crucial 21st century

skill since the ability to write proficiently is a critical

component of success in many professional endeavors. The

impetus on writing can also be witnessed in the formal

education system. Standardized tests in the U.S., such as the

ACT, SAT, and GRE, have recently added sections that

evaluate an applicant’s writing proficiency. Unfortunately,

despite the increased focus on writing in the education

system, there is evidence to suggest that the average student

lacks the necessary skills to effectively communicate their

ideas in written text. For example, according to a National

Assessment of Educational Progress report, only 23% of 8th

graders and 31% of 12th graders in the U.S. were considered

to be “proficient” writers and there was no significant

change since 2002 [23].

Considering the high stakes placed on writing competency

several educational applications have been developed to

automatically score written essays and provide formative

feedback on writing quality. Some of these include the

Intelligent Essay Assessor, E-Rater, Summary Street, and

Writing Pal [1, 15, 20, 33]. Unfortunately, these systems

have focused on the cognitive aspects of writing, while

ignoring the affective ones. The present research can

alleviate this problem by automatically detecting when

engagement is waning and the student risks disengaging

from the writing task due to boredom. This educational

application can use this information to intervene

accordingly, such as changing topics when the student is

bored. The affect detectors developed in this study could

also be instrumental in furthering understanding the role of

affect in developing writing proficiency by affording

automated measurement of engagement and boredom,

which are presumably correlated with writing outcomes.

The present focus on keystroke and timing analysis makes

it quite easy to incorporate our models into intelligent

writing-support applications that detect and respond to

affect. The analysis of keystroke, timing, stable traits, and

task appraisal features can all be implemented easily

through software. Data for keystroke and timing features

can be collected unobtrusively while users type on the

keyboard. Stable traits can be collected with short

questionnaires; taking no longer than five minutes while a

user creates their profile for the first time. Task appraisals

can be collected with a brief series of questions before each

writing task, which can also be done with minimal

interruption. This data is collected before the task begins, so

there is little infringement on the writing process. Since

collecting data for task appraisals and stable traits is done

quickly, it is not infeasible to incorporate these features into

a real world situation.

Our research is focused on affect during writing, which

extends farther than educational systems. Detecting affect

might be beneficial in the workplace or on a personal

computer as well. Continuous monitoring of affect through

keystroke analysis could help create systems that can soothe

a user becoming frustrated with an aspect of their work or

re-focus a user that is becoming bored. A personal

computer could detect when a user is becoming sad and

might make an attempt to cheer them up. These situations

are somewhat different than typing from an essay writing

standpoint, but interaction with a keyboard invites the

opportunity for unobtrusive affect detection regardless of

the situation.

Limitations and Future Work

There are a few limitations of the methods we employed.

First, the highest accuracy for the boredom, engagement,

and neutral discrimination was 56.3%, which, although

approximately 17% better than chance, is quite modest.

That being said, we do have some confidence in the

generalizability of our models, because these results were

obtained for previously unseen subjects through the leave

several subjects out cross-validation method. Second, we

did not perform feature selection other than removing

features that exhibited multicollinearity. Employing feature

selection would reduce the number of features being used

and may improve classification as well. Third, we only

investigated affect in 15 second intervals because this is

when humans provided affect judgments. Different interval

lengths might be more or less appropriate for affect

detection. Fourth, we investigated whether our models

generalize to new learners through the leave several out

method, but we did not assess the extent to which they

generalize to new topics.

Future work could be focused in several areas. It would be

useful to investigate if feature selection improves

performance of the classifiers. Instead of looking at 15

second intervals, it might be possible to analyze a

continuous stream of keystrokes. Our models were built


231

from data within the subject but independent of the essay –

it might be informative to study trends within each essay as

well. Whether these and other advances result in notable

improvements in accuracy awaits further research and

empirical testing.

ACKNOWLEDGEMENTS

Special thanks to Caitlin Mills for leading the data

collection and to Rebekah Combs, Rosaire Daigle, Nia

Dowell, Ally Dobbins, Melissa Gross, Blair Lehman, and

Amber Strain for help with data collection. This research

was supported by the National Science Foundation (NSF)

(ITR 0325428, HCC 0834847, and DRL 1235958). Any

opinions, findings and conclusions, or recommendations

expressed in this paper are those of the authors and do not

necessarily reflect the views of the NSF.

REFERENCES

1. Attali, Y., & Burstein, J. (2006). Automated essay

scoring with e-rater® V. 2. The Journal of Technology,

Learning and Assessment, 4(3).

2. Bleha, S., Slivinsky, C., & Hussien, B. (1990).

Computer-access security systems using keystroke

dynamics. Pattern Analysis and Machine Intelligence,

IEEE Transactions on, 12(12), 1217-1222.

3. Calvo, R. A., & D’Mello, S. K. (2010). Affect detection:

An interdisciplinary review of models, methods, and

their applications. IEEE Transactions on Affective

Computing, 1(1), 18-37. doi: 10.1109/T-AFFC.2010.1

4. Cole, J. S., & Gonyea, R. M. (2010). Accuracy of self-

reported SAT and ACT test scores: Implications for

research. Research in Higher Education, 51(4), 305-

319. doi: 10.1007/s11162-009-9160-9

5. Crawford, H. (2010). Keystroke dynamics:

Characteristics and opportunities. In Privacy Security

and Trust (PST), 2010 Eighth Annual International

Conference on (pp. 205-212). IEEE.

6. Craig, S., D'Mello, S., Witherspoon, A., & Graesser, A.

(2008). Emote aloud during learning with AutoTutor:

Applying the facial action coding system to cognitive-

affective states during learning. Cognition & Emotion,

22(5), 777-788.

7. Craig, S., Graesser, A., Sullins, J., & Gholson, J. (2004).

Affect and learning: An exploratory look into the role of

affect in learning. Journal of Educational Media, 29,

241-250.

8. Cunningham, A. E., & Stanovich, K. E. (1997). Early

reading acquisition and its relation to reading experience

and ability 10 years later. Developmental Psychology,

33(6), 934-945.

9. Daly, J., & Miller, M. (1975). The empirical

development of an instrument to measure writing

apprehension. Research in the Teaching of English 9(3),

242-249.

10. D’Mello, S., & Graesser, A. (2011). The half-life of

cognitive-affective states during complex learning.

Cognition & Emotion, 25(7), 1299-1308.

11. D'Mello, S., & Graesser, A. (2010). Multimodal semi-

automated affect detection from conversational cues,

gross body language, and facial features. User Modeling

and User-adapted Interaction, 20(2), 147-187.

12. D'Mello, S., & Mills, C. (in review). Emotions during

emotional and non-emotional writing.

13. Ekman, P. (1992). An argument for basic emotions.

Cognition & Emotion, 6(3-4), 169-200.

14. Epp, C., Lippold, M., & Mandryk, R. L. (2011).

Identifying emotional states using keystroke dynamics.

In Proceedings of the 2011 annual conference on

Human factors in computing systems (pp. 715-724).

ACM.

15. Foltz, P. W., Laham, D., & Landauer, T. K. (1999). The

intelligent essay assessor: Applications to educational

technology. Interactive Multimedia Electronic Journal

of Computer-Enhanced Learning, 1(2).

16. Gunetti, D., & Picardi, C. (2005). Keystroke analysis of

free text. ACM Transactions on Information and System

Security (TISSEC), 8(3), 312-347.

17. Joyce, R., & Gupta, G. (1990). Identity authentication

based on keystroke latencies. Communications of the

ACM, 33(2), 168-176.

18. K., & Litman, D. J. (2011). Benefits and challenges of

real-time uncertainty detection and adaptation in a

spoken dialogue computer tutor. Speech

Communication, 53(9-10), 1115-1136. doi:

10.1016/j.specom.2011.02.006

19. Khanna, P., & Sasikumar, M. (2010). Recognising

Emotions from Keyboard Stroke Pattern. International

Journal of Computer Applications IJCA, 11(9), 24-28.

20. McNamara, D. S., Raine, R., Roscoe, R., Crossley, S.,

Jackson, G. T., Dai, J., & Graesser, A. C. (2012). The

Writing-Pal: Natural language algorithms to support

intelligent tutoring on writing strategies. Applied natural

language processing and content analysis:

Identification, investigation, and resolution. Hershey,

PA: IGI Global.

21. Metallinou, A., Wollmer, M., Katsamanis, A., Eyben,

F., Schuller, B., & Narayanan, S. (in press). Context-

Sensitive Learning for Enhanced Audiovisual Emotion

Classification. IEEE Transactions on Affective

Computing.

22. Mills, C., & D’Mello, S. K. (2012). Emotions during

writing on topics that align or misalign with personal

beliefs. In S. Cerri, W. Clancey, G. Papadourakis & K.

Panourgia (Eds.), Proceedings of the 11th International

Conference on Intelligent Tutoring Systems (pp. 638-

639). Berlin Heidelberg: Springer-Verlag.


232

23. NAEP. (2007). The Nation's Report Card: Writing 2007.

24. Nijholt, Anton, & Desney S. Tan (2010). Brain-

Computer Interfaces: Applying our Minds to Human-

Computer Interaction. Springer.

25. Pang, B., & Lee, L. (2008). Opinion mining and

sentiment analysis. Foundations and Trends in

Information Retrieval, 2(1-2), 1-135

26. Pantic, M., & Patras, I. (2006). Dynamics of facial

expression: Recognition of facial actions and their

temporal segments from face profile image sequences.

IEEE Transactions on Systems, Man, and Cybernetics,

Part B., 36(2), 433-449. doi:

10.1109/tsmcb.2005.859075

27. Pantic, M., & Rothkrantz, L. (2003). Toward an affect-

sensitive multimodal human-computer interaction.

[Review]. Proceedings of the IEEE, 91(9), 1370-1390.

doi: 10.1109/jproc.2003.817122

28. Picard, R. (1997). Affective Computing. Cambridge,

Mass: MIT Press.

29. Picard, R. (2010). Affective Computing: From Laughter

to IEEE. IEEE Transactions on Affective Computing,

1(1), 11-17.

30. Rosenberg, E., & Ekman, P. (1994). Coherence between

expressive and experiential systems in emotion.

Cognition & Emotion, 8(3), 201-229.

31. Valstar, M. F., Mehu, M., Jiang, B., Pantic, M., Scherer,

K., Jiang, B., Valstar, M., Pantic, M., Valstar, M., &

Jiang, B. (in press). Meta-Analysis of the First Facial

Expression Recognition Challenge. IEEE transactions

on systems, man, and cybernetics. Part B, Cybernetics.

32. Vizer, L. M., Zhou, L., & Sears, A. (2009). Automated

stress detection using keystroke and linguistic features:

An exploratory study. International Journal of Human-

Computer Studies, 67(10), 870-886.

33. Wade-Stein, D., & Kintsch, E. (2004). Summary Street:

Interactive computer support for writing. Cognition and

Instruction, 22(3), 333-362.

34. Witten, I.H., Frank, E., Trigg, L., Hall, M., Holmes, G.

& Cunningham, S.J. (1999). Weka: Practical machine

learning tools and techniques with Java

implementations. Hamilton, New Zealand: University of

Waikato, Department of Computer Science.

35. Zeng, Z., Pantic, M., Roisman, G., & Huang, T. (2009).

A survey of affect recognition methods: Audio, visual,

and spontaneous expressions. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 31(1), 39-

58.


233

Date post:	12-Dec-2016
Category:	Documents
Upload:	sidney
View:	214 times
Download:	1 times

[ACM Press the 2013 international conference - Santa Monica, California, USA...

Documents