Detecting Boredom and Engagement During Writing with Keystroke Analysis, Task Appraisals, and Stable Traits
Robert Bixler
University of Notre Dame
384 Fitzpatrick Hall
Notre Dame, IN 46556
Sidney D’Mello
University of Notre Dame
384 Fitzpatrick Hall
Notre Dame, IN 46556
ABSTRACT
It is hypothesized that the ability for a system to
automatically detect and respond to users’ affective states
can greatly enhance the human-computer interaction
experience. Although there are currently many options for
affect detection, keystroke analysis offers several attractive
advantages to traditional methods. In this paper, we
consider the possibility of automatically discriminating
between natural occurrences of boredom, engagement, and
neutral by analyzing keystrokes, task appraisals, and stable
traits of 44 individuals engaged in a writing task. The
analyses explored several different arrangements of the
data: using downsampled and/or standardized data;
distinguishing between three different affect states or
groups of two; and using keystroke/timing features in
isolation or coupled with stable traits and/or task appraisals.
The results indicated that the use of raw data and the feature
set that combined keystroke/timing features with task
appraisals and stable traits, yielded accuracies that were
11% to 38% above random guessing and generalized to
new individuals. Applications of our affect detector for
intelligent interfaces that provide engagement support
during writing are discussed.
Author Keywords
Affect detection; keystroke dynamics; boredom;
engagement; free text.
ACM Classification Keywords
Categories and subject descriptors: H.5.m [Information
Interfaces and Presentation]: Miscellaneous
INTRODUCTION
One of the primary goals of the field of Affective
Computing is to develop intelligent systems that can detect
and respond to users’ affective states [28, 29]. In general,
affective systems can be subdivided into three categories:
systems that detect affect, systems that express affect, and
systems that experience affect (i.e., artificial emotions).
Although each type of system has far-reaching applications
in a myriad of research domains, detecting affect is an
integral component in designing applications that
intelligently communicate with users and assist them in
performing their tasks.
The Affective Computing community has recognized the
importance of detecting affect and a wide array of methods
have been developed [see 3, 27, and 35 for reviews]. The
preeminent modality associated with affect is that of the
face, presumably due to the well-known links between
affect and facial expressions [13, 26, 31]. A second
modality that has received considerable attention is speech
and it has been shown that paralinguistic features of speech
can serve as an important index into affect [18, 21]. Text
analysis is a third type of modality that has been heavily
researched as evidenced by the nascent field of sentiment
analysis [25]. The fourth modality includes physiological
signals, such as the electrical conductivity of the skin,
electrical activity of the heart, and activation of muscles in
the face. These measures of peripheral physiology can be
complemented by measures of central physiology such as
fMRI and EEG [24]. Other measures that have received
somewhat less attention include eye gaze, postures, and
gestures [3].
There are some important factors that must be considered
when evaluating the usefulness of any particular modality
for affect detection. These are: 1) validity of the signal as
being diagnostic of affect, 2) reliability of the signal in a
real-world environment, 3) time resolution of the signal,
and 4) cost of implementation and intrusiveness of the
sensors. Although significant advances in the development
of functional affect detection systems have been made over
the last decade, most of the current systems fail to achieve
one or more of these desirable features.
Taking a somewhat different approach from the common
modalities discussed above, the present paper focuses on
the use of keystroke analysis to develop fully automated
affect detectors while individuals perform a writing task
with a computer interface. The working hypothesis is that
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
IUI’13, March 19–22, 2013, Santa Monica, CA, USA.
Copyright © 2013 ACM 978-1-4503-1965-2/13/03...$15.00.
Session: Emotion and User Modeling IUI'13, March 19–22, 2013, Santa Monica, CA, USA
225
an analysis of users’ keystrokes can yield information that
signals their affective states. The focus is not on what
specific content is generated, but on how the content is
generated.
There are many attractive aspects of keystroke analysis for
affect detection. Keystroke analysis requires no extra
hardware beyond the already present computer and
keyboard. All that is required is software to record and
analyze the keystrokes, so the method is unobtrusive, cost
effective, and scalable. Keystrokes are transferred directly
from the keyboard to the data collection software, resulting
in a reliable signal with negligible interference. Because
each keystroke event occurs within a short time frame of
the previous keystroke, enough data can be gathered to
detect and respond to affect in a timely manner. The only
unknown factor is how accurately keystroke analysis
predicts affect, which is the focus of the present work. We
begin by briefly reviewing some of the research on
keystroke analysis as an affect detection channel followed
by an overview of the present study.
Previous Work on Keystroke Analysis and Affect
Although not directly relevant to affect detection, the
largest body on keystroke analysis is in research on
authentication systems. These systems aim to identify
individuals based on their unique typing patterns. The most
distinctive variation of keystroke-based authentication
systems is fixed text versus free text. Most attempts at
keystroke authentication involve developing a profile while
users type a pre-determined text [2, 17]. This profile is then
compared to any subsequent attempts to log in as that user.
Free text analysis consists of logging the keystrokes during
the user’s session without any constraints on what the user
can type, and comparing similarities between the
authentication profile and patterns observed during typing
[16]. Considerably less research has been done in this area
as fixed text analysis has a much lower error rate – a
measurement that is paramount in authentication systems
[5].
There have been a few studies that have used keystroke
analysis for affect detection. Vizer et al. [32] explored the
possibility of detecting users’ stress by continuously
monitoring keyboard interactions. Their features consisted
of timing, keystroke, and linguistic information. They tested
their models with both raw and standardized data in order to
minimize subject variability. They reported that their
classification rates using standardized data were consistent
with those achieved by other researchers using different
methods such as facial recognition, speech analysis, and a
pressure-sensing mouse.
Khanna and Sasikumar used a suite of classifiers in WEKA
to distinguish between positive, negative, and neutral affect
based on fixed text [19]. Their data was collected while
users re-typed text from chosen paragraphs. After copying
the paragraphs, users were instructed to indicate their
affective state during the typing task. The classifiers that
performed best were BF Trees and J48, with correct
classification rates of 89.0% and 84.1% for detecting
negative affect from neutral affect and correct classification
rates of 86.4% and 88.9% for detecting positive affect from
neutral affect.
A most recent study by Epp et al. [14] collected both fixed
and free text and analyzed models for six different affective
states. During their experiment they collected 10 minutes of
free text prior to a questionnaire that asked users which
affect they were experiencing. Following the questionnaire
was a passage from Alice in Wonderland which the user
was instructed to type as a sample of fixed text. They then
created models based on either the fixed or free text
samples.. Their best performing models all used fixed text
and achieved accuracies of 77.4%-87.8% with kappas
ranging between .55 and .75.
Overview of Current Study
Keystrokes are never generated in a vacuum, but are always
coupled to an underlying activity. The target activity in the
present study was a writing activity that involved the
composition of essays on a variety of different topics.
Writing is an activity that lends itself to keystroke analysis,
because the purpose of writing is to generate content, which
involves keystrokes in most cases. There is also evidence
which suggests that affective states might be tied to both the
topics and processes of writing, so it is important to
understand the affective states that arise during writing and
how they influence writing outcomes [12, 22]. In line with
this, the purpose of this paper is to develop automated
methods to monitor the moment-to-moment affective states
that individuals experience while writing about topics that
vary in emotional intensity. The hope is that the automated
methods can be integrated into intelligent interfaces that can
help students develop writing proficiency by detecting and
responding to their affective states (this is discussed in
more detail in the General Discussion section).
Our proposed approach varies from the previous work on
keystroke analysis reviewed above in several key ways.
First, we attempt to detect affective states that naturally
occur in the context of an ongoing activity, rather than
inducing affect or tracking affect before or after writing.
Second, we attempt to classify affect in 15 second intervals,
which is a considerably shorter time frame compared to
other free text methods. Third, the granularity of our affect
classification is finer; instead of either distinguishing
between positive and negative affect or detecting a
particular affective state, we are distinguishing between
either two or three non-basic affective states as discussed in
the next section. Fourth, our models include features, such
as number of pauses, which appear in very few studies on
keystroke analysis. Fifth, in addition to the keystrokes
themselves, our models also include information at two
additional levels of granularity. These include the writer’s
appraisals or impressions of the writing task prior to writing
(task appraisals) and relevant measures of writer’s
Session: Emotion and User Modeling IUI'13, March 19–22, 2013, Santa Monica, CA, USA
226
individual traits (stable traits). Finally, unlike the majority
of research on keystroke analysis which uses fixed text
entry methods, we pursue free text analysis because it offers
a glimpse into the mechanics of writing, which we
hypothesize conveys affect.
We begin by first describing the data set used to collect
training data for the affect detection models. Next, the
feature engineering, selection, and model building phases
are discussed. This is followed by the results of a cross-
validation analysis. Finally, an overview of our major
findings, a discussion of limitations, possible resolutions,
and potential applications are discussed.
DATA COLLECTION
Participants
The participants were 44 U.S. undergraduate students (68%
female; mean age of 19.9 years; 45% Caucasians, 52%
African Americans, and 3% “Other”) who participated for
course credit.
Essay Topics
The study involved relatively fine-grained tracking of
affective states while participants wrote essays on three
different topics: (a) academic topics which were obtained
from the writing portion of the American College Testing
(ACT) test, (b) socially charged issues like abortion and the
death penalty, and (c) personal emotional experiences such
as recent angry or happy experiences. These three topics
were selected because they reflect a continuum with respect
to their expected emotional impact, ranging from the
affectively neutral academic topics to the affectively
charged personal emotional experiences, with the socially
charged topic lying in between these two extremes.
In order to increase interest in the writing activity,
participants were allowed to choose one subtopic from a list
of options under each topic. Academic topics were adapted
from the writing portion of the ACT and subtopics included
“time spent in high school”, “the use of class discussion”,
and “social skills being taught in schools”. Socially charged
subtopics included “abortion”, “gays in the military”, and
“the death penalty”. Subtopics for the personal emotional
experiences condition involved writing about one of six
basic affective states (anger, disgust, fear, happiness,
sadness, and surprise).
Stable Trait Measures (Individual Differences)
In order to account for within subject differences, we
collected stable traits for each subject. The trait measures
we used were scholastic aptitude, writing apprehension,
and exposure to print. Participants’ self-reported ACT
scores were used as the measure of scholastic aptitude. Self-
reported ACT scores have been found to correlate with
actual test scores [4], so we have some confidence in this
measure.
Writing apprehension was measured with the validated 26-
item Writing Apprehension Test (WAT) [9]. The WAT is a
self-report inventory where participants respond to a series
of statements (e.g., “I avoid writing”, “I'm nervous about
writing”) via a 5-point Likert scale. As recommended by
Daly and Miller [9], the test was modified for use outside of
the classroom by removing six items pertaining to
classroom composition activities. Scores on the WAT can
range from 20 to 100, with lower scores indicated more
writing apprehension.
Exposure to print was measured with the Author
Recognition Test [8]. Participants were presented with a list
of names of 42 popular authors (e.g., J. R. R. Tolkien, Dean
Koontz) and they had to place a check mark by each author
they recognized. The version of the measure we
implemented did not have any distracter items; hence,
scores were computed as the proportion of authors that
were correctly recognized.
Task Appraisal Measures
Participants completed a Task Appraisal Questionnaire
prior to writing each essay. The questionnaire used a six-
point scale to collect information on participants: (1)
interest in the topic, (2) how often they discussed the topic,
(3) how comfortable with the topic they were (important for
socially charged issues), (4) how confidant they were in
their ability to write a good essay (5), how much effort they
would put into writing, (6) their perceived enjoyment while
writing, and (7) how personally connected they were to the
topic.
Procedure
Essay Writing
Participants were given 10 minutes to write an essay on
each of the three topics. The order of topics was
counterbalanced across participants with a 3 × 3 Latin
Square. For each topic, they were asked to select one of the
subtopics listed above. They typed their essays on a
computer interface. Each keystroke was logged along with
a timestamp from the beginning of both the session and
essay and the number of milliseconds since the last
keystroke. Videos of participants’ faces and computer
screens were also recorded.
Retrospective Affect Judgment
Participants provided self-judgments of their affective states
immediately after the writing session. Similar to a cued-
recall procedure [10, 30], the judgments for a participant’s
writing session began by playing a video of the face along
with the screen capture video on a widescreen monitor. The
screen capture included the writing prompt and dynamically
presented the text as it was written, thereby providing the
context of the writing session. Participants were instructed
to make judgments on what affective states were present at
any moment during the writing session by manually
pausing the videos. They were also instructed to make
judgments at each 15-second interval. The fifteen affective
states available to select were: anger, contempt, disgust,
fear, happiness, sadness, surprise, boredom, confusion,
delight, engagement, frustration, anxiety, curious, and
finally, neutral.
Session: Emotion and User Modeling IUI'13, March 19–22, 2013, Santa Monica, CA, USA
227
It is important to mention three points pertaining to the
present affect judgment methodology. Offline judgments
were used because they allow monitoring of participants’
affective states at multiple points, with minimal task
interference, and without participants knowing that these
states were being monitored during completion of the
learning task. Second, this retrospective affect-judgment
method has been previously validated [30], and analyses
comparing these offline affect judgments with online
measures encompassing self-reports and observations by
judges have produced similar distributions of emotions
[6,7]. Third, the offline affect annotations obtained via this
retrospective protocol correlate with online recordings of
facial activity and gross body movements in expected
directions [11]. Although no method is without its
limitations, the present method appears to be a viable
approach to track emotions at a relatively fine-grained
temporal resolution.
Affect Distributions
The retrospective affect judgment protocol yielded 5,551
affect judgments across the 44 participants. The distribution
of affective states supports two major conclusions. First,
the fourteen affective states cumulatively accounted for
78.9% of the observations, while neutral was reported for
the remaining 21.1% of the observations. Second, only six
out of the 14 affective states (excluding neutral) occurred
with some regularity. The most frequent affective state was
engagement with an occurrence rate of 35.4% followed by
boredom at 26.4%; these two states comprised
approximately half of the observations (51.8%). We chose
to focus on engagement, boredom, and neutral in this study
because they comprised 72.9% of the observations and the
remaining affective states were either observed at very low
frequencies or inconsistently observed across participants.
MODEL BUILDING
Three types of features were used for the present analysis:
keystroke/timing features, task appraisals, and stable traits.
ACT scores, WAT scores, and exposure to print were used
as stable traits among all subjects. The seven pre-essay
questions were used as task appraisals for each essay. These
two sets of features have been described in the previous
section, so we focus on keystroke/timing features here.
Keystroke and Timing Features
The list of keystroke and timing features is presented in
Table 1. These 12 features were reduced from a larger set of
19 based on tolerance analyses, which was used to identify
multicollinearity among features.
Features were computed in 15 second intervals, with each
interval culminating in an affect judgment (see above). The
features can be subdivided into four types: relative timing
features, keystroke content features, keystroke timing
features, and pausing behaviors. Relative timing features
consist of the time of an interval relative to the start of the
session or essay. Keystroke verbosity features consist of
features measuring the number of certain keystrokes such as
backspaces or the overall number of keystrokes in an
interval. Keystroke timing features consist of various
descriptive statistics of the latency between keystrokes,
such as the largest latency or the mean latency between
keystrokes in an interval. Finally, pausing behavior features
consist of the number of each type of pause within an
interval.
Outliers were taken to be values greater than three standard
deviations away from the mean and were not included in
the data that we built our models with. This data set is
referred to as the raw data set because no additional
transformations were performed other than outlier
treatment.
Description
Relative Timing
Session Time Elapsed time from start of session
Essay Time Elapsed time from start of essay
Keystroke Verbosity
Verbosity Number of keys within the interval
Backspace Use Number of backspaces within the
interval
Keystroke Timing
Largest
Latency
Largest time difference between
keystrokes within interval
Smallest
Latency
Smallest time difference between
keystrokes within interval
Mean Latency The mean of all the differences in time
between keystrokes within the interval
Median
Latency
The median of all the differences in time
between keystrokes within the interval
Pausing Behaviors
0.5 Second
Pauses
The number of pauses above .5 seconds
and below 1 second
1 Second
Pauses
The number of pauses above 1 second
and below 1.5 seconds
1.5 Second
Pauses
The number of pauses above 1.5 second
and below 2 seconds
2 Second
Pauses
The number of pauses above 2 second
and below 3 seconds
Table 1. Keystroke and Timing Features
Models
We created models with four different combinations of
features: keystroke/timing features alone; keystroke/timing
features with stable traits; keystroke/timing features with
task appraisals; and all three types of features.
Session: Emotion and User Modeling IUI'13, March 19–22, 2013, Santa Monica, CA, USA
228
In addition to models on the raw features, we also created
models with standardized features, models that were
downsampled to the affect state with lowest frequency
(boredom), and downsampled models with standardized
features. Downsampling was used to eliminate imbalance
arising from uneven class labels since models that are built
on uneven data have a tendency to skew classification
towards the majority class. Downsampling can improve
classification rates in instances where there is an uneven
distribution of class labels, as was the case for our data.
Standardization (i.e., z-scores) was done by subject on only
the keystroke/timing features, and was used to reduce
between-subject variability. Standardization was performed
after downsampling.
In addition to the boredom, engagement, neutral
discrimination, we also created three two-affect models.
This resulted in a model distinguishing boredom from
engagement, one distinguishing engagement from neutral,
and one distinguishing boredom from neutral.
We used WEKA implementations [34] of nine different
classifiers for supervised classification: J48, NaïveBayes,
BayesNet, SMO, DecisionTable, OneR, RandomForest,
RandomTree, and REPTree. For models with
downsampling, we ran the tests with ten different randomly
downsampled datasets and took the average values over the
ten runs.
We validated our results using the leave several subjects
out method. Specifically, we split our data into two groups,
with 66% of subjects in one group and the remaining 33%
in the other. We then trained the models on the first 66%
and tested on the other 33%. The leave-several-subjects out
method ensures that the models generalize to new subjects
because the same subject is either in the training or the
testing group, but never both. By splitting our data by
subject, we emulate the situation of encountering new
subjects with a previously trained model, ensuring
generalization. We did this a total of ten times, and took the
average kappa values and accuracies for each classifier over
those ten runs.
RESULTS
A total of 5,760 models (4 affect discriminations × 4 data
sets × 4 features × 9 classifiers × 10 iterations per classifier)
Model Best
Classifier
Kappa
Boredom-Engagement-Neutral
Keystroke/Timing DecisionTable 0.036
Keystroke/Timing + Task Appraisals BayesNet 0.100
Keystroke/Timing + Stable Traits NaiveBayes 0.116
Keystroke/Timing + Task Appraisals + Stable Traits BayesNet 0.171
Boredom - Engagement
Keystroke/Timing NaiveBayes 0.021
Keystroke/Timing + Task Appraisals NaiveBayes 0.270
Keystroke/Timing + Stable Traits RandomForest 0.357
Keystroke/Timing + Task Appraisals + Stable Traits REPTree 0.374
Engagement - Neutral
Keystroke/Timing NaiveBayes 0.105
Keystroke/Timing + Task Appraisals SMO 0.087
Keystroke/Timing + Stable Traits SMO 0.127
Keystroke/Timing + Task Appraisals + Stable Traits NaiveBayes 0.156
Boredom - Neutral
Keystroke/Timing BayesNet 0.102
Keystroke/Timing + Task Appraisals NaiveBayes 0.132
Keystroke/Timing + Stable Traits J48 0.121
Keystroke/Timing + Task Appraisals + Stable Traits BayesNet 0.115
Table 2. Kappa and accuracy for each data and feature set
Session: Emotion and User Modeling IUI'13, March 19–22, 2013, Santa Monica, CA, USA
229
were estimated. The raw data performed better or almost as
well as the standardized and downsampled data. As raw
data is more ideal to use for real world purposes, we
decided to focus on this data and the results for the models
on standardized and downsampled data are not discussed
further.
The results for the classifier that yielded the best average
performance across trials for a given combination of affect
discrimination and feature set are shown in Table 2. The
kappa values of the best feature set within each
classification task have been bolded. We use kappas instead
of accuracy to interpret the results, because kappas correct
for random guessing, which is of concern due to an uneven
distribution of priors of the three affective states.
From the table it is easy to see that classification between
boredom and engagement (kappa = .374; accuracy =
87.0%) was the easiest to perform. The best model for the
three-way boredom- engagement-neutral discrimination
yielded a kappa of 0.171 with an accuracy of 56.3%, which
is approximately half of the kappa of the boredom-
engagement discrimination. This is presumably due to
difficulty associated with discriminating boredom and
engagement from neutral (best kappas of 0.132 with an
accuracy of 65.0% and 0.156 with an accuracy of 68.4%,
respectively).
The effect of the different feature sets is apparent from the
table, as well. With the exception of the boredom and
neutral discrimination, in general using all three types of
features resulted in the best kappa values. The average
kappa value for the feature set with all three features across
the four discriminations was 0.204. The next best feature
set was the combination of keystroke/timing features and
stable traits, which resulted in an average of 0.180. The
third best feature set was the set with both task appraisals
and keystroke/timing features, with an average of 0.147.
The feature set with only keystroke/timing features resulted
in the lowest kappas, with an average of 0.066.
The results highlight several key points. First, the best
model for discriminating boredom, engagement, and neutral
resulted in a kappa value of 0.171, or approximately 17%
above chance. This suggests that it is possible to
automatically discriminate between boredom, engagement,
and neutral by coupling keystrokes and timing data with
information on task appraisals and stable traits related to the
writing task.
Second, the fact that the highest kappa was between
boredom and engagement, while kappas for the other three-
way discriminations were on par and lower, suggests that
the classifiers experienced difficulty in separating boredom
and engagement from neutral compared to discriminating
boredom from engagement. This is what could be expected
since boredom and engagement are on two opposite sides of
a continuum with neutral in between.
Third, there was an interaction between feature set and
affect discrimination. Specifically, the keystroke/timing
level features alone yielded very low kappas for the three
way-classification (kappa = .036, accuracy = 54.4%). This
is because they were not very useful in discriminating
boredom from engagement (kappa = .021, accuracy =
82.0%), although they could discriminate both boredom and
engagement from neutral (kappas of .105 with an accuracy
of 68.6% and .102 with an accuracy of 66.0%,
respectively). A combination of task appraisals and stable
traits along with keystroke/timing features was needed to
discriminate between the three states.
Fourth, the feature set with the keystroke/timing features
and stable traits generally performed better than either the
keystroke/timing features alone or keystroke/timing
features coupled with task appraisals. This indicates that the
stable traits were more useful than the task appraisals for
the discriminations.
GENERAL DISCUSSION
Scalable and accurate affect detection has and always will
be one of the predominant goals of affective computing. As
noted in the Introduction, some affect detection methods
currently in use have limitations, such as obtrusiveness and
cost, which keystroke analysis overcomes. Due to the
prevalence of typed entry, we hypothesized that keystroke
analysis is an attractive alternative to either replace or
complement other affect detection methods. We tested this
hypothesis by investigating how accurately we could
discriminate among natural occurrences of boredom,
engagement, and neutral during a writing task. By doing so,
we have expanded upon a relatively small amount of
previous research pertaining to affect detection using
keystroke analysis by using keystroke/timing features in
conjunction with stable traits and task appraisals to
distinguish between boredom, engagement, and neutral. In
this section, we take stock of our major findings, discuss
applied implications, and discuss limitations and potential
areas for future work.
Major Findings
Our results were illuminating in a number of ways. First,
we have shown that it is possible to discriminate between
boredom, engagement, and neutral using keystroke analysis
and both task appraisals and stable traits relevant to writing.
Our work differed from previous research on keystroke
based affect detection (see Introduction) in terms of our
focus on free text, tracking of non-basic emotions, finer-
grained temporal resolution for affect detection, a feature
set that couples keystrokes with task appraisals and stable
traits, validation methods that generalize to new learners,
and the fact that naturalistic experiences of affect were
being tracked.
Second, standardizing and downsampling the data did not
provide a significant increase in classification rate over raw
data. This is a significant advantage for real-world
implementation because raw keystroke data is immediately
Session: Emotion and User Modeling IUI'13, March 19–22, 2013, Santa Monica, CA, USA
230
available as input to the classifiers. Third, classification was
improved by combining keystroke analysis with both task
appraisals and relevant stable traits. Although this requires
additional data collection (see below), it may be worth the
improvement in detection. Fourth, we found that
distinguishing between boredom and engagement was
easier than distinguishing either boredom or engagement
from neutral.
Finally, it is important to note that the models were built
using purely free text keystroke analysis. Free text
keystroke analysis is a more accurate representation of real
world situations than fixed text keystroke analysis, and as
such is more appropriate for dynamic affect detection. Our
results indicate that free text keystroke analysis may be a
reasonable avenue for consideration in affect detection
systems.
Application of Findings to Intelligent User Interfaces
The present findings are applicable to any user interface
that utilizes keyboard for user input. The most immediate
application are systems that monitor affect during writing.
This is an important application area because writing is
increasingly being considered to be a crucial 21st century
skill since the ability to write proficiently is a critical
component of success in many professional endeavors. The
impetus on writing can also be witnessed in the formal
education system. Standardized tests in the U.S., such as the
ACT, SAT, and GRE, have recently added sections that
evaluate an applicant’s writing proficiency. Unfortunately,
despite the increased focus on writing in the education
system, there is evidence to suggest that the average student
lacks the necessary skills to effectively communicate their
ideas in written text. For example, according to a National
Assessment of Educational Progress report, only 23% of 8th
graders and 31% of 12th graders in the U.S. were considered
to be “proficient” writers and there was no significant
change since 2002 [23].
Considering the high stakes placed on writing competency
several educational applications have been developed to
automatically score written essays and provide formative
feedback on writing quality. Some of these include the
Intelligent Essay Assessor, E-Rater, Summary Street, and
Writing Pal [1, 15, 20, 33]. Unfortunately, these systems
have focused on the cognitive aspects of writing, while
ignoring the affective ones. The present research can
alleviate this problem by automatically detecting when
engagement is waning and the student risks disengaging
from the writing task due to boredom. This educational
application can use this information to intervene
accordingly, such as changing topics when the student is
bored. The affect detectors developed in this study could
also be instrumental in furthering understanding the role of
affect in developing writing proficiency by affording
automated measurement of engagement and boredom,
which are presumably correlated with writing outcomes.
The present focus on keystroke and timing analysis makes
it quite easy to incorporate our models into intelligent
writing-support applications that detect and respond to
affect. The analysis of keystroke, timing, stable traits, and
task appraisal features can all be implemented easily
through software. Data for keystroke and timing features
can be collected unobtrusively while users type on the
keyboard. Stable traits can be collected with short
questionnaires; taking no longer than five minutes while a
user creates their profile for the first time. Task appraisals
can be collected with a brief series of questions before each
writing task, which can also be done with minimal
interruption. This data is collected before the task begins, so
there is little infringement on the writing process. Since
collecting data for task appraisals and stable traits is done
quickly, it is not infeasible to incorporate these features into
a real world situation.
Our research is focused on affect during writing, which
extends farther than educational systems. Detecting affect
might be beneficial in the workplace or on a personal
computer as well. Continuous monitoring of affect through
keystroke analysis could help create systems that can soothe
a user becoming frustrated with an aspect of their work or
re-focus a user that is becoming bored. A personal
computer could detect when a user is becoming sad and
might make an attempt to cheer them up. These situations
are somewhat different than typing from an essay writing
standpoint, but interaction with a keyboard invites the
opportunity for unobtrusive affect detection regardless of
the situation.
Limitations and Future Work
There are a few limitations of the methods we employed.
First, the highest accuracy for the boredom, engagement,
and neutral discrimination was 56.3%, which, although
approximately 17% better than chance, is quite modest.
That being said, we do have some confidence in the
generalizability of our models, because these results were
obtained for previously unseen subjects through the leave
several subjects out cross-validation method. Second, we
did not perform feature selection other than removing
features that exhibited multicollinearity. Employing feature
selection would reduce the number of features being used
and may improve classification as well. Third, we only
investigated affect in 15 second intervals because this is
when humans provided affect judgments. Different interval
lengths might be more or less appropriate for affect
detection. Fourth, we investigated whether our models
generalize to new learners through the leave several out
method, but we did not assess the extent to which they
generalize to new topics.
Future work could be focused in several areas. It would be
useful to investigate if feature selection improves
performance of the classifiers. Instead of looking at 15
second intervals, it might be possible to analyze a
continuous stream of keystrokes. Our models were built
Session: Emotion and User Modeling IUI'13, March 19–22, 2013, Santa Monica, CA, USA
231
from data within the subject but independent of the essay –
it might be informative to study trends within each essay as
well. Whether these and other advances result in notable
improvements in accuracy awaits further research and
empirical testing.
ACKNOWLEDGEMENTS
Special thanks to Caitlin Mills for leading the data
collection and to Rebekah Combs, Rosaire Daigle, Nia
Dowell, Ally Dobbins, Melissa Gross, Blair Lehman, and
Amber Strain for help with data collection. This research
was supported by the National Science Foundation (NSF)
(ITR 0325428, HCC 0834847, and DRL 1235958). Any
opinions, findings and conclusions, or recommendations
expressed in this paper are those of the authors and do not
necessarily reflect the views of the NSF.
REFERENCES
1. Attali, Y., & Burstein, J. (2006). Automated essay
scoring with e-rater® V. 2. The Journal of Technology,
Learning and Assessment, 4(3).
2. Bleha, S., Slivinsky, C., & Hussien, B. (1990).
Computer-access security systems using keystroke
dynamics. Pattern Analysis and Machine Intelligence,
IEEE Transactions on, 12(12), 1217-1222.
3. Calvo, R. A., & D’Mello, S. K. (2010). Affect detection:
An interdisciplinary review of models, methods, and
their applications. IEEE Transactions on Affective
Computing, 1(1), 18-37. doi: 10.1109/T-AFFC.2010.1
4. Cole, J. S., & Gonyea, R. M. (2010). Accuracy of self-
reported SAT and ACT test scores: Implications for
research. Research in Higher Education, 51(4), 305-
319. doi: 10.1007/s11162-009-9160-9
5. Crawford, H. (2010). Keystroke dynamics:
Characteristics and opportunities. In Privacy Security
and Trust (PST), 2010 Eighth Annual International
Conference on (pp. 205-212). IEEE.
6. Craig, S., D'Mello, S., Witherspoon, A., & Graesser, A.
(2008). Emote aloud during learning with AutoTutor:
Applying the facial action coding system to cognitive-
affective states during learning. Cognition & Emotion,
22(5), 777-788.
7. Craig, S., Graesser, A., Sullins, J., & Gholson, J. (2004).
Affect and learning: An exploratory look into the role of
affect in learning. Journal of Educational Media, 29,
241-250.
8. Cunningham, A. E., & Stanovich, K. E. (1997). Early
reading acquisition and its relation to reading experience
and ability 10 years later. Developmental Psychology,
33(6), 934-945.
9. Daly, J., & Miller, M. (1975). The empirical
development of an instrument to measure writing
apprehension. Research in the Teaching of English 9(3),
242-249.
10. D’Mello, S., & Graesser, A. (2011). The half-life of
cognitive-affective states during complex learning.
Cognition & Emotion, 25(7), 1299-1308.
11. D'Mello, S., & Graesser, A. (2010). Multimodal semi-
automated affect detection from conversational cues,
gross body language, and facial features. User Modeling
and User-adapted Interaction, 20(2), 147-187.
12. D'Mello, S., & Mills, C. (in review). Emotions during
emotional and non-emotional writing.
13. Ekman, P. (1992). An argument for basic emotions.
Cognition & Emotion, 6(3-4), 169-200.
14. Epp, C., Lippold, M., & Mandryk, R. L. (2011).
Identifying emotional states using keystroke dynamics.
In Proceedings of the 2011 annual conference on
Human factors in computing systems (pp. 715-724).
ACM.
15. Foltz, P. W., Laham, D., & Landauer, T. K. (1999). The
intelligent essay assessor: Applications to educational
technology. Interactive Multimedia Electronic Journal
of Computer-Enhanced Learning, 1(2).
16. Gunetti, D., & Picardi, C. (2005). Keystroke analysis of
free text. ACM Transactions on Information and System
Security (TISSEC), 8(3), 312-347.
17. Joyce, R., & Gupta, G. (1990). Identity authentication
based on keystroke latencies. Communications of the
ACM, 33(2), 168-176.
18. K., & Litman, D. J. (2011). Benefits and challenges of
real-time uncertainty detection and adaptation in a
spoken dialogue computer tutor. Speech
Communication, 53(9-10), 1115-1136. doi:
10.1016/j.specom.2011.02.006
19. Khanna, P., & Sasikumar, M. (2010). Recognising
Emotions from Keyboard Stroke Pattern. International
Journal of Computer Applications IJCA, 11(9), 24-28.
20. McNamara, D. S., Raine, R., Roscoe, R., Crossley, S.,
Jackson, G. T., Dai, J., & Graesser, A. C. (2012). The
Writing-Pal: Natural language algorithms to support
intelligent tutoring on writing strategies. Applied natural
language processing and content analysis:
Identification, investigation, and resolution. Hershey,
PA: IGI Global.
21. Metallinou, A., Wollmer, M., Katsamanis, A., Eyben,
F., Schuller, B., & Narayanan, S. (in press). Context-
Sensitive Learning for Enhanced Audiovisual Emotion
Classification. IEEE Transactions on Affective
Computing.
22. Mills, C., & D’Mello, S. K. (2012). Emotions during
writing on topics that align or misalign with personal
beliefs. In S. Cerri, W. Clancey, G. Papadourakis & K.
Panourgia (Eds.), Proceedings of the 11th International
Conference on Intelligent Tutoring Systems (pp. 638-
639). Berlin Heidelberg: Springer-Verlag.
Session: Emotion and User Modeling IUI'13, March 19–22, 2013, Santa Monica, CA, USA
232
23. NAEP. (2007). The Nation's Report Card: Writing 2007.
24. Nijholt, Anton, & Desney S. Tan (2010). Brain-
Computer Interfaces: Applying our Minds to Human-
Computer Interaction. Springer.
25. Pang, B., & Lee, L. (2008). Opinion mining and
sentiment analysis. Foundations and Trends in
Information Retrieval, 2(1-2), 1-135
26. Pantic, M., & Patras, I. (2006). Dynamics of facial
expression: Recognition of facial actions and their
temporal segments from face profile image sequences.
IEEE Transactions on Systems, Man, and Cybernetics,
Part B., 36(2), 433-449. doi:
10.1109/tsmcb.2005.859075
27. Pantic, M., & Rothkrantz, L. (2003). Toward an affect-
sensitive multimodal human-computer interaction.
[Review]. Proceedings of the IEEE, 91(9), 1370-1390.
doi: 10.1109/jproc.2003.817122
28. Picard, R. (1997). Affective Computing. Cambridge,
Mass: MIT Press.
29. Picard, R. (2010). Affective Computing: From Laughter
to IEEE. IEEE Transactions on Affective Computing,
1(1), 11-17.
30. Rosenberg, E., & Ekman, P. (1994). Coherence between
expressive and experiential systems in emotion.
Cognition & Emotion, 8(3), 201-229.
31. Valstar, M. F., Mehu, M., Jiang, B., Pantic, M., Scherer,
K., Jiang, B., Valstar, M., Pantic, M., Valstar, M., &
Jiang, B. (in press). Meta-Analysis of the First Facial
Expression Recognition Challenge. IEEE transactions
on systems, man, and cybernetics. Part B, Cybernetics.
32. Vizer, L. M., Zhou, L., & Sears, A. (2009). Automated
stress detection using keystroke and linguistic features:
An exploratory study. International Journal of Human-
Computer Studies, 67(10), 870-886.
33. Wade-Stein, D., & Kintsch, E. (2004). Summary Street:
Interactive computer support for writing. Cognition and
Instruction, 22(3), 333-362.
34. Witten, I.H., Frank, E., Trigg, L., Hall, M., Holmes, G.
& Cunningham, S.J. (1999). Weka: Practical machine
learning tools and techniques with Java
implementations. Hamilton, New Zealand: University of
Waikato, Department of Computer Science.
35. Zeng, Z., Pantic, M., Roisman, G., & Huang, T. (2009).
A survey of affect recognition methods: Audio, visual,
and spontaneous expressions. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 31(1), 39-
58.
Session: Emotion and User Modeling IUI'13, March 19–22, 2013, Santa Monica, CA, USA
233