Accepted Manuscript
Head movement and facial expressions as game input
Mirja Ilves, Yulia Gizatdinova, Veikko Surakka, Esko Vankka
PII: S1875-9521(14)00016-0
DOI: http://dx.doi.org/10.1016/j.entcom.2014.04.005
Reference: ENTCOM 115
To appear in: Entertainment Computing
Received Date: 9 September 2013
Revised Date: 10 February 2014
Accepted Date: 28 April 2014
Please cite this article as: M. Ilves, Y. Gizatdinova, V. Surakka, E. Vankka, Head movement and facial expressions
as game input, Entertainment Computing (2014), doi: http://dx.doi.org/10.1016/j.entcom.2014.04.005
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers
we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting proof before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
1
Head movement and facial expressions as game input
Mirja Ilvesa*, Yulia Gizatdinovaa, Veikko Surakkaa, and Esko Vankkaa
aResearch Group for Emotions, Sociality, and Computing, Tampere Unit for Computer-Human
Interaction (TAUCHI), School of Information Sciences, University of Tampere, Kanslerinrinne 1,
FIN-33014 University of Tampere, Finland
E-mail addresses: [email protected] (M. Ilves), [email protected] (Y. Gizatdinova),
[email protected] (V. Surakka), [email protected] (E. Vankka)
*Corresponding author: Mirja Ilves, E-mail: [email protected], Tel: +358 50 318 5848, Postal
address: Mirja Ilves, Tampere Unit for Computer-Human Interaction, School of Information
Sciences, University of Tampere, Kanslerinrinne 1, FIN-33014 University of Tampere, Finland
Abstract
This study aimed to develop and test a hands-free video game that utilizes information on the
player’s real-time face position and facial expressions as intrinsic elements of a gameplay. Special
focus was given to investigating the user’s subjective experiences in utilizing computer vision input
2
in the game interaction. The player’s goal was to steer a drunken character home as quickly as
possible by moving their head. Additionally, the player could influence the behavior of game
characters by using the facial expressions of frowning and smiling. The participants played the
game with computer vision and a conventional joystick and rated the functionality of the control
methods and their emotional and game experiences. The results showed that although the
functionality of the joystick steering was rated higher than that of the computer vision method, the
use of head movements and facial expressions enhanced the experiences of game playing in many
ways. The participants rated playing with the computer vision technique as more entertaining,
interesting, challenging, immersive, and arousing than doing so with a joystick. The results
suggested that a high level of experienced arousal in the case of computer vision-based interaction
may be a key factor for better experiences of game playing.
Keywords: Interfaces and interaction techniques, camera-based video game, computer vision, face
detection and tracking, facial expression classification, head movement, gameplay experience,
emotion
1. Introduction
Enjoyment and other emotion-related factors are central motivators for playing video games. People
seek and play video games that are fun and entertaining or elicit other kinds of emotional
experiences. Games with different characteristics elicit various emotional responses (e.g. [1]), but
3
control devices for playing can also affect a player’s emotions and game experience. Recent
evidence shows that new, handheld but more natural controllers (i.e. Wii remote, steering wheel)
can lead to higher feelings of spatial presence and game enjoyment than traditional control devices
(i.e. joystick, gamepad, keyboard) [2]. However, systems that utilize physical controls have inherent
limitations of being unable to detect the presence and identity of players, for example.
More natural, active, and playful gaming has become possible because of advances in computer
vision (CV). Through standard video cameras, CV technology provides a low-cost alternative to
handheld devices and allows entirely unobtrusive detection of head and body movements or hand
gestures, for example, and their use as a game input. Automatic analysis of facial information can
help to understand the identities and number of players, as well as their presence and locations. The
recognition of facial expressions is also possible with the help of CV. The human face and facial
expressions provide a rich source of information about human behavior and emotional state. It can
be argued that faces are the main modality in human nonverbal communication, and many
expressions can be performed voluntarily; thus, the use of facial expressions could provide a natural
method of game control. However, in the past, research in the area of automatic face analysis had
focused on the technological aspect, dealing with performance characteristics of different methods
such as their speed, accuracy, and robustness [3,4]. Generally, the question of how video games can
successfully leverage facial information remains less understood. The literature analysis reveals that
although automatic face analysis has started to be utilized in gaming, few studies have attempted to
systematically evaluate the usability and user experience aspects of face-responsive games.
This study aimed to investigate the functional and experiential aspects of head movements and
facial expressions detected by CV as game input methods. In contrast to earlier studies, we
combined active and continuous face tracking with facial expression classification in real time in
4
order to enhance the overall experiences of a gameplay. For this purpose, we designed and
implemented the game “Take Drunkard Home,” where a player’s goal is to steer by head
movements a drunken character home as quickly as possible without the latter falling down. Along
the way, the player needs to pick up various items and avoid collisions with stationary or moving
obstacles. Additionally, the player can influence the behavior of the other game characters by using
facial expressions with two affective meanings—positive for smiling and negative for frowning.
The game solely relies on automatic face processing and therefore supports better accessibility to
video gaming for those players who have difficulties or simply do not want to use physical input
devices. We conducted a user study in which the participants played the game using CV technology
and a conventional joystick. We recorded the game duration; the number of falls; and the number of
picked flowers, beer cans, and hamburgers when the participants played the game. The participants
evaluated the functionality of the control methods, as well as their emotional and game experiences.
This paper reviews recent advances in face-responsive video gaming and introduces the game
design and CV-based control methods used in this study. Then it presents the results of the
empirical evaluation of the game and further discusses issues related to the future development of
video games that utilize facial information.
2. Background
Quite recently, the game industry has added new input devices and techniques to traditional controls
such as the keyboard, mouse, joystick, and gamepad. The controlling of games is not limited
anymore to the use of the hands; new input devices and techniques allow more natural and more
playful, physically active gaming. Thus, the motion-sensing capability of Nintendo’s Wii1 remote
5
enables the detection of acceleration and movement in three dimensions. There are also
technologies that enable game control without any handheld device. For example, Microsoft Kinect2
enables players to control the Xbox 360 using body movements and gestures without touching the
control device. The Microsoft Kinect sensor consists of an infrared laser projector, combined with
two cameras that detect the positions and movements of people in three dimensions. Some game
technologies, such as Sony EyeToy3 in Sony PlayStation2, use CV for the gestural control of
games. Because of advances in CV and hardware processing capabilities, movement- and gesture-
based interaction has also become possible in conventional computer systems. The CV technology
has been improving constantly, and the field has recently demonstrated a number of face analysis
methods that have proven to function well even in challenging conditions [3,5,6]. Considering that
real-time, accurate, and robust measurement of facial visual information is only a matter of time, it
is important to understand how this information can be used in the context of game interaction. At
this point, we emphasize the need for early user studies as an integral part of the development of
CV-based user interfaces and their successful integration into video games.
From the implementation standpoint, controlling video games or any other graphical user interface
is typically based on pointing and activation methods. The pointing method identifies the object of
interest, and the activation method allows implying a certain action in a virtual game world. From
the design standpoint, video games can be controlled in two ways, implicit and explicit. Implicit
control allows for automatic adaptation or adjustment of the game environment and interaction
modalities to the player’s spontaneous behavior. Explicit control means that the player consciously
produces facial expressions, head movements, and body gestures to directly control the game
interaction. This type of gaming replaces traditional gaming with physical input devices that are
primarily based on the point-and-click concept. Furthermore, the use of faces for game control
(implicit or explicit) has two important advantages. First, the human face is highly expressive, with
6
more than 40 muscles that alone or in combinations produce visually detectable changes in facial
appearance. Therefore, facial expressions, together with head movements, can potentially provide a
diverse, intuitive, and fine-grained means of game control. Second, growing evidence indicates that
the use of physical movements of the head and face can enhance the overall experience of game
playing.
The proposed video games to date can be roughly divided into three categories, according to the
inner mechanisms of how facial information has been utilized for game interaction:
(1) Digital interactive mirror. The avatar explicitly repeats the player’s head movements and
facial expressions (the top left image of Figure 1). Real-time animation of sometimes
impressively photorealistic avatars has become an increasingly popular area of research
(e.g. [7]), partly due to its potential utilization in the film industry. Additional graphical add-
ons such as makeup, various head-wearing objects, or emoticons can be drawn on top of the
player’s face to enhance the experience of presence and role-playing [8].
--------------------------------
Insert Figure 1 about here
--------------------------------
Figure 1. Screen shots of face-responsive games: (top left) “Maris head” digital interactive mirror (top right), “eating game” (arrows show the direction of movement), and (bottom) “walking game.”
(2) Viewpoint, directional navigation, and action trigger control. Information on face position
and head orientation is directly used to change the player’s point of view in first-person
games or to steer the avatar in the game environment. Additionally, the detected head
gestures and facial expressions are used to imply a certain action in the game world.
Conventionally, the navigation of the avatar in a two- or three-dimensional world has been
7
performed with handheld devices such as the keyboard, mouse, joystick, and gamepad. It
has been shown that head rotation and movement can substitute for the use of physical
devices in navigational tasks and provide a more natural and intuitive means of game control
[8,9]. The top right image in Figure 1 shows the “eating game” [10], which implies the idea
of transferring the head movement sideways to a horizontal motion of the “eater” character
that is located at the bottom of the game space. The player controls the character’s mouth-
opening movement by opening his or her own mouth. The bottom image in Figure 1 shows a
top-down, strategy-like “walking game” [10], where a circular movement template is used to
move the character from one cell to another in a labyrinth, by means of head gestures. The
player can also produce facial expressions to pick up different items.
(3) Affective control. Facial expressions are naturally utilized to bring affective information to
the gameplay and, depending on the game design, implicitly or explicitly execute emotion-
related or emotion-guided activities. This idea closely relates to the concept of affective
gaming, meaning that the player’s emotional state influences the game’s difficulty level or
aesthetics, for example [11]. Previously, in order to assess the user’s affective state,
information on the player’s heart rate, skin conductance, and respiration had been detected
and further utilized in manipulating the gameplay [12]. In CV-based games, the player’s
spontaneous facial expressions and body gestures are detected fully unobtrusively without
the attached physiological sensors and used to adapt the game to the supposed affective state
of the player.
Our research focus belongs to the last two categories on augmenting video games with information
on the player’s real-time head position and facial expressions. We concentrate our literature search
on those studies involving empirical verification and explain how the proposed game designs
influence user experiences.
8
2.1. Facial information for viewpoint, directional navigation, and action trigger control
Wang et al. [8] investigated how applying real-time information about face position as an essential
element of a gameplay would affect game experiences. They found that using face position
information in a first person shooter (FPS) game for peek-and-dodge movements can effectively
enhance the sense of presence. Sko and Gardner [9] also studied the potential of the head gestural
input in FPS games. From their focus group consisting of expert game developers and experienced
end users, their study received positive feedback concerning the head gestural input (e.g. zooming,
peeking, spinning) in FPS games. In their further study [13], the head interaction technique was
improved to make it robust to the variable conditions of home use. The feedback from 2,500 users
showed that head tracking improved the game’s immersion and realism. Furthermore, Gorodnichy
and Roth’s study [14] showed that test participants rated playing the game “Aim-n-shoot
BubbleFrenzy” with the hands-free ‘nose as mouse’ technique as more fun and less tiring than
playing the game with a mouse.
2.2. Facial expressions as affective control
In addition to body postures, gestures, and head position, CV can detect changes in facial
expressions. The human face and facial expressions have a significant role in interpersonal
interaction; thus, the use of facial expressions as an input method also provides a potential
communication channel for the gaming environment. Facial expressions are communicative signals
that reflect both voluntary and involuntary activation (e.g. [15,16]. Involuntary facial expressions
that occur spontaneously can reflect emotional states such as fear, anger, and happiness, or
9
cognitive activities such as concentration [17,18,19]. Facial expressions can also be used
voluntarily, for example, to affect the mental state or behavior of another person. For example, a
smile can show friendliness, approval, or encouragement; lifting the eyebrows can communicate
wonder; and lowering the eyebrows can demonstrate disapproval or aggression.
Some authors have developed and tested video games that utilize voluntary or involuntary facial
expressions. Obaid, Han, and Billinghurst [20] designed the game “Feed the Fish,” which responds
to a player’s facial expressions by adjusting the game’s difficulty level. A positive expression
changes the game level to a harder one, and a negative expression lowers the game level to an easier
one. They found that people rated the affect-aware game as more enjoyable, exciting, and
challenging than its non-affective version. Bernhaupt et al. [21] developed the game “Emotional
Flowers” for long-term usage (i.e. three to five days) in a working environment. The game’s main
idea is to grow a flower as fast as possible by using positive facial expressions. The facial
expressions are measured every now and then, and the flower grows or shrinks depending on the
detected facial expressions. Multiple players can play the game simultaneously. An ambient display
in a public area shows the flowers of all participants. A user study showed that the game influenced
the players’ emotional states and social communication patterns. Lankes et al. [22] redesigned the
game of Bernhaupt et al. to be suitable for a shopping center. In this kind of environment, the
interaction time has to be short, with immediate feedback. In Lankes et al.’s “EmoFlowers” game,
the players’ facial expressions influence the weather status in the game and in this way, the growth
of a virtual flower. The expressions of sadness lead to rain, and the expressions of joy cause
sunshine. The participants of a user study reported that interaction with the game via facial
expressions was natural and easy to learn. Additionally, the majority (i.e. 92 %) of the players
claimed a positive user experience while playing.
10
These studies show the potential of facial expressions as a game input method. Because in real life,
facial expressions have a central role in face-to-face communication between humans, it is logical
to study whether the use of expressions would also function in a game’s more natural interaction
context. Thus, in our game, the facial expressions’ purpose was to affect the behavior of the other
game characters. Moreover, our game did not utilize facial expressions only, but combined them
with information on head movements for avatar steering.
2.3. Subjective experience measures of game playing
Computer games are hedonic in nature, that is, people play games to entertain themselves [23].
Hedonic products are consumed mainly for affective or sensory gratification purposes [24].
Affective experiences are often measured through a certain set of dimensional scales (i.e. valence,
arousal, and dominance) that are formed based on a dimensional theory of emotions [25]. The
valence dimension relates to the pleasantness of a certain experience, ranging from unpleasant to
pleasant. Arousal dimension refers to the level of activation, ranging from relaxed to stimulated.
The dominance dimension involves the feelings of control, ranging from being controlled to being
in control.
In the game environment, factors such as immersion, presence, and flow have also been considered
important for a comprehensive understanding of the subjective game experience. IJsselsteijn et al.
[26] developed a game experience questionnaire (GEQ) that measures several dimensions of the
playing experience. The questionnaire has been used in many game studies globally, and it can
assess the gameplay experience with high reliability [27,28]. There is also evidence that video
games cause motion sickness in many players [29]. Thus, it is essential to evaluate whether the use
11
of head movements increases motion sickness, compared to a more traditional control device. Thus,
we attempted to compare the functionality and motion sickness experiences between the CV and
joystick control methods, the emotional experiences with the control methods, and players’ game
experiences.
3. Methods
3.1. Game design
“Take Drunkard Home” (see Figure 2) is a third-person view, obstacle course game, where the
player’s main goal is to steer a drunken man home as quickly as possible without the latter falling
down. The character—a drunken soldier—walks along the dark street (the character’s forward
movement is automatic). Real-time head position information is processed and used for explicit
control of the character’s walking direction. Thus, the player’s steering of the character to the left or
right is possible by using head movements sideways. Because the player’s character is drunk, he is
in constant danger of falling down. That is why the player must actively keep the drunken character
balanced with head movements; the player should move the head to the right when the character is
falling to the left and vice versa.
--------------------------------
Insert Figure 2 about here
--------------------------------
Figure 2. Screen shots of the game: (top image) neutral facial expression, (bottom left image) smiling facial
expression, and (bottom right image) frowning facial expression.
12
The character’s initial level of intoxication is adjustable from the menu settings, varying from sober
(balanced, easy to steer) to very drunk (highly unbalanced, difficult to steer). The player should also
try to pick up as many items as possible along the way. Bonus points are represented as flowers, and
the player should pick them up in order to increase the total score of the gameplay. The other items
that influence the gameplay are beer cans, which increase the character’s walking speed at the cost
of also increasing his intoxication level, and hamburgers, which decrease the character’s
intoxication level and walking speed. On the way, the player should avoid collisions with different
obstacles such as boxes and other characters such as stationary or moving cats or people. The
difficulty and length of the obstacle course are adjustable from the menu settings. A general
workflow diagram of the game is shown in Figure 3.
--------------------------------
Insert Figure 3 about here
--------------------------------
Figure 3. A general workflow diagram of the gameplay with key scenes (green), backend judgment (pink),
and user interaction (yellow).
Our particular interest in designing this game was to integrate facial expressions into the gameplay
in an intuitive, easy, and entertaining way. Although the general idea seems straightforward, the use
of voluntary facial expressions as another axis of control for emotion-guided activities is not easy in
the case of active games. In the “Take Drunkard Home” game, the player navigates through the
obstacle course, strategically thinks of positions and amounts of items to be picked up, and at the
same time, actively steers and balances the character with physical movements. It can become a
demanding task for the player to receive an additional cognitive load by recognizing situations
13
where expressions can be beneficial and making a physical effort to produce such expressions.
Therefore, the game design should take special care of easing this task for the player by creating
situations with a natural connection to the real world’s use of facial expressions. We ended up with
a design that assigns two affective meanings, namely, positive and negative affects, to the player’s
expressions with respect to the other characters of the game (see Figure 4). Thus, a smiling
expression stops moving characters, which become friendly and do not collide with the player’s
character. A frowning expression frightens away stationary characters, which become scared by the
expression of disapproval or anger and give way to the player’s character.
--------------------------------
Insert Figure 4 about here
--------------------------------
Figure 4. A diagram of key actions of the gameplay such as walking, socio-emotional interactions with other
game characters, avoiding obstacles, falling down, and picking up items.
We enhanced visual feedback about facial expression processing and usage by adding an emoticon
at the bottom of the game space (see Figure 3). The emoticon remains inactive during the times
when the game does not expect affective input from the player. When the player approaches a
character that moves or stands in the way, the emoticon activates, indicating that now the player is
entitled to input affective information to the gameplay by producing one of the two predefined
facial expressions. If the player shows neither of those two expressions, the game proceeds. If the
character falls down because of a stationary or moving character, the game gives the player three
seconds to prepare for the continuation of the game (the display shows a timer). Additionally, there
is a two-second period when the character does not collide with the obstacles but walks by them.
This prevents the character from colliding with the same obstacle repeatedly.
14
3.2. Implementation
The game was created using the XNA4 Framework, with the Nuclex5 Framework’s input library for
the joystick input. The CV algorithms were developed separately and executed in another
application. The results of face processing were sent to the game application using a socket
connection. The participants were able to observe the face-processing output on a separate window
to ensure that the face was tracked properly. If the game did not find a face when the player moved
out of the camera’s field of view, for example, it paused the gameplay until the face was found
again.
The CV application used the location of the player’s face on the camera image, aka camera mouse,
to steer the game’s character. Two different methods were utilized for face detection and tracking.
The Viola-Jones face detector [30,31] was used as a primary method for locating the player’s face
and tracking his or her sideways movements. It is a fairly fast and robust method of detecting faces,
but it frequently fails when the head is tilted or rotated to near-profile views. On the other hand, the
tracking-learning-detection method (TLD) [32] is a reliable one, which can learn changes in facial
appearance on the fly. However, in our implementation, TLD was not fast enough to support a real-
time gameplay. For this reason, TLD was used only in those cases when the Viola-Jones method
failed to find the face. The speed of face detection was ~30 frames per second (fps) for the Viola-
Jones method, dropping to 10–15 fps for the TLD method. A moving average was applied to the
five most recent face locations to remove the jitter from the detection output. Using the
anthropometrical measures of the human face [33], the detected region was further segmented into
the upper face (eye-forehead) and lower face (mouth) areas.
15
Facial expression recognition was performed according to the approach presented in [31,34], which
applies support vector machines for the histogram-based image classification with structural and
textural features. The upper face classifier was trained to differentiate between neutral and frowning
expressions, while the lower face classifier distinguished between neutral and smiling expressions.
The classifiers have been evaluated previously in real-time interaction scenarios. Thus, based on the
earlier findings reported for this approach to classifying facial expressions [31,34], the expected
misclassifications due to the system, namely, false positive and false negative rates, were ~10%.
The speed of upper and lower classifiers working simultaneously was ~10 fps. Considering that an
average duration of facial expressions such as frowning is 500±200 ms [30], we assumed that the
speed of facial classification would support capturing the players’ facial expressions. The
expression classification module was implemented so that it did not slow down the speed of the face
detection/tracking module. We note that the game design is independent from its implementation of
CV methods; therefore, the selection of the methods is not limited to those presented in this paper,
since other methods may operate better in a given context.
The game can also be played using a conventional physical input device, for example, steering and
balancing the character by tilting the joystick to the left or right. The expressions are controlled with
two buttons, one for a smile and one for an angry expression. If no button is pressed, the expression
is neutral.
4. User study
4.1. Participants
16
Twenty participants (7 females, 13 males) took part in the study. The participants’ mean age was 33
years, ranging between 24 and 51. All the participants played video games to some extent; 3
participants played at least a couple of times a year, 2 at least once a month, 9 at least once a week,
and 6 on a daily basis. All had some experience using both a joystick and body movements for
controlling games (e.g. Nintendo Wii, Microsoft Kinect, or PlayStation Move).
4.2. Equipment
The experiments were conducted on a PC computer with a 64-bit Windows 7 Professional operating
system. The computer had an Intel® Core™2 Quad CPU and 4 GB of RAM. The display screen
was a 24-inch Samsung SyncMaster 2443 with a resolution of 1920x1080 pixels. For CV-based
controlling, we used an off-the-shelf Creative Live!® CAM Sync web camera with a resolution of
800x600 pixels. The camera was installed on top of the monitor, approximately at eye level. The
joystick used was Logitech Attack 3.
4.3. Experimental procedure and measures
When a participant arrived at the laboratory, he or she was oriented by the experimenter about the
laboratory and asked to fill out a consent form and a background questionnaire. Then the participant
was told to sit on a chair in front of the computer screen. The experimenter introduced the idea of
the game and instructed the participant about his or her task to play the game with two controlling
methods and rate the game and game experiences.
17
The experiment was counterbalanced so that half of the participants played the game first using a
joystick and the other half did so using the CV-based control method. This study utilized a person-
dependent CV system, meaning that it needed a special calibration procedure to fine-tune the CV
methods for each new participant [31]. A calibration window was presented to the participants (the
window width equalled the screen width, and the window height equalled half of the screen height,
with the window positioned on the top half of the screen). The participants were instructed to
continuously point with their faces, one by one at the four corners of the calibration window. The
corners pointed at were highlighted in red, providing visual feedback to the participants. During the
calibration procedure that lasted 2–3 minutes overall, facial image data were collected and further
used to train the TLD face tracker and face classifiers. The training set for the TLD face tracker was
collected first and consisted of 50 images. The face classifiers were trained next, with image inputs
from the segmented upper and lower parts of the face. First, a training set of 50 upper and 50 lower
face images with a neutral expression was collected, followed by 25 images each of a smiling face
and a frowning face. The participants were asked to produce expressions of high intensity in order
to obtain a representative training set of images. Then the CV system was trained with the collected
image data for 1–2 minutes. Finally, the face detection and expression classification were verified to
ensure that the system operated well.
During the calibration, the participants were familiarized with the idea of the CV-based control
method. It was also explained that the CV technology had certain limitations and that the best
performance of the system could be achieved if the face remained close to the up-frontal position.
The general recommendation was to activate the torso and make small rotations and tilts of the head
to control the application. The participants were instructed to check now and then whether their
faces were still in the camera’s field of view.
18
With both input methods, there were four playing fields: a practice field and three actual playing
fields. After the participant had played the game using one or the other method, he or she gave the
ratings. First, the participant rated his or her emotional experience with the control method using
three, nine-point bipolar scales: pleasantness, arousability, and dominance. The scales varied from -
4 (e.g. unpleasant) to +4 (e.g. pleasant). Zero (0) represented a neutral point (e.g. neither unpleasant
nor pleasant) in all scales.
Then the user gave eight different ratings of the game and control methods using nine-point bipolar
scales that have previously been used in many studies investigating new interaction methods
[35,36,37]. The scales were as follows: general evaluation (i.e. varying from bad to good), speed
(i.e. from slow to fast), accuracy (i.e. from inaccurate to accurate), efficiency (i.e. from inefficient
to efficient), difficulty (i.e. from difficult to easy), naturalness (i.e. from unnatural to natural),
amusement (i.e. from boring to fun), and interestingness (i.e. from uninteresting to interesting). The
CV-based method was also rated with four additional scales: the pleasantness of smiling, the
pleasantness of frowning, the functionality of the smile, and the functionality of the frown. These
scales varied from -4 (e.g. bad experience) to +4 (e.g. good experience). Zero (0) represented a
neutral value (e.g. neither bad nor good) in all scales.
Finally, the participant filled out a GEQ [26] consisting of seven dimensions of the player
experience: sensory and imaginative immersion, tension, competence, flow, negative affect,
positive affect, and challenge. The ending section of the GEQ had four questions about sickness
symptoms taken from the ITC-Sense of Presence Inventory (ITC-SOPI), where they formed a factor
labeled “negative effects.” The 5-point rating scales varied from 0 (not at all) to 4 (extremely). After
filling the questionnaire, the participant played the game once more using the other technique and
gave the ratings again. The total duration of the experiment was approximately 50 min.
19
4.4. Data analysis
The game performance measures were compared between the control techniques using the pairwise
t-test. The ratings for the control techniques were compared using the pairwise Wilcoxon signed-
rank test.
5. Results
5.1. Game performance measures
Figures 5 and 6 show the means and standard error of the means (SEMs) of the game duration, the
number of falls, and the number of picked flowers, beer cans, and hamburgers. The pairwise
comparisons between the control methods showed statistically significant differences; the
participants got through the game quicker t(19) = -11.83, p < .001, managed to pick up more
flowers t(19) = 5.23, p < .001 and beer cans t(19) = 4.01, p < .001, and their drunkard character fell
less frequently t(19) = -8.83, p < .001 when they played the game using a joystick than when they
did so with CV. The difference in the amounts of the picked hamburgers was statistically
insignificant.
--------------------------------
Insert Figure 5 about here
--------------------------------
20
Figure 5. Mean game durations (and SEMs) for both control techniques.
--------------------------------
Insert Figure 6 about here
--------------------------------
Figure 6. Mean numbers of falls, and picked flowers, beer cans, and hamburgers (and SEMs) for both
control techniques.
The pairwise comparisons between the first and third playing fields showed that practicing
improved to some extent the game performance for both control methods. The differences in the
results were statistically significant; the participants finished the game quicker t(19) = 3.03, p < .01
and their drunkard character fell less frequently t(19) = 3.56, p < .01 in the last field, compared to
the first playing field, when the participant played the game with CV. In the first playing field, the
drunkard fell down approximately every 17 seconds. In the third playing field, the drunkard fell
down approximately every 26 seconds. The differences between the numbers of picked flowers,
beer cans, or hamburgers were statistically insignificant. In the joystick method, the participants’
drunkard character fell less frequently t(19) = 2.81, p < .05 and managed to pick up more flowers
t(19) = -3.49, p < .01 in the third playing field than in the first one. The differences in the game
duration and numbers of picked beer cans or hamburgers were statistically insignificant.
5.2. Emotional ratings for the control techniques
Figure 7 shows the mean ratings and SEMs for experienced valence, arousal, and dominance in
both control techniques. Wilcoxon signed-ranked tests showed statistically significant differences in
21
the ratings for valence (Z = 3.04, p < .01) and dominance (Z = 3.91, p < .001), which were higher
after the participants had played the game using the joystick than using the CV-based technique.
The ratings for arousal also indicated statistically significant differences, although they were higher
after the participants had played the game using the CV-based technique than using the joystick (Z
= 3.03, p < .01).
--------------------------------
Insert Figure 7 about here
--------------------------------
Figure 7. Mean ratings (and SEMs) for valence, arousal, and dominance in both control techniques.
5.3. Subjective evaluation of the game
Figures 8 and 9 show the mean ratings and SEMs for the evaluations of the game and control
methods. The Wilcoxon signed-ranked test between the CV-based technique and the joystick
showed statistically insignificant differences in the general evaluation. The steering of the game
using the joystick was rated as significantly faster (Z = 3.03, p < .01), more accurate (Z = 3.03, p <
.01), more efficient (Z = 3.03, p < .01), easier (Z = 3.03, p < .01), and more natural (Z = 3.03, p <
.01) than that with the CV-based method. Playing using the CV-based method was rated as
significantly more entertaining (Z = 3.03, p < .01) and interesting (Z = 3.03, p < .01) than doing so
with the joystick.
Pairwise comparisons between the functionality of frowning and smiling or pleasantness of
frowning and smiling were statistically insignificant.
22
--------------------------------
Insert Figure 8 about here
--------------------------------
Figure 8. Mean subjective ratings (and SEMs) for both control techniques.
--------------------------------
Insert Figure 9 about here
--------------------------------
Figure 9. Mean ratings (and SEMs) for the game experience in both control techniques.
5.4. Game experience
Figure 10 shows the mean ratings and SEMs for the seven factors of the GEQ. The ratings for
immersion (Z = 2.05, p < .05), tension (Z = 2.40, p < .05), and challenge (Z = 3.64, p < .001) were
significantly higher after playing the game with the CV-based method than with the joystick. The
ratings for competence (Z = 3.36, p < .001) and negative affect (Z = 1.98, p < .05) were
significantly higher after playing with the joystick than with CV.
The ratings for the flow and positive affect indicated statistically insignificant differences between
the control methods.
23
--------------------------------
Insert Figure 10 about here
--------------------------------
Figure 10. Mean ratings (and SEMs) for the game experience in both control techniques.
5.5. Physical tiredness and sickness symptoms
No significant differences in sickness symptoms (Cronbach’s α = 0.76) were reported when the
participants played the game with the joystick or CV.
6. Discussion
The present results provided evidence both for and against using body- and face-based interactions.
On the one hand, the findings showed that CV enhanced gaming and entertainment experiences in
many ways, compared with the joystick. Playing using CV was rated as more entertaining and
interesting than gaming with the joystick. The participants also experienced playing with CV as
more challenging and immersive than doing so with the joystick. On the other hand, the joystick
was scored as more functional than the CV technique. The participants rated joystick steering as
more accurate, efficient, natural, faster, and easier than CV steering. Probably, the lower
functionality of the CV method also caused the participants to regard themselves as less competent
and to feel more tension after gaming with CV than doing so with the joystick.
24
The scores for the affective space indicated higher ratings for valence and dominance with joystick
controlling than with the CV-based counterpart, whereas CV controlling was ranked more arousing
than the joystick-controlled playing. The probable reason for these differences is that the joystick
control function was experienced as better and easier than that of CV; the former also evoked more
pleasant and less arousing experiences. On the other hand, the ratings for arousal were consistent
with Ibsister’s findings [38] that movement-based steering led to higher scores for arousal than
those obtained using key commands. Furthermore, the results for the affective ratings are interesting
in the light of the recent findings of Poels et al. [23]. They studied how players’ emotions during
gameplay predict playing behavior at a later stage. Their research revealed that pleasure during the
initial gameplay affected short-term playing time and game preferences, while experienced arousal
predicted long-term game preferences best. Thus, experienced arousal seems to be an important
factor for the long-term success of a video game. Maybe the level of experienced arousal is also
associated with the level and amount of body movements while playing. This in turn motivates
players better for future gaming. Previous research has shown that body movement controlling can
lead to more enjoyable and engaging experiences, compared with traditional controllers [39,40].
Clearly, our CV-based application required more movements than the use of the joystick.
The ratings for valence somewhat contradict those for positive and negative affects in the GEQ.
Although joystick steering evoked higher ratings for valence than CV steering did, the scores for
positive affect were not significantly different between these control methods. Furthermore, the
ratings for negative affect were even higher after joystick controlling than after CV controlling.
When the participants gave their scores for valence, arousal, and dominance, they were instructed to
rate their experiences with the control method. In contrast, the negative and positive affect
dimensions in the GEQ measured comprehensive experience during the game (e.g. “I felt bored”).
Thus, although joystick steering was experienced as more pleasant than CV steering, the overall
25
impression of the gameplay with the joystick was more boring and tiresome, compared to that with
CV.
The investigation of players’ game experiences, along with the more functional aspects of the game,
is essential for evaluating the potential success of a video game. Sweetser and Wyeth [41, p. 1]
suggested that “player enjoyment is the single most important goal for computer games.” Moreover,
Sherry et al. [42] proposed that challenge and arousal are among the main reasons why people play
video games. Thus, because the participants experienced game playing as particularly more
entertaining, arousing, and challenging when they used their own head movements and facial
expressions rather than the joystick, CV seems to have considerable potential as a control method.
Our results suggest that CV-based controlling could be an appealing enhancement to traditional
control devices in the game environment. However, on the basis of this study, it is not possible to
predict which of the gameplay conditions people would prefer over the long term. It is possible that
the novelty of the CV method influenced the ratings for interestingness and entertainment, for
example. Alternatively, CV-based controlling could possibly be more rewarding after the players
have learned to use it better. In a future study, it would be interesting to investigate long-term
playing behavior to discover whether the ratings change over time.
As described above, the participants rated the functionality (or usability) of the joystick as better
than that of CV. Joystick steering was scored faster and easier, as well as more accurate and
efficient than CV steering. Additionally, although the mean ratings for the functionality of facial
expressions as an input method were on the positive side of the scale, these numbers were quite
low. These combined results indicate some problems with the detection of head movements and
facial expressions using the CV method. By improving the robustness and speed of the CV method,
steering the game with head movements and facial expressions could result in more positive ratings
26
for game functionality and players’ subjective experiences. In future studies, we will extend the
scope of facial behavior so that other expressions would be tested with respect to game controlling.
In the present study, the participants controlled the game directly and consciously with head
movements and facial expressions. It is also possible to apply other kinds of bodily information as
an interaction method in games. Previous studies have provided evidence that direct or explicit
biofeedback can enhance the game experience. Nacke et al. [43] found that people prefer direct
physiological control over indirect control. In the study of Kuikkaniemi et al. [27], implicit
biofeedback had no effects on player experience, whereas the measures that a player could
manipulate explicitly heightened the feelings of immersion and enjoyment. Furthermore, Dekker
and Champion [12] successfully used players’ biofeedback information to increase the feelings of
terror in a horror game. Thus, in addition to facial expressions, using other kinds of consciously
produced physiological information could enhance the game experience. This aspect will be
considered in future work.
In conclusion, even though CV functionality was not experienced as effective as that of the joystick,
a new kind of control method evoked significantly higher experiences of entertainment,
interestingness, challenge, and immersion, for example. Thus, it can be argued that the playing
experience using the CV-based technique was more enhanced than that with the traditional joystick.
In the future, CV could provide a promising, hands-free method for controlling games.
Acknowledgments
27
This research was funded by the Academy of Finland (project 129354). The authors would like to
thank the following students of the University of Tampere: Tek Prasad Gautam, Reza Ahliaraghi,
Henrik Lehtinen, Anju Thapa, Mirjan Merruko, and especially Yanzhao Wen (who worked as a
summer intern in the project), for implementing face-tracking algorithms, and Anu Leppälampi for
serving as the experimenter.
Footnotes
1 http://www.nintendo.com/wii
2 http://www.microsoft.com/en-us/kinectforwindows/
3 http://us.playstation.com/ps2/accessories/eyetoy-usb-camera-ps2.html
4 http://www.microsoft.com/en-us/download/details.aspx?id=23714
5 http://nuclexframework.codeplex.com/
References
[1] N. Ravaja, M. Salminen, J. Holopainen, T. Saari, J. Laarni, A. Järvinen, Emotional response
patterns and sense of presence during video games: Potential criterion variables for game design, in:
Proceedings of the third Nordic conference on Human-computer interaction, ACM, New York,
2004, pp. 339-347.
[2] P. Skalski, R. Tamborini, A. Shelton, M. Buncher, P. Lindmark, Mapping the road to fun:
Natural video game controllers, presence, and game enjoyment, New Media & Society 13 (2)
(2011) 224-242.
28
[3] Z. Zeng, M. Pantic, G.I. Roisman, T.S. Huang, A survey of affect recognition methods: audio,
visual, and spontaneous expressions, IEEE Transactions on Pattern Analysis and Machine
Intelligence 31 (1) (2009) 39-58.
[4] C. Manresa-Yee, P. Ponsa, J. Varona, F.J. Perales, User experience to improve the usability of a
vision-based interface, Interacting with Computers 22 (6) (2010) 594-605.
[5] M. Porta, Vision-based user interfaces: Methods and applications, International Journal of
Human-Computer Studies 57 (1) (2002) 27-73.
[6] M. Yang, D. Kriegman, N. Ahuaja, Detecting face in images: A survey, IEEE Transactions on
Pattern Analysis and Machine Intelligence 24 (1) (2002) 34-58.
[7] J.M. Saragih, S. Lucey, J.F. Cohn, Real-time avatar animation from a single image, in: IEEE
International Conference on Automatic Face and Gesture Recognition (FG’11), 2011, pp. 117 –
124.
[8] S. Wang, X. Ziong, Y. Zu, C. Wang, W. Zhang, X. Dai, D. Zhang, Face-tracking as an
augmented input in video games: enhancing presence, role-playing and control, in: Proceedings of
the SIGCHI conference on Human Factors in computing systems, ACM, New York, 2006, pp.
1097-1106.
[9] T. Sko, H.J. Gardner, Head tracking in First-Person games: interaction using a web-camera, in:
Proceedings of the 12th IFIP TC 13 International Conference on Human-Computer Interaction
(INTERACT '09): Part I, LNCS 5726, 2009, pp. 342-355.
29
[10] Y. Gizatdinova, V. Surakka, S. Haniff, E. Mäkinen, R. Raisamo, J. Iso-Tuisku, A. Sand (2013).
Emerging application areas and challenges of automatic face analysis, Continuum: Journal of Media
& Cultural Sciences 27 (4) (2013) 572-584.
[11] K. Gilleade, A. Dix, J. Alanson, Affective videogames and modes of affective gaming: assist
me, challenge me, emote me, in: Proceedings of the Digital Games Research Association DiGRA
2005 Conference: Changing Views – Worlds in Play, 2005.
[12] A. Dekker, E. Champion, Please biofeed the zombies: enhancing the gameplay and display of a
horror game using biofeedback, in: Proceedings of the Digital Games Research Association DiGRA
2007 Conference: In Situated Play, 2007, 550-558.
[13] T. Sko, H. Gardner, M. Martin, Studying a head tracking technique for first-person-shooter
games in a home setting, in: Proceedings of the International Conference on Human-Computer
Interaction (INTERACT 2013): Part IV, LNCS 8120, 2013, pp. 246-263.
[14] D.O. Gorodnichy, G. Roth, Nouse ‘use your nose as a mouse’ perceptual vision technology for
hands-free games and interfaces, Image and Vision Computing 22 (12) (2004) 931-942.
[15] M.L. Knapp, Nonverbal communication in human interaction, Holt, Rinehart and Winston,
New York, 1978.
[16] V. Surakka, J.K. Hietanen, Facial and emotional reactions to Duchenne and non-Duchenne
smiles, International Journal of Psychophysiology 29 (1998) 23-33.
30
[17] J.T. Cacioppo, R.E. Petty, K.J. Morris, Semantic, evaluative, and self-referent processing:
memory, cognitive effort, and somatovisceral activity, Psychophysiology 22 (4) (1985) 371-384.
[18] P. Ekman, An argument for basic emotions, Cognition and Emotion 6 (3/4) (1992) 169-200.
[19] J.K. Hietanen, V. Surakka, I. Linnankoski, Facial electromyographic responses to vocal affect
expressions, Psychophysiology 35 (1998) 530-536.
[20] M. Obaid, C. Ham, M. Billinghurst, “Feed the Fish”: and affect aware game, in: Proceedings of
the 5th Australasian Conference on Interactive Entertainment, article No. 6, ACM, New York,
2008.
[21] R. Bernhaupt, A. Boldt, T. Mirlacher, D. Wilfinger, M. Tscheligi, Using emotion in games:
emotional flowers, in: Proceedings of the International Conference on Advances in Computer
Entertainment Technology 2007, ACM, New York, 2007, pp. 41-48.
[22] M. Lankes, S. Riegler, A. Weiss, T. Mirlacher, M. Pirker, M. Tscheligi, Facial expressions as
game input with different emotional feedback conditions, in: Proceedings of the International
Conference on Advances in Computer Entertainment Technology 2008, ACM, New York, 2008,
pp. 253-256.
[23] K. Poels, W. van den Hoogen, W. Ijssellsteijn, Y. de Kort, Pleasure to play, arousal to stay: the
effect of player emotions on digital game preferences and playing time, Cyberpsychology,
Behavior, and Networking 15 (1) (2012) 1-6.
31
[24] D.S. Kempf, Attitude formation from product trial: Distinct roles of cognition and affect for
hedonic and functional products, Psychology & Marketing 16 (1) (1999) 35-50.
[25] M.M. Bradley, P.J. Lang, Measuring emotion: The self-assessment manikin and the semantic
differential, Journal of Behavioral Therapy and Experimental Psychiatry 25(1) (1994) 49-59.
[26] W. IJsselsteijn, W. van den Hoogen, C. Klimmt, Y. de Kort, C. Lindley, K. Mathiak, K. Poels,
N. Ravaja, M. Turpeinen, P. Vorderer, Measuring the experience of digital game enjoyment, in:
Proceedings of Measuring Behavior, 2008, pp.26-29.
[27] K. Kuikkaniemi, T. Laitinen, M. Turpeinen, T. Saari, I. Kosunen, The influence of implicit and
explicit biofeedback in first-person shooter game, in: Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems, ACM, New York, 2010, pp. 859-868.
[28] L. Nacke, C.A. Lindley, Flow and immersion in first-person shooters: Measuring the player’s
gameplay experience, in: Proceedings of the 2008 Conference on Future Play: Research, Play,
Share, ACM, New York, 2008, pp. 81-88.
[29] C.-H. Chang, W.-W. Pan, L.Y. Tseng, T.A. Stoffregen, Postural activity and motion sickness
during video game play in children and adults, Experimental Brain Research 217 (2) (2012) 299-
309.
[30] P. Viola, M. Jones, Robust real-time face detection, International Journal of Computer Vision
57 (2) (2004) 137-154.
32
[31] Y. Gizatdinova, O. Špakov, V. Surakka, Face typing: Visual gesture-based perceptual interface
for typing with a scrollable virtual keyboard, in: IEEE Workshop on the Applications of Computer
Vision (WACV’12), IEEE Computer Society, 2012, pp. 81-87.
[32] Z. Kalal, K. Mikolajczyk, J. Matas, “Tracking-Learning-Detection”, Pattern Analysis and
Machine Intelligence 34 (7) (2012) 1409 - 1422.
[33] L. Farkas, Anthropometry of the Head and Face, second ed., Raven, New York, 1994.
[34] Y. Gizatdinova, O. Špakov, V. Surakka, Comparison of video-based pointing and selection
techniques for hands-free text entry, in: Proceedings of International Working Conference on
Advanced Visual Interfaces (AVI’12), ACM, New York, 2012, pp. 132-139.
[35] V. Surakka, M. Illi, P. Isokoski, Gazing and frowning as a new human-computer interaction
technique, ACM Transactions on Applied Perception 1 (1) (2004) 40-56.
[36] V. Surakka, P. Isokoksi, M. Illi, K. Salminen, Is it better to gaze and frown or gaze and smile
when controlling user interfaces?, in: Proceedings of HCI International, Vol. 2005.
[37] O. Tuisku, V. Surakka, T. Vanhala, V. Rantanen, J. Lekkala, Wireless Face Interface: Using
voluntary gaze direction and facial muscle activations for human-computer interaction, Interacting
with Computers 24 (1) (2012) 1-9.
33
[38] K. Ibsister, Emotion and motion: Games as inspiration for shaping the future interface,
Interactions 18 (5) (2011) 24-27.
[39] N. Bianchi-Berthouze, W.W. Kim, D. Patel, Does body movement engage you more in digital
game play? And why?, in: Proceedings of the International Conference of Affective Computing and
Intelligent Interaction (ACII 2007), LNCS 4738, 2007, pp. 102-113.
[40] L.F. Teófilo, P.A. Nogueira, P.B. Silva, GEMINI: A generic multi-modal natural interface
framework for videogames, in: Advances in Information Systems and Technologies
(WorldCIST’13), 2013, 873-884.
[41] P. Sweetser, P. Wyeth, GameFlow: A model for evaluating player enjoyment in games, ACM
Computers in Entertainment 3 (3) (2005) 1-24.
[42] J.L. Sherry, K. Lukas, B. Greenberg, K. Lachlan, Video game uses and gratifications as
predictors of use and game preference, in P. Vorderer, J. Bryant (Eds.), Playing videogames:
motives, responses, and consequences. NJ: Lawrence Erlbaum Associates, 2006, pp. 213-224.
[43] L.E. Nacke, M. Kalyn, C. Lough, R.L. Mandryk, Biofeedback game design: using direct and
indirect physiological control to enhance game interaction, in: Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems, ACM, New York, 2011, pp. 103-112.
34
Highlights
• We present a game that utilizes information on face position and facial expressions
• A user study was conducted to evaluate the game
• The utilization of face analysis enhanced players’ game experiences in many ways
• Facial information detected by computer vision offers a promising way for game control
35
36
Game starts User login and
settings adjustment
(e.g. difficulty level,
controlling method)
Yes Calibration of
CV methods
Is face
detected/t
racked?
Is CV input
used? No
Collision course starts
Yes
No
Yes
Logging starts
Game pauses
Game continues
Play again with the same
user?
Collision course ends Logging ends
Walking/steering
Game endsNo
u
37
Walking/steering
Avoiding inanimate
obstacles (e.g. boxes,
bananas)
Interaction with living obstacles (e.g. people, animals)
Is obstacle
moving to
collide?
No
Falling down
Smiling
Was smile
successful?
Yes
No
Yes
No
Was frown
successful?
Is obstacle
standing on
the way?
Frowning
Yes
Yes
No
Picking up items (i.e. beer
cans, hamburgers, flowers)
38
39