MODELLING AND EVALUATING VERBAL AND NON ...ccc.inaoep.mx/~villasen/bib/MODELLING AND...

Chapter 3

MODELLING AND EVALUATINGVERBAL AND NON-VERBALCOMMUNICATION IN TALKINGANIMATED INTERFACE AGENTS

Bjorn Granstrom and David HouseKTH (Royal Institute of Technology)Stockholm, [email protected], [email protected]

Abstract The use of animated talking agents is a novel feature of many multimodalexperimental spoken dialogue systems. The addition and integration of a vir-tual talking head has direct implications for the way in which users approachand interact with such systems. Established techniques for evaluating the quality,efficiency, and other impacts of this technology have not yet appeared in standardtextbooks. The focus of this chapter is to look into the communicative function ofthe agent, both the capability to increase intelligibility of the spoken interactionand the possibility to make the flow of the dialogue smoother, through differ-ent kinds of communicative gestures such as gestures for emphatic stress, emo-tions, turntaking, and negative or positive system feedback. The chapter reviewsstate-of-the-art animated agent technologies and their applications primarily indialogue systems. The chapter also includes examples of methods of evaluatingcommunicative gestures in different contexts.

Keywords Audio-visual speech synthesis; Talking heads; Animated agents; Spoken dia-logue systems; Visual prosody.

1 IntroductionIn our interaction with others, we easily and naturally use all of our sensory

modalities as we communicate and exchange information. Our senses areexceptionally well adapted for these tasks, and our brain enables us to effort-lessly integrate information from different modalities fusing data to optimallymeet the current communication needs. As we attempt to take advantage of

65L. Dybkjær et al. (eds.), Evaluation of Text and Speech Systems, 65–98.

c© 2007 Springer.

66 EVALUATION OF TEXT AND SPEECH SYSTEMS

the effective communication potential of human conversation, we see an in-creasing need to embody the conversational partner using audio-visual verbaland non-verbal communication in the form of animated talking agents. Theuse of animated talking agents is currently a novel feature of many multimodalexperimental spoken dialogue systems. The addition and integration of a vir-tual talking head has direct implications for the way in which users approachand interact with such systems (Cassell et al., 2000). However, establishedtechniques for evaluating the quality, efficiency, and other impacts of this tech-nology have not yet appeared in standard textbooks.

Effective interaction in dialogue systems involves both the presentation ofinformation and the flow of interactive dialogue. A talking animated agentcan provide the user with an interactive partner whose goal is to take therole of the human agent. An effective agent is one who is capable of sup-plying the user with relevant information, can fluently answer questions con-cerning complex user requirements, and can ultimately assist the user in adecision-making process through the interactive flow of conversation. One wayto achieve believability is through the use of a talking head where informa-tion is transformed through text into speech, articulator movements, speech-related gestures, and conversational gestures. Useful applications of talkingheads include aids for the hearing impaired, educational software, audio-visualhuman perception experiments (Massaro, 1998), entertainment, and high-quality audio-visual text-to-speech synthesis for applications such as news-reading. The use of the talking head aims at increasing effectiveness bybuilding on the user’s social skills to improve the flow of the dialogue(Bickmore and Cassell, 2005). Visual cues to feedback, turntaking, and sig-nalling the system’s internal state are key aspects of effective interaction. Thereis also currently much interest in the use of visual cues as a means to ensurethat participants in a conversation share an understanding of what has beensaid, i.e. a common ground (Nakano et al., 2003).

The talking head developed at KTH is based on text-to-speech synthesis.Audio speech synthesis is generated from a text representation in synchronywith visual articulator movements of the lips, tongue, and jaw. Linguisticinformation in the text is used to generate visual cues for relevant prosodiccategories such as prominence, phrasing, and emphasis. These cues generallytake the form of eyebrow and head movements which we have termed “visualprosody” (Granstrom et al., 2001). These types of visual cues with the addi-tion of, for example, a smiling or frowning face are also used as conversationalgestures to signal such things as positive or negative feedback, turntaking reg-ulation, and the system’s internal state. In addition, the head can visually signalattitudes and emotions. Recently, we have been exploring data-driven methodsto model articulation and facial parameters of major importance for conveyingsocial signals and emotion.

Modelling and Evaluating Verbal and Non-Verbal Communication 67

The focus of this chapter is to look into the communicative function of theagent, both the capability to increase intelligibility of the spoken interactionand the possibility to make the flow of the dialogue smoother, through dif-ferent kinds of communicative gestures such as gestures for emphatic stress,emotions, turntaking, and negative or positive system feedback. The chapterreviews state-of-the-art animated agent technologies and their applications pri-marily in dialogue systems. The chapter also includes some examples of meth-ods of evaluating communicative gestures in different contexts.

2 KTH Parametric Multimodal SpeechSynthesis

Animated synthetic talking faces and characters have been developed usinga number of different techniques and for a variety of purposes for more thantwo decades. Historically, our approach is based on parameterised, deformable3D facial models, controlled by rules within a text-to-speech framework(Carlson and Granstrom, 1997). The rules generate the parameter tracksfor the face from a representation of the text, taking coarticulation intoaccount (Beskow, 1995). We employ a generalised parameterisation techniqueto adapt a static 3D wireframe of a face for visual speech animation (Beskow,1997). Based on concepts first introduced by Parke (1982), we define a set ofparameters that will deform the wireframe by applying weighted transforma-tions to its vertices. One critical difference from Parke’s system, however, isthat we have decoupled the model definitions from the animation engine. Theanimation engine uses different definition files that are created for each face.This greatly increases flexibility, allowing models with different topologies tobe animated from the same control parameter program, such as a text-to-speechsystem.

The models are made up of polygon surfaces that are rendered in 3D usingstandard computer graphics techniques. The surfaces can be articulated anddeformed under the control of a number of parameters. The parameters are de-signed to allow for intuitive interactive or rule-based control. For the purposesof animation, parameters can be roughly divided into two (overlapping) cate-gories: those controlling speech articulation and those used for non-articulatorycues and emotions. The articulatory parameters include jaw rotation, lip round-ing, bilabial occlusion, labiodental occlusion, and tongue tip elevation. Thenon-articulatory category includes eyebrow raising, eyebrow shape, smile,gaze direction, and head orientation. Furthermore, some of the articulatoryparameters such as jaw rotation can be useful in signalling non-verbal ele-ments such as certain emotions. The display can be chosen to show only thesurfaces or the polygons for the different components of the face. The sur-faces can be made (semi-)transparent to display the internal parts of the model,


Figure 1. Some different versions of the KTH talking head.

including the tongue, palate, jaw, and teeth (Engwall, 2003). The internal partsare based on articulatory measurements using magnetic resonance imaging,electromagnetic articulography, and electropalatography, in order to ensurethat the model’s physiology and movements are realistic. This is of impor-tance for language learning situations, where the transparency of the skin maybe used to explain non-visible articulations (Cole et al., 1999; Massaro et al.,2003; Massaro and Light, 2003; Balter et al., 2005). Several face models havebeen developed for different applications, and some of them can be seen inFigure 1. All can be parametrically controlled by the same articulation rules.

For stimuli preparation and explorative investigations, we have developeda control interface that allows fine-grained control over the trajectories foracoustic as well as visual parameters. The interface is implemented as an ext-ension to the WaveSurfer application (http://www.speech.kth.se/wavesurfer)(Sjolander and Beskow, 2000), which is a freeware tool for recording, play-ing, editing, viewing, printing, and labelling audio data.

The interface makes it possible to start with an utterance synthesised fromtext, with the articulatory parameters generated by rule, and then interactivelyedit the parameter tracks for F0, visual (non-articulatory) parameters as wellas the durations of individual segments in the utterance to produce specificcues. An example of the user interface is shown in Figure 2. In the top boxa text can be entered in Swedish or English. The selection of language trig-gers separate text-to-speech systems with different phoneme definitions andrules, built in the Rulsys notation (Carlson and Granstrom, 1997). One exampleof language-dependent rules are the rules for visual realisation of interdentalsin English which do not apply to Swedish. The generated phonetic transcrip-tion can be edited. On pushing “Synthesize”, rule-generated parameters willbe created and displayed in different panes below. The selection of parametersis user-controlled. The lower section contains segmentation and the acousticwaveform. A talking face is displayed in a separate window. The acousticsynthesis can be exchanged for a natural utterance and synchronised to the face


Figure 2. The WaveSurfer user interface for parametric manipulation of the multimodalsynthesis.

synthesis on a segment-by-segment basis by running the face synthesis withphoneme durations from the natural utterance. This requires a segmentationof the natural utterance which can be done (semi-)automatically in, for exam-ple, WaveSurfer. The combination of natural and synthetic speech is useful fordifferent experiments on multimodal integration and has been used in the Syn-face/Teleface project (see below). In language learning applications this featurecould be used to add to the naturalness of the tutor’s voice in cases when theacoustic synthesis is judged to be inappropriate. The parametric manipulationtool is used to experiment with, and define, gestures. Using this tool we haveconstructed a library of gestures that can be invoked via XML markup in theoutput text.

3 Data Collection and Data-DrivenVisual Synthesis

More recently, we have begun experimenting with data-driven visual syn-thesis using a newly developed MPEG-4 compatible talking head (Beskowand Nordenberg, 2005). A data-driven approach enables us to capture theinteraction between facial expression and articulation. This is especiallyimportant when trying to synthesize emotional expressions (cf. Nordstrandet al., 2004).

To automatically extract important facial movements we have employed amotion capture procedure. We wanted to be able to obtain both articulatory data


as well as other facial movements at the same time, and it was crucial that theaccuracy in the measurements was good enough for resynthesis of an animatedhead. Optical motion tracking systems are gaining popularity for being ableto handle the tracking automatically and for having good accuracy as well asgood temporal resolution. The Qualisys system that we use has an accuracy ofbetter than 1 mm with a temporal resolution of 60 Hz. The data acquisition andprocessing is very similar to earlier facial measurements carried out by Beskowet al. (2003). The recording set-up can be seen in Figure 3.

The subject could either pronounce sentences presented on the screen out-side the window or be engaged in a (structured) dialogue with another per-son as shown in the figure. In the present set-up, the second person cannot berecorded with the Qualisys system but is only video recorded. By attachinginfrared reflecting markers to the subject’s face, see Figure 3, the system isable to register the 3D coordinates for each marker at a frame rate of 60 Hz,i.e. every 17 ms. We used 30 markers to register lip movements as well as otherfacial movements such as eyebrows, cheek, chin, and eyelids. Additionally weplaced three markers on the chest to register head movements with respect tothe torso. A pair of spectacles with four markers attached was used as a ref-erence to be able to factor out head and body movements when looking at thefacial movements specifically.

The databases we have thus far collected have enabled us to analyse speechmovements such as articulatory variation in expressive speech (Nordstrandet al., 2004) in addition to providing us with data with which to develop data-driven visual synthesis. The data has also been used to directly drive synthetic3D face models which adhere to the MPEG-4 Facial Animation (FA) stan-dard (Pandzic and Forchheimer, 2002) enabling us to perform comparative

Figure 3. Data collection set-up with video and IR-cameras, microphone and a screen forprompts (left), and test subject with the IR-reflecting markers glued to the face (right).


Figure 4. Visual stimuli generated by data-driven synthesis from the happy database (left)and the angry database (right) using the MPEG-4 compatible talking head.

evaluation studies of different animated faces within the EU-funded PF-Starproject (Beskow et al., 2004a, b).

The new talking head is based on the MPEG-4 FA standard and is a textured3D model of a male face comprising around 15,000 polygons. Current work ondata-driven visual synthesis is aimed at synthesising visual speech articulationfor different emotions (Beskow and Nordenberg, 2005). The database consistsof recordings of a male native Swedish amateur actor who was instructed toproduce 75 short sentences with the six emotions happiness, sadness, surprise,disgust, fear, and anger plus neutral (Beskow et al., 2004c). Using the data-bases of different emotions results in talking head animations which differ inarticulation and visual expression. The audio synthesis used at present is thesame as that for the parametric synthesis. Examples of the new head displayingtwo different emotions taken from the database are shown in Figure 4.

4 Evaluating Intelligibility and InformationPresentation

One of the more striking examples of improvement and effectiveness inspeech intelligibility is taken from the Synface project which aims at improv-ing telephone communication for the hearing impaired (Agelfors et al., 1998).A demonstrator of the system for telephony with a synthetic face that articu-lates in synchrony with a natural voice has now been implemented as a resultof the project. The telephone interface used in the demonstrator is shown inFigure 5.

Evaluation studies within this project were mainly oriented towards inves-tigating differences in intelligibility between speech alone and speech withthe addition of a talking head. These evaluation studies were performed off-line: e.g. the speech material was manually labelled so that the visible speech


Figure 5. Telephone interface for Synface.

synthesis always generated the correct phonemes rather than being gener-ated from the Synface recogniser, which can introduce recognition errors. Theresults of a series of tests using vowel-consonant-vowel (VCV) words andhearing-impaired subjects showed a significant gain in intelligibility when thetalking head was added to a natural voice. With the synthetic face, consonantidentification improved from 29% to 54% correct responses. This compares tothe 57% correct response result obtained by using the natural face. In certaincases, notably the consonants consisting of lip movement (i.e., the bilabial andlabiodental consonants), the response results were in fact better for the syn-thetic face than for the natural face. This points to the possibility of using over-articulation strategies for the talking face in these kinds of applications. Recentresults indicate that a certain degree of overarticulation can be advantageous inimproving intelligibility (Beskow et al., 2002b).

Similar intelligibility tests have been run using normal hearing subjectswhere the audio signal was degraded by adding white noise (Agelfors et al.,1998). Similar results were obtained. For example, for a synthetic male voice,consonant identification improved from 31% without the face to 45% with theface.

Hearing-impaired persons often subjectively report that some speakers aremuch easier to speech-read than others. It is reasonable to hypothesise thatthis variation depends on a large number of factors, such as rate of speaking,amplitude and dynamics of the articulatory movements, orofacial anatomy ofthe speaker, presence of facial hair, and so on. Using traditional techniques,however, it is difficult to isolate these factors to get a quantitative measure of


their relative contribution to readability. In an attempt to address this issue, weemploy a synthetic talking head that allows us to generate stimuli where eachvariable can be studied in isolation. In this section we focus on a factor that wewill refer to as articulation strength, which is implemented as a global scalingof the amplitude of the articulatory movements.

In one experiment the articulation strength has been adjusted by applyinga global scaling factor to the parameters marked with an x in Table 1. Theycan all be varied between 25% and 200% of normal. Normal is defined as thedefault articulation produced by the rules, which are hand-tuned to match atarget person’s articulation.

The default parameter settings are chosen to optimise perceived similaritybetween a target speaker and the synthetic faces. However, it is difficult toknow whether these settings are optimal in a lip-reading situation for hearing-impaired persons. An informal experiment was pursued to find out the pre-ferred articulation strength and its variance. Twenty-four subjects all closelyconnected to the field of aural rehabilitation either professionally or as hearingimpaired were asked to choose the most intelligible face out of eight record-ings of the Swedish sentence “De skrattade mycket hogt” (They laughed veryloudly). The subjects viewed the eight versions in eight separate windows on acomputer screen and were allowed to watch and compare the versions as manytimes as they wished by clicking on each respective window to activate therecordings. The recordings had 25%, 50%, 75%, 100%, 112%, 125%, 150%and 175% of the default strength of articulation. The default strength of artic-ulation is based on the phoneme parameter settings for visual speech synthesisas developed by Beskow (1997). The different articulation strengths were imp-lemented as a global scaling of the amplitude of the articulatory movements.The amount of co-articulation was not altered.

The average preferred hyperarticulation was found to be 24%, given the taskto optimise the subjective ability to lip-read. The highest and lowest preferred

Table 1. Parameters used for articulatory control of the face. The second column indicateswhich ones are adjusted in the experiments described here.

Parameter Adjusted in experiment

Jaw rotation xLabiodental occlusionBilabial occlusionLip roundingLip protrusion xMouth spread xTongue tip elevation x


values were 150% and 90% respectively with a standard deviation of 16%. Theoption of setting the articulation strength to the user’s subjective preferencecould be included in the Synface application. The question of whether or notthe preferred setting genuinely optimises intelligibility and naturalness wasalso studied as is described below.

Experiment 1: Audio-visual consonant identification. To test thepossible quantitative impact of articulation strength, as defined in the previoussection, we performed a VCV test. Three different articulation strengths wereused: 75%, 100%, and 125% of the default articulation strength for the visualspeech synthesis. Stimuli consisted of nonsense words in the form of VCVcombinations. Seventeen consonants were used: /p, b, m, f, v, t, d, n, s, l, r, k,g, ng, sj, tj, j/ in two symmetric vowel contexts /a, U/ yielding a total of 34different VCV words. The task was to identify the consonant. (The consonantsare given in Swedish orthography – the non-obvious IPA correspondences are:ng=/N/, sj=/Ê/, tj= /c/.) Each word was presented with each of the three levelsof articulation strength. The list was randomised. To avoid starting and endingeffects, five extra stimuli were inserted at the beginning and two at the end.

Stimuli were presented audio-visually by the synthetic talking head. Theaudio was taken from the test material from the Teleface project recordingsof natural speech from a male speaker (Agelfors et al., 1998). The audio hadpreviously been segmented and labelled, allowing us to generate control para-meter tracks for facial animation using the visual synthesis rules.

The nonsense words were presented in white masking noise at a signal-to-noise ratio of 3 dB.

Twenty-four subjects participated in the experiment. They were all under-graduate students at KTH. The experiments were run in plenary by presentingthe animations on a large screen using an overhead projector. The subjects re-sponded on pre-printed answer sheets.

The mean results for the different conditions can be seen in Table 2. For the/a/ context there are only minor differences in the identification rate, while the

Table 2. Percent correct consonant identification in the VCV test with respect to place ofarticulation, presented according to vowel context and articulation strength.

Articulation strength (%) /aCa/ /UCU/

75 78.7 50.5100 75.2 62.2125 80.9 58.1


results for the /U/ context are generally worse, especially for the 75% artic-ulation rate condition. A plausible reason for this difference lies in the bettervisibility of tongue articulations in the more open /a/ vowel context than inthe context of the rounded /U/ vowel. It can also be speculated that move-ments observed on the outside of the face can add to the superior readabil-ity of consonants in the /a/ context. However, we could not find evidence forthis in a study based on simultaneous recordings of face and tongue move-ments (Beskow et al., 2003; Engwall and Beskow, 2003). In general, the con-tribution of articulation strength to intelligibility might be different with otherspeech material such as connected sentences.

Experiment 2: Rating of naturalness. Eighteen sentences from theTeleface project (Agelfors et al., 1998) were used for a small preference test.Each sentence was played twice: once with standard articulation (100%) andonce with smaller (75%) or greater (125%) articulation strength. The set ofsubjects, presentation method, and noise masking of the audio was the sameas in experiment 1 (the VCV test). The subjects were asked to report whichof the two variants seemed more natural or if they were judged to be of equalquality. The test consisted of 15 stimuli pairs, presented to 24 subjects. Toavoid starting and ending effects, two extra pairs were inserted at the beginningand one at the end. The results can be seen in Table 3. The only significantpreference was for the 75% version contrary to the initial informal experiment.However, the criterion in the initial test was readability rather than naturalness.

The multimodal synthesis software together with a control interface basedon the WaveSurfer platform (Sjolander and Beskow, 2000) allows for the easyproduction of material addressing the articulation strength issue. There is apossible conflict in producing the most natural and the most easily lip-readface. However, under some conditions it might be favourable to trade offsome naturalness for better readability. For example, the dental viseme cluster[r, n, t, d, s, and l] could possibly gain discriminability in connection withclosed vowels if tongue movements could be to some extent hyperarticulatedand well rendered. Of course the closed vowels should be as open as possiblewithout jeopardising the overall vowel discriminability.

The optimum trade-off between readability and naturalness is certainly alsoa personal characteristic. It seems likely that hearing-impaired people would

Table 3. Judged naturalness compared to the default (100%) articulation strength.

Articulation strength(%) Less natural Equal More natural

75 31.67 23.33 45.00125 41.67 19.17 39.17


emphasise readability before naturalness. Therefore it could be considered thatin certain applications like in the Synface software, users could be given theoption of setting the articulation strength themselves.

Experiment 3: Targeted audio. A different type of application whichcan potentially benefit from a talking head in terms of improved intelligibilityis targeted audio. To transmit highly directional sound, targeted audio makesuse of a technique known as “parametric array” (Westervelt, 1963). This typeof highly directed sound can be used to communicate a voice message to asingle person within a group of people (e.g., in a meeting situation or at amuseum exhibit) without disturbing the other people. Within the framework ofthe EU project CHIL, experiments have been run to evaluate intelligibility ofsuch targeted audio combined with a talking head (Svanfeldt and Olszewski,2005).

Using an intelligibility test similar to the ones described above, listenerswere asked to identify the consonant in a series of VCV words. The seven con-sonants to be identified were [f, s, m, n, k, p, t] uttered in an [aCa] frame usingboth audio and audio-visual speech synthesis. Four conditions were tested.Two audio conditions with and without the talking head were tested, one whichtargeted the audio directly towards the listener (the 0◦ condition) and one whichtargeted the audio 45◦ away from the listener (the 45◦ condition). The subjectswere seated in front of a computer screen with the target audio device next tothe screen. See Figure 6.

The addition of the talking head increased listener recognition accuracyfrom 77% to 93% in the 0◦ condition and even more dramatically from 58% to88% in the 45◦ condition. Thus the talking head can serve to increase intelligi-bility and even help to compensate for situations in which the listener may bemoving or not located optimally in the audio beam.

A more detailed analysis of the data revealed that consonant confusions inthe audio-only condition tended to occur between [p] and [f] and between [m]and [n]. These confusions were largely resolved by the addition of the talking

Figure 6. Schematic view of the experimental set-up (Svanfeldt and Olszewski, 2005).


head. This is as could be expected since the addition of the visual modalityprovides place of articulation information for the labial articulation.

5 Evaluating Visual Cues for ProminenceAnother quite different example of the contribution of the talking head to

information presentation is taken from the results of perception studies inwhich the percept of emphasis and syllable prominence is enhanced by eye-brow and head movements. In an experiment investigating the contribution ofeyebrow movement to the perception of prominence in Swedish (Granstromet al., 1999), a test sentence was created using our audio-visual text-to-speechsynthesis in which the acoustic cues and lower-face visual cues were the samefor all stimuli. Articulatory movements were created by using the text-to-speech rule system. The upper-face cues were eyebrow movement where theeyebrows were raised on successive words in the sentence. The movementswere created by hand-editing the eyebrow parameter. The degree of eyebrowraising was chosen to create a subtle movement that was distinctive althoughnot too obvious. The total duration of movement was 500 ms and compriseda 100 ms dynamic raising part, a 200 ms static raised portion, and a 200 msdynamic lowering part. In the stimuli, the acoustic signal was always the same,and the sentence was synthesized as one phrase. Six versions were included inthe experiment: one with no eyebrow movement and five where eyebrow rais-ing was placed on one of the five content words in the test sentence. The wordswith concomitant eyebrow movement were generally perceived as more promi-nent than words without the movement. This tendency was even greater for asubgroup of non-native (L2) listeners. The mean increase in prominence re-sponse following an eyebrow movement was 24% for the Swedish native (L1)listeners and 39% for the L2 group. One example result is shown in Figure 7.Similar results have also been obtained for Dutch by Krahmer et al. (2002a).

In another study (House et al., 2001) both eyebrow and head movementswere tested as potential cues to prominence. The goal of the study was twofold.First of all, we wanted to see if head movement (nodding) is a more powerfulcue to prominence than is eyebrow movement by virtue of a larger movement.Secondly, we wanted to test the perceptual sensitivity to the timing of botheyebrow and head movement in relationship to the syllable.

As in the previous experiment, our rule-based audio-visual synthesiser wasused for stimuli preparation. The test sentence used to create the stimuli forthe experiment was the same as that used in an earlier perception experimentdesigned to test acoustic cues only (House, 2001). The sentence, Jag vill baraflyga om vadret ar perfekt (I only want to fly if the weather is perfect) wassynthesized with focal accent rises on both flyga (fly) (Accent 2) and vadret(weather) (Accent 1). The F0 rise excursions corresponded to the stimulus


Figure 7. Prominence responses in percent for each content word for the acoustically neutralreading of the stimulus sentence “Nar pappa fiskar stor p/Piper Putte”, with eyebrow movementon “Putte”. Subjects are grouped as all, Swedish (sw), and foreign (fo).

in the earlier experiment which elicited nearly equal responses for flyga andvadret in terms of the most prominent word in the sentence. The voice usedwas the Infovox 330 Ingmar MBROLA voice.

Eyebrow and head movements were then created by hand-editing the res-pective parameters. The eyebrows were raised to create a subtle movementthat was distinctive although not too obvious. In quantitative terms the move-ment comprised 4% of the total possible movement. The head movement wasa slight vertical lowering comprising 3% of the total possible vertical headrotation. Statically, the displacement is difficult to perceive, while dynamically,the movement is quite distinct. The total duration of both eyebrow and headmovement was 300 ms and comprised a 100 ms dynamic onset, a 100 ms staticportion, and a 100 ms dynamic offset.

Two sets of stimuli were created: set 1 in which both eyebrow and headmovement occurred simultaneously, and set 2 in which the movements wereseparated and potentially conflicting with each other. In set 1, six stimuliwere created by synchronizing the movement in stimulus 1 with the stressedvowel of flyga. This movement was successively shifted in intervals of 100 mstowards vadret resulting in the movement in stimulus 6 being synchronizedwith the stressed vowel of vadret. In set 2, stimuli 1–3 were created by fix-ing the head movement to synchronize with the stressed vowel of vadret andsuccessively shifting the eyebrow movements from the stressed vowel of flygatowards vadret in steps of 100 ms. Stimuli 4–6 were created by fixing the eye-brow movement to vadret and shifting the head movement from flyga towardsvadret. The acoustic signal and articulatory movements were the same for allstimuli. A schematic illustration of the stimuli is presented in Figure 8.


Figure 8. Schematic illustration of face gesture timing.

Figure 9. Results for stimulus set 1 showing prominence response for vadret and confidencein percent.

The results from stimulus set 1 where eyebrow and head movements occu-rred simultaneously clearly reflect the timing aspect of these stimuli as canbe seen in Figure 9 where percent votes for vadret increase successively asmovement is shifted in time from flyga to vadret.

It is clear from the results that combined head and eyebrow movements ofthe scope used in the experiment are powerful cues to prominence when syn-chronized with the stressed vowel of the potentially prominent word and whenno conflicting acoustic cue is present. The results demonstrate a general sen-sitivity to the timing of these movements at least on the order of 100 ms as


the prominence response moves successively from the word flyga to the wordvadret. However, there is a tendency for integration of the movements to thenearest potentially prominent word, thus accounting for the jump in promi-nence response between stimulus 3 and 4 in set 1. This integration is consis-tent with the results of similar experiments using visual and auditory segmentalcues (Massaro et al., 1996).

As could be expected, the results from set 2, where eyebrow and headmovements were in conflict, showed more stimulus ambiguity. Head move-ment, however, demonstrated a slight advantage in signalling prominence. Thisadvantage can perhaps be explained by the fact that the movement of the headmay be visually more salient than the relatively subtle eyebrow movement.The advantage might even be increased if the head is observed from a greaterdistance. In an informal demonstration, where observers were further awayfrom the computer screen than the subjects in the experiment, head-movementadvantage was quite pronounced.

A number of questions remain to be answered, as a perception experimentof this type is necessarily restricted in scope. Amplitude of movement wasnot addressed in this investigation. If, for example, eyebrow movement wereexaggerated, would this counterbalance the greater power of head movement?A perhaps even more crucial question is the interaction between the acousticand visual cues. There was a slight bias for flyga to be perceived as more promi-nent (one subject even chose flyga in 11 of the 12 stimuli), and indeed the F0excursion was greater for flyga than for vadret, even though this was ambigu-ous in the previous experiment. In practical terms of multimodal synthesis,however, it will probably be sufficient to combine cues, even though it wouldbe helpful to have some form of quantified weighting factor for the differentacoustic and visual cues.

Duration of the eyebrow and head movements is another considerationwhich was not tested here. It seems plausible that similar onset and offsetdurations (100 ms) combined with substantially longer static displacementswould serve as conversational signals rather than as cues to prominence. Inthis way, non-synchronous eyebrow and head movements can be combined tosignal both prominence and, for example feedback giving or seeking. Some ofthe subjects also commented that the face seemed to convey a certain degree ofirony in some of the stimuli in set 2, most likely in those stimuli with non-synchronous eyebrow movement. Experimentation with, and evaluation of,potential cues for feedback seeking was pursued in the study reported on inSection 6.

6 Evaluating Prosody and InteractionThe use of a believable talking head can trigger the user’s social skills

such as using greetings, addressing the agent by name, and generally socially


chatting with the agent. This was clearly shown by the results of the publicuse of the August system (Bell and Gustafson, 1999a) during a period of 6months (see Section 9). These promising results have led to more specificstudies on visual cues for feedback (e.g., Granstrom et al., 2002), in whichsmile, for example, was found to be the strongest cue for affirmative feed-back. Further detailed work on turntaking regulation, feedback seeking andgiving, and signalling of the system’s internal state will enable us to improvethe gesture library available for the animated talking head and continue toimprove the effectiveness of multimodal dialogue systems. One of the centralclaims in many theories of conversation is that dialogue partners seek andprovide evidence about the success of their interaction (Clark and Schaeffer,1989; Traum, 1994; Brennan, 1990). That is, partners tend to follow a proofprocedure to check whether their utterances were understood correctly or notand constantly exchange specific forms of feedback that can be affirmative(“go on”) or negative (“do not go on”). Previous research has brought to lightthat conversation partners can monitor the dialogue this way on the basis of atleast two kinds of features not encoded in the lexico-syntactic structure of asentence: namely, prosodic and visual features. First, utterances that functionas negative signals appear to differ prosodically from affirmative ones in thatthey are produced with more “marked” settings (e.g., higher, louder, slower)(Shimojima et al., 2002; Krahmer et al., 2002b). Second, other studies revealthat, in face-to-face interactions, people signal by means of facial expressionsand specific body gestures whether or not an utterance was correctly under-stood (Gill et al., 1999).

Given that current spoken dialogue systems are prone to error, mainlybecause of problems in the automatic speech recognition (ASR) engine ofthese systems, a sophisticated use of feedback cues from the system to theuser is potentially very helpful to improve human–machine interactions as well(e.g., Hirschberg et al., 2001). There are currently a number of advanced mul-timodal user interfaces in the form of talking heads that can generate audio-visual speech along with different facial expressions (Beskow, 1995, 1997;Beskow et al., 2001; Granstrom et al., 2001; Massaro, 2002; Pelachaud, 2002;Tisato et al., 2005). However, while such interfaces can be accurately modifiedin terms of a number of prosodic and visual parameters, there are as yet noformal models that make explicit how exactly these need to be manipulated tosynthesise convincing affirmative and negative cues.

One interesting question, for instance, is what the strength relation is bet-ween the potential prosodic and visual cues. The goal of one study (Granstromet al., 2002) was to gain more insight into the relative importance of specificprosodic and visual parameters for giving feedback on the success of the inter-action. In this study, use is made of a talking head whose prosodic and visualfeatures are orthogonally varied in order to create stimuli that are presented to


subjects who have to respond to these stimuli and judge them as affirmative ornegative backchannelling signals.

The stimuli consisted of an exchange between a human, who was intendedto represent a client, and the face, representing a travel agent. An observer ofthese stimuli could only hear the client’s voice, but could both see and hearthe face. The human utterance was a natural speech recording and was exactlythe same in all exchanges, whereas the speech and the facial expressions of thetravel agent were synthetic and variable. The fragment that was manipulated,always consisted of the following two utterances:

Human: “Jag vill aka fran Stockholm till Linkoping.”(“I want to go from Stockholm to Linkoping.”)

Head: “Linkoping.”

The stimuli were created by orthogonally varying six parameters, shown inTable 4, using two possible settings for each parameter: one which was hypoth-esised to lead to affirmative feedback responses, and one which was hypothe-sised to lead to negative responses.

The parameter settings were largely created by intuition and observinghuman productions. However, the affirmative and negative F0 contours werebased on two natural utterances. In Figure 10 an example of the all-negativeand all-affirmative face can be seen.

The actual testing was done via a group experiment using a projected imageon a large screen. The task was to respond to this dialogue exchange in termsof whether the head signals that he understands and accepts the human utter-ance, or rather signals that the head is uncertain about the human utterance. Inaddition, the subjects were required to express on a 5-point scale how confidentthey were about their response. A detailed description of the experiment andthe analysis can be found in Granstroom et al. (2002). Here, we would only liketo highlight the strength of the different acoustic and visual cues. In Figure 11

Table 4. Different parameters and parameter settings used to create different stimuli.

Affirmative setting Negative setting

Smile Head smiles Neutral expressionHead move Head nods Head leans backEyebrows Eyebrows rise Eyebrows frownEye closure Eyes narrow slightly Eyes open wideF0 contour Declarative InterrogativeDelay Immediate reply Delayed reply


Figure 10. The all-negative and all-affirmative faces sampled in the end of the first syllableof Linkoping.

Figure 11. The mean response value difference for stimuli with the indicated cues set to theiraffirmative and negative value.

the mean difference in response value (the response weighted by the subjects’confidence ratings) is presented for negative and affirmative settings of the dif-ferent parameters. The effects of Eye closure and Delay are not significant, butthe trends observed in the means are clearly in the expected direction. Thereappears to be a strength order with Smile being the most important factor, fol-lowed by F0 contour, Eyebrow, Head movement, Eye closure, and Delay.


This study clearly shows that subjects are sensitive to both acoustic andvisual parameters when they have to judge utterances as affirmative or nega-tive feedback signals. One obvious next step is to test whether the fluency ofhuman–machine interactions is helped by the inclusion of such feedback cuesin the dialogue management component of a system.

7 Evaluating Visual Cues to Sentence ModeIn distinguishing questions from statements, prosody has a well-established

role, especially in cases such as echo questions where there is no syntacticcue to the interrogative mode. Almost without exception this has been shownonly for the auditory modality. Inspired by the results of the positive and neg-ative feedback experiment presented in Section 6, an experiment was carriedout to test if similar visual cues could influence the perception of question andstatement intonation in Swedish (House, 2002). Parameters were hand manip-ulated to create two configurations: one expected to elicit more interrogativeresponses and the other expected to elicit more declarative responses. Theseconfigurations were similar, although not identical, to the positive and nega-tive configurations used in the feedback experiment. Hypothesised cues for theinterrogative mode consisted of a slow up–down head nod and eyebrow low-ering. Hypothesised cues for the declarative mode consisted of a smile, a shortup–down head nod, and eye narrowing. The declarative head nod was of thesame type as was used in the prominence experiments reported in Section 5.12 different intonation contours were used in the stimuli ranging from a lowfinal falling contour (clearly declarative) to a high final rise (clearly interrog-ative). A separate perception test using these audio-only stimuli resulted in100% declarative responses for the low falling contour and 100% interrogativeresponses for the high final rise with a continuum of uncertainty in between.

The influence of the visual cues on the audio cues was only marginal.While the hypothesised cues for the declarative mode (smile, short head nod,and eye narrowing) elicited somewhat more declarative responses for the am-biguous and interrogative intonation contours, the hypothesized cues for theinterrogative mode (slow head nod and eyebrow lowering) led to more un-certainty in the responses for both the declarative intonation contours and theinterrogative intonation contours (i.e., responses for the declarative contourswere only slightly more interrogative than in the audio-only condition while re-sponses for the interrogative contours were actually more declarative). Similarresults were obtained for English by Srinivasan and Massaro (2003). Althoughthey were able to demonstrate that the visual cues of eyebrow raising and headtilting synthesised based on a natural model reliably conveyed question intona-tion, their experiments showed a weak visual effect relative to a strong audioeffect of intonation. This weak visual effect remained despite attempts to en-hance the visual cues and make the audio information more ambiguous.


The dominance of the audio cues in the context of these question/statementexperiments may indicate that question intonation may be less variable thanvisual cues for questions, or we simply may not yet know enough about thecombination of visual cues and their timing in signalling question mode tosuccessfully override the audio cues. Moreover, a final high rising intonationis generally a very robust cue to question intonation, especially in the contextof perception experiments with binary response alternatives.

8 Evaluation of Agent Expressiveness andAttitude

In conjunction with the development of data-driven visual synthesis as re-ported in Section 3, two different evaluation studies have been carried out. Onewas designed to evaluate expressive visual speech synthesis in the frameworkof a virtual language tutor (cf. Granstrom, 2004). The experiment, reportedin detail (Beskow and Cerrato, 2006), used a method similar to the one rep-orted on in Section 6. The talking head had the role of a language tutor whowas engaged in helping a student of Swedish improve her pronunciation. Eachinteraction consisted of the student’s pronunciation of a sentence including amispronounced word. The virtual tutor responds by correcting the mispronun-ciation after which the student makes a new attempt in one of three ways: withthe correct pronunciation, with the same mistake, or with a new mistake. Thetest subjects hear the student’s pronunciation and both see and hear the tutor.The task was to judge which emotion the talking head expressed in its finalturn of the interaction.

Visual synthesis derived from a happy, angry, sad, and neutral database wasused to drive the new MPEG-4 compatible talking head as described in Sec-tion 3. For the audio part of the stimuli, a pre-recorded human voice was used toportray the three emotions since we have not yet developed suitable audio data-driven synthesis with different emotions. All possible combinations of audioand visual stimuli were tested. The results indicated that for stimuli where theaudio and visual emotion matched, listener recognition of each emotion wasquite good: 87% for neutral and happy, 70% for sad, and 93% for angry. Forthe mismatched stimuli, the visual elements seemed to have a stronger influ-ence than the audio elements. These results point to the importance of matchingaudio and visual emotional content and show that subjects attend to the visualelement to a large degree when judging agent expressiveness and attitude.

In another experiment reported on in House (2006), the new talking headwas evaluated in terms of degrees of friendliness. Databases of angry, happy,and neutral emotions were used to synthesise the utterance Vad heter du?(What is your name?). Samples of the three versions of the visual stimuli arepresented in Figure 12. The three versions of the visual synthesis were com-bined with two audio configurations: low, early pitch peak; and high, late pitch


Figure 12. Visual stimuli generated by data-driven synthesis from the angry database (left),the happy database (middle), and the neutral database (right). All samples are taken from themiddle of the second vowel of the utterance Vad heter du? (What is your name?).

Figure 13. Results from the data-driven synthesis test showing the cumulative “friendlinessscore” for each stimulus.

peak, resulting in six stimuli. Previous experiments showed that the high, latepitch peak elicited more friendly responses (House, 2005). A perception testusing these six stimuli was carried out by asking 27 subjects to indicate on a4-point scale how friendly they felt the agent was.

The results are presented in Figure 13. It is quite clear that the face synthe-sised from the angry database elicited the lowest friendliness score. However,there is still evidence of interaction from the audio, as the angry face with thelate, high peak received a higher friendliness score than did the angry facewith the early, low peak. The faces from the other databases (happy and neu-tral) elicited more friendliness responses, but neither combination of face and


audio received a particularly high friendliness score. The happy face did notelicit more friendliness responses than did the neutral face, but the influence ofthe audio stimuli remained consistent for all the visual stimuli. Nevertheless,the results show that the visual modality can be a powerful signal of attitude.Moreover, the effects of the audio cues for friendliness indicate that subjectsmake use of both modalities in judging speaker attitude. These results stressthe need to carefully consider both the visual and audio aspects of expressivesynthesis.

9 Agent and System Evaluation StudiesThe main emphasis of the evaluation studies reported on in this chapter has

been the evaluation of the intelligibility and the dialogue functions of the talk-ing head agent as presented to subjects in experimental test situations. Duringthe last decade, however, a number of experimental applications using the talk-ing head have been developed at KTH (see Gustafson, 2002 for a review). Twoexamples that will be mentioned here are the August project, which was adialogue system in public use, and the Adapt multimodal real-estate agent.Finally, we will also report on some studies aimed at evaluating user satisfac-tion in general during exposure to the August and the Adapt dialogue systems.

9.1 The August SystemThe Swedish author, August Strindberg, provided inspiration to create the

animated talking agent used in a dialogue system that was on display during1998 as part of the activities celebrating Stockholm as the Cultural Capital ofEurope (Gustafson et al., 1999). The system was a fully automatic dialoguesystem using modules for speech recognition and audio-visual speech synthe-sis. This dialogue system made it possible to combine several domains, thanksto the modular functionality of the architecture. Each domain had its own dia-logue manager, and an example-based topic spotter was used to relay the userutterances to the appropriate dialogue manager. In this system, the animatedagent “August” presents different tasks such as taking the visitors on a tripthrough the Department of Speech, Music, and Hearing, giving street direc-tions, and also reciting short excerpts from the works of August Strindberg,when waiting for someone to talk to. The system was built into a kiosk andplaced in public in central Stockholm for a period of 6 months. One of themain challenges of this arrangement was the open situation with no explicitinstructions being given to the visitor. A simple visual “visitor detector” madeAugust start talking about one of his knowledge domains.

To ensure that the recorded user utterances were actually directed to thesystem, a push-to-talk button was used to initiate the recordings. The speechrecordings resulted in a database consisting of 10,058 utterances from 2,685


speakers. The utterances were transcribed orthographically and labelled forspeaker characteristics and utterance types by Bell and Gustafson (1999a,b;see also Gustafson, 2002 and Bell, 2003 for recent reviews of this work). Theresulting transcribed and labelled database has subsequently been used as thebasis for a number of studies evaluating user behaviour when interacting withthe animated agent in this open environment.

Gustafson and Bell (2000) present a study showing that about half theutterances in the database can be classified as socially oriented while the otherhalf is information-seeking. Children used a greater proportion of socialisingutterances than did adults. The large proportion of socialising utterances is ex-plained by the presence of an animated agent, and by the fact that the systemwas designed to handle and respond to social utterances such as greetings andqueries concerning some basic facts about the life of Strindberg. Furthermore,it was found that users who received an accurate response to a socialisingutterance continued to use the system for a greater number of turns than didthose users who were searching for information or those who did not receivean appropriate response to a socially oriented utterance.

In another study concerning phrase-final prosodic characteristics of userutterances comprising wh-questions, House (2005) found that final rises werepresent in over 20% of the questions. Final rises can indicate a more friendlytype of question attitude and were often present in social-oriented questions.However, rises were also found in information-oriented questions. This couldindicate that the intention to continue a social type of contact with the agentmay not be restricted to questions that are semantically categorized as so-cial questions. The social intention can also be present in information-orientedquestions. Finally children’s wh-question utterances as a group contained thegreatest proportion of final rises followed by women’s utterances, with men’sutterances containing the lowest proportion of final rises. This could also ref-lect trends in social intent. These results can be compared to findings by Oviattand Adams (2000) where children interacting with animated undersea animalsin a computer application used personal pronouns with about one-third ofthe exchanges comprising social questions about the animal’s name, birthday,friends, family, etc.

9.2 The Adapt Multimodal Real-Estate AgentThe practical goal of the Adapt project was to build a system in which

a user could collaborate with an animated agent to solve complicatedtasks (Gustafson et al., 2000). We chose a domain in which multimodal inter-action is highly useful, and which is known to engage a wide variety of peoplein our surroundings, namely, finding available apartments in Stockholm. In theAdapt project, the agent was given the role of asking questions and providing


Figure 14. The agent Urban in the Adapt apartment domain.

guidance by retrieving detailed authentic information about apartments. Theuser interface can be seen in Figure 14.

Because of the conversational nature of the Adapt domain, the demand wasgreat for appropriate interactive signals (both verbal and visual) for encourage-ment, affirmation, confirmation, and turntaking (Cassell et al., 2000; Pelachaudet al., 1996). As generation of prosodically grammatical utterances (e.g., cor-rect focus assignment with regard to the information structure and dialoguestate) was also one of the goals of the system, it was important to maintainmodality consistency by simultaneous use of both visual and verbal prosodicand conversational cues (Nass and Gong, 1999). In particular, facial gesturesfor turntaking were implemented in which the agent indicated such states asattention, end-of-speech detection, continued attention, and preparing an ans-wer (Beskow et al., 2002a; Gustafson, 2002).

Two different sets of data were collected from the Adapt system. The firstcollection was carried out by means of a Wizard-of-Oz simulation to obtaindata for an evaluation of the prototype system under development. This firstdatabase represents 32 users and contains 1,845 utterances. The second data-base was collected in a study where 26 users interacted with a fully automatedAdapt system. The second database comprises 3,939 utterances (Gustafson,2002).


The study used to generate the second database was carried out in orderto evaluate user reactions to the use of the agent’s facial gestures for feed-back (Edlund and Nordstrand, 2002). The users were split up into three groupsand exposed to three different system configurations. One group was presentedwith a system which implemented facial gestures for turntaking in the animatedagent, the second group saw an hourglass symbol to indicate that the systemwas busy but were provided with no facial gestures, and the third group had noturntaking feedback at all. The results showed that the feedback gestures didnot produce an increase in efficiency of the system as measured by turntakingerrors where the subjects started to speak during the time in which the systemwas preparing a response. However, users were generally more satisfied withthe system configuration having the facial feedback gestures as reflected byresponses in a user satisfaction form based on the method described in PAR-ADISE (Walker et al., 2000).

In another evaluation of the Adapt corpus, Hjalmarsson (2005) examinedthe relationship between subjective user satisfaction and changes in a setof evaluation metrics over the approximately 30-minute time span of eachuser interaction. She found that users with high subjective satisfaction ratingstended to improve markedly during the course of the interaction as measuredby task success. Users with low subjective satisfaction ratings showed a smallerinitial improvement, which was followed by a deterioration in task success.

9.3 A Comparative Evaluation of the TwoSystems

In a study designed to test new metrics for the evaluation of multimodaldialogue systems using animated agents, Cerrato and Ekeklint (2004) com-pared a subset of the August corpus with the Adapt Wizard-of-Oz simulation.Their hypothesis was that the way in which users ended their dialogues (bothsemantically and prosodically) would reveal important aspects of user satisfac-tion and dialogue success. The general characteristics of the dialogues differedsubstantially between the systems. The August dialogues were characterisedby a small number of turns and frequent dialogue errors, while the dialoguesin the Adapt simulations were much longer and relatively unproblematic. Thefinal user utterances from both systems were analysed and classified as so-cial closures or non-social closures. The social closures were then groupedinto subcategories such as farewell, thanks, other courtesy expressions such as“nice talking to you”, and non-courtesy expressions such as insults.

The comparison presented in Figure 15 shows a much greater percent ofthanking expressions in the Adapt interactions than in those from the Augustcorpus. While there were no insults ending Adapt interactions, insults and non-courtesy expressions comprised a fair proportion of the final utterances in theAugust interactions.


Figure 15. Distribution of social subcategories of the final utterances in the August and Adaptcorpus. (Adapted from Cerrato and Ekeklint, 2004).

In addition to the category of final utterance, Cerrato and Ekeklint(2004) also analysed prosodic characteristics of the final utterances from thefarewell and thanks category. They found a tendency for a farewell or thanksto have a rising intonation contour following a successful interaction with thesystem. They also found a tendency for users to end with a falling intonation,higher intensity, or greater duration in those cases where there had not been asuccessful interaction with the system.

10 Future Challenges in Modellingand Evaluation

In this chapter, we have presented an overview of some of the recent work inaudio-visual synthesis, primarily at KTH, regarding data collection methods,modelling and evaluation experiments, and implementation in animated talk-ing agents for dialogue systems. From this point of departure, we can see thatmany challenges remain before we will be able to create a believable, animatedtalking agent based on knowledge concerning how audio and visual signals in-teract in verbal and non-verbal communication. In terms of modelling and eval-uation, there is a great need to explore in more detail the coherence betweenaudio and visual prosodic expressions, especially regarding different functionaldimensions. As we demonstrated in the section on prominence above, headnods which strengthen the percept of prominence tend to be integrated with the


nearest candidate syllable resulting in audio-visual coherence. However, headnods which indicate dialogue functions such as feedback or turntaking may notbe integrated with the audio in the same way. Visual gestures can even be usedto contradict or qualify the verbal message, which is often the case in ironicexpressions. On the other hand, there are other powerful visual communicativecues such as the smile which clearly affect the resulting audio (through articu-lation) and must by definition be integrated with the speech signal. Modellingof a greater number of parameters is also essential, such as head movement inmore dimensions, eye movement and gaze, and other body movements such ashand and arm gestures. To model and evaluate how these parameters combinein different ways to convey individual personality traits while at the same timesignalling basic prosodic and dialogue functions is a great challenge.

Acknowledgements The work reported here was carried out by a largenumber of researchers at the Centre for Speech Technology which is gratefullyacknowledged. The work has been supported by the EU/IST (projects SYN-FACE and PF-Star), and CTT, the Centre for Speech Technology, a competencecentre at KTH, supported by VINNOVA, KTH, and participating Swedishcompanies and organisations. Marc Swerts collaborated on the feedback studywhile he was a guest at CTT.

ReferencesAgelfors, E., Beskow, J., Dahlquist, M., Granstrom, B., Lundeberg, M., Spens,

K.-E., and Ohman, T. (1998). Synthetic Faces as a Lipreading Support. InProceedings of the International Conference on Spoken Language Process-ing (ICSLP), pages 3047–3050, Sydney, Australia.

Balter, O., Engwall, O., Oster, A.-M., and Kjellstrom, H. (2005). Wizard-of-Oz Test of ARTUR – a Computer-Based Speech Training System withArticulation Correction. In Proceedings of the Seventh International ACMSIGACCESS Conference on Computers and Accessibility, pages 36–43, Bal-timore, Maryland, USA.

Bell, L. (2003). Linguistic Adaptations in Spoken Human-Computer Dia-logues; Empirical Studies of User Behavior. Doctoral dissertation, Depart-ment of Speech, Music and Hearing, KTH, Stockholm, Sweden.

Bell, L. and Gustafson, J. (1999a). Interacting with an Animated Agent: AnAnalysis of a Swedish Database of Spontaneous Computer Directed Speech.In Proceedings of the European Conference on Speech Communication andTechnology (Eurospeech), pages 1143–1146, Budapest, Hungary.

Bell, L. and Gustafson, J. (1999b). Utterance Types in the August System.In Proceedings of the ESCA Tutorial and Research Workshop on Interac-tive Dialogue in Multi-Modal Systems (IDS), pages 81–84, Kloster Irsee,Germany.


Beskow, J. (1995). Rule-based Visual Speech Synthesis. In Proceedings of theEuropean Conference on Speech Communication and Technology (Euro-speech), pages 299–302, Madrid, Spain.

Beskow, J. (1997). Animation of Talking Agents. In Proceedings of ESCAWorkshop on Audio-Visual Speech Processing (AVSP), pages 149–152,Rhodes, Greece.

Beskow, J. and Cerrato, L. (2006). Evaluation of the Expressivity of a SwedishTalking Head in the Context of Human-Machine Interaction. In Proceedingsof Gruppo di Studio della Comunicazione Parlata (GSCP), Padova, Italy.

Beskow, J., Cerrato, L., Cosi, P., Costantini, E., Nordstrand, M., Pianesi, F.,Prete, M., and Svanfeldt, G. (2004a). Preliminary Cross-cultural Evaluationof Expressiveness in Synthetic Faces. In Andre, E., Dybkjær, L., Minker,W., and Heisterkamp, P., editors, Affective Dialogue Systems. Proceedingsof the Irsee Tutorial and Research Workshop on Affective Dialogue Systems,volume 3068 of LNAI, pages 301–304, Springer.

Beskow, J., Cerrato, L., Granstrom, B., House, D., Nordenberg, M., Nord-strand, M., and Svanfeldt, G. (2004b). Expressive Animated Agents forAffective Dialogue Systems. In Andre, E., Dybkjær, L., Minker, W., andHeisterkamp, P., editors, Affective Dialogue Systems. Proceedings of theIrsee Tutorial and Research Workshop on Affective Dialogue Systems,volume 3068 of LNAI, pages 240–243, Springer.

Beskow, J., Cerrato, L., Granstrom, B., House, D., Nordstrand, M., and Svan-feldt, G. (2004c). The Swedish PF-Star Multimodal Corpora. In Proceed-ings of the LREC Workshop on Multimodal Corpora: Models of HumanBehaviour for the Specification and Evaluation of Multimodal Input andOutput Interfaces, pages 34–37, Lisbon, Portugal.

Beskow, J., Edlund, J., and Nordstrand, M. (2002a). Specification and Real-isation of Multimodal Output in Dialogue Systems. In Proceedings of theInternational Conference on Spoken Language Processing (ICSLP), pages181–184, Denver, Colorado, USA.

Beskow, J., Engwall, O., and Granstrom, B. (2003). Resynthesis of Facial andIntraoral Articulation from Simultaneous Measurements. In Proceedings ofthe International Congresses of Phonetic Sciences (ICPhS), pages 431–434,Barcelona, Spain.

Beskow, J., Granstrom, B., and House, D. (2001). A Multimodal Speech Syn-thesis Tool Applied to Audio-Visual Prosody. In Keller, E., Bailly, G., Mon-aghan, A., Terken, J., and Huckvale, M., editors, Improvements in SpeechSynthesis, pages 372–382, John Wiley, New York, USA.

Beskow, J., Granstrom, B., and Spens, K.-E. (2002b). Articulation Strength -Readability Experiments with a Synthetic Talking Face. In The QuarterlyProgress and Status Report of the Department of Speech, Music and Hear-ing (TMH-QPSR), volume 44, pages 97–100, KTH, Stockholm, Sweden.


Beskow, J. and Nordenberg, M. (2005). Data-driven Synthesis of ExpressiveVisual Speech using an MPEG-4 Talking Head. In Proceedings of the Euro-pean Conference on Speech Communication and Technology (Interspeech),pages 793–796, Lisbon, Portugal.

Bickmore, T. and Cassell, J. (2005). Social Dialogue with Embodied Conversa-tional Agents. In van Kuppevelt, J., Dybkjær, L., and Bernsen, N. O., editors,Advances in Natural Multimodal Dialogue Systems, pages 23–54, Springer,Dordrecht, The Netherlands.

Brennan, S. E. (1990). Seeking and Providing Evidence for Mutual Under-standing. Unpublished doctoral dissertation, Stanford University, Stanford,California, USA.

Carlson, R. and Granstrom, B. (1997). Speech synthesis. In Hardcastle, W.and Laver, J., editors, The Handbook of Phonetic Sciences, pages 768–788,Blackwell Publishers, Oxford, UK.

Cassell, J., Bickmore, T., Campbell, L., Hannes, V., and Yan, H. (2000). Con-versation as a System Framework: Designing Embodied ConversationalAgents. In Cassell, J., Sullivan, J., Prevost, S., and Churchill, E., editors, Em-bodied Conversational Agents, pages 29–63, MIT Press, Cambridge, Massa-chusetts, USA.

Cerrato, L. and Ekeklint, S. (2004). Evaluating Users’ Reactions to Human-likeInterfaces: Prosodic and Paralinguistic Features as Measures of User Satis-faction. In Ruttkay, Z. and Pelachaud, C., editors, From Brows to Trust:Evaluating Embodied Conversational Agents, pages 101–124, KluwerAcademic Publishers, Dordrecht, The Netherlands.

Clark, H. H. and Schaeffer, E. F. (1989). Contributing to Discourse. CognitiveScience, 13:259–294.

Cole, R., Massaro, D. W., de Villiers, J., Rundle, B., Shobaki, K., Wouters, J.,Cohen, M., Beskow, J., Stone, P., Connors, P., Tarachow, A., and Solcher,D. (1999). New Tools for Interactive Speech and Language Training: Us-ing Animated Conversational Agents in the Classrooms of Profoundly DeafChildren. In Proceedings of the ESCA/Socrates Workshop on Method andTool Innovations for Speech Science Education (MATISSE), pages 45–52,University College London, London, UK.

Edlund, J. and Nordstrand, M. (2002). Turn-taking Gestures and Hour-Glassesin a Multi-modal Dialogue System. In Proceedings of the ISCA Workshopon Multi-Modal Dialogue in Mobile Environments, pages 181–184, KlosterIrsee, Germany.

Engwall, O. (2003). Combining MRI, EMA and EPG Measurements ina Three-Dimensional Tongue Model. Speech Communication, 41(2–3):303–329.

Engwall, O. and Beskow, J. (2003). Resynthesis of 3D Tongue Movementsfrom Facial Data. In Proceedings of the European Conference on Speech


Communication and Technology (Eurospeech), pages 2261–2264, Geneva,Switzerland.

Gill, S. P., Kawamori, M., Katagiri, Y., and Shimojima, A. (1999). Pragmaticsof Body Moves. In Proceedings of the Third International Cognitive Tech-nology Conference, pages 345–358, San Francisco, USA.

Granstrom, B. (2004). Towards a Virtual Language Tutor. In Proceedings ofthe InSTIL/ICALL Symposium: NLP and Speech Technologies in AdvancedLanguage Learning Systems, pages 1–8, Venice, Italy.

Granstrom, B., House, D., Beskow, J., and Lundeberg, M. (2001). Verbaland Visual Prosody in Multimodal Speech Perception. In von Dommelen,W. and Fretheim, T., editors, Nordic Prosody: Proceedings of the EighthConference, Trondheim 2000, pages 77–88, Peter Lang, Frankfurt am Main,Germany.

Granstrom, B., House, D., and Lundeberg, M. (1999). Prosodic Cues in Multi-modal Speech Perception. In Proceedings of the International Congress ofPhonetic Sciences (ICPhS), pages 655–658, San Francisco, USA.

Granstrom, B., House, D., and Swerts, M. G. (2002). Multimodal FeedbackCues in Human-Machine Interactions. In Bel, B. and Marlien, I., editors,Proceedings of the Speech Prosody Conference, pages 347–350, LaboratoireParole et Langage, Aix-en-Provence, France.

Gustafson, J. (2002). Developing Multimodal Spoken Dialogue Systems;Empirical Studies of Spoken Human-Computer Interaction. Doctoral disser-tation, Department of Speech, Music and Hearing, KTH, Stockholm, Swe-den.

Gustafson, J. and Bell, L. (2000). Speech Technology on Trial: Experiencesfrom the August System. Natural Language Engineering, 6(3–4):273–296.

Gustafson, J., Bell, L., Beskow, J., Boye, J., Carlson, R., Edlund, J., Granstrom,B., House, D., and Wiren, M. (2000). Adapt – a Multimodal ConversationalDialogue System in an Apartment Domain. In Proceedings of the Inter-national Conference on Spoken Language Processing (ICSLP), volume 2,pages 134–137, Beijing, China.

Gustafson, J., Lindberg, N., and Lundeberg, M. (1999). The August Spo-ken Dialogue System. In Proceedings of the European Conference onSpeech Communication and Technology (Eurospeech), pages 1151–1154,Budapest, Hungary.

Hirschberg, J., Litman, D., and Swerts, M. (2001). Identifying User Correc-tions Automatically in Spoken Dialogue Systems. In Proceedings of theNorth American Chapter of the Association for Computational Linguistics(NAACL), pages 208–215, Pittsburg, USA.

Hjalmarsson, A. (2005). Towards User Modelling in Conversational DialogueSystems: A Qualitative Study of the Dynamics of Dialogue Parameters.


In Proceedings of the European Conference on Speech Communication andTechnology (Interspeech), pages 869–872, Lisbon, Portugal.

House, D. (2001). Focal Accent in Swedish: Perception of Rise Properties forAccent 1. In van Dommelen, W. and Fretheim, T., editors, Nordic Prosody2000: Proceedings of the Eighth Conference, pages 127–136, Trondheim,Norway.

House, D. (2002). Intonational and Visual Cues in the Perception of Inter-rogative Mode in Swedish. In Proceedings of the International Confer-ence on Spoken Language Processing (ICSLP), pages 1957–1960, Denver,Colorado, USA.

House, D. (2005). Phrase-final Rises as a Prosodic Feature in Wh-Questions inSwedish Human-Machine Dialogue. Speech Communication, 46:268–283.

House, D. (2006). On the Interaction of Audio and Visual Cues to Friendlinessin Interrogative Prosody. In Proceedings of the Second Nordic Conferenceon Multimodal Communication, pages 201–213, Gothenburg University,Sweden.

House, D., Beskow, J., and Granstrom, B. (2001). Timing and Interaction ofVisual Cues for Prominence in Audiovisual Speech Perception. In Proceed-ings of the European Conference on Speech Communication and Technology(Eurospeech), pages 387–390, Aalborg, Denmark.

Krahmer, E., Ruttkay, Z., Swerts, M., and Wesselink, W. (2002a). Percep-tual Evaluation of Audiovisual Cues for Prominence. In Proceedings of theInternational Conference on Spoken Language Processing (ICSLP), pages1933–1936, Denver, Colorado, USA.

Krahmer, E., Swerts, M., Theune, M., and Weegels, M. (2002b). The Dualof Denial: Two Uses of Disconfirmations in Dialogue and their ProsodicCorrelates. Speech Communication, 36(1–2):133–145.

Massaro, D. W. (1998). Perceiving Talking Faces: From Speech Perception toa Behavioural Principle. MIT Press, Cambridge, Massachusetts, USA.

Massaro, D. W. (2002). Multimodal Speech Perception: A Paradigm forSpeech Science. In Granstrom, B., House, D., and Karlsson, I., editors, Mul-timodality in Language and Speech Systems, pages 45–71. Kluwer Acad-emic Publishers, The Netherlands.

Massaro, D. W., Bosseler, A., and Light, J. (2003). Development and Evalu-ation of a Computer-Animated Tutor for Language and Vocabulary Learn-ing. In Proceedings of the 15th International Congress of Phonetic Sciences(ICPhS), pages 143–146, Barcelona, Spain.

Massaro, D. W., Cohen, M. M., and Smeele, P. M. T. (1996). Perception ofAsynchronous and Conflicting Visual and Auditory Speech. Journal of theAcoustical Society of America, 100(3):1777–1786.

Massaro, D. W. and Light, J. (2003). Read My Tongue Movements:Bimodal Learning to Perceive and Produce Non-Native Speech /r/ and /l/.


In Proceedings of the European Conference on Speech Communication andTechnology (Eurospeech), pages 2249–2252, Geneva, Switzerland.

Nakano, Y., Reinstein, G., Stocky, T., and Cassell, J. (2003). Towards a Modelof Face-to-Face Grounding. In Proceedings of the 43rd Annual Meetingof the Association of Computational Linguistics (ACL), pages 553–561,Sapporo, Japan.

Nass, C. and Gong, L. (1999). Maximized Modality or Constrained Consis-tency? In Proceedings of Auditory-Visual Speech Processing (AVSP), pages1–5, Santa Cruz, USA.

Nordstrand, M., Svanfeldt, G., Granstrom, and House, D. (2004). Measure-ments of Articulatory Variation in Expressive Speech for a Set of SwedishVowels. In Speech Communication, volume 44, pages 187–196.

Oviatt, S. L. and Adams, B. (2000). Designing and Evaluating ConversationalInterfaces with Animated Characters. In Embodied Conversational Agents,pages 319–343, MIT Press, Cambridge, Massachusetts, USA.

Pandzic, I. S. and Forchheimer, R., editors (2002). MPEG Facial Animation– The Standard, Implementation and Applications. John Wiley Chichester,UK.

Parke, F. I. (1982). Parameterized Models for Facial Animation. IEEE Com-puter Graphics, 2(9):61–68.

Pelachaud, C. (2002). Visual Text-to-Speech. In Pandzic, I. S. andForchheimer, R., editors, MPEG-4 Facial Animation – The Standard, Im-plementation and Applications. John Wiley, Chichester, UK.

Pelachaud, C., Badler, N. I., and Steedman, M. (1996). Generating Facial Exp-ressions for Speech. Cognitive Science, 28:1–46.

Shimojima, A., Katagiri, Y., Koiso, H., and Swerts, M. (2002). Informationaland Dialogue-Coordinating Functions of Prosodic Features of JapaneseEchoic Responses. Speech Communication, 36(1–2):113–132.

Sjolander, K. and Beskow, J. (2000). WaveSurfer – an Open Source SpeechTool. In Proceedings of the International Conference on Spoken LanguageProcessing (ICSLP), volume 4, pages 464–467, Beijing, China.

Srinivasan, R. J. and Massaro, D. W. (2003). Perceiving Prosody from the Faceand Voice: Distinguishing Statements from Echoic Questions in English.Language and Speech, 46(1):1–22.

Svanfeldt, G. and Olszewski, D. (2005). Perception Experiment Combining aParametric Loudspeaker and a Synthetic Talking Head. In Proceedings ofthe European Conference on Speech Communication and Technology (In-terspeech), pages 1721–1724, Lisbon, Portugal.

Tisato, G., Cosi, P., Drioli, C., and Tesser, F. (2005). INTERFACE: ANew Tool for Building Emotive/Expressive Talking Heads. In Proceedingsof the European Conference on Speech Communication and Technology(Interspeech), pages 781–784, Lisbon, Portugal.


Traum, D. R. (1994). A Computational Theory of Grounding in NaturalLanguage Conversation. PhD thesis, Rochester, USA.

Walker, M. A., Kamm, C. A., and Litman, D. J. (2000). Towards DevelopingGeneral Models of Usability with PARADISE. Natural Language Engineer-ing, 6(3–4):363–377.

Westervelt, P. J. (1963). Parametric Acoustic Array. Journal of the AcousticalSociety of America, 35:535–537.

Date post:	18-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

MODELLING AND EVALUATING VERBAL AND NON ...ccc.inaoep.mx/~villasen/bib/MODELLING AND...

Documents