1
Human Communication Research CentreUniversity of Edinburgh
andDepartment of Computer Science
Center for Cognitive ScienceRutgers University
Designing Meaningful Agents
Matthew Stone
Meaning
The content that a speaker conveys and the hearer recognizes with an utterance.
Speaker meaning, in the sense of Grice 1957
Meaningful Agents
Systems that mean what they say.
Meaningfulness:– not a module or level of representation– an attribution based on many coordinated
processes
Talk goal
Get clearer on processes involved in meaning – an important way computational thinking about
language can contribute to cognitive science
2
Outline
IntroductionBackgroundMeaning in language
– Action– Interepretation– Convention
Conclusions and future work
Background
My research involves working with testbedsthat model face-to-face conversation.
Cassell, Stone, and Yan. INLG 2000.
3
REA
Developed by Justine Cassell at MIT and team.
I designed and implemented REA’sarchitecture for utterance generation.
Methodology
Models are based on human-human talk– where non-verbal cues mix closely with words
Methodology
Goal is to account for behaviors…
4
Methodology
…and their effects on the conversation
Generation in REA
Model how interlocutors plan behaviors.– An integrated process across modalities.– Incrementally add contextually-motivated and
contextually-appropriate actions.
Works in real time.Matches a corpus of human descriptions.
5
Argument
Modeling cues in face-to-face conversation – yields useful generalizations about how all
communication works.– yields useful distinctions about how meaning in
language is special.
Three case studies
Action in conversation– seamlessly combines language and other actions
Understanding in conversation– language and other actions are interpreted
analogously
Language and convention– we need to ground and track special
commitments for language
Underlying moral
Computational thinking connects practical systems and questions about cognition.
Cognitive science guides system-building– use linguistic techniques for multimodal systems
System-building informs cognitive science– ask clearer questions after practical experience
Outline
IntroductionBackgroundMeaning in language
– Action– Interpretation– Convention
Conclusions and future work
6
Modeling communicative action
Key principles
Temporal consistency– Voice and body emphasize same important
places in the sentence
Semantic coherence– Voice and body consistently reflect the
conversational action the speaker is taking
7
Theoretical orientation
Nonverbal actions are integral, meaningful parts of contributions to face-to-face conversation.
Sources:– Cassell in HCI – Ekman, McNeill, Clark, Bavelas and others in
psychology
Temporal consistency
BatonsQuick motions that align with speech accents
nod down on “similar”nod up on “ever”
8
Temporal consistency
UnderlinersGradual motions over an entire phrase
– tilt and turn on “far greater”– brow raise on “than any similar object”
Semantic coherence
All communicative actions support message
Here:– head tilt and turn anticipates story’s punchline– brow raise accompanies what’s surprising– nods underscore key concepts – similar, ever
9
Describing utterances computationally: coding“Musical score” for an utterance
– Made up of elements– Composed in an abstract time-line– Marking emphasis and synchrony
Representations are very close to those used already for expressive speech synthesis.
Coding for spoken utterances
TEXT far greater than any similar object ever discoveredINTONATION L+H* !H* H– L+H* !H* L– L+H* L+!H* L– L%BROWS 1+2 HEAD TR D* U*
Our realization
2004: DeCarlo, Stone, Revilla & Venditti“Specifying and animating facial signals for discourse in embodied conversational agents”, Computer Animation and Virtual Worlds 15(1).
10
Implementation
RUTH (Rutgers University Talking Head)Real-time animation from 3D geometryFreely available for research http://www.cs.rutgers.edu/~village/ruth
Outline
IntroductionBackgroundMeaning in language
– Action– Interpretation– Convention
Conclusions and future work
Modeling utterance interpretation
11
Trying to get precise
Clearly language and gesture are different– interpretation of gesture seems more elusive
But they do fit together– we do understand what gesture portrays– we do understand how that relates to speech
Trying to get precise
So there are these very low-level phonological errors that tend to not get reported…
…because they are being produced continually by an iterative process below our level of awareness.
12
Trying to get precise
Now one thing you could do is totally audiotape hours and hours…
…so that you get a large amount of data that you can think of as laid out on a time-line.
Trying to get precise
And exhaustively go through and make sure that you really pick up all the speech errors…
…by individually examining each unit of analysis along the time-line of your data.
13
Trying to get precise
Allow two different coders to go through it…
…and moreover get them to work independently and reconcile their activities.
Linking gesture and speech
The content of gesture connects with speech by rhetorical relations
[speech] because [gesture][speech] so that [gesture][speech] by [gesture][speech] and moreover [gesture]
Just like successive sentences in discourse connect together.
Parallel extends to formal model
Describe interpretation of gesture and speech in SDRT (Asher & Lascarides, 2003)
– start from underspecified meaning– assume specific resolution of meaning– find rhetorical connection to discourse– to explain why action is coherent
In fact, REA already used same (simple) model to interpret words and gesture
Practical consequences
Can add meaningful nonverbal behaviors very easily to systems designed for language
– possible with deep techniques as in REA– possible with shallow techniques too
14
Example: data-driven synthesis
Speech synthesis for limited domains– Plan utterances using templates– Select speech from application-specific database
Black and Lenzo 2000, Bulyko and Ostendorf 2002, Pan and Wang 2002
Becomes multimodal with database of speech and motion.
Our proof of concept
Our proof of concept
15
Convincing results
Because the animation closely follows human performance.
Performance in our animation
to set up your landing#155#185
you didn’t manage#174#214
dude#172#122
that was ugly#041#091
ContentVoiceMotion
16
Flexibility
The character responds dynamically to ongoing events – including player actions
By selecting recordings for new situations and adapt them for one another
Another selection of performance
you got it#094#114
on this jump#133#183
dude#172#192
yikes#181#031
ContentVoiceMotion
17
Another selection of performance
you got it#094#114
on this jump#133#183
dude#172#192
yikes#181#031
ContentVoiceMotion
Live Demo
Stone, DeCarlo, Oh, Rodriguez, Stere, Lees & Bregler“Speaking with Hands: Creating Animated Conversational Characters from Recordings of Human Performance”, ACM Transactions on Graphics 23(3) (SIGGRAPH 2004).
Outline
IntroductionBackgroundMeaning in language
– Action– Interpretation– Convention
Conclusions and future work
Trying to get precise
Clearly language and gesture are different– interpretation of gesture seems more elusive
Why might this be?
18
Gesture lacks convention
Meta-talk is unintelligible:– What do you mean [demonstration]?– What does [demonstration] mean?
Language depends on convention
You see this in meta-talk in dialogue:– What do you mean?– What does that mean?
You see this in intuitions about reference– Particularly when we borrow others’ reference
Reference borrowingSomeone, let’s say, a baby, is born; his parents call him by a certain name. They talk about him to their friends. Other people meet him. Through various sorts of talk the name is spread from link to link as if by a chain. A speaker who is on the far end of this chain, who has heard about, say, Richard Feynman, in the market place or elsewhere, may be referring to Richard Feynman even though he can’t remember from whom he first heard of Feynman or from whom he ever heard of Feynman. He knows that Feynman is a famous physicist. A certain passage of communication reaching ultimately to the man himself does reach the speaker. He is then referring to Feynman even though he can’t identify him uniquely.
Kripke 1980
Conventions involve commitments
Agents must work to keep meaning aligned with community
– Clarifying what they say, if asked– Resolving inconsistencies, if necessary
19
Pursuing commitments
Clarification– S: It’s light brown.
U: Do you mean tan?S: Yeah.
Negotiation– S: It’s a square.
U: There is no square.S: Isn’t the red one a square?U: No.S: OK, it’s not a square; it’s the red one.
Conventions involve commitments
Sounds very abstract, but amenable to strategies already used to implement dialogue
– after Larsson & Traum, Purver
Information-state approach
Formalize the state of the ongoing dialogue– including commitments and obligations
Formalize the moves interlocutors can make– including meta-level moves that push side tasks
Create dialogue strategy– rules for choosing good moves in each state
This is standard KR methology
See “Living with Classic” (Brachman et al 1990)– Get clear on the relevant objects and meaningful
distinctions in the world– Set up symbols for these objects and properties– Give clear content to structures that say how
things are– Build algorithms that respect that content
20
Pursuing commitments
Clarification– S: It’s light brown.
U: Do you mean tan?S: Yeah.
Negotiation– S: It’s a square.
U: There is no square.S: Isn’t the red one a square?U: No.S: OK, it’s not a square; it’s the red one.
Meaning and information state
Formalize the state of the ongoing dialogue– Include links between words and referents
interlocutors are committed to
Formalize the moves interlocutors can make– Carry out a clarification subdialogue,
to agree what a word refers to– Carry out a negotiation subdialogue,
to align concepts of how the world is
Meaning and information state
Dialogue strategy– Among other things, choose moves so that
agents pursue commitments to public meaning
“Societal grounding”– Formalized and illustrated in AAAI 2006:
Societal grounding is essential for meaningful language use. David DeVault, Iris Oved and Matthew Stone.
Outline
IntroductionBackgroundMeaning in language
– Action– Interpretation– Convention
Conclusions and future work
21
Conclusions
Action in conversation– seamlessly combines language and other actions
Understanding in conversation– language and other actions are interpreted in
similar ways
Language and convention– only words have conventional meanings that we
need to ground and track in conversation
Designing Meaningful Agents
An attribution based on coordinated processes– Not a module or level of representation
Using language as a collaboration
observed events
Inferredintentions
Intent Recognition
Current intention
action
CoordinatedExecution
Deliberation
Context:beliefsdesires
commitments
Pursuing sharing, not just outcomes
Inferredintentions
Current intentionDeliberation
Deliberation that reflects– Meta-level insight into one’s own meanings– Commitment to make meanings shared– Commitment to respect community standards
22
What makes language special?
A range of answers across cognitive science– Social underpinning, e.g. Tomasello– Linguistic processing, e.g. Pickering, Garrod– Linguistic structure, e.g. Fitch, Hauser, Chomsky
Need to reconcile them rather than arbitrate– Computational thinking can help make this clear
Semantics at Rutgers
PhilosophyJerry Fodor, Ernie Lepore, Jason Stanley
LinguisticsMaria Bittner, Veneeta Dayal, Roger Schwarzschild
Computer ScienceChung-chieh Shan, Matthew Stone
New initiative
IGERT in Perceptual Science– Graduate training program – Investigations bridging humans and computers– Includes face-to-face communication as a focus
Acknowledgments
AnimationDoug DeCarlo, (Assoc Prof RU)Radu Gruian (U’00), Corey Revilla (U’01),Christian Rodriguez (M’04),Niki Shah (M’01), Adrian Stere (U’05),
23
Acknowledgments
Data analysisJennifer Venditti (Postdoc; PhD Ohio State ’00)Chris Dymek (U’01)Nathan Folsom-Kovarik (U’01)Insuk Oh (PhD Student)
Acknowledgments
SemanticsAlex Lascarides (University of Edinburgh) GestureDavid DeVault (PhD Student, Rutgers) GroundingIris Oved (PhD Student, Rutgers) Grounding
Acknowledgments
Motion captureChris Bregler (Assoc Prof, NYU)Alyssa Lees (PhD Student, NYU)Kate Brehm (Performer)
AnimationLoren Runcie (U’04)Jared Silver (Staff, NYU)
Zoe model from Electronic Arts SSX 3