Planning, Inference and Pragmatics in Sequential Language ...

Planning, Inference and Pragmatics in Sequential Language Games

Fereshte KhaniStanford University

[email protected]

Noah D. GoodmanStanford University

[email protected]

Percy LiangStanford University

[email protected]

Abstract

We study sequential language games in whichtwo players, each with private information,communicate to achieve a common goal. Insuch games, a successful player must (i) in-fer the partner’s private information from thepartner’s messages, (ii) generate messages thatare most likely to help with the goal, and (iii)reason pragmatically about the partner’s strat-egy. We propose a model that captures allthree characteristics and demonstrate their im-portance in capturing human behavior on anew goal-oriented dataset we collected usingcrowdsourcing.

1 Introduction

Human communication is extraordinarily rich. Peo-ple routinely choose what to say based on theirgoals (planning), figure out the state of the worldbased on what others say (inference), all while tak-ing into account that others are strategizing agentstoo (pragmatics). All three aspects have been stud-ied in both the linguistics and AI communities. Forplanning, Markov Decision Processes and their ex-tensions can be used to compute utility-maximizingactions via forward-looking recurrences (e.g., Vo-gel et al. (2013a)). For inference, model-theoreticsemantics (Montague, 1973) provides a mechanismfor utterances to constrain possible worlds, and thishas been implemented recently in semantic parsing(Matuszek et al., 2012; Krishnamurthy and Kollar,2013). Finally, for pragmatics, the cooperative prin-ciple of Grice (1975) can be realized by models inwhich a speaker simulates a listener—e.g., Franke(2009) and Frank and Goodman (2012).

Find B2 Find B2

B ?

B ?

C ?

Pletter view

? 2

? 3

? 2

Pdigit view

Pletter: squarePdigit: circle

Pletter: click (1,3)

Planning: Let mefirst try square,which is just onepossibility.

Inference: Thesquare’s letter mustbe B.

Pragmatics: Thesquare’s digit can-not be 2.

Figure 1: A game of InfoJigsaw played by two hu-man players. One of the players (Pletter) only seesthe letters, while the other one (Pdigit) only sees thedigits. Their goal is to identify the goal object, B2,by exchanging a few words. The clouds show thehypothesized role of planning, inference, and prag-matics in the players’ choice of utterances. In thisgame, the bottom object is the goal (position (1, 3)).

There have been a few previous efforts in the lan-guage games literature to combine the three aspects.Hawkins et al. (2015) proposed a model of commu-nication between a questioner and an answerer basedon only one round of question answering. Vogel etal. (2013b) proposed a model of two agents playinga restricted version of the game from the Cards Cor-pus (Potts, 2012), where the agents only communi-cate once.1 In this work, we seek to capture all threeaspects in a single, unified framework which allows

1Specifically, two agents must both co-locate with a specificcard. The agent which finds the card sooner shares the cardlocation information with the other agent.

543

Transactions of the Association for Computational Linguistics, vol. 6, pp. 543–555, 2018. Action Editor: Mark Steedman.Submission batch: 1/2018; Revision batch: 5/2018; 5/2018; Published 8/2018.

c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

for multiple rounds of communication.

Specifically, we study human communication ina sequential language game in which two players,each with private knowledge, try to achieve a com-mon goal by talking. We created a particular sequen-tial language game called InfoJigsaw (Figure 1). InInfoJigsaw, there is a set of objects with public prop-erties (shape, color, position) and private properties(digit, letter). One player (Pletter) can only see theletters, while the other player (Pdigit) can only see thedigits. The two players wish to identify the goal ob-ject, which is uniquely defined by a letter and digit.To do this, the players take turns talking; to encour-age strategic language, we allow at most two Englishwords at a time. At any point, a player can end thegame by choosing an object.

Even in this relatively constrained game, we cansee the three aspects of communication at work.As Figure 1 shows, in the first turn, since Pletterknows that the game is multi-turn, she simply sayssquare; if the other player does not click on thesquare, she can try the bottom circle in the next turn(planning). In the second turn, Pdigit infers fromsquare that the square’s letter is probably B (in-ference). As the digit on the square is not a 2, shesays circle. Finally, Pletter infers that digits of cir-cles are 2, and in addition she infers from circlethat the digit on the square is not a 2 as otherwise,Pdigit would have clicked on it (pragmatics). There-fore, she correctly clicks on (1,3).

In this paper, we propose a model that capturesplanning, inference, and pragmatics for sequentiallanguage games, which we call PIP. Planning re-currences look forward, inference recurrences lookback, and pragmatics recurrences look to simpler in-terlocutors’ model. The principal challenge is to in-tegrate all three types in a coherent way; we presenta “two-dimensional” system of recurrences to cap-ture this. Our recurrences bottom out in very simple,literal semantics, (e.g., context-independent mean-ing of circle), and we rely on the structure of re-currences to endow words with their rich context-dependent meaning. As a result, our model is veryparsimonious and only has four (hyper)parameters.

As our interest is in modeling human communi-cation in sequential language games, we evaluatePIP on its ability to predict how humans play In-

foJigsaw.2 We paired up workers on Amazon Me-chanical Turk to play InfoJigsaw, and collected atotal of 1680 games. Our findings are as follows:(i) PIP obtains higher log-likelihood than a base-line that chooses actions to convey maximum infor-mation in each round; (ii) PIP obtains higher log-likelihood than ablations that remove the pragmaticor the planning components, supporting their im-portance in communication; (iii) PIP is better thanan ablation with a truncated inference componentthat forgets the distant past only for longer games,but worse for shorter games. The overall conclu-sion is that by combining a very simple, context-independent literal semantics with an explicit modelof planning, inference, and pragmatics, PIP obtainsrich context-dependent meanings that correlate withhuman behavior.

2 Sequential Language Games

In a sequential language game, there are two play-ers who have a shared world state w. In addition,each player j ∈ {+1,−1} has a private state sj .At each time step t = 1, 2, . . . , the active playerj(t) = 2(t mod 2) − 1 (which alternates) choosesan action (including speaking) at based on its policyπj(t)(at | w, sj(t), a1:t−1). Importantly that playerj(t) can see her own private state sj(t), but not thepartner’s s−j(t). At the end of the game (definedby a terminating action), both players receive utilityU(w, s+1, s−1, a1:t) ∈ R. The utility consists of apenalty if players did not reach the goal and a re-ward if they reached the goal along with a penaltyfor each action. Because the players have a commonutility function that depends on private information,they must communicate the part of their private in-formation that is relevant for maximizing utility. Inorder to simplify notation, we use j to represent j(t)in the rest of the paper.

InfoJigsaw. In InfoJigsaw (see Figure 1 for an ex-ample), two players try to identify a goal object, buteach only has partial information about its identity.Thus, in order to solve the task, they must communi-cate, piecing their information together like a jigsaw

2One could in principle solve for an optimal communicationstrategy for InfoJigsaw, but this would likely result in a solutionfar from human communication.

544

(a) Pdigit view (b) Pletter view

Figure 2: Chat interface that Amazon Mechanical Turk (AMT) workers use to play InfoJigsaw (for read-ability, objects with the goal digit/letter are bolded).

puzzle. Figure 2 shows the interface that humans useto play the game.

More formally, the shared world state w includesthe public properties of a set of objects: position ona m× n grid, color (blue, yellow, green), and shape(square, diamond, circle). In addition, w containsthe letter and digit of the goal object (e.g., B2). Theprivate state of player Pdigit is a digit (e.g., 1,2,3) foreach object, and the private state of player Pletter is aletter (e.g., A,B,C) for each object. These states ares+1, s−1 depending on which player goes first.

On each turn t, a player j(t)’s action at can beeither (i) a message containing one or two Englishwords 3 (e.g., circle), or (ii) a click on an object,specified by its position (e.g., (1,3)). A click actionterminates the game. If the clicked object is the goal,a green square will appear around it which is visibleto both players; if the clicked object is not the goal,a red square appears instead. To discourage randomguessing, we prevent players from clicking in thefirst time step. Players do not see an explicit util-ity (U ); however, they are instructed to think strate-gically to choose messages that lead to clicking onthe correct object while using a minimum numberof messages. Players can see the number of correctclicks, wrong clicks, and number of the words theyhave sent to each other so far at the top right of thescreen.

We would like to study how context-dependentmeaning arises out of the interplay between a

3 If the words are not inside the English dictionary, thesender receives an error and the message is rejected. This pre-vents players from circumventing the game rules by connectingmultiple words without spaces.

# games # messages average score

All 1680 4967 7.50Kept 1259 3358 7.48

Table 1: Statistics for all 1680 games and the 1259games in which each message contains at least oneof the 12 most frequent words or “yes”, or “no”.

context-independent literal semantics with context-sensitive planning, inference, and pragmatics. Thesimplicity of the InfoJigsaw game ensures that thisinterplay is not obscured by other challenges.

2.1 Data collection

We generated 10 InfoJigsaw scenarios as follows:For each one, we randomly choose m,n to be ei-ther 2× 3 or 3× 2 (which results in 64 possible pri-vate states). We randomly choose the properties ofall objects and randomly designated one as the goal.We randomly choose either Pletter or Pdigit to start thegame first. Finally, to make the scenarios interesting,we keep a scenario if it satisfies: (i) Only the goalobject (and no other objects) has the goal combina-tion of the letter and digit; (ii) There exist at leasttwo goal-consistent objects for each player and theirsum of goal-consistent objects is at least m×n; and(iii) all the goal consistent objects for each player donot share the same color, shape, or position (whichmeans all the goal-consistent objects are not in left,right, top, bottom, or middle).

We collected a dataset of InfoJigsaw games onAmazon Mechanical Turk using the framework inHawkins (2015) as follows: 200 pairs of players

545

2 3 4 5 6 7 80

200

400

600

800

# messages

#ga

mes

(a) Number of exchanged mes-sages per game.

0 1 2 3 4 5 6 7 8 9 100

100

200

300

score

#ga

mes

(b) Distribution of final gamescores.

0 1 2 3 4 5 6 7 8 90

0.2

0.4

0.6

0.8

1

round

aver

age

scor

e

(c) Average score over multiplerounds of game play, which inter-estingly remains constant.

left

top

squa

reye

llow

botto

mbl

uegr

een

righ

tdi

amon

dci

rcle

mid

dle

not

200

400

freq

uenc

y

(d) 12 most frequent words, whichmake up 73% of all tokens.

top

left

botto

mle

ftto

pri

ght

yello

wsq

uare

blue

squa

rebo

ttom

righ

tye

llow

diam

ond

blue

gree

ndi

amon

dbl

ueci

rcle

squa

redi

amon

dgr

een

left

colu

mn

top

mid

dle no

squa

re left

circ

ledi

amon

dgr

een

squa

reye

llow

top

row

circ

le top

gree

nci

rcle

colo

rto

pbo

ttom

blue

gree

nye

llow

gree

nbl

ueye

llow

botto

m

50

100

150

freq

uenc

y

(e) 30 most frequent messages, which make up 49% of all messages.

Figure 3: Statistics of the collected corpus.

each played all 10 scenarios in a random order. Outof 200 pairs, 32 pairs left the game prematurelywhich results in 168 pairs playing the total of 1680games. Players performed 4967 actions (messagesand clicks) total and obtained an average score (cor-rect clicks) of 7.5 per game. The average score perscenario varied from 6.4 to 8.2. Interestingly, thereis no significant difference in scores across the 10scenarios, suggesting that players do not adapt andbecome more proficient with more game play (Fig-ure 3c). Figure 3 shows the statistics of the collectedcorpus. Figure 4 shows one of the games, along withthe distribution of messages in the first time step ofall games played on this scenario.

To focus on the strategic aspects of InfoJigsaw,we filtered the dataset to reduce the words in thetail. Specifically, we keep a game if all its mes-sages contain at least one of the 12 most frequentwords (shown in Figure 3d) or “yes” or “no”. Forexample, in Figure 4, the games containing mes-sages such as what color, mid row, color

are filtered because they don’t contain any fre-quent words. Messages such as middle, eithermiddle, middle maybe, middle objectsare mapped to middle. 1259 of 1680 games sur-vived. Table 1 compares the statistics between allgames and the ones that were kept. Most games thatwere filtered out contained less frequent synonyms(e.g. round instead of circle). Some questionswere filtered out too (e.g., what color). Filteredgames are 1.15 times longer on average.

3 Literal Semantics

In order to understand the principles behind howhumans perform planning, inference, and pragmat-ics, we aim to develop a parsimonious, interpretablemodel with few parameters rather than a highly ex-pressive, data-driven model. Therefore, followingthe tradition of Rational Speech Acts (RSA) (Frankand Goodman, 2012; Goodman and Frank, 2016),we will define in this section a mapping from eachword to its literal semantics, and rely on the PIP re-

546

blue

squa

rem

iddl

ero

wsq

uare

diam

ond

botto

mri

ght

nott

opco

lor

mid

dle

yello

wsq

uare

mid

dle

left

squa

reno

tcir

cle

squa

res

mid

dle

two

wha

tcol

orye

llow

circ

ledi

amon

dsq

uare

eith

erm

iddl

em

aybe

mid

dle

mid

row

mid

dle

obje

cts

5

10

15

20

freq

uenc

y

Find A1 Find A1

A ? A ?

B ? B ?

A ? B ?

Pletter view

? 3 ? 1

? 1 ? 1

? 2 ? 2

Pdigit view

Pdigit: middle

Pletter: yellow circle

Pdigit: bottom right

Pletter: click (1,2)

Figure 4: Bottom: one of the games played byTurkers. Top: the distribution of utterances on thefirst message. Players choose to explain their pri-vate state in different ways. Some use more generalmessages (e.g., square diamond), while someuse more specific ones (e.g., blue square). Topdiagram shows the first 20 most frequent messageson the first round (72% of all the messages).

currences (which we will describe in Section 4) toprovide context-dependence. One could also learnthe literal semantics by backpropagating throughthese recurrences, which has been done for simplerRSA models (Monroe and Potts, 2015); or learn theliteral semantics from data and then put RSA on top(Andreas et al., 2016); we leave this to future work.

Suppose a player utters a single word circle.There are multiple possible context-dependent inter-pretations:

• Are any circles goal-consistent?• All the circles are goal-consistent.• Some circles but no other objects are goal-

s−1

0

1

1

[ ] s+1

1

0

1

[ ]Find B2 Find B2

B ?

B ?

C ?

Pletter view

? 2

? 3

? 2

Pdigit view

JsquareK ={s : s ∧

[010

]6=[000

]}

Jtop bottomK ={s : s ∧

([100

]∨[001

])6=[000

]}

Jtop blueK ={s : s ∧

([100

]∧[111

])6=[000

]}

Figure 5: Private state of the players and meaningof two action sequences.

consistent.• Most of the circles are goal-consistent.• At least one circle is goal-consistent.

We will show that most of these interpretations canarise from a simple fixed semantics: roughly “somecircles are goal consistent”. We will now define asimple literal semantics of message actions such ascircle, which forms the base case of PIP. Recallthat the shared world state w contains the goal (e.g.,B2) and, assuming Pletter goes first, the private states−1 (s+1) of player Pletter (Pdigit) contains the let-ter (digit) of each object. For notational simplicity,let us define s−1 (s+1) to be a matrix correspond-ing to the spatial locations of the objects, where anentry is 1 if the corresponding object has the goal let-ter (digit) and 0 otherwise. Thus sj also representsthe set of goal-consistent objects given the privateknowledge of that player. Figure 5 shows the privatestates of the players.

We define two types of message actions: infor-mative (e.g., blue, top) and verifying (e.g., yes,no). Informative messages have immediate mean-ing, while verifying messages depend on the previ-ous utterance.

Informative messages. Informative messages de-scribe constraints on the speaker’s private state(which the partner does not know). For a message a,

547

define JaK to be the set of consistent private states.For example, JbottomK is all private states wherethere are goal-consistent objects in the bottom row.

Formally, for each word x that specifies some ob-ject property (e.g., blue, top), define vx to be ann×mmatrix where an entry is 1 if the correspondingobject has the property x, and 0 otherwise. Then, de-fine the literal semantics of a single-word message xto be JxK def

= {s : s ∧ vx 6= 0}, where ∧ denoteselement-wise and and 0 denotes the zero matrix.That is, single-property messages can be glossed as“some goal-consistent object has property x”.

For a two-word message xy, we define the literalsemantics depending on the relationship between xand y. If x and y are mutually exclusive, then weinterpret xy as x or y (e.g., square circle);otherwise, we interpret xy as x and y (e.g., bluetop). Formally, JxyK def

= {s : s ∧ (vx ∧ vy) 6= 0} ifvx ∧ vy 6= 0 and {s : s ∧ (vx ∨ vy) 6= 0} otherwise.See Figure 5 for some examples.

Action sequences. We now define the literal se-mantics of an entire action sequence Ja1:tKj with re-spect to player j, which is the set of possible part-ner private states s−j . Intuitively, we want to sim-ply intersect the set of consistent private states ofthe informative messages, but we need to also han-dle verifying messages (yes and no), which arecontext-dependent. Formally, we say that privatestate s−j ∈ Ja1:tKj if the following holds: for allinformative messages ai uttered by −j, s−j ∈ JaiK;and for all verifying messages ai uttered by −j ifai = yes then, s−j ∈ Jai−1K; and if ai = no then,s−j 6∈ Jai−1K.

4 The Planning-Inference-Pragmatics(PIP) Model

Why does Pdigit in Figure 1 choose circle ratherthan top or click(1,2)? Intuitively, when aplayer chooses an action, she should take into ac-count her previous actions, her partner’s actions, andthe effect of her actions on future turns. She shoulddo all these while reasoning pragmatically that herpartner is also a strategic player.

At a high-level, PIP defines a system of recur-rences revolving around three concepts, depicted inFigure 6: player j’s beliefs over the partner’s pri-

Figure 6: PIP is defined via a system of recur-rences that simultaneously captures planning, infer-ence, and pragmatics. The arrows show the depen-dencies between beliefs p, expected utilities V , andpolicy π.

vate state pkj (s−j | sj , a1:t), her expected utility ofthe game V k

j (s+1, s−1, a1:t), and her policy πkj (at |sj , a1:t−1). Here, t indexes the current time and kindexes the depth of pragmatic recursion, which willbe explained later in Section 4.3. To simplify the no-tation, we have dropped w (shared world state) fromthe notation, since everything conditions on it.

4.1 InferenceFrom player j’s point of view, the purpose of infer-ence is to compute a distribution over the partner’sprivate state s−j given all actions thus far a1:t. Wefirst consider a “level 0” player, which simply as-signs a uniform distribution over all states consistentwith the literal semantics of a1:t, which we definedin Section 3:

p0j (s−j | sj , a1:t) ∝{1 s−j ∈ Ja1:tKj ,0 otherwise.

(1)

For example, Figure 7, shows the Pletter’s beliefabout Pdigit’s private state after observing circle.Remember we show the private state of the playersas a matrix where an entry is 1 if the correspondingobject has the goal letter (digit) and 0 otherwise.

A player’s own private state sj can also constrainher beliefs about her partner’s private state s−j . Forexample, in InfoJigsaw, the active player knowsthere is a goal, and so we set pkj (s−j | sj , a1:t) = 0if s−j ∧ sj = 0.

548

0

0.05

0.1

0.15

000

[ ]001

[ ]010

[ ]011

[ ]100

[ ]101

[ ]110

[ ]111

[ ]

Figure 7: Pletter’s probability distribution overPdigit’s private state after Pdigit says circle in thegame shown in Figure 5.

4.2 Planning

The purpose of planning is to compute a policy πkj ,which specifies a distribution over player j’s actionsat given all past actions a1:t−1. To construct the pol-icy, we first define an expected utility V k

j via the fol-lowing forward-looking recurrence: When the gameis over (e.g., in InfoJigsaw, one player clicks on anobject), the expected utility of the dialogue is simplyits utility as defined by the game:

V kj (s+1, s−1, a1:t) = U(s+1, s−1, a1:t). (2)

Otherwise, we compute the expected utility assum-ing that in the next turn, player j chooses action at+1

with probability governed by her policy πkj (at+1 |sj , a1:t):

V kj (s+1, s−1, a1:t) =

∑

at+1

πkj (at+1 | sj , a1:t)

V k−j(s−1, s+1, a1:t+1).

(3)

Having defined the expected utility, we now de-fine the policy. First, let Dk

j be the gain in expectedutility V k

−j(s+1, s−1, a1:t) over a simple baselinepolicy that ends the game immediately, yielding util-ity U(s+1, s−1, a1:t−1) (which is simply a penaltyfor not finding the correct goal and a penalty foreach action). Of course, the partner’s private states−j is unknown and must be marginalized out basedon player j’s beliefs; let Ekj be the expected gain.Let the probability of an action at be proportional tomax(0, Ekj )

α, where α ∈ [0,∞) is a hyperparame-ter that controls the rationality of the agent (a largerα means that the player chooses utility-maximizing

actions more aggressively). Formally:

Dkj = V k

−j(s+1, s−1, a1:t)− U(s+1, s−1, a1:t−1),

Ekj =∑

s−j

pkj (s−j | sj , a1:t−1)Dkj ,

πkj (at | sj , a1:t−1) ∝ max(0, Ekj

)α. (4)

In practice, we use a depth-limited recurrence,where the expected utility is computed assumingthat the game will end in f turns and the last actionis a click action (meaning that we only consider theaction sequences with size≤ f and a clicking actionas the last action). Figure 8 shows how Pdigit com-putes the expected gain (Eqn. 4) of saying circle.

Figure 8: Planning reasoning for the game in Fig-ure 1 (reproduced here in the bottom right). (a) Inorder to calculate the expected gain (E) of generat-ing circle, for every state s, Pdigit computes theprobability of s being the Pletter’s private state. (b)She then computes the expected utility (V ) if shegenerates circle assuming Pletter’s private state iss.

4.3 PragmaticsThe purpose of pragmatics is to take into accountthe partner’s strategizing. We do this by construct-ing a level-k player that infers the partner’s pri-vate state, following the tradition of Rational SpeechActs (RSA) (Frank and Goodman, 2012; Goodmanand Frank, 2016). Recall that a level-0 player p0j(Section 4.1) puts a uniform distribution over all the

549

Figure 9: Pragmatic reasoning for the game in Figure 1 (reproduced here in the upper right) at time step3. Players reason recursively about each others beliefs: the level-0 player puts a uniform distribution p0jover all the states in which at least one circle is goal-consistent independent of the shared world state andprevious actions. The level-1 player assigns probability over the partner’s private states s−j proportional to

the probability that she would have performed the last action given that state s−j . For example, if[001

]were

Pdigit’s private state, then saying bottom would be more probable (given the shared world state); if[111

]

were Pdigit’s state, then clicking on the square would be a better option (given the previous actions). But

given that Pdigit uttered circle,[101

]is most likely, as reflected by p1j .

semantically valid private states of the partner. Alevel-k player assigns probability over the partner’sprivate state proportional to the probability that alevel-(k − 1) player would have performed the lastaction at:

pkj (s−j | sj , a1:t) ∝ πk−1−j (at | s−j , a1:t−1)

pkj (s−j | sj , a1:t−2). (5)

Figure 9 shows an example of the pragmatic rea-soning.

4.4 A closer look at the meaning of actionsIn the Section 4.2, we modeled the players as ra-tional agents that choose actions that lead to highergain utility. In the pragmatics section (Section 4.3),we described how a player infers the partner’s pri-vate state taking into account that her partner isacting cooperatively. The phenomena that emerges

from the combination of the two is the topic of thissection.

We first define the belief marginals Bj of a playerj to be the marginal probabilities that each objectis goal-consistent under the hypothesized partner’sprivate state s−j ∈ Rm×n, conditioned on actionsa1:t:

Bj(sj , a1:t) =∑

s−j

pkj (s−j | sj , a1:t)s−j . (6)

At time t = 0 (before any actions), the beliefmarginals of both players are m × n matrices with0.5 in all entries. The change in a belief marginalafter observing an action at gives a sense of the ef-fective (context-dependent) meaning of that action.

We first explain how pragmatics (k > 0 in (Eqn.5)) leads to rich action meanings. When a playerobserves her partner’s action at, she assumes this ac-

550

Find A2

? 3 ? 2

? 1 ? 2

Pdigit view

(a)

Pdigit: bottomPletter: right

(k=0)

Pdigit: bottomPletter: right

(k=1)

Pdigit estimationof Pletter state

(b)

0.500 0.667

0.500 0.667

[ ]

0.424 0.769

0.423 0.940

[ ]

Figure 10: Belief marginals of Pdigit (Eqn. 6) afterobserving sequences of actions for different prag-matic depths k. (b) Without pragmatics (k = 0),Pdigit thinks both objects on the right has the sameprobability to be goal-consistent. With pragmatics(k = 1), Pdigit thinks that the object in the bottomright is more likely to be goal-consistent.

tion was chosen because it results in a higher utilitythan the alternatives. In other words, she infers thather partner’s private state cannot be one in which atdoes not lead to high utility. As an example, say-ing circle instead of top circle or bottomcircle implies that there is more than one goal-consistent circle. The pragmatic depth k governs theextent to which this type of reasoning is applied.

Recall in Section 4.2, a player chooses an actionconditioned on all previous actions, and the otherplayer assumed this context-dependence. As an ex-ample, Figure 10(d) shows how right changes itsmeaning when it follows bottom.

5 Experiments

5.1 Setup

We a priori set the reward of clicking on the goalto be +100 and clicking on the wrong object to be−100. We set the smoothing α = 10 and the actioncost to be −50 based on the data. The larger theaction cost, the fewer messages will be used beforeselecting an object. Formally, after k actions:

Utility = −50k +{+100 the goal object is clicked,

−100 otherwise.

(7)

We smoothed all polices by adding 0.01 to theprobability of each action and re-normalizing. Bydefault, we set k = 1 (pragmatic depth (Eqn. 4)).When computing the expected utility (Eqn. 3) of thegame, we use a lookahead of f = 2. Inference looksback b time steps (i.e., (Eqn. 1) and (Eqn. 5) arebased on at−b+1:t rather than a1:t); we set b =∞ bydefault.

We implemented two baseline policies:Random policy: for player j, the random pol-icy randomly chooses one of the semantically valid(Section 3) actions with respect to sj or clicks on agoal-consistent object. Formally, the random policyplaces a uniform distribution over:

{a : sj ∈ JaK} ∪ {click(u, v) : (sj)u,v = 1}.(8)

Greedy Policy: assigns higher probability to theactions that convey more information about theplayer’s private state. We heuristically set the prob-ability of generating an action proportional to howmuch it shrinks the set of semantically valid states.Formally, for the message actions:

πmsgj (at | a1:t−1, sj) ∝ |Ja1:t−1K−j | − |Ja1:tK−j |

(9)

For the clicking actions, we compute the belief stateas explained in Section 4.4. Remember Bu,v is themarginal probability of the object in the row u andcolumn v being goal-consistent in the partner’s pri-vate state. Formally, for clicking actions:

πclickj (click(u, v) | a1:t, sj) ∝

min((sj)u,v, Bj(sj , a1:t)u,v). (10)

Finally, the greedy policy chooses a click action withprobability γ and a message action with probability1 − γ. So that γ increases as the player gets moreconfident about the position of the goal, we set γ tobe the probability of the most probable position ofthe goal: γ = max

u,vπclickj (click(u, v) | a1:t, sj).

5.2 ResultsFigure 11 compares the two baselines with PIP onthe task of predicting human behavior as measuredby log-likelihood.4 To estimate the best possible

4We bootstrap the data 1000 times and we show 90% confi-dence intervals.

551

−4.5 −4 −3.5 −3 −2.5

ceiling

PIP

greedy

random

log-likelihood

All roundsFirst round

Figure 11: Average log-likelihood across messages.(a) Performance of PIP and baselines on all timesteps. (b) Performance of PIP and baselines on onlythe first time step along with the ceiling given by theentropy of the human data. The error bars show 90%confidence intervals.

(i.e., ceiling) performance, we compute the entropyof the actions on the first time step based on approxi-mately 100 data points per scenario. For each policy,we rank the actions by their probability in decreas-ing order (actions with the same probability are ran-domly ordered), and then compute the average rank-ing across actions according to the different policies;see Figure 13 for the results.

To assess the different components (planning, in-ference, pragmatics) of PIP, we run PIP, ablatingone component at a time from the default setting ofk = 1, f = 2, and b =∞ (see Figure 12).

Pragmatics. Let PIP-prag be PIP but with a prag-matic depth (Eqn. 4) of k = 0 rather than k = 1,which means that PIP-prag only draws inferencesbased on the literal semantics of messages. PIP-pragloses 0.21 in average log-likelihood per action, high-lighting the importance of pragmatics in modelinghuman behavior.

Planning. Let PIP-plan be PIP, but looking aheadonly f = 1 step when computing the expected util-ity (Eqn. 3) rather than f = 2. With a shorter fu-ture horizon, PIP-plan tries to give as much informa-tion as possible at each turn, whereas human playerstend to give information about their state incremen-

tally. PIP-plan cannot capture this behavior and al-locates low probability to these kinds of dialogue.PIP-plan has an average log-likelihood which is 0.05lower than that of PIP, highlighting the importanceof planning.

Inference. Let PIP-infer be PIP, but only looking atthe last utterance (b = 1) rather than the full history(b = ∞). The results here are more nuanced. Al-though PIP-infer actually performs better than PIP onall games, we find that PIP-infer is worse than PIP byan average log-likelihood of 0.15 in predicting mes-sages after time step 3, highlighting the importanceof inference, but only in long games. It is likelythat additional noise involved in the inference pro-cess leads to the decreased performance when back-ward looking inference is not actually needed.

6 Related Work and Discussion

Our work touches on ideas in game theory, prag-matic modeling, dialogue modeling, and learningcommunicative agents, which we highlight below.

Game theory. According to game theory termi-nology (Shoham and Leyton-Brown, 2008), Info-Jigsaw is a non-cooperative (there is no offline op-timization of the player’s policy before the gamestarts), common-payoff (the players have the sameutility), incomplete information (the players haveprivate state) game with the sequential actions. Onerelated concept in game theory related to our modelis rationalizability (Bernheim, 1984; Pearce, 1984).A strategy is rationalizable if it is justifiable to playagainst a completely rational player. Another relatedconcept is epistemic games (Dekel and Siniscalchi,2015; Perea, 2012). Epistemic game theory studiesthe behavioral implications of rationality and mutualbeliefs in games.

It is important to note that we are not interestedin notions of global optima or equilibria; rather, weare interested in modeling human behavior. Re-stricting words to a very restricted natural languagehas been studied in the context of language games(Wittgenstein, 1953; Lewis, 2008; Nowak et al.,1999; Franke, 2009; Huttegger et al., 2010).

Rational speech acts. The pragmatic componentof PIP is based on Rational Speech Act framework(Frank and Goodman, 2012; Golland et al., 2010),

552

PIP

PIP -

prag

PIP -

plan

PIP -

infe

r

−3.5

−3.4

−3.3

−3.2lo

g-lik

elih

ood

(a) Performance over all games andall rounds.

PIP

PIP -

prag

PIP -

plan

PIP -

infe

r

−2.7

−2.6

−2.5

−2.4

−2.3

−2.2

log-

likel

ihoo

d

(b) Performance over messages afterround 3.

PIP

PIP -

prag

PIP -

plan

PIP -

infe

r

k (pragmatics) 1 0 1 1f (planning) 2 2 1 2b (inference) ∞ ∞ ∞ 1

rank all 17.1 19.3 17.2 16.9rank ≥ 3 10.4 10.8 11.6 13.1

(c) Top: parameter setup. Bot-tom: expected ranking of humanmessages according to the differ-ent ablations

Figure 12: Performance on ablations of PIP. Average log-likelihood per message, the whiskers show 90%confidence intervals. PIP has better performance of ablation of planning and pragmatics over all rounds.Looking only one step backward has a better performance in the first few rounds but it is worse after round3.

rand

om

gree

dy PIP

ceili

ng

0

10

20

30

40

expe

cted

rank

ing

All roundsFirst round

Figure 13: Expected ranking of the human mes-sages according to different policies. Error barsshow 90% confidence intervals.

which defines recurrences capturing how one agentreasons about another. Similar ideas were exploredin the precursor work of Golland et al. (2010), andmuch work has ensued (Smith et al., 2013; Qing andFranke, 2014; Monroe and Potts, 2015; Ullman etal., 2016; Andreas and Klein, 2016).

Most of this work is restricted to production andcomprehension of a single utterance. Hawkins etal. (2015) extend these ideas to two utterances (aquestion and an answer). Vogel et al. (2013b) in-

tegrates planning with pragmatics using decentral-ized partially observable Markov processes (DEC-POMDPs). In their task, two bots should find andco-locate with a specific card. In contrast to Info-Jigsaw, their task can be completed without commu-nication; their agents only communicate once shar-ing the card location. They also only study artifi-cial agents playing together and were not concernedabout modeling human behavior.

Learning to communicate. There is a rich liter-ature on multi-agent reinforcement learning (Buso-niu et al., 2008). Some works assume full visibil-ity and cooperate without communication, assumingthe world is completely visible to all agents (Lauerand Riedmiller, 2000; Littman, 2001); others as-sume a predefined convention for communication(Zhang and Lesser, 2013; Tan, 1993). There is alsosome work that learns the convention itself (Foersteret al., 2016; Sukhbaatar et al., 2016; Lazaridou et al.,2017; Mordatch and Abbeel, 2018). Lazaridou et al.(2017) puts humans in the loop to make the commu-nication more human-interpretable. In comparisonto these works, we seek to predict human behaviorinstead of modeling artificial agents that communi-cate with each other.

Dialogue. There is also a lot of work in compu-tational linguistics and NLP on modeling dialogue.Allen and Perrault (1980) provides a model that in-

553

fers the intention/plan of the other agent and usesthis plan to generate a response. Clark and Brennan(1991) explains how two players update their com-mon ground (mutual knowledge, mutual beliefs, andmutual assumptions) in order to coordinate. Recentwork in task-oriented dialogue uses POMDPs andend-to-end neural networks (Young, 2000; Young etal., 2013; Wen et al., 2017; He et al., 2017). In thiswork, instead of learning from a large corpus, wepredict human behavior without learning, albeit in amuch more strategic, stylized setting (two words perutterance).

7 Conclusion

In this paper, we started with the observation that hu-mans use language in a very contextual way drivenby their goals. We identified three salient aspects—planning, inference, pragmatics—and proposed aunified model, PIP, that captures all three aspects si-multaneously. Our main result is that a very simple,context-independent literal semantics can give risevia the recurrences to rich phenomena. We studythese phenomena in a new game, InfoJigsaw, andshow that PIP is able to capture human behavior.

Reproducibility

All code, data, and experiments for this paper areavailable on the CodaLab platform at https://worksheets.codalab.org/worksheets/0x052129c7afa9498481185b553d23f0f9/.

Acknowledgments

We would like to thank the anonymous reviewersand the action editor for their helpful comments. Wealso thank Will Monroe for providing valuable feed-back on early drafts.

References

James F. Allen and C. Raymond Perrault. 1980. Ana-lyzing intention in utterances. Artificial Intelligence,15(3):143–178.

Jacob Andreas and Dan Klein. 2016. Reasoning aboutpragmatics with neural listeners and speakers. InEmpirical Methods in Natural Language Processing(EMNLP), pages 1173–1182.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, andDan Klein. 2016. Learning to compose neural net-works for question answering. In Association forComputational Linguistics (ACL), pages 1545–1554.

B. Douglas Bernheim. 1984. Rationalizable strategic be-havior. Econometrica: Journal of the Econometric So-ciety, pages 1007–1028.

Lucian Busoniu, Robert Babuska, and Bart De Schutter.2008. A comprehensive survey of multiagent rein-forcement learning. IEEE Trans. Systems, Man, andCybernetics, Part C, 38(2):156–172.

Herbert H. Clark and Susan E. Brennan. 1991. Ground-ing in Communication. Perspectives on SociallyShared Cognition.

Eddie Dekel and Marciano Siniscalchi. 2015. Epistemicgame theory, volume 4. Handbook of Game Theorywith Economic Applications.

Jakob Foerster, Yannis M. Assael, Nando de Freitas, andShimon Whiteson. 2016. Learning to communi-cate with deep multi-agent reinforcement learning. InAdvances in Neural Information Processing Systems(NIPS), pages 2137–2145.

Michael C. Frank and Noah D. Goodman. 2012. Predict-ing pragmatic reasoning in language games. Science,336:998–998.

Michael Franke. 2009. Signal to act: Game theory inpragmatics. Institute for Logic, Language and Com-putation.

Dave Golland, Percy Liang, and Dan Klein. 2010. Agame-theoretic approach to generating spatial descrip-tions. In Empirical Methods in Natural Language Pro-cessing (EMNLP), pages 410–419.

Noah D. Goodman and Michael C. Frank. 2016. Prag-matic language interpretation as probabilistic infer-ence. Trends in Cognitive Sciences, 20(11):818–829.

Herbert P. Grice. 1975. Logic and conversation. Syntaxand Semantics, 3:41–58.

Robert X. D. Hawkins, Andreas Stuhlmuller, Judith De-gen, and Noah D. Goodman. 2015. Why do you ask?Good questions provoke informative answers. In Pro-ceedings of the Thirty-Seventh Annual Conference ofthe Cognitive Science Society.

Robert X. D. Hawkins. 2015. Conducting real-time mul-tiplayer experiments on the web. Behavior ResearchMethods, 47(4):966–976.

He He, Anusha Balakrishnan, Mihail Eric, and PercyLiang. 2017. Learning symmetric collaborative dia-logue agents with dynamic knowledge graph embed-dings. In Association for Computational Linguistics(ACL), pages 1766–1776.

Simon M. Huttegger, Brian Skyrms, Rory Smead, andKevin J.S. Zollman. 2010. Evolutionary dynamics of Lewis signaling games: Signaling systems vs. partial pooling. Synthese, 172(1):177–191.

554

Jayant Krishnamurthy and Thomas Kollar. 2013. Jointlylearning to parse and perceive: Connecting natural lan-guage to the physical world. Transactions of the Asso-ciation for Computational Linguistics (TACL), 1:193–206.

Martin Lauer and Martin Riedmiller. 2000. An algorithmfor distributed reinforcement learning in cooperativemulti-agent systems. In International Conference onMachine Learning (ICML), pages 535–542.

Angeliki Lazaridou, Alexander Peysakhovich, and MarcoBaroni. 2017. Multi-agent cooperation and the emer-gence of (natural) language. In International Confer-ence on Learning Representations (ICLR).

David Lewis. 2008. Convention: A philosophical study.John Wiley & Sons.

Michael L. Littman. 2001. Value-function reinforce-ment learning in Markov games. Cognitive SystemsResearch, 2(1):55–66.

Cynthia Matuszek, Nicholas FitzGerald, Luke Zettle-moyer, Liefeng Bo, and Dieter Fox. 2012. A jointmodel of language and perception for grounded at-tribute learning. In International Conference on Ma-chine Learning (ICML), pages 1671–1678.

Will Monroe and Christopher Potts. 2015. Learning inthe Rational Speech Acts model. In Proceedings of20th Amsterdam Colloquium.

Richard Montague. 1973. The proper treatment of quan-tification in ordinary English. In Approaches to Natu-ral Language, pages 221–242.

Igor Mordatch and Pieter Abbeel. 2018. Emergence ofgrounded compositional language in multi-agent pop-ulations. In Association for the Advancement of Artifi-cial Intelligence (AAAI).

Martin A. Nowak, Joshua B. Plotkin, and David C.Krakauer. 1999. The evolutionary language game.Journal of Theoretical Biology, 200(2):147–162.

David G. Pearce. 1984. Rationalizable strategic behaviorand the problem of perfection. Econometrica: Journalof the Econometric Society, pages 1029–1050.

Andr’es Perea. 2012. Epistemic game theory: reasoningand choice. Cambridge University Press.

Christopher Potts. 2012. Goal-driven answers in theCards dialogue corpus. In Proceedings of the 30thWest Coast Conference on Formal Linguistics, pages1–20.

Ciyang Qing and Michael Franke. 2014. Gradableadjectives, vagueness, and optimal language use: Aspeaker-oriented model. In Semantics and LinguisticTheory, volume 24, pages 23–41.

Yoav Shoham and Kevin Leyton-Brown. 2008. Multi-agent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press.

Nathaniel J. Smith, Noah D. Goodman, and Michael C.Frank. 2013. Learning and using language via re-cursive pragmatic reasoning about other agents. InAdvances in Neural Information Processing Systems(NIPS), pages 3039–3047.

Sainbayar Sukhbaatar, Rob Fergus, et al. 2016. Learningmultiagent communication with backpropagation. InAdvances in Neural Information Processing Systems(NIPS), pages 2244–2252.

Ming Tan. 1993. Multi-agent reinforcement learning:Independent vs. cooperative agents. In InternationalConference on Machine Learning (ICML), pages 330–337.

Tomer D. Ullman, Yang Xu, and Noah D. Goodman.2016. The pragmatics of spatial language. In Pro-ceedings of the 38th Annual Conference of the Cogni-tive Science Society.

Adam Vogel, Max Bodoia, Christopher Potts, and DanielJurafsky. 2013a. Emergence of Gricean maxims frommulti-agent decision theory. In North American Asso-ciation for Computational Linguistics (NAACL), pages1072–1081.

Adam Vogel, Christopher Potts, and Dan Jurafsky.2013b. Implicatures and nested beliefs in approximatedecentralized-POMDPs. In Association for Computa-tional Linguistics (ACL), pages 74–80.

Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M.Rojas-Barahona, Pei-Hao Su, Stefan Ultes, DavidVandyke, and Steve Young. 2017. A network-basedend-to-end trainable task-oriented dialogue system. InEuropean Association for Computational Linguistics(EACL), pages 438–449.

Ludwig Wittgenstein. 1953. Philosophical Investiga-tions. Blackwell, Oxford.

Steve Young, Milica Gasic, Blaise Thomson, and Ja-son D. Williams. 2013. POMDP-based statistical spo-ken dialog systems: A review. In Proceedings of theIEEE, number 5, pages 1160–1179.

Steve J. Young. 2000. Probabilistic methods in spoken-dialogue systems. Philosophical Transactions of theRoyal Society of London A: Mathematical, Physicaland Engineering Sciences, 358(1769):1389–1402.

Chongjie Zhang and Victor Lesser. 2013. Coordinatingmulti-agent reinforcement learning with limited com-munication. In Proceedings of the 2013 InternationalConference on Autonomous Agents and Multi-agentSystems, pages 1101–1108.

555

556

Date post:	16-Oct-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Planning, Inference and Pragmatics in Sequential Language ...

Documents