+ All Categories
Home > Documents > If beam search is the answer, what was the question?

If beam search is the answer, what was the question?

Date post: 26-Dec-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
13
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2173–2185, November 16–20, 2020. c 2020 Association for Computational Linguistics 2173 If Beam Search is the Answer, What was the Question? Clara Meister Ryan Cotterell , Tim Vieira ETH Z ¨ urich University of Cambridge Johns Hopkins University [email protected] [email protected] [email protected] Abstract Quite surprisingly, exact maximum a posteri- ori (MAP) decoding of neural language gen- erators frequently leads to low-quality results (Stahlberg and Byrne, 2019). Rather, most state-of-the-art results on language generation tasks are attained using beam search despite its overwhelmingly high search error rate. This implies that the MAP objective alone does not express the properties we desire in text, which merits the question: if beam search is the an- swer, what was the question? We frame beam search as the exact solution to a different de- coding objective in order to gain insights into why high probability under a model alone may not indicate adequacy. We find that beam search enforces uniform information density in text, a property motivated by cognitive sci- ence. We suggest a set of decoding objec- tives that explicitly enforce this property and find that exact decoding with these objectives alleviates the problems encountered when de- coding poorly calibrated language generation models. Additionally, we analyze the text pro- duced using various decoding strategies and see that, in our neural machine translation ex- periments, the extent to which this property is adhered to strongly correlates with BLEU. Our code is publicly available at https:// github.com/rycolab/uid-decoding. 1 Introduction As a simple search heuristic, beam search has been used to decode models developed by the NLP community for decades. Indeed, it is notewor- thy that beam search is one of the few NLP al- gorithms that has stood the test of time: It has remained a cornerstone of NLP systems since the 1970s (Reddy, 1977). As such, it became the nat- ural choice for decoding neural probabilistic text generators—whose design makes evaluating the full search space impossible (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Vinyals and Le, 2015; Yin et al., 2016). While there is no formal guarantee that beam search will return— Figure 1: Average std. deviation σ of surprisals (per sentence) and corpus BLEU for translations generated using exact search over the MAP objective with a greedy regularizer (Eq. (11)) with varying degrees of λ. References for beam search (k =5 and k = 100) are included. Sub-graph shows the explicit relationship between BLEU and σ. λ and σ axes are log-scaled. or even approximate—the highest-scoring candi- date under a model, it has repeatedly proven its merit in practice (Serban et al., 2017; Edunov et al., 2018; Yang et al., 2019) and, thus, has largely been tolerated—even embraced—as NLP’s go-to search heuristic. However, in the context of neural ma- chine translation (NMT), a shocking empirical find- ing has emerged: Using beam search to decode sentences from neural text generators almost invari- ably leads to better text than using exact search (or beam search with a very large beam size). In fact, Stahlberg and Byrne (2019) report that exact search
Transcript
Page 1: If beam search is the answer, what was the question?

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2173–2185,November 16–20, 2020. c©2020 Association for Computational Linguistics

2173

If Beam Search is the Answer, What was the Question?

Clara Meister Ryan Cotterell , Tim VieiraETH Zurich University of Cambridge Johns Hopkins [email protected] [email protected]

[email protected]

Abstract

Quite surprisingly, exact maximum a posteri-ori (MAP) decoding of neural language gen-erators frequently leads to low-quality results(Stahlberg and Byrne, 2019). Rather, moststate-of-the-art results on language generationtasks are attained using beam search despite itsoverwhelmingly high search error rate. Thisimplies that the MAP objective alone does notexpress the properties we desire in text, whichmerits the question: if beam search is the an-swer, what was the question? We frame beamsearch as the exact solution to a different de-coding objective in order to gain insights intowhy high probability under a model alone maynot indicate adequacy. We find that beamsearch enforces uniform information densityin text, a property motivated by cognitive sci-ence. We suggest a set of decoding objec-tives that explicitly enforce this property andfind that exact decoding with these objectivesalleviates the problems encountered when de-coding poorly calibrated language generationmodels. Additionally, we analyze the text pro-duced using various decoding strategies andsee that, in our neural machine translation ex-periments, the extent to which this propertyis adhered to strongly correlates with BLEU.Our code is publicly available at https://github.com/rycolab/uid-decoding.

1 Introduction

As a simple search heuristic, beam search has beenused to decode models developed by the NLPcommunity for decades. Indeed, it is notewor-thy that beam search is one of the few NLP al-gorithms that has stood the test of time: It hasremained a cornerstone of NLP systems since the1970s (Reddy, 1977). As such, it became the nat-ural choice for decoding neural probabilistic textgenerators—whose design makes evaluating thefull search space impossible (Kalchbrenner andBlunsom, 2013; Sutskever et al., 2014; Vinyalsand Le, 2015; Yin et al., 2016). While there isno formal guarantee that beam search will return—

Figure 1: Average std. deviation σ of surprisals (persentence) and corpus BLEU for translations generatedusing exact search over the MAP objective with agreedy regularizer (Eq. (11)) with varying degrees ofλ. References for beam search (k = 5 and k = 100)are included. Sub-graph shows the explicit relationshipbetween BLEU and σ. λ and σ axes are log-scaled.

or even approximate—the highest-scoring candi-date under a model, it has repeatedly proven itsmerit in practice (Serban et al., 2017; Edunov et al.,2018; Yang et al., 2019) and, thus, has largely beentolerated—even embraced—as NLP’s go-to searchheuristic. However, in the context of neural ma-chine translation (NMT), a shocking empirical find-ing has emerged: Using beam search to decodesentences from neural text generators almost invari-ably leads to better text than using exact search (orbeam search with a very large beam size). In fact,Stahlberg and Byrne (2019) report that exact search

Page 2: If beam search is the answer, what was the question?

2174

returns the empty string in > 50% of cases,1 show-ing that the success of beam search does not stemfrom its ability to approximate exact decoding inpractice, but rather due to a hidden inductive biasembedded in the algorithm. This inductive bias ap-pears to be paramount for generating desirable textfrom neural probabilistic text generators. Whileseveral works explore this phenomenon (Murrayand Chiang, 2018; Yang et al., 2018; Stahlbergand Byrne, 2019; Cohen and Beck, 2019), no onehas yet hypothesized what beam search’s hiddeninductive bias may be. Our work fills this gap.

We analyze the beam search blessing by re-verse engineering an objective that beam searchreturns the exact solution for. Specifically, we in-troduce a regularizer for the the standard (MAP)decoding objective for text generation models suchthat the exact solution to this regularized objec-tive is equivalent to the solution found by beamsearch under the unmodified objective. Qualitativeinspection reveals that our “beam search regular-izer” has a clear connection to a theory in cog-nitive science—the uniform information densityhypothesis (UID; Levy and Jaeger, 2007). The UIDhypothesis states that—subject to the constraintsof the grammar—humans prefer sentences that dis-tribute information (in the sense of informationtheory) equally across the linguistic signal, e.g., asentence. In other words, human-produced text,regardless of language, tends to have evenly dis-tributed surprisal, formally defined in informationtheory as negative log-probability. This connec-tion suggests beam search has an interpretation asexact decoding, but with a UID-promoting regu-larizer that encourages evenly distributed surprisalin generated text. This insight naturally leads tothe development of several new regularizers thatlikewise enforce the UID property.

Empirically, we experiment with our novel regu-larizers in the decoding of NMT models. We firstobserve a close relationship between the standarddeviation of surprisals—an operationalization ofUID—and BLEU, which suggests that high-qualitytext does indeed exhibit the UID property. Addi-tionally, we find that even with exact search, ourregularized objective leads to performance simi-lar to beam search on standard NMT benchmarks.Both of these observations are reflected in Fig. 1.Lastly, we see that our regularizers alleviate the

1This rate tends to decrease for larger models, although itis often still a considerable percentage.

text-quality degradation typically seen when de-coding with larger beam sizes. We take all theabove as evidence that our proposed explanation ofbeam search’s inductive bias indeed elucidates whythe algorithm performs so well as a search heuristicfor language generation tasks.

2 Neural Probabilistic Text Generation

Probabilistic text generators define a probabilitydistribution pθ(y | x) over an output space of hy-potheses Y (to be defined in Eq. (1)) conditionedon an input x.2 Modern generators are typicallyparameterized by a deep neural network—possiblyrecurrent—with a set of learned weights θ. In thecase of text generation, the full set of possible hy-potheses grows exponentially with the vocabularysize |V|. We consider the set of complete hypothe-ses, i.e., valid outputs, as

Y := {BOS ◦ v ◦ EOS | v ∈ V∗} (1)

where ◦ is string concatenation and V∗ is theKleene closure of V . In words, valid hypothesesare text, e.g., sentences or phrases, padded with dis-tinguished tokens, BOS and EOS. In this work, weconsider models that are locally normalized, i.e.,the model pθ is defined as the product of probabilitydistributions:

pθ(y | x) =

|y|∏t=1

pθ(yt | x,y<t) (2)

where each pθ(· | x,y<t) is a distribution with sup-port over V := V ∪ {EOS} and y<1 = y0 := BOS.

The decoding objective for text generation aimsto find the most-probable hypothesis among allcandidate hypotheses, i.e. we aim to solve thefollowing optimization problem:

y? = argmaxy∈Y

log pθ(y | x) (3)

This is commonly known as maximum a posteriori(MAP) decoding since pθ is a probability model.While there exists a wealth of literature on decod-ing algorithms for statistical text generation mod-els, e.g., phrase-based machine translation models,many of these methods cannot reasonably be usedwith neural models. Specifically, due to the non-Markovian structure of most neural text generators,dynamic-programming algorithms for searching

2The input could be another sentence, a semantic structureor an image, to name a few examples.

Page 3: If beam search is the answer, what was the question?

2175

over the exponentially large space are not efficientin this setting. Indeed, there are formal results thatsolving Eq. (3) with a recurrent neural networkis NP-hard (Chen et al., 2018). Therefore decod-ing is performed almost exclusively with heuristicmethods, such as beam search.

2.1 Beam SearchBeam search is a form of pruned breadth-firstsearch where the breadth is limited to k ∈ Z+

(i.e., a maximum of k hypotheses) are expandedat each time step. We express beam search as thefollowing recursion:

Y0 = {BOS} (4)

Yt = argmaxY ′⊆Bt,|Y ′|=k

log pθ(Y′ | x) (5)

where we define the candidate set at t > 0

Bt ={yt91 ◦ y | y ∈ V and yt91 ∈ Yt91

}(6)

For notational convenience, we define EOS◦EOS =EOS. The above algorithm terminates after a fixednumber of iterations3 nmax and the set Ynmax isreturned. We overload pθ(· | x) to take a set of hy-potheses as an argument instead of just a single hy-pothesis. In this case, pθ(Y | x) :=

∏y∈Y pθ(y |

x).4 Using a similar schema, the argmax mayalso operate over a different objective, e.g., log-probabilities combined with various rewards or pe-naties, such as those discussed in §2.2.

Beam search has a long history in sequencetransduction. For example, many of the decodingstrategies used in statistical machine translation(SMT) systems were variants of beam search (Ochet al., 1999; Koehn et al., 2003; Koehn, 2004). Aslanguage generation systems moved away fromphrase-based statistical approaches and towardsneural models, beam search remained the de-factodecoding algorithm (Sutskever et al., 2014; Vinyalsand Le, 2015). However, it has been observedthat when used as a decoding algorithm for neuraltext generation, beam search (for small beams)typically has a large percentage of search errors

3If all hypotheses in Yt end in EOS for some t < nmax,then we may terminate beam search early as it is then gau-ranteed that Yt = Ynmax . We do not consider further early-stopping methods for beam search (Huang et al., 2017; Yanget al., 2018; Meister et al., 2020) as they generally should notaffect the quality of the decoded set.

4There do exist objectives that take into account interac-tions between hypotheses in a set, e.g., diverse beam search(Vijayakumar et al., 2018), but we do not consider those here.

(Stahlberg and Byrne, 2019). Counterintuitively,it is widely known that increasing the beam sizebeyond 5 can hurt model performance in termsof downstream evaluation metrics (e.g., BLEU,ROUGE); while a number of prior works havereferred to this phenomenon as a curse (Koehn andKnowles, 2017; Yang et al., 2018; Cohen and Beck,2019), it should perhaps be seen as a blessing.Beam search typically generates well-formed andcoherent text from probabilistic models, whoseglobal optimum in many cases is the empty string,when they otherwise might fail to produce text atall. As we demonstrate in §4, this text also tendsto be human-like. We will subsequently explorepossible reasons as to why beam search leadsto desirable text from models that are otherwisepoorly calibrated, i.e., poor representations of thetrue distribution p(y | x) (Guo et al., 2017).

2.2 Alternative Decoding Objectives

When the MAP objective (Eq. (3)) is used for de-coding neural text generators, the results are gen-erally not satisfactory. Among other problems, thegenerated texts are often short and defaults to high-frequency words (Cho et al., 2014; Vinyals and Le,2015; Shen et al., 2016). Methods such as lengthand coverage normalization (Jean et al., 2015; Tuet al., 2016; Murray and Chiang, 2018), which aug-ment the MAP objective with an additive term ormultiplicative factor, have been adopted to allevi-ate these issues. For example, two such forms oflength5 and coverage normalization use the follow-ing modified MAP objective respectively duringdecoding to produce higher-quality output:

log pθ(y |x) + λ|y| (7)

log pθ(y |x)+λ

|x|∑i=1

log min

1,

|y|∑j=1

αij

(8)

where λ > 0 is the (tunable) strength of the rewardand αij is the attention weight (Bahdanau et al.,2015) from the jth decoding step over the ith input.Eq. (7) directly rewards longer outputs (He et al.,2016) while Eq. (8) aims to reward coverage of in-put words in a prediction using the attention mecha-nism of an encoder–decoder model as an oracle (Tu

5The predominant form of length normalization divides(log) sequence probability by the length of the hypothesisrather than using an additive reward as in (He et al., 2016).We present results from the former in our experiments as wefind it empirically leads to better performance.

Page 4: If beam search is the answer, what was the question?

2176

et al., 2016). While such methods help obtain state-of-the-art results in neural MT (Wu et al., 2016;Gehring et al., 2017; Ng et al., 2019), we viewthem as a patch to the observed problems. Thefact that text quality still degrades with increasedbeam sizes when these rewards are used (Koehnand Knowles, 2017; Ott et al., 2018a) suggests thatthey do not address the inherent issues with textgeneration systems. We subsequently hypothesizeabout the nature of these issues and provide a setof linguistically motivated regularizers—inspiredby beam search—that appear to alleviate them.

3 Deriving Beam Search

We introduce a regularized decoding framework.The idea is simple; we seek to solve the regularizedoptimization problem to decode

y? = argmaxy∈Y

(log pθ(y | x)− λ · R(y)

)(9)

for a strategically chosenR(·). Clearly, for certainR(·), we recover the decoding objectives discussedin §2.2. The question we ask in this work is thefollowing: If we want to view beam search as anexact-decoding algorithm, whichR(·) should wechoose to recover beam search?

We discovered an elegant answer rooted in infor-mation theory and cognitive science (the connec-tions are discussed in-depth in §4). We first definethe model’s time-dependent surprisals, which arean information-theoretic concept that characterizesthe amount of new information expressed at time t:

u0(BOS) = 0

ut(y) = − log pθ(y | x,y<t), for t ≥ 1 (10)

Note that minimally surprising means maximallyprobable. For the special case of greedy decoding,where k = 1, the following choice of regularizerrecovers beam search for sufficiently large λ:

Rgreedy(y) =

|y|∑t=1

(ut(yt)−min

y′∈Vut(y

′)

)2

(11)

The intuition behind Eq. (11) is to encourage lo-cally optimal decisions: Every local surprise utshould be close to the minimally surprising choice.In the limiting case where locally optimal deci-sions are not just encouraged, but rather enforced,we recover greedy search.

Formally, we have the following theorem:

Theorem 3.1. The argmax of log pθ(y | x) − λ ·Rgreedy(y) is exactly computed by greedy searchin the limiting case as λ→∞.

Proof. By induction. In App. A.

Theorem 3.1 establishes that greedy search is thelimiting case of a regularizer that seeks to encour-age decisions to have high-probability locally. Incontrast, the optimal MAP solution will generallynot have this property. This is because a globallyoptimal MAP decoder may require a locally subop-timal decision for the sake of being able to makea compensatory decision later that leads to globaloptimality.6

We now consider the generalization of greedysearch (k = 1) to full beam search (k ≥ 1). Recallthat beam search returns not just a single output,but rather a set of outputs. Thus, we must considerthe set-decoding objective

Y ? = argmaxY⊆Y,|Y |=k

(log pθ(Y | x)−λ ·R(Y )

)(12)

where, as before, we have used our overloaded nota-tion pθ(· | x) to score sets of hypotheses. Similarlyto Rgreedy, we formulate a greedy set-regularizerto recover beam search:

Rbeam(Y ) = (13)

nmax∑t=1

ut(Yt)− minY ′⊆Bt,|Y ′|=k

ut(Y′)

2

where Yt = {y1:t | y ∈ Y } corresponds to the setof hypotheses expanded by t steps.7 Note that weadditionally overload surprisal to operate on sets,ut(Y ) =

∑y∈Y ut(y). We prove an analogous

theorem to Theorem 3.1 for this regularizer.

Theorem 3.2. The argmax of log pθ(Y | x)− λ ·R(Y ) is computed by beam search with beam sizeof k = |Y | as λ→∞.

Proof. The proof follows from the same argumentas Theorem 3.1, albeit with sets instead of an indi-vidual hypothesis.

6Indeed, we only have formal guarantees for greedy algo-rithms when local optimality translates into global optimality(Kleinberg and Tardos, 2005, Chapter 4).

7This includes both incomplete hypotheses of length t andcomplete hypotheses that have reached EOS at step ≤ t.

Page 5: If beam search is the answer, what was the question?

2177

Note that in the (predominant) case where we wantto return a single candidate sentence as the outputrather than an entire set—as would be generated byEq. (12)—we can take the highest-probability se-quence in the chosen set Y ? as our decoded output.The objective in Eq. (12) boils down to a subsetselection problem which, given the size of Y , is acomputationally prohibitive optimization problem.Nonetheless, we can use it to analyze the propertiesenforced on generated text by beam search.

4 From Beam Search to UID

The theoretical crux of this paper hinges on aproposed relationship between beam search andthe uniform information density hypothesis(Levy, 2005; Levy and Jaeger, 2007), a conceptfrom cognitive science:

Hypothesis 4.1. “Within the bounds defined bygrammar, speakers prefer utterances that distributeinformation uniformly across the signal (informa-tion density). Where speakers have a choice be-tween several variants to encode their message,they prefer the variant with more uniform informa-tion density (ceteris paribus)” (Jaeger, 2010).

At its core, the theory seeks to explain variousaspects of human language processing in terms ofinformation theory; it is often applied to an areaof psycholinguistics known as sentence processingwhere the UID hypothesis is used to explain exper-imental data (Hale, 2001). As the UID hypothesisconcerns a cognitive process (virtually) indepen-dent of the language in use, the theory should holdacross languages (Jaeger and Tily, 2011).

To see the hypothesis in action, consider theclassic case of syntactic reduction from Levy andJaeger (2007):

(1) How big is [NP the familyi [RC (that) you cookfor −i]]?

In the above example, the sentence does not requirethe relativizer that at the start of the relative clause(denoted by RC); it would also be syntacticallycorrect without it. However, many would agreethat the relativizer makes the text qualitatively bet-ter. The information-theoretic explanation of thisperception is that without the relativizer, the firstword of a relative clause conveys two pieces of in-formation simultaneously: the onset of a relativeclause and part of its internal contents. Includingthe relativizer spreads this information across two

words, thereby distributing information across thesentence more uniformly and avoiding instancesof high surprisal—which, from a psycholinguisticperspective, are displeasing. In short, the relativizerhelps to ensure the UID property of the sentence.

Importantly, the preference suggested by theUID hypothesis is between possible utterances (i.e.,outputs) where grammaticality and informationcontent are held constant. Any violation of theseassumptions presents confounding factors whenmeasuring, or optimizing, the information densityof the generated text. In our setting, there is reasonto believe that grammaticallity and information con-tent are approximately held constant while select-ing between hypothesis. First, the high-probabilityoutputs of neural generation models tend to begrammatical (Holtzman et al., 2020). Second, be-cause decoding is conditioned on a specific inputx, the conditional probability model pθ(y | x) isable to assign high-probability to outputs y that areplausible outputs (e.g., translations) of the givenx. Thus, even though the various y are not con-strained to be sematically equivalent to one another,they tend to express similar information becausethey are at least relevant to the same x. This iswhy our regularized optimization problem Eq. (9)combines an information-density regularizer withlog pθ(y | x): the term log pθ(y | x) rewardsgrammaticallity and content relevance, whereas theinformation-density regularizer encourages the hu-man preferences posited by the UID hypothesis.The parameter λ allows the preferences to be cali-brated to perform well on downstream evaluationmetrics, such as BLEU and ROUGE.

4.1 The UID Bias in Beam Search

It may not be immediately obvious how the UIDhypothesis relates to beam search. After all, beamsearch narrows the scope of the search to only thelowest surprisal candidates at each time step, whichdoes not clearly lead to a uniform distribution ofsurprisals in the final decoded sequences. The con-nection is best seen visually.

Fig. 2 shows the time-dependent surprisals utunder the model of several candidate translations(German to English). Recall that we have ut(y) ∈[0,∞) and that the standard decoding objective ex-plicitly minimizes the sum of surprisals, i.e., maxi-mizes log-probability. Therefore, the only way thedistribution of a solution can become distinctly non-uniform is when there are several high-surprisal

Page 6: If beam search is the answer, what was the question?

2178

Figure 2: Surprisals (according to pθ) by time step of sequences generated with various decoding strategies. Valuesof λ indicate the greedy regularizer was used with the corresponding λ value. Note that beam search (k=5) andexact search (λ = 1.0) return the same prediction in this example, and thus, are represented by the same line.

decisions in the mix; we observe this in the or-ange and red curves. Intuitively, this correspondsto the notion of compensation discussed earlier:a globally optimal decoding scheme may select ahigh-surprisal step at some point in order to shortenthe length of the path or to take a low-surprisal steplater on. We observe an extreme example of thisbehavior above: Selecting the EOS character at thefirst step leads to a very non-uniform distribution,i.e., the degenerate distribution, which, violates ouroperationalization of UID described subsequently.In summary, we see that as λ is decreased, the de-coded sentences obey the UID property less strictly.Indeed, setting λ = 0, i.e., exact inference of theMAP objective, results in the empty string.

A number of successful sampling methods (p-nucleus sampling (Holtzman et al., 2020) and top-k sampling (Fan et al., 2018)) enforce the UIDproperty in generated text by the same logic asabove. Both methods eliminate many of the high-surprisal choices at any given decoding step bynarrowing the set of tokens that may be chosen.

4.2 Cognitive Motivation for Beam Search

The goal of this work is to expose a possible in-ductive bias of beam search. We now exhibit ourprimary hypothesis

Hypothesis 4.2. Beam search is a cognitively mo-tivated search heuristic for decoding language gen-

eration models. The success of beam search onsuch tasks is, in part, due to the fact that it inher-ently biases the search procedure towards text thathumans prefer.

The foundation of the argument for this hypoth-esis follows naturally from the previous sections:First, we demonstrated in §3 that beam search is anexact decoding algorithm for a certain regularizedobjective—to wit, the one in Eq. (9). Qualitatively,we related the behavior of the regularizer to theUID hypothesis from cognitive science. As a fi-nal step, we next provide operationalizations ofUID—in the form of regularizers within our regu-larized decoding framework—through which wecan empirically test the validity of this hypothesis.

5 Generalized UID Decoding

If beam search is trying to optimize for UID, canwe beat it at its own game? This section developsa battery of possible sentence-level UID measures,which can be used as regularizers in our regularizeddecoding framework and compared experimentallyon downstream evaluation metrics.

Variance Regularizer. We first consider the vari-ance regularizer from Jain et al. (2018). In essence,UID concerns the distribution of information overthe course (i.e., time steps) of a sentence. A natural

Page 7: If beam search is the answer, what was the question?

2179

measure for this is variance of the surprisals.

Rvar(y) =1

|y|

|y|∑t=1

(ut(yt)− µ

)2(14)

where µ = 1/|y|∑|y|

t=1 ut(yt). This regularizer,in contrast to Eq. (11), is a much more straight-forward encoding of the UID: it directly opera-tionalizes UID through variance.

Local Consistency. Next we consider a localconsistency regularizer, also taken from Jain et al.(2018), that encourages adjacent surprisals to havesimilar magnitude:

Rlocal(y) =1

|y|

|y|∑t=1

(ut(yt)− ut−1(yt−1)

)2(15)

Again, this is a straightforward encoding of theUID: if every surprisal is similar to its neighbor, itwill be close to uniform. Note that both of theabove regularizers are defined for all decodingsteps t > 0 since we define u0(y0) = 0, y0 =BOS for all valid hypotheses.

Max Regularizer. We propose a UID-inspiredregularizer of our own design that exploits the na-ture of MAP decoding, for which the overarchinggoal is to find a solution with low surprisal. Inthis setting, one strategy is to penalize decisionsthat move the distribution away from 0, the lowestpossible surprisal. This suggests

Rmax(y) =|y|

maxt=1

ut(yt) (16)

would regularize for UID. Such a regularizer wouldalso directly penalize extreme compensation dur-ing decoding (discussed in §3). It is worth notingthat this regularizer has a connection to entropyregularization, which can be seen by looking at theformula for Renyi entropy.

Squared Regularizer. Finally, we consider anovel squared penalty, that, again, exploits the goalof MAP decoding. If we wish to keep everythinguniform, we can try to push all surprisals close to0, but this time with a squared penalty:

Rsquare(y) =

|y|∑t=1

ut(yt)2 (17)

Experimentally, we expect to see the following:If encouraging decoded text to exhibit UID is

helpful—and our logic in constructing regulariz-ers is sound—all the regularizers (Eq. (14) to (17))should lead to roughly the same performance un-der exact decoding and beam search with largebeam widths. Such results would not only validatethe connection between UID and high-quality text;comparable performance of optimal beam search8

and exact search under our regularized objectivewould provide explicit evidence for our declarativeexplanation of the inductive bias in beam search.

6 Experiments

We explore how encouraging uniform informationdensity in text generated by neural probabalistictext generators affects its downstream quality. Tothis end, we decode NMT models using the reg-ularized objective (Eq. (9)) with our UID regu-larizers. We perform exact decoding for a rangeof λ and observe how text quality (quantified byBLEU (Papineni et al., 2002) using the SacreBLEU(Post, 2018) system) and the distribution of sur-prisal changes. We additionally evaluate our regu-larizers under the beam search decoding strategyto see if penalizing violations of UID alleviatesthe text-quality degradation typically seen with in-creased beam widths.

Experiments are performed using models trainedon the IWSLT’14 De-En (Cettolo et al., 2012) andWMT’14 En-Fr (Bojar et al., 2014) datasets. For re-producibility, we use the model provided by fairseq(Ott et al., 2019) for the WMT’14 task;9 we use thedata pre-processing scripts and recommended hy-perparameter settings provided by fairseq for train-ing a model on the IWSLT’14 De-En dataset. Weuse the Newstest’14 dataset as the test set for theWMT’14 model. All model and data informationcan be found on the fairseq NMT repository. 10

6.1 Exact Decoding

To perform exact decoding of neural probabilistictext generators, we build on the decoding frame-work of Stahlberg et al. (2017), albeit using Dijk-stra’s algorithm (Dijkstra, 1959) instead of depth-first search as we find it decreases decoding time.Note that Dijkstra’s algorithm is guaranteed to findthe global optimum when path cost is monotoni-

8By optimal beam search, we mean beam search using thebeam width that empirically leads to the best results.

9This model uses a transformer architecture (Vaswani et al.,2017) and was trained as in Ott et al. (2018b).

10https://github.com/pytorch/fairseq/tree/master/examples/translation

Page 8: If beam search is the answer, what was the question?

2180

Figure 3: BLEU as a function of beam width for various regularizers. We choose λ for each regularizer by best per-formance on validation sets (see App. B). y-scales are broken to show minimum BLEU values. x-axis is log-scaled.

cally increasing, which is the case for hypothesesunder the scoring scheme used by neural proba-bilistic text generators (see Meister et al. (2020)for more detailed discussion). While the varianceand local consistency regularizers Eq. (14) and (15)break this monotonicity property, we can still guar-antee optimality by using a stopping criterion sim-ilar to the one proposed by Yang et al. (2018).Explicitly, we check if the top-scoring completehypothesis has a greater score than the maximumpossible score of any hypothesis in the queue. Allscores are bounded due to the maximum-length cri-terion. Additionally, we lower-bound each searchby the score of the empty string to decrease thememory footprint, i.e., we stop considering hy-potheses whose scores (or maximum possible scorein the case of Eq. (14) and (15)) drop below that ofthe empty string at any time step.

Fig. 1 demonstrates how the addition of thegreedy UID regularizer (Eq. (11) ) to the regular-ized MAP objective (Eq. (9)) affects characteristicsof the global optimum under the model as we varyλ. Notably, increasing the strength of the regular-izer appears to alleviate the text quality degradationseen with exact search, leading to results that ap-proach the BLEU of those generated using optimalbeam search. Fig. 1 also shows a strong inverse re-lationship between BLEU and average standard de-viation (per sentence) of surprisals. We take theseobservations as empirical validation of Hyp. 4.2.

6.2 Regularized Beam Search

We next look at how the regularized decoding ob-jective affects text generated using beam search. Aspreviously noted, text quality generally degradeswith increased beam size when using the standardMAP objective; this phenomenon is demonstratedin Fig. 3. UID regularization appears to alleviate

k=5 k=10 k=100 k=500

No Regularization 36.42 36.30 32.83 14.66Squared Regularizer 36.92 36.42 36.13 35.96Greedy Regularizer 36.45 36.49 36.22 36.15Combined Regularizers 36.69 36.65 36.48 36.35Length Normalization 36.02 35.94 35.80 35.11

Table 1: BLEU scores on first 1000 samples of New-stest2014 for predictions generated with various decod-ing strategies. Best scores per beam size are bolded.

this problem. Notably, the greedy and squaredregularizer aid performance for larger beam sizesmore so than other regularizers, for which we stillsee a slight drop in performance for larger beamsizes. This drop is negligible compared to theone observed for unregularized beam search—adrop which is also frequently observed for length-normalized decoding (Koehn and Knowles, 2017).While intuitively, variance and local variance arethe purest encodings of UID, they perform the poor-est of the regularizers. Arguably, this may be dueto the fact that they do not simultaneously (as theother regularizers do) penalize for high surprisal.

We additionally decode with a combination ofthe UID regularizers in tandem. We collectivelytune the λ value for each of the regularizers onvalidation sets. We report performance in Tab. 1and see that results outperform standard and length-normalized, i.e. score divided by sequence length,beam search with noticeable improvements forlarger beams. Search details and parameter set-tings may be found in App. B. Notably, combiningmultiple UID regularizers does not lead to as greatan increase in performance as one might expect,which hints that a single method for enforcing UIDis sufficient for promoting quality in generated text.

Page 9: If beam search is the answer, what was the question?

2181

7 Related Work

Neural probabilistic text generators are far fromperfect; prior work has shown that they often gen-erate text that is generic (Vinyals and Le, 2015;Li et al., 2016), unnatural (Holtzman et al., 2020),and sometimes even non-existent (Stahlberg andByrne, 2019). In the context of the degenerate be-havior of these models, the beam search curse—aspecific phenomenon where using a larger beamsize leads to worse performance—has been ana-lyzed by a number of authors (Koehn and Knowles,2017; Murray and Chiang, 2018; Yang et al., 2018;Stahlberg and Byrne, 2019; Jean et al., 2015; Tuet al., 2016; He et al., 2016; Cohen and Beck, 2019).Many of these authors attribute the performancedrop (as search becomes better) to an inherent biasin neural sequence models to pefer shorter sen-tences. Other authors have ascribed fault to themodel architectures, or how they are trained (Choet al., 2014; Bengio et al., 2015; Sountsov andSarawagi, 2016; Vinyals et al., 2017; Ott et al.,2018a; Kumar and Sarawagi, 2019). To remedy theproblem, a large number of regularized decodingobjectives and modified training techniques havebeen proposed. In contrast, this work analyzes thebehavior of neural text generators from a differentangle: We provide a plausible answer—inspired bypsycholinguistic theory—as to why beam search(with small beams) leads to high-quality text, ratherthan another explanation of why exact search per-forms so badly.

8 Conclusion

We analyze beam search as a decoding strategy fortext generation models by framing it as the solu-tion to an exact decoding problem. We hypothesizethat beam search has an inductive bias which canbe linked to the promotion of uniform informa-tion density (UID), a theory from cognitive scienceregarding even distribution of information in lin-guistic signals. We observe a strong relationshipbetween variance of surprisals (an operationaliza-tion of UID) and BLEU in our experiments withNMT models. With the aim of further exploringdecoding strategies for neural text generators in thecontext of UID, we design a set of objectives toexplicitly encourage uniform information densityin text generated from neural probabalistic modelsand find that they alleviate the quality degradationtypically seen with increased beam widths.

Acknowledgments

We would like to thank Ari Holtzman and Jason Eis-ner for useful feedback and discussion that helpedimprove this work.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2015. Neural machine translation by jointlylearning to align and translate. In 3rd InternationalConference on Learning Representations.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, andNoam Shazeer. 2015. Scheduled sampling for se-quence prediction with recurrent neural networks.In Advances in Neural Information Processing Sys-tems 28.

Ondrej Bojar, Christian Buck, Christian Federmann,Barry Haddow, Philipp Koehn, Johannes Leveling,Christof Monz, Pavel Pecina, Matt Post, HerveSaint-Amand, Radu Soricut, Lucia Specia, and AlesTamchyna. 2014. Findings of the 2014 workshopon statistical machine translation. In Proceedingsof the Workshop on Statistical Machine Translation,Baltimore, Maryland, USA. Association for Compu-tational Linguistics.

Mauro Cettolo, Christian Girardi, and Marcello Fed-erico. 2012. Wit3: Web inventory of transcribed andtranslated talks. In Proceedings of the Conference ofthe European Association for Machine Translation(EAMT).

Yining Chen, Sorcha Gilroy, Andreas Maletti, JonathanMay, and Kevin Knight. 2018. Recurrent neuralnetworks as weighted language recognizers. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 2261–2271, NewOrleans, Louisiana. Association for ComputationalLinguistics.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder–decoder ap-proaches. In Proceedings of SSST-8, Eighth Work-shop on Syntax, Semantics and Structure in Statisti-cal Translation, pages 103–111, Doha, Qatar. Asso-ciation for Computational Linguistics.

Eldan Cohen and Christopher Beck. 2019. Empiricalanalysis of beam search performance degradation inneural sequence models. In Proceedings of the In-ternational Conference on Machine Learning, vol-ume 97, Long Beach, California, USA. PMLR.

Edsger W. Dijkstra. 1959. A note on two problemsin connexion with graphs. Numerische Mathematik,1(1).

Page 10: If beam search is the answer, what was the question?

2182

Sergey Edunov, Myle Ott, Michael Auli, and DavidGrangier. 2018. Understanding back-translation atscale. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 489–500, Brussels, Belgium. Association forComputational Linguistics.

Angela Fan, Mike Lewis, and Yann Dauphin. 2018.Hierarchical neural story generation. CoRR,abs/1805.04833.

Jonas Gehring, Michael Auli, David Grangier, DenisYarats, and Yann N. Dauphin. 2017. Convolutionalsequence to sequence learning. In Proceedings ofthe International Conference on Machine Learning -Volume 70, ICML’17. JMLR.org.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Wein-berger. 2017. On calibration of modern neural net-works. In Proceedings of the International Confer-ence on Machine Learning, ICML’17. JMLR.org.

John Hale. 2001. A probabilistic Earley parser as a psy-cholinguistic model. In Second Meeting of the NorthAmerican Chapter of the Association for Computa-tional Linguistics.

Wei He, Zhongjun He, Hua Wu, and Haifeng Wang.2016. Improved neural machine translation with smtfeatures. In Proceedings of the Thirtieth AAAI Con-ference on Artificial Intelligence, AAAI’16. AAAIPress.

Ari Holtzman, Jan Buys, Maxwell Forbes, and YejinChoi. 2020. The curious case of neural text degener-ation. International Conference on Learning Repre-sentations.

Liang Huang, Kai Zhao, and Mingbo Ma. 2017. Whento finish? optimal beam search for neural text gen-eration (modulo beam size). In Proceedings of the2017 Conference on Empirical Methods in Natu-ral Language Processing, pages 2134–2139, Copen-hagen, Denmark. Association for ComputationalLinguistics.

T. Florian Jaeger. 2010. Redundancy and reduc-tion: Speakers manage syntactic information density.Cognitive Psychology, 61(1).

T. Florian Jaeger and Harry Tily. 2011. On language’utility’: Processing complexity and communicativeefficiency. Wiley Interdisciplinary Reviews: Cogni-tive Science, 2.

Ayush Jain, Vishal Singh, Sidharth Ranjan, Rajakrish-nan Rajkumar, and Sumeet Agarwal. 2018. Uni-form information density effects on syntactic choicein Hindi. In Proceedings of the Workshop on Lin-guistic Complexity and Natural Language Process-ing, pages 38–48, Santa Fe, New-Mexico. Associa-tion for Computational Linguistics.

Sebastien Jean, Orhan Firat, Kyunghyun Cho, RolandMemisevic, and Yoshua Bengio. 2015. Montrealneural machine translation systems for WMT’15. In

Proceedings of the Tenth Workshop on StatisticalMachine Translation, pages 134–140, Lisbon, Por-tugal. Association for Computational Linguistics.

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrentconvolutional neural networks for discourse compo-sitionality. In Proceedings of the Workshop on Con-tinuous Vector Space Models and their Composition-ality, pages 119–126, Sofia, Bulgaria. Associationfor Computational Linguistics.

Jon Kleinberg and Eva Tardos. 2005. Algorithm De-sign. Addison-Wesley Longman Publishing Co.,Inc., USA.

Philipp Koehn. 2004. Pharaoh: A beam search decoderfor phrase-based statistical machine translation mod-els. In Machine Translation: From Real Users toResearch, Berlin, Heidelberg. Springer Berlin Hei-delberg.

Philipp Koehn and Rebecca Knowles. 2017. Six chal-lenges for neural machine translation. In Proceed-ings of the First Workshop on Neural Machine Trans-lation, pages 28–39, Vancouver. Association forComputational Linguistics.

Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003.Statistical phrase-based translation. In Proceedingsof the 2003 Human Language Technology Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics, pages 127–133.

Aviral Kumar and Sunita Sarawagi. 2019. Calibrationof encoder decoder models for neural machine trans-lation. CoRR, abs/1903.00802.

Roger Levy. 2005. Probabilistic Models of Word Orderand Syntactic Discontinuity. Ph.D. thesis, Stanford,CA, USA.

Roger P. Levy and T. F. Jaeger. 2007. Speakers op-timize information density through syntactic reduc-tion. In B. Scholkopf, J. C. Platt, and T. Hoffman,editors, Advances in Neural Information ProcessingSystems 19. MIT Press.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016. A diversity-promoting ob-jective function for neural conversation models. InProceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 110–119, San Diego, California. Associationfor Computational Linguistics.

Clara Meister, Tim Vieira, and Ryan Cotterell. 2020.Best-first beam search. Transactions of the Associa-tion for Computational Linguistics.

Kenton Murray and David Chiang. 2018. Correct-ing length bias in neural machine translation. InProceedings of the Third Conference on MachineTranslation: Research Papers, pages 212–223, Bel-gium, Brussels. Association for Computational Lin-guistics.

Page 11: If beam search is the answer, what was the question?

2183

Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott,Michael Auli, and Sergey Edunov. 2019. FacebookFAIR’s WMT19 news translation task submission.In Proceedings of the Fourth Conference on Ma-chine Translation (Volume 2: Shared Task Papers,Day 1), pages 314–319, Florence, Italy. Associationfor Computational Linguistics.

Franz Josef Och, Christoph Tillmann, and HermannNey. 1999. Improved alignment models for statisti-cal machine translation. In 1999 Joint SIGDAT Con-ference on Empirical Methods in Natural LanguageProcessing and Very Large Corpora.

Myle Ott, Michael Auli, David Grangier, andMarc’Aurelio Ranzato. 2018a. Analyzing uncer-tainty in neural machine translation. ICML.

Myle Ott, Sergey Edunov, Alexei Baevski, AngelaFan, Sam Gross, Nathan Ng, David Grangier, andMichael Auli. 2019. fairseq: A fast, extensibletoolkit for sequence modeling. In Proceedings ofNAACL-HLT 2019: Demonstrations.

Myle Ott, Sergey Edunov, David Grangier, andMichael Auli. 2018b. Scaling neural machine trans-lation. In Proceedings of the Third Conference onMachine Translation: Research Papers, pages 1–9,Belgium, Brussels. Association for ComputationalLinguistics.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automaticevaluation of machine translation. In Proceedingsof the Annual Meeting on Association for Computa-tional Linguistics.

Matt Post. 2018. A call for clarity in reporting BLEUscores. In Proceedings of the Conference on Ma-chine Translation: Research Papers.

Raj Reddy. 1977. Speech understanding systems: Asummary of results of the five-year research effort atcarnegie mellon university.

Iulian Serban, Tim Klinger, Gerald Tesauro, KartikTalamadupula, Bowen Zhou, Yoshua Bengio, andAaron Courville. 2017. Multiresolution recurrentneural networks: An application to dialogue re-sponse generation.

Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, HuaWu, Maosong Sun, and Yang Liu. 2016. Minimumrisk training for neural machine translation. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 1683–1692, Berlin, Germany. Asso-ciation for Computational Linguistics.

Pavel Sountsov and Sunita Sarawagi. 2016. Lengthbias in encoder decoder models and a case for globalconditioning. In Proceedings of the 2016 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 1516–1525, Austin, Texas. Asso-ciation for Computational Linguistics.

Felix Stahlberg and Bill Byrne. 2019. On NMT SearchErrors and Model Errors: Cat Got Your Tongue? InProceedings of the Conference on Empirical Meth-ods in Natural Language Processing and the 9th In-ternational Joint Conference on Natural LanguageProcessing (EMNLP-IJCNLP), Hong Kong, China.Association for Computational Linguistics.

Felix Stahlberg, Eva Hasler, Danielle Saunders, andBill Byrne. 2017. SGNMT – a flexible NMT de-coding platform for quick prototyping of new mod-els and search strategies. In Proceedings of the2017 Conference on Empirical Methods in Natu-ral Language Processing: System Demonstrations,pages 25–30, Copenhagen, Denmark. Associationfor Computational Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.Sequence to sequence learning with neural networks.In Z. Ghahramani, M. Welling, C. Cortes, N. D.Lawrence, and K. Q. Weinberger, editors, Advancesin Neural Information Processing Systems 27. Cur-ran Associates, Inc.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016. Modeling coverage for neuralmachine translation. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 76–85, Berlin, Germany. Association for ComputationalLinguistics.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in Neural Information Pro-cessing Systems, volume 30.

Ashwin K. Vijayakumar, Michael Cogswell, Ram-prasaath R. Selvaraju, Qing Sun, Stefan Lee, David J.Crandall, and Dhruv Batra. 2018. Diverse beamsearch for improved description of complex scenes.In Proceedings of the Thirty-Second AAAI Confer-ence on Artificial Intelligence. AAAI Press.

Oriol Vinyals and Quoc V. Le. 2015. A neural conver-sational model. CoRR, abs/1506.05869.

Oriol Vinyals, Alexander Toshev, Samy Bengio, andDumitru Erhan. 2017. Show and tell: Lessonslearned from the 2015 mscoco image captioningchallenge. IEEE Transactions on Pattern Analysisand Machine Intelligence, 39(4):652–663.

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.Le, Mohammad Norouzi, Wolfgang Macherey,Maxim Krikun, Yuan Cao, Qin Gao, KlausMacherey, Jeff Klingner, Apurva Shah, Melvin John-son, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws,Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, KeithStevens, George Kurian, Nishant Patil, Wei Wang,Cliff Young, Jason Smith, Jason Riesa, Alex Rud-nick, Oriol Vinyals, Gregory S. Corrado, MacduffHughes, and Jeffrey Dean. 2016. Google’s neu-ral machine translation system: Bridging the gap

Page 12: If beam search is the answer, what was the question?

2184

between human and machine translation. CoRR,abs/1609.08144.

Yilin Yang, Liang Huang, and Mingbo Ma. 2018.Breaking the beam search curse: A study of (re-)scoring methods and stopping criteria for neuralmachine translation. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 3054–3059, Brussels, Bel-gium. Association for Computational Linguistics.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le.2019. Xlnet: Generalized autoregressive pretrain-ing for language understanding. In H. Wallach,H. Larochelle, A. Beygelzimer, F. d'Alche-Buc,E. Fox, and R. Garnett, editors, Advances in Neu-ral Information Processing Systems 32. Curran As-sociates, Inc.

Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang,Hang Li, and Xiaoming Li. 2016. Neural gener-ative question answering. In Proceedings of theWorkshop on Human-Computer Question Answer-ing, pages 36–42, San Diego, California. Associa-tion for Computational Linguistics.

Page 13: If beam search is the answer, what was the question?

2185

A Theory

Proof. We prove Theorem 3.2 by induction. We denote the argmax of log pθ(y | x) − λ · Rgreedy(y)

as yR and the solution found by greedy search as ygreedy. We will show that ygreedyt = yRt for all0 ≤ t ≤ max(|yR|, |ygreedy|). The theorem holds trivially for the base case of t = 0 because y0 mustbe BOS for any valid hypothesis by definition of the hypothesis space (Eq. (1)). Now, by the inductivehypothesis, suppose ygreedyi = yRi for all i < t. We will show that our regularized objective must choosethe same word as greedy search at time-step t. In the limiting case of Eq. (11), the following functionreflects the penalty to the distribution over tokens at position t:

limλ→∞

[λ ·(ut(yt)−min

y′∈Vut(y

′))2]

=

{0 if ut(yt) = miny′∈V ut(y

′)

∞ otherwise

Since minimum surprisal implies maximum log-probability, the above function clearly returns either0 or∞ depending on whether the decoding choice at time-step t is greedy. Therefore the only choicethat would not send the hypothesis score to −∞ is the greedy choice, which implies any feasiblesolution to our objective must have yRt = ygreedyt . By the principle of induction, ygreedyt = yRt for all0 ≤ t ≤ |yR| = |ygreedy|, which in turn implies ygreedy = yR.

B Parameters

For values in Fig. 3, we perform grid search over λ ∈ [0.2, 0.5, 0.7, 1, 2, 3, 4, 6, 7, 8, 9, 10] and choosethe λ with the best validation set performance. For combined UID regularization, we perform hyper-parameter search over the 5 strength parameters, each sampled uniformly from the following values:[0, 0.2, 0.5, 0.7, 1, 2, 3, 4, 6, 7, 8, 9, 10]. We run 50 trials on the validation set; λ = 5 and λ = 2 yield thebest performance for the greedy and squared regularizers, respectively with all others λ set to 0.

IWSLT’14 WMT’14Greedy 10 5Local Consistency 4 6Max 5 3Squared 3 2Variance 7 3

Table 2: λ settings used during decoding in Fig. 3 and reported in table Tab. 1.

C Additional Plots

Figure 4: BLEU vs. std. deviation of surprisals fortranslations generated with beam search on test setsof IWSLT’14 and WMT’14. Size of point indicatesbeam width used (between 5 and 100). In contrast tothe subgraph of Fig. 1, the x-axis is not log-scaled.


Recommended