Iterative Search for Weakly Supervised Semantic Parsing · 2019-06-01 · Proceedings of NAACL-HLT...

Proceedings of NAACL-HLT 2019, pages 2669–2680Minneapolis, Minnesota, June 2 - June 7, 2019. c©2019 Association for Computational Linguistics

2669

Iterative Search for Weakly Supervised Semantic Parsing

Pradeep Dasigi♥, Matt Gardner♥, Shikhar Murty♦,Luke Zettlemoyer♣, and Eduard Hovy♠

♥Allen Institute for Artificial Intelligence, Seattle, Washington♥Allen Institute for Artificial Intelligence, Irvine, California

♦Mila, Universite de Montreal ♣University of Washington ♠Carnegie Mellon [email protected]

Abstract

Training semantic parsers from question-answer pairs typically involves searching overan exponentially large space of logical forms,and an unguided search can easily be misledby spurious logical forms that coincidentallyevaluate to the correct answer. We proposea novel iterative training algorithm that alter-nates between searching for consistent logicalforms and maximizing the marginal likelihoodof the retrieved ones. This training scheme letsus iteratively train models that provide guid-ance to subsequent ones to search for logicalforms of increasing complexity, thus dealingwith the problem of spuriousness. We eval-uate these techniques on two hard datasets:WIKITABLEQUESTIONS (WTQ) and CornellNatural Language Visual Reasoning (NLVR),and show that our training algorithm outper-forms the previous best systems, on WTQ in acomparable setting, and on NLVR with signif-icantly less supervision.

1 Introduction

Semantic parsing is the task of translating natu-ral language utterances into machine-executablemeaning representations, often called programs orlogical forms. These logical forms can be exe-cuted against some representation of the contextin which the utterance occurs, to produce a de-notation. This setup allows for complex reason-ing over contextual knowledge, and it has beensuccessfully used in several natural language un-derstanding problems such as question answer-ing (Berant et al., 2013), program synthesis (Yinand Neubig, 2017) and building natural languageinterfaces (Suhr et al., 2018).

Recent work has focused on training seman-tic parses via weak supervision from denotationsalone (Liang et al., 2011; Berant et al., 2013).This is because obtaining logical form annotations

is generally expensive (although recent work hasaddressed this issue to some extent (Yih et al.,2016)), and not assuming full supervision lets usbe agnostic about the logical form language. Thesecond reason is more important in open-domainsemantic parsing tasks where it may not be pos-sible to arrive at a complete set of operators re-quired by the task. However, training semanticparsers with weak supervision requires not onlysearching over an exponentially large space of log-ical forms (Berant et al., 2013; Artzi and Zettle-moyer, 2013; Pasupat and Liang, 2015; Guu et al.,2017, inter alia) but also dealing with spuriouslogical forms that evaluate to the correct denota-tion while not being semantically equivalent to theutterance. For example, if the denotations are bi-nary, 50% of all syntactically valid logical formsevaluate to the correct answer, regardless of theirsemantics. This problem renders the training sig-nal extremely noisy, making it hard for the modelto learn anything without some additional guid-ance during search.

We introduce two innovations to improve learn-ing from denotations. Firstly, we propose an it-erative search procedure for gradually increasingthe complexity of candidate logical forms for eachtraining instance, leading to better training dataand better parsing accuracy. This procedure is im-plemented via training our model with two inter-leaving objectives, one that involves searching forlogical forms of limited complexity during train-ing (online search), and another that maximizesthe marginal likelihood of retrieved logical forms.Second, we include a notion of coverage over thequestion in the search step to guide the training al-gorithm towards logical forms that not only eval-uate to the correct denotation, but also have someconnection to the words in the utterance.

We demonstrate the effectiveness of these twotechniques on two difficult reasoning tasks: WIK-

2670

ITABLEQUESTIONS(WTQ) (Pasupat and Liang,2015), an open domain task with significant lex-ical variation, and Cornell Natural Language Vi-sual Reasoning (NLVR) (Suhr et al., 2017), aclosed domain task with binary denotations, andthus far less supervision. We show that: 1) in-terleaving online search and MML over retrievedlogical forms (§4) is a more effective training algo-rithm than each of those objectives alone; 2) cov-erage guidance during search (§3) is helpful fordealing with weak supervision, more so in the caseof NLVR where the supervision is weaker; 3) acombination of the two techniques yields 44.3%test accuracy on WTQ, outperforming the previ-ous best single model in a comparable setting, and82.9% test accuracy on NLVR, outperforming thebest prior model, which also relies on greater su-pervision.

2 Background

2.1 Weakly supervised semantic parsing

We formally define semantic parsing in a weaklysupervised setup as follows. Given a dataset wherethe ith instance is the triple {xi, wi, di}, represent-ing a sentence xi, the world wi associated with thesentence, and the corresponding denotation di, ourgoal is to find yi, the translation of xi in an ap-propriate logical form language (see §5.3), suchthat JyiKwi = di; i.e., the execution of yi in worldwi produces the correct denotation di. A seman-tic parser defines a distribution over logical formsgiven an input utterance: p(Y |xi; θ).

2.2 Training algorithms

In this section we describe prior techniques fortraining semantic parsers with weak supervision:maximizing marginal likelihood, and reward-based methods.

2.2.1 Maximum marginal likelihoodMost work on training semantic parsers from de-notations maximizes the likelihood of the deno-tation given the utterance. The semantic pars-ing model itself defines a distribution over logi-cal forms, however, not denotations, so this maxi-mization must be recast as a marginalization overlogical forms that evaluate to the correct denota-tion:

maxθ

∏xi,di∈D

∑yi∈Y |JyiKwi=di

p(yi|xi; θ) (1)

This objective function is called maximummarginal likelihood (MML). The inner summationis in general intractable to perform during training,so it is only approximated.

Most prior work (Berant et al., 2013; Goldmanet al., 2018, inter alia) approximate the intractablemarginalization by summing over logical formsobtained via beam search during training. Thistypically results in frequent search failures earlyduring training when model parameters are closeto random, and in general may only yield spuri-ous logical forms in the absence of any guidance.Since modern semantic parsers typically operatewithout a lexicon, new techniques are essential toprovide guidance to the search procedure (Gold-man et al., 2018).

One way of providing this guidance duringsearch is to perform some kind of heuristic searchup front to find a set of logical forms that evalu-ate to the correct denotation, and use those logicalforms to approximate the inner summation (Lianget al., 2011; Krishnamurthy et al., 2017). The par-ticulars of the heuristic search can have a large im-pact on performance; a smaller candidate set haslower noise, while a larger set makes it more likelythat the correct logical form is in it, and one needsto strike the right balance. In this paper, we re-fer to the MML that does search during trainingas dynamic MML, and the one that does an offlinesearch as static MML.

The main benefit of dynamic MML is that itadapts its training signal over time. As the modellearns, it can increasingly focus its probabilitymass on a small set of very likely logical forms.The main benefit of static MML is that there is noneed to search during training, so there is a con-sistent training signal even at the start of training,and it is typically more computationally efficientthan dynamic MML.

2.2.2 Reward-based methodsWhen training weakly supervised semanticparsers, it is often desirable to inject some priorknowledge into the training procedure by definingarbitrary reward or cost functions. There existsprior work that use such methods, both in areinforcement learning setting (Liang et al., 2017,2018), and otherwise (Iyyer et al., 2017; Guuet al., 2017). In our work, we define a customizedcost function that includes a coverage term, anduse a Minimum Bayes Risk (MBR) (Goodman,1996; Goel and Byrne, 2000; Smith and Eisner,

2671

2006) training scheme, which we describe in §3.

3 Coverage-guided search

Weakly-supervised training of semantic parsersrelies heavily on lexical cues to guide the initialstages of learning to good logical forms. Tradi-tionally, these lexical cues were provided in theparser’s lexicon. Neural semantic parsers removethe lexicon, however, and so need another mech-anism for obtaining these lexical cues. In thissection we introduce the use of coverage to in-ject lexicon-like information into neural semanticparsers.

Coverage is a measure of relevance of the candi-date logical form yi to the input xi, in terms of howwell the productions in yi map to parts of xi. Weuse a small manually specified lexicon as a map-ping from source language to the target languageproductions, and define coverage of yi as the num-ber of productions triggered by the input utterance,according to the lexicon, that are included in yi.

We use this measure of coverage to augment ourloss function, and train using an MBR based algo-rithm as follows. We use beam search to train amodel to minimize the expected value of a costfunction C:

minθ

N∑i=1

Ep(yi|xi;θ)C(xi, yi, wi, di) (2)

where p is a re-normalization1 of the probabilitiesassigned to all logical forms on the beam.

We define the cost function C as:

C(xi, yi, wi, di) = λS(yi, xi)+(1−λ)T (yi, wi, di)(3)

where the function S measures the number ofitems that yi is missing from the actions (or gram-mar production rules) triggered by the input utter-ance xi given the lexicon; and the function T mea-sures the consistency of the evaluation of yi in wi,meaning that it is 0 if JyiKwi = di, or a value e oth-erwise. We set e as the maximum possible value ofthe coverage cost for the corresponding instance,to make the two costs comparable in magnitude. λis a hyperparameter that gives the relative weightof the coverage cost.

1Note that without this re-normalization, and with a -1/0cost function based on denotation accuracy, MBR will max-imize the likelihood of correct logical forms on the beam,which is equivalent to dynamic MML.

4 Iterative search

In this section we describe the iterative techniquefor refining the set of candidate logical forms as-sociated with each training instance.

As discussed in §2.2, most prior work onweakly-supervised training of semantic parsersuses dynamic MML. This is particularly problem-atic in domains like NLVR, where the supervi-sion signal is binary—it is very hard for dynamicMML to bootstrap its way to finding good logi-cal forms. To solve this problem, we interleavestatic MML, which has a consistent supervisionsignal from the start of training, with the coverage-augmented MBR algorithm described in §3.

In order to use static MML, we need an ini-tial set of candidate logical forms. We obtainthis candidate set using a bounded-length exhaus-tive search, filtered using heuristics based on thesame lexical mapping used for coverage in §3. Abounded-length search will not find logical formsfor the entire training data, so we can only use asubset of the data for initial training. We train amodel to convergence using static MML on theselogical forms, then use that model to initializecoverage-augmented MBR training. This givesthe model a good starting place for the dynamiclearning algorithm, and the search at training timecan look for logical forms that are longer thancould be found with the bounded-length exhaus-tive search. We train MBR to convergence, thenuse beam search on the MBR model to find a newset of candidate logical forms for static MML onthe training data. This set of logical forms canhave a greater length than those in the initial set,because this search uses model scores to not ex-haustively explore all possible paths, and thus willlikely cover more of the training data. In this way,we can iteratively improve the candidate logicalforms used for static training, which in turn im-proves the starting place for the online search al-gorithm.

Algorithm 1 concretely describes this process.Decode in the algorithm refers to running a beamsearch decoder that returns a set of consistent log-ical forms (i.e. T = 0) for each of the input ut-terances. We start off with a seed dataset D0 forwhich consistent logical forms are available.

5 Datasets

We will now describe the two datasets we use inthis work to evaluate our methods – Cornell NLVR

2672

Input : Dataset D = {X,W,D}; andseed set D0 = {X0, Y 0} such thatX0 ⊂ X and C(x0i , y0i ,Wi, Di) = 0

Output: Model parameters θMBR

Initialize dataset DMML = D0;while Acc(Ddev) is increasing do

θMML = MML(DMML);Initialize θMBR = θMML;Update θMBR = MBR(D; θMBR);Update DMML = Decode(D; θMBR);

endAlgorithm 1: Iterative coverage-guided search

Figure 1: Example from NLVR dataset showing an ut-terance associated with two worlds and correspondingbinary denotations. Also shown are the logical formand the actions triggered by the lexicon from the utter-ance.

and WIKITABLEQUESTIONS.

5.1 Cornell NLVR

Cornell NLVR is a language-grounding datasetcontaining natural language sentences providedalong with synthetically generated visual contexts,and a label for each sentence-image pair indicatingwhether the sentence is true or false in the givencontext. Figure 1 shows two example sentence-image pairs from the dataset (with the same sen-tence). The dataset also comes with structured rep-resentations of images, indicating the color, shape,size, and x- and y-coordinates of each of the ob-jects in the image. While we show images in Fig-ure 1 for ease of exposition, we use the structuredrepresentations in this work.

Following the notation introduced in §2.1, xi inthis example is There is a box with only one item

Figure 2: Example from WIKITABLEQUESTIONSdataset showing an utterance, a world, associated de-notation, corresponding logical form, and actions trig-gered by the lexicon.

that is blue. The structured representations asso-ciated with the two images shown are two of theworlds (w1

i and w2i ), in which xi could be evalu-

ated. The corresponding labels are the denotationsd1i and d2i that a translation yi of the sentence xiis expected to produce, when executed in the twoworlds respectively. That the same sentence oc-curs with multiple worlds is an important propertyof this dataset, and we make use of it by definingthe function T to be 0 only if ∀

wji ,d

jiJyiKw

ji = dji .

5.2 WIKITABLEQUESTIONS

WIKITABLEQUESTIONS is a question-answeringdataset where the task requires answering complexquestions in the context of Wikipedia tables. Anexample can be seen in Figure 2. Unlike NLVR,the answers are not binary. They can instead becells in the table or the result of numerical or set-theoretic operations performed on them.

5.3 Logical form languagesFor NLVR, we define a typed variable-free func-tional query language, inspired by the GeoQuerylanguage (Zelle and Mooney, 1996). Our languagecontains six basic types: box (referring to one ofthe three gray areas in Figure 1), object (refer-ring to the circles, triangles and squares in Fig-ure 1), shape, color, number and boolean.The constants in our language are color and shapenames, the set of all boxes in an image, and theset of all objects in an image. The functions inour language include those for filtering objects andboxes, and making assertions, a higher order func-tion for handling negations, and a function for

2673

querying objects in boxes. This type specifica-tion of constants and functions gives us a grammarwith 115 productions, of which 101 are terminalproductions (see Appendix A.1 for the completeset of rules in our grammar). Figure 1 shows an ex-ample of a complete logical form in our language.

For WTQ, we use the functional query languageused by (Liang et al., 2018) as the logical form lan-guage. Figure 2 shows an example logical form.

5.4 Lexicons for coverage

The lexicon we use for the coverage measure de-scribed in §3 contains under 40 rules for each log-ical form language. They mainly map words andphrases to constants and unary functions in the tar-get language. The complete lexicons are shown inthe Appendix. Figures 1 and 2 also show the ac-tions triggered by the corresponding lexicons forthe utterances shown. We find that small but pre-cise lexicons are sufficient to guide the search pro-cess away from spurious logical forms. Moreover,as shown empirically in §6.4, the model for NLVRdoes not learn much without this simple but cru-cial guidance.

6 Experiments

We evaluate both our contributions on NLVR andWIKITABLEQUESTIONS.

6.1 Model

In this work, we use a grammar-constrainedencoder-decoder neural semantic parser for ourexperiments. Of the many variants of this basicarchitecture (see §7), all of which are essentiallyseq2seq models with constrained outputs and/orre-parameterizations, we choose to use the parserof Krishnamurthy et al. (2017), as it is particu-larly well-suited to the WIKITABLEQUESTIONS

dataset, which we evaluate on.The encoder in the model is a bi-directional

recurrent neural network with Long Short-TermMemory (LSTM) (Hochreiter and Schmidhuber,1997) cells, and the decoder is a grammar-constrained decoder also with LSTM cells. In-stead of directly outputting tokens in the logi-cal form, the decoder outputs production rulesfrom a CFG-like grammar. These productionrules sequentially build up an abstract syntax tree,which determines the logical form. The modelalso has an entity linking component for produc-ing table entities in the logical forms; this com-

ponent is only applicable to WIKITABLEQUES-TIONS, and we remove it when running experi-ments on NLVR. The particulars of the model arenot the focus of this work, so we refer the readerto the original paper for more details.

In addition, we slightly modify the constraineddecoding architecture from (Krishnamurthy et al.,2017) to bias the predicted actions towards thosethat would decrease the value of S(yi, xi). This isdone using a coverage vector, vSi for each traininginstance that keeps track of the production rulestriggered by xi, and gets updated whenever one ofthose desired productions is produced by the de-coder. That is, vSi is a vector of 1s and 0s, with1s indicating the triggered productions that are yetto be produced by the decoder. This is similar tothe idea of checklists used by Kiddon et al. (2016).The decoder in the original architecture scores out-put actions at each time step by computing a dotproduct of the predicted action representation withthe embeddings of each of the actions. We add aweighted sum of all the actions that are yet to pro-duced:

sai = ea.(pi + γ ∗ vSi .E) (4)

where sai is the score of action a at time step i, ea

is the embedding of that action, pi is the predictedaction representation, E is the set of embeddingsof all the actions, and γ is a learned parameter forregularizing the bias towards yet-to-be producedtriggered actions.

6.2 Experimental setup

NLVR We use the standard train-dev-test splitfor NLVR, containing 12409, 988 and 989sentence-image pairs respectively. NLVR con-tains most of the sentences occurring in multipleworlds (with an average of 3.9 worlds per sen-tence). We set the word embedding and actionembedding sizes to 50, and the hidden layer size ofboth the encoder and the decoder to 30. We initial-ized all the parameters, including the word and ac-tion embeddings using Glorot uniform initializa-tion (Glorot and Bengio, 2010). We found that us-ing pretrained word representations did not help.We added a dropout (Srivastava et al., 2014) of0.2 on the outputs of the encoder and the decoderand before predicting the next action, set the beamsize to 10 both during training and at test time, andtrained the model using ADAM (Kingma and Ba,2014) with a learning rate of 0.001. All the hyper-parameters are tuned on the validation set.

2674

WIKITABLEQUESTIONS This dataset comeswith five different cross-validation folds of train-ing data, each containing a different 80/20 split fortraining and development. We first show resultsaggregated from all five folds in §6.3, and thenshow results from controlled experiments on fold1. We replicated the model presented in Krishna-murthy et al. (2017), and only changed the trainingalgorithm and the language used. We used a beamsize of 20 for MBR during training and decoding,and 10 for MML during decoding, and trained themodel using Stochastic Gradient Descent (Kieferet al., 1952) with a learning rate of 0.1, all of whichare tuned on the validation sets.

Specifics of iterative search For our iterativesearch algorithm, we obtain an initial set of can-didate logical forms in both domains by exhaus-tively searching to a depth of 102. During searchwe retrieve the logical forms that lead to the cor-rect denotations in all the corresponding worlds,and sort them based on their coverage cost usingthe coverage lexicon described in §5.4, and choosethe top-k3. At each iteration of the search stepin our iterative training algorithm, we increase themaximum depth of our search with a step-size of2, finding more complex logical forms and cover-ing a larger proportion of the training data. Whileexhaustive search is prohibitively expensive be-yond a fixed number of steps, our training processthat uses beam search based approximation can godeeper.

Implementation We implemented ourmodel and training algorithms within theAllenNLP (Gardner et al., 2018) toolkit.The code and models are publicly availableat https://github.com/allenai/iterative-search-semparse.

6.3 Main results

WIKITABLEQUESTIONS Table 1 comparesthe performance of a single model trained us-ing Iterative Search, with that of previously pub-lished single models. We excluded ensemble mod-els since there are differences in the way ensem-bles are built for this task in previous work, ei-ther in terms of size or how the individual mod-els were chosen. We show both best and aver-

2It was prohibitively expensive to search beyond depth of10.

3k is a hyperparameter that is chosen on the dev set ateach iteration in iterative search, and is typically 10 or 20

Approach Dev Test

Pasupat and Liang (2015) 37.0 37.1Neelakantan et al. (2017) 34.1 34.2Haug et al. (2018) - 34.8Zhang et al. (2017) 40.4 43.7Liang et al. (2018) (MAPO) (avg.) 42.3 43.1Liang et al. (2018) (MAPO) (best) 42.7 43.8Iterative Search (avg.) 42.1 43.9Iterative Search (best) 43.1 44.3

Table 1: Comparison of single model performances ofIterative Search with previously reported single modelperformances

Algorithm Dev acc. Test acc.

MAPO 42.1 42.7

Static MML 40.0 42.2Iterative MML 42.5 43.1Iterative Search 43.0 43.8

Table 2: Comparison of iterative search with staticMML, iterative MML, and the previous best resultfrom (Liang et al., 2018), all trained on the official split1 of WIKITABLEQUESTIONS and tested on the officialtest set.

age (over 5 folds) single model performance fromLiang et al. (2018) (Memory Augmented PolicyOptimization). The best model was chosen basedon performance on the development set. Our sin-gle model performances are computed in the sameway. Note that Liang et al. (2018) also use a lexi-con similar to ours to prune the seed set of logicalforms used to initialize their memory buffer.

In Table 2, we compare the performance of ouriterative search algorithm with three baselines: 1)Static MML, as described in §2.2.1 trained on thecandidate set of logical forms obtained throughthe heuristic search technique described in §6.2;2) Iterative MML, also an iterative technique butunlike iterative search, we skip MBR and iter-atively train static MML models while increas-ing the number of decoding steps; and 3) MAPO(Liang et al., 2018), the current best published sys-tem on WTQ. All four algorithms are trained andevaluated on the first fold, use the same language,and the bottom three use the same model and thesame set of logical forms used to train static MML.

https://github.com/allenai/iterative-search-semparse

https://github.com/allenai/iterative-search-semparse

2675

NLVR In Table 3, we show a comparison ofthe performance of our iterative coverage-guidedsearch algorithm with the previously published ap-proaches for NLVR. The first two rows correspondto models that are not semantic parsers. Thisshows that semantic parsing is a promising direc-tion for this task. The closest work to ours is theweakly supervised parser built by (Goldman et al.,2018). They build a lexicon similar to ours formapping surface forms in input sentences to ab-stract clusters. But in addition to defining a lex-icon, they also manually annotate complete sen-tences in this abstract space, and use those annota-tions to perform data augmentation for training asupervised parser, which is then used to initialize aweakly supervised parser. They also explicitly usethe abstractions to augment the beam during de-coding using caching, and a separately-trained dis-criminative re-ranker to re-order the logical formson the beam. As a discriminative re-ranker is or-thogonal to our contributions, we show their re-sults with and without it, with “Abs. Sup.” beingmore comparable to our work. Our model, whichuses no data augmentation, no caching during de-coding, and no discriminative re-ranker, outper-forms their variant without reranking on the pub-lic test set, and outperforms their best model onthe hidden test set, achieving a new state-of-the-art result on this dataset.

6.4 Effect of coverage-guided search

To evaluate the contribution of coverage-guidedsearch, we compare the the performance of theNLVR parser in two different settings: with andwithout coverage guidance in the cost function.We also compare the performance of the parser inthe two settings, when initialized with parametersfrom an MML model trained to maximize the like-lihood of the set of logical forms obtained fromexhaustive search. Table 4 shows the results ofthis comparison. We measure accuracy and con-sistency of all four models on the publicly avail-able test set, using the official evaluation script.Consistency here refers to the percentage of logi-cal forms that produce the correct denotation in allthe corresponding worlds, and is hence a strictermetric than accuracy. The cost weight (λ in Equa-tion 3) was tuned based on validation set perfor-mance for the runs with coverage, and we foundthat λ = 0.4 worked best.

It can be seen that both with and without ini-

tialization, coverage guidance helps by a big mar-gin, with the gap being even more prominent inthe case where there is no initialization. Whenthere is neither coverage guidance nor a good ini-tialization, the model does not learn much fromunguided search and get a test accuracy not muchhigher than the majority baseline of 56.2%.

We found that coverage guidance was not asuseful for WTQ. The average value of the best per-forming λ was around 0.2, and higher values nei-ther helped nor hurt performance.

6.5 Effect of iterative search

To evaluate the effect of iterative search, wepresent the accuracy numbers from the search (S)and maximization (M) steps from different itera-tions in Tables 5 and 6, showing results on NLVRand WTQ, respectively. Additionally, we alsoshow number of decoding steps used at each it-erations, and the percentage of sentences in thetraining data for which we were able to obtainconsistent logical forms from the S step, the setthat was used in the M step of the same iteration.It can be seen in both tables that a better MMLmodel gives a better initialization for MBR, anda better MBR model results in a larger set of ut-terances for which we can retrieve consistent log-ical forms, thus improving the subsequent MMLmodel. The improvement for NLVR is more pro-nounced (a gain of 21% absolute) than for WTQ(a gain of 3% absolute), likely because the initialexhaustive search provides a much higher percent-age of spurious logical forms for NLVR, and thusthe starting place is relatively worse.

Complexity of Logical Forms We analyzed thelogical forms produced by our iterative search al-gorithm at different iterations to see how they dif-fer. As expected, for NLVR, allowing greaterdepths lets the parser explore more complex logi-cal forms. Table 7 shows examples from the vali-dation set that indicate this trend.

7 Related Work

Most of the early methods used for training se-mantic parsers required the training data to comewith annotated logical forms (Zelle and Mooney,1996; Zettlemoyer and Collins, 2005). The pri-mary limitation of such methods is that manuallyproducing these logical forms is expensive, mak-ing it hard to scale these methods across domains.

2676

Dev. Test-P Test-HApproach Acc. Cons. Acc. Cons. Acc. Cons.

MaxEnt (Suhr et al., 2017) 68.0 - 67.7 - 67.8 -BiATT-Pointer (Tan and Bansal, 2018) 74.6 - 73.9 - 71.8 -Abs. Sup. (Goldman et al., 2018) 84.3 66.3 81.7 60.1 - -Abs. Sup. + ReRank (Goldman et al., 2018) 85.7 67.4 84.0 65.0 82.5 63.9Iterative Search 85.4 64.8 82.4 61.3 82.9 64.3

Table 3: Comparison of our approach with previously published approaches. We show accuracy and consistencyon the development set, and public (Test-P) and hidden (Test-H) test sets.

No coverage + coverageAcc. Cons. Acc. Cons.

No init. 56.4 12.0 73.9 43.6MML init. 77.7 51.1 80.7 56.4

Table 4: Effect of coverage guidance on NLVR parserstrained with and without initialization from an MMLmodel. Metrics shown are accuracy and consistency onthe public test set.

Iter. Length % cov. Step Dev. Acc

0 10 51 M 64.0

1 12 65S 81.6M 76.5

2 14 65S 82.7M 81.8

3 16 73S 85.4M 83.1

4 18 75S 84.7M 81.2

Table 5: Effect of iterative search (S) and maximization(M) on NLVR. % cov. is the percentage of training datafor which the S step retrieves consistent logical forms.

Iter. Length % cov. Step Dev. Acc

0 10 83.3 M 40.0

1 12 70.2S 42.5M 42.5

2 14 71.3S 43.1M 42.7

3 16 71.0S 42.8M 42.5

4 18 71.0S 43.0M 42.7

Table 6: Iterative search on WIKITABLEQUESTIONS.M and S refer to Maximization and Search steps.

More recent research has focused on training se-mantic parsers with weak supervision (Liang et al.,2011; Berant et al., 2013), or trying to automat-ically infer logical forms from denotations (Pa-supat and Liang, 2016). However, matching theperformance of a fully supervised semantic parserwith only weak supervision remains a significantchallenge (Yih et al., 2016).

The main contributions of this work deal withtraining semantic parsers with weak supervision,and we gave a detailed discussion of related train-ing methods in §2.2.

We evaluate our contributions on the NLVR andWIKITABLEQUESTIONS datasets. Other workthat evaluates on on these datasets include Gold-man et al. (2018), Tan and Bansal (2018), Nee-lakantan et al. (2017), Krishnamurthy et al. (2017),Haug et al. (2018), and (Liang et al., 2018). Theseprior works generally present modeling contri-butions that are orthogonal (and in some casescomplementary) to the contributions of this paper.There has also been a lot of recent work on neuralsemantic parsing, most of which is also orthogo-nal to (and could probably benefit from) our con-tributions (Dong and Lapata, 2016; Jia and Liang,2016; Yin and Neubig, 2017; Krishnamurthy et al.,2017; Rabinovich et al., 2017). Recent attemptsat dealing with the problem of spuriousness in-clude Misra et al. (2018) and Guu et al. (2017).

Coverage has recently been used in machinetranslation (Tu et al., 2016) and summarization(See et al., 2017). There have also been manymethods that use coverage-like mechanisms togive lexical cues to semantic parsers. Goldmanet al. (2018)’s abstract examples is the most recentand related work, but the idea is also related to lex-icons in pre-neural semantic parsers (Kwiatkowskiet al., 2011).

2677

0There is a tower with four blocks(box exists (member count equals all boxes 4))

1Atleast one black triangle is not touching the edge(object exists (black (triangle ((negate filter touch wall) all objects))))

2There is a yellow block as the top of a tower with exactly three blocks.(object exists (yellow (top (object in box (member count equals all boxes 3)))))

3The tower with three blocks has a yellow block over a black block(object count greater equals (yellow (above (black (object in box(member count equals all boxes 3))))) 1)

Table 7: Complexity of logical forms produced at different iterations, from iteration 0 to iteration 3; each logicalform could not be produced at the previous iterations

8 Conclusion

We have presented a new technique for trainingsemantic parsers with weak supervision. Our keyinsights are that lexical cues are crucial for guid-ing search during the early stages of training, andthat the particulars of the approximate marginal-ization in maximum marginal likelihood have alarge impact on performance. To address the firstissue, we used a simple coverage mechanism forincluding lexicon-like information in neural se-mantic parsers that do not have lexicons. For thesecond issue, we developed an iterative procedurethat alternates between statically-computed anddynamically-computed training signals. Togetherthese two contributions greatly improve seman-tic parsing performance, leading to new state-of-the-art results on NLVR and WIKITABLEQUES-TIONS. As these contributions are to the learn-ing algorithm, they are broadly applicable to manymodels trained with weak supervision. One poten-tial future work direction is investigating whetherthey extend to other structured prediction prob-lems beyond semantic parsing.

Acknowledgments

We would like to thank Jonathan Berant and NoahSmith for comments on earlier drafts and ChenLiang for helping us with implementation detailsof MAPO. Computations on beaker.org weresupported in part by credits from Google Cloud.

ReferencesYoav Artzi and Luke Zettlemoyer. 2013. Weakly su-

pervised learning of semantic parsers for mappinginstructions to actions. Transactions of the Associa-tion of Computational Linguistics, 1:49–62.

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on freebase fromquestion-answer pairs. In Proceedings of the 2013

Conference on Empirical Methods in Natural Lan-guage Processing, pages 1533–1544.

Li Dong and Mirella Lapata. 2016. Language to logicalform with neural attention. In ACL’16.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E.Peters, Michael Schmitz, and Luke S. Zettlemoyer.2018. Allennlp: A deep semantic natural languageprocessing platform. CoRR, abs/1803.07640.

Xavier Glorot and Yoshua Bengio. 2010. Understand-ing the difficulty of training deep feedforward neu-ral networks. In Proceedings of the thirteenth in-ternational conference on artificial intelligence andstatistics, pages 249–256.

Vaibhava Goel and William J Byrne. 2000. Minimumbayes-risk automatic speech recognition. ComputerSpeech & Language, 14(2):115–135.

Omer Goldman, Veronica Latcinnik, Udi Naveh, AmirGloberson, and Jonathan Berant. 2018. Weakly-supervised semantic parsing with abstract examples.In ACL.

Joshua Goodman. 1996. Parsing algorithms and met-rics. In Proceedings of the 34th annual meetingon Association for Computational Linguistics, pages177–183. Association for Computational Linguis-tics.

Kelvin Guu, Panupong Pasupat, Evan Zheran Liu,and Percy Liang. 2017. From language to pro-grams: Bridging reinforcement learning and maxi-mum marginal likelihood. In Association for Com-putational Linguistics (ACL).

Till Haug, Octavian-Eugen Ganea, and PaulinaGrnarova. 2018. Neural multi-step reasoning forquestion answering on semi-structured tables. InECIR.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Mohit Iyyer, Wen tau Yih, and Ming-Wei Chang. 2017.Search-based neural structured learning for sequen-tial question answering. In Association for Compu-tational Linguistics.

beaker.org

2678

Robin Jia and Percy Liang. 2016. Data recombinationfor neural semantic parsing. In ACL’16.

Chloe Kiddon, Luke Zettlemoyer, and Yejin Choi.2016. Globally coherent text generation with neuralchecklist models. In Proceedings of the 2016 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 329–339.

Jack Kiefer, Jacob Wolfowitz, et al. 1952. Stochasticestimation of the maximum of a regression function.The Annals of Mathematical Statistics, 23(3):462–466.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard-ner. 2017. Neural semantic parsing with type con-straints for semi-structured tables. In Proceedings ofthe 2017 Conference on Empirical Methods in Nat-ural Language Processing, pages 1516–1526.

Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwa-ter, and Mark Steedman. 2011. Lexical generaliza-tion in ccg grammar induction for semantic parsing.In Proceedings of the conference on empirical meth-ods in natural language processing, pages 1512–1523. Association for Computational Linguistics.

Chen Liang, Jonathan Berant, Quoc Le, Kenneth DForbus, and Ni Lao. 2017. Neural symbolic ma-chines: Learning semantic parsers on freebase withweak supervision. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), volume 1,pages 23–33.

Chen Liang, Mohammad Norouzi, Jonathan Berant,Quoc Le, and Ni Lao. 2018. Memory augmentedpolicy optimization for program synthesis with gen-eralization. arXiv preprint arXiv:1807.02322.

Percy Liang, Michael I Jordan, and Dan Klein. 2011.Learning dependency-based compositional seman-tics. In ACL.

Dipendra Misra, Ming-Wei Chang, Xiaodong He, andWen tau Yih. 2018. Policy shaping and generalizedupdate equations for semantic parsing from denota-tions. In EMNLP.

Arvind Neelakantan, Quoc V Le, Martin Abadi, An-drew McCallum, and Dario Amodei. 2017. Learn-ing a natural language interface with neural pro-grammer. In ICLR.

Panupong Pasupat and Percy Liang. 2015. Compo-sitional semantic parsing on semi-structured tables.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing (Volume 1: Long Papers), vol-ume 1, pages 1470–1480.

Panupong Pasupat and Percy Liang. 2016. Inferringlogical forms from denotations. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),volume 1, pages 23–32.

Maxim Rabinovich, Mitchell Stern, and Dan Klein.2017. Abstract syntax networks for code generationand semantic parsing. In ACL.

Abigail See, Peter J Liu, and Christopher D Manning.2017. Get to the point: Summarization with pointer-generator networks. ACL.

David A Smith and Jason Eisner. 2006. Minimumrisk annealing for training log-linear models. InProceedings of the COLING/ACL on Main confer-ence poster sessions, pages 787–794. Associationfor Computational Linguistics.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: A simple way to prevent neural networksfrom overfitting. The Journal of Machine LearningResearch, 15(1):1929–1958.

Alane Suhr, Srinivasan Iyer, and Yoav Artzi. 2018.Learning to map context-dependent sentences to ex-ecutable formal queries. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers),volume 1, pages 2238–2249.

Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi.2017. A corpus of natural language for visual rea-soning. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 2: Short Papers), volume 2, pages 217–223.

Hao Tan and Mohit Bansal. 2018. Object ordering withbidirectional matchings for visual reasoning. In Pro-ceedings of the 2018 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies.

Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu,and Hang Li. 2016. Modeling coverage for neuralmachine translation. ACL.

Wen-tau Yih, Matthew Richardson, Christopher Meek,Ming-Wei Chang, and Jina Suh. 2016. The value ofsemantic parse labeling for knowledge base questionanswering. In ACL.

Pengcheng Yin and Graham Neubig. 2017. A syntacticneural model for general-purpose code generation.In ACL’17.

John M Zelle and Raymond J Mooney. 1996. Learn-ing to parse database queries using inductive logicprogramming. In Proceedings of the thirteenth na-tional conference on Artificial intelligence-Volume2, pages 1050–1055. AAAI Press.

2679

Luke S. Zettlemoyer and Michael Collins. 2005.Learning to map sentences to logical form: Struc-tured classification with probabilistic categorialgrammars. In UAI’05.

Yuchen Zhang, Panupong Pasupat, and Percy Liang.2017. Macro grammars and holistic triggering forefficient semantic parsing. In Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, pages 1214–1223.

A Logical form language and lexicon forNLVR

Basic Types: bool (t), box (b), object(o), shape (s), color (c), number (n)

In the grammar and lexicon that follow we use thefollowing placeholders,

quantifier ∈ {any,all,none}comparator ∈ {equals,not equals,

lesser,lesser equals,

greater,greater equals}color ∈ {yellow,blue,black}shape ∈ {square,triangle,circle}size ∈ {big,medium,small}location ∈ {above,below,top,left,right,bottom,corner,wall}

number ∈ {1...9}

A.1 GrammarConstantsb -> all_boxesc -> color_black,

color_blue, color_yellown -> 1, 2, ..., 9o -> all_objectss -> shape_circle,

shape_square, shape_triangle

Object filtering functions<o,o> -> [location], [color],[shape], [size], same_color,same_shape, touch_object,touch_bottom, touch_top,touch_left, touch_right,touch_corner, touch_wall,

Box filtering functions<b,<s,b>> ->

member_shape_[quantifier]_equals<b,<c,b>> ->

member_color_[quantifier]_equals<b,<n,b>> ->

member_count_[comparator]member_color_count_[comparator],member_shape_count_[comparator]

<b,b> -> member_color_different,member_color_same,member_shape_different,member_shape_same

Assertion functions

<b,t> -> box_exists<o,t> -> object_exists<b,<n,t>> -> box_count_[comparator]<o,<c,t>> ->

object_color_[quantifier]_equals<o,<s,t>> ->

object_shape_[quantifier]_equals<o,<n,t>> ->

object_color_count_[comparator],object_shape_count_[comparator],object_count_[comparator]

Other functions

<b,o> -> object_in_box<<o,o>,<o,o>> -> negate_filter

A.2 Lexicon for NLVR

there is a box→ box existsthere is a [other]→ object existsbox . . . [color]→ color [color]box . . . [shape]→ shape [shape]not→ negate filtercontains→ object in boxtouch . . . [location]→ touch [location][location]→ [location][shape]→ [shape][color]→ [color][size]→ [size][number]→ [number]

B Logical form language and lexicon forWIKITABLEQUESTIONS

We use the language from Liang et al. (2018). Forcoverage, in addition to triggering productions fornumbers, and column names and cell strings in thetable, we use the following lexicon for coverage.

2680

B.1 Lexicon for WIKITABLEQUESTIONS

at least→ filter≥[greater|larger|more] than→ filter≥at most→ filter≤no [greater|larger|more] than→ filter≤[next|below|after]→ next[previous|above|before]→ previous[first|top]→ top[last|bottom]→ bottomsame→ same astotal→ sumdifference→ diffaverage→ average[least|smallest|lowest|smallest]→ argmin[most|longest|highest|largest]→ argmax[what|when] . . . [last|least]→ min[what|when] . . . [first|most]→ maxhow many→ count

Date post:	26-Jul-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Iterative Search for Weakly Supervised Semantic Parsing · 2019-06-01 · Proceedings of NAACL-HLT...

Documents