arXiv:2004.05572v2 [cs.CL] 29 Apr 2020 · ARG2 ARG0 ty obligate-01? go-02 ARG2 ARG0 obligate-01...

AMR Parsing via Graph�Sequence Iterative Inference∗

Deng CaiThe Chinese University of Hong [email protected]

Wai LamThe Chinese University of Hong Kong

[email protected]

Abstract

We propose a new end-to-end model that treatsAMR parsing as a series of dual decisions onthe input sequence and the incrementally con-structed graph. At each time step, our modelperforms multiple rounds of attention, reason-ing, and composition that aim to answer twocritical questions: (1) which part of the inputsequence to abstract; and (2) where in the out-put graph to construct the new concept. Weshow that the answers to these two questionsare mutually causalities. We design a modelbased on iterative inference that helps achievebetter answers in both perspectives, leading togreatly improved parsing accuracy. Our ex-perimental results significantly outperform allpreviously reported SMATCH scores by largemargins. Remarkably, without the help of anylarge-scale pre-trained language model (e.g.,BERT), our model already surpasses previousstate-of-the-art using BERT. With the help ofBERT, we can push the state-of-the-art resultsto 80.2% on LDC2017T10 (AMR 2.0) and75.4% on LDC2014T12 (AMR 1.0).

1 Introduction

Abstract Meaning Representation (AMR) (Ba-narescu et al., 2013) is a broad-coverage semanticformalism that encodes the meaning of a sentenceas a rooted, directed, and labeled graph, wherenodes represent concepts and edges represent rela-tions (See an example in Figure 1). AMR parsing isthe task of transforming natural language text intoAMR. One biggest challenge of AMR parsing isthe lack of explicit alignments between nodes (con-cepts) in the graph and words in the text. This char-acteristic not only poses great difficulty in concept

∗The work described in this paper is substantially sup-ported by grants from the Research Grant Council of the HongKong Special Administrative Region, China (Project Code:14204418) and the Direct Grant of the Faculty of Engineering,CUHK (Project Code: 4055093).

prediction but also brings a close tie for conceptprediction and relation prediction.

While most previous works rely on a pre-trainedaligner to train a parser, some recent attempts in-clude: modeling the alignments as latent variables(Lyu and Titov, 2018), attention-based sequence-to-sequence transduction models (Barzdins andGosko, 2016; Konstas et al., 2017; van Noord andBos, 2017), and attention-based sequence-to-graphtransduction models (Cai and Lam, 2019; Zhanget al., 2019b). Sequence-to-graph transductionmodels build a semantic graph incrementally viaspanning one node at every step. This property isappealing in terms of both computational efficiencyand cognitive modeling since it mimics what hu-man experts usually do, i.e., first grasping the coreideas then digging into more details (Banarescuet al., 2013; Cai and Lam, 2019).

Unfortunately, the parsing accuracy of existingworks including recent state-of-the-arts (Zhanget al., 2019a,b) remain unsatisfactory comparedto human-level performance,1 especially in caseswhere the sentences are rather long and informa-tive, which indicates substantial room for improve-ment. One possible reason for the deficiency isthe inherent defect of one-pass prediction process;that is, the lack of the modeling capability of theinteractions between concept prediction and rela-tion prediction, which is critical to achieving fully-informed and unambiguous decisions.

We introduce a new approach tackling AMRparsing, following the incremental sequence-to-graph transduction paradigm. We explicitly charac-terize each spanning step as the efforts for findingwhich part to abstract with respect to the input se-quence, and where to construct with respect to thepartially constructed output graph. Equivalently,

1The average annotator vs. inter-annotator agreement(SMATCH) was 0.83 for newswire and 0.79 for web text ac-cording to Banarescu et al. (2013).

arX

iv:2

004.

0557

2v2

[cs

.CL

] 2

9 A

pr 2

020

we treat AMR parsing as a series of dual decisionson the input sequence and the incrementally con-structed graph. Intuitively, the answer of what con-cept to abstract decides where to construct (i.e., therelations to existing concepts), while the answerof where to construct determines what concept toabstract. Our proposed model, supported by neuralnetworks with explicit structure for attention, rea-soning, and composition, integrated with an itera-tive inference algorithm. It iterates between findingsupporting text pieces and reading the partially con-structed semantic graph, inferring more accurateand harmonious expansion decisions progressively.Our model is aligner-free and can be effectivelytrained with limited amount of labeled data. Exper-iments on two AMR benchmarks demonstrate thatour parser outperforms the previous best parserson both benchmarks. It achieves the best-reportedSMATCH scores (F1): 80.2% on LDC2017T10 and75.4% on LDC2014T12, surpassing the previousstate-of-the-art models by large margins.

2 Related Work & Background

On a coarse-grained level, we can categorize exist-ing AMR parsing approaches into two main classes:Two-stage parsing (Flanigan et al., 2014; Lyu andTitov, 2018; Zhang et al., 2019a) uses a pipelinedesign for concept identification and relation pre-diction, where the concept decisions precede allrelation decisions; One-stage parsing constructs aparse graph incrementally. For more fine-grainedanalysis, those one-stage parsing methods can befurther categorized into three types: Transition-based parsing (Wang et al., 2016; Damonte et al.,2017; Ballesteros and Al-Onaizan, 2017; Penget al., 2017; Guo and Lu, 2018; Liu et al., 2018;Wang and Xue, 2017; Naseem et al., 2019) pro-cesses a sentence from left-to-right and constructsthe graph incrementally by alternately inserting anew node or building a new edge. Seq2seq-basedparsing (Barzdins and Gosko, 2016; Konstas et al.,2017; van Noord and Bos, 2017; Peng et al., 2018)views parsing as sequence-to-sequence transduc-tion by some linearization of the AMR graph. Theconcept and relation prediction are then treatedequally with a shared vocabulary. The third classis graph-based parsing (Cai and Lam, 2019; Zhanget al., 2019b), where at each time step, a new nodealong with its connections to existing nodes arejointly decided, either in order (Cai and Lam, 2019)or in parallel (Zhang et al., 2019b). So far, the recip-

The boy must not go

The boy must not go

obligate-01

boy

go-02

-

ARG2 ARG0

polarity

obligate-01

?

go-02ARG2 AR

G0

obligate-01 go-02

?

ARG2

polarity

(a)

(b)

The boy must not goThe current partial (solid) and full (solid + dashed) AMR graphs for the sentence “The boy must no go”

Figure 1: AMR graph construction given the partiallyconstructed graph: (a) one possible expansion resultingin the boy concept. (b) another possible expansion re-sulting in the - (negation) concept.

rocal causation of relation prediction and conceptprediction has not been closely-studied and well-utilized.

There are also some exceptions staying beyondthe above categorization. Peng et al. (2015) intro-duce a synchronous hyperedge replacement gram-mar solution. Pust et al. (2015) regard the task asa machine translation problem, while Artzi et al.(2015) adapt combinatory categorical grammar.Groschwitz et al. (2018); Lindemann et al. (2019)view AMR graphs as the structure AM algebra.

3 Motivation

Our approach is inspired by the deliberation pro-cess when a human expert is deducing a semanticgraph from a sentence. The output graph startsfrom an empty graph and spans incrementally ina node-by-node manner. At any time step of thisprocess, we are distilling the information for thenext expansion. We call it expansion because thenew node, as an abstract concept of some specifictext fragments in the input sentence, is derived tocomplete some missing elements in the current se-mantic graph. Specifically, given the input sentenceand the current partially constructed graph, we areanswering two critical questions: which part of theinput sequence to abstract, and where in the outputgraph to construct the new concept. For instance,Figure 1(a) and (b) show two possible choices forthe next expansion. In Figure 1(a), the word “boy”is abstracted to the concept boy to complement thesubject information of the event go-02. On the

(Current Graph)

(Input Sequence)

The boy wants the girl to

believe him.

GraphEncoder

SequenceEncoder

text memory

…

Relation Solver

attention

xt

yt

…

Concept Solver

The boy wants the girl to

believe him. attention

xtyt+ 1…initial state x0

f (Gi, x0) f (Gi, x1)

g (W, y1) g (W, y2)

y1 x1 y2

Gi

W

graph memory

Figure 2: Overview of the dual graph-sequence iterative inference for AMR parsing. Given the current graph Gi

and input sequence W . The inference starts with an initial concept decision x0 and follows the inference chainx0 → f(Gi, x0) → y1 → g(W, y1) → x1 → f(Gi, x1) → y2 → g(W, y2) → · · · . The details of f and g areshown in red and blue boxes, where nodes in graph and tokens in sequence are selected via attention mechanisms.

other hand, in Figure 1(b), a polarity attribute ofthe event go-2 is constructed, which is triggeredby the word “not” in the sentence.

We note that the answer to one of the questionscan help answer the other. For instance, if wehave decided to render the word “not” to the graph,then we will consider adding an edge labeled aspolarity, and finally determine its attachmentto the existing event go-2 (rather than an edgelabeled ARG0 to the same event go-2, though itis also present in the golden graph). On the otherhand, if we have decided to find the subject (ARG0relation) of the action go-02, we are confident tolocate the word “boy” instead of function wordslike “not” or “must”, thus unambiguously predictthe right concept boy. Another possible circum-stance is that we may make a mistake trying to asksomething that is not present in the sentence (e.g.,the destination of the go-02 action). This attemptwill be rejected by a review of the sentence. Therationale is that literally we cannot find the destina-tion information in the sentence. Similarly, if wemistakenly propose to abstract some parts of thesentence that are not ready for construction yet, theproposal will be rejected by another inspection onthe graph since that there is nowhere to place sucha new concept.

We believe the mutual causalities, as describedabove, are useful for action disambiguation andharmonious decision making, which eventually re-sult in more accurate parses. We formulate AMRparsing as a series of dual graph-sequence deci-sions and design an iterative inference approach

to tackle each of them. It is sort of analogous tothe cognition procedure of a person, who mightfirst notice part of the important information inone side (graph or sequence), then try to confirmher decision at the other side, which could just re-fute her former hypothesis and propose a new one,and finally converge to a conclusion after multiplerounds of reasoning.

4 Proposed Model

4.1 Overview

Formally, the parsing model consists of a series ofgraph expansion procedures {G0 → . . .→ Gi →. . .}, starting from an empty graph G0. In eachturn of expansion, the following iterative inferenceprocess is performed:

yit = g(Gi, xit),

xit+1 = f(W, yit),

where W,Gi are the input sequence and the currentsemantic graph respectively. g(·), f(·) seek whereto construct (edge prediction) and what to abstract(node prediction) respectively, and xit, y

it are the

t-th graph hypothesis (where to construct) and t-thsequence hypothesis (what to abstract) for the i-thexpansion step respectively. For clarity, we maydrop the superscript i in the following descriptions.

Figure 2 depicts an overview of the graph-sequence iterative inference process. Our modelhas four main components: (1) Sequence Encoder,which generates a set of text memories (per token)

to provide grounding for concept alignment and ab-straction; (2) Graph Encoder, which generates a setof graph memories (per node) to provide groundingfor relation reasoning; (3) Concept Solver, wherea previous graph hypothesis is used for conceptprediction; and (4) Graph Solver, where a previousconcept hypothesis is used for relation prediction.The last two components correspond to the reason-ing functions g(·) and f(·) respectively.

The text memories can be computed by Sen-tence Encoder at the beginning of the whole pars-ing while the graph memories are constructed byGraph Encoder incrementally as the parsing pro-gresses. During the iterative inference, a semanticrepresentation of current state is used to attend toboth graph and text memories (blue and red arrows)in order to locate the new concept and obtain itsrelations to the existing graph, both of which sub-sequently refine each other. Intuitively, after a firstglimpse of the input sentence and the current graph,specific sub-areas of both sequence and graph arerevisited to obtain a better understanding of thecurrent situation. Later steps typically read the textin detail with specific learning aims, either confirm-ing or overturning a previous hypothesis. Finally,after several iterations of reasoning steps, the re-fined sequence/graph decisions are used for graphexpansion.

4.2 Sequence EncoderAs mentioned above, we employ a sequence en-coder to convert the input sentence into vector rep-resentations. The sequence encoder follows themulti-layer Transformer architecture described inVaswani et al. (2017). At the bottom layer, eachtoken is firstly transformed into the concatenationof features learned by a character-level convolu-tional neural network (charCNN, Kim et al., 2016)and randomly initialized embeddings for its lemma,part-of-speech tag, and named entity tag. Addition-ally, we also include features learned by pre-trainedlanguage model BERT (Devlin et al., 2019).2

Formally, for an input sequence w1, w2, . . . , wn

with length n, we insert a special token BOS at thebeginning of the sequence. For clarity, we omit thedetailed transformations (Vaswani et al., 2017) anddenote the final output from our sequence encoderas {h0, h1, . . . , hn} ∈ Rd, where h0 correspondsthe special token BOS and serves as an overall rep-

2We obtain word-level representations from pre-trainedBERT in the same way as Zhang et al. (2019a,b), where sub-token representations at the last layer are averaged.

resentation while others are considered as contextu-alized word representations. Note that the sequenceencoder only needs to be invoked once, and the pro-duced text memories are used for the whole parsingprocedure.

4.3 Graph Encoder

We use a similar idea in Cai and Lam (2019) toencode the incrementally expanding graph. Specif-ically, a graph is simply treated as a sequenceof nodes (concepts) in the chronological order ofwhen they are inserted into the graph. We employmulti-layer Transformer architecture with maskedself-attention and source-attention, which only al-lows each position in the node sequence to attendto all positions up to and including that position,and every position in the node sequence to attendover all positions in the input sequence.3 While thisdesign allows for significantly more parallelizationduring training and computation-saving incremen-tality during testing,4 it inherently neglects the edgeinformation. We attempted to alleviate this problemby incorporating the idea of Strubell et al. (2018)that applies auxiliary supervision at attention headsto encourage them to attend to each nodes parentsin the AMR graph. However, we did not see perfor-mance improvement. We attribute the failure to thefact that the neural attention mechanisms on theirown are already capable of learning to attend to use-ful graph elements, and the auxiliary supervision islikely to disturb the ultimate parsing goal.

Consequently, for the current graph G withm nodes, we take its output concept sequencec1, c2, . . . , cm as input. Similar to the sequenceencoder, we insert a special token BOG at the be-ginning of the concept sequence. Each concept isfirstly transformed into the concatenation of featurevector learned by a char-CNN and randomly initial-ized embedding. Then, a multi-layer Transformerencoder with masked self-attention and source-attention is applied, resulting in vector representa-tions {s0, s1, . . . , sm} ∈ Rd, where s0 representsthe special concept BOG and serves as a dummynode while others are considered as contextualizednode representations.

3It is analogous to a standard Transformer decoder(Vaswani et al., 2017) for sequence-to-sequence learning.

4Trivially employing a graph neural network here can becomputationally expensive and intractable since it needs tore-compute all graph representations after every expansion.

4.4 Concept Solver

At each sequence reasoning step t, the conceptsolver receives a state vector yt that carries thelatest graph decision and the input sequence mem-ories h1, . . . , hn from the sequence encoder, andaims to locate the proper parts in the input sequenceto abstract and generate a new concept. We em-ploy the scaled dot-product attention proposed inVaswani et al. (2017) to solve this problem. Con-cretely, we first calculate an attention distributionover all input tokens:

αt = softmax((WQyt)

TWKh1:n√dk

),

where {WQ,WK} ∈ Rdk×d denote learnable lin-ear projections that transform the input vectors intothe query and key subspace respectively, and dkrepresents the dimensionality of the subspace.

The attention weights αt ∈ Rn provide a softalignment between the new concept and the tokensin the input sequence. We then compute the proba-bility distribution of the new concept label througha hybrid of three channels. First, αt is fed throughan MLP and softmax to obtain a probability distri-bution over a pre-defined vocabulary:

MLP(αt) = (W V h1:n)αt + yt (1)

P (vocab) = softmax(W (vocab)MLP(αt) + b(vocab)),

where W V ∈ Rd×d denotes the learnable linearprojection that transforms the text memories intothe value subspace, and the value vectors are aver-aged according to αt for concept label prediction.Second, the attention weights αt directly serve as acopy mechanism (Gu et al., 2016; See et al., 2017),i,e., the probabilities of copying a token lemmafrom the input text as a node label. Third, to ad-dress the attribute values such as person names ornumerical strings, we also use αt for another copymechanism that directly copies the original stringsof input tokens. The above three channels are com-bined via a soft switch to control the production ofthe concept label from different sources:

[p0, p1, p2] = softmax(W (switch)MLP(αt)),

where MLP is the same as in Eq. 1, and p0, p1 andp2 are the probabilities of three prediction channelsrespectively. Hence, the final prediction probability

of a concept c is given by:

P (c) =p0 · P (vocab)(c)

+p1 · (∑

i∈L(c)

αt[i]) + p2 · (∑

i∈T (c)

αt[i]),

where [i] indexes the i-th element and L(c) andT (c) are index sets of lemmas and tokens respec-tively that have the surface form as c.

4.5 Relation Solver

At each graph reasoning step t, the relation solverreceives a state vector xt that carries the latestconcept decision and the output graph memoriess0, s1, . . . , sm from the graph encoder, and aims topoint out the nodes in the current graph that havean immediate relation to the new concept (sourcenodes) and generate corresponding edges. Simi-lar to Cai and Lam (2019); Zhang et al. (2019b),we factorize the task as two stages: First, a rela-tion identification module points to some precedingnodes as source nodes; Then, the relation classifica-tion module predicts the relation type between thenew concept and predicted source nodes. We leavethe latter to be determined after iterative inference.

AMR is a rooted, directed, and acyclic graph.The reason for AMR being a graph instead of a treeis that it allows reentrancies where a concept partic-ipates in multiple semantic relations with differentsemantic roles. Following Cai and Lam (2019),we use multi-head attention for a more compactparsing procedure where multiple source nodes aresimultaneously determined.5 Formally, our relationidentification module employs H different atten-tion heads, for each head h, we calculate an atten-tion distribution over all existing node (includingthe dummy node s0):

βht = softmax((WQ

h xt)TWK

h s0:m√dk

).

Then, we take the maximum over different headsas the final edge probabilities:

βt[i] =H

maxh=1

βht [i].

Therefore, different heads may points to differentnodes at the same time. Intuitively, each head rep-resents a distinct relation detector for a particular

5This is different to Zhang et al. (2019b) where an AMRgraph is converted into a tree by duplicating nodes that havereentrant relations.

0.05 0.7 0.1 0.8

0.8 0.1 0.0 0.05

0.05 0.1 0.2 0.1

0.1 0.1 0.7 0.05

0.8

0.8

0.2

0.7

maxs0

s1

s2

s3

h1 h2 h3 h4

c1 c3

new concept

predict

c2β

Figure 3: Multi-head attention for relation identifica-tion. At left is the attention matrix, where each columncorresponds to a unique attention head, and each rowcorresponds to an existing node.

set of relation types. For each attention head, itwill point to a source node if certain relations existbetween the new node and the existing graph, other-wise it will point to the dummy node. An examplewith four attention heads and three existing nodes(excluding the dummy node) is illustrated in Figure3.

4.6 Iterative InferenceAs described above, the concept solver and the re-lation solver are conceptually two attention mech-anisms over the sequence and graph respectively,addressing the concept prediction and relation pre-diction separately. The key is to pass the decisionsbetween the solvers so that they can examine eachother’s answer and make harmonious decisions.Specifically, at each spanning step i, we start theiterative inference by setting x0 = h0 and solvingf(Gi, x0). After the t-th graph reasoning, we com-pute the state vector yt, which will be handed overto the concept solver as g(W, yt), as:

yt = FFN(y)(xt + (W V h1:n)αt),

where FFN(y) is a feed-forward network and W V

projects text memories into a value space. Simi-larly, after the t-th sequence reasoning, we updatethe state vector from yt to xt+1 as:

xt+1 = FFN(x)(yt +H∑

h=1

(W Vh s0:n)β

ht ),

where FFN(x) is a feed-forward network and W Vh

projects graph memories into a value space for eachhead h. After N steps of iterative inference, i,e.,

x0 → f(Gi, x0)→ y1 → g(W, y1)→ x1 → · · ·→ f(Gi, xN−1)→ yN → g(W, yN )→ xN ,

we finally employ a deep biaffine classifier (Dozatand Manning, 2016) for edge label prediction. The

Algorithm 1 AMR Parsing via Graph�SequenceIterative InferenceInput: the input sentence W = (w1, w2, . . . , wn)Output: the corresponding AMR graph G

// compute text memories1: h0, h1, . . . , hn = SequenceEncoder((BOS,w1, . . . , wn))// initialize graph

2: G0 = (nodes= {BOG},edges= ∅)// start graph expansions

3: i = 04: while True do5: s0, . . . , si = GraphEncoder(Gi)

// the graph memories can becomputed *incrementally*

6: x0 = h0// iterative inference

7: for t← 1 to N do8: yt = f(Gi, xt−1) // Seq.→Graph9: xt = g(W, yt) // Graph→Seq.

10: end for11: if concept prediction is EOG then12: break13: end if14: update Gi+1 based on Gi, xN and yN15: i = i+ 116: end while17: return Gi

classifier uses a biaffine function to score each la-bel, given the final concept representation xN andthe node vector s1:m as input. The resulted concept,edge, and edge label predictions will added to thenew graph Gi+1 if the concept prediction is notEOG, a special concept that we add for indicatingtermination. Otherwise, the whole parsing processis terminated and the current graph is returned asfinal result. The complete parsing process adoptingthe iterative inference is described in Algorithm 1.

5 Training & Prediction

Our model is trained with the standard maximumlikelihood estimate. The optimization objective isto maximize the sum of the decomposed step-wiselog-likelihood, where each is the sum of concept,edge, and edge label probabilities. To facilitatetraining, we create a reference generation orderof nodes by running a breadth-first-traversal overtarget AMR graphs, as it is cognitively appealing(core-semantic-first principle, Cai and Lam, 2019)and the effectiveness of pre-order traversal is also

empirically verified by Zhang et al. (2019a) in adepth-first setting. For the generation order for sib-ling nodes, we adopt the uniformly random orderand the deterministic order sorted by the relationfrequency in a 1 : 1 ratio at first then change to thedeterministic order only in the final training steps.We empirically find that the deterministic-after-random strategy slightly improves performance.

During testing, our model searches for the bestoutput graph through beam search based on thelog-likelihood at each spanning step. The timecomplexity of our model is O(k|V |), where k isthe beam size, and |V | is the number of nodes.

6 Experiments

6.1 Experimental Setup

Datasets Our evaluation is conducted on twoAMR public releases: AMR 2.0 (LDC0217T10)and AMR 1.0 (LDC2014T12). AMR 2.0 is thelatest and largest AMR sembank that was exten-sively used in recent works. AMR 1.0 shares thesame development and test set with AMR, whilethe size of its training set is only about one-third ofAMR 2.0, making it a good testbed to evaluate ourmodel’s sensitivity for data size.6

Implementation Details We use StanfordCoreNLP (Manning et al., 2014) for tokenization,lemmatization, part-of-speech, and named entitytagging. The hyper-parameters of our modelsare chosen on the development set of AMR 2.0.Without explicit specification, we perform N = 4steps of iterative inference. Other hyper-parametersettings can be found in the Appendix. Our modelsare trained using ADAM (Kingma and Ba, 2014)for up to 60K steps (first 50K with the randomsibling order and last 10K with deterministicorder), with early stopping based on developmentset performance. We fix BERT parameters similarto Zhang et al. (2019a,b) due to the GPU memorylimit. During testing, we use a beam size of 8 forthe highest-scored graph approximation.7

AMR Pre- and Post-processing We removesenses as done in Lyu and Titov (2018); Zhanget al. (2019a,b) and simply assign the most fre-quent sense for nodes in post-processing. Notably,

6There are a few annotation revisions from AMR 1.0 toAMR 2.0.

7Our code is released at https://github.com/jcyk/AMR-gs.

most existing methods including the state-the-of-art parsers (Zhang et al., 2019a,b; Lyu and Titov,2018; Guo and Lu, 2018, inter alia) often rely onheavy graph re-categorization for reducing the com-plexity and sparsity of the original AMR graphs.For graph re-categorization, specific subgraphs ofAMR are grouped together and assigned to a singlenode with a new compound category, which usuallyinvolves non-trivial expert-level manual efforts forhand-crafting rules. We follow the exactly samepre- and post-processing steps of those of Zhanget al. (2019a,b) for graph re-categorization. Moredetails can be found in the Appendix.

Ablated Models As pointed out by Cai and Lam(2019), the precise set of graph re-categorizationrules differs among different works, making it dif-ficult to distinguish the performance improvementfrom model optimization and carefully designedrules. In addition, only recent works (Zhang et al.,2019a,b; Lindemann et al., 2019; Naseem et al.,2019) have started to utilize the large-scale pre-trained language model, BERT (Devlin et al., 2019;Wolf et al., 2019). Therefore, we also include ab-lated models for addressing two questions: (1) Howdependent is our model on performance from hand-crafted graph re-categorization rules? (2) Howmuch does BERT help? We accordingly imple-ment three ablated models by removing either oneof them or removing both. The ablation study notonly reveals the individual effect of two model com-ponents but also helps facilitate fair comparisonswith prior works.

6.2 Experimental Results

Main Results The performance of AMR pars-ing is conventionally evaluated by SMATCH (F1)metric (Cai and Knight, 2013). The left block ofTable 1 shows the SMATCH scores on the AMR2.0 test set of our models against the previous bestapproaches and recent competitors. On AMR 2.0,we outperform the latest push from Zhang et al.(2019b) by 3.2% and, for the first time, obtain aparser with over 80% SMATCH score. Note thateven without BERT, our model still outperforms theprevious state-of-the-art approaches using BERT(Zhang et al., 2019b,a) with 77.3%. This is particu-larly remarkable since running BERT is computa-tionally expensive. As shown in Table 2, on AMR1.0 where the training instances are only around10K, we improve the best-reported results by 4.1%and reach at 75.4%, which is already higher than

https://github.com/jcyk/AMR-gs

https://github.com/jcyk/AMR-gs

Model G. R. BERT SMATCHfine-grained evaluation

Unlabeled No WSD Concept SRL Reent. Neg. NER Wikivan Noord and Bos (2017) × × 71.0 74 72 82 66 52 62 79 65Groschwitz et al. (2018) X × 71.0 74 72 84 64 49 57 78 71

Lyu and Titov (2018) X × 74.4 77.1 75.5 85.9 69.8 52.3 58.4 86.0 75.7Cai and Lam (2019) × × 73.2 77.0 74.2 84.4 66.7 55.3 62.9 82.0 73.2

Lindemann et al. (2019) X X 75.3 - - - - - - - -Naseem et al. (2019) X X 75.5 80 76 86 72 56 67 83 80Zhang et al. (2019a) X × 74.6 - - - - - - - -Zhang et al. (2019a) X X 76.3 79.0 76.8 84.8 69.7 60.0 75.2 77.9 85.8Zhang et al. (2019b) X X 77.0 80 78 86 71 61 77 79 86

Ours

× × 74.5 77.8 75.1 85.9 68.5 57.7 65.0 82.9 81.1X × 77.3 80.1 77.9 86.4 69.4 58.5 75.6 78.4 86.1× X 78.7 81.5 79.2 88.1 74.5 63.8 66.1 87.1 81.3X X 80.2 82.8 80.8 88.1 74.2 64.6 78.9 81.1 86.3

Table 1: SMATCH scores (%) (left) and fine-grained evaluations (%) (right) on the test set of AMR 2.0. G. R./BERTindicates whether or not the results use Graph Re-categorization/BERT respectively.

Model G. R. BERT SMATCH

Flanigan et al. (2016) × × 66.0Pust et al. (2015) × × 67.1

Wang and Xue (2017) X × 68.1Guo and Lu (2018) X × 68.3Zhang et al. (2019a) X X 70.2Zhang et al. (2019b) X X 71.3

Ours

× × 68.8X × 71.2× X 74.0X X 75.4

Table 2: SMATCH scores on the test set of AMR 1.0.

most models trained on AMR 2.0. The even moresubstantial performance gain on the smaller datasetsuggests that our method is both effective and data-efficient. Besides, again, our model without BERTalready surpasses previous state-of-the-art resultsusing BERT. For ablated models, it can be observedthat our models yield the best results in all settingsif there are any competitors, indicating BERT andgraph re-categorization are not the exclusive keyfor our superior performance.

Fine-grained Results In order to investigatehow our parser performs on individual sub-tasks,we also use the fine-grained evaluation tool (Da-monte et al., 2017) and compare to systems whichreported these scores.8 As shown in the right blockof Table 1, our best model obtains the highestscores on almost all sub-tasks. The improvementsin all sub-tasks are consistent and uniform (around2%∼3%) compared to the previous state-of-the-artperformance (Zhang et al., 2019b), partly confirm-ing that our model boosts performance via consol-idated and harmonious decisions rather than fix-

8We only list the results on AMR 2.0 since there are fewresults on AMR 1.0 to compare.

Smat

ch (%

)

55.0

62.5

70.0

77.5

85.0

Number of Inference Steps

1 2 3 4 5 6

All(0, 15](15, 30](30, ∞)

Figure 4: SMATCH scores with different numbers ofinference steps. Sentences are grouped by length.

ing particular phenomena. By our ablation study,it is worth noting that the NER scores are muchlower when using graph re-categorization. This isbecause the rule-based system for NER in graph re-categorization does not generalize well to unseenentities, which suggest a potential improvement byadapting better NER taggers.

6.3 More Analysis

Effect of Iterative Inference We then turn tostudy the effect of our key idea, namely, the it-erative inference design. To this end, we run a setof experiments with different values of the num-ber of the inference steps N . The results on AMR2.0 are shown in Figure 4 (solid line). As seen,the performance generally goes up when the num-ber of inference steps increases. The differenceis most noticeable between 1 (no iterative reason-ing is performed) and 2, while later improvementsgradually diminish. One important point here isthat the model size in terms of the number of pa-rameters is constant regardless of the number ofinference steps, making it different from general

you

pity-01pity-01

you

pity-01pity-01

you

pity-01pity-01

pity-01pity-01

you

pity-01pity-01

or

you I

Predicted Expansion

little

Golden AMR

-

pity-01pity-01

or

you I

or or or orβ1 β2 β3 β4

α1 α2 α3 α4cccccccc c

cccccccc c

cccccccc c

cccccccc cI have little or no

pity for you .I have little or no



pity for you .

1.00.50.0

Figure 5: Case study (viewed in color). Color shading intensity represents the value of the attention score.

over-parameterized problems.For a closer study on the effect of the inference

steps with respect to the lengths of input sentences,we group sentences into three classes by length andalso show the individual results in Figure 4 (dashedlines). As seen, the iterative inference helps morefor longer sentences, which confirms our intuitionthat longer and more complex input needs morereasoning. Another interesting observation is thatthe performance on shorter sentences reaches thepeaks earlier. This observation suggests that thenumber of inference steps can be adjusted accord-ing to the input sentence, which we leave as futurework.

Effect of Beam Size We are also interested inthe effect of beam size during testing. Ideally, if amodel is able to make accurate predictions in thefirst place, it should rely less on the search algo-rithm. We vary the beam size and plot the curvein Figure 6. The results show that the performancegenerally gets better with larger beam sizes. How-ever, a small beam size of 2 already gets the mostof the credits, which suggests that our model isrobust enough for time-stressing environments.

Visualization We visualize the iterative reason-ing process with a case study in Figure 5. We illus-trate the values of αt, βt as the iterative inferenceprogresses. As seen, the parser makes mistakes inthe first step, but gradually corrects its decisionsand finally makes the right predictions. Later rea-soning steps typically provide a sharper attentiondistribution than earlier steps, narrowing down themost likely answer with more confidence.

Speed We also report the parsing speed of ournon-optimized code: With BERT, the parsing speedof our system is about 300 tokens/s, while withoutBERT, it is about 330 tokens/s on a single NvidiaP4 GPU. The absolute speed depends on various

Smat

ch (%

)

77.0

78.0

79.0

80.0

81.0

Beam Size

1 2 3 4 5 6 7 8

Figure 6: SMATCH scores with different beam sizes.

implementation choices and hardware performance.In theory, the time complexity of our parsing algo-rithm is O(kbn), where k is the number of iterativesteps, b is beam size, and n is the graph size (num-ber of nodes) respectively. It is important to notethat our algorithm is linear in the graph size.

7 Conclusion

We presented the dual graph-sequence iterative in-ference method for AMR Parsing. Our methodconstructs an AMR graph incrementally in a node-by-node fashion. Each spanning step is explicitlycharacterized as answering two questions: whichparts of the sequence to abstract, and where inthe graph to construct. We leverage the mutualcausalities between the two and design an itera-tive inference algorithm. Our model significantlyadvances the state-of-the-art results on two AMRcorpora. An interesting future work is to make thenumber of inference steps adaptive to input sen-tences. Also, the idea proposed in this paper maybe applied to a broad range of structured predictiontasks (not only restricted to other semantic parsingtasks) where the complex output space can be di-vided into two interdependent parts with a similariterative inference process to achieve harmoniouspredictions and better performance.

ReferencesYoav Artzi, Kenton Lee, and Luke Zettlemoyer. 2015.

Broad-coverage ccg semantic parsing with amr. InProceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing, pages1699–1710.

Miguel Ballesteros and Yaser Al-Onaizan. 2017. AMRparsing using stack-LSTMs. In Proceedings of the2017 Conference on Empirical Methods in NaturalLanguage Processing, pages 1269–1275.

Laura Banarescu, Claire Bonial, Shu Cai, MadalinaGeorgescu, Kira Griffitt, Ulf Hermjakob, KevinKnight, Philipp Koehn, Martha Palmer, and NathanSchneider. 2013. Abstract meaning representationfor sembanking. In Proceedings of the 7th Linguis-tic Annotation Workshop and Interoperability withDiscourse, pages 178–186.

Guntis Barzdins and Didzis Gosko. 2016. RIGA atSemEval-2016 task 8: Impact of Smatch extensionsand character-level neural translation on AMR pars-ing accuracy. In Proceedings of the 10th Interna-tional Workshop on Semantic Evaluation (SemEval-2016), pages 1143–1147.

Deng Cai and Wai Lam. 2019. Core semantic first: Atop-down approach for AMR parsing. In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), pages 3797–3807.

Shu Cai and Kevin Knight. 2013. Smatch: an evalua-tion metric for semantic feature structures. In Pro-ceedings of the 51st Annual Meeting of the Associa-tion for Computational Linguistics (Volume 2: ShortPapers), volume 2, pages 748–752.

Joachim Daiber, Max Jakob, Chris Hokamp, andPablo N Mendes. 2013. Improving efficiency andaccuracy in multilingual entity extraction. In Pro-ceedings of the 9th International Conference on Se-mantic Systems, pages 121–124.

Marco Damonte, Shay B. Cohen, and Giorgio Satta.2017. An incremental parser for abstract meaningrepresentation. In Proceedings of the 15th Confer-ence of the European Chapter of the Association forComputational Linguistics: Volume 1, Long Papers,pages 536–546.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Timothy Dozat and Christopher D Manning. 2016.Deep biaffine attention for neural dependency pars-ing. arXiv preprint arXiv:1611.01734.

Jeffrey Flanigan, Chris Dyer, Noah A Smith, and JaimeCarbonell. 2016. Cmu at semeval-2016 task 8:Graph-based amr parsing with infinite ramp loss. InProceedings of the 10th International Workshop onSemantic Evaluation (SemEval-2016), pages 1202–1206.

Jeffrey Flanigan, Sam Thomson, Jaime Carbonell,Chris Dyer, and Noah A Smith. 2014. A discrimi-native graph-based parser for the abstract meaningrepresentation. In Proceedings of the 52nd AnnualMeeting of the Association for Computational Lin-guistics (Volume 1: Long Papers), volume 1, pages1426–1436.

Jonas Groschwitz, Matthias Lindemann, MeaghanFowlie, Mark Johnson, and Alexander Koller. 2018.AMR dependency parsing with a typed semantic al-gebra. In Proceedings of the 56th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 1831–1841.

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.Li. 2016. Incorporating copying mechanism insequence-to-sequence learning. In Proceedings ofthe 54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 1631–1640.

Zhijiang Guo and Wei Lu. 2018. Better transition-based amr parsing with refined search space. In Pro-ceedings of the 2018 Conference on Empirical Meth-ods in Natural Language Processing, pages 1712–1722.

Yoon Kim, Yacine Jernite, David Sontag, and Alexan-der M Rush. 2016. Character-aware neural languagemodels. In Thirtieth AAAI Conference on ArtificialIntelligence, pages 2741–2749.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Ioannis Konstas, Srinivasan Iyer, Mark Yatskar, YejinChoi, and Luke Zettlemoyer. 2017. Neural AMR:Sequence-to-sequence models for parsing and gener-ation. In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Vol-ume 1: Long Papers), pages 146–157.

Matthias Lindemann, Jonas Groschwitz, and Alexan-der Koller. 2019. Compositional semantic parsingacross graphbanks. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 4576–4585.

Yijia Liu, Wanxiang Che, Bo Zheng, Bing Qin,and Ting Liu. 2018. An AMR aligner tuned bytransition-based parser. In Proceedings of the 2018Conference on Empirical Methods in Natural Lan-guage Processing, pages 2422–2430.

Chunchuan Lyu and Ivan Titov. 2018. AMR parsing asgraph prediction with latent alignment. In Proceed-ings of the 56th Annual Meeting of the Association

for Computational Linguistics (Volume 1: Long Pa-pers), pages 397–407.

Christopher Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven Bethard, and David McClosky.2014. The stanford corenlp natural language pro-cessing toolkit. In Proceedings of 52nd annual meet-ing of the association for computational linguistics:system demonstrations, pages 55–60.

Tahira Naseem, Abhishek Shah, Hui Wan, Radu Flo-rian, Salim Roukos, and Miguel Ballesteros. 2019.Rewarding Smatch: Transition-based AMR parsingwith reinforcement learning. In Proceedings of the57th Annual Meeting of the Association for Compu-tational Linguistics, pages 4586–4592.

Rik van Noord and Johan Bos. 2017. Neural seman-tic parsing by character-based translation: Experi-ments with abstract meaning representations. arXivpreprint arXiv:1705.09980.

Xiaochang Peng, Linfeng Song, and Daniel Gildea.2015. A synchronous hyperedge replacement gram-mar based approach for amr parsing. In Proceedingsof the Nineteenth Conference on Computational Nat-ural Language Learning, pages 32–41.

Xiaochang Peng, Linfeng Song, Daniel Gildea, andGiorgio Satta. 2018. Sequence-to-sequence modelsfor cache transition systems. In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 1: Long Papers), pages1842–1852.

Xiaochang Peng, Chuan Wang, Daniel Gildea, and Ni-anwen Xue. 2017. Addressing the data sparsity is-sue in neural AMR parsing. In Proceedings of the15th Conference of the European Chapter of the As-sociation for Computational Linguistics: Volume 1,Long Papers, pages 366–375.

Michael Pust, Ulf Hermjakob, Kevin Knight, DanielMarcu, and Jonathan May. 2015. Parsing englishinto abstract meaning representation using syntax-based machine translation. In Proceedings of the2015 Conference on Empirical Methods in NaturalLanguage Processing, pages 1143–1154.

Abigail See, Peter J. Liu, and Christopher D. Manning.2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1073–1083.

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting. The Journal of Machine LearningResearch, 15(1):1929–1958.

Emma Strubell, Patrick Verga, Daniel Andor,David Weiss, and Andrew McCallum. 2018.Linguistically-informed self-attention for semantic

role labeling. In Proceedings of the 2018 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 5027–5038.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Chuan Wang, Sameer Pradhan, Xiaoman Pan, HengJi, and Nianwen Xue. 2016. Camr at semeval-2016task 8: An extended transition-based amr parser. InProceedings of the 10th International Workshop onSemantic Evaluation (SemEval-2016), pages 1173–1178.

Chuan Wang and Nianwen Xue. 2017. Getting themost out of amr parsing. In Proceedings of the 2017Conference on Empirical Methods in Natural Lan-guage Processing, pages 1257–1268.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing. ArXiv, abs/1910.03771.

Sheng Zhang, Xutai Ma, Kevin Duh, and BenjaminVan Durme. 2019a. AMR parsing as sequence-to-graph transduction. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 80–94.

Sheng Zhang, Xutai Ma, Kevin Duh, and BenjaminVan Durme. 2019b. Broad-coverage semantic pars-ing as transduction. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 3784–3796.

A Hyper-parameter Settings

Table 3 lists the hyper-parameters used in our fullmodels. Char-level CNNs and Transformer layersin the sentence encoder and the graph encoder sharethe same hyper-parameter settings. The BERTmodel (Devlin et al., 2019) we used is the Hugging-faces implementation (Wolf et al., 2019) (bert-base-cased). To mitigate overfitting, we apply dropout(Srivastava et al., 2014) with the drop rate 0.2 be-tween different layers. We randomly mask (replac-ing inputs with a special UNK token) the inputlemmas, POS tags, and NER tags with a rate of0.33. Parameter optimization is performed withthe ADAM optimizer (Kingma and Ba, 2014) withβ1 = 0.9 and β2 = 0.999. The learning rate sched-ule is similar to that in Vaswani et al. (2017), withwarm-up steps being set to 2K. We use early stop-ping on the development set for choosing the bestmodel.

B AMR Pre- and Post-processing

We follow exactly the same pre- and post-processing steps of those of Zhang et al. (2019a,b)for graph re-categorization. In preprocessing, weanonymize entities, remove wiki links and polarityattributes, and convert the resultant AMR graphsinto a compact format by compressing certain sub-graphs. In post-processing, we recover the origi-nal AMR format from the compact format, restoreWikipedia links using the DBpedia Spotlight API(Daiber et al., 2013), add polarity attributes basedon rules observed from the training data. Moredetails can be found in Zhang et al. (2019a).

Embeddingslemma 300POS tag 32NER tag 16concept 300char 32Char-level CNN#filters 256ngram filter size [3]output size 128Sentence Encoder#transformer layers 4Graph Encoder#transformer layers 2Transformer Layer#heads 8hidden size 512feed-forward hidden size 1024Concept Solverfeed-forward hidden size 1024Relation Solver#heads 8feed-forward hidden size 1024Deep biaffine classifierhidden size 100

Table 3: Hyper-parameters settings.

Date post:	07-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

arXiv:2004.05572v2 [cs.CL] 29 Apr 2020 · ARG2 ARG0 ty obligate-01? go-02 ARG2 ARG0 obligate-01...

Documents